Hadoop exercises hands-on, Exercises of Computer Science

Hadoop exercises for practicing

Typology: Exercises

2020/2021

Uploaded on 06/05/2021

sa-lma-2
sa-lma-2 🇲🇦

5

(1)

1 document

1 / 46

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Apache Hadoop:
Hands-On Exercises
General'Notes'............................................................................................................................'3!
Hands0On'Exercise:'Using'HDFS'.........................................................................................'5!
Hands0On'Exercise:'Running'a'MapReduce'Job'..........................................................'11!
Hands0On'Exercise:'Writing'a'MapReduce'Java'Program'.......................................'16!
Hands0On'Exercise:'More'Practice'With'MapReduce'Java'Programs'.................'24!
Optional'Hands0On'Exercise:'Writing'a'MapReduce'Streaming'Program'.........'26!
Hands0On'Exercise:'Writing'Unit'Tests'With'the'MRUnit'Framework'...............'29!
201403!
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e

Partial preview of the text

Download Hadoop exercises hands-on and more Exercises Computer Science in PDF only on Docsity!

Apache Hadoop:

Hands-On Exercises

General Notes ............................................................................................................................ 3

Hands-­‐On Exercise: Using HDFS ......................................................................................... 5

Hands-­‐On Exercise: Running a MapReduce Job .......................................................... 11

Hands-­‐On Exercise: Writing a MapReduce Java Program ....................................... 16

Hands-­‐On Exercise: More Practice With MapReduce Java Programs ................. 24

Optional Hands-­‐On Exercise: Writing a MapReduce Streaming Program ......... 26

Hands-­‐On Exercise: Writing Unit Tests With the MRUnit Framework ............... 29

201403

Hands-On Exercise: Creating an Inverted Index ........................................................

Hands-On Exercise: Calculating Word - Occurrence ..........................................Co

Hands-On Exercise: Importing Data With Sqoop .......................................................

Hands-On Exercise: Manipulating Data With Hive ....................................................

Hands-On Exercise: Running an Oozie Workflow ......................................................

shown (on two lines), or you can enter it on a single line. If you do the latter, you should not type in the backslash.

Points to note during the exercises

1. For most exercises, three folders are provided. Which you use will depend on how you would like to work on the exercises: - stubs: contains minimal skeleton code for the Java classes you’ll need to write. These are best for those with Java experience. - hints: contains Java class stubs that include additional hints about what’s required to complete the exercise. These are best for developers with limited Java experience. - solution: Fully implemented Java code which may be run “as-­‐is”, or you may wish to compare your own solution to the examples provided. 2. As the exercises progress, and you gain more familiarity with Hadoop and MapReduce, we provide fewer step-­‐by-­‐step instructions; as in the real world, we merely give you a requirement and it’s up to you to solve the problem! You should feel free to refer to the hints or solutions provided, ask your instructor for assistance, or consult with your fellow students! 3. There are additional challenges for some of the Hands-­‐On Exercises. If you finish the main exercise, please attempt the additional steps.

Hands-On Exercise: Using HDFS

Files Used in This Exercise:

Data files (local) ~/training_materials/developer/data/shakespeare.tar.gz ~/training_materials/developer/data/access_log.gz

In this exercise you will begin to get acquainted with the Hadoop tools. You

will manipulate files in HDFS, the Hadoop Distributed File System.

Set Up Your Environment

1. Before starting the exercises, run the course setup script in a terminal window:

$ ~/scripts/developer/training_setup_dev.sh

Hadoop

Hadoop is already installed, configured, and running on your virtual machine.

Most of your interaction with the system will be through a command-­‐line wrapper called hadoop. If you run this program with no arguments, it prints a help message. To try this, run the following command in a terminal window:

$ hadoop

The hadoop command is subdivided into several subsystems. For example, there is a subsystem for working with files in HDFS and another for launching and managing MapReduce processing jobs.

Note that the directory structure in HDFS has nothing to do with the directory structure of the local filesystem; they are completely separate namespaces.

Step 2: Uploading Files

Besides browsing the existing filesystem, another important thing you can do with FsShell is to upload new data into HDFS.

1. Change directories to the local filesystem directory containing the sample data we will be using in the course.

$ cd ~/training_materials/developer/data

If you perform a regular Linux ls command in this directory, you will see a few files, including two named shakespeare.tar.gz and shakespeare-stream.tar.gz. Both of these contain the complete works of Shakespeare in text format, but with different formats and organizations. For now we will work with shakespeare.tar.gz.

2. Unzip shakespeare.tar.gz by running:

$ tar zxvf shakespeare.tar.gz

This creates a directory named shakespeare/ containing several files on your local filesystem.

3. Insert this directory into HDFS:

$ hadoop fs -put shakespeare /user/training/shakespeare

This copies the local shakespeare directory and its contents into a remote, HDFS directory named /user/training/shakespeare.

4. List the contents of your HDFS home directory now:

$ hadoop fs -ls /user/training

You should see an entry for the shakespeare directory.

5. Now try the same fs -ls command but without a path argument:

$ hadoop fs -ls

You should see the same results. If you don’t pass a directory name to the -ls command, it assumes you mean your home directory, i.e. /user/training.

Relative paths

If you pass any relative (non-absolute) paths to FsShell commands (or use relative paths in MapReduce programs), they are considered relative to your home directory.

6. We also have a Web server log file, which we will put into HDFS for use in future exercises. This file is currently compressed using GZip. Rather than extract the file to the local disk and then upload it, we will extract and upload in one step. First, create a directory in HDFS in which to store it:

$ hadoop fs -mkdir weblog

7. Now, extract and upload the file in one step. The -c option to gunzip uncompresses to standard output, and the dash (-) in the hadoop fs -put command takes whatever is being sent to its standard input and places that data in HDFS.

$ gunzip -c access_log.gz
| hadoop fs -put - weblog/access_log

8. Run the hadoop fs -ls command to verify that the log file is in your HDFS home directory. 9. The access log file is quite large – around 500 MB. Create a smaller version of this file, consisting only of its first 5000 lines, and store the smaller version in HDFS. You can use the smaller version for testing in subsequent exercises.

good idea to pipe the output of the fs -cat command into head, tail, more, or less.

4. To download a file to work with on the local filesystem use the fs -get command. This command takes two arguments: an HDFS path and a local path. It copies the HDFS contents into the local filesystem:

$ hadoop fs -get shakespeare/poems ~/shakepoems.txt $ less ~/shakepoems.txt

Other Commands

There are several other operations available with the hadoop fs command to perform most common filesystem manipulations: mv, cp, mkdir, etc.

1. Enter:

$ hadoop fs

This displays a brief usage report of the commands available within FsShell. Try playing around with a few of these commands if you like.

This is the end of the Exercise

Hands-On Exercise: Running a

MapReduce Job

Files and Directories Used in this Exercise

Source directory: ~/workspace/wordcount/src/solution

Files: WordCount.java: A simple MapReduce driver class. WordMapper.java: A mapper class for the job. SumReducer.java: A reducer class for the job. wc.jar: The compiled, assembled WordCount program

In this exercise you will compile Java files, create a JAR, and run MapReduce jobs.

In addition to manipulating files in HDFS, the wrapper program hadoop is used to launch MapReduce jobs. The code for a job is contained in a compiled JAR file. Hadoop loads the JAR into HDFS and distributes it to the worker nodes, where the individual tasks of the MapReduce job are executed.

One simple example of a MapReduce job is to count the number of occurrences of each word in a file or set of files. In this lab you will compile and submit a MapReduce job to count the number of occurrences of every word in the works of Shakespeare.

4. Collect your compiled Java files into a JAR file:

$ jar cvf wc.jar solution/*.class

5. Submit a MapReduce job to Hadoop using your JAR file to count the occurrences of each word in Shakespeare:

$ hadoop jar wc.jar solution.WordCount
shakespeare wordcounts

This hadoop jar command names the JAR file to use (wc.jar), the class whose main method should be invoked (solution.WordCount), and the HDFS input and output directories to use for the MapReduce job. Your job reads all the files in your HDFS shakespeare directory, and places its output in a new HDFS directory called wordcounts.

6. Try running this same command again without any change:

$ hadoop jar wc.jar solution.WordCount
shakespeare wordcounts

Your job halts right away with an exception, because Hadoop automatically fails if your job tries to write its output into an existing directory. This is by design; since the result of a MapReduce job may be expensive to reproduce, Hadoop prevents you from accidentally overwriting previously existing files.

7. Review the result of your MapReduce job:

$ hadoop fs -ls wordcounts

This lists the output files for your job. (Your job ran with only one Reducer, so there should be one file, named part -r- 00000 , along with a _SUCCESS file and a _logs directory.)

8. View the contents of the output for your job:

$ hadoop fs -cat wordcounts/part-r-00000 | less

You can page through a few screens to see words and their frequencies in the works of Shakespeare. (The spacebar will scroll the output by one screen; the letter 'q' will quit the less utility.) Note that you could have specified wordcounts/* just as well in this command.

Wildcards in HDFS file paths

Take care when using wildcards (e.g. ) when specifying HFDS filenames; because of how Linux works, the shell will attempt to expand the wildcard before invoking hadoop, and then pass incorrect references to local files instead of HDFS files. You can prevent this by enclosing the wildcarded HDFS filenames in single quotes, e.g. hadoop fs –cat 'wordcounts/'

9. Try running the WordCount job against a single file:

$ hadoop jar wc.jar solution.WordCount
shakespeare/poems pwords

When the job completes, inspect the contents of the pwords HDFS directory.

10. Clean up the output files produced by your job runs:

$ hadoop fs -rm -r wordcounts pwords

Stopping MapReduce Jobs

It is important to be able to stop jobs that are already running. This is useful if, for example, you accidentally introduced an infinite loop into your Mapper. An important point to remember is that pressing ^C to kill the current process (which is displaying the MapReduce job's progress) does not actually stop the job itself.

Hands-On Exercise: Writing a

MapReduce Java Program

Projects and Directories Used in this Exercise

Eclipse project: averagewordlength

Java files: AverageReducer.java (Reducer) LetterMapper.java (Mapper) AvgWordLength.java (driver)

Test data (HDFS): shakespeare

Exercise directory: ~/workspace/averagewordlength

In this exercise you write a MapReduce job that reads any text input and computes the average length of all words that start with each character.

For any text input, the job should report the average length of words that begin with ‘a’, ‘b’, and so forth. For example, for input:

No now is definitely not the time

The output would be:

N 2.

n 3.

d 10.

i 2.

t 3.

(For the initial solution, your program should be case-­‐sensitive as shown in this example.)

The Algorithm

The algorithm for this program is a simple one-­‐pass MapReduce program:

The Mapper

The Mapper receives a line of text for each input value. (Ignore the input key.) For each word in the line, emit the first letter of the word as a key, and the length of the word as a value. For example, for input value:

No now is definitely not the time

Your Mapper should emit:

N 2

n 3

i 2

d 10

n 3

t 3

t 4

The Reducer

Thanks to the shuffle and sort phase built in to MapReduce, the Reducer receives the keys in sorted order, and all the values for one key are grouped together. So, for the Mapper output above, the Reducer receives this:

If you are using Eclipse, open the stub files (located in the src/stubs package) in the averagewordlength project. If you prefer to work in the shell, the files are in ~/workspace/averagewordlength/src/stubs.

You may wish to refer back to the wordcount example (in the wordcount project in Eclipse or in ~/workspace/wordcount) as a starting point for your Java code. Here are a few details to help you begin your Java programming:

3. Define the driver

This class should configure and submit your basic job. Among the basic steps here, configure the job with the Mapper class and the Reducer class you will write, and the data types of the intermediate and final keys.

4. Define the Mapper

Note these simple string operations in Java:

str.substring(0, 1) // String : first letter of str str.length() // int : length of str

5. Define the Reducer

In a single invocation the reduce() method receives a string containing one letter (the key) along with an iterable collection of integers (the values), and should emit a single key-­‐value pair: the letter and the average of the integers.

6. Compile your classes and assemble the jar file

To compile and jar, you may either use the command line javac command as you did earlier in the “Running a MapReduce Job” exercise, or follow the steps below (“Using Eclipse to Compile Your Solution”) to use Eclipse.

Step 3: Use Eclipse to Compile Your Solution

Follow these steps to use Eclipse to complete this exercise.

Note: These same steps will be used for all subsequent exercises. The instructions will not be repeated each time, so take note of the steps.

1. Verify that your Java code does not have any compiler errors or warnings.

The Eclipse software in your VM is pre-­‐configured to compile code automatically without performing any explicit steps. Compile errors and warnings appear as red and yellow icons to the left of the code.

2. In the Package Explorer, open the Eclipse project for the current exercise (i.e. averagewordlength). Right-­‐click the default package under the src entry and select Export.

A red X indicates a compiler error