






































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Hadoop exercises for practicing
Typology: Exercises
1 / 46
This page cannot be seen from the preview
Don't miss anything!







































General Notes ............................................................................................................................ 3
Hands-‐On Exercise: Using HDFS ......................................................................................... 5
Hands-‐On Exercise: Running a MapReduce Job .......................................................... 11
Hands-‐On Exercise: Writing a MapReduce Java Program ....................................... 16
Hands-‐On Exercise: More Practice With MapReduce Java Programs ................. 24
Optional Hands-‐On Exercise: Writing a MapReduce Streaming Program ......... 26
Hands-‐On Exercise: Writing Unit Tests With the MRUnit Framework ............... 29
201403
Hands-On Exercise: Creating an Inverted Index ........................................................
Hands-On Exercise: Calculating Word - Occurrence ..........................................Co
Hands-On Exercise: Importing Data With Sqoop .......................................................
Hands-On Exercise: Manipulating Data With Hive ....................................................
Hands-On Exercise: Running an Oozie Workflow ......................................................
shown (on two lines), or you can enter it on a single line. If you do the latter, you should not type in the backslash.
1. For most exercises, three folders are provided. Which you use will depend on how you would like to work on the exercises: - stubs: contains minimal skeleton code for the Java classes you’ll need to write. These are best for those with Java experience. - hints: contains Java class stubs that include additional hints about what’s required to complete the exercise. These are best for developers with limited Java experience. - solution: Fully implemented Java code which may be run “as-‐is”, or you may wish to compare your own solution to the examples provided. 2. As the exercises progress, and you gain more familiarity with Hadoop and MapReduce, we provide fewer step-‐by-‐step instructions; as in the real world, we merely give you a requirement and it’s up to you to solve the problem! You should feel free to refer to the hints or solutions provided, ask your instructor for assistance, or consult with your fellow students! 3. There are additional challenges for some of the Hands-‐On Exercises. If you finish the main exercise, please attempt the additional steps.
Data files (local) ~/training_materials/developer/data/shakespeare.tar.gz ~/training_materials/developer/data/access_log.gz
In this exercise you will begin to get acquainted with the Hadoop tools. You
will manipulate files in HDFS, the Hadoop Distributed File System.
1. Before starting the exercises, run the course setup script in a terminal window:
$ ~/scripts/developer/training_setup_dev.sh
Hadoop is already installed, configured, and running on your virtual machine.
Most of your interaction with the system will be through a command-‐line wrapper called hadoop. If you run this program with no arguments, it prints a help message. To try this, run the following command in a terminal window:
$ hadoop
The hadoop command is subdivided into several subsystems. For example, there is a subsystem for working with files in HDFS and another for launching and managing MapReduce processing jobs.
Note that the directory structure in HDFS has nothing to do with the directory structure of the local filesystem; they are completely separate namespaces.
Besides browsing the existing filesystem, another important thing you can do with FsShell is to upload new data into HDFS.
1. Change directories to the local filesystem directory containing the sample data we will be using in the course.
$ cd ~/training_materials/developer/data
If you perform a regular Linux ls command in this directory, you will see a few files, including two named shakespeare.tar.gz and shakespeare-stream.tar.gz. Both of these contain the complete works of Shakespeare in text format, but with different formats and organizations. For now we will work with shakespeare.tar.gz.
2. Unzip shakespeare.tar.gz by running:
$ tar zxvf shakespeare.tar.gz
This creates a directory named shakespeare/ containing several files on your local filesystem.
3. Insert this directory into HDFS:
$ hadoop fs -put shakespeare /user/training/shakespeare
This copies the local shakespeare directory and its contents into a remote, HDFS directory named /user/training/shakespeare.
4. List the contents of your HDFS home directory now:
$ hadoop fs -ls /user/training
You should see an entry for the shakespeare directory.
5. Now try the same fs -ls command but without a path argument:
$ hadoop fs -ls
You should see the same results. If you don’t pass a directory name to the -ls command, it assumes you mean your home directory, i.e. /user/training.
If you pass any relative (non-absolute) paths to FsShell commands (or use relative paths in MapReduce programs), they are considered relative to your home directory.
6. We also have a Web server log file, which we will put into HDFS for use in future exercises. This file is currently compressed using GZip. Rather than extract the file to the local disk and then upload it, we will extract and upload in one step. First, create a directory in HDFS in which to store it:
$ hadoop fs -mkdir weblog
7. Now, extract and upload the file in one step. The -c option to gunzip uncompresses to standard output, and the dash (-) in the hadoop fs -put command takes whatever is being sent to its standard input and places that data in HDFS.
$ gunzip -c access_log.gz
| hadoop fs -put - weblog/access_log
8. Run the hadoop fs -ls command to verify that the log file is in your HDFS home directory. 9. The access log file is quite large – around 500 MB. Create a smaller version of this file, consisting only of its first 5000 lines, and store the smaller version in HDFS. You can use the smaller version for testing in subsequent exercises.
good idea to pipe the output of the fs -cat command into head, tail, more, or less.
4. To download a file to work with on the local filesystem use the fs -get command. This command takes two arguments: an HDFS path and a local path. It copies the HDFS contents into the local filesystem:
$ hadoop fs -get shakespeare/poems ~/shakepoems.txt $ less ~/shakepoems.txt
There are several other operations available with the hadoop fs command to perform most common filesystem manipulations: mv, cp, mkdir, etc.
1. Enter:
$ hadoop fs
This displays a brief usage report of the commands available within FsShell. Try playing around with a few of these commands if you like.
Source directory: ~/workspace/wordcount/src/solution
Files: WordCount.java: A simple MapReduce driver class. WordMapper.java: A mapper class for the job. SumReducer.java: A reducer class for the job. wc.jar: The compiled, assembled WordCount program
In this exercise you will compile Java files, create a JAR, and run MapReduce jobs.
In addition to manipulating files in HDFS, the wrapper program hadoop is used to launch MapReduce jobs. The code for a job is contained in a compiled JAR file. Hadoop loads the JAR into HDFS and distributes it to the worker nodes, where the individual tasks of the MapReduce job are executed.
One simple example of a MapReduce job is to count the number of occurrences of each word in a file or set of files. In this lab you will compile and submit a MapReduce job to count the number of occurrences of every word in the works of Shakespeare.
4. Collect your compiled Java files into a JAR file:
$ jar cvf wc.jar solution/*.class
5. Submit a MapReduce job to Hadoop using your JAR file to count the occurrences of each word in Shakespeare:
$ hadoop jar wc.jar solution.WordCount
shakespeare wordcounts
This hadoop jar command names the JAR file to use (wc.jar), the class whose main method should be invoked (solution.WordCount), and the HDFS input and output directories to use for the MapReduce job. Your job reads all the files in your HDFS shakespeare directory, and places its output in a new HDFS directory called wordcounts.
6. Try running this same command again without any change:
$ hadoop jar wc.jar solution.WordCount
shakespeare wordcounts
Your job halts right away with an exception, because Hadoop automatically fails if your job tries to write its output into an existing directory. This is by design; since the result of a MapReduce job may be expensive to reproduce, Hadoop prevents you from accidentally overwriting previously existing files.
7. Review the result of your MapReduce job:
$ hadoop fs -ls wordcounts
This lists the output files for your job. (Your job ran with only one Reducer, so there should be one file, named part -r- 00000 , along with a _SUCCESS file and a _logs directory.)
8. View the contents of the output for your job:
$ hadoop fs -cat wordcounts/part-r-00000 | less
You can page through a few screens to see words and their frequencies in the works of Shakespeare. (The spacebar will scroll the output by one screen; the letter 'q' will quit the less utility.) Note that you could have specified wordcounts/* just as well in this command.
Take care when using wildcards (e.g. ) when specifying HFDS filenames; because of how Linux works, the shell will attempt to expand the wildcard before invoking hadoop, and then pass incorrect references to local files instead of HDFS files. You can prevent this by enclosing the wildcarded HDFS filenames in single quotes, e.g. hadoop fs –cat 'wordcounts/'
9. Try running the WordCount job against a single file:
$ hadoop jar wc.jar solution.WordCount
shakespeare/poems pwords
When the job completes, inspect the contents of the pwords HDFS directory.
10. Clean up the output files produced by your job runs:
$ hadoop fs -rm -r wordcounts pwords
It is important to be able to stop jobs that are already running. This is useful if, for example, you accidentally introduced an infinite loop into your Mapper. An important point to remember is that pressing ^C to kill the current process (which is displaying the MapReduce job's progress) does not actually stop the job itself.
Eclipse project: averagewordlength
Java files: AverageReducer.java (Reducer) LetterMapper.java (Mapper) AvgWordLength.java (driver)
Test data (HDFS): shakespeare
Exercise directory: ~/workspace/averagewordlength
In this exercise you write a MapReduce job that reads any text input and computes the average length of all words that start with each character.
For any text input, the job should report the average length of words that begin with ‘a’, ‘b’, and so forth. For example, for input:
No now is definitely not the time
The output would be:
N 2.
n 3.
d 10.
i 2.
t 3.
(For the initial solution, your program should be case-‐sensitive as shown in this example.)
The algorithm for this program is a simple one-‐pass MapReduce program:
The Mapper
The Mapper receives a line of text for each input value. (Ignore the input key.) For each word in the line, emit the first letter of the word as a key, and the length of the word as a value. For example, for input value:
No now is definitely not the time
Your Mapper should emit:
N 2
n 3
i 2
d 10
n 3
t 3
t 4
The Reducer
Thanks to the shuffle and sort phase built in to MapReduce, the Reducer receives the keys in sorted order, and all the values for one key are grouped together. So, for the Mapper output above, the Reducer receives this:
If you are using Eclipse, open the stub files (located in the src/stubs package) in the averagewordlength project. If you prefer to work in the shell, the files are in ~/workspace/averagewordlength/src/stubs.
You may wish to refer back to the wordcount example (in the wordcount project in Eclipse or in ~/workspace/wordcount) as a starting point for your Java code. Here are a few details to help you begin your Java programming:
3. Define the driver
This class should configure and submit your basic job. Among the basic steps here, configure the job with the Mapper class and the Reducer class you will write, and the data types of the intermediate and final keys.
4. Define the Mapper
Note these simple string operations in Java:
str.substring(0, 1) // String : first letter of str str.length() // int : length of str
5. Define the Reducer
In a single invocation the reduce() method receives a string containing one letter (the key) along with an iterable collection of integers (the values), and should emit a single key-‐value pair: the letter and the average of the integers.
6. Compile your classes and assemble the jar file
To compile and jar, you may either use the command line javac command as you did earlier in the “Running a MapReduce Job” exercise, or follow the steps below (“Using Eclipse to Compile Your Solution”) to use Eclipse.
Follow these steps to use Eclipse to complete this exercise.
Note: These same steps will be used for all subsequent exercises. The instructions will not be repeated each time, so take note of the steps.
1. Verify that your Java code does not have any compiler errors or warnings.
The Eclipse software in your VM is pre-‐configured to compile code automatically without performing any explicit steps. Compile errors and warnings appear as red and yellow icons to the left of the code.
2. In the Package Explorer, open the Eclipse project for the current exercise (i.e. averagewordlength). Right-‐click the default package under the src entry and select Export.
A red X indicates a compiler error