Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Assaignment on Big data, Assignments of Data Mining

Lovely Professional University Data Mining

this 1. As companies move past the experimental phase with Hadoop, many cite the need for additional capabilities, including _______________ a) Improved data storage and information retrieval b) Improved extract, transform and load features for data integration c) Improved data warehousing functionality d) Improved security, workload management, and SQL support View Answer 2. Point out the correct statement. a) Hadoop do need specialized hardware to process the data b) Hadoop 2.0 allows live st

Typology: Assignments

2019/2020

Uploaded on 09/19/2020

cabinet-shah-zhylnzpttr 🇮🇳

3 documents

1 / 18

This page cannot be seen from the preview

Don't miss anything!

ASSIGNMENT

INT 312

(BIG DATA FUNDAMENTALS)

SET SB

SUBMITTED BY: SUBMITTED TO:

Cabinet Kumar Shah Prof. Mamoon Rashid

11812420 20574

A35 KOM59 School of CSE dept.

Discover Assignments of Data Mining Lovely Professional University

Partial preview of the text

Download Assaignment on Big data and more Assignments Data Mining in PDF only on Docsity!

ASSIGNMENT

ON

INT 312

(BIG DATA FUNDAMENTALS)

SET SB

SUBMITTED BY: SUBMITTED TO:

Cabinet Kumar Shah Prof. Mamoon Rashid

A35 KOM59 School of CSE dept.

Question 1

For any organization X, there is a data of 10PB and the available resource of HDD

is having accessing rate of 100MB/s. The HDD is having four channels for data

storage and retrieval. As a Big Data Engineer, you are required to calculate the

time required to retrieve this data with given features. Also, provide an idea how

this organization X can retrieve this data in minimum time and what are the

requirements for fulfilling your idea in solving this problem. (10)

(Hint: Data will be read from four channels in parallel. As a result you need to

calculate time of half data only as half data will be read in parallel at the same time

via another channel).

Solution:

For any organization X,

1 PB = 1000 TB

10 PB = 10 * 1000 TB = 10,000 TB

Data = 10 PB

= 10,000 TB

= (10,000 * 1048576 ) MB ( 1 TB = 1048576 MB)

= 10485760000 MB

Average data accessing rate = 100 MB/sec.

Now,

1 sec = 100 mb

X sec = 10485760000 mb (10 PB)

X = 10485760000 / 100 sec

= 104857600 sec

= 1747626.66667 minutes

Question 2 Hadoop installation Show the steps and commands used in the installation of Apache Hadoop for one node cluster on your machine. Each step must be supported with screenshots of your machine with your name on terminal. Explain the functionality of each file used in the configuration of Apache Hadoop. Step 1 : if java version is not installed then we use below command $ sudo apt-get install openjdk-8-jdk

Step 2 : check java version. for that run this command $ java –version Step 3 : Run command $ readlink -f /usr/bin/java | sed "s:bin/java::" Copy the output after running this command

Step 6: Run the command $ ssh-keygen -t rsa -P "" Step 7 : Run the command $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Step 8: Run the command $ ssh localhost Step 9: open bahsrc file with command $ gedit ~/.bashrc then paste the following below lines #HADOOP VARIABLES START export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd export HADOOP_INSTALL=/home/cabinetshah/hadoop-3.1. export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib" #HADOOP VARIABLES END

Step 12: Run these two commands $ sudo mkdir -p /home/cabinetshah/hadoop-3.1.3/tmp $ sudo chown cabinetshah:cabinetshah /home/cabinetshah/hadoop-3.1.3/tmp

Step 13: Goto hadoop 3.1.3 file  etc folder  hadoop folder  open in texteditor  core-site.xml file  paste following lines inside core-site.xml file hadoop.tmp.dir /home/cabinetshah/hadoop-3.1.3/tmp A base for other temporary directories. fs.default.name hdfs://localhost:54310 The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem. save & quit

Step 16: Goto hadoop 3.1.3 file  etc folder  hadoop folder  open in texteditor  hdfs-site.xml file  paste following lines inside hdfs-site.xml file dfs.replication 1 Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. dfs.namenode.name.dir file:/home/cabinetshah/hadoop-3.1.3/hadoop_store/hdfs/namenode

dfs.datanode.data.dir file:/home/cabinetshah/hadoop-3.1.3/hadoop_store/hdfs/datanode Save & quit Step 17: Run following command to format namenode $ hadoop namenode –format

Remarks : If we won’t get namenode after starting deamons then we just need to delete hadoop store folder & temp folder. And again we need to run all the command. We also need to check all those 4 files inside etc folder of hadoop 3.1.3 folder. Moreover we need to check bashrc file as well. Question 3

Part:-I

Create a text file named temp.txt and save it in local file system. Write a

hadoop command to move this file into HDFS and later change the replication

factor for this file to 4 on HDFS. Support your answer with screenshot of CLI

fetching text file on HDFS.

Step 1 : created file on local file system with named temp.txt

Command to move file from local to hdfs file system. Step 2 : hdfs dfs –moveFromLocal /home/cabinetshah/ /KOM59/ Command to change replication factor to 4. Step 3 : hdfs dfs –setrep –R 4 /KOM59/

Assaignment on Big data, Assignments of Data Mining

Related documents

Partial preview of the text

Download Assaignment on Big data and more Assignments Data Mining in PDF only on Docsity!

ASSIGNMENT

ON

INT 312

(BIG DATA FUNDAMENTALS)

SUBMITTED BY: SUBMITTED TO:

Cabinet Kumar Shah Prof. Mamoon Rashid

A35 KOM59 School of CSE dept.

For any organization X, there is a data of 10PB and the available resource of HDD

is having accessing rate of 100MB/s. The HDD is having four channels for data

storage and retrieval. As a Big Data Engineer, you are required to calculate the

time required to retrieve this data with given features. Also, provide an idea how

this organization X can retrieve this data in minimum time and what are the

requirements for fulfilling your idea in solving this problem. (10)

(Hint: Data will be read from four channels in parallel. As a result you need to

calculate time of half data only as half data will be read in parallel at the same time

via another channel).

Solution:

For any organization X,

1 PB = 1000 TB

10 PB = 10 * 1000 TB = 10,000 TB

Data = 10 PB

= 10,000 TB

= (10,000 * 1048576 ) MB ( 1 TB = 1048576 MB)

= 10485760000 MB

Average data accessing rate = 100 MB/sec.

Now,

1 sec = 100 mb

X sec = 10485760000 mb (10 PB)

X = 10485760000 / 100 sec

= 104857600 sec

= 1747626.66667 minutes

Part:-I

Create a text file named temp.txt and save it in local file system. Write a

hadoop command to move this file into HDFS and later change the replication

factor for this file to 4 on HDFS. Support your answer with screenshot of CLI

fetching text file on HDFS.

Step 1 : created file on local file system with named temp.txt

Part :– II

Create a text file named test.txt and save it on HDFS in one directory. Write

a hadoop command to move it in another directory of Hadoop and then display its

contents. Support your answer with screenshot of CLI fetching text file.

Command to create directory named ‘KOM59’ on dhfs file system.

Step 1: hdfs dfs –mkdir /KOM

Command to create directory named ‘K18PV’ on dhfs file system.