Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Assignment Spark Data Analytics, Study Guides, Projects, Research of Advanced Data Analysis

Cairo University Advanced Data Analysis

**Title: Leveraging Apache Spark for Big Data Analytics** **Introduction** Big Data Analytics has become an indispensable part of modern businesses, enabling organizations to derive valuable insights from vast volumes of data. Apache Spark has emerged as a leading framework for Big Data processing and analytics due to its speed, ease of use, and versatility. In this assignment, we will explore the fundamentals of Apache Spark and its application in various aspects of Big Data Analytics. **Understanding Apache Spark** Apache Spark is an open-source distributed computing framework designed for large-scale data processing. It provides a unified platform for batch processing, real-time streaming, machine learning, and graph processing, making it suitable for a wide range of Big Data analytics tasks. Key features of Apache Spark include: 1. **Speed**: Spark offers in-memory computation, which significantly accelerates data processing compared to traditional disk-based systems like Hado

Typology: Study Guides, Projects, Research

2022/2023

Uploaded on 02/13/2024

lama-ahmed 🇪🇬

1 document

1 / 2

This page cannot be seen from the preview

Don't miss anything!

Cairo University

Faculty of Computers and Artificial Intelligence

Managing and Modelling Big Data (2023/2024)

Assignment 2

Dataset – Wikimedia Project:

The Wikimedia Foundation supports hundreds of thousands of people around the world in

creating the largest free knowledge projects in history. The work of volunteers helps millions of

people around the globe discover information, contribute knowledge, and share it with others

no matter their bandwidth.

In this task you are going to explore the page views of Wikimedia projects. Download the

page view statistics generated between 0-1 am on Jan 1, 2016 from here.

Each line, delimited by a white space, contains the statistics for one Wikimedia page. The

schema looks as follows:

Field

Meaning

Project code

The project identifier for each page.

Page title

A string containing the title of the page.

Page hits

Number of requests on the specific hour.

Page size

Size of the page

Develop spark application in any programming language that implements the below functions

once using map-reduce paradigm in spark and once using spark loops and compare their

performance in terms of time.

You must also create a document includes all the results of each query:

1) Compute the min, max, and average page size.

2) Determine the number of page titles that start with the article “The”. How many of those

page titles are not part of the English project (Pages that are part of the English project

have “en” as first field)?

3) Determine the number of unique terms appearing in the page titles. Note that in page

titles, terms are delimited by “_” instead of a white space. You can use any number of

normalization steps (e.g. lowercasing, removal of non-alphanumeric characters).

4) Extract each title and the number of times it was repeated.

5) Combine between data of pages with the same title and save each pair of pages data

in order to display them.

Discover Study Guides, Projects, Research of Advanced Data Analysis Cairo University

Partial preview of the text

Download Assignment Spark Data Analytics and more Study Guides, Projects, Research Advanced Data Analysis in PDF only on Docsity!

Cairo University Faculty of Computers and Artificial Intelligence Managing and Modelling Big Data (202 3 /202 4 )

Assignment 2

Dataset – Wikimedia Project: The Wikimedia Foundation supports hundreds of thousands of people around the world in creating the largest free knowledge projects in history. The work of volunteers helps millions of people around the globe discover information, contribute knowledge, and share it with others no matter their bandwidth. In this task you are going to explore the page views of Wikimedia projects. Download the page view statistics generated between 0- 1 am on Jan 1, 2016 from here. Each line, delimited by a white space, contains the statistics for one Wikimedia page. The schema looks as follows: Field Meaning Project code The project identifier for each page. Page title A string containing the title of the page. Page hits Number of requests on the specific hour. Page size Size of the page Develop spark application in any programming language that implements the below functions once using map-reduce paradigm in spark and once using spark loops and compare their performance in terms of time. You must also create a document includes all the results of each query:

Compute the min, max, and average page size.
Determine the number of page titles that start with the article “The”. How many of those page titles are not part of the English project (Pages that are part of the English project have “en” as first field)?
Determine the number of unique terms appearing in the page titles. Note that in page titles, terms are delimited by “_” instead of a white space. You can use any number of normalization steps (e.g. lowercasing, removal of non-alphanumeric characters).
Extract each title and the number of times it was repeated.
Combine between data of pages with the same title and save each pair of pages data in order to display them.

Cairo University Faculty of Computers and Artificial Intelligence Managing and Modelling Big Data (202 3 /202 4 ) Important Notes:

This is a group assignment of 4 members (at most) and the members should be from the same group/lab.
All team members should work and fully understand everything in the assignment even if you distributed the questions, you should understand your colleague’s questions.
The due date is on Thursday, 21 st of December. No late submission is allowed. No submission through e-mails.
Do not share your code with anyone, so that no other student would take your files and submit it under their names.
Any cheating will be graded ZERO for both teams.
Any team member who misses attending the discussion will take ZERO.

Assignment Spark Data Analytics, Study Guides, Projects, Research of Advanced Data Analysis

Related documents

Partial preview of the text

Download Assignment Spark Data Analytics and more Study Guides, Projects, Research Advanced Data Analysis in PDF only on Docsity!

Assignment 2