Introduction Data Engineering, Lecture notes of Data Mining

Introduction to Data Engineering, Data science, comparison of the role of Data scientist and data engineers

Typology: Lecture notes

2019/2020

Uploaded on 05/30/2020

unknown user
unknown user 🇯🇴

1 document

1 / 36

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
An Overview of Data Engineering
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24

Partial preview of the text

Download Introduction Data Engineering and more Lecture notes Data Mining in PDF only on Docsity!

An Overview of Data Engineering

2

Objectives

 (^) In this first chapter, you will be  Exposed to the world of data engineering.  (^) Explore the differences between a data engineer and a data scientist.  (^) Get an overview of the various tools data engineers use.  (^) Expand your understanding of how cloud technology plays a role in data engineering.

4

The DATA Problem

 (^) Preparing Data for Analytics is Hard.  (^) Data is often the biggest challenge of self-service analytics.  (^) Self-Service Analytics allows end users to easily  (^) analyze their data by building their own reports and modify existing ones with little to no training.

5

The Data Problem in Self Service Analytics

 (^) Half of the organizations are accessing external data sources.  (^) Data is scattered.

7

The Data Problem in Self Service Analytics

 (^) Database needs to be optimized so it becomes  (^) faster to query  free of corrupt data

8

The DATA Problem Solution

 (^) In comes the Data Engineer to rescue.

10

Tasks of a Data Engineer

 (^) The tasks of a data engineer consist of:  (^) developing a scalable data architecture (schema)  (^) تطوير بنية بيانات قابلة للتطوير (مخطط)  (^) streamlining data acquisition  (^) تبسيط الحصول على البيانات  (^) setting up processes that bring data together from several sources  (^) إعداد عمليات تجمع البيانات من عدة مصادر  safeguarding data quality by cleaning up corrupt data  (^) عن طريق تنظيف البيانات الفاسدةحماية جودة البيانات ع

11

Tasks of a Data Engineer

 (^) Data engineers design , build , and maintain data architectures for large-scale applications.  This career path requires strong software engineering skills  (^) Essentially, a data engineer needs to have the skills to build a data pipeline that connects all the pieces of the data ecosystem together and keep it up and running.  (^) Data engineering is the first — and arguably most crucial — step for a successful data strategy.  Data engineers make sure data scientists have the data they need to perform data science.

13

Tasks of a Data Engineer

14

Tasks of a Data Engineer

 (^) To emphasize just how important data engineering is for data science, take a look at the following hierarchy of needs, proposed by Monica Rogati.

16

Exercise 1

 (^) There are some differences between the tasks of data scientists and the tasks of data engineers.  (^) Below are three essential tasks that need to happen in a data- driven company. Can you find the one that best fits the job of a data engineer?  (^) Apply a statistical model to a large dataset to find outliers.  (^) Set up scheduled ingestion of data from the application databases to an analytical database.  (^) Come up with a database schema for an application.

17

Exercise 2

 (^) Classify the tasks in the correct color. Data engineer (red) or the data scientist (blue).  (^) Cloud technology  (^) Mining data for patterns  (^) Monitor business processes  Streamline data acquisition  (^) Clean statistical outliers in data  (^) Set up processes to bring together data  (^) Statistical modeling  (^) Develop scalable data architecture  (^) Predictive models using machine learning  (^) Clean corrupt data

19

Tools of the Data Engineer: Database Systems

 (^) Data engineers are expert users of database systems.  (^) A database is a computer system that holds large amounts of data.  (^) Applications rely on databases to provide certain functionality.  (^) Other databases are used for analyses.  (^) The data engineer’s task begins and ends at databases.

20

Tools of the Data Engineer: Processing

 (^) Tools for quickly processing data  (^) Clean data  Aggregate data  (^) Join data  (^) Huge data have to be processed.  That is where parallel processing comes into play.  (^) Data engineers use clusters of machines to process data.