Data Engineering Cookbook, Lecture notes of Data Mining

Data Engineering cookbook is helpful book who wants to learn data mining

Typology: Lecture notes

2019/2020

Uploaded on 11/30/2020

melss-aaa
melss-aaa 🇬🇧

1 document

1 / 125

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
The Data Engineering Cookbook
Mastering The Plumbing Of Data Science
Andreas Kretz
August 10, 2019
v2.2
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download Data Engineering Cookbook and more Lecture notes Data Mining in PDF only on Docsity!

The Data Engineering Cookbook

Mastering The Plumbing Of Data Science

Andreas Kretz

August 10, 2019

v2.

Contents

Part I

Introduction

1 How To Use This Cookbook

What do you actually need to learn to become an awesome data engineer? Look no further, you’ll find it here.

If you are looking for AI algorithms and such data scientist things, this book is not for you.

How to use this document: First of all, this is not a training! This cookbook is a collection of skills that I value highly in my daily work as a data engineer. It’s intended to be a starting point for you to find the topics to look into and become an awesome data engineer.

You are going to find Five Types of Content in this book: Articles I wrote, links to my podcast episodes (video & audio), more then 200 links to helpful websites I like, data engineering interview questions and case studies.

This book is a work in progress! As you can see, this book is not finished. I’m constantly adding new stuff and doing videos for the topics. But obviously, because I do this as a hobby my time is limited. You can help making this book even better.

Help make this book awesome! If you have some cool links or topics for the cookbook, please become a contributor on GitHub: https://github.com/andkret/Cookbook. Pull the repo, add them and create a pull request. Or join the discussion by opening Issues. You can also write me an email any time to [email protected]. Tell me your thoughts, what you value, you think should be included, or correct where I am wrong.

This Cookbook is and will always be free! I don’t want to sell you this book, but please support what you like and join my Patreon: https://www.patreon.com/plumbersofds

Check out this podcast episode where I talk in detail why I decided to share all this information for free: #079 Trying to stay true to myself and making the cookbook public on GitHub

Yes, machine learning, it works like this:

You feed an algorithm with measurement data. It generates a model and optimises it based on the data you fed it with. That model basically represents a pattern of how your data is looking. You show that model new data and the model will tell you if the data still represents the data you have trained it with. This technique can also be used for predicting machine failure in advance with machine learning. Of course the whole process is not that simple.

The actual process of training and applying a model is not that hard. A lot of work for the data scientist is to figure out how to pre-process the data that gets fed to the algorithms.

Because to train a algorithm you need useful data. If you use any data for the training the produced model will be very unreliable.

A unreliable model for predicting machine failure would tell you that your machine is damaged even if it is not. Or even worse: It would tell you the machine is ok even when there is an malfunction.

Model outputs are very abstract. You also need to post-process the model outputs to receive health values from 0 to 100.

Figure 2.1: The Machine Learning Pipeline

2.2 Data Engineer

Data Engineers are the link between the management’s big data strategy and the data scientists that need to work with data.

What they do is building the platforms that enable data scientists to do their magic.

These platforms are usually used in five different ways:

  • Data ingestion and storage of large amounts of data
  • Algorithm creation by data scientists
  • Automation of the data scientist’s machine learning models and algorithms for production use
  • Data visualisation for employees and customers
  • Most of the time these guys start as traditional solution architects for systems that involve SQL databases, web servers, SAP installations and other “standard” systems.

But to create big data platforms the engineer needs to be an expert in specifying, set- ting up and maintaining big data technologies like: Hadoop, Spark, HBase, Cassandra, MongoDB, Kafka, Redis and more.

What they also need is experience on how to deploy systems on cloud infrastructure like at Amazon or Google or on premise hardware.

Podcast Episode: #048 From Wannabe Data Scientist To Engineer My Journey In this episode Kate Strachnyi interviews me for her humans of data science podcast. We talk about how I found out that I am more into the engineering part of data science. YouTube Click here to watch Audio Click here to listen Table 2.2: Podcast: 048 From Wannabe Data Scientist To Engineer My Journey

2.3 Who Companies Need

For a good company it is absolutely important to get well trained data engineers and data scientists. Think of the data scientist as the professional race car driver. A fit athlete with talent and driving skills like you have never seen.

What he needs to win races is someone who will provide him the perfect race car to drive. That’s what the solution architect is for.

Like the driver and his team the data scientist and the data engineer need to work closely together. They need to know the different big data tools Inside and out.

That’s why companies are looking for people with Spark experience. It is a common ground between both that drives innovation.

Spark gives data scientists the tools to do analytics and helps engineers to bring the data scientist’s algorithms into production. After all, those two decide how good the data

Part II

Basic Data Engineering Skills

3 Learn To Code

Why this is important: Without coding you cannot do much in data engineering. I cannot count the number of times I needed a quick Java hack.

The possibilities are endless:

  • Writing or quickly getting some data out of a SQL DB
  • Testing to produce messages to a Kafka topic
  • Understanding Source code of a Java Webservice
  • Reading counter statistics out of a HBase key value store

So, which language do I recommend then?

I highly recommend Java. It’s everywhere!

When you are getting into data processing with Spark you should use Scala. But, after learning Java this is easy to do.

Also Python is a great choice. It is super versatile.

Personally however, I am not that big into Python. But I am going to look into it

Where to Learn? There’s a Java Course on Udemy you could look at: https://www.udemy.com/java- programming-tutorial-for-beginners

  • OOP Object oriented programming
  • What are Unit tests to make sure what you code is working
  • Functional Programming
  • How to use build management tools like Maven
  • Resilient testing (?)

I talked about the importance of learning by doing in this podcast: https://anchor.fm/ andreaskayy/episodes/Learning-By-Doing-Is-The-Best-Thing-Ever---PoDS-035-e25g

5 Agile Development

Agility, the ability to adapt quickly to changing circumstances.

These days everyone wants to be agile. Big or small company people are looking for the “startup mentality”.

Many think it’s the corporate culture. Others think it’s the process how we create things that matters.

In this article I am going to talk about agility and self-reliance. About how you can incorporate agility in your professional career.

5.1 Why is agile so important?

Historically development is practiced as a hard defined process. You think of something, specify it, have it developed and then built in mass production.

It’s a bit of an arrogant process. You assume that you already know exactly what a customer wants. Or how a product has to look and how everything works out.

The problem is that the world does not work this way!

Often times the circumstances change because of internal factors.

Sometimes things just do not work out as planned or stuff is harder than you think.

You need to adapt.

Other times you find out that you build something customers do not like and need to be changed.

You need to adapt.

That’s why people jump on the Scrum train. Because Scrum is the definition of agile development, right?

5.2 Agile rules I learned over the years

5.2.1 Is the method making a difference?

Yes, Scrum or Google’s OKR can help to be more agile. The secret to being agile however, is not only how you create.

What makes me cringe is people try to tell you that being agile starts in your head. So, the problem is you?

No!

The biggest lesson I have learned over the past years is this: Agility goes down the drain when you outsource work.

5.2.2 The problem with outsourcing

I know on paper outsourcing seems like a no brainer: Development costs against the fixed costs.

It is expensive to bind existing resources on a task. It is even more expensive if you need to hire new employees.

The problem with outsourcing is that you pay someone to build stuff for you.

It does not matter who you pay to do something for you. He needs to make money.

His agenda will be to spend as less time as possible on your work. That is why outsourcing requires contracts, detailed specifications, timetables and delivery dates.

He doesn’t want to spend additional time on a project, only because you want changes in the middle. Every unplanned change costs him time and therefore money.

If so, you need to make another detailed specification and a contract change.

He is not going to put his mind into improving the product while developing. Firstly because he does not have the big picture. Secondly because he does not want to.

He is doing as he is told.

Who can blame him? If I was the subcontractor I would do exactly the same!

Does this sound agile to you?