































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A comprehensive overview of data engineering, focusing on the different types of data: structured, semi-structured, and unstructured. It explores the characteristics, examples, and applications of each type, highlighting the importance of understanding data structures in data engineering. The document also delves into sql, its role in data engineering and data science, and the concept of database schemas. Additionally, it discusses data warehouses and data lakes, their differences, and the importance of data catalogs for data governance.
Typology: Summaries
1 / 39
This page cannot be seen from the preview
Don't miss anything!
































U N D E R S TA N D I N G D ATA E N G I N E E R I N G Hadrien Lacroix Content Developer at DataCamp
Easy to search and organize Consistent model, rows and columns Defined types Can be grouped to form relations Stored in relational databases About 20% of the data is structured Created and queried using SQL
office address number city zipcode Belgium Martelarenlaan 38 Leuven 3010 UK Old Street 207 London EC1V 9NR USA 5th Ave 350 New York 10118
index last_name first_name office address number city zipcode 0 Thien Vivian Belgium Martelarenlaan 38 Leuven 3010 1 Huong Julian Belgium Martelarenlaan 38 Leuven 3010 2 Duplantier Norbert UK Old Street 207 London EC1V 9NR 3 McColgan Jeff USA 5th Ave 350 New York 10118 4 Sanchez Rick USA 5th Ave 350 New York 10118
{ {"user_1645156": "last_name": "Lacroix", "first_name: "Hadrien", "favorite_artists": ["Fools in Deed", "Gojira", "Pain", "Nanowar of Steel"]}, {"user_5913764": "last_name": "Billen", "first_name: "Sara", "favorite_artists": ["Tamino", "Taylor Swift"]}, {"user_8436791": "last_name": "Sulmont", "first_name: "Lis", "favorite_artists": ["Arctic Monkeys", "Rihanna", "Nina Simone"]}, ... }
Does not follow a model, can't be contained in rows and columns Difficult to search and organize Usually text, sound, pictures or videos Usually stored in data lakes, can appear in data warehouses or databases Most of the data is unstructured Can be extremely valuable
Use AI to search and organize unstructured data Add information to make it semi-structured
Structured data Semi-structured data Unstructured data Differences between the three Give examples
U N D E R S TA N D I N G D ATA E N G I N E E R I N G Hadrien Lacroix Content Developer at DataCamp
Structured Query Language Industry standard for Relational Database Management System (RDBMS) Allows you to access many records at once, and group, filter or aggregate them Close to written English, easy to write and understand Data engineers use SQL to create and maintain databases Data scientists use SQL to query (request information from) databases
Data engineers use SQL to create, maintain and update tables. CREATE TABLE employees ( employee_id INT, first_name VARCHAR(255), last_name VARCHAR(255), role VARCHAR(255), team VARCHAR(255), full_time BOOLEAN, office VARCHAR(255) );
Data scientist use SQL to query, filter, group and aggregate data in tables. SELECT first_name, last_name FROM employees WHERE role LIKE '%Data%'