Understanding Data Engineering: Structured, Semi-structured, and Unstructured Data, Summaries of Computer science

A comprehensive overview of data engineering, focusing on the different types of data: structured, semi-structured, and unstructured. It explores the characteristics, examples, and applications of each type, highlighting the importance of understanding data structures in data engineering. The document also delves into sql, its role in data engineering and data science, and the concept of database schemas. Additionally, it discusses data warehouses and data lakes, their differences, and the importance of data catalogs for data governance.

Typology: Summaries

2023/2024

Uploaded on 12/26/2024

vimalan-kumarakulasingam
vimalan-kumarakulasingam 🇱🇰

5 documents

1 / 39

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data structures
U N D E R STA N D I N G DATA E N G I N E E R I N G
Hadrien Lacroix
Content Developer at DataCamp
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27

Partial preview of the text

Download Understanding Data Engineering: Structured, Semi-structured, and Unstructured Data and more Summaries Computer science in PDF only on Docsity!

Data structures

U N D E R S TA N D I N G D ATA E N G I N E E R I N G Hadrien Lacroix Content Developer at DataCamp

Structured data

Easy to search and organize Consistent model, rows and columns Defined types Can be grouped to form relations Stored in relational databases About 20% of the data is structured Created and queried using SQL

Relational database

office address number city zipcode Belgium Martelarenlaan 38 Leuven 3010 UK Old Street 207 London EC1V 9NR USA 5th Ave 350 New York 10118

Relational database

index last_name first_name office address number city zipcode 0 Thien Vivian Belgium Martelarenlaan 38 Leuven 3010 1 Huong Julian Belgium Martelarenlaan 38 Leuven 3010 2 Duplantier Norbert UK Old Street 207 London EC1V 9NR 3 McColgan Jeff USA 5th Ave 350 New York 10118 4 Sanchez Rick USA 5th Ave 350 New York 10118

Favorite artists JSON file

{ {"user_1645156": "last_name": "Lacroix", "first_name: "Hadrien", "favorite_artists": ["Fools in Deed", "Gojira", "Pain", "Nanowar of Steel"]}, {"user_5913764": "last_name": "Billen", "first_name: "Sara", "favorite_artists": ["Tamino", "Taylor Swift"]}, {"user_8436791": "last_name": "Sulmont", "first_name: "Lis", "favorite_artists": ["Arctic Monkeys", "Rihanna", "Nina Simone"]}, ... }

Unstructured data

Does not follow a model, can't be contained in rows and columns Difficult to search and organize Usually text, sound, pictures or videos Usually stored in data lakes, can appear in data warehouses or databases Most of the data is unstructured Can be extremely valuable

Adding some structure

Use AI to search and organize unstructured data Add information to make it semi-structured

Summary

Structured data Semi-structured data Unstructured data Differences between the three Give examples

SQL databases

U N D E R S TA N D I N G D ATA E N G I N E E R I N G Hadrien Lacroix Content Developer at DataCamp

SQL

Structured Query Language Industry standard for Relational Database Management System (RDBMS) Allows you to access many records at once, and group, filter or aggregate them Close to written English, easy to write and understand Data engineers use SQL to create and maintain databases Data scientists use SQL to query (request information from) databases

SQL for data engineers

Data engineers use SQL to create, maintain and update tables. CREATE TABLE employees ( employee_id INT, first_name VARCHAR(255), last_name VARCHAR(255), role VARCHAR(255), team VARCHAR(255), full_time BOOLEAN, office VARCHAR(255) );

SQL for data scientists

Data scientist use SQL to query, filter, group and aggregate data in tables. SELECT first_name, last_name FROM employees WHERE role LIKE '%Data%'