




























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Notes for Data Programming for AOU Course TM351
Typology: Lecture notes
1 / 240
This page cannot be seen from the preview
Don't miss anything!





























































































3 1
PART 3
Data preparation
Workload (total about 3 hours)
content and make extensive use of IPython (Jupyter) Notebooks.
tabular data in SQL and pandas DataFrames.
limiting (40 minutes).
datasets (30 minutes).
to show outliers for cleaning (30 minutes).
Data preparation
Activities :
Activities also known as:
Note :
Looking ahead:
This week you will look first F 08 CF 02 0at some basic data cleansing issues that apply to single and multiple tabular datasets, and then F 08 DF 02 0at the processes used to combine and shape them: selection, projection, aggregation, and joins. Many of these techniques can also be straightforwardly applied to data structures other than tables.
2 Data cleansing
Is the process of:
irrelevant data – a decision must be made about how to
handle them.
Table 3.1 Fictitious details of family members
Classification of error types
Accuracy
Checking correctness requires some external ‘gold standard’ to check them against (e.g. a table of valid postcodes, would show hat M60 9HP isn’t a postcode that is currently is use). Otherwise, hints based on spelling and capitalisation are the best hope.
Completeness
and a postcode, although they may not know the value (assuming they are in the UK – if they live elsewhere they may not have a postcode), but can the dataset be considered complete with some of these missing? This will depend on the purpose of any future analysis.
Uniformity
The DOB field contains date values drawn from two different calendars, which would create problems in later processing. It would be necessary to choose a base or canonical representation and translate all values to that form. A similar issue appears in the ncome column.
2.2 Combining data from multiple
sources
a common (aka canonical ) form for non-uniform data.
data sources use different base representations.
2.2 Combining data from multiple
sources (Examples)
with subjective values: