TM351 Data Programming Notes, Lecture notes of Database Programming

Notes for Data Programming for AOU Course TM351

Typology: Lecture notes

2016/2017

Uploaded on 12/23/2017

joe-titan
joe-titan 🇯🇴

2 documents

1 / 240

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
TM351
Data management and analysis
3
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download TM351 Data Programming Notes and more Lecture notes Database Programming in PDF only on Docsity!

TM

Data management and analysis

3 1

  • SESSION
  • parts 3 and

PART 3

Data preparation

Workload (total about 3 hours)

  • During the course of this week you will work through the VLE

content and make extensive use of IPython (Jupyter) Notebooks.

  • You will work through five Notebooks, looking at manipulating

tabular data in SQL and pandas DataFrames.

  • Activity 3.5 uses 03.1 Cleaning data (30 minutes).
  • Activity 3.9 (optional) recaps features of SQL.
  • (^) Activity 3.10 uses 03.2 Selecting and projecting, sorting and

limiting (40 minutes).

  • Activity 3.11 uses 03.3 Combining data from multiple

datasets (30 minutes).

  • Activity 3.12 uses 03.4 Handling missing data (20 minutes).
  • In addition there are 3 screencasts showing how to use OpenRefine.
  • Activity 3.4: Cleaning data with OpenRefine and clustering data

to show outliers for cleaning (30 minutes).

  • Activity 3.8: Reshaping data with OpenRefine (20 minutes).

Data preparation

Activities :

  1. Data cleansing: remove or repair obvious errors and inconsistencies in the dataset
  2. (^) Data integration: combining datasets
  3. data transformation: shaping datasets

Activities also known as:

  • In data warehousing, the acronym ETL ( E xtract, T ransform, and L oad) is used for the process of taking data from operational systems and loading them into the warehouse.
  • Terms like data harmonisation and data enhancement are also used.

Note :

  • Some of the techniques used in data preparation – especially in transformation and integration – are also used to manipulate data during analysis
  • Conversely, some analysis techniques are also used in data preparation.

Looking ahead:

This week you will look first F 08 CF 02 0at some basic data cleansing issues that apply to single and multiple tabular datasets, and then F 08 DF 02 0at the processes used to combine and shape them: selection, projection, aggregation, and joins. Many of these techniques can also be straightforwardly applied to data structures other than tables.

2 Data cleansing

Is the process of:

  • detecting and correcting errors in a dataset.
  • It can even mean removing irrelevant parts of the data
    • we will look at this later in the section.
  • Having found errors – incomplete, incorrect, inaccurate or

irrelevant data – a decision must be made about how to

handle them.

Table 3.1 Fictitious details of family members

Classification of error types

  • Validity
  • Accuracy
  • Completeness
  • Consistency
  • Uniformity

Accuracy

Checking correctness requires some external ‘gold standard’ to check them against (e.g. a table of valid postcodes, would show hat M60 9HP isn’t a postcode that is currently is use). Otherwise, hints based on spelling and capitalisation are the best hope.

Completeness

  • Are all the values required present? Everyone has a DOB

and a postcode, although they may not know the value (assuming they are in the UK – if they live elsewhere they may not have a postcode), but can the dataset be considered complete with some of these missing? This will depend on the purpose of any future analysis.

Uniformity

The DOB field contains date values drawn from two different calendars, which would create problems in later processing. It would be necessary to choose a base or canonical representation and translate all values to that form. A similar issue appears in the ncome column.

2.2 Combining data from multiple

sources

  • Harmonisation is the data cleansing activity of creating

a common (aka canonical ) form for non-uniform data.

  • Mixed forms more often occur when two or more

data sources use different base representations.

2.2 Combining data from multiple

sources (Examples)

  • There are limits to how much harmonisation can be achieved

with subjective values:

  • Figure 3.2 The challenge of agreeing subjective values