TM351 Data Warehousing Notes, Lecture notes of Data Warehousing

Lecture Notes for Data Warehousing AOU Course TM351

Typology: Lecture notes

2016/2017

Uploaded on 12/23/2017

joe-titan
joe-titan 🇯🇴

2 documents

1 / 172

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
TM351
Data management and analysis
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download TM351 Data Warehousing Notes and more Lecture notes Data Warehousing in PDF only on Docsity!

TM

Data management and analysis

Caveat

  • These slides DO NOT replace the course learning materials
  • Exams WILL BE derived from the full set of the course learning materials

MODULE GUIDE

1.3 Module structure

The major areas are:

  • the data analysis pipeline (Parts 2–5), which looks at specific issues around data acquisition, preparation, analysis and presentation
  • Relational database management systems (Parts 8–12)
  • Document and non-relational systems and distributed Storage and processing (Parts 13–18)
  • Data warehousing and data mining (Parts 19–22)
  • Linked data and the ‘semantic web’ (Parts 24–26).

Learning Python

Online tutorial https://docs.python.org/3/tutorial/

index.html Textbook

  • http://interactivepython.org/runestone/static/pythonds/ Intro duction/ReviewofBasicPython.html
  • Start by completing the Bootcamp notebooks (part 1 notebooks 1.1-1.5)

Learning materials

  • Downloadable from the Central LMS (course contents).
  • Course software (downloadable from the internet)

Assessment

MTA: 30%

TMA: 20%

Final: 50%

SOFTWARE GUIDE

General guidelines

  • Read the software guide now
  • Install anaconda now
  • Go through the bootcamp notebooks 1.1-1.5 now

PART 1

Introducing data management and analysis

Transient : transient data is data that is generated but is of little value, so not collected (for example cursor positions on websites)

Data and data sets (Rob Kitchin 2014)

Characterisati on

Characteristic values

Producer Primary : generated by the producer for their own use Secondary : data provided by a producer to another user for (re)use over and above the primary use Tertiary : derived data published for use by third parties, e.g. statistical tables and reports Type Indexical : data that includes unique identifiers (e.g. a UK National Insurance number), allowing data items to be linked across distinct data collections Attribute : properties or attributes of a data item; multiple attributes of the same item (e.g. a customer’s name, age and postcode) may be unique to that item Metadata : data about data – see Section 2.4 below

3.1 Stakeholders

  • A dataset or database may have a very broad range of stakeholders
  • Different stakeholders will have widely different concerns.
  • For example, if data about an individual is being analysed, then: - that individual is a stakeholder - So is the Information Commissioner (Data Protection Act), protecting the legal interests of all data subjects.

3.2 Scale

The three (or six) Vs

  • Volume : our traditional measure of data size – how much there is of it.
  • Variety : in many different, sometimes incompatible forms and representations.
  • Velocity : how fast new data is generated and has to be processed.
  • Three more Vs are now becoming current:
  • Veracity : the quality of the data; how ‘clean’ it is.
  • Validity : to what extent the facts the data incorporates are correct and consistent for their context (1)
  • Volatility : how quickly data changes, or becomes invalid.