Download TM351 Data Warehousing Notes and more Lecture notes Data Warehousing in PDF only on Docsity!
TM
Data management and analysis
Caveat
- These slides DO NOT replace the course learning materials
- Exams WILL BE derived from the full set of the course learning materials
MODULE GUIDE
1.3 Module structure
The major areas are:
- the data analysis pipeline (Parts 2–5), which looks at specific issues around data acquisition, preparation, analysis and presentation
- Relational database management systems (Parts 8–12)
- Document and non-relational systems and distributed Storage and processing (Parts 13–18)
- Data warehousing and data mining (Parts 19–22)
- Linked data and the ‘semantic web’ (Parts 24–26).
Learning Python
Online tutorial https://docs.python.org/3/tutorial/
index.html Textbook
- http://interactivepython.org/runestone/static/pythonds/ Intro duction/ReviewofBasicPython.html
- Start by completing the Bootcamp notebooks (part 1 notebooks 1.1-1.5)
Learning materials
- Downloadable from the Central LMS (course contents).
- Course software (downloadable from the internet)
Assessment
MTA: 30%
TMA: 20%
Final: 50%
SOFTWARE GUIDE
General guidelines
- Read the software guide now
- Install anaconda now
- Go through the bootcamp notebooks 1.1-1.5 now
PART 1
Introducing data management and analysis
Transient : transient data is data that is generated but is of little value, so not collected (for example cursor positions on websites)
Data and data sets (Rob Kitchin 2014)
Characterisati on
Characteristic values
Producer Primary : generated by the producer for their own use Secondary : data provided by a producer to another user for (re)use over and above the primary use Tertiary : derived data published for use by third parties, e.g. statistical tables and reports Type Indexical : data that includes unique identifiers (e.g. a UK National Insurance number), allowing data items to be linked across distinct data collections Attribute : properties or attributes of a data item; multiple attributes of the same item (e.g. a customer’s name, age and postcode) may be unique to that item Metadata : data about data – see Section 2.4 below
3.1 Stakeholders
- A dataset or database may have a very broad range of stakeholders
- Different stakeholders will have widely different concerns.
- For example, if data about an individual is being analysed, then: - that individual is a stakeholder - So is the Information Commissioner (Data Protection Act), protecting the legal interests of all data subjects.
3.2 Scale
The three (or six) Vs
- Volume : our traditional measure of data size – how much there is of it.
- Variety : in many different, sometimes incompatible forms and representations.
- Velocity : how fast new data is generated and has to be processed.
- Three more Vs are now becoming current:
- Veracity : the quality of the data; how ‘clean’ it is.
- Validity : to what extent the facts the data incorporates are correct and consistent for their context (1)
- Volatility : how quickly data changes, or becomes invalid.