Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

On Line Transaction Processing - Data Warehousing - Lecture Handouts, Lecture notes of Data Warehousing

Quaid-i-Azam University (QAU)Data Warehousing

Online Transaction Processing, Management Information System, Data Warehouse environment, Decision making is AdHoc, Running in circles, Different patterns of hardware utilization are discusses points of document. This handout is for data warehousing course from Virtual University of Pakistan.

Typology: Lecture notes

2011/2012

Uploaded on 11/06/2012

ahsen 🇵🇰

4.6

(88)

84 documents

1 / 11

This page cannot be seen from the preview

Don't miss anything!

Data Warehousing Course Code: CS614

[email protected]

Virtual University of Pakistan

18

Lecture Handout

Data Warehousing

Lecture No. 03

It is a blend of many technologies, the basic concept being:

• Take all data from different operational systems.

• If necessary, add relevant data from industry.

• Transform all data and bring into a uniform format.

• Integrate all data as a single entity.

• Store data in a format supporting easy access for decision support.

• Create performance enhancing indices.

• Implement performance enhancement joins.

• Run ad-hoc queries with low selectivity.

A Data Warehouse is not something shrink-wrapped i.e. you take a set of CDs and install

into a box and soon you have a Data Warehouse up and running. A Data Warehouse

evolves over time, you don’t buy it. Basically it is about taking/collecting data from

different heterogeneous sources. Heterogeneous means not only the operating system is

different but so is the underlying file format, different databases, and even with same

Discover Lecture notes of Data Warehousing Quaid-i-Azam University (QAU)

Partial preview of the text

Download On Line Transaction Processing - Data Warehousing - Lecture Handouts and more Lecture notes Data Warehousing in PDF only on Docsity!

[email protected]

Virtual University of Pakistan

Lecture Handout

Data Warehousing

Lecture No. 03

It is a blend of many technologies, the basic concept being:

Take all data from different operational systems.
If necessary, add relevant data from industry.
Transform all data and bring into a uniform format.
Integrate all data as a single entity.
Store data in a format supporting easy access for decision support.
Create performance enhancing indices.
Implement performance enhancement joins.
Run ad-hoc queries with low selectivity.

A Data Warehouse is not something shrink-wrapped i.e. you take a set of CDs and install into a box and soon you have a Data Warehouse up and running. A Data Warehouse evolves over time, you don’t buy it. Basically it is about taking/collecting data from different heterogeneous sources. Heterogeneous means not only the operating system is different but so is the underlying file format, different databases, and even with same

[email protected]

Virtual University of Pakistan

database systems different representations for the same entity. This could be anything from different columns names to different data types for the same entity.

Companies collect and record their own operational data, but at the same time they also use reference data obtained from external sources such as codes, prices etc. This is not the only external data, but customer lists with their contact information are also obtained from external sources. Therefore, all this external data is also added to the data warehouse.

As mentioned earlier, even the data collected and obtained from within the company is not standard for a host of different reasons. For example, different operational systems being used in the company were developed by different vendors over a period of time, and there is no or minimal evenness in data representation etc. When that is the state of affairs (and is normal) within a company, then there is no control on the quality of data obtained from external sources. Hence all the data has to be transformed into a uniform format, standardized and integrated before it can go into the data warehouse.

In a decision support environment, the end user i.e. the decision maker is interested in the big picture. Typical DSS queries do not involve using a primary key or asking questions about a particular customer or account. DSS queries deal with number of variables spanning across number of tables (i.e. join operations) and looking at lots of historical data. As a result large number of records are processed and retrieved. For such a case, specialized or different database architectures/topologies are required, such as the star schema. We will cover this in detail in the relevant lecture.

Recall that a B-Tree is a data structure that supports dictionary operations. In the context of a database, a B-Tree is used as an index that provides access to records without actually scanning the entire table. However, for very large databases the corresponding B-Trees becomes very large. Typically the node of a B-Tree is stored in a memory block, and traversing a B-Tree involves O(log n) page faults. This is highly undesirable, because by default the height of the B-Tree would be very large for very large data bases.

[email protected]

Virtual University of Pakistan

How is it different?

Decision making is Ad-Hoc

Figure-3.1: Running in circles

Consider a decision maker or a business user who wants some of his questions to be answered. He/she sets a meeting with the IT people, and explains the requirements. The IT people go over the cycle of system analysis and design, that takes anywhere from couple of weeks to couple of months and they finally design and develop the system. Happy and proud with their achievement the IT people go to the business user with the reporting system or MIS system. After a learning curve the business users spends some

[email protected]

Virtual University of Pakistan

time with the brand new system, and may get some answers to the required questions. But then those answers results in more questions. The business user has no choice to meet the IT people with a new set of requirements. The business user is frustrated that his questions are not getting answered, while the IT people are frustrated that the business user always changes the requirements. Both are correct in their frustration.

Different patterns of hardware utilization

Figure-3.2: Different patterns of CPU Usage

Although there are peaks and valleys in the operational processing, but ultimately there is relatively static pattern of utilization. There is an essentially different pattern of hardware utilization in the data warehouse environment i.e. a binary pattern of utilization, either the hardware is utilized fully or not at all. Calculating a mean utilization for a DWH is not a meaningful activity. Therefore, trying to mix the two environments is a recipe for disaster. You can optimize the machine for the performance of one type of application, not for both.

Bus vs. Train Analogy Consider the analogy of a bus and train. I believe you can find dozens of buses operating between Lahore and Rawalpindi almost every 30 minutes. As a consequence, literally there are buses moving between Lahore and Rawalpindi almost continuously through out the day. But how many times a dedicated train moves between the two cities? Only twice a day and carries a bulk of passengers and cargo. Binary operation i.e. either traveling or not. The train can NOT be optimized for every 30-min travel, it will never fill to capacity

100%

0% Operational (^) DWH

[email protected]

Virtual University of Pakistan

Because you don’t need historical data to do business today, but you do need the historical data to understand patterns of business usage to do business tomorrow, such why a customer left?

How much History?

Depends on: Industry.

Cost of storing historical data.

Economic value of historical data.

Industries and history Telecomm calls are much much more as compared to bank transactions- 18 months of historical data.

Retailers interested in analyzing yearly seasonal patterns- 65 weeks of historical data.

Insurance companies want to do actuary analysis, use the historical data in order to predict risk- 7 years of historical data.

Hence, a DWH NOT a complete repository of data

How back do you look historically? It really depends a lot on the industry. Typically it’s an economic equation. How far back depends on how much dose it cost to store that extra years work of data and what is it’s economic value? So for example in financial organizations, they typically store at least 3 years of data going backward. Again it’s typical. It’s not a hard and fast rule.

[email protected]

Virtual University of Pakistan

In a telecommunications company, for example, typically around 18 months of data is stored. Because there are a lot more call details records then there are deposits and withdrawals from a bank account so the storage period is less, as one can not afford to store as much of it typically. Another important point is, the further back in history you store the data, the less value it has normally. Most of the times, most of the access into the data is within that last 3 months to 6 months. That’s the most predictive data.

In retail business, retailers typically store at least 65 weeks of data. Why do they do that? Because they want to be able to look at this season’s selling history to last season’s selling history. For example, if it is Eid buying season, I want to look at the transit- buying this Eid and compare it with the year ago. Which means I need 65 weeks in order to get year going back, actually more then a year. It’s a year and a season. So 13 weeks are additionally added to do the analysis. So it really depends a lot on the industry. But normally you expect at least 13 months.

Economic value of data Vs. Storage cost

Data Warehouse a complete repository of data?

This raises an interesting question, do we decide about storage of historical data using only time, or consider space also, or both?

[email protected]

Virtual University of Pakistan

even once every night. It depends on the volume of data you are working with, and how important the timings of the data are and so on.

Deviation from the PURIST approach

Let me first explain what/who a purist is. A purist is an idealist or traditionalist who wants everything to be done by the book or the old arcane ways (only he/she knows), in short he/she is not pragmatic or realist. Because the purist wants everything perfect, so he/she has good excuses of doing nothing, as it is not a perfect world. When automobiles were first invented, it was the purists who said that the automobiles will fail, as they scare the horses. As Iqbal very rightly said “ Aina no sa durna Tarzay Kuhan Pay Arna… ”

As data warehouses become mainstream and the corresponding technology also becomes mainstream technology, some traditional attributes are being deviated in order to meet the increasing demands of the user’s. We have already discussed and reconciled with the fact that a data warehouse is NOT the complete repository of data. The other most noticeable deviations being time variance and nonvolatility.

Deviation from Time Variance and Nonvolatility As the size of data warehouse grows over time (e.g., in terabytes), reloading and appending data can become a very tedious and time consuming task. Furthermore, as business users get the “hang of it” they start demanding that more up-to-date data be available in the data warehouse. Therefore, instead of sticking to the traditional data warehouse characteristic of keeping the data nonvolatile and time variant, new data is being added to the data warehouse on a daily basis, if not on a real-time basis and at the same time historical data removed to make room for the “fresh” data. Thus, new approaches are being made to tackle this task. Two possible methods are as follows:

Perform hourly/daily batch updates from shadow tables or log files. Transformation rules are executed during the loading process. Thus, when the

[email protected]

Virtual University of Pakistan

data reaches the target data warehouse database, it is already transformed, cleansed and summarized.

Perform real-time updates from shadow tables or log files. Again, transformation rules are executed during the loading process. Instead of batch updates, this takes place on a per transaction basis that meets certain business selection criteria.

On Line Transaction Processing - Data Warehousing - Lecture Handouts, Lecture notes of Data Warehousing

Related documents

Partial preview of the text

Download On Line Transaction Processing - Data Warehousing - Lecture Handouts and more Lecture notes Data Warehousing in PDF only on Docsity!

Lecture Handout

Data Warehousing

Lecture No. 03