






















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A Comprehensive Guide To data ware housing where students can learn everything about the DW
Typology: Study notes
1 / 30
This page cannot be seen from the preview
Don't miss anything!























In this chapter, the nature of data warehousing is discussed. The purpose of the chapter is to provide background knowledge for the forthcoming chapters on the relationship between data warehousing and systems thinking, rather than to give a complete description of data warehousing design methods.
The terms data warehouse and data warehousing may be confusing. Therefore, it was decided to use the term data warehouse as a noun and data warehousing as the process to create a data warehouse. A data warehouse is throughout this thesis regarded as a system.
The first section investigates the definition of a data warehouse. Data warehouses are then compared with operational information systems. The explanation of data warehousing is clarified by a discussion on data warehousing architecture. The main stages in the data warehousing lifecycle, namely requirements collection, data modelling, data staging and data access are discussed to highlight different views on data warehousing methods.
Data warehousing success is of critical importance to the industry. The Cutter consortium (Anonymous, 2003:1) reported that 41% of data warehousing professionals has experienced data warehousing projects that failed. A review of current literature on data warehousing success factors is given to highlight the problems and opportunities in this field. An Internet research study on perceived critical success factors and main causes of failures serves as a link between the formal literature and the practices of data warehousing professionals.
The chapter concludes with a literature investigation into the combination of systems thinking and data warehousing practices, which serves as an investigation of current research for the overall study presented in this thesis.
This chapter represents the practice level of the philosophy, methodology 1 and practice model presented in this thesis. Although IS professionals would recognise
the information presented in this chapter as data warehousing methodology , it is viewed as a generalisation of practices of data warehousing professionals and therefore as practice. The term “method” is used to indicate data warehousing methodology. The practice layer in the model can be divided into generalised practices and individual practices. The generalised practices are presented in this chapter and the individual practices in the next chapter.
Although the association between data warehousing practices and systems thinking is only done in chapter 5, it is possible to identify systems thinking ideas in data warehousing practices as presented in this chapter. The educated systems thinker is able to identify conflicts in the definitions given for key data warehousing terminology.
Different systems thinking ideas are already visible in definitions of information systems. Mallach (2000:88) defines information systems as “a system whose purpose is to store, process, and communicate information”. This definition can be compared to that of Du Plooy et al. (1993:01): “Information systems is an inter- disciplinary field of scholarly inquiry, where information, information systems and the integration thereof with the organisation is studied in order to benefit the total system (technology, people, organisation and society).” It is clear that the system in Mallach’s definition has a tighter boundary than that of Du Plooy et al. (1993). The latter follows a more holistic (soft systems) approach to IS.
Data warehouses are examples of decision support systems (DSS). A DSS can be defined as a “computer-based information system whose primary purpose is to provide knowledge workers with information on which to base informed decisions.” (Mallach, 2000:13). DSS can be divided into data-oriented DSS, model-oriented DSS and process-oriented DSS. A data-oriented DSS uses data base systems as source of the decision support, in contrast to a model-oriented DSS which uses mathematical models to support business decisions and a process-oriented DSS which simulates human decision making processes (Mallach, 2000:143). Data warehouses are the primary example of data-oriented DSS today.
large-scale organisational development or change management project, rather than a traditional IS project (Markus, 2000:44).
Data warehouses are also known as online analytical processing (OLAP) systems because they serve managers and knowledge workers in the field of data analysis and decision making.
Online transaction processing (OLTP) systems, or operational systems, are those information systems that support the daily processing that an organisation does. OLTP systems’ main purpose is to capture information about the economic activities of an organisation. One might argue that the purpose of OLTP systems is to get data into computers, whereas the purpose of data warehouses is to get data or information out of computers.
Han and Kamber (2001:43) describe the differences between data warehouses and OLTP systems. The key differences are summarised in table 4.1.
Feature OLTP OLAP
Characteristic operational processing informational processing
Orientation transaction analysis
User clerk, data base administrator(DBA), data base professional knowledge worker (e.g. manager,executive, analyst)
Function day-to-day operations long-term informationalrequirements, decision support
Data base (DB) design
entity relational (ER) based, application oriented star / snowflake, subject oriented
Data current; guaranteed up-to-date historical; accuracy maintainedover time
Summarisation primitive, highly detailed summarised, consolidated
View detailed, flat relational summarised, multidimensional
Unit of work short, simple transaction complex query
Access read / write mostly read
Focus data in information out
Operations index / hash on primary key lots of scans
Number of records accessed tens^ millions
Number of users thousands hundreds
DB size 100 MB to GB 100 GB to TB
Priority high performance, high availability high flexibility, end-user autonomy
Metric transaction throughput query throughput, response time
Table 4-1 Comparison between OLTP and OLAP systems (Han & Kamber, 2001:43)
Han and Kamber (2001:42) argue that an OLTP system is customer-oriented as opposed to a data warehouse that is market-oriented.
It is difficult to combine data warehousing (OLAP) and OLTP capabilities in one system. The dimensional data design model used in data warehouses is much more effective for querying than the relational model used in OLTP systems. Furthermore, data warehouses may use more that one data base as data source. The dimensional design of a data warehouse is not suitable for OLTP systems, mainly due to redundancy and the loss of referential integrity of the data. Organisations choose to have two separate information systems, one OLTP system and one OLAP system.
Poe et al. (1998:3) stress the fact that analysis using OLAP systems, are primarily done through comparisons, or by analysing patterns and trends. For example, sales trends are analysed along with marketing strategies to determine the relative success of specific marketing strategies with regard to sales patterns. Such analysis is difficult to perform with OLTP systems since the information accessed is stored in different systems across several departments in the organisation.
Corey et al. (2001:16) highlight the fact that usage of OLTP systems is very predictable. For example, a bank clerk always performs the same actions on the system. The usage of a data warehouse system on the other hand is very unpredictable. It is not possible to predict which trends will be analysed by which managers during which time period.
Kimball et al. (1998) differ from this approach by following a requirements-driven development lifecycle. This difference will feature strongly in the arguments presented in chapter 5.
This discussion aims to give a holistic view on data warehousing. The section begins with a high level view presented by The Data Warehouse Institute (TDWI) of businesses intelligence (BI). This is followed by a discussion based on the proposed high level architecture given by Kimball et al. (1998). Differences to this approach will be discussed in section 4.5 where a more detailed view is taken on the key issues of data warehousing.
Eckerson (2003) from TDWI did a study on the success factors in implementing BI systems in organisations and the role of data warehouses in this process. Eckerson (2003:4) views the BI process holistically as a “data refinery”. Data from different OLTP systems are integrated, which leads to a new product called information. The data warehouse staging process is responsible for this transformation. Users equipped with programs such as specialised reporting tools, OLAP tools and data mining tools transform information to knowledge. This is done through analysis that identifies trends, patterns and exceptions. Kimball et al. (1998:329) include this process as part of the data warehouse project. The next step is to transform knowledge to rules. Users create rules from knowledge; these may be simple rules such as “Order 50 new units whenever inventory falls below 25 units”, or complex rules generated by statistical algorithms or models. Rules lead to plans of action that implement these rules. The actual implementation of these plans creates a cycle when new data enters the data warehouse, to be transformed once again into information and so forth. Although a data warehouse is only one tool in this process, it illustrates the value and purpose of a data warehouse in the organisation.
Kimball et al. (1998:329) give a graphic representation of data warehouse architecture. Figure 4.1 depicts the operation of the data warehouse in the organisation. The aim of the data warehouse is to give end-users (mostly managers) easy access to data in the organisation. In order to do this, it is necessary to capture everyday operational data from the operational systems of the organisation. These are transactional systems (OLTP), for example point of sale systems that are
designed around relational databases. Such systems become the source systems of the data warehouse.
The data from the source systems go through a process called data staging to the presentation servers (Kimball et al. , 1998:345). Data staging involves four very important actions. Firstly, the data is extracted from the source systems. The data required for the data warehouse is usually distributed in various different source systems with different file formats running on different hardware and operating system platforms. Secondly, the data is transformed to the data warehouse format. Errors and inconsistencies are removed during this phase. Thirdly, the data is loaded into data marts in the presentation server. The final task of data staging is to schedule this process.
Data stagingarea
Metadatacatalog
High level warehouse technical architecture The back room (^) The front room Data staging services
Query services
Dimensional data martsincluding atomic data
Application models(e.g. data mining)
Key element^ Data Service element
Source systems Presentation servers
Downstream operational / systems
Desktop dataaccess tools
reporting toolsStandard
Dimensional data marts withonly aggregated data
warehouseThe data bus
Conformeddimensions & Data stagingarea conformedfacts
Metadatacatalog
High level warehouse technical architecture The back room (^) The front room Data staging services
Query services
Dimensional data martsincluding atomic data
Application models(e.g. data mining)
Key element^ Data Service element
Source systems Presentation servers
Downstream operational / systems
Desktop dataaccess tools
reporting toolsStandard
Dimensional data marts withonly aggregated data
warehouseThe data bus
Conformeddimensions & conformedfacts
Figure 4.1 High level data warehouse architecture (Kimball et al., 1998:329)
The extraction of data from the operational source system influences the availability of these systems, therefore these processes should be done during off-peak times and as quickly as possible. High quality data warehouse output is dependent on high quality data in the data warehouse (Redman, 1996:32). Therefore, the staging process is most important from a data quality perspective.
A data warehouse is a read-only data source, which means that end-users may not change the value of data elements in the data warehouse. However, figure 4.1 does contain a feedback arrow from the end-user systems towards the data staging area. Specialised users may add data to the warehouse. A typical example is clustering information that may be associated with customers, as a result of data mining procedures that were carried out on the data in the data warehouse. For example, risk factors might be assigned to customers in a financial institute’s data warehouse.
Inmon’s (1996) approach to data warehouse architecture differs from that of Kimball et al. (1998). Kimball et al. (1998) describe a data mart as a subset of the data warehouse. The data warehouse is the sum of all the data marts, each representing a business process in the organisation. Inmon (1996) views a data mart as an interface between the data warehouse and the end-user.
Figure 4.2 Data marts: Inmon vs. Kimball (adapted from Mailvaganam, 2003:2)
Data warehouse Data mart
OLTP User
Inmon’s model of a data mart
OLTP User
Kimball’s model of a data mart
Data mart
Data warehouse
A data mart is a separate copy of a subset of the data in the data warehouse, organised in a star schema to be accessed by end-users. This difference is illustrated graphically in figure 4.2.
This section contains different views of different authors on data warehousing aspects. It will be shown in chapter 5 that these different views can be traced back to different systems thinking methodologies1&2. The aim of this section is to give a practice level description of different views on various data warehousing aspects.
4.5.1 The data warehouse development lifecycle
A data warehouse development lifecycle is a sequence of high-level tasks required for effective data warehouse design, development, and deployment (Kimball et al. , 1998:33). Different authors have radically different views on the order of these tasks in the development lifecycle for data warehouses. These differences are presented in the following paragraphs.
Inmon (1996:290) advocates the use of a data-driven method. This means that DSS processing begins with data and ends with requirements. Inmon calls this method the CLDS (the reverse of SDLC) as depicted in table 4.2. According to Inmon (1996:44), a data warehouse starts with building a central data store for one subject- area, which is populated from operational systems. As the analytical ability of the new data warehouse is discovered, demand for an integrated data store for another subject area will grow and this process will repeat itself until a complete data warehouse has been developed. Although Inmon (1996) presents the lifecycle of a data warehouse to be opposite to the requirements-driven lifecycle of OLTP systems, it is interesting to note that in his data warehouse review checklist (Inmon, 1996:297), the second question (of a 54 question - checklist) is whether the end-user requirements have been anticipated, or not.
In contrast to Inmon’s approach, Kimball et al. (1998:33) advocate the use of a requirements-driven method. The process is depicted in figure 4.3. The data warehouse starts with project planning to determine the readiness of the organisation
It is clear that the requirements-driven method differs from an OLTP system’s design method in the amount of time and effort spent on feasibility studies. Kimball et al. (1998:43) argue that a certain degree of readiness of the organisation for a data warehouse is essential for the development effort to succeed. This would include the presence of a strong business management sponsor, a compelling business motivation, a well functioning business and IS department partnership, the current analytic decision making culture in the organisation, and technical feasibility based on the current infrastructure of the organisation.
4.5.2 Collecting requirements
Collecting requirements is the foundation for all subsequent stages according to Kimball et al. (1998:96). Kimball et al. (1998:97) state, “You can’t just ask users what data they would like to see in the data warehouse. Instead, you need to talk to them about their jobs, their objectives, and their challenges and try to figure out how they make decisions, both today and in the future”.
Bischoff and Alexander (1997:67) advise that only requirements that support the initial business area and nothing more should be investigated. This statement will be used in the mapping of systems thinking methodologies1&2 in chapter 5, because Kimball et al. (1998:266) accentuate the advantages of an investigation into the entire organisation’s data usage, before deciding which business area and therefore which data mart to develop initially. It is clear that this difference of opinion is rooted in different systems views. Bischoff and Alexander’s opinion is motivated by hard systems thinking and Kimball’s by soft systems thinking.
Inmon (1996:144) states, “Requirements for the data warehouse cannot be known a priori.” The main idea of the data-driven method is to create a data warehouse from existing data and to supply the decision makers with data to satisfy their needs, without having to specify those needs upfront.
Kimball et al. (1998:97) give a detailed description on requirements collection for data warehouse projects. The data warehousing team should begin by talking to the business users, rather than talking to source systems experts. Business users are not technically skilled and the data warehousing team should talk to them about their jobs, rather than the data warehouse. The team may use facilitated sessions and/or
personal interviews for this process. Both these techniques require the interview team to gain prior knowledge on the operations of the organisation. Kimball et al. (1998:101) advise the interviewing team to do research into prior data warehouse development attempts, since business users might feel that the current team is duplicating previous work. Business users as well as IS personnel should be interviewed.
Interviews with business users should involve users on different levels in the organisation. When business executives are interviewed, the first question should be to establish the objectives of the organisation. Success measures for measuring the current status should be discussed. Business opportunities and causes for concern should be identified. A very important part of the interview is to discover future developments in the organisation, as well as the information needs thereof (Kimball et al. , 1998:116). Heads of departments should be interviewed with a strong focus on identifying routine decisions and current reports used for analysis. They should also be questioned on their need for analysis in addition to the current available information.
Interviews with IS personnel are conducted to determine the availability of data in support of the business users’ requirements. These interviews serve as a reality check, since the requirements of the business users are tested against the available data. During the data staging phase, IS interviews will be followed up by detailed sessions to work out all the technical problems embedded in the data. During this first round of IS interviews, the team aims to understand the source systems in the organisation, as well as to investigate current analysis methods. Questions are asked to determine what type of analysis is done routinely. The current procedures for handling ad hoc queries are investigated. It is very important to establish and manage the expectations of IS personnel about the intended data warehouse (Kimball et al. , 1998:121).
During the closure of the interviews, users should be asked about the success criteria for the project. One needs to determine measurable criteria for the success of the data warehouse, which can be used as success metrics for the completed project. These success criteria should specify availability of the data warehouse, ease of use, data availability and business impact metrics. Interview information should be written down as quickly as possible following the interview. The individual interview write-up documentation is followed by a requirements finding document.
The above can be illustrated by setting up an ERD and a star schema for the same organisation. The organisation manufactures products and sells them to chain retailers. The chain retailer’s sales are also measured. Comparing the ERD in figure 4.4 to the star schema in figure 4.5, the first problem with the ERD is that the entire enterprise entity structure is represented on one diagram. Although this is acceptable from a soft systems approach, it makes it very difficult to understand. The star schema represents only one business process, i.e. the retail sales process. Another major advantage of the star schema is that it includes the attributes of each dimension. These, for example, may refer to the detailed information about the products or the stores. There is simply no space to put this information on the ERD. By looking at the star schema, the user will easily spot missing data fields.
The centre table in the star schema contains the numerical data, such as dollar amounts of the event represented by the star schema, while the other fields in the centre table are links to all other aspects of interest. The table in the centre is called a fact table and the other descriptive tables dimension tables. This brief explanation suffices for a non-technical business user to fully understand the star schema. It would be extremely challenging to come up with a two-, or three-sentence explanation of an ERD, especially since the cardinality of the relations is always important.
It should be noted that there are various technical differences between ERDs and star schemas that make star schemas very effective to use in data warehouses, but also very ineffective to use in production systems. ERDs are much more effective in production systems, mainly because of the limited redundancy of data compared to the star schema.
Inmon (1996:85) proposes the use of an ERD data model for a data warehouse. The corporate ERD of the data warehouse is a composite of many individual ERDs that reflect the different views of people across the organisation. Inmon (1996:143) also describes star schemas (which he refers to as star joins). A brief discussion on star joins follows a detailed discussion on ERDs. Inmon concludes that a combination of star joins and ERDs will lead to an optimal warehouse design. He offers little explanation on how exactly this is achieved.
Figure 4.5 A star schema isolating the retail sales process from figure 4.4 (Kimball et al., 1998:145)
Figure 4.4 An entity-relationship model of an enterprise that manufactures goods (Kimball et al., 1998:143)
Production
Time Sales rep. Customer Promotion Product Plant Distr. center
Time Sales rep. Customer Promotion Product Plant Distr. center
Time Sales rep. Customer Promotion Product Plant Distr. center
Figure 4.6 Conformed dimensions used by two data marts (Kimball et al.,1998:347)
4.5.4 Data staging
Data staging is the process of moving data from the operational database to the data warehouse. The main tasks in this process are extracting the data from the source systems, transforming the data to the data warehouse standards and loading the transformed data into the data warehouse. The transformation process also includes cleansing of the data. The data staging process is often called the ETL process (extract, transform, and load). The ETL process is a very technical part of the data warehouse development process, and although many different procedures are followed, most authors have reached consensus about the technical detail of the process. The market is overloaded with ETL-tools that are designed to assist the data warehouse development team in the data staging process. Although the technical detail of data staging is not of great importance to this study, each of the aspects will be discussed briefly in order to familiarise the reader with the key concepts. Since data quality assurance and the ownership thereof in the
organisation is of greater importance to the study, this section will be concluded with a description of quality issues of data warehousing.
Extraction is the process of copying relevant data from the source systems. It is essential to perform this process with as little disruption as possible to the source system. This process soon becomes very technical when changes to data loaded earlier in the data warehouse, needs to be managed. Technology of the source systems may differ substantially from the data warehouse technology. The causes of the problems include operating system platforms supporting only specific programming languages and different data formats. Since the availability of the source systems is of major importance to the organisations, the transformation of data is done as a separate stage. The data is copied from the source systems without any transformation to an intermediary storage system.
Since most data warehouses receive data from more than one source system, the data needs to be transformed before it is loaded into the data warehouse. Data attribute formats must be consolidated, for example date formats of source systems may be different. Measurements, such as currency, need to be consolidated. Data fields might have to be separated or joined, for example name fields. Most data warehouse text books contain detailed descriptions on data transformation.
Data quality is addressed during the transformation process. Good quality data is essential to the success of the data warehouse. Mallach (2000:121) discusses eleven information quality factors: