Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Data Warehousing and Data Mining, Study notes of Business Demography and Environmental Studies

Visvesvaraya Technological University Business Demography and Environmental Studies

This notes covers Data Warehousing and Data Mining concepts in very detailed. Notes are provided by very profound professor of our University. Highly focused on exams perspective and also highly recommended to go through these.

Typology: Study notes

2025/2026

Available from 01/10/2026

veensadds 🇮🇳

4 documents

1 / 119

This page cannot be seen from the preview

Don't miss anything!

Discover Study notes of Business Demography and Environmental Studies Visvesvaraya Technological University

Partial preview of the text

Download Data Warehousing and Data Mining and more Study notes Business Demography and Environmental Studies in PDF only on Docsity!

expands the boundaries of Isle of Capri’s innovative business processes. With real-time data flows, Isle of Capri can view the current state of its business and quickly identify problems, which is the first and foremost step toward solving them analytically. Decision makers require concise, dependable information about current operations, trends, and changes. Data are often fragmented in distinct operational systems, so managers often make decisions with partial information, at best. Data warehousing cuts through this obstacle by accessing, integrating, and organizing key operational data in a form that is consistent, reliable, timely, and readily available, wherever and whenever needed. What Is a Data Warehouse? In simple terms, a data warehouse (DW) is a pool of data produced to support decision making; it is also a repository of current and historical data of potential interest to managers throughout the organization. Data are usually structured to be available in a form ready for analytical processing activities (i.e., online analytical processing [OLAP], data mining, querying, reporting, and other decision support applications). A data warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management’s decision-making process. A Historical Perspective to Data Warehousing Even though data warehousing is a relatively new term in information technology, its roots can be traced back in time, even before computers were widely used. In the early 1900s, people were using data (though mostly via manual methods) to formulate trends to help business users make informed decisions, which is the most prevailing purpose of data warehousing. The motivations that led to the development of data warehousing technologies go back to the 1970s, when the computing world was dominated by mainframes. Real business data-processing applications, the ones run on the corporate mainframes, had complicated file structures using early-generation databases (not the table-oriented relational databases most applications use today) in which they stored data. Although these applications did a decent job of performing routine transactional data-processing functions, the data created as a result of these functions (such as information about customers, the products they ordered, and how much money they spent) was locked away in the depths of the files and databases. When aggregated information such as sales trends by region and by product type was needed, one had to formally request it from the data-processing department, where it was put on a waiting list with a couple hundred other report requests (Hammergren and Simon, 2009). Even though the need for information and the data used to generate it existed, the database technology was not there to satisfy it. Figure 2.1 shows a timeline where some of the significant events that led to the development of data warehousing are shown. ¥ Mainframe computers v Centralized data storage ¥ Big Data analytics ¥ Simple data entry v Data warehousing was born ¥ Social media analytics v Routine reporting ¥ Inmon, Building the Data Warehouse V Text and Web analytics v Primitive database structures v Kimball, The Data Warehouse Toolkit “ Hadoop, MapReduce, NoSQL ¥ Teradata incorporated ¥ EDW architecture design ¥ In-memory, in-database > 1970s > 1980s > 1990s > 2000s > 2010s > ¥ Mini/personal computers (PCs) v Exponentially growing data Web data Y Business applications for PCs Y Consolidation of DW/BI industry Y Distributer DBMS ¥ Data warehouse appliances emerged ¥ Relational DBMS Y Business intelligence popularized ¥ Teradata ships commercial DBs ¥ Data mining and predictive modeling Y Business Data Warehouse coined ~ Open source software v Saas, PaaS, Cloud computing Figure 2.1 A List of Events That Led to Data Warehousing Development. Later in the last century, commercial hardware and software companies began to emerge with solutions to this problem. Between 1976 and 1979, the concept for a new company, Teradata, grew out of research at the California Institute of Technology (Caltech), driven from discussions with Citibank’s advanced technology group. Founders worked to design a database management system for parallel processing with multiple microprocessors, targeted specifically for decision support. Teradata was incorporated on July 13, 1979, and started in a garage in Brentwood, California. The name Teradata was chosen to symbolize the ability to manage terabytes (trillions of bytes) of data. The 1980s were the decade of personal computers and minicomputers. Before anyone knew it, real computer applications were no longer only on mainframes; they were all over the place—everywhere you looked in an organization. That led to a portentous problem called is/ands of data. The solution to this problem led to a new type of software, called a distributed database management system, which would magically pull the requested data from databases across the organization, bring all the data back to the same place, and then consolidate it, sort it, and do whatever else was necessary to answer the user’s question. Although the concept was a good one and early results from research were promising, the results were plain and simple: They just didn’t work efficiently in the real world, and the islands-of-data problem still existed. Meanwhile, Teradata began shipping commercial products to solve this problem. Wells Fargo Bank received the first Teradata test system in 1983, a parallel RDBMS (relational database management system) for decision support—the world’s first. By 1984, Teradata released a production version of their product, and in 1986, Fortune magazine named Teradata “Product of the Year.” Teradata, still in existence today, built the first data warehousing appliance—a combination of hardware and software to solve the data warehousing needs of many. Other companies began to formulate their strategies, as well. During the 1980s several other events occurred, collectively making it the decade of data warehousing innovation. For instance, Ralph Kimball founded Red Brick Systems in 1986. Red Brick began to emerge as a visionary software e Web based. Data warehouses are typically designed to provide an efficient computing environment for Web-based applications. e Relational/multidimensional. A data warehouse uses either a relational structure or a multidimensional structure. A recent survey on multidimensional structures can be found in Romero and Abellé (2009). e Client/server. A data warehouse uses the client/server architecture to provide easy access for end users. e Real time. Newer data warehouses provide real-time, or active, data-access and analysis capabilities (see Basu, 2003; and Bonde and Kuckuk, 2004). e Include metadata. A data warehouse contains metadata (data about data) about how the data are organized and how to effectively use them. Whereas a data warehouse is a repository of data, data warehousing is literally the entire process (see Watson, 2002). Data warehousing is a discipline that results in applications that provide decision support capability, allows ready access to business information, and creates business insight. The three main types of data warehouses are data marts, operational data stores (ODS), and enterprise data warehouses (EDW). In addition to discussing these three types of warehouses next, we also discuss metadata. Data Marts Whereas a data warehouse combines databases across an entire enterprise, a data mart is usually smaller and focuses on a particular subject or department. A data mart is a subset of a data warehouse, typically consisting of a single subject area (e.g., marketing, operations). A data mart can be either dependent or independent. A dependent data mart is a subset that is created directly from the data warehouse. It has the advantages of using a consistent data model and providing quality data. Dependent data marts support the concept of a single enterprise-wide data model, but the data warehouse must be constructed first. A dependent data mart ensures that the end user is viewing the same version of the data that is accessed by all other data warehouse users. The high cost of data warehouses limits their use to large companies. As an alternative, many firms use a lower-cost, scaled-down version of a data warehouse referred to as an independent data mart. An independent data mart is a small warehouse designed for a strategic business unit (SBU) or a department, but its source is not an EDW. Operational Data Stores An operational data store (ODS) provides a fairly recent form of customer information file (CIF). This type of database is often used as an interim staging area for a data warehouse. Unlike the static contents of a data warehouse, the contents of an ODS are updated throughout the course of business operations. An ODS is used for short-term decisions involving mission-critical applications rather than for the medium- and long-term decisions associated with an EDW. An ODS is similar to short-term memory in that it stores only very recent information. In comparison, a data warehouse is like long-term memory because it stores permanent information. An ODS consolidates data from multiple source systems and provides a near-real-time, integrated view of volatile, current data. The exchange, transfer, and load (ETL) processes (discussed later in this chapter) for an ODS are identical to those for a data warehouse. Finally, oper marts (see Imhoff, 2001) are created when operational data needs to be analyzed multidimensionally. The data for an oper mart come from an ODS. Enterprise Data Warehouses (EDW) An enterprise data warehouse (EDW) is a large-scale data warehouse that is used across the enterprise for decision support. It is the type of data warehouse that Isle of Capri Casinos developed, as described in the opening vignette. The large-scale nature provides integration of data from many sources into a standard format for effective BI and decision support applications. EDW are used to provide data for many types of DSS, including CRM, supply chain management (SCM), business performance management (BPM), business activity monitoring (BAM), product lifecycle management (PLM), revenue management, and sometimes even knowledge management systems (KMS). Application Case 2.1 shows the variety of benefits that telecommunication companies leverage from implementing data warehouse driven analytics solutions. Application Case 2.1 A Better Data Plan: Well-Established TELCOs Leverage Data Warehousing and Analytics to Stay on Top in a Competitive Industry Mobile service providers (i.e., Telecommunication Companies, or TELCOs in short) that helped trigger the explosive growth of the industry in the mid- to late-1990s have long reaped the benefits of being first to market. But to stay competitive, these companies must continuously refine everything from customer service to plan pricing. In fact, veteran carriers face many of the same challenges that up-and-coming carriers do: retaining customers, decreasing costs, fine-tuning pricing models, improving customer satisfaction, acquiring new customers and understanding the role of social media in customer loyalty. Highly targeted data analytics play an ever-more-critical role in helping carriers secure or improve their standing in an increasingly competitive marketplace. Here’s how some of the world’s leading providers are creating a strong future based on solid business and customer intelligence. Customer Retention It's no secret that the speed and success with which a provider handles service requests directly affects customer satisfaction and, in turn, the propensity to churn. But getting down to which factors have the greatest impact is a challenge. “If we could trace the steps involved with each process, we could understand points of failure and acceleration,” notes Roxanne Garcia, manager of the Commercial Operations Center for Telefonica de Argentina. “We could measure workflows both within and across functions, anticipate rather than react to performance indicators, and improve the overall satisfaction with onboarding new customers.” The company’s solution was its traceability project, which began with 10 dashboards in 2009. It has since realized US$2.4 million in annualized revenues and cost savings, shortened customer provisioning times and reduced customer defections by 30%. Cost Reduction Staying ahead of the game in any industry depends, in large part, on keeping costs in line. For France’s Bouygues Telecom, cost reduction came in the form of automation. Aladin, the company’s Teradata-based marketing operations management system, automates marketing/communications collateral production. It delivered more than US$1 million in savings in a single year while tripling email campaign and content production. “The goal is to be more productive and responsive, to simplify teamwork, [and] to standardize and protect our expertise,” notes Catherine Corrado, the company’s project lead and retail communications manager. “[Aladin lets] team members focus on value-added work by reducing low-value tasks. The end result is more quality and more creative [output].” An unintended but very welcome benefit of Aladin is that other departments have been inspired to begin deploying similar projects for everything from call center support to product/offer launch processes. Customer Acquisition of data), structural metadata (i.e., data describing the structure of the data), and semantic metadata (i.e., data describing the meaning of the data in a specific domain). We next explain traditional metadata patterns and insights into how to implement an effective metadata strategy via a holistic approach to enterprise metadata integration. The approach includes ontology and metadata registries; enterprise information integration (Ell); extraction, transformation, and load (ETL); and service-oriented architectures (SOA). Effectiveness, extensibility, reusability, interoperability, efficiency and performance, evolution, entitlement, flexibility, segregation, user interface, versioning, versatility, and low maintenance cost are some of the key requirements for building a successful metadata-driven enterprise. According to Kassam (2002), business metadata comprise information that increases our understanding of traditional (i.e., structured) data. The primary purpose of metadata should be to provide context to the reported data; that is, it provides enriching information that leads to the creation of knowledge. Business metadata, though difficult to provide efficiently, release more of the potential of structured data. The context need not be the same for all users. In many ways, metadata assist in the conversion of data and information into knowledge. Metadata form a foundation for a metabusiness architecture (see Bell, 2001). Tannenbaum (2002) described how to identify metadata requirements. Vaduva and Vetterli (2001) provided an overview of metadata management for data warehousing. Zhao (2005) described five levels of metadata management maturity: (1) ad hoc, (2) discovered, (3) managed, (4) optimized, and (5) automated. These levels help in understanding where an organization is in terms of how and how well it uses its metadata. The design, creation, and use of metadata—descriptive or summary data about data—and its accompanying standards may involve ethical issues. There are ethical considerations involved in the collection and ownership of the information contained in metadata, including privacy and intellectual property issues that arise in the design, collection, and dissemination stages (for more, see Brody, 2003). Section 2.2 Review Questions 1. What is a data warehouse? 2. How does a data warehouse differ from a database? 3. What is an ODS? 4. Differentiate among a data mart, an ODS, and an EDW. 5. Explain the importance of metadata. 2.3 Data Warehousing Process Overview Organizations, private and public, continuously collect data, information, and knowledge at an increasingly accelerated rate and store them in computerized systems. Maintaining and using these data and information becomes extremely complex, especially as scalability issues arise. In addition, the number of users needing to access the information continues to increase as a result of improved reliability and availability of network access, especially the Internet. Working with multiple databases, either integrated in a data warehouse or not, has become an extremely difficult task requiring considerable expertise, but it can provide immense benefits far exceeding its cost. As an illustrative example, Figure 2.2 shows business benefits of the enterprise data warehouse built by Teradata for a major automobile manufacturer. Many organizations need to create data warehouses—massive data stores of time-series data for decision support. Data are imported from various external and internal resources and are cleansed and organized in a manner consistent with the organization’s needs. After the data are populated in the data warehouse, data marts can be loaded for a specific area or department. Alternatively, data marts can be created first, as needed, and then integrated into an EDW. Often, though, data marts are not developed, but data are simply loaded onto PCs or left in their original state for direct manipulation using BI tools. Figure 2.2 Data-Driven Decision Making—Business Benefits of the Data Warehouse. Source: Teradata. Application Case 2.2 Data Warehousing Helps MultiCare Save More Lives In the spring of 2012, leadership at MultiCare Health System (MultiCare)—a Tacoma, Washington-based health system—realized the results of a 12-month journey to reduce septicemia. The effort was supported by the system’s top leadership, who participated in a data-driven approach to prioritize care improvement based on an analysis of resources consumed and variation in care outcomes. Reducing septicemia (mortality rates) was a top priority for MultiCare as a result of three hospitals performing below, and one that was performing well below, national mortality averages. In September 2010, MultiCare implemented Health Catalyst’s Adaptive Data Warehouse, a healthcare-specific data model, and subsequent clinical and process improvement services to measure and effect care through organizational and process improvements. Two major factors contributed to the rapid reduction in septicemia mortality. Clinical Data to Drive Imprvovement The Adaptive Data Warehouse™ organized and simplified data from multiple data sources across the continuum of care. It became the single source of truth requisite to see care improvement opportunities and to measure change. It also proved to be an important means to unify clinical, IT, and financial leaders and to drive accountability for performance improvement. Because it proved difficult to define sepsis due to the complex comorbidity factors leading to septicemia, MultiCare partnered with Health Catalyst to refine the clinical definition of sepsis. Health Catalyst’s data work allowed MultiCare to explore around the boundaries of the definition and to ultimately settle on an algorithm that defined a septic patient. The iterative work resulted in increased confidence in the severe sepsis cohort. System-Wide Critical Care Collaborative The establishment and collaborative efforts of permanent, integrated teams consisting of clinicians, technologists, analysts, and quality personnel were essential for accelerating MultiCare’s efforts to reduce septicemia mortality. Together the collaborative addressed three key bodies of work—standard of care definition, early identification, and efficient delivery of defined-care standard. access data. There are many front-end applications that business users can use to interact with data stored in the data repositories, including data mining, OLAP, reporting tools, and data visualization tools. No data marts option / p— Access Data mart (Marketing) eo Data mart (Engineering) Seas) Data mart (Finance) API/Middleware ~a Applications (Visualization) Routine eag business reports 4 Data /text mining ‘ OLAP. aq Dashboard, Web \ ‘| ory ae AEH Data | Sources | —— ETL =a | Process” ] ; ERP / —_' Select >_< —— Lega Metadata sil Extract === Pos Transform Enterprise Data — Integrate Warehouse Other OLT/WEB ~~ Load aad External ; L__data__J Replication a Figure 2.3 A Data Warehouse Framework and Views. Section 2.3 Review Questions 1. Describe the data warehousing process. 2. Describe the major components of a data warehouse. 3. ) —_—_—_> N Identify and discuss the role of middleware tools. Tier 1: Client workstation Tier 2: Applica tion server Custom-built applications ) L ) YY TY Tier 3: Database server Figure 2.4 Architecture of a Three-Tier Data Warehouse. 2.4 Data Warehousing Architectures There are several basic information system architectures that can be used for data warehousing. Generally speaking, these architectures are commonly called client/server or n-tier architectures, of which two-tier and three-tier architectures are the most common (see Figures 2.4 and 2.5), but sometimes there is simply one tier. These types of multi-tiered architectures are known to be capable of serving the needs of large-scale, performance-demanding information systems such as data warehouses. Referring to the use of n-tiered architectures for data warehousing, Hoffer et al. (2007) distinguished among these architectures by dividing the data warehouse into three parts: Tier 1: Tier 2: Client workstation Application and database server Figure 2.5 Architecture of a Two-Tier Data Warehouse. 1. The data warehouse itself, which contains the data and associated software 2. Data acquisition (back-end) software, which extracts data from legacy systems and external sources, consolidates and summarizes them, and loads them into the data warehouse 3. Client (front-end) software, which allows users to access and analyze data from the warehouse (a DSS/Bl/business analytics [BA] engine) In a three-tier architecture, operational systems contain the data and the software for data acquisition in one tier (i.e., the server), the data warehouse is another tier, and the third tier includes the DSS/BI/BA engine (i.e., the application server) and the client (see Figure 2.4). Data from the warehouse are processed twice and deposited in an additional multidimensional database, organized for easy multidimensional analysis and presentation, or replicated in data marts. The advantage of the three-tier architecture is its separation of the functions of the data warehouse, which eliminates resource constraints and makes it possible to easily create data marts. The Vanguard Group moved to a Web-based, three-tier architecture for its enterprise architecture to integrate all its data and provide customers with the same views of data as internal users (Dragoon, 2003). Likewise, Hilton migrated all its independent client/server systems to a three-tier data warehouse, using a Web design enterprise system. This change involved an investment of $3.8 million (excluding labor) and affected 1,500 users. It increased processing efficiency (speed) by a factor of six. When it was deployed, Hilton expected to save $4.5 to $5 million annually. Finally, Hilton experimented with Dell’s clustering (i.e., parallel computing) technology to enhance scalability and speed (see Anthes, 2003). Web architectures for data warehousing are similar in structure to other data warehousing architectures, requiring a design choice for housing the Web data warehouse with the transaction server or as a separate server(s). Page-loading speed is an important consideration in designing Web-based applications; therefore, server capacity must be planned carefully. Several issues must be considered when deciding which architecture to use. Among them are the following: e Which database management system (DBMS) should be used? Most data warehouses are built using relational database management systems (RDBMS). Oracle (Oracle Corporation, oracle.com), SQL Server (Microsoft Corporation, ), and DB2 (IBM Corporation, ) are the ones most commonly used. Each of these products supports both client/server and Web-based architectures. e Will parallel processing and/or partitioning be used? Parallel processing enables multiple CPUs to process data warehouse query requests simultaneously and provides scalability. Data warehouse designers need to decide whether the database tables will be partitioned (i.e., split into smaller tables) for access efficiency and what the criteria will be. This is an important consideration that is necessitated by the large amounts of data contained in a typical data warehouse. A recent survey on parallel and distributed data warehouses can be found in Furtado (2009). Teradata (' ) has successfully adopted and often commended on its novel implementation of this approach. e Will data migration tools be used to load the data warehouse? Moving data from an existing system into a data warehouse is a tedious and laborious task. Depending on the diversity and the location of the data assets, migration may be a relatively simple procedure or (on the contrary) a months-long project. The results of a thorough assessment of the existing data assets should be used to determine whether to use migration tools and, if so, what capabilities to seek in those commercial tools. e What tools will be used to support data retrieval and analysis? Often it is necessary to use specialized tools to periodically locate, access, analyze, extract, transform, and load necessary data into a data warehouse. A decision has to be made on (1) developing the migration tools in-house, (2) purchasing them from a third-party provider, or (3) using the ones provided with the data warehouse system. Overly complex, real-time migrations warrant specialized third-part ETL tools. Alternative Data Warehousing Architectures At the highest level, data warehouse architecture design viewpoints can be categorized into enterprise-wide data warehouse (EDW) design and data mart (DM) design (Golfarelli and Rizzi, 2009). In Figure 2.7 (parts a—e), we show some alternatives to the basic architectural design types that are neither pure EDW nor pure DM, but in between or beyond the traditional architectural structures. Notable new ones include hub-and-spoke and federated architectures. The five architectures shown in Figure 2.7 (parts a—e) are proposed by Ariyachandra and Watson (2005, 2006a, and 2006b). Previously, in an extensive study, Sen and Sinha (2005) identified 15 different data warehousing methodologies. The sources of these methodologies are classified into three broad categories: core-technology vendors, infrastructure vendors, and information-modeling companies. 1. Independent data marts. This is arguably the simplest and the least costly architecture alternative. The data marts are developed to operate independently of each another to serve the needs of individual organizational units. Because of their independence, they may have inconsistent data definitions and different dimensions and measures, making it difficult to analyze data across the data marts (i.e., itis difficult, if not impossible, to get to the “one version of the truth”). Data mart bus architecture. This architecture is a viable alternative to the independent data marts where the individual marts are linked to each other via some kind of middleware. Because the data are linked among the individual marts, there is a better chance of maintaining data consistency across the enterprise (at least at the metadata level). Even though it allows for complex data queries across data marts, the performance of these types of analysis may not be at a satisfactory level. Hub-and-spoke architecture. This is perhaps the most famous data warehousing architecture today. Here the attention is focused on building a scalable and maintainable infrastructure (often developed in an iterative way, subject area by subject area) that includes a centralized data warehouse and several dependent data marts (each for an organizational unit). This architecture allows for easy and customization of user interfaces and reports. On the negative side, this architecture lacks the holistic enterprise view and may lead to data redundancy and data latency. (a) Independent Data Marts Architecture o a Se Source Staging Independent data marts ae systems area (atomic/summarized data) BGCEsS ‘aN applications (b) Data Mart Bus Architecture with Linked Dimensional Data Marts ss) ge Dimensionalized SS Source Staging data marts linked by systems area conformed dimensions End-user access and applications (atomic/summarized data) central data warehouse architecture, which is advocated mainly by Teradata Corp., advises using data warehouses without any data marts (see Figure 2.8). rosconius ag] | |g) Su) BH Transactional Data Physical Database Design Data Transformation Operational Data Store (ODS) “Enterprise” Data Warehouse Data Replication Middleware /Enterprise Message Bus Logical Data Model Data Marts co es | a a | Business and Technology - Consultation Support and Education Services Enterprise, System, and Database Management Metadata Strategic Tactical Reporting Data Event-driven/ Users Users OLAP Users Miners Closed Loop 1. Figure 2.8 Teradata Corporation’s Enterprise Data Warehouse. 2. Source: Teradata Corporation (ieradata.com). Used with permission. 4. Federated data warehouse. The federated approach is a concession to the natural forces that undermine the best plans for developing a perfect system. It uses all possible means to integrate analytical resources from multiple sources to meet changing needs or business conditions. Essentially, the federated approach involves integrating disparate systems. In a federated architecture, existing decision support structures are left in place, and data are accessed from those sources as needed. The federated approach is supported by middleware vendors that propose distributed query and join capabilities. These eXtensible Markup Language (XML)-based tools offer users a global view of distributed data sources, including data warehouses, data marts, Web sites, documents, and operational systems. When users choose query objects from this view and press the submit button, the tool automatically queries the distributed sources, joins the results, and presents them to the user. Because of performance and data quality issues, most experts agree that federated approaches work well to supplement data warehouses, not replace them (see Eckerson, 2005). Ariyachandra and Watson (2005) identified 10 factors that potentially affect the architecture selection decision: 1. Information interdependence between organizational units 2. Upper management's information needs 3. Urgency of need for a data warehouse 4. Nature of end-user tasks 5. Constraints on resources 6. Strategic view of the data warehouse prior to implementation 7. Compatibility with existing systems 8. Perceived ability of the in-house IT staff 9. Technical issues 10. Social/political factors These factors are similar to many success factors described in the literature for information systems projects and DSS and BI projects. Technical issues, beyond providing technology that is feasibly ready for use, is important, but often not as important as behavioral issues, such as meeting upper management's information needs and user involvement in the development process (a social/political factor). Each data warehousing architecture has specific applications for which it is most (and least) effective and thus provides maximal benefits to the organization. However, overall, the data mart structure seems to be the least effective in practice. See Ariyachandra and Watson (2006a) for some additional details. Which Architecture Is the Best? Ever since data warehousing became a critical part of modern enterprises, the question of which data warehouse architecture is the best has been a topic of regular discussion. The two gurus of the data warehousing field, Bill Inmon and Ralph Kimball, are at the heart of this discussion. Inmon advocates the hub-and-spoke architecture (e.g., the Corporate Information Factory), whereas Kimball promotes the data mart bus architecture with conformed dimensions. Other architectures are possible, but these two options are fundamentally different approaches, and each has strong advocates. To shed light on this controversial question, Ariyachandra and Watson (2006b) conducted an empirical study. To collect the data, they used a Web-based survey targeted at individuals involved in data warehouse implementations. Their survey included questions about the respondent, the respondent's company, the company’s data warehouse, and the success of the data warehouse architecture. In total, 454 respondents provided usable information. Surveyed companies ranged from small (less than $10 million in revenue) to large (in excess of $10 billion). Most of the companies were located in the United States (60%) and represented a variety of industries, with the financial services industry (15%) providing the most responses. The predominant architecture was the hub-and-spoke architecture (39%), followed by the bus architecture (26%), the centralized architecture (17%), independent data marts (12%), and the federated architecture (4%). The most common platform for hosting the data warehouses was Oracle (41%), followed by Microsoft (19%) and IBM (18%). The average (mean) gross revenue varied from $3.7 billion for independent data marts to $6 billion for the federated architecture. They used four measures to assess the success of the architectures: (1) information quality, (2) system quality, (3) individual impacts, and (4) organizational impacts. The questions used a seven-point scale, with the higher score indicating a more successful architecture. Table 2.1 shows the average scores for the measures across the architectures. As the results of the study indicate, independent data marts scored the lowest on all measures. This finding confirms the conventional wisdom that independent data marts are a poor architectural solution. Next lowest on all measures was the federated architecture. Firms sometimes have disparate decision-support platforms resulting from mergers and acquisitions, and they may choose a federated approach, at least in the short run. The findings suggest that the federated architecture is not an optimal long-term solution. What is interesting, however, is the similarity of the averages for the bus, hub-and-spoke, and centralized architectures. The differences are sufficiently small that no claims can be made for a particular architecture’s superiority over the others, at least based on a simple comparison of these success measures. Data Integration Data integration comprises three major processes that, when correctly implemented, permit data to be accessed and made accessible to an array of ETL and analysis tools and the data warehousing environment: data access (i.e., the ability to access and extract data from any data source), data federation (i.e., the integration of business views across multiple data stores), and change capture (based on the identification, capture, and delivery of the changes made to enterprise data sources). See Application Case 2.3 for an example of how BP Lubricant benefits from implementing a data warehouse that integrates data from many sources. Some vendors, such as SAS Institute, Inc., have developed strong data integration tools. The SAS enterprise data integration server includes customer data integration tools that improve data quality in the integration process. The Oracle Business Intelligence Suite assists in integrating data as well. Application Case 2.3 BP Lubricants Achieves BIGS Success BP Lubricants established the BIGS program following recent merger activity to deliver globally consistent and transparent management information. As well as timely business intelligence, BIGS provides detailed, consistent views of performance across functions such as finance, marketing, sales, and supply and logistics. BP is one of the world’s largest oil and petrochemicals groups. Part of the BP plc group, BP Lubricants is an established leader in the global automotive lubricants market. Perhaps best known for its Castrol brand of oils, the business operates in over 100 countries and employs 10,000 people. Strategically, BP Lubricants is concentrating on further improving its customer focus and increasing its effectiveness in automotive markets. Following recent merger activity, the company is undergoing transformation to become more effective and agile and to seize opportunities for rapid growth. Challenge Following recent merger activity, BP Lubricants wanted to improve the consistency, transparency, and accessibility of management information and business intelligence. In order to do so, it needed to integrate data held in disparate source systems, without the delay of introducing a standardized ERP system. Solution BP Lubricants implemented the pilot for its Business Intelligence and Global Standards (BIGS) program, a strategic initiative for management information and business intelligence. At the heart of BIGS is Kalido, an adaptive enterprise data warehousing solution for preparing, implementing, operating, and managing data warehouses. Kalido’s federated enterprise data warehousing solution supported the pilot program’s complex data integration and diverse reporting requirements. To adapt to the program’s evolving reporting requirements, the software also enabled the underlying information architecture to be easily modified at high speed while preserving all information. The system integrates and stores information from multiple source systems to provide consolidated views for: e Marketing Customer proceeds and margins for market segments with drill down to invoice-level detail e Sales Sales invoice reporting augmented with both detailed tariff costs and actual payments e Finance Globally standard profit and loss, balance sheet, and cash flow statements—with audit ability; customer debt management supply and logistics; consolidated view of order and movement processing across multiple ERP platforms Benefits By improving the visibility of consistent, timely data, BIGS provides the information needed to assist the business in identifying a multitude of business opportunities to maximize margins and/or manage associated costs. Typical responses to the benefits of consistent data resulting from the BIGS pilot include: e Improved consistency and transparency of business data e Easier, faster, and more flexible reporting e Accommodation of both global and local standards e Fast, cost-effective, and flexible implementation cycle e Minimal disruption of existing business processes and the day-to-day business e Identification of data quality issues and encourages their resolution e Improved ability to respond intelligently to new business opportunities Questions for Discussion 1. What is BIGS? 2. What were the challenges, the proposed solution, and the obtained results with BIGS? Sources: Kalido, “BP Lubricants Achieves BIGS, Key IT Solutions,” www.kalido.com/customer-stories/bp-ple.htm (accessed July 2013); and BP Lubricant homepage, (accessed July 2013). A major purpose of a data warehouse is to integrate data from multiple systems. Various integration technologies enable data and metadata integration: e Enterprise application integration (EAI) e Service-oriented architecture (SOA) e Enterprise information integration (Ell) e Extraction, transformation, and load (ETL) Enterprise application integration (EAI) provides a vehicle for pushing data from source systems into the data warehouse. It involves integrating application functionality and is focused on sharing functionality (rather than data) across systems, thereby enabling flexibility and reuse. Traditionally, EAl solutions have focused on enabling application reuse at the application programming interface (API) level. Recently, EAI is accomplished by using SOA coarse-grained services (a collection of business processes or functions) that are well defined and documented. Using Web services is a specialized way of implementing an SOA. EAI can be used to facilitate data acquisition directly into a near-real-time data warehouse or to deliver decisions to the OLTP systems. There are many different approaches to and tools for EAI implementation. Enterprise information integration (Ell) is an evolving tool space that promises real-time data integration from a variety of sources, such as relational databases, Web services, and multidimensional databases. It is a mechanism for pulling data from source systems to satisfy a request for information. Ell tools use predefined metadata to populate views that make integrated data appear relational to end users. XML may be the most important aspect of Ell because XML allows data to be tagged either at creation time or later. These tags can be extended and modified to accommodate almost any area of knowledge (see Kay, 2005). Physical data integration has conventionally been the main mechanism for creating an integrated view with data warehouses and data marts. With the advent of Ell tools (see Kay, 2005), new virtual data integration patterns are

Data Warehousing and Data Mining, Study notes of Business Demography and Environmental Studies

Related documents

Partial preview of the text

Download Data Warehousing and Data Mining and more Study notes Business Demography and Environmental Studies in PDF only on Docsity!