















Studia grazie alle numerose risorse presenti su Docsity
Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium
Prepara i tuoi esami
Studia grazie alle numerose risorse presenti su Docsity
Prepara i tuoi esami con i documenti condivisi da studenti come te su Docsity
Trova i documenti specifici per gli esami della tua università
Preparati con lezioni e prove svolte basate sui programmi universitari!
Rispondi a reali domande d’esame e scopri la tua preparazione
Riassumi i tuoi documenti, fagli domande, convertili in quiz e mappe concettuali
Studia con prove svolte, tesine e consigli utili
Togliti ogni dubbio leggendo le risposte alle domande fatte da altri studenti come te
Esplora i documenti più scaricati per gli argomenti di studio più popolari
Ottieni i punti per scaricare
Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium
A comprehensive overview of data warehousing and data mining concepts, covering topics such as data sources, data types, data warehousing processes, data mining techniques, and data visualization. It explores the importance of data quality, data cleansing, and data preparation for successful data mining projects. The document also discusses common myths and mistakes associated with data mining, emphasizing the need for clear business objectives and effective data analysis.
Tipologia: Dispense
1 / 23
Questa pagina non è visibile nell’anteprima
Non perderti parti importanti!
















Business is the act of doing something productive to serve someone’s needs, and thus earn a living and make the world a better place. Business activities are recorded on paper or using electronic media, and then these records become data. There is more data from customers’ responses and on the industry as a whole. All this data can be analysed and mined using special tools and techniques to generate patterns and intelligence, which reflect how the business is functioning. These ideas can then be fed back into the business so that it can evolve to become more effective and efficient in serving customer needs. And the cycle continues on.
Any business organization needs to continually monitor its business environment and its own performance, and then rapidly adjust its future plans. This includes monitoring the industry, the competitors, the suppliers, and the Costumers. Business intelligence is a broad set of information technology (IT) solutions that includes tools for gathering, analyzing, and reporting information to the users about performance of the organization and its environment. These IT solutions are among the most highly prioritised solutions for investment.
A pattern is a design or model that helps grasp something. Patterns help connect things that may not appear to be connected. Patterns help cut through complexity and reveal simpler understandable trends. A perfect pattern or model is one that (a) accurately describes a situation, (b) is broadly applicable, and (c) can be described in a simple manner.
Data is the new natural resource. Data lies at the heart of business intelligence. There is a sequence of steps to be followed to benefit from the data in a systematic way. Data can be modelled and stored in a database. Relevant data can be extracted from the operational data stores according to certain reporting and analyzing purposes, and stored in a data warehouse. The data from the warehouse can be combined with other sources of data, and mined using data mining techniques to generate new insights. The insights need to be visualized and communicated to the right audience in real time for competitive advantage.
Anything that is recorded is data.
A database is a modeled collection of data that is accessible in many ways. A data model can be designed to integrate the operational data of the organization. Databases have grown tremendously over time. They have grown in complexity in terms of number of the objects and their properties being recorded. They have also grown in the quantity of data being stored. Many database management software systems (DBMSs) are available to help store and manage this data. These include commercial systems, such as Oracle and DB2 system.
The goal for an organization is to make effective decisions, while reducing risk. Businesses calculate risks and make decisions based on a broad set of facts and insights. Reliable knowledge about the future can help managers make the right decisions with lower levels of risk.
There are two main kinds of decisions: strategic decisions and operational decisions. BI can help make both better. ● Strategic decisions are those that impact the direction of the company. The decision to reach out to a new customer set would be a strategic decision. In strategic decision-making, the goal itself may or may not be clear , and the same is true for the path to reach the goal. The consequences of the decision would be apparent some time later. Thus, one is constantly scanning for new possibilities and new paths to achieve the goals. BI can help with what-if analysis of many possible scenarios. BI can also help create new ideas based on new patterns found from data mining.
● Operational decisions are more routine and tactical decisions, focused on developing greater efficiency. Operational decisions can be made more efficient using an analysis of past data. A classification system can be created and modeled using the data of past instances to develop a good model of the domain. This model can help improve operational decisions in the future. BI can help automate operations level decision-making and improve efficiency by making millions of microlevel operational decisions in a model-driven way.
BI includes a variety of software tools and techniques to provide the managers with the information and insights needed to run the business. BI tools include data warehousing, online analytical processing, social media analytics, reporting, dashboards, querying, and data mining. A spreadsheet tool, such as Microsoft Excel , can act as an easy but effective BI tool by itself. Data can be downloaded and stored in the spreadsheet, then analyzed to produce insights, then presented in the form of graphs and tables. A dashboarding system, such as IBM Cognos or Tableau , can offer a sophisticated set of tools for gathering, analyzing, and presenting data. They are industrial strength systems that provide capabilities to apply a wide range of analytical models on large data sets. Open source systems, such as Weka , are popular platforms designed to help mine large amounts of data to discover patterns.
A skilled and experienced BI specialist should be open enough to go outside the box, open the aperture and see a wider perspective that includes more dimensions and variables, in order to find important patterns and insights. The problem needs to be looked at from a wider perspective to consider many more angles that may not be immediately obvious. An imaginative solution should be proposed for the problem so that interesting and useful results can emerge.
A business exists to serve a customer. A happy customer becomes a repeat customer. BI applications can impact many aspects of marketing.
Retail organizations grow by meeting customer needs with quality products, in a convenient, timely, and cost-effective manner. Understanding emerging customer shopping patterns can help retailers organize their products, inventory, store layout, and web presence in order to delight their customers, which in turn would help increase revenue and profits.
Banks make loans and offer credit cards to millions of customers. They are most interested in improving the quality of loans and reducing bad debts. They also want to retain more good customers, and sell more services to them.
Chapter 3: Data Warehousing A data warehouse (DW) is an organized collection of integrated, subject- oriented databases designed to support decision and functions. DW provides clean enterprise-wide data in a standardized format for reports, queries and analysis. It is physically and functionally separate from an operational and transactional database. Creating a DW for analysis and queries represents significant investment in time and effort. It has to be constantly kept up-to-date for it to be useful. DW offers many business and technical benefits. For example, DW can present a competitive advantage by facilitating decision making and helping reform business processes. Design Considerations for DW Here are some requirements for a good DW:
There are two approaches to developing DW: top down and bottom up. The top-down approach is to make a comprehensive DW that covers all the reporting needs of the enterprise. The bottom-up approach is to produce small data marts , for the reporting needs of different departments or functions, as needed. The top-down approach provides consistency but takes more time and resources. The bottom-up approach leads to healthy local ownership and maintainability of data.
The heart of a useful DW is the processes to populate the DW with good quality data. This is called the Extract-Transform-Load (ETL) cycle.
Star schema is the preferred data architecture for most DWs. There is a central fact table that provides most of the information of interest. There are lookup tables that provide detailed values for codes used in the central table. For example, the central table may use digits to represent a sales person. The lookup table will help provide the name for that salesperson code. Example:
Other schemas include the snowflake architecture. The difference between a star and snowflake is that in the latter, the look-up tables can have their own further look up tables.
Data from the DW could be accessed for many purposes, by many users, through many devices.
A data warehousing project reflects a significant investment into information technology (IT). All of the best practices in implementing any IT project should be followed:
Gathering and curating data takes time and effort, particularly when it is unstructured or semistructured. Unstructured data can come in many forms like databases, blogs, images, videos, audio, and chats. There are streams of unstructured social media data from blogs, chats, and tweets. There are streams of machine-generated data from connected machines, RFID tags, the internet of things, and so on. Eventually the data should be rectangularized, that is, put in rectangular data shapes with clear columns and rows, before submitting it to data mining. Knowledge of the business domain helps select the right streams of data for pursuing new insights. The data elements should be relevant, and suitably address the problem being solved. Data cleansing and preparation The quality of data is critical to the success and value of the data mining project. The quality of incoming data varies by the source and nature of data. Data from internal operations is likely to be of higher quality, as it will be accurate and consistent. Data from social media and other public sources is less under the control of business, and is less likely to be reliable. Data almost certainly needs to be cleansed and transformed before it can be used for data mining. There are many ways in which data may need to be cleansed before it can be ready for analysis. Data cleansing and preparation is a labor-intensive or semi-automated activity that can take up to 60-70% of the time needed for a data mining project.
One popular form of data mining output is a decision tree. It is a hierarchically branched structure that helps visually follow the steps to make a model-based decision. The tree may have certain attributes, such as probabilities assigned to each branch. Population “centroid” is a statistical measure for describing central tendencies of a collection of data points. These might be defined in a multidimensional space. For example, a centroid could be “middle-aged, highly educated, high-net worth professionals, married with two children, living in the coastal areas”. These are typical representations of the output of a cluster analysis exercise. Business rules are an appropriate representation of the output of a market basket analysis exercise. These rules are if-then statements with some probability parameters associated with each rule. For example, those that buy milk and bread will also buy butter (with 80 percent probability). The output can be in the form of a regression equation or mathematical function that represents the best fitting curve to represent the data. This equation may include linear and non-linear terms. Regression equations are a good way of representing the output of classification exercises. These are also a good representation of forecasting formulae. Evaluating Data Mining Results There are two primary kinds of data mining processes: supervised learning and unsupervised learning. In supervised learning, a decision model can be created using past data, and the model can then be used to predict the correct answer for future data instances. Classification is the main category of supervised learning activity. There are many techniques for classification, decision trees being the most popular one. A common metric for all of classification techniques is predictive accuracy. Predictive Accuracy = (Correct Predictions) / Total Predictions Suppose a data mining project has been initiated to develop a predictive model for cancer patients using a decision tree. The model is then used to predict other data instances. When a true positive data point is positive, that is a correct prediction, called a true positive (TP). Similarly, when a true negative data point is classified as negative, that is a true negative (TN). On the other hand, when a true-positive data point is classified by the model as negative, that is an incorrect prediction, called a false negative (FN). Similarly, when a true-negative data point is classified as positive, that is classified as a false positive (FP).
be used to model and predict the energy consumption as a function of daily temperature. Once a regression model has been developed, the energy consumption on any future day can be predicted using an equation. The accuracy of the regression model depends entirely upon the dataset used and not at all on the algorithm or tools used. Artificial Neural Networks (ANN) is a sophisticated data mining technique from the Artificial Intelligence stream in Computer Science. It mimics the behavior of human neural structure: Neurons receive stimuli, process them, and communicate their results to other neurons successively, and eventually a neuron outputs a decision. There could be many layers of neurons involved in a decision task, depending upon the complexity of the domain. The neural network can be trained by making a decision over and over again with many data points. It will continue to learn by adjusting its internal computation and communication parameters based on feedback received on its previous decisions. At some point, the neural network will have learned enough and begin to match the predictive accuracy of a human expert. The predictions of some ANNs that have been trained over a long period of time with a large amount of data have become decisively more accurate than human experts. At that point, the ANNs can begin to be seriously considered for deployment, in real situations in real time. ANNs are popular because they are eventually able to reach a high predictive accuracy. ANNs are also relatively simple to implement and do not have any issues with data quality. However, they require a lot of data to train it to develop good predictive ability. Cluster Analysis is an exploratory learning technique that helps in identifying a set of similar groups in the data. It is a technique used for automatic identification of natural groupings of things. Data instances that are similar to (or near) each other are categorized into one cluster, while data instances that are very different (or far away) from each other are categorized into separate clusters. Clustering is also known as the segmentation technique. It helps divide and conquer large data sets. The technique shows the clusters from past data. The output is the centroids for each cluster and the allocation of data points to their cluster. Clustering is also a part of the artificial intelligence family of techniques. Association rules are a popular data mining method in business, especially where selling is involved. Also known as market basket analysis, it helps in answering questions about cross-selling opportunities. This is the heart of the personalization engine used by ecommerce sites like Amazon.com and streaming movie sites like Netflix.com. The technique helps find interesting relationships (affinities) between variables (items or events). These are represented as rules of the form X ® Y , where X and Y are sets of data items. A form of unsupervised learning, it has no dependent variable; and there are no right or wrong answers. There are just stronger and weaker affinities. Tools and Platforms for Data Mining Data Mining tools have existed for many decades. However, they have recently become more important as the values of data have grown. There are a wide range of data mining platforms available in the market today.
algorithms gets called in. Understanding the relative strengths of various algorithms is helpful but not mandatory. Myth #2 : Data Mining is about predictive accuracy. While important, predictive accuracy is a feature of the algorithm. As in myth#1, the quality of output is a strong function of the right problem, right hypothesis, and the right data. Myth #3 : Data Mining requires a data warehouse. Some data mining problems may benefit from clean data available directly from the DW, but a DW is not mandatory. Myth #4 : Data Mining requires large quantities of data. Many interesting data mining exercises are done using small or medium sized data sets, at low costs, using end-user tools. Myth #5 : Data Mining requires a technology expert. Many interesting data mining exercises are done by end-users and executives using simple everyday tools like spreadsheets. Data Mining Mistakes Here are some of the more common mistakes in doing data mining, and should be avoided. Mistake #1 : Selecting the wrong problem for data mining : Without the right goals, data mining leads to a waste of time. Getting the right answer to an irrelevant question could be interesting, but it would be pointless from a business perspective. A good goal would be one that would deliver a good ROI to the organization. Mistake #2: Buried under mountains of data without clear metadata : It is more important to be engaged with the data, than to have lots of data. The relevant data required may be much less than initially thought. There may be insufficient knowledge about the data, or metadata. Mistake #3: Disorganized data mining: Without clear goals, much time is wasted. Doing the same tests using the same mining algorithms repeatedly and blindly, without thinking about the next stage, without a plan, would lead to wasted time and energy. Not leaving sufficient time for data acquisition, selection and preparation can lead to data quality issues, and GIGO. Similarly not providing enough time for testing the model, training the users and deploying the system can make the project a failure. Mistake #4: Insufficient business knowledge: Without a deep understanding of the business domain, the results would be meaningless. Don’t rule out anything when observing data analysis results. Don’t ignore suspicious (good or bad) findings. Even when insights emerge at one level, it is important to slice and dice the data at other levels to see if more powerful insights can be extracted. Mistake #5: Incompatibility of data mining tools and datasets. All the tools from data gathering, preparation, mining, and visualization, should work together. Use tools that can work with data from multiple sources in multiple industry standard formats. Mistake #6: Looking only at aggregated results and not at individual records/predictions. It is possible that the right results at the aggregate level provide absurd conclusions at an individual record level. Diving into the data at the right angle can yield insights at many levels of data. Mistake #7: Not measuring your results differently from the way your sponsor measures them. If the data mining team loses its sense of business objectives, and beginning to mine data for its own sake, it will lose respect and executive support very quickly. The BIDM cycle should be remembered.
Review Questions
Data Visualization is the art and science of making data easy to understand and consume, for the end user. Ideal visualization shows the right amount of data, in the right order and form, to convey the high priority information. The right visualization requires an understanding of the consumer’s needs, nature of the data, and the many tools and techniques available to present data. Data visualization is the last step in the data life cycle it is where the data is processed for presentation in an easy manner to the right audience for the right purpose. The data should be into a language and format that is well understood by the consumer of data and the presentation should highlight the insights from the data in an actionable manner. It is important to remember that if the data is presented in too much detail, the consumer might lose interest. EXCELLENCE IN VISUALIZATION Data can be presented as rectangular tables , or as colorful graphs of various types. “Small, non-comparative, highly-labeled data sets usually belong in tables” – (Ed Tufte, 2001, p 33). However, for larger amounts of data graphs are preferable. Tufte, a pioneering expert on data visualization, presents the following objectives for graphical excellence: