Docsity
Docsity

Prepara i tuoi esami
Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity


Ottieni i punti per scaricare
Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium


Guide e consigli
Guide e consigli


Data Warehousing and Data Mining: A Comprehensive Guide, Dispense di Business Planning

A comprehensive overview of data warehousing and data mining concepts, covering topics such as data sources, data types, data warehousing processes, data mining techniques, and data visualization. It explores the importance of data quality, data cleansing, and data preparation for successful data mining projects. The document also discusses common myths and mistakes associated with data mining, emphasizing the need for clear business objectives and effective data analysis.

Tipologia: Dispense

2022/2023

In vendita dal 10/12/2024

r30
r30 🇮🇹

3.5

(2)

16 documenti

1 / 23

Toggle sidebar

Questa pagina non è visibile nell’anteprima

Non perderti parti importanti!

bg1
Business Intelligence and Data Mining
Chapter 1 : Wholeness of Data Analytics
Business is the act of doing something productive to serve someone’s needs, and thus earn a living
and make the world a better place. Business activities are recorded on paper or using electronic media,
and then these records become data. There is more data from customers’ responses and on the
industry as a whole. All this data can be analysed and mined using special tools and techniques to
generate patterns and intelligence, which reflect how the business is functioning. These ideas can then
be fed back into the business so that it can evolve to become more effective and efficient in serving
customer needs. And the cycle continues on.
BUSINESS INTELLIGENCE
Any business organization needs to continually monitor its business environment and its own
performance, and then rapidly adjust its future plans. This includes monitoring the industry, the
competitors, the suppliers, and the Costumers .
Business intelligence is a broad set of information technology (IT) solutions that includes tools
for gathering, analyzing, and reporting information to the users about performance of the
organization and its environment. These IT solutions are among the most highly prioritised
solutions for investment.
PATTERN RECOGNITION
A pattern is a design or model that helps grasp something. Patterns help connect things that
may not appear to be connected. Patterns help cut through complexity and reveal simpler
understandable trends.
A perfect pattern or model is one that (a) accurately describes a situation, (b) is broadly
applicable, and (c) can be described in a simple manner.
DATA PROCESSING CHAIN
Data is the new natural resource.
Data lies at the heart of business intelligence. There is a sequence of steps to be followed to benefit
from the data in a systematic way. Data can be modelled and stored in a database. Relevant data can
be extracted from the operational data stores according to certain reporting and analyzing purposes,
and stored in a data warehouse. The data from the warehouse can be combined with other sources of
data, and mined using data mining techniques to generate new insights. The insights need to be
visualized and communicated to the right audience in real time for competitive advantage.
Data
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17

Anteprima parziale del testo

Scarica Data Warehousing and Data Mining: A Comprehensive Guide e più Dispense in PDF di Business Planning solo su Docsity!

Business Intelligence and Data Mining

Chapter 1 : Wholeness of Data Analytics

Business is the act of doing something productive to serve someone’s needs, and thus earn a living and make the world a better place. Business activities are recorded on paper or using electronic media, and then these records become data. There is more data from customers’ responses and on the industry as a whole. All this data can be analysed and mined using special tools and techniques to generate patterns and intelligence, which reflect how the business is functioning. These ideas can then be fed back into the business so that it can evolve to become more effective and efficient in serving customer needs. And the cycle continues on.

BUSINESS INTELLIGENCE

Any business organization needs to continually monitor its business environment and its own performance, and then rapidly adjust its future plans. This includes monitoring the industry, the competitors, the suppliers, and the Costumers. Business intelligence is a broad set of information technology (IT) solutions that includes tools for gathering, analyzing, and reporting information to the users about performance of the organization and its environment. These IT solutions are among the most highly prioritised solutions for investment.

PATTERN RECOGNITION

A pattern is a design or model that helps grasp something. Patterns help connect things that may not appear to be connected. Patterns help cut through complexity and reveal simpler understandable trends. A perfect pattern or model is one that (a) accurately describes a situation, (b) is broadly applicable, and (c) can be described in a simple manner.

DATA PROCESSING CHAIN

Data is the new natural resource. Data lies at the heart of business intelligence. There is a sequence of steps to be followed to benefit from the data in a systematic way. Data can be modelled and stored in a database. Relevant data can be extracted from the operational data stores according to certain reporting and analyzing purposes, and stored in a data warehouse. The data from the warehouse can be combined with other sources of data, and mined using data mining techniques to generate new insights. The insights need to be visualized and communicated to the right audience in real time for competitive advantage.

Data

Anything that is recorded is data.

  1. Data could come from any number of sources. It could come from operational records inside an organization , and it can come from records compiled by the organization, by individuals and so on ...
  2. Data can come in many ways. It may come as paper reports. It may come as a file stored on a computer.
  3. There is also data about data. It is called metadata. For example, people regularly upload videos on YouTube. The format of the video file (whether it was a high-def file or lower resolution) is metadata. There are different type of data :
  4. They could be an unordered collection of values eg.a retailer sells shirts of red, blue, and green colors. There is no intrinsic ordering among these color values. One can hardly argue that any one color is higher or lower than the other. This is called nominal (means names) data.
  5. Data could be ordered values eg. Medium, small, large. This is called ordinal/ordered data
  6. Data could have numeric values defined in a certain range, with the assumption of equal distance between values. This is called interval data.
  7. Data could be take on any numeric values. This is called ratio data.
  8. Data could be at first structured and then analysed eg. Audio, video and graphs. These kinds of data lend themselves to different forms of analysis and mining. They may contain sentiment and intention, but these are not quantitatively precise. The precision of analysis increases as data becomes more numeric. Ratio data could be subjected to rigorous mathematical analysis. For example, precise weather data about temperature, pressure, and humidity can be used to create rigorous mathematical models that can accurately predict future weather. Note : Datafication is a new term that means that almost every phenomenon is now being observed and stored. More devices are connected to the Internet. More people are constantly connected to “the grid,” by their phone network or the Internet, and so on.

Database

A database is a modeled collection of data that is accessible in many ways. A data model can be designed to integrate the operational data of the organization. Databases have grown tremendously over time. They have grown in complexity in terms of number of the objects and their properties being recorded. They have also grown in the quantity of data being stored. Many database management software systems (DBMSs) are available to help store and manage this data. These include commercial systems, such as Oracle and DB2 system.

Data Warehouse

  1. Choose wisely from a palette of graphs to suit the data.
  2. Organize the results to make the central point stand out.
  3. Ensure that the visuals accurately reflect the numbers. Inappropriate visuals can create misinterpretations and misunderstandings.
  4. Make the presentation unique , imaginative and memorable. Executive dashboards are designed to provide information on select few variables for every executive. They use graphs, dials, and lists to show the status of important parameters. These dashboards also have a drill-down capability to enable a root-cause analysis of exception situations. Possibili domande 1: Describe the Business Intelligence and Data Mining cycle. 2: Describe the data processing chain. 3: What are the similarities between diamond mining and data mining? 4: What are the different data mining techniques? Which of these would be relevant in your current work? 5: What is a dashboard? How does it help?

Chapter 2 : Business Intelligence Concepts and

Applications

The goal for an organization is to make effective decisions, while reducing risk. Businesses calculate risks and make decisions based on a broad set of facts and insights. Reliable knowledge about the future can help managers make the right decisions with lower levels of risk.

Decision Types

There are two main kinds of decisions: strategic decisions and operational decisions. BI can help make both better. ● Strategic decisions are those that impact the direction of the company. The decision to reach out to a new customer set would be a strategic decision. In strategic decision-making, the goal itself may or may not be clear , and the same is true for the path to reach the goal. The consequences of the decision would be apparent some time later. Thus, one is constantly scanning for new possibilities and new paths to achieve the goals. BI can help with what-if analysis of many possible scenarios. BI can also help create new ideas based on new patterns found from data mining.

Operational decisions are more routine and tactical decisions, focused on developing greater efficiency. Operational decisions can be made more efficient using an analysis of past data. A classification system can be created and modeled using the data of past instances to develop a good model of the domain. This model can help improve operational decisions in the future. BI can help automate operations level decision-making and improve efficiency by making millions of microlevel operational decisions in a model-driven way.

BI Tools

BI includes a variety of software tools and techniques to provide the managers with the information and insights needed to run the business. BI tools include data warehousing, online analytical processing, social media analytics, reporting, dashboards, querying, and data mining. A spreadsheet tool, such as Microsoft Excel , can act as an easy but effective BI tool by itself. Data can be downloaded and stored in the spreadsheet, then analyzed to produce insights, then presented in the form of graphs and tables. A dashboarding system, such as IBM Cognos or Tableau , can offer a sophisticated set of tools for gathering, analyzing, and presenting data. They are industrial strength systems that provide capabilities to apply a wide range of analytical models on large data sets. Open source systems, such as Weka , are popular platforms designed to help mine large amounts of data to discover patterns.

BI Skills

A skilled and experienced BI specialist should be open enough to go outside the box, open the aperture and see a wider perspective that includes more dimensions and variables, in order to find important patterns and insights. The problem needs to be looked at from a wider perspective to consider many more angles that may not be immediately obvious. An imaginative solution should be proposed for the problem so that interesting and useful results can emerge.

Customer Relationship Management

A business exists to serve a customer. A happy customer becomes a repeat customer. BI applications can impact many aspects of marketing.

  1. Maximize the return on marketing campaigns : Understanding the customer’s pain points from data-based analysis can ensure that the marketing messages are fine-tuned to better resonate with customers.

Retail

Retail organizations grow by meeting customer needs with quality products, in a convenient, timely, and cost-effective manner. Understanding emerging customer shopping patterns can help retailers organize their products, inventory, store layout, and web presence in order to delight their customers, which in turn would help increase revenue and profits.

  1. Optimize inventory levels at different locations: Retailers need to manage their inventories carefully. Carrying too much inventory imposes carrying costs, while carrying too little inventory can cause stock-outs and lost sales opportunities.
  2. Improve store layout and sales promotions: A market basket analysis can develop predictive models of which products sell together often.
  3. Optimize logistics for seasonal effects: Seasonal products offer tremendously profitable short-term sales opportunities, yet they also offer the risk of unsold inventories at the end of the season.
  4. Minimize losses due to limited shelf life: Perishable goods offer challenges in terms of disposing off the inventory in time.

Banking

Banks make loans and offer credit cards to millions of customers. They are most interested in improving the quality of loans and reducing bad debts. They also want to retain more good customers, and sell more services to them.

  1. Automate the loan application process: Decision models can be generated from past data that predict the likelihood of a loan proving successful.
  2. Detect fraudulent transactions: Billions of financial transactions happen around the world every day. Exception-seeking models can identify patterns of fraudulent transactions.
  3. Maximize customer value (cross-, up-selling). Selling more products and services to existing customers is often the easiest way to increase revenue.
  4. Optimize cash reserves with forecasting. Banks have to maintain certain liquidity to meet the needs of depositors who may like to withdraw money. Using past data and trend analysis, banks can forecast how much to keep and invest the rest to earn interest. Possibili domande
  5. Why should organizations invest in business intelligence solutions? Are these more important than IT security solutions? Why or why not?
  6. List 3 business intelligence applications in the hospitality industry.

Chapter 3: Data Warehousing A data warehouse (DW) is an organized collection of integrated, subject- oriented databases designed to support decision and functions. DW provides clean enterprise-wide data in a standardized format for reports, queries and analysis. It is physically and functionally separate from an operational and transactional database. Creating a DW for analysis and queries represents significant investment in time and effort. It has to be constantly kept up-to-date for it to be useful. DW offers many business and technical benefits. For example, DW can present a competitive advantage by facilitating decision making and helping reform business processes. Design Considerations for DW Here are some requirements for a good DW:

  1. Subject oriented : To be effective, a DW should be designed around a subject domain, i.e. to help solve a certain category of problems;
  2. Integrated : The DW should include data from many functions that can shed light on a particular subject area;
  3. Time-variant : The data in DW should grow over time;
  4. Nonvolatile : DW should be persistent and consistently available for analysis, across the organization and over time;
  5. Summarized : DW contains rolled-up data at the right level for queries and analysis;
  6. Not normalized : DW often uses a star schema, which is a rectangular central table, surrounded by some look-up tables. The single table view significantly enhances speed of queries.
  7. Metadata : Many of the variables in the database are computed from other variables in the operational database. Every element in the DW should be sufficiently well-defined.
  8. Near Real-time and/or right-time (active): DWs should be updated in near real-time in many high transaction volume industries, such as airlines. The cost of implementing and updating DW in real time could be discouraging though. Another downside of real-time DW is the possibilities of inconsistencies in reports drawn just a few minutes apart.

DW Development Approaches

There are two approaches to developing DW: top down and bottom up. The top-down approach is to make a comprehensive DW that covers all the reporting needs of the enterprise. The bottom-up approach is to produce small data marts , for the reporting needs of different departments or functions, as needed. The top-down approach provides consistency but takes more time and resources. The bottom-up approach leads to healthy local ownership and maintainability of data.

  1. Specialized applications : This includes applications such as Point of Sale (POS) terminals, and e-commerce applications, that also provide customer-facing data.
  2. External syndicated data : This includes publicly available data such as weather or economic activity data. It provides good contextual information to decision makers.

Data Loading Processes

The heart of a useful DW is the processes to populate the DW with good quality data. This is called the Extract-Transform-Load (ETL) cycle.

  1. Data should be extracted from the operational (transactional) database sources, as well as from other applications, on a regular basis.
  2. The extracted data should be aligned together by key fields and integrated into a single data set. It should be cleansed of any irregularities or missing values. It should be rolled-up together to the same level of granularity. Desired fields, such as daily sales totals, should be computed. The entire data should then be brought to the same format as the central table of DW.
  3. This transformed data should then be uploaded into the DW. This ETL process should be run at a regular frequency. The DW is up to date every morning. ETL work is usually automated using programming scripts that are written, tested, and then deployed for periodically updating the DW.

Data Warehouse Design

Star schema is the preferred data architecture for most DWs. There is a central fact table that provides most of the information of interest. There are lookup tables that provide detailed values for codes used in the central table. For example, the central table may use digits to represent a sales person. The lookup table will help provide the name for that salesperson code. Example:

Other schemas include the snowflake architecture. The difference between a star and snowflake is that in the latter, the look-up tables can have their own further look up tables.

DW Access

Data from the DW could be accessed for many purposes, by many users, through many devices.

  1. A primary use of DW is to produce routine management and monitoring reports. E.g. The data from DW can be used to populate customized performance dashboards for executives.
  2. The data from the DW could be used for ad-hoc queries and any other applications that make use of the internal data.
  3. Data from DW is used to provide data for mining purposes. Parts of the data would be extracted, and then combined with other relevant data, for data mining.

DW Best Practices

A data warehousing project reflects a significant investment into information technology (IT). All of the best practices in implementing any IT project should be followed:

  1. The DW project should align with the corporate strategy. Top management should be consulted for setting objectives. Financial viability (ROI) should be established. The

Gathering and curating data takes time and effort, particularly when it is unstructured or semistructured. Unstructured data can come in many forms like databases, blogs, images, videos, audio, and chats. There are streams of unstructured social media data from blogs, chats, and tweets. There are streams of machine-generated data from connected machines, RFID tags, the internet of things, and so on. Eventually the data should be rectangularized, that is, put in rectangular data shapes with clear columns and rows, before submitting it to data mining. Knowledge of the business domain helps select the right streams of data for pursuing new insights. The data elements should be relevant, and suitably address the problem being solved. Data cleansing and preparation The quality of data is critical to the success and value of the data mining project. The quality of incoming data varies by the source and nature of data. Data from internal operations is likely to be of higher quality, as it will be accurate and consistent. Data from social media and other public sources is less under the control of business, and is less likely to be reliable. Data almost certainly needs to be cleansed and transformed before it can be used for data mining. There are many ways in which data may need to be cleansed before it can be ready for analysis. Data cleansing and preparation is a labor-intensive or semi-automated activity that can take up to 60-70% of the time needed for a data mining project.

  1. Duplicate data needs to be removed. The same data may be received from multiple sources.
  2. Missing values need to be filled in , or those rows should be removed from analysis. Missing values can be filled in with average, modal or default values.
  3. Data elements should be comparable. They need to be (a) transformed from one unit to another; (b) comparable ; (c) stored at the same granularity to ensure comparability. For example, sales data may be available daily, but the salesperson compensation data may only be available monthly. To relate these variables, the data must be brought to the lowest common denominator, in this case, monthly.
  4. Continuous values may need to be binned into a few buckets to help with some analyses. For instance, work experience could be binned as low, medium, and high.
  5. Outlier data elements need to be removed after careful review, to avoid the skewing of results. For example, one big donor could skew the analysis of alumni donors in an educational setting.
  6. Ensure that the data is representative of the phenomena under analysis by correcting for any biases in the selection of data. For example, if the data includes many more members of one gender than is typical of the population of interest, then adjustments need to be applied to the data.
  7. Data may need to be selected to increase information density. Some data may not show much variability, because it was not properly recorded. This data may dull the effects of other differences in the data and should be removed to improve the information density of the data. Outputs of Data Mining The outputs of data mining will reflect the objective being served. There are many ways of representing the outputs of data mining.

One popular form of data mining output is a decision tree. It is a hierarchically branched structure that helps visually follow the steps to make a model-based decision. The tree may have certain attributes, such as probabilities assigned to each branch. Population “centroid” is a statistical measure for describing central tendencies of a collection of data points. These might be defined in a multidimensional space. For example, a centroid could be “middle-aged, highly educated, high-net worth professionals, married with two children, living in the coastal areas”. These are typical representations of the output of a cluster analysis exercise. Business rules are an appropriate representation of the output of a market basket analysis exercise. These rules are if-then statements with some probability parameters associated with each rule. For example, those that buy milk and bread will also buy butter (with 80 percent probability). The output can be in the form of a regression equation or mathematical function that represents the best fitting curve to represent the data. This equation may include linear and non-linear terms. Regression equations are a good way of representing the output of classification exercises. These are also a good representation of forecasting formulae. Evaluating Data Mining Results There are two primary kinds of data mining processes: supervised learning and unsupervised learning. In supervised learning, a decision model can be created using past data, and the model can then be used to predict the correct answer for future data instances. Classification is the main category of supervised learning activity. There are many techniques for classification, decision trees being the most popular one. A common metric for all of classification techniques is predictive accuracy. Predictive Accuracy = (Correct Predictions) / Total Predictions Suppose a data mining project has been initiated to develop a predictive model for cancer patients using a decision tree. The model is then used to predict other data instances. When a true positive data point is positive, that is a correct prediction, called a true positive (TP). Similarly, when a true negative data point is classified as negative, that is a true negative (TN). On the other hand, when a true-positive data point is classified by the model as negative, that is an incorrect prediction, called a false negative (FN). Similarly, when a true-negative data point is classified as positive, that is classified as a false positive (FP).

be used to model and predict the energy consumption as a function of daily temperature. Once a regression model has been developed, the energy consumption on any future day can be predicted using an equation. The accuracy of the regression model depends entirely upon the dataset used and not at all on the algorithm or tools used. Artificial Neural Networks (ANN) is a sophisticated data mining technique from the Artificial Intelligence stream in Computer Science. It mimics the behavior of human neural structure: Neurons receive stimuli, process them, and communicate their results to other neurons successively, and eventually a neuron outputs a decision. There could be many layers of neurons involved in a decision task, depending upon the complexity of the domain. The neural network can be trained by making a decision over and over again with many data points. It will continue to learn by adjusting its internal computation and communication parameters based on feedback received on its previous decisions. At some point, the neural network will have learned enough and begin to match the predictive accuracy of a human expert. The predictions of some ANNs that have been trained over a long period of time with a large amount of data have become decisively more accurate than human experts. At that point, the ANNs can begin to be seriously considered for deployment, in real situations in real time. ANNs are popular because they are eventually able to reach a high predictive accuracy. ANNs are also relatively simple to implement and do not have any issues with data quality. However, they require a lot of data to train it to develop good predictive ability. Cluster Analysis is an exploratory learning technique that helps in identifying a set of similar groups in the data. It is a technique used for automatic identification of natural groupings of things. Data instances that are similar to (or near) each other are categorized into one cluster, while data instances that are very different (or far away) from each other are categorized into separate clusters. Clustering is also known as the segmentation technique. It helps divide and conquer large data sets. The technique shows the clusters from past data. The output is the centroids for each cluster and the allocation of data points to their cluster. Clustering is also a part of the artificial intelligence family of techniques. Association rules are a popular data mining method in business, especially where selling is involved. Also known as market basket analysis, it helps in answering questions about cross-selling opportunities. This is the heart of the personalization engine used by ecommerce sites like Amazon.com and streaming movie sites like Netflix.com. The technique helps find interesting relationships (affinities) between variables (items or events). These are represented as rules of the form X ® Y , where X and Y are sets of data items. A form of unsupervised learning, it has no dependent variable; and there are no right or wrong answers. There are just stronger and weaker affinities. Tools and Platforms for Data Mining Data Mining tools have existed for many decades. However, they have recently become more important as the values of data have grown. There are a wide range of data mining platforms available in the market today.

  1. Simple or sophisticated : There are simple end-user data mining tools such as MS Excel, and there are more sophisticated tools such as IBM SPSS Modeler.
  2. Stand-alone or Embedded : There are stand alone tools and there are tools embedded in an existing transaction processing or data warehousing or ERP system.
  1. Open source or Commercial : There are open source and freely available tools such as Weka, and there are commercial products.
  2. User interface : There are text-based tools that require some programming skills, and there are GUI-based drag-and-drop format tools.
  3. Data formats : There are tools that work only on proprietary data formats and there are those directly accept data from a host of popular data management tools formats. MS Excel is a relatively simple and easy data mining tool. It can get quite versatile once Analyst Pack and some other add-on products are installed on it. IBM’s SPSS Modeler is an industry-leading data mining platform. If offers a powerful set of tools and algorithms for most popular data mining capabilities. It has colorful GUI format with drag-and-drop capabilities. It can accept data in multiple formats including reading Excel files directly. Weka is an open-source GUI based tool that offers a large number of data mining algorithms. ERP systems include some data analytic capabilities, too. SAP has its Business Objects (BO) software. BO is considered one of the leading BI suites in the industry, and is often used by organizations that use SAP. Data Mining Best Practices Effective and successful use of data mining activity requires both business and technology skills. The business aspects help understand the domain and the key questions. The IT aspects help fetch the data from many sources, clean up the data, assemble it to meet the needs of the business problem, and then run the data mining techniques on the platform. An important element is to go after the problem iteratively. It is better to divide and conquer the problem with smaller amounts of data, and get closer to the heart of the solution in an iterative sequence of steps. There are several best practices learned from the use of data mining techniques over a long period of time. The Data Mining industry has proposed a Cross-Industry Standard Process for Data Mining (CRISP-DM). It has six essential steps (Figure 4.3):

algorithms gets called in. Understanding the relative strengths of various algorithms is helpful but not mandatory. Myth #2 : Data Mining is about predictive accuracy. While important, predictive accuracy is a feature of the algorithm. As in myth#1, the quality of output is a strong function of the right problem, right hypothesis, and the right data. Myth #3 : Data Mining requires a data warehouse. Some data mining problems may benefit from clean data available directly from the DW, but a DW is not mandatory. Myth #4 : Data Mining requires large quantities of data. Many interesting data mining exercises are done using small or medium sized data sets, at low costs, using end-user tools. Myth #5 : Data Mining requires a technology expert. Many interesting data mining exercises are done by end-users and executives using simple everyday tools like spreadsheets. Data Mining Mistakes Here are some of the more common mistakes in doing data mining, and should be avoided. Mistake #1 : Selecting the wrong problem for data mining : Without the right goals, data mining leads to a waste of time. Getting the right answer to an irrelevant question could be interesting, but it would be pointless from a business perspective. A good goal would be one that would deliver a good ROI to the organization. Mistake #2: Buried under mountains of data without clear metadata : It is more important to be engaged with the data, than to have lots of data. The relevant data required may be much less than initially thought. There may be insufficient knowledge about the data, or metadata. Mistake #3: Disorganized data mining: Without clear goals, much time is wasted. Doing the same tests using the same mining algorithms repeatedly and blindly, without thinking about the next stage, without a plan, would lead to wasted time and energy. Not leaving sufficient time for data acquisition, selection and preparation can lead to data quality issues, and GIGO. Similarly not providing enough time for testing the model, training the users and deploying the system can make the project a failure. Mistake #4: Insufficient business knowledge: Without a deep understanding of the business domain, the results would be meaningless. Don’t rule out anything when observing data analysis results. Don’t ignore suspicious (good or bad) findings. Even when insights emerge at one level, it is important to slice and dice the data at other levels to see if more powerful insights can be extracted. Mistake #5: Incompatibility of data mining tools and datasets. All the tools from data gathering, preparation, mining, and visualization, should work together. Use tools that can work with data from multiple sources in multiple industry standard formats. Mistake #6: Looking only at aggregated results and not at individual records/predictions. It is possible that the right results at the aggregate level provide absurd conclusions at an individual record level. Diving into the data at the right angle can yield insights at many levels of data. Mistake #7: Not measuring your results differently from the way your sponsor measures them. If the data mining team loses its sense of business objectives, and beginning to mine data for its own sake, it will lose respect and executive support very quickly. The BIDM cycle should be remembered.

Review Questions

  1. What is data mining? What are supervised and unsupervised learning techniques?
  2. Describe the key steps in the data mining process. Why is it important to follow these processes?
  3. What is a confusion matrix?
  4. Why is data preparation so important and time consuming?
  5. What are some of the most popular data mining techniques?
  6. What are the major mistakes to be avoided when doing data mining?
  7. What are the key requirements for a skilled data analyst?

Cap 5: Data Visualization

Data Visualization is the art and science of making data easy to understand and consume, for the end user. Ideal visualization shows the right amount of data, in the right order and form, to convey the high priority information. The right visualization requires an understanding of the consumer’s needs, nature of the data, and the many tools and techniques available to present data. Data visualization is the last step in the data life cycle it is where the data is processed for presentation in an easy manner to the right audience for the right purpose. The data should be into a language and format that is well understood by the consumer of data and the presentation should highlight the insights from the data in an actionable manner. It is important to remember that if the data is presented in too much detail, the consumer might lose interest. EXCELLENCE IN VISUALIZATION Data can be presented as rectangular tables , or as colorful graphs of various types. “Small, non-comparative, highly-labeled data sets usually belong in tables” – (Ed Tufte, 2001, p 33). However, for larger amounts of data graphs are preferable. Tufte, a pioneering expert on data visualization, presents the following objectives for graphical excellence:

  1. Show, and even reveal, the data : The data should tell a story, and it is important for that story to be contexualized and coherent with the data.
  2. Induce the viewer to think of the substance of the data : The format of the graph should be so well connected to the data, that it lets data shine.
  3. Avoid distorting what the data have to say : Statistics can be used to lie. Therefore, it’s important to provide a context in order to avoid the distortion of the data.
  4. Make large data sets coherent : By giving shape to data, visualizations can help bring the data together to tell a comprehensive story.
  5. Encourage the eyes to compare different pieces of data : Organize the chart in ways the eyes would naturally move to derive insights from the graph.
  6. Reveal the data at several levels of detail : Graphs leads to insights, which raise further curiosity, and thus presentations should help get to the root cause.