Data Mining vs. OLAP: Differences & Importance in Business Intelligence - Prof. Stephen Lo, Study notes of Management Information Systems

The differences between data mining and online analytical processing (olap) in business intelligence. It highlights the objectives, tools, skill sets, and implementation methods of each technique. Data mining focuses on identifying hidden patterns and new hypotheses, while olap provides fast, consistent, interactive access to data from various perspectives. The document also covers the importance of data mining readiness assessment and the role of business champions in successful implementation.

Typology: Study notes

Pre 2010

Uploaded on 08/16/2009

koofers-user-0p6
koofers-user-0p6 🇺🇸

5

(1)

10 documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
The Lowdown on Data Mining
by Evan Levy
Data mining is a high-yield but complex form of knowledge discovery. Before you even think
about using this technology, make sure you know what it is, what it takes, and what you need.
In its 1997-1998 study of data mining market trends, META Group claimed that nearly 80 percent of
companies intervi ewed exp ect ed d ata min ing t o be a cri tic al s ucce ss f act or b y 199 9. M ore r ecen tly,
Forrester Research weighed in on data mining, claiming that, while many companies were still
evaluating the technology, most planned on using it by 2001. Other analysts and independent research
firms polling companies to find out who’s doing what in the data mining space are finding that the
common denominator is intention, not practice. Is this because companies are solidifying their
infrastructures first? Are companies too intimidated to admit they have no intention of doing data mining
at all? Or is there still a pervasive misunderstanding of what data mining really is–and isn’t?
I recently spoke at a database marketing conference on this point. The title of my presentation was "Data
Mining in the Real World," and the room was brimming with both technicians and marketers. When I
got to the part of the presentation that discussed the differences between data mining and OLAP, I
noticed a guy a few rows from the front. He had stopped taking notes and had put down his pen. After
the presentation, he buttonholed me, taking me to task for my definition of data mining.
At first, I figured he worked for an OLAP vendor, one of the many who had labeled its multidimensional
analysis or query generation tool as a data mining product. But after listening to his harangue for a few
minutes, I was able to piece together that he was a data analyst for a marketing organization and had
been telling everyone that his company was doing data mining. I had burst his bubble by classifying his
cherished "data mining" tool as a simple OLAP application, and I had clearly called into question his
status as a knowledge worker.
My point is not that OLAP is less valuable than data mining, but that they are two separate breeds of
analysis with entirely different objectives, not to mention tools, skill sets, and implementation methods.
UNDERSTANDING THE PLAYERS
Most people wouldn’t use a spreadsheet tool to write a book. Even a crack statistician wouldn’t use SAS
to fill out an expense report. Different software tools exist to tackle different business functions, just as
different decision-support tools exist because there are different classes of questions. The major classes
of decision support are:
Canned reports. This is the most basic type of decision support, if not the most pervasive. Nearly every
data warehouse starts out by generating reports. The delivery of timely, accurate reports containing
business information is incredibly valuable–especially in places where this data never before existed.
Such an application focuses on well-defined, well-understood business questions. It also allows users to
gradually change their businesses to leverage this new information.
Ad hoc querying. Submitting free-form or ad hoc questions to the database is the next logical step in the
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Data Mining vs. OLAP: Differences & Importance in Business Intelligence - Prof. Stephen Lo and more Study notes Management Information Systems in PDF only on Docsity!

The Lowdown on Data Mining

by Evan Levy

Data mining is a high-yield but complex form of knowledge discovery. Before you even think about using this technology, make sure you know what it is, what it takes, and what you need.

In its 1997-1998 study of data mining market trends, META Group claimed that nearly 80 percent of companies interviewed expected data mining to be a critical success factor by 1999. More recently, Forrester Research weighed in on data mining, claiming that, while many companies were still evaluating the technology, most planned on using it by 2001. Other analysts and independent research firms polling companies to find out who’s doing what in the data mining space are finding that the common denominator is intention, not practice. Is this because companies are solidifying their infrastructures first? Are companies too intimidated to admit they have no intention of doing data mining at all? Or is there still a pervasive misunderstanding of what data mining really is–and isn’t?

I recently spoke at a database marketing conference on this point. The title of my presentation was "Data Mining in the Real World," and the room was brimming with both technicians and marketers. When I got to the part of the presentation that discussed the differences between data mining and OLAP, I noticed a guy a few rows from the front. He had stopped taking notes and had put down his pen. After the presentation, he buttonholed me, taking me to task for my definition of data mining.

At first, I figured he worked for an OLAP vendor, one of the many who had labeled its multidimensional analysis or query generation tool as a data mining product. But after listening to his harangue for a few minutes, I was able to piece together that he was a data analyst for a marketing organization and had been telling everyone that his company was doing data mining. I had burst his bubble by classifying his cherished "data mining" tool as a simple OLAP application, and I had clearly called into question his status as a knowledge worker.

My point is not that OLAP is less valuable than data mining, but that they are two separate breeds of analysis with entirely different objectives, not to mention tools, skill sets, and implementation methods.

UNDERSTANDING THE PLAYERS

Most people wouldn’t use a spreadsheet tool to write a book. Even a crack statistician wouldn’t use SAS to fill out an expense report. Different software tools exist to tackle different business functions, just as different decision-support tools exist because there are different classes of questions. The major classes of decision support are:

Canned reports. This is the most basic type of decision support, if not the most pervasive. Nearly every data warehouse starts out by generating reports. The delivery of timely, accurate reports containing business information is incredibly valuable–especially in places where this data never before existed. Such an application focuses on well-defined, well-understood business questions. It also allows users to gradually change their businesses to leverage this new information.

Ad hoc querying. Submitting free-form or ad hoc questions to the database is the next logical step in the

evolution of a company’s data warehouse. After receiving hard copy reports, business users inevitably have additional questions. They can submit ad hoc queries with tools such as Hummingbird Communications’ BI/Query or Cognos Inc.’s Impromptu.

OLAP. Online analytical processing takes various forms (slicing and dicing the data by dimension, complex multi-statement queries, and so on), but the common denominator of these forms is that they all provide analysts with fast, consistent, interactive access to data from a variety of perspectives. OLAP not only enables analysts to ask many questions–each question relating to the answers and details of the previous question–but also to organize the results.

For example, consider a business analyst for a utility company who is reviewing electricity use for customers in a particular geographic region. An OLAP tool lets the analyst ask questions about customers in a particular region. If the analyst happens to identify a region with lower-than-expected usage, he or she could redirect the focus on that region’s usage to a more specific time frame to determine if the usage shortfall relates to a specific day of the week.

OLAP tools let users "drill down" into more detail, allowing them to examine the same data from multiple perspectives, limited only by the metrics available in the database and their own imagination.

Data mining. With canned reports, ad hoc querying, and OLAP, the end user defines a hypothesis and determines which data to examine. With data mining, the tool identifies the hypothesis, and it actually tells the user where in the data to start the exploration process.

Rather than using SQL to filter out values and methodically reduce the data into a concise answer set, data mining uses algorithms that exhaustively review the relationships among data elements to determine if any patterns exist.

The whole purpose of data mining is to yield new business information that a business person can act on. The mining activity itself is, by necessity, "back-office" work. Forget the myth that data mining will eventually "mature" to become a desktop application. (The fact that many organizations are buying into this myth accounts for much of the current reluctance to adopt the technology and the preference, instead, to "wait and see.") The truth is, data mining inherently requires a certain amount of application-specific data manipulation in order to yield effective results. This means that the IT organization must deploy the data mining tools–just as they run other technical functions–to load new business intelligence into the data warehouse. This "closed-loop" process allows end users to query the results of data mining without directly operating the data mining tool.

The actual analysis techniques that current data mining tools use aren’t new; in fact, some of the algorithms have existed for more than 20 years. The innovation is in the recent commercialization of these algorithms into software products that address business-oriented problems. Data mining products have been reengineered from the traditional mainframe and supercomputer class systems to leverage the more popular (and considerably less expensive) SMP and MPP platforms.

Data mining tools are typically classified by the type of algorithm they use to identify hidden patterns. There are many different algorithms in use, but the four most popular are association, sequence, clustering (or segmentation), and predictive modeling.

to process in parallel.

With the advent of customer relationship management, association analysis is at the forefront of data mining because it crisply identifies the products customers purchase along with which products and services drive additional sales. Other decision-support tools don’t support the analysis of such product combinations.

SEQUENCE

Sequential analysis helps data miners identify a set of order-specific items or events. Association identifies the existence of patterns or groups of items; sequential analysis identifies the order of those patterns or groups of items.

At a phone company recently, product managers using OLAP and canned queries were monitoring new customer orders and cancellations. They calculated that in one out of every eight orders, a customer canceled Product A, while in one out of every 10 orders, a customer canceled Product B. Based on customer purchase history, these cancellation rates were five times greater than normal. We were asked to employ data mining to find out why.

We uncovered 12 order combinations that included disconnects of both Product A and Product B. Table 2 shows one such combination.

Table 2. Sequential analysis results.

(Percentage of orders displaying this pattern = 7.3)

Pattern Activity Product

Disconnect Product A

Disconnect Product B

Connect Product A

Connect Product B

Connect Product C

In this example, 7.3 percent of all the orders weren’t disconnecting either product, but simply purchasing a new product (Product C). Unfortunately, the legacy order system could not add a new product to a customer’s bill. Instead, it had to disconnect both existing products and then reconnect them together with the new product. The high rate of disconnects actually represented a high rate of service upgrades, not cancellations. The 11 other sequences reflected similar activity with other product additions. Thus, the 7.3 percent problem actually indicated that about 80 percent of the company’s disconnect orders were in fact new product orders.

Customers weren’t disconnecting products out of dissatisfaction; they were upgrading their products! In reality, disconnect levels had not increased over historical norms, and product managers had discovered that their disconnect costs were in fact insignificant.

Cluster 1: 3.5% of account holders, 14.9% had loan produc t.

  • 3 times as many credit cards, high money market balance, IRAs

Cluster 2: 4.2% of account holders, 28.6% had loan produc t.

  • 8 times as more likely to have a business checking ac count.

Cluster3: 7.2% of account holders, 19.8% had loan produc t.

  • loan to value ratio < 45%

Cluster 4: 6.8 % of po pulation, 4.7 % had lo an prod uct.

  • <4% ha ve credit car ds, all had sav ings accoun t.

The company could never have been able to differentiate "disconnects" from "upgrades" without data mining. If OLAP technology were applied to this problem, it would have mandated running queries against connect and disconnect orders for millions of monthly orders, covering a portfolio of more than 200 products. In fact, that approach would require 200 to the fifth power, or 320 billion, queries–each of them most likely a full-table scan. Extrapolation shows that if each of these queries took a second, the work would take 10,000 years to perform!

Sequential analysis is an ideal algorithm for uncovering event patterns that aren’t obvious. In environments that involve thousands or millions of events over time, sequential analysis can be invaluable in revealing interesting characteristics and behaviors.

CLUSTERING

Cluster analysis lets the data miner assemble data into unforeseen groups containing similar characteristics. Also known as "segmentation," this type of data mining is probably the most widely used.

In one case, a company I was consulting with wanted to find new data about customers owning a specific loan product. We performed cluster analysis on data relating to several hundred thousand customers and including more than 150 attributes for each customer. (Customer attributes included account name, home address, and outstanding balance.) We instructed the data mining tool to ignore any groups that contained less than 1 percent of the population. A four-cluster subset of the overall results is illustrated in the text box below.

These four clusters provide some rather interesting insight into the behavior of this group of account holders. The first cluster–comprising 3.5 percent of account holders–actually had the most credit cards and the highest money market balance, making it an attractive market for cross-selling new banking products. The same holds true for two other clusters–business checking account holders and the low debt group.

The most interesting cluster, however, is the cluster containing 6.8 percent of account holders. These account holders don’t represent a likely target market: They don’t use credit cards, and they have the least profitable account the bank offers. Obviously, future marketing campaigns should exclude this audience in order to save the bank money.

This example illustrates how data mining can provide a business user with a hypothesis from detailed data. Any experienced analyst could have constructed queries to peruse the data; however, it’s unlikely that he or she could have guessed the precise combination of attributes that would reveal the insights data mining is capable of.

While query tools are certainly useful for examining cluster results, they’re not capable of identifying

service without caller ID, and 3) are in three states where customers are highly likely to purchase the custom-calling product. The predictive tool tells us that three different attributes (location, other product usage, and industry) define this segment. It also tells us the attributes that aren’t influencing the outcome. Thus, the user can avoid trial-and-error query submission. The tool will identify the hypothesis.

Predictive modeling tools test themselves by checking their hypotheses against the actual data matching the criteria. The key to supporting predictive modeling, however, is the availability of data relating to the prediction. Predictive modeling can’t occur without information relating to the event or outcome that you want to analyze. In other words, if you want to predict the propensity to buy a particular product, you need sales history about that product and the customers who have purchased it.

Although difficult, it is possible to model events that haven’t yet occurred. To model brand-new events, you must have access to data for similar events. This type of analysis is sometimes used in the entertainment industry. For a new Mel Gibson action movie, a studio can mine box-office data from Gibson’s prior movies and also include information regarding the number of screens, costars, opening seasons, and other details. Such a predictive model could tell a studio the customer segments that are most likely–and least likely–to watch this kind of movie, as well as the probable revenues for opening weekend.

Once again, since query tools, including OLAP, can only answer questions that can be fully qualified, the studio could never uncover this information without data mining.

ARE YOU READY FOR DATA MINING?

Just because you have a data warehouse doesn’t mean you’re necessarily ready for data mining. Much of the work our company does in the data mining arena has more to do with data mining readiness assessment than with actually performing data mining.

Granted, it’s a lot easier for clients to commit to an assessment than it is to justify the purchase and integration of a new data mining tool suite. But many of our clients are still working on the post-assessment recommendations and haven’t yet talked to data mining vendors. So what does it take to be prepared for data mining?

Here are some metrics you can use to gauge your data mining readiness:

Do you have a staff of experienced knowledge workers? Everyone is excited about the opportunities that data mining affords, but few understand the implications of presenting this new type of business intelligence to knowledge workers. Ten years ago, retailers had their hands full when they delivered fresh data warehouse reports to their merchandisers because these reports frequently challenged traditional views of the popular and profitable products. The initial response was that the data warehouse was just plain wrong.

Many companies want to bypass traditional decision support and go directly to data mining, but it’s highly risky. If business users aren’t experienced in using data and they haven’t yet transformed their business processes to use metrics (instead of gut-level instinct) to drive their decisions, it’s unlikely that they’ll accept the results of data mining blindly.

Do you have the data? As funny as this question sounds, it often elicits blank stares. If you haven’t got the data, you can’t mine it. If your data mining business case is to establish customer buying trends, but you don’t have access to customer purchase data, you’ve picked the wrong business case. You have to have data relevant to the problem you’re targeting.

Can your data support data mining? Even with advanced data mining technologies, the old garbage in, garbage out adage still applies. Data mining focuses on the quantity and accuracy of the attribute detail. If your mining activity focuses on customer traits and habits, it’s important to provide as many data attributes about the customer as possible. No business has perfect data; what’s important is knowing the inherent limitations of the data before beginning a data mining effort.

Do you have marketing processes in place that can use this data? I once reviewed a data mining activity that analyzed the pricing of products in a Midwestern state. It proved to be a highly insightful activity; unfortunately, it was worthless. There was no way to reprice telecommunications products regulated by state and federal agencies.

It’s important to understand how the results of a mining activity will be used. In fact, this is something you should review during the requirements gathering step.

Is your problem a data mining problem? As I’ve discussed, data mining provides the ability to identify patterns and new hypotheses about data. Data analysts usually implement it after they have exhausted their ability to identify new business intelligence from the data warehouse.

Because of the hype and visibility of data mining within many different industries, many business users new to decision support are convinced they need data mining. In many instances, what they need is desktop ad hoc query support, not advanced data mining algorithms.

I recently attended a meeting with a specialty retailer. Business users were screaming that they wanted advanced analysis, and the technology group wanted data mining. However, after a short discussion with the marketing analysts, I established that their advanced analysis needs didn’t indicate data mining at all, but rather a way to drill down into their weekly report information. Their requirements in fact pointed to ad hoc and OLAP analysis, not data mining.

Do you have a business champion who can embrace the process and results? The tried and true principle with data warehousing is that without a business champion, your data warehouse won’t succeed. The same holds true for data mining. Without a champion who is interested in new business intelligence, there’s little likelihood that a new technology that challenges traditional thought and practices is going to be accepted.

Do you have the technology infrastructure to support advanced analysis? Data mining analysis is new and complex technology. Consequently, it requires additional hardware, software, and technical skills. Although this should be of no surprise to most IT professionals, it is almost always an issue.

There’s little question that in order to be successful with data mining, you need a lot of detailed data–not summary or aggregated detail, but baseline business detail. As discussed in "To Sample Or Not To Sample", anything that filters or rolls up data has the potential of filtering out an important facet of new business intelligence. It’s also important to realize that, even with the relative newness of this technology to the commercial marketplace, data mining consumes a significant amount of processing horsepower

  1. Does your tool scale? Can it break problems into multiple concurrent steps? If so, how?
  2. Is your data mining tool business or function focused? A "business- focused" data mining tool focuses on a specific function such as "churn." A function-focused tool is more aligned with the type of algorithm (such as, cluster) and can usually apply to more than one business problem.
  3. Is it a learning or static model tool? (A static model tool requires the user to identify the specific attributes and their relative weightings. A learning model tool analyzes all available data attributes and determines the appropriate weightings and values itself.)

Sample or Not to Sample

One of the most heated debates in data mining circles is whether or not to employ data sampling. Sampling is a method by which the data mining engines use only a subset of data to perform the analysis activity. The benefit of this approach is obvious: The data mining involves fewer processing resources because it’s analyzing less data.

Although the concept is very straightforward, the impact to results isn’t as obvious. The whole premise behind data mining is to identify hidden patterns in data. Sampling introduces the risk of omitting hidden patterns.

Nonetheless, sampling has proven to be a very successful statistical analysis strategy. The benefit of sampling is that you can use 10 percent of the total data. Assuming your sample reflects the aspects of information contained in the full volume of the data, analyzing only 10 percent of your total data will give you the same insight at lower overhead. You should thus be able to analyze smaller data quantities more exhaustively. In order to sample effectively, it is important to sample data that is consistent and homogenous throughout. And that’s the problem. Sampling assumes that the data is homogenous and can be sampled without losing vital detail. Mining assumes that there are hidden patterns in the data and that you need all the detail to find the hidden patterns. How can you take an accurate sample of data that preserves the informational content of the base data unless you know the content? This, by the way, isn’t practical until you mine the data to identify the hidden patterns.

In much the same way that data warehousing has established detailed data as the only sure way to capture reality, the only surefire means of mining data is to use as much detail as possible. In simplistic terms, sampling’s biggest advocates are the statistical software tool vendors who rely on data sampling techniques in order to extrapolate findings. The fact that most of the data mining tools currently on the market cater to much smaller volumes of data than typically found in a data warehouse has lent a lot of backing to the sampling approach. Sampling’s biggest detractors are vendors and commercial companies who want the data mining tool to consider all their data, not just a sample.

Sampling’s proponents will insist, and rightly so, that sampling is a statistically valid way to apply analysis findings. Does that mean, however, that you should join the sampling bandwagon? It depends on several factors:

! First, how much data do you have? The answer to this question is probably the greatest factor. If you haven’t got a platform that can support mining execution of an entire data set, sampling or

subsetting may be the only reasonable solutions.

! Can the data be subsetted based on the business problem? Although a utility may have 20 million customers in its five-state region, practical issues dictate that it can only market and manage customers along the state boundaries (for legal reasons). This is a perfect situation for subsetting the data into five smaller data sets. In situations in which a business is managed at a divisional or regional level, subsetting the data into multiple sets along the boundaries of operations is actually beneficial. Identifying customer traits and profiles unique to a region will add more value than identifying a more generalized profile across the entire company.

! Do you need all the data in the first place? I can’t count the number of times that one of my initial mining activities used all the data in the warehouse. Be careful to focus on data that’s useful and valuable to the outcome. I once worked on a project for a prison system that initially included race and religion as part of an analysis; however, such analysis is against the law. New rules and business practices cannot be created simply based on the data you’ve got.

! Will your analysis technique require lots of data? Predictive modeling can use every piece of data that you can throw at it. Association and sequence algorithms have practical limits. Additionally, mining focuses on specific data elements, not necessarily the entire warehouse. The golden rule of data mining should be "concentrate first on how you will use the data and what your business drivers are." Only then should you decide on the analysis technique, the specific tool, and whether or not to sample.

Evan Levy is the president of Baseline Consulting Group, a worldwide consulting firm specializing in industry-specific business intelligence and database marketing solutions. You can contact him at [email protected] or through Baseline’s Web site at www.baseline-consulting.com. Copyright © 1999 Miller Freeman Inc. All Rights Reserved Redistribution without permission is prohibited.