Download Comparing Data Mining Systems: Commercial Options and Selection Criteria - Prof. Jennifer and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!
Data Mining
CS57300 / STAT 59800-
Purdue University April 28, 2009 1
Data mining systems
How to choose a data mining system
- Commercial data mining systems have little in common
- Different data mining functionality or methodology
- May even work with completely different kinds of data
- Need to consider multiple dimensions in selection
- Data types: relational, transactional, sequential, spatial?
- Data sources: ASCII text files? multiple relational data sources? support open database connectivity (ODBC) connections?
- System issues: running on only one or on several operating systems? a client/server architecture? provide Web-based interfaces and allow XML data as I/O? 3
Choosing a system
- Dimensions (cont):
- Data mining functions and methodologies
- One vs. multiple data mining functions
- One vs. variety of methods per function
- More functions and methods per function provide the user with greater flexibility and analysis power
- Coupling with DB and/or data warehouse systems
- Four forms of coupling: no coupling, loose coupling, semitight coupling, and tight coupling
- Ideally, a data mining system should be tightly coupled with a database system
Example systems
- Microsoft SQL Server 2008
- Integrate DB and OLAP with multiple mining methods
- Supports Object Linking and Embedding Database (OLEDB) -- access to wider formats of data than just ODBC
- Vero Insight MineSet
- Multiple data mining algorithms and advanced statistics
- Advanced visualization tools (originally developed by Silicon Graphics)
- PASW Modeler (SPSS)
- Integrated data mining development environment for end-users and developers
- Multiple data mining algorithms and visualization tools 7
Example systems
- DBMiner (developed by Jiawei Han at SFU)
- Multiple data mining modules: discovery-driven OLAP analysis, association, classification, and clustering
- Efficient, association and sequential-pattern mining functions, and visual classification tool
- Mining both relational databases and data warehouses
Top Ten Data Mining Mistakes
(source: John Edler, Edler Research ) 9
You’ve made a
mistake if you...
- Lack data
- Focus on training
- Rely on one technique
- Ask the wrong question
- Listen (only) to the data
- Accept leaks from the future
- Discount pesky case
- Extrapolate
- Answer every inquiry
- Sample casually
- Believe the best model
2: Rely on one technique
- "To a person with a hammer, all the world's a nail."
- For best work, need a whole toolkit.
- At very least, compare your method to a conventional one (e.g., naive Bayes, logistic regression)
- It’s somewhat unusual for a particular modeling technique to make a big difference, and when it will is hard to predict.
- Best approach: use a handful of good tools (Each adds only 5-10% effort) 13 © 2004 Elder Research, Inc.^12
Relative Performance Examples: 5 Algorithms on 6 Datasets
(with Stephen Lee, U. Idaho, 1997 )
. 00 . 10 . 20 . 30 . 40 . 50 . 60 . 70 . 80 . 90
- 00 Diabetes Gaussian Hypothyroid German Credit Waveform Investment Neural Network Logistic Regression Linear Vector Quantization Projection Pursuit Regression Decision Tree Er ror Relative to P eer Techniques (low er is better )
© 2004 Elder Research, Inc.^13
Essentially every Bundling method improves performance
. 00 . 10 . 20 . 30 . 40 . 50 . 60 . 70 . 80 . 90
- 00 Diabetes Gaussian Hypothyroid German Credit Waveform Investment Advisor Perceptron AP weighted average Vote Average Er ror Relative to P eer Techniques (low er is better ) 15
3: Ask the wrong question
- Project Goal: Aim at the right target
- Fraud Detection at AT&T Labs: predict fraud in international calls
- Didn't attempt to classify fraud/nonfraud for general call, but characterized normal behavior for each account, then flagged outliers! a brilliant success.
- Model Goal: Evaluate appropriately
- Most researchers use squared error or accuracy for their convenience
- Ask the algorithm to do what's most helpful for the system, not what's easiest for it
5: Accept leaks from the future
- Example:
- Forecasting interest rate at Chicago Bank
- Neural network was 95% accurate, but output was a candidate input
- Example 2:
- Used moving average of 3 days, but centered on today
- Look for variables which work (too) well
- Example: Insurance code associated with 25% of purchasers turned out to describe type of cancellation
- Need domain knowledge about collection process 19
6: Discount pesky cases
- Outliers may be skewing results (e.g. decimal point error on price) or be the whole answer (e.g. Ozone hole), so examine carefully!
- The most exciting phrase in research isn't "Aha!" but "That's odd..."
- Inconsistencies in the data may be clues to problems with the information flow process - Example: Direct mail - Persistent questioning of oddities found errors in the merge-purge process and was a major contributor to doubling sales per catalog
7: Extrapolate
- Tend to learn too much from first few experiences
- Hard to "erase" findings after an upstream error is discovered
- Curse of Dimensionality: low-dimensional intuition is useless in high dimensions
- Human and computer strengths are more complementary than alike 21
8: Answer every inquiry
- "Don't Know" is a useful model output state
- Could estimate the uncertainty for each output (a function of the number and spread of samples near X)
- However, few algorithms provide an estimate of uncertainty along with their predictions
10: Believe the best model
- Interpretability is not always necessary
- Model can be useful without being "correct"
- In practice there are often many very similar variables available and the selected variables may have only barely won out
- And structural similarity is different from functional similarity -- competing models often look different, but act the same
- Best estimator is likely to be an ensemble of models 25
Example: Lift chart
- Last quintile of customers are 4X more expensive to obtain than first quintile
- Decision tree provides relatively few decision points © 2004 Elder Research, Inc.^22
Lift Chart: %purchasers vs. %prospects
- Ex: Last quintile of customers are 4 times more expensive to obtain than first quintile (10% vs. 40% to gain 20%)
- Decision Tree provides relatively few decision points. 0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100 % 0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100 %
Overall population
Target population
Ensemble of 5 trees
© 2004 Elder Research, Inc.
Bundling 5 Trees
improves accuracy and smoothness
(^100) %% 2300 %% 4500 %% 6700 %% 8900 %% 100 % 0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100 % (^100) %% 2300 %% 4500 %% 6700 %% 8900 %% 100 % 0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100 % 010 %% 2030 %% 4050 %% 6070 %% 8090 %% 100 % 0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100 % (^100) %% 2300 %% 4500 %% 6700 %% 8900 %% 100 % 0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100 % (^100) %% 2300 %% 4500 %% 6700 %% 8900 %% 100 % 0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100 % 0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100 % 0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100 % 27 © 2004 Elder Research, Inc.^24
#Models Combined (averaging output rank)
De
fa
u
lt
e
rs
M
isse
d
( few er is be tte r) NT NS (^) ST MT PS PT NP MS MN MP SNT MPN SPT PNT^ SMT MPT SPN MNT SMN SMP
Credit Scoring Model Performance
Bundled Trees Stepwise Regression Polynomial Network Neural Network MARS SPNT SMPT SMNT SMPN MPNT SMPNT
Myths and pitfalls of data mining
(source: Tom Khabaza, DMReview ) 31
Myth
- Data mining is all about algorithms
- Data mining is a process consisting of many elements, such as formulating business goals, mapping business goals to data mining goals, acquiring, understanding and preprocessing the data, evaluating and presenting the results of analysis and deploying these results to achieve business benefits
- A problem occurs when data miners focus too much on the algorithms and ignore the other 90-95 percent of the data mining process
Myth
- Data mining is all about predictive accuracy
- Predictive models should have some degree of accuracy because this demonstrates that it has truly discovered patterns in the data
- However, the usefulness of an algorithm or model is also determined by a number of other properties, one of which is understandability
- This is because the data mining process is driven by business expertise -- it relies on the input and involvement of non-technical business professionals in order to be successful 33
Myth
- Data mining requires a data warehouse
- Data mining can benefit from warehoused data that is well organized, relatively clean and easy to access
- But warehoused data may be less useful than the source or operational data -- in the worst case, warehoused data may be completely useless (e.g. if only summary data is stored)
- Data mining benefits from a properly designed data warehouse and constructing such a warehouse often benefits from doing some exploratory DM
Pitfalls
- Buried under mountains of data
- Do not always need to build models from millions of examples just because the data are available
- The Mysterious Disappearing Terabyte
- For a given data mining problem, the amount of available and relevant data may be much less than initially supposed 37
Pitfalls
- Disorganized Data Mining
- Data mining can occasionally, despite the best of intentions, take place in an ad hoc manner, with no clear goals and no idea of how the results will be used -- this leads to wasted time and unusable results
- Insufficient Business Knowledge
- Business knowledge is critical -- without it, organizations can neither achieve useful results nor guide the data mining process towards them
Pitfalls
- Insufficient Data Knowledge
- In order to perform data mining, we must be able to answer questions such as: What do the codes in this field mean, and can there be more than one record per customer in this table and more? In some cases, this information is surprisingly hard to come by
- Erroneous Assumptions (courtesy of experts)
- Business and data experts are crucial resources, but this does not mean that the data miner should unquestioningly accept every statement they make 39
Pitfalls
- Incompatibility of Data Mining Tools
- No toolkit will provide every possible capability, especially when the individual preferences of analysts are taken into account, so the toolkit should interface easily with other available tools and third-party options
- Locked in the Data Jail House
- Some tools require the data to be in a proprietary format that is not compatible with commonly used database systems
- This can result in high overhead costs and create difficulty in deployment into an organization's system