Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Comparing Data Mining Systems: Commercial Options and Selection Criteria - Prof. Jennifer , Study notes of Data Analysis & Statistical Methods

Purdue University Data Analysis & Statistical Methods

Prof. Jennifer L. Neville

An overview of data mining systems, discussing the importance of choosing the right system based on various dimensions such as data types, data mining functions and methodologies, coupling with databases, and data visualization. It also lists several example data mining systems and their features.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-bm2 🇺🇸

10 documents

1 / 21

This page cannot be seen from the preview

Don't miss anything!

Data Mining

CS57300 / STAT 59800-024

Purdue University

April 28, 2009

Data mining systems

Discover Study notes of Data Analysis & Statistical Methods Purdue University

Partial preview of the text

Download Comparing Data Mining Systems: Commercial Options and Selection Criteria - Prof. Jennifer and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Data Mining

CS57300 / STAT 59800-

Purdue University April 28, 2009 1

Data mining systems

How to choose a data mining system

Commercial data mining systems have little in common
- Different data mining functionality or methodology
- May even work with completely different kinds of data
Need to consider multiple dimensions in selection
- Data types: relational, transactional, sequential, spatial?
- Data sources: ASCII text files? multiple relational data sources? support open database connectivity (ODBC) connections?
- System issues: running on only one or on several operating systems? a client/server architecture? provide Web-based interfaces and allow XML data as I/O? 3

Choosing a system

Dimensions (cont):
- Data mining functions and methodologies
  - One vs. multiple data mining functions
  - One vs. variety of methods per function
  - More functions and methods per function provide the user with greater flexibility and analysis power
- Coupling with DB and/or data warehouse systems
  - Four forms of coupling: no coupling, loose coupling, semitight coupling, and tight coupling
  - Ideally, a data mining system should be tightly coupled with a database system

Example systems

Microsoft SQL Server 2008
- Integrate DB and OLAP with multiple mining methods
- Supports Object Linking and Embedding Database (OLEDB) -- access to wider formats of data than just ODBC
Vero Insight MineSet
- Multiple data mining algorithms and advanced statistics
- Advanced visualization tools (originally developed by Silicon Graphics)
PASW Modeler (SPSS)
- Integrated data mining development environment for end-users and developers
- Multiple data mining algorithms and visualization tools 7

Example systems

DBMiner (developed by Jiawei Han at SFU)
- Multiple data mining modules: discovery-driven OLAP analysis, association, classification, and clustering
- Efficient, association and sequential-pattern mining functions, and visual classification tool
- Mining both relational databases and data warehouses

Top Ten Data Mining Mistakes

(source: John Edler, Edler Research ) 9

You’ve made a

mistake if you...

Lack data
Focus on training
Rely on one technique
Ask the wrong question
Listen (only) to the data
Accept leaks from the future
- Discount pesky case
- Extrapolate
- Answer every inquiry
- Sample casually
- Believe the best model

2: Rely on one technique

"To a person with a hammer, all the world's a nail."
For best work, need a whole toolkit.
At very least, compare your method to a conventional one (e.g., naive Bayes, logistic regression)
It’s somewhat unusual for a particular modeling technique to make a big difference, and when it will is hard to predict.
Best approach: use a handful of good tools (Each adds only 5-10% effort) 13 © 2004 Elder Research, Inc.^12

Relative Performance Examples: 5 Algorithms on 6 Datasets

(with Stephen Lee, U. Idaho, 1997 )

. 00 . 10 . 20 . 30 . 40 . 50 . 60 . 70 . 80 . 90

00 Diabetes Gaussian Hypothyroid German Credit Waveform Investment Neural Network Logistic Regression Linear Vector Quantization Projection Pursuit Regression Decision Tree Er ror Relative to P eer Techniques (low er is better )

Essentially every Bundling method improves performance

. 00 . 10 . 20 . 30 . 40 . 50 . 60 . 70 . 80 . 90

00 Diabetes Gaussian Hypothyroid German Credit Waveform Investment Advisor Perceptron AP weighted average Vote Average Er ror Relative to P eer Techniques (low er is better ) 15

3: Ask the wrong question

Project Goal: Aim at the right target
- Fraud Detection at AT&T Labs: predict fraud in international calls
- Didn't attempt to classify fraud/nonfraud for general call, but characterized normal behavior for each account, then flagged outliers! a brilliant success.
Model Goal: Evaluate appropriately
- Most researchers use squared error or accuracy for their convenience
- Ask the algorithm to do what's most helpful for the system, not what's easiest for it

5: Accept leaks from the future

Example:
- Forecasting interest rate at Chicago Bank
- Neural network was 95% accurate, but output was a candidate input
Example 2:
- Used moving average of 3 days, but centered on today
Look for variables which work (too) well
- Example: Insurance code associated with 25% of purchasers turned out to describe type of cancellation
Need domain knowledge about collection process 19

6: Discount pesky cases

Outliers may be skewing results (e.g. decimal point error on price) or be the whole answer (e.g. Ozone hole), so examine carefully!
The most exciting phrase in research isn't "Aha!" but "That's odd..."
Inconsistencies in the data may be clues to problems with the information flow process - Example: Direct mail - Persistent questioning of oddities found errors in the merge-purge process and was a major contributor to doubling sales per catalog

7: Extrapolate

Tend to learn too much from first few experiences
Hard to "erase" findings after an upstream error is discovered
Curse of Dimensionality: low-dimensional intuition is useless in high dimensions
Human and computer strengths are more complementary than alike 21

8: Answer every inquiry

"Don't Know" is a useful model output state
Could estimate the uncertainty for each output (a function of the number and spread of samples near X)
However, few algorithms provide an estimate of uncertainty along with their predictions

10: Believe the best model

Interpretability is not always necessary
- Model can be useful without being "correct"
- In practice there are often many very similar variables available and the selected variables may have only barely won out
- And structural similarity is different from functional similarity -- competing models often look different, but act the same
Best estimator is likely to be an ensemble of models 25

Example: Lift chart

Last quintile of customers are 4X more expensive to obtain than first quintile

Lift Chart: %purchasers vs. %prospects

Ex: Last quintile of customers are 4 times more expensive to obtain than first quintile (10% vs. 40% to gain 20%)
Decision Tree provides relatively few decision points. 0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100 % 0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100 %

Overall population

Target population

Ensemble of 5 trees

Bundling 5 Trees

improves accuracy and smoothness

(^100) %% 2300 %% 4500 %% 6700 %% 8900 %% 100 % 0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100 % (^100) %% 2300 %% 4500 %% 6700 %% 8900 %% 100 % 0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100 % 010 %% 2030 %% 4050 %% 6070 %% 8090 %% 100 % 0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100 % (^100) %% 2300 %% 4500 %% 6700 %% 8900 %% 100 % 0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100 % (^100) %% 2300 %% 4500 %% 6700 %% 8900 %% 100 % 0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100 % 0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100 % 0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % 100 % 27 © 2004 Elder Research, Inc.^24

#Models Combined (averaging output rank)

De

fa

u

lt

e

rs

M

isse

d

( few er is be tte r) NT NS (^) ST MT PS PT NP MS MN MP SNT MPN SPT PNT^ SMT MPT SPN MNT SMN SMP

Credit Scoring Model Performance

Bundled Trees Stepwise Regression Polynomial Network Neural Network MARS SPNT SMPT SMNT SMPN MPNT SMPNT

Myths and pitfalls of data mining

(source: Tom Khabaza, DMReview ) 31

Myth

Data mining is all about algorithms
- Data mining is a process consisting of many elements, such as formulating business goals, mapping business goals to data mining goals, acquiring, understanding and preprocessing the data, evaluating and presenting the results of analysis and deploying these results to achieve business benefits
- A problem occurs when data miners focus too much on the algorithms and ignore the other 90-95 percent of the data mining process

Myth

Data mining is all about predictive accuracy
- Predictive models should have some degree of accuracy because this demonstrates that it has truly discovered patterns in the data
- However, the usefulness of an algorithm or model is also determined by a number of other properties, one of which is understandability
- This is because the data mining process is driven by business expertise -- it relies on the input and involvement of non-technical business professionals in order to be successful 33

Myth

Data mining requires a data warehouse
- Data mining can benefit from warehoused data that is well organized, relatively clean and easy to access
- But warehoused data may be less useful than the source or operational data -- in the worst case, warehoused data may be completely useless (e.g. if only summary data is stored)
Data mining benefits from a properly designed data warehouse and constructing such a warehouse often benefits from doing some exploratory DM

Pitfalls

Buried under mountains of data
- Do not always need to build models from millions of examples just because the data are available
The Mysterious Disappearing Terabyte
- For a given data mining problem, the amount of available and relevant data may be much less than initially supposed 37

Pitfalls

Disorganized Data Mining
- Data mining can occasionally, despite the best of intentions, take place in an ad hoc manner, with no clear goals and no idea of how the results will be used -- this leads to wasted time and unusable results
Insufficient Business Knowledge
- Business knowledge is critical -- without it, organizations can neither achieve useful results nor guide the data mining process towards them

Pitfalls

Insufficient Data Knowledge
- In order to perform data mining, we must be able to answer questions such as: What do the codes in this field mean, and can there be more than one record per customer in this table and more? In some cases, this information is surprisingly hard to come by
Erroneous Assumptions (courtesy of experts)
- Business and data experts are crucial resources, but this does not mean that the data miner should unquestioningly accept every statement they make 39

Pitfalls

Incompatibility of Data Mining Tools
- No toolkit will provide every possible capability, especially when the individual preferences of analysts are taken into account, so the toolkit should interface easily with other available tools and third-party options
Locked in the Data Jail House
- Some tools require the data to be in a proprietary format that is not compatible with commonly used database systems
- This can result in high overhead costs and create difficulty in deployment into an organization's system

Comparing Data Mining Systems: Commercial Options and Selection Criteria - Prof. Jennifer , Study notes of Data Analysis & Statistical Methods

Related documents

Partial preview of the text

Download Comparing Data Mining Systems: Commercial Options and Selection Criteria - Prof. Jennifer and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Data Mining

CS57300 / STAT 59800-

Data mining systems

How to choose a data mining system

Choosing a system

Example systems

Example systems

Top Ten Data Mining Mistakes

You’ve made a

mistake if you...

2: Rely on one technique

Relative Performance Examples: 5 Algorithms on 6 Datasets

Essentially every Bundling method improves performance

3: Ask the wrong question

5: Accept leaks from the future

6: Discount pesky cases

7: Extrapolate

8: Answer every inquiry

10: Believe the best model

Example: Lift chart

Lift Chart: %purchasers vs. %prospects

Overall population

Target population

Ensemble of 5 trees

Bundling 5 Trees

improves accuracy and smoothness

#Models Combined (averaging output rank)

De

fa

u

lt

e

rs

M

isse

d

Credit Scoring Model Performance

Myths and pitfalls of data mining

Myth

Myth

Myth

Pitfalls

Pitfalls

Pitfalls

Pitfalls