Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Midterm Review for Data Mining | CS 57300, Exams of Computer Science

Purdue University Computer Science

Material Type: Exam; Class: Data Mining; Subject: CS-Computer Sciences; University: Purdue University - Main Campus; Term: Spring 2009;

Typology: Exams

Pre 2010

Uploaded on 07/30/2009

koofers-user-gz5-1 🇺🇸

10 documents

1 / 18

This page cannot be seen from the preview

Don't miss anything!

Data Mining

CS57300 / STAT 59800-024

Purdue University

Mar 3, 2009

Midterm review

Discover Exams of Computer Science Purdue University

Partial preview of the text

Download Midterm Review for Data Mining | CS 57300 and more Exams Computer Science in PDF only on Docsity!

Data Mining

CS57300 / STAT 59800-

Purdue University Mar 3, 2009 1

Midterm review

Topics

Elements of data mining algorithms
Data preparation and exploration
Statistical foundations
Predictive modeling 3

Readings

Chapter 1: Introduction
Chapter 2: Measurement and Data
Chapter 3: Visualizing and Exploring Data
Chapter 4: Data Analysis and Uncertainty
Chapter 5: Systematic Overview of Data Mining Algorithms (5.1-5.3.1 only)
Chapter 6: Models and Patterns (6.1-6.3 only)
Chapter 7: Score Functions
Chapter 8: Search and Optimization
Chapter 10: Predictive Modeling for Classification

Elements of data mining

Task specification
- Description of the characteristics of the analysis and desired result
Data representation
- Data structure used for representing individual and collections of measurements
Knowledge representation
- Description of the results of the data mining algorithm
- Defines a set of possible models or patterns 7

Elements of data mining

Learning technique
- Search: Method for generating possible models/patterns and optimizing their score
- Evaluation: Associates a numerical score with each possible model/pattern
Inference technique
- Method to apply learned model to data for prediction

Data

Distance measures

Distance measures indicate dissimilarity between objects
A metric d(i,j) is a dissimilarity measure that satisfies the following properties:
- d(i,j)! 0 for all i,j and d(i,j)=0 iff i=j
- d(i,j) = d(j,i) for all i,j
- d(i,j) " d(i,k)+d(k,j) for all i,j,k

Statistics

Topics

Bayesian vs. Frequentist view
- Although interpretation of probability is different, underlying calculus is the same
Random variables
Expectation, variance
Common distributions
- Bernoulli, Binomial, Multinomial, Normal
Independence, conditional independence
Populations and samples

Properties of estimators

Let! be an estimate for a population parameter!
For different samples D, we will have different estimates !D
Bias is difference between the mean estimate for! and the true parameter!
Variance is the average squared deviation of !D from the mean of the sampling distribution for!
Bias-variance tradeoff: reducing one often increases the other ^ ^ ^ ^ ^ 15

Hypothesis testing

Formulation
- Null hypothesis (H 0 ): Presumed true until rejected
- Alternative hypothesis (H 1 ): Rival hypothesis
- ": critical threshold for rejecting H 0
Test
- Identify a test statistic to assess the truth of H 0
- Compute the p-value for test statistic using the sampling distribution under H 0
- Reject H 0 if p < "

Parameter estimation

Infer the value of model parameters from data
Maximum likelihood estimation (MLE)
- Choose the parameter value that maximizes likelihood
- Take derivative of loglikelihood and set to zero to solve for parameter, check that value is a maximum
Maximum aposteriori estimation (MAP)
- Choose the parameter value that maximizes posterior (likelihood X prior) 19

Predictive modeling

Predictive models

Predictive models predict the value of one variable of interest given known values of other variables - Focus on modeling conditional distribution P(Y | X) or decision boundary for Y
Data representation: training set of x (i), y(i) pairs
Task: estimate a function y=f( x ;! ) which maps observed x values to y value 21

Discriminative vs. probabilistic

Discriminative
- Model the decision boundary directly
- Direct mapping from inputs x to class label y
- May seek a discriminant function f( x ;! ) that maximizes measure of separation between classes
Probabilistic
- Model the underlying probability distributions (posterior class distribution or full joint distribution)
- Indirect mapping from inputs x to class label y through posterior class distribution p(y|x)

Learning algorithms

Search:
- Set of states (defined by knowledge representation)
- Set of search operators (actions to move in state space)
- Search algorithm (input state, choice of actions, stopping criterion)
Evaluation function:
- Internal: associate score with state for use in search
- External: measure quality of pattern/model 25

Tree learning

Top-down recursive divide and conquer algorithm
- Start with all examples at root
- Select best attribute/feature
- Partition examples by selected attribute
- Recurse and repeat
Evaluation functions:
- Information gain, gini gain, chi-square statistic

Naive Bayes classifier

P (C|X) = P (X|C)P (C) P (X) ∝ ∏^ m i= P (Xi|C) P (C)

Estimate class prior P(C) and conditional attribute distributions P(Xi | C) independently w/MLE 27

Nearest neighbor

Store training data and delay processing until a new instance must be classified
All points represented in m-dimensional space
Nearest neighbors are calculated using Euclidean distance
For a new test instance, returns the most common value among k closest training examples

Ensemble methods

Bagging
- Learn many unstable predictors on data drawn with replacement from original sample
- Aggregate predictions on test instances
- Reduces variance of model
Boosting
- Learn many weak predictors on reweighted versions of the original sample
- Aggregate predictions on test instances
- Reduces bias of model 31

Pathologies

Overfitting

Overfitting the training data
- Given a model space M, a model m! M is overfitting the training data if " m'! M, such that m has smaller error than m' on the training data, but m' has smaller error on the entire distribution of instances
Approaches for avoiding overfitting
- Decision trees: pruning
- Naive Bayes: smoothing 33 Pathologies of induction algorithms
Overfitting
- Adding components to models that reduce performance or leave it unchanged
Oversearching
- Selecting models with lower performance as the size of search space grows
Attribute selection errors
- Preferring attributes with many possible values despite lower performance Jensen and Cohen (2000)

Midterm Review for Data Mining | CS 57300, Exams of Computer Science

Related documents

Partial preview of the text

Download Midterm Review for Data Mining | CS 57300 and more Exams Computer Science in PDF only on Docsity!

Data Mining

CS57300 / STAT 59800-

Midterm review

Topics

Readings

Elements of data mining

Elements of data mining

Data

Distance measures

Statistics

Topics

Properties of estimators

Hypothesis testing

Parameter estimation

Predictive modeling

Predictive models

Discriminative vs. probabilistic

Learning algorithms

Tree learning

Naive Bayes classifier

Nearest neighbor

Ensemble methods

Pathologies