Midterm Review for Data Mining | CS 57300, Exams of Computer Science

Material Type: Exam; Class: Data Mining; Subject: CS-Computer Sciences; University: Purdue University - Main Campus; Term: Spring 2009;

Typology: Exams

Pre 2010

Uploaded on 07/30/2009

koofers-user-gz5-1
koofers-user-gz5-1 🇺🇸

10 documents

1 / 18

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Mining
CS57300 / STAT 59800-024
Purdue University
Mar 3, 2009
1
Midterm review
2
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12

Partial preview of the text

Download Midterm Review for Data Mining | CS 57300 and more Exams Computer Science in PDF only on Docsity!

Data Mining

CS57300 / STAT 59800-

Purdue University Mar 3, 2009 1

Midterm review

Topics

  • Elements of data mining algorithms
  • Data preparation and exploration
  • Statistical foundations
  • Predictive modeling 3

Readings

  • Chapter 1: Introduction
  • Chapter 2: Measurement and Data
  • Chapter 3: Visualizing and Exploring Data
  • Chapter 4: Data Analysis and Uncertainty
  • Chapter 5: Systematic Overview of Data Mining Algorithms (5.1-5.3.1 only)
  • Chapter 6: Models and Patterns (6.1-6.3 only)
  • Chapter 7: Score Functions
  • Chapter 8: Search and Optimization
  • Chapter 10: Predictive Modeling for Classification

Elements of data mining

  • Task specification
    • Description of the characteristics of the analysis and desired result
  • Data representation
    • Data structure used for representing individual and collections of measurements
  • Knowledge representation
    • Description of the results of the data mining algorithm
    • Defines a set of possible models or patterns 7

Elements of data mining

  • Learning technique
    • Search: Method for generating possible models/patterns and optimizing their score
    • Evaluation: Associates a numerical score with each possible model/pattern
  • Inference technique
    • Method to apply learned model to data for prediction

Data

9

Distance measures

  • Distance measures indicate dissimilarity between objects
  • A metric d(i,j) is a dissimilarity measure that satisfies the following properties:
    • d(i,j)! 0 for all i,j and d(i,j)=0 iff i=j
    • d(i,j) = d(j,i) for all i,j
    • d(i,j) " d(i,k)+d(k,j) for all i,j,k

Statistics

13

Topics

  • Bayesian vs. Frequentist view
    • Although interpretation of probability is different, underlying calculus is the same
  • Random variables
  • Expectation, variance
  • Common distributions
    • Bernoulli, Binomial, Multinomial, Normal
  • Independence, conditional independence
  • Populations and samples

Properties of estimators

  • Let! be an estimate for a population parameter!
  • For different samples D, we will have different estimates !D
  • Bias is difference between the mean estimate for! and the true parameter!
  • Variance is the average squared deviation of !D from the mean of the sampling distribution for!
  • Bias-variance tradeoff: reducing one often increases the other ^ ^ ^ ^ ^ 15

Hypothesis testing

  • Formulation
    • Null hypothesis (H 0 ): Presumed true until rejected
    • Alternative hypothesis (H 1 ): Rival hypothesis
    • ": critical threshold for rejecting H 0
  • Test
    • Identify a test statistic to assess the truth of H 0
    • Compute the p-value for test statistic using the sampling distribution under H 0
    • Reject H 0 if p < "

Parameter estimation

  • Infer the value of model parameters from data
  • Maximum likelihood estimation (MLE)
    • Choose the parameter value that maximizes likelihood
    • Take derivative of loglikelihood and set to zero to solve for parameter, check that value is a maximum
  • Maximum aposteriori estimation (MAP)
    • Choose the parameter value that maximizes posterior (likelihood X prior) 19

Predictive modeling

Predictive models

  • Predictive models predict the value of one variable of interest given known values of other variables - Focus on modeling conditional distribution P(Y | X) or decision boundary for Y
  • Data representation: training set of x (i), y(i) pairs
  • Task: estimate a function y=f( x ;! ) which maps observed x values to y value 21

Discriminative vs. probabilistic

  • Discriminative
    • Model the decision boundary directly
    • Direct mapping from inputs x to class label y
    • May seek a discriminant function f( x ;! ) that maximizes measure of separation between classes
  • Probabilistic
    • Model the underlying probability distributions (posterior class distribution or full joint distribution)
    • Indirect mapping from inputs x to class label y through posterior class distribution p(y|x)

Learning algorithms

  • Search:
    • Set of states (defined by knowledge representation)
    • Set of search operators (actions to move in state space)
    • Search algorithm (input state, choice of actions, stopping criterion)
  • Evaluation function:
    • Internal: associate score with state for use in search
    • External: measure quality of pattern/model 25

Tree learning

  • Top-down recursive divide and conquer algorithm
    • Start with all examples at root
    • Select best attribute/feature
    • Partition examples by selected attribute
    • Recurse and repeat
  • Evaluation functions:
    • Information gain, gini gain, chi-square statistic

Naive Bayes classifier

P (C|X) = P (X|C)P (C) P (X) ∝ ∏^ m i= P (Xi|C) P (C)

  • Estimate class prior P(C) and conditional attribute distributions P(Xi | C) independently w/MLE 27

Nearest neighbor

  • Store training data and delay processing until a new instance must be classified
  • All points represented in m-dimensional space
  • Nearest neighbors are calculated using Euclidean distance
  • For a new test instance, returns the most common value among k closest training examples

Ensemble methods

  • Bagging
    • Learn many unstable predictors on data drawn with replacement from original sample
    • Aggregate predictions on test instances
    • Reduces variance of model
  • Boosting
    • Learn many weak predictors on reweighted versions of the original sample
    • Aggregate predictions on test instances
    • Reduces bias of model 31

Pathologies

Overfitting

  • Overfitting the training data
    • Given a model space M, a model m! M is overfitting the training data if " m'! M, such that m has smaller error than m' on the training data, but m' has smaller error on the entire distribution of instances
  • Approaches for avoiding overfitting
    • Decision trees: pruning
    • Naive Bayes: smoothing 33 Pathologies of induction algorithms
  • Overfitting
    • Adding components to models that reduce performance or leave it unchanged
  • Oversearching
    • Selecting models with lower performance as the size of search space grows
  • Attribute selection errors
    • Preferring attributes with many possible values despite lower performance Jensen and Cohen (2000)