Predictive Modeling in Data Mining: Components and Approaches - Prof. Jennifer L. Neville, Study notes of Data Analysis & Statistical Methods

This document from purdue university, cs57300 / stat 59800-024, discusses predictive modeling in data mining. The components of data mining, including task specification, data representation, knowledge representation, learning technique, and inference technique. It also differentiates between descriptive and predictive modeling and provides examples of predictive modeling approaches such as classification and regression. The document also touches upon knowledge representation and modeling approaches.

Typology: Study notes

Pre 2010

Uploaded on 07/31/2009

koofers-user-bq5
koofers-user-bq5 🇺🇸

8 documents

1 / 12

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Mining
CS57300 / STAT 59800-024
Purdue University
February 5, 2009
1
Predictive modeling: representation
2
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Predictive Modeling in Data Mining: Components and Approaches - Prof. Jennifer L. Neville and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Data Mining

CS57300 / STAT 59800-

Purdue University February 5, 2009 1

Predictive modeling: representation

Data mining components

  • Task specification: Prediction
  • Data representation: Homogeneous IID data
  • Knowledge representation
  • Learning technique
  • Inference technique 3

Descriptive vs. predictive modeling

  • Descriptive models summarize the data
    • Provide insights into the domain
    • Focus on modeling joint distribution P(X)
    • May be used for classification, but not primary goal
  • Predictive models predict the value of one variable of interest given known values of other variables - Focus on modeling conditional distribution P(Y | X) or decision boundary for Y

Predictive modeling

  • Data representation: training set of x (i), y(i) pairs
  • Task: estimate a predictive function f( x ;! )=y
    • Assume that there is a function y=f(x) that maps data instances ( x ) to class labels (y)
    • Construct a model that approximates the mapping
      • Classification: if y is categorical
      • Regression: if y is real-valued 7

Modeling approaches

Classification

  • In its simplest form, a classification model defines a decision boundary and labels for each side of the boundary
  • Input: x ={x 1 ,x 2 ,...,xn} is a set of attributes, function f assigns a label y to input x , where y is a discrete variable with a finite number of values X 1 X 2 h 9

Classification output

  • Different classification tasks can require different kinds of output
    • Class labels — Crisp class boundaries only
    • Ranking — Allows for exploration of many potential class boundaries
    • Probabilities — Allows for more refined reasoning about sets of instances
  • Each requires progressively more accurate models (e.g., a poor probability estimator can still produce an accurate ranking)

Knowledge representation

13

Knowledge representation

  • Model: high-level global description of dataset
  • Choose model family
    • “All models are wrong, some models are useful” G. Box and N. Draper (1987)
  • Estimate model parameters and possibly model structure from training data

Classification tree

15

Perceptron

f (x) =

wj xj > 0

wj xj ≤ 0

19

Naive Bayes classifier

p(y|x) =

p(x|y)p(y)

p(x)

i

p(xi|y) p(y)

j p(x|yj^ )p(yj^ )

Naive Bayes classifier

Y

X 1 X 2... Xn

21

Parametric vs. non-parametric

  • Parametric
    • Particular functional form is assumed (e.g., Binomial)
    • Number of parameters is fixed in advance
    • Examples: Naive Bayes, perceptron
  • Non-parametric
    • Few assumptions are made about the functional form
    • Model structure is determined from data
    • Examples: classification tree, nearest neighbor