Lecture Slides on Support Vector Machines - Data Mining | STAT 59800, Study notes of Data Analysis & Statistical Methods

Material Type: Notes; Professor: Neville; Class: Explore Stat Sci Research; Subject: STAT-Statistics; University: Purdue University - Main Campus; Term: Spring 2009;

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-uwa
koofers-user-uwa 🇺🇸

10 documents

1 / 14

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Mining
CS57300 / STAT 59800-024
Purdue University
February 19, 2009
1
Support vector machines
2
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe

Partial preview of the text

Download Lecture Slides on Support Vector Machines - Data Mining | STAT 59800 and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Data Mining

CS57300 / STAT 59800- Purdue University February 19, 2009 1

Support vector machines

Support vector machines

  • Discriminative model
  • General idea:
    • Find best boundary points (support vectors) and build classifier on top of them
  • Linear and non-linear SVMs 3

Choosing hyperplanes

Source: Introduction to Data Mining, Tan, Steinbach, and Kumar

Constrained optimization

Eq 1 : x(j) · w + b ≥ +1 f or y(j) = + Eq 2 : x(j) · w + b ≤ − 1 f or y(j) = − 1 Eq 3 : y(j)(x(j) · w + b) − 1 ≥ 0 ∀y(j) H 1 : x(j) · w + b = + 1 H 2 : x(j) · w + b = − 1 d+ = d− = 1 ||w||

  • Can maximize margin by minimizing ||w||^2 subject to constraints margin = 2 ||w|| 7

Constrained optimization

  • Can maximize margin by minimizing ||w|| subject to constraints on Eq
  • Introduce Lagrange multipliers and minimize

Lagrange function

  • Minimize LP with respect to w, b, and !I! 0
  • Convex programming problem
  • Dual problem: maximize LD wrt constraints on w, b, and !I! 0 LP = 1 2 ||w|| 2 − ∑^ I i= αiy(i)[x(i) · w + b] + ∑^ I i= αi LD = ∑^ I i= αi − 1 2 ∑ i,j αiαj y(i)y(j)[x(i) · x(j)] 9

Dual problem

  • For a convex problem (no local minima) the dual problem is equivalent to the primal problem (i.e. we can switch between them)
  • Dual depends on inner product between feature vectors
  • " simpler quadratic programming problem

Kernel trick

  • Note that the dual problem only depends on #iT#j
  • Move to an infinite number of features by replacing # with a kernel: #iT#j " K(i,j)
  • Here kernel K is a function that returns the value of the dot product between the two arguments
  • As long as kernel is symmetric and positive semi-definite you can forget about the features - Example: Polynomial kernel K(i,j)=[r+ x(i)Tx(j)]d 13

Kernel SVMs

  • State-of-the-art classifier (with good kernel)
  • Solves computational problem of working in high dimensional space
  • Non-parametric classifier (keeps all data around in kernel)
  • Learning: O(n^2 ) (approximations available for O(n))

Predictive modeling: evaluation

15

Empirical evaluation

  • Given observed accuracy of a model on limited data, how well does this estimate generalize for additional examples?
  • Given that one model outperforms another on some sample of data, how likely is it that this model is more accurate in general?
  • When data are limited, what is the best way to use the data to both learn and evaluate a model?

Approach

  • Use k-fold cross-validation to get k estimates of error for MA and MB
  • Mean is estimate of expected error
  • Use paired t-test to assess whether the two distributions are statistically different from each other 19

Issues

  • Assumes errors for all individuals are equally weighted
    • May want to weight recent instances more heavily
    • May want to include information about reliability of sets of measurements
  • Assumes errors for all contexts are equally weighted
    • May want to weigh false positive and false negatives differently
    • May be costs associated with acting on particular classifications (e.g., marketing to individuals)

Examples

  • Loan decisions
    • Cost of lending to a defaulter is much greater than the lost business of refusing loan to a non-defaulter
  • Oil-slick detection
    • Cost of failing to detect an environmental disaster is far less than the cost of a false alarm
  • Promotional mailing
    • Cost of sending junk mail to a household that doesn’t respond is far less that the lost business of not sending it to a household that would have responded 21

Cost-sensitive models

  • Define a score function based on a cost matrix
  • If ~y is the predicted class and y is the true class, then need to define a matrix of costs C(~y,y)
  • Reflects the severity of classifying an instance with true class y to class ~y - True positive rate (TPR) = TP/(TP+FN) - False positive rate (FPR) = FP/(FP+TN) - Recall = TP/(TP+FN) = TPR - Precision = TP/(TP+FP) - Specificity = TN/(FP+TN) - Sensitivity = TPR

Simple measures on tables

P re di ct e d Actual

- FN TN + TP FP + –

Bias-variance analysis

41 / 45 42 / 45 Conventional bias/variance framework Training Set Samples M 1 M 2 M 3 Models Test Set Model predictions 25

Findings

  • Bias
    • Often related to size of model space
    • More complex models tend to have lower bias
  • Variance
    • Often related to size of dataset
    • When data is large enough to estimate parameters well then models have lower variance
  • Simple models can perform surprisingly well due to lower variance

Bias/variance tradeoff

Expected MSE Size of parameter space Low bias High variance High bias Low variance 27

Next class

  • Topic:
    • Bagging and boosting
    • Pathologies of learning algorithms