Download Lecture Slides on Support Vector Machines - Data Mining | STAT 59800 and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!
Data Mining
CS57300 / STAT 59800- Purdue University February 19, 2009 1
Support vector machines
Support vector machines
- Discriminative model
- General idea:
- Find best boundary points (support vectors) and build classifier on top of them
- Linear and non-linear SVMs 3
Choosing hyperplanes
Source: Introduction to Data Mining, Tan, Steinbach, and Kumar
Constrained optimization
Eq 1 : x(j) · w + b ≥ +1 f or y(j) = + Eq 2 : x(j) · w + b ≤ − 1 f or y(j) = − 1 Eq 3 : y(j)(x(j) · w + b) − 1 ≥ 0 ∀y(j) H 1 : x(j) · w + b = + 1 H 2 : x(j) · w + b = − 1 d+ = d− = 1 ||w||
- Can maximize margin by minimizing ||w||^2 subject to constraints margin = 2 ||w|| 7
Constrained optimization
- Can maximize margin by minimizing ||w|| subject to constraints on Eq
- Introduce Lagrange multipliers and minimize
Lagrange function
- Minimize LP with respect to w, b, and !I! 0
- Convex programming problem
- Dual problem: maximize LD wrt constraints on w, b, and !I! 0 LP = 1 2 ||w|| 2 − ∑^ I i= αiy(i)[x(i) · w + b] + ∑^ I i= αi LD = ∑^ I i= αi − 1 2 ∑ i,j αiαj y(i)y(j)[x(i) · x(j)] 9
Dual problem
- For a convex problem (no local minima) the dual problem is equivalent to the primal problem (i.e. we can switch between them)
- Dual depends on inner product between feature vectors
- " simpler quadratic programming problem
Kernel trick
- Note that the dual problem only depends on #iT#j
- Move to an infinite number of features by replacing # with a kernel: #iT#j " K(i,j)
- Here kernel K is a function that returns the value of the dot product between the two arguments
- As long as kernel is symmetric and positive semi-definite you can forget about the features - Example: Polynomial kernel K(i,j)=[r+ x(i)Tx(j)]d 13
Kernel SVMs
- State-of-the-art classifier (with good kernel)
- Solves computational problem of working in high dimensional space
- Non-parametric classifier (keeps all data around in kernel)
- Learning: O(n^2 ) (approximations available for O(n))
Predictive modeling: evaluation
15
Empirical evaluation
- Given observed accuracy of a model on limited data, how well does this estimate generalize for additional examples?
- Given that one model outperforms another on some sample of data, how likely is it that this model is more accurate in general?
- When data are limited, what is the best way to use the data to both learn and evaluate a model?
Approach
- Use k-fold cross-validation to get k estimates of error for MA and MB
- Mean is estimate of expected error
- Use paired t-test to assess whether the two distributions are statistically different from each other 19
Issues
- Assumes errors for all individuals are equally weighted
- May want to weight recent instances more heavily
- May want to include information about reliability of sets of measurements
- Assumes errors for all contexts are equally weighted
- May want to weigh false positive and false negatives differently
- May be costs associated with acting on particular classifications (e.g., marketing to individuals)
Examples
- Loan decisions
- Cost of lending to a defaulter is much greater than the lost business of refusing loan to a non-defaulter
- Oil-slick detection
- Cost of failing to detect an environmental disaster is far less than the cost of a false alarm
- Promotional mailing
- Cost of sending junk mail to a household that doesn’t respond is far less that the lost business of not sending it to a household that would have responded 21
Cost-sensitive models
- Define a score function based on a cost matrix
- If ~y is the predicted class and y is the true class, then need to define a matrix of costs C(~y,y)
- Reflects the severity of classifying an instance with true class y to class ~y - True positive rate (TPR) = TP/(TP+FN) - False positive rate (FPR) = FP/(FP+TN) - Recall = TP/(TP+FN) = TPR - Precision = TP/(TP+FP) - Specificity = TN/(FP+TN) - Sensitivity = TPR
Simple measures on tables
P re di ct e d Actual
- FN TN + TP FP + –
Bias-variance analysis
41 / 45 42 / 45 Conventional bias/variance framework Training Set Samples M 1 M 2 M 3 Models Test Set Model predictions 25
Findings
- Bias
- Often related to size of model space
- More complex models tend to have lower bias
- Variance
- Often related to size of dataset
- When data is large enough to estimate parameters well then models have lower variance
- Simple models can perform surprisingly well due to lower variance
Bias/variance tradeoff
Expected MSE Size of parameter space Low bias High variance High bias Low variance 27
Next class
- Topic:
- Bagging and boosting
- Pathologies of learning algorithms