
















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Exam; Class: ST: Prog Analy &Mechanization; Subject: Computer Science; University: University of New Mexico; Term: Unknown 1989;
Typology: Exams
1 / 24
This page cannot be seen from the preview
Don't miss anything!

















Sec 4.
Axis orthagonal, hyperrectangular, piecewise- constant models
Categorical labels
Non-metric
Usual to “hold out” a separate set of data for testing; not used to train classifier
A.k.a., test set, holdout set, evaluation set, etc.
E.g.,
is training set accuracy
is test set (or generalization ) accuracy
What if you’re unlucky when you split data into train/test?
E.g., all train data are class A and all test are class B?
No “red” things show up in training data
Best answer: stratification
Try to make sure class (+feature) ratios are same in train/test sets (and same as original data)
Why does this work?
Almost as good: randomization
Shuffle data randomly before split
Why does this work?
Original data [ X ’; Y ’] Random shuffle k -way partition [ X1 ’ Y1 ’] [ X2 ’ Y2 ’] [ Xk ’ Yk ’] ... k train/ test sets k accuracies 53.7% 85.1% 73.2%
Now we know how well our models are performing
But are they really learning?
Maybe any classifier would do as well
E.g., a default classifier (pick the most likely class) or a random classifier
How can we tell if the model is learning anything?
Go back to first definitions
What does it mean to learn something?
Cross validation helps you get better estimate of accuracy for small data
Randomization (shuffling the data) helps guard against poor splits/ordering of the data
Learning curves help assess learning rate/asymptotic accuracy
Still one big missing component: variance
Definition: Variance of a classifier is the fraction of error due to the specific data set it’s trained on
Variance tells you how much you expect your classifier/performance to change when you train it on a new (but similar) data set
E.g., take 5 samplings of a data source; train/test 5 classifiers
Accuracies: 74.2, 90.3, 58.1, 80.6, 90.
Mean accuracy: 78.7%
Std dev of acc: 13.4%
Variance is usually a function of both classifier and data source
High variance classifiers are very susceptible to small changes in data
10 20 30 40 50 60 70 80 90 40 50 60 70 80 90 100 % data size accuracy “hepatitis” data
Decision trees are non-metric
Don’t know anything about relations between instances, except sets induced by feature splits
Often, we have well-defined distances between points
Idea of distance encapsulated by a metric
Examples:
Euclidean distance
a
b
a 1
b 1
2
a d
b d
2
a
b
T
a
b
1 2
d
i=
a i
b i
2
Examples:
Manhattan (taxicab) distance
Distance travelled along a grid between two points
No diagonals allowed
a
b
a 1
b 1
a d
d b
d
i=
a i
b i
Examples:
What if some attribute is categorical?
Typical answer is 0/1 distance :
For each attribute, add 1 if the instances differ in that attribute, else 0
0 / 1
d
i=
a i
b i
Nearest neighbor : find the nearest instance to the query point in feature space, return the class of that instance
Simplest possible distance-based classifier
With more notation:
X ′ ∈Xtrain
′