Instance-Based Learning: Understanding k-Nearest Neighbors and k-Means Clustering, Exams of Computer Science

An introduction to instance-based learning, focusing on the k-nearest neighbors (k-nn) algorithm and k-means clustering. Instance-based learning is an effective approach for dealing with large numeric data sets. K-nn can be used in both supervised and unsupervised settings, while k-means is commonly used for unsupervised clustering. The basics of these methods, including how they work, their advantages, and challenges.

Typology: Exams

Pre 2010

Uploaded on 07/30/2009

koofers-user-2x8
koofers-user-2x8 🇺🇸

2

(1)

10 documents

1 / 27

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Artificial Intelligence
Programming
Instance Based Learning
Chris Brooks
Department of Computer Science
University of San Francisco
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b

Partial preview of the text

Download Instance-Based Learning: Understanding k-Nearest Neighbors and k-Means Clustering and more Exams Computer Science in PDF only on Docsity!

Artificial Intelligence

Programming Instance Based Learning

Chris Brooks Department of Computer ScienceUniversity of San Francisco

Instance-Based Learning So far, all of the learning algorithms we’ve studiedconstruct an explicit hypothesis about the data set. This is nice because it lets us do a lot of the trainingahead of time. It has the weakness that we must then use the samehypothesis fro each element in the test set. One way to get around this is to construct differenthypotheses for each test example.^ Potentially better results, but more computationneeded at evaluation time. We can use this in either a supervised or unsupervisedsetting.

Department of Computer Science — University of San Francisco – p.1/

Supervised kNN Training is trivial.^ Store training set. Assume each individual is an-dimensional vector, plus a classification. Testing is more computationally complex:^ Find the

k^ closest points and collect their classifications. Use majority rule to classify the unseen point.

Department of Computer Science — University of San Francisco – p.3/

kNN Example Suppose we have the following data points and areusing 3-NN:

X1^ X

Class 4 3

We see the following data point: x1=3, x2 = 1. Howshould we classify it?

Department of Computer Science — University of San Francisco – p.4/

Discussion K-NN can be a very effective algorithm when you havelots of data.^ Easy to compute^ Resistant to noise. Bias: points that are “close” to each other shareclassification.

Department of Computer Science — University of San Francisco – p.6/

Discussion Issues: How to choose the best

k?

Search using cross-validation Distance is computed globally. Recall the data we used for decision tree training. Part of the goal was eliminate irrelevant attributes. All neighbors get an equal vote.

Department of Computer Science — University of San Francisco – p.7/

Attribute Weighting A more serious problem with kNN is the presence ofirrelevant attributes. In many data sets, there are a large number of attributesthat are completely unrelated to classification. More data actually lowers classification performance.^ This is sometimes called the

curse of dimensionality

Department of Computer Science — University of San Francisco – p.9/

Attribute Weighting We can address this problem by assigning a weight toeach component of the distance calculation. d(p, p^1

√) = (^2) ∑^ (w[ i](p[i^1

]^ −^ p[^2

(^2) i]))where

w^ is a vector of

weights. This has the effect of transforming or stretching theinstance space. More useful features have larger weights

Department of Computer Science — University of San Francisco – p.10/

Unsupervised Learning What if we want to group instances, but we don’t knowtheir classes? We just want “similiar” instances to be in the samegroup. Examples:^ Clustering documents based on text^ Grouping users with similar preferences^ Identifying demographic groups

Department of Computer Science — University of San Francisco – p.12/

K-means Clustering Let’s suppose we want to group our items into

K

clusters.^ For the moment, assume

K^ given.

Approach 1: Choose K items at random. We will call these the centers

Each center gets its own cluster. For each other item, assign it to the cluster thatminimizes distance between it and the center. This is called

K-means clustering.

Department of Computer Science — University of San Francisco – p.13/

K-means Clustering To evaluate this, we measure the sum of all distancesbetween instances and the center of their cluster. But how do we know that we picked good centers? We don’t. We need to adjust them.

Department of Computer Science — University of San Francisco – p.15/

Tuning the centers For each cluster, find its mean.^ This is the point c that minimizes the total distance toall points in the cluster. But what if some points are now in the wrong cluster?

Department of Computer Science — University of San Francisco – p.16/

K-means pseudocode centers

=^ random

items while^ not

done^ : foreach^ item^

: assign^

to^ closest

center

foreach

center

: find^ mean

of^ its

cluster.

Department of Computer Science — University of San Francisco – p.18/

Hierarchical Clustering K-means produces a flat set of clusters. Each document is in exactly one cluster. What if we want a tree of clusters?^ Topics and subtopics.^ Relationships between clusters. We can do this using

hierarchical clustering

Department of Computer Science — University of San Francisco – p.19/