Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Instance-Based Learning: Understanding k-Nearest Neighbors and k-Means Clustering, Exams of Computer Science

University of San Francisco (USF)Computer Science

An introduction to instance-based learning, focusing on the k-nearest neighbors (k-nn) algorithm and k-means clustering. Instance-based learning is an effective approach for dealing with large numeric data sets. K-nn can be used in both supervised and unsupervised settings, while k-means is commonly used for unsupervised clustering. The basics of these methods, including how they work, their advantages, and challenges.

Typology: Exams

Pre 2010

Uploaded on 07/30/2009

koofers-user-2x8 🇺🇸

(1)

10 documents

1 / 27

This page cannot be seen from the preview

Don't miss anything!

Artificial Intelligence

Programming

Instance Based Learning

Chris Brooks

Department of Computer Science

University of San Francisco

Discover Exams of Computer Science University of San Francisco (USF)

Partial preview of the text

Download Instance-Based Learning: Understanding k-Nearest Neighbors and k-Means Clustering and more Exams Computer Science in PDF only on Docsity!

Artificial Intelligence

Programming Instance Based Learning

Chris Brooks Department of Computer ScienceUniversity of San Francisco

Instance-Based Learning So far, all of the learning algorithms we’ve studiedconstruct an explicit hypothesis about the data set. This is nice because it lets us do a lot of the trainingahead of time. It has the weakness that we must then use the samehypothesis fro each element in the test set. One way to get around this is to construct differenthypotheses for each test example.^ Potentially better results, but more computationneeded at evaluation time. We can use this in either a supervised or unsupervisedsetting.

Department of Computer Science — University of San Francisco – p.1/

Supervised kNN Training is trivial.^ Store training set. Assume each individual is an-dimensional vector, plus a classification. Testing is more computationally complex:^ Find the

k^ closest points and collect their classifications. Use majority rule to classify the unseen point.

Department of Computer Science — University of San Francisco – p.3/

kNN Example Suppose we have the following data points and areusing 3-NN:

X1^ X

Class 4 3

We see the following data point: x1=3, x2 = 1. Howshould we classify it?

Department of Computer Science — University of San Francisco – p.4/

Discussion K-NN can be a very effective algorithm when you havelots of data.^ Easy to compute^ Resistant to noise. Bias: points that are “close” to each other shareclassification.

Department of Computer Science — University of San Francisco – p.6/

Discussion Issues: How to choose the best

Search using cross-validation Distance is computed globally. Recall the data we used for decision tree training. Part of the goal was eliminate irrelevant attributes. All neighbors get an equal vote.

Department of Computer Science — University of San Francisco – p.7/

Attribute Weighting A more serious problem with kNN is the presence ofirrelevant attributes. In many data sets, there are a large number of attributesthat are completely unrelated to classification. More data actually lowers classification performance.^ This is sometimes called the

curse of dimensionality

Department of Computer Science — University of San Francisco – p.9/

Attribute Weighting We can address this problem by assigning a weight toeach component of the distance calculation. d(p, p^1

√) = (^2) ∑^ (w[ i](p[i^1

]^ −^ p[^2

(^2) i]))where

w^ is a vector of

weights. This has the effect of transforming or stretching theinstance space. More useful features have larger weights

Department of Computer Science — University of San Francisco – p.10/

Unsupervised Learning What if we want to group instances, but we don’t knowtheir classes? We just want “similiar” instances to be in the samegroup. Examples:^ Clustering documents based on text^ Grouping users with similar preferences^ Identifying demographic groups

Department of Computer Science — University of San Francisco – p.12/

K-means Clustering Let’s suppose we want to group our items into

clusters.^ For the moment, assume

K^ given.

Approach 1: Choose K items at random. We will call these the centers

Each center gets its own cluster. For each other item, assign it to the cluster thatminimizes distance between it and the center. This is called

K-means clustering.

Department of Computer Science — University of San Francisco – p.13/

K-means Clustering To evaluate this, we measure the sum of all distancesbetween instances and the center of their cluster. But how do we know that we picked good centers? We don’t. We need to adjust them.

Department of Computer Science — University of San Francisco – p.15/

Tuning the centers For each cluster, find its mean.^ This is the point c that minimizes the total distance toall points in the cluster. But what if some points are now in the wrong cluster?

Department of Computer Science — University of San Francisco – p.16/

K-means pseudocode centers

=^ random

items while^ not

done^ : foreach^ item^

: assign^

to^ closest

center

foreach

center

: find^ mean

of^ its

cluster.

Department of Computer Science — University of San Francisco – p.18/

Hierarchical Clustering K-means produces a flat set of clusters. Each document is in exactly one cluster. What if we want a tree of clusters?^ Topics and subtopics.^ Relationships between clusters. We can do this using

hierarchical clustering

Department of Computer Science — University of San Francisco – p.19/

Instance-Based Learning: Understanding k-Nearest Neighbors and k-Means Clustering, Exams of Computer Science

Related documents

Partial preview of the text

Download Instance-Based Learning: Understanding k-Nearest Neighbors and k-Means Clustering and more Exams Computer Science in PDF only on Docsity!

Artificial Intelligence

Programming Instance Based Learning

Supervised kNN Training is trivial.^ Store training set. Assume each individual is an-dimensional vector, plus a classification. Testing is more computationally complex:^ Find the

kNN Example Suppose we have the following data points and areusing 3-NN:

X1^ X

Discussion K-NN can be a very effective algorithm when you havelots of data.^ Easy to compute^ Resistant to noise. Bias: points that are “close” to each other shareclassification.

Discussion Issues: How to choose the best

Attribute Weighting A more serious problem with kNN is the presence ofirrelevant attributes. In many data sets, there are a large number of attributesthat are completely unrelated to classification. More data actually lowers classification performance.^ This is sometimes called the

Attribute Weighting We can address this problem by assigning a weight toeach component of the distance calculation. d(p, p^1

Unsupervised Learning What if we want to group instances, but we don’t knowtheir classes? We just want “similiar” instances to be in the samegroup. Examples:^ Clustering documents based on text^ Grouping users with similar preferences^ Identifying demographic groups

K-means Clustering Let’s suppose we want to group our items into

K-means Clustering To evaluate this, we measure the sum of all distancesbetween instances and the center of their cluster. But how do we know that we picked good centers? We don’t. We need to adjust them.

Tuning the centers For each cluster, find its mean.^ This is the point c that minimizes the total distance toall points in the cluster. But what if some points are now in the wrong cluster?

K-means pseudocode centers

Hierarchical Clustering K-means produces a flat set of clusters. Each document is in exactly one cluster. What if we want a tree of clusters?^ Topics and subtopics.^ Relationships between clusters. We can do this using