Machine Learning a Quick Introduction - Lecture Slides | CSE 591, Study Guides, Projects, Research of Computer Science

Material Type: Project; Professor: Hakenberg; Class: Introduction to Image Processing and Analysis; Subject: Computer Science and Engineering; University: Arizona State University - Tempe; Term: Fall 2008;

Typology: Study Guides, Projects, Research

Pre 2010

Uploaded on 09/02/2009

koofers-user-3vt
koofers-user-3vt 🇺🇸

10 documents

1 / 18

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CSE 591
Machine learning
-a quick introduction-
Fall 2008
http://www.public.asu.edu/~jhakenbe/591/
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12

Partial preview of the text

Download Machine Learning a Quick Introduction - Lecture Slides | CSE 591 and more Study Guides, Projects, Research Computer Science in PDF only on Docsity!

CSE 591

Machine learning

-a quick introduction-

Fall 2008

http://www.public.asu.edu/~jhakenbe/591/

Class format

• class project: 3 building blocks

- named entity recognition

• protein, drug, disease, organ/tissue, biol. process, cell. location

- sentence classification

• discusses certain type of relation?

- relation mining

• find partners in relation

• 4 groups (2 students)

- genetic implications in disease

- gene-drug associations

- cellular locations of proteins

- protein-protein interactions

• 15min presentations in per group

- dictionary-based NER

- naive Bayes for sentence

classification

- four types of relation mining:

pairwise classification, pattern-

based (POS), pattern-based

(parse tree), tree kernel

POS ambiguity

• Are there ambiguities other than NN/VB?

• JJ or NN?

• JJ or RB?

• NN or RB?

• IN or WDT?

  • NN, noun;^ VB, verb; JJ, adjective; RB, adverb; IN, preposition;^ WDT, which-determiner The Japanese_ JJ system for the classification of gastric cancer Five haplotypes were identified in the Japanese_ NNP population rapid_ JJ growth_ NN rapid_ RB growing_ VBG organisms 4 patients were returned back home_ RB today. These deaths took place at home_ NN. The fact that_ IN Marimastat reduced in vitro invasion ... Propolis (PP) is a sticky substance that_ WDT is collected from plants by honeybees. bank, store, home, call, … might be nouns or verbs

Machine learning

• make computers “learn”

• learn rules^ that^ explain^ a given data set

• model^ (a small snippet of) the world

• makes sense when we have^ massive data sets^ to

analyze, especially for repeated tasks:

- predict the weather (rain or not, humidity, temperature)

- stock market analysis

- credit card fraud detection

- time series analysis

- handwriting recognition

- filter emails (ham or spam)

- sort a text into a category (sport, politics)

Supervised learning

• starts with known data

- given a set of observations

- we know the outcome for each

- training^ (=learning from labeled examples) on these data points

• predicts the outcome of unknown data

- will it rain at 70 degrees and 30% humidity?

• common form:^ classification

yes no

We call an observation (or list

of observations) with an

outcome a “labeled example”

Temperature Humidity Rain

55 10% no

40 40% yes

95 20% no

75 45% yes

A set of labeled

examples is a

“training set”

Unsupervised learning

• starts with observations

• but we don’t know their labels

• typically used to find a structure in a data set

- group texts by similarity^ ➠^ groups of texts that share a similar

“topic” ➠ but we don’t know what that “topic” might be

• common form:^ clustering

We call an observation (or list of

observations) without an

outcome an “unlabeled example”

Vector space representation

supervised unsupervised

= result

Immediate applications

• supervised learning

- classify^ new data (blue) using the learned rules

- e.g., by checking on which side of the hyperplane they are

- hypothesis: new data will correspond to the examples on the

same side ➠ same label

• unsupervised learning

- similar data are already^ clustered^ together

- we could check a few examples per cluster and label them

- then assign the label to all examples in the same cluster

Support vector machine

(overall idea only)

• supervised ML

• learns a separation of data points^ ➠^ high-dimensional vector

space (one dimension per feature) ➱ learns a hyperplane

• iteratively adapts a hyperplane until all or most training

examples lie on the correct side

• hyperplane is represented by^ support vectors^ ➱^ SVM

• classification: build the^ norm^ of a new vector onto the

hyperplane ➠ check sign ➠ predict class

• details in three weeks

k-means clustering

• unsupervised ML

• decide on number of clusters,^ k

• decide on similarity measure (e.g., cosine coefficient)

1. randomized initialization with k centroids

2. assign remaining points by similarity to these centroids

3. compute actual centroid per cluster

4. re-assign remaining points

5. until no new centroids

Summary

• ML helps in explaining a (small, virtual) world

• world consists of^ observations, e.g., features and their values

  • a weather observation (temperature, humidity, overcast)
  • a text (tokens and their TF*IDF score)

• sometimes we can make use of previously known labels^ ➠^ supervised^ vs.

unsupervised learning

  • given a particular weather situation, was it raining or not?
  • a text is on sports, or business, or politics, (or a combination)

• vector space model used in most techniques (at least implicitely)

• ML learns rules that explain a data set^ ➠^ a model

- hyperplances, decisions, cluster boundaries, centroids

• we can apply the model to new data

- classification: find the label of a new example given some “old” labeled

examples

- clustering: group examples by their similarity

What we’ll do next time

• Machine learning

- sequence learning: Hidden Markov Models, Conditional

Random Fields

• Evaluation

- predictions: true positive, false positive, false negative, …

- metrics: precision, recall, f-measure, accuracy

• Named entity recognition

- dictionary-based

- CRF-based