Introduction to Data Mining-Data Mining-Lecture 01 Slides-Computer Science, Slides of Data Mining

In this subject you will be able to learn High-correlation Mining, Min-hashing, Locality-sensitive Hashing, Mining Data Streams and Clustering for Large-scale Data.Data Mining, Anand Rajaraman, Jeff Ullman, Data Cleansing, Visualization, Warehousing, Decision Trees, Clusters, Bayes, Hidden-markov, Applications, Cultures, Models vs. Analytic Processing, Rhine Paradox

Typology: Slides

2011/2012

Uploaded on 01/31/2012

marphy
marphy 🇺🇸

4.4

(31)

284 documents

1 / 23

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
CS345 --- Data Mining
Introductions
What Is It?
Cultures of Data Mining
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17

Partial preview of the text

Download Introduction to Data Mining-Data Mining-Lecture 01 Slides-Computer Science and more Slides Data Mining in PDF only on Docsity!

CS345 --- Data Mining

Introductions

What Is It?

Cultures of Data Mining

Course Staff

Instructors:



Anand Rajaraman



Jeff Ullman

TA:



Robbie Yan

Project

Software implementation

related to

course subject matter.

Should involve an

original

component

or experiment.

We will provide some databases to mine; others are OK.

Team Projects

Working in pairs OK, but …

We will expect more from a pair thanfrom an individual.

The effort should be roughly evenlydistributed.

Typical Kinds of Patterns

Decision trees

: succinct ways to classify by

testing properties.

Clusters

: another succinct classification by

similarity of properties.

Bayes, hidden-Markov

, and other statistical

models,

frequent-itemsets

: expose

important associations within data.

Example: Clusters

x

x

x x

x x

x

x x x

x

x x

x

x

x

xx

x

x x x

x x

x

x x

x

x

x

x

x x

x

x

x

x

x

x

x

x

Applications (Among Many)

Intelligence-gathering



Total Information Awareness.

Web Analysis



PageRank.

Marketing



Run a sale on diapers; raise the price ofbeer.

Cultures

Databases

: concentrate on large-scale

(non-main-memory) data.

AI

(machine-learning): concentrate on

complex methods, small data.

Statistics

: concentrate on inferring

models.

(Way too Simple) Example

Given a billion numbers, a DB person might compute their average.

A statistician might fit the billion points to the best Gaussian distribution andreport the mean and standarddeviation.

Meaningfulness of Answers

A big risk when data mining is that you will “discover” patterns that aremeaningless.

Statisticians call it

Bonferroni’s

principle

: (roughly) if you look in more

places for interesting patterns than youramount of data will support, you arebound to find crap.

Rhine Paradox --- (1)

David Rhine was a parapsychologist in the 1950’s who hypothesized that somepeople had Extra-Sensory Perception.

He devised an experiment where subjects were asked to guess 10 hidden cards --- red

or

blue

He discovered that almost 1 in 1000 had ESP --- they were able to get all 10 right!

Rhine Paradox --- (2)

He told these people they had ESP and called them in for another test of thesame type.

Alas, he discovered that almost all of them had lost their ESP.

What did he conclude?



Answer on next slide.

A Concrete Example

This example illustrates a problem with intelligence-gathering.

Suppose we believe that certain groups of evil-doers are meeting occasionally inhotels to plot doing evil.

We want to find people who at least twice have stayed at the same hotel on the sameday.

The Details

9

people being tracked.

1000 days.

Each person stays in a hotel 1% of the time (10 days out of 1000).

Hotels hold 100 people (so 10

5

hotels).

If everyone behaves randomly (I.e., no evil-doers) will the data mining detectanything suspicious?