















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
In this subject you will be able to learn High-correlation Mining, Min-hashing, Locality-sensitive Hashing, Mining Data Streams and Clustering for Large-scale Data.Data Mining, Anand Rajaraman, Jeff Ullman, Data Cleansing, Visualization, Warehousing, Decision Trees, Clusters, Bayes, Hidden-markov, Applications, Cultures, Models vs. Analytic Processing, Rhine Paradox
Typology: Slides
1 / 23
This page cannot be seen from the preview
Don't miss anything!
















Introductions
What Is It?
Cultures of Data Mining
Instructors:
Anand Rajaraman
Jeff Ullman
Robbie Yan
Software implementation
related to
course subject matter.
Should involve an
original
component
or experiment.
We will provide some databases to mine; others are OK.
Working in pairs OK, but …
We will expect more from a pair thanfrom an individual.
The effort should be roughly evenlydistributed.
Decision trees
: succinct ways to classify by
testing properties.
Clusters
: another succinct classification by
similarity of properties.
Bayes, hidden-Markov
, and other statistical
models,
frequent-itemsets
: expose
important associations within data.
x
x
x x
x x
x
x x x
x
x x
x
x
x
xx
x
x x x
x x
x
x x
x
x
x
x
x x
x
x
x
x
x
x
x
x
Intelligence-gathering
Total Information Awareness.
Web Analysis
PageRank.
Marketing
Run a sale on diapers; raise the price ofbeer.
Databases
: concentrate on large-scale
(non-main-memory) data.
(machine-learning): concentrate on
complex methods, small data.
Statistics
: concentrate on inferring
models.
Given a billion numbers, a DB person might compute their average.
A statistician might fit the billion points to the best Gaussian distribution andreport the mean and standarddeviation.
A big risk when data mining is that you will “discover” patterns that aremeaningless.
Statisticians call it
Bonferroni’s
principle
: (roughly) if you look in more
places for interesting patterns than youramount of data will support, you arebound to find crap.
David Rhine was a parapsychologist in the 1950’s who hypothesized that somepeople had Extra-Sensory Perception.
He devised an experiment where subjects were asked to guess 10 hidden cards --- red
or
blue
He discovered that almost 1 in 1000 had ESP --- they were able to get all 10 right!
He told these people they had ESP and called them in for another test of thesame type.
Alas, he discovered that almost all of them had lost their ESP.
What did he conclude?
Answer on next slide.
This example illustrates a problem with intelligence-gathering.
Suppose we believe that certain groups of evil-doers are meeting occasionally inhotels to plot doing evil.
We want to find people who at least twice have stayed at the same hotel on the sameday.
9
people being tracked.
1000 days.
Each person stays in a hotel 1% of the time (10 days out of 1000).
Hotels hold 100 people (so 10
5
hotels).
If everyone behaves randomly (I.e., no evil-doers) will the data mining detectanything suspicious?