Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

For each uploaded document

Answer questions

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Naive Bayes Classifiers - Data Mining - Lecture Slides | STAT 59800, Study notes of Data Analysis & Statistical Methods

Purdue University Data Analysis & Statistical Methods

Prof. Jennifer L. Neville

Material Type: Notes; Professor: Neville; Class: Explore Stat Sci Research; Subject: STAT-Statistics; University: Purdue University - Main Campus; Term: Spring 2009;

Typology: Study notes

Pre 2010

Uploaded on 07/31/2009

koofers-user-pfq-1 🇺🇸

10 documents

1 / 18

This page cannot be seen from the preview

Don't miss anything!

bg1

Data Mining

CS57300 / STAT 59800-024

Purdue University

February 17, 2009

1

Naive Bayes classifiers

2

pf3

pf4

pf5

pf8

pf9

pfa

pfd

pfe

pff

pf12

Discover Study notes of Data Analysis & Statistical Methods Purdue University

Related documents

Lecture Slides on Data Mining and Bayes Net Interference | STAT 59800

Descriptive Modeling, Bayes Nets - Lecture Slides | STAT 59800

Lecture Slides on Support Vector Machines - Data Mining | STAT 59800

Data Mining, and Populations and Samples - Lecture Slides | STAT 59800

MDPs, Dynamic Bayes Nets, and Naive Bayes Classifiers

Learning Linear Classifiers with Perceptron & Naive Bayes

Lecture Notes on What is Mining - Data Mining | CS 57300

Measurement and Data - Lecture Slides | STAT 59800

Elements of Data Mining - Lecture Slides | CS 57300

Predictive Modeling in Data Mining: Components and Approaches - Prof. Jennifer L. Neville

Data Mining: Techniques for Summarization and Dimensionality Reduction - Prof. Jennifer L.

Anomaly Detection in Data Mining - Prof. Jennifer L. Neville

Partial preview of the text

Download Naive Bayes Classifiers - Data Mining - Lecture Slides | STAT 59800 and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Data Mining

CS57300 / STAT 59800-

Purdue University February 17, 2009 1

Naive Bayes classifiers

Example: Home security

3

Semantics

The full joint distribution is defined as the product of the local conditional distributions: - P (X 1 , … ,Xn) =^ !i = 1 P (Xi | Parents(Xi ))
Example:
- P(j! m! a! ¬b! ¬e) = P(j|a) P(m|a) P(a|¬b,¬e) P(¬b) P(¬e) n

NBC learning

Estimate prior P(C) and conditional probability distributions P(Xi | C) independently w/MLE
P(C)=9/ P(I=high|C=yes)=2/ P(I=med|C=yes)=4/ P(I=low|C=yes)=3/ etc. 15 CS 590 D 30 prediction
Bayesian Classification
Instance Based Methods
Classification by decision tree induction
Classification by Neural Networks
Classification by Support Vector Machines (SVM)
Prediction CS 590 D 31 Training Dataset age income student credit_rating buys_computer <= 30 high no fair no <= 30 high no excellent no 31 … 40 high no fair yes

40 medium no fair yes 40 low yes fair yes 40 low yes excellent no 31 … 40 low yes excellent yes <= 30 medium no fair no <= 30 low yes fair yes 40 medium yes fair yes <= 30 medium yes excellent yes 31 … 40 medium no excellent yes 31 … 40 high yes fair yes 40 medium no excellent no !"#$% &'((')$%+%% ,-./(,% & 0 '.% 12 #+(*+ 3 $% 456 7 Learning CPTs from examples

False 2 13 0

True 10 13 17

Low Medium High

f( x )

X 1

P[ X 1 = Low | f(x) = True] =

P[ F(x) = False] =

8

Zero counts are a problem

If an attribute value does not occur in training example, we assign zero probability to that value
How does that affect the conditional probability P[ F(x) | x ]?
It equals 0!!!
Why is this a problem?
Adjust for zero counts by “smoothing” probability estimates 9

Add uniform prior

Smoothing: Laplace correction

False 2 13 0

True 10 13 17

Low Medium High

f( x )

X 1

P[ X 1 = High | f(x) = False] =

NBC learning

Model space?
Search algorithm?
Evaluation function? 13

Other predictive models

Nearest neighbor

Instance-based method
- Store instance and delay processing until a new instance must be classified
- All point represented in m-dimensional space
- Nearest neighbors are calculated using Euclidean distance
k-NN returns the most common value among k closest training examples
- How to choose k? 15

Nearest neighbor decision boundary

All points in such a cell are labeled by the class of the training point, forming a Voronoi tesselation of the feature space. Source: http://www.cs.bilkent.edu.tr/~saksoy/courses/cs551-Spring2008/slides/cs551_nonbayesian1.pdf

Neural networks

Analogous to biological systems
Massive parallelism is computationally efficient
First learning algorithm in 1959 (Rosenblatt)
- Perceptron learning rule
- Provide target outputs with inputs for a single neuron
- Incrementally update weights to learn to produce outputs 19

Neuron

CS 590 D 56

A Neuron

-^ μ k

f

weighted sum Input vector x output y Activation function weight vector w ! w 0 w 1 wn x 0 x 1 xn y sign( ) For Example n i 0 = (^)! wi xi + μ k =

Multi-Layer Perceptron

20

Perceptron

!"#$"%&#'( !"#$%&'($)%&%"+#'+,'-$'"#).* ()%/$ ,012'3'!)*+",- ./+ 01234 (/ 5 /+ 00 4 5 '

' ' (^) ' ' '

6)7%*$'&.8$ ",':"'!) 6 *+ 3 ",- 7819 7&."( ) 6 ,: 717 ) 67 ,7#:"'+ 3 6 ,, f (x) = ∑^ m i= wixi + b y = sign[f (x)] Model: Learning:

if y(j)(

∑^ m

i=

wixi(j) + b) ≤ 0

then w ← w + ηy(j)x(j)

21

Perceptron learning

Model space?
Search algorithm?
Evaluation function?

Maximize margin

Source: Introduction to Data Mining, Tan, Steinbach, and Kumar 25

Linear SVMs

Search for hyperplane with largest margin
Margin=d+ + d- where d+ is distance to closest positive example and d- is distance to closest negative example !"#$%&'()**+&,'-$.,+&'/%. 0 "#$ 1

2 (-/1'5++A'>+&'0?$&5%#$'B",0',0$'5%&;$1,'8%&;"#

7

(^) +

7 (^777) 7 7

Constrained optimization

Eq 1 : x(j) · w + b ≥ +1 f or y(j) = + Eq 2 : x(j) · w + b ≤ − 1 f or y(j) = − 1 Eq 3 : y(j)(x(j) · w + b) − 1 ≥ 0 ∀y(j) H 1 : x(j) · w + b = + 1 H 2 : x(j) · w + b = − 1 d+ = d− = 1 ||w||

Can maximize margin by minimizing ||w||^2 subject to constraints

margin =

||w||

27

Constrained optimization

Can maximize margin by minimizing ||w|| subject to constraints on Eq
Introduce Lagrange multipliers and minimize

Limitations of linear SVMs

Linear classifiers cannot deal with:
- Non-linear concepts
- Noisy data
Solutions:
- Soft margin (e.g., allow mistakes in training data)
- Network of simple linear classifiers (e.g., neural networks)
- Map data into richer feature space (e.g., non-linear features) and then use linear classifier 31

Map to new features

Define a new set of features where data are linearly separable:

#(x)=[x 1 , x 2 , x 1 x 2 , x 12 , x 22 ,...] !"#$%&'()*

' : ' ' : : !;' 9 !;: 9 !;: 9 !;: 9 !;' 9 !;' 9

Kernel trick

Note that the dual problem only depends on #iT#j
Move to an infinite number of features by replacing # with a kernel: #iT#j " K(i,j)
Here kernel K is a function that returns the value of the dot product between the two arguments
As long as kernel is symmetric and positive semi-definite you can forget about the features - Example: Polynomial kernel K(i,j)=[r+ x(i)Tx(j)]d 33

Kernel SVMs

State-of-the-art classifier (with good kernel)
Solves computational problem of working in high dimensional space
Non-parametric classifier (keeps all data around in kernel)
Learning: O(n^2 ) (approximations available for O(n))