##### Document information

**Recap
**

• We are discussing non-parametric estimation of density functions.

PR NPTEL course – p.1/130

**Recap
**

• We are discussing non-parametric estimation of density functions.

• Here we do not assume any form for the density function.

PR NPTEL course – p.2/130

**Recap
**

• We are discussing non-parametric estimation of density functions.

• Here we do not assume any form for the density function.

• The basic idea is to estimate the density by

f̂(x) = k

n V

where V is the volume of a small region around x in which k out of the n data samples are found.

PR NPTEL course – p.3/130

• The choice of size of V is critical for getting good estimates.

PR NPTEL course – p.4/130

• The choice of size of V is critical for getting good estimates.

• As discussed in last class there are two possibilities:

PR NPTEL course – p.5/130

• The choice of size of V is critical for getting good estimates.

• As discussed in last class there are two possibilities: • we can fix V and compute k (Parzen Window or

kernel-density estimate)

PR NPTEL course – p.6/130

• The choice of size of V is critical for getting good estimates.

• As discussed in last class there are two possibilities: • we can fix V and compute k (Parzen Window or

kernel-density estimate) • we can fix k and compute V (k-nearest neighbour

density estimate)

PR NPTEL course – p.7/130

• The choice of size of V is critical for getting good estimates.

• As discussed in last class there are two possibilities: • we can fix V and compute k (Parzen Window or

kernel-density estimate) • we can fix k and compute V (k-nearest neighbour

density estimate) • We discuss these in this class

PR NPTEL course – p.8/130

**Parzen Windows
**

• We first consider the Parzen window method.

PR NPTEL course – p.9/130

**Parzen Windows
**

• We first consider the Parzen window method. • Define a function φ : ℜd → ℜ by

φ(u) = 1 if |ui| ≤ 0.5, i = 1, · · · , d = 0 otherwise

where u = (u1, · · · , ud)T .

PR NPTEL course – p.10/130

**Parzen Windows
**

• We first consider the Parzen window method. • Define a function φ : ℜd → ℜ by

φ(u) = 1 if |ui| ≤ 0.5, i = 1, · · · , d = 0 otherwise

where u = (u1, · · · , ud)T . • This defines a unit hypercube in ℜd centered at origin.

PR NPTEL course – p.11/130

**Parzen Windows
**

• We first consider the Parzen window method. • Define a function φ : ℜd → ℜ by

φ(u) = 1 if |ui| ≤ 0.5, i = 1, · · · , d = 0 otherwise

where u = (u1, · · · , ud)T . • This defines a unit hypercube in ℜd centered at origin. • Note that φ(u) = φ(−u).

PR NPTEL course – p.12/130

• φ(u−u0 h

) is a hypercube of side h centered at u0.

PR NPTEL course – p.13/130

• φ(u−u0 h

) is a hypercube of side h centered at u0.

• Let D = {x1, · · · , xn} be the data samples.

PR NPTEL course – p.14/130

• φ(u−u0 h

) is a hypercube of side h centered at u0.

• Let D = {x1, · · · , xn} be the data samples. • Then, for any x, φ(x−xi

h ) would be 1 only if xi falls in a

hypercube of side h centered at x.

PR NPTEL course – p.15/130

• φ(u−u0 h

) is a hypercube of side h centered at u0.

• Let D = {x1, · · · , xn} be the data samples. • Then, for any x, φ(x−xi

h ) would be 1 only if xi falls in a

hypercube of side h centered at x. • Hence the number of data points falling in a

hypercube of side h centered at x is

k = n

∑

i=1

φ

(

x − xi h

)

PR NPTEL course – p.16/130

• Hypercube of side h in ℜd has volume hd.

PR NPTEL course – p.17/130

• Hypercube of side h in ℜd has volume hd. • Hence we can write our estimated density function as

f̂(x) = 1

n

n ∑

i=1

1

hd φ

(

x − xi h

)

PR NPTEL course – p.18/130

• Hypercube of side h in ℜd has volume hd. • Hence we can write our estimated density function as

f̂(x) = 1

n

n ∑

i=1

1

hd φ

(

x − xi h

)

• Known as Parzen window estimate.

PR NPTEL course – p.19/130

• Hypercube of side h in ℜd has volume hd. • Hence we can write our estimated density function as

f̂(x) = 1

n

n ∑

i=1

1

hd φ

(

x − xi h

)

• Known as Parzen window estimate.

• If we store all xi, we can compute f̂(x) at any x.

PR NPTEL course – p.20/130

• Hypercube of side h in ℜd has volume hd. • Hence we can write our estimated density function as

f̂(x) = 1

n

n ∑

i=1

1

hd φ

(

x − xi h

)

• Known as Parzen window estimate.

• If we store all xi, we can compute f̂(x) at any x.

• The value of h determines size of volume element.

PR NPTEL course – p.21/130

• The Parzen window density estimate is

f̂(x) = 1

n

n ∑

i=1

1

hd φ

(

x − xi h

)

PR NPTEL course – p.22/130

• The Parzen window density estimate is

f̂(x) = 1

n

n ∑

i=1

1

hd φ

(

x − xi h

)

• This is a kind of generalization of the histogram idea.

PR NPTEL course – p.23/130

• The Parzen window density estimate is

f̂(x) = 1

n

n ∑

i=1

1

hd φ

(

x − xi h

)

• This is a kind of generalization of the histogram idea. • We are essentially erecting bins where needed and

counting.

PR NPTEL course – p.24/130

• The Parzen window density estimate is

f̂(x) = 1

n

n ∑

i=1

1

hd φ

(

x − xi h

)

• This is a kind of generalization of the histogram idea. • We are essentially erecting bins where needed and

counting. • We can use other φ functions also in this form of

estimate.

PR NPTEL course – p.25/130

• The φ function should satisfy

φ(u) ≥ 0, ∀u, and ∫

ℜd

φ(u) du = 1

• The hypercube window function that we used satisfies these.

PR NPTEL course – p.26/130

• The φ function should satisfy

φ(u) ≥ 0, ∀u, and ∫

ℜd

φ(u) du = 1

• The hypercube window function that we used satisfies these. Let

V =

∫

ℜd

φ (u

h

)

du =

∫

ℜd

φ

(

u − u0 h

)

du

PR NPTEL course – p.27/130

• The φ function should satisfy

φ(u) ≥ 0, ∀u, and ∫

ℜd

φ(u) du = 1

• The hypercube window function that we used satisfies these. Let

V =

∫

ℜd

φ (u

h

)

du =

∫

ℜd

φ

(

u − u0 h

)

du

• For our hypercube window, V = hd.

PR NPTEL course – p.28/130

• Then the estimate

f̂(x) = 1

n

n ∑

i=1

1

V φ

(

x − xi h

)

would be a density.

PR NPTEL course – p.29/130

• Then the estimate

f̂(x) = 1

n

n ∑

i=1

1

V φ

(

x − xi h

)

would be a density.

• That is, f̂ would satisfy

f̂(x) ≥ 0, ∀x, and ∫

f(x) dx = 1

PR NPTEL course – p.30/130

• We can choose many φ functions that satisfy the earlier conditions.

PR NPTEL course – p.31/130

• We can choose many φ functions that satisfy the earlier conditions.

• Then with appropriate V we get a density estimate.

PR NPTEL course – p.32/130

• We can choose many φ functions that satisfy the earlier conditions.

• Then with appropriate V we get a density estimate. • This general method is often called Kernel density

estimate.

PR NPTEL course – p.33/130

• For example, we can have

φ(u) =

(

1√ 2π

)d

exp

[

− 1 2 ||u||2

]

PR NPTEL course – p.34/130

• For example, we can have

φ(u) =

(

1√ 2π

)d

exp

[

− 1 2 ||u||2

]

• This is the d-dimensional Gaussian density

PR NPTEL course – p.35/130

• For example, we can have

φ(u) =

(

1√ 2π

)d

exp

[

− 1 2 ||u||2

]

• This is the d-dimensional Gaussian density

• For this φ also we get V = hd.

PR NPTEL course – p.36/130

• The density estimate now is

f̂(x) = 1

n

n ∑

i=1

(

1

h √

2π

)d

exp

[

− ||x − xi|| 2

2 h2

]

PR NPTEL course – p.37/130

• The density estimate now is

f̂(x) = 1

n

n ∑

i=1

(

1

h √

2π

)d

exp

[

− ||x − xi|| 2

2 h2

]

• We are essentially taking a Gaussian centered at each data point and representing the unknown density as a mixture of these Gaussians.

PR NPTEL course – p.38/130

• The density estimate now is

f̂(x) = 1

n

n ∑

i=1

(

1

h √

2π

)d

exp

[

− ||x − xi|| 2

2 h2

]

• We are essentially taking a Gaussian centered at each data point and representing the unknown density as a mixture of these Gaussians.

• This Gaussian kernel gives a smoother density estimate.

PR NPTEL course – p.39/130

• We next look at convergence of such estimates.

PR NPTEL course – p.40/130

• We next look at convergence of such estimates.

• Let f̂n denote the density estimate with n samples and similarly let hn and Vn denote the quantities when sample size is n.

PR NPTEL course – p.41/130

• We next look at convergence of such estimates.

• Let f̂n denote the density estimate with n samples and similarly let hn and Vn denote the quantities when sample size is n.

• The density estimate is

f̂n(x) = 1

n

n ∑

i=1

1

Vn φ

(

x − xi hn

)

PR NPTEL course – p.42/130

• We next look at convergence of such estimates.

• Let f̂n denote the density estimate with n samples and similarly let hn and Vn denote the quantities when sample size is n.

• The density estimate is

f̂n(x) = 1

n

n ∑

i=1

1

Vn φ

(

x − xi hn

)

• The question we now ask is: does f̂n → f .

PR NPTEL course – p.43/130

• Define

δn(x) = 1

Vn φ

(

x

hn

)

PR NPTEL course – p.44/130

• Define

δn(x) = 1

Vn φ

(

x

hn

)

• Both amplitude and width of δn is affected by hn.

PR NPTEL course – p.45/130

• Define

δn(x) = 1

Vn φ

(

x

hn

)

• Both amplitude and width of δn is affected by hn. • We assume that as n → ∞, we have hn → 0 and δn

tends to a delta function.

PR NPTEL course – p.46/130

• Define

δn(x) = 1

Vn φ

(

x

hn

)

• Both amplitude and width of δn is affected by hn. • We assume that as n → ∞, we have hn → 0 and δn

tends to a delta function. • This is true for both the φ functions we saw.

PR NPTEL course – p.47/130

• By the properties of φ, we have ∫

1

Vn φ

(

x − xi hn

)

dx = 1

PR NPTEL course – p.48/130

• By the properties of φ, we have ∫

1

Vn φ

(

x − xi hn

)

dx = 1

• Hence we have ∫

δn(x − xi) dx = 1, ∀i, ∀n.

PR NPTEL course – p.49/130

• We can write f̂n in terms of δn as

PR NPTEL course – p.50/130

• We can write f̂n in terms of δn as

f̂n(x) = 1

n

n ∑

i=1

1

Vn φ

(

x − xi hn

)

PR NPTEL course – p.51/130

• We can write f̂n in terms of δn as

f̂n(x) = 1

n

n ∑

i=1

1

Vn φ

(

x − xi hn

)

= 1

n

n ∑

i=1

δn(x − xi)

PR NPTEL course – p.52/130

• We can write f̂n in terms of δn as

f̂n(x) = 1

n

n ∑

i=1

1

Vn φ

(

x − xi hn

)

= 1

n

n ∑

i=1

δn(x − xi)

• f̂n(x) is random variable because it depends in xi, i = 1, · · · , n.

PR NPTEL course – p.53/130

• We can write f̂n in terms of δn as

f̂n(x) = 1

n

n ∑

i=1

1

Vn φ

(

x − xi hn

)

= 1

n

n ∑

i=1

δn(x − xi)

• f̂n(x) is random variable because it depends in xi, i = 1, · · · , n.

• The xi are iid with density f .

PR NPTEL course – p.54/130

• Let f̄n(x) be expectation of f̂n(x). Then

PR NPTEL course – p.55/130

• Let f̄n(x) be expectation of f̂n(x). Then

f̄n(x) = 1

n

n ∑

i=1

E

[

1

Vn φ

(

x − xi hn

)]

PR NPTEL course – p.56/130

• Let f̄n(x) be expectation of f̂n(x). Then

f̄n(x) = 1

n

n ∑

i=1

E

[

1

Vn φ

(

x − xi hn

)]

=

∫

1

Vn φ

(

x − z hn

)

f(z) dz

PR NPTEL course – p.57/130

• Let f̄n(x) be expectation of f̂n(x). Then

f̄n(x) = 1

n

n ∑

i=1

E

[

1

Vn φ

(

x − xi hn

)]

=

∫

1

Vn φ

(

x − z hn

)

f(z) dz

=

∫

δn(x − z) f(z) dz

PR NPTEL course – p.58/130

• Thus,

f̄n(x) =

∫

δn(x − z) f(z) dz

PR NPTEL course – p.59/130

• Thus,

f̄n(x) =

∫

δn(x − z) f(z) dz

• We know δn becomes delta function as n → ∞.

PR NPTEL course – p.60/130

• Thus,

f̄n(x) =

∫

δn(x − z) f(z) dz

• We know δn becomes delta function as n → ∞. • Hence, as n → ∞, E[f̂n(x)] → f(x), ∀x.

PR NPTEL course – p.61/130

• Thus,

f̄n(x) =

∫

δn(x − z) f(z) dz

• We know δn becomes delta function as n → ∞. • Hence, as n → ∞, E[f̂n(x)] → f(x), ∀x. • Now let us calculate variance of f̂n(x).

PR NPTEL course – p.62/130

• We have

f̂n(x) = n

∑

i=1

1

n

1

Vn φ

(

x − xi hn

)

PR NPTEL course – p.63/130

• We have

f̂n(x) = n

∑

i=1

1

n

1

Vn φ

(

x − xi hn

)

• Thus it is sum of n terms each being a function of xi.

PR NPTEL course – p.64/130

• We have

f̂n(x) = n

∑

i=1

1

n

1

Vn φ

(

x − xi hn

)

• Thus it is sum of n terms each being a function of xi.

• Since xi are iid, variance of f̂n(x) would be sum of variances of these n random variables.

PR NPTEL course – p.65/130

• The mean of f̂n(x) is given by

f̄n(x) = n

∑

i=1

E

[

1

n

1

Vn φ

(

x − xi hn

)]

PR NPTEL course – p.66/130

• The mean of f̂n(x) is given by

f̄n(x) = n

∑

i=1

E

[

1

n

1

Vn φ

(

x − xi hn

)]

• Hence each expectation inside the sum is 1 n

f̄n(x).

PR NPTEL course – p.67/130

• Let σ2n be variance of f̂n(x). Then

PR NPTEL course – p.68/130

• Let σ2n be variance of f̂n(x). Then

σ2n = n Var

[

1

n

1

Vn φ

(

x − xi hn

)]

PR NPTEL course – p.69/130

• Let σ2n be variance of f̂n(x). Then

σ2n = n Var

[

1

n

1

Vn φ

(

x − xi hn

)]

= n E

[

1

n2 V 2n φ2

(

x − xi hn

)]

− n 1 n2

f̄ 2n(x)

PR NPTEL course – p.70/130

• Let σ2n be variance of f̂n(x). Then

σ2n = n Var

[

1

n

1

Vn φ

(

x − xi hn

)]

= n E

[

1

n2 V 2n φ2

(

x − xi hn

)]

− n 1 n2

f̄ 2n(x)

= 1

n Vn

∫

1

Vn φ2

(

x − z hn

)

f(z) dz − 1 n

f̄ 2n(x)

PR NPTEL course – p.71/130

Thus we have

σ2n ≤ 1

n Vn

∫

1

Vn φ2

(

x − z hn

)

f(z) dz

PR NPTEL course – p.72/130

Thus we have

σ2n ≤ 1

n Vn

∫

1

Vn φ2

(

x − z hn

)

f(z) dz

≤ 1 n Vn

sup(φ)

∫

1

Vn φ

(

x − z hn

)

f(z) dz

PR NPTEL course – p.73/130

Thus we have

σ2n ≤ 1

n Vn

∫

1

Vn φ2

(

x − z hn

)

f(z) dz

≤ 1 n Vn

sup(φ)

∫

1

Vn φ

(

x − z hn

)

f(z) dz

where sup(φ) = max u

φ(u)

PR NPTEL course – p.74/130

Thus we have

σ2n ≤ 1

n Vn

∫

1

Vn φ2

(

x − z hn

)

f(z) dz

≤ 1 n Vn

sup(φ)

∫

1

Vn φ

(

x − z hn

)

f(z) dz

where sup(φ) = max u

φ(u)

= sup(φ) f̄n(x)

n Vn

PR NPTEL course – p.75/130

• Thus we get

σ2n ≤ sup(φ) f̄n(x)

n Vn

where sup(φ) = maxu φ(u).

PR NPTEL course – p.76/130

• Thus we get

σ2n ≤ sup(φ) f̄n(x)

n Vn

where sup(φ) = maxu φ(u).

• This implies σ2n → 0 as n → ∞.

PR NPTEL course – p.77/130

• Thus we get

σ2n ≤ sup(φ) f̄n(x)

n Vn

where sup(φ) = maxu φ(u).

• This implies σ2n → 0 as n → ∞. • This finally shows that the kernel density estimate is a

consistent estimate.

PR NPTEL course – p.78/130

• Kernel density estimates are essentially mixture densities.

f̂(x) = 1

n1

n1 ∑

i=1

1

V φ

(

x − xi h

)

PR NPTEL course – p.79/130

• Kernel density estimates are essentially mixture densities.

f̂(x) = 1

n1

n1 ∑

i=1

1

V φ

(

x − xi h

)

• We store all the data samples.

PR NPTEL course – p.80/130

• Kernel density estimates are essentially mixture densities.

f̂(x) = 1

n1

n1 ∑

i=1

1

V φ

(

x − xi h

)

• We store all the data samples. • We compute the density whenever needed.

PR NPTEL course – p.81/130

• The kernel density estimators are easy to use.

PR NPTEL course – p.82/130

• The kernel density estimators are easy to use. • However, computationally they are expensive.

PR NPTEL course – p.83/130

• The kernel density estimators are easy to use. • However, computationally they are expensive. • Consider 2-class problem with n1 + n2 = n training

samples.

PR NPTEL course – p.84/130

• The kernel density estimators are easy to use. • However, computationally they are expensive. • Consider 2-class problem with n1 + n2 = n training

samples. • If we use Gaussian window function, at any x we

need to compute n Gaussians

PR NPTEL course – p.85/130

• The kernel density estimators are easy to use. • However, computationally they are expensive. • Consider 2-class problem with n1 + n2 = n training

samples. • If we use Gaussian window function, at any x we

need to compute n Gaussians • If we can model both class conditional densities as

Gaussian, the needed computation is much less.

PR NPTEL course – p.86/130

• Another issue is the size of the volume element.

PR NPTEL course – p.87/130

• Another issue is the size of the volume element. • Choosing the value for h is difficult.

PR NPTEL course – p.88/130

• Another issue is the size of the volume element. • Choosing the value for h is difficult. • Sometimes one may choose different h in different

parts of the feature space.

PR NPTEL course – p.89/130

• Another issue is the size of the volume element. • Choosing the value for h is difficult. • Sometimes one may choose different h in different

parts of the feature space. • Kernel density estimates are the more popular

non-parametric estimates.

PR NPTEL course – p.90/130

• A different approach to non-parametric density estimation is the k-nearest neighbour approach.

PR NPTEL course – p.91/130

• A different approach to non-parametric density estimation is the k-nearest neighbour approach.

• Here we do not have to choose the size parameter, h.

PR NPTEL course – p.92/130

• A different approach to non-parametric density estimation is the k-nearest neighbour approach.

• Here we do not have to choose the size parameter, h. • Instead we choose k and find V to enclose the k

nearest neighbours of x.

PR NPTEL course – p.93/130

• A different approach to non-parametric density estimation is the k-nearest neighbour approach.

• Here we do not have to choose the size parameter, h. • Instead we choose k and find V to enclose the k

nearest neighbours of x. • Then we take

f̂(x) = k

n V

PR NPTEL course – p.94/130

• Nearest neighbour density estimate is closely related to nearest neighbour classifier.

PR NPTEL course – p.95/130

• Nearest neighbour density estimate is closely related to nearest neighbour classifier.

• Consider a 2-class problem with prior prababilities pi and class conditional densities fi, i = 0, 1.

PR NPTEL course – p.96/130

• Nearest neighbour density estimate is closely related to nearest neighbour classifier.

• Consider a 2-class problem with prior prababilities pi and class conditional densities fi, i = 0, 1.

• Let f(x) = p0 f0(x) + p1 f1(x) be the overall density of feature vector.

PR NPTEL course – p.97/130

• Suppose there are n data samples with ni being from Class-i, i = 0, 1.

PR NPTEL course – p.98/130

• Suppose there are n data samples with ni being from Class-i, i = 0, 1.

• We do k-nearest neighbour estimation of f . Suppose the needed volume is V .

PR NPTEL course – p.99/130

• Suppose there are n data samples with ni being from Class-i, i = 0, 1.

• We do k-nearest neighbour estimation of f . Suppose the needed volume is V .

• Suppose in this volume there are ki samples of class-i, i = 0, 1.

PR NPTEL course – p.100/130

• Suppose there are n data samples with ni being from Class-i, i = 0, 1.

• We do k-nearest neighbour estimation of f . Suppose the needed volume is V .

• Suppose in this volume there are ki samples of class-i, i = 0, 1.

• Now using the same volume element, we estimate densities f as well as fi, i = 0, 1.

PR NPTEL course – p.101/130

• Then we have

f̂i(x) = ki

ni V , i = 0, 1, and f̂(x) =

k

n V

PR NPTEL course – p.102/130

• Then we have

f̂i(x) = ki

ni V , i = 0, 1, and f̂(x) =

k

n V

• The estimates for priors would be p̂i = ni/n.

PR NPTEL course – p.103/130

• Then we have

f̂i(x) = ki

ni V , i = 0, 1, and f̂(x) =

k

n V

• The estimates for priors would be p̂i = ni/n. • Using these estimates, the posterior probabilities are

qj(x) = f̂j(x) p̂j

f̂(x) =

kj nj V

nj n

n V

k =

kj k

PR NPTEL course – p.104/130

• Now if we want to implement Bayes classifier, x would be put in class-j if

qj(x) ≥ qi(x), ∀i, which implies kj ≥ ki, ∀i

PR NPTEL course – p.105/130

• Now if we want to implement Bayes classifier, x would be put in class-j if

qj(x) ≥ qi(x), ∀i, which implies kj ≥ ki, ∀i • Thus the Bayes classifier with these estimated

densities is the k-nearest neighbour classifier.

PR NPTEL course – p.106/130

**Implementing Bayes Classifier
**

• So far we have been discussing implementation of Bayes classifier.

PR NPTEL course – p.107/130

**Implementing Bayes Classifier
**

• So far we have been discussing implementation of Bayes classifier.

• Bayes classifier is optimal for minimizing risk.

PR NPTEL course – p.108/130

**Implementing Bayes Classifier
**

• So far we have been discussing implementation of Bayes classifier.

• Bayes classifier is optimal for minimizing risk. • To implement it, we need class conditional densities.

PR NPTEL course – p.109/130

**Implementing Bayes Classifier
**

• So far we have been discussing implementation of Bayes classifier.

• Bayes classifier is optimal for minimizing risk. • To implement it, we need class conditional densities. • Class conditional densities can be estimated, given iid

samples from each class.

PR NPTEL course – p.110/130

**Implementing Bayes Classifier
**

• So far we have been discussing implementation of Bayes classifier.

• Bayes classifier is optimal for minimizing risk. • To implement it, we need class conditional densities. • Class conditional densities can be estimated, given iid

samples from each class. • We could estimate the densities parametrically or

non-parametrically.

PR NPTEL course – p.111/130

• In parametric method, we assume a form for the density and estimate parameters.

PR NPTEL course – p.112/130

• In parametric method, we assume a form for the density and estimate parameters.

• We have considered both maximum likelihood and Bayesian methods.

PR NPTEL course – p.113/130

• In parametric method, we assume a form for the density and estimate parameters.

• We have considered both maximum likelihood and Bayesian methods.

• We have also discussed EM algorithm for ML estimation of mixture densities.

PR NPTEL course – p.114/130

• In parametric method, we assume a form for the density and estimate parameters.

• We have considered both maximum likelihood and Bayesian methods.

• We have also discussed EM algorithm for ML estimation of mixture densities.

• In non-parametric methods we do not know form of density.

PR NPTEL course – p.115/130

• In parametric method, we assume a form for the density and estimate parameters.

• We have considered both maximum likelihood and Bayesian methods.

• We have also discussed EM algorithm for ML estimation of mixture densities.

• In non-parametric methods we do not know form of density.

• We considered Kernel density estimates.

PR NPTEL course – p.116/130

• In parametric method, we assume a form for the density and estimate parameters.

• We have considered both maximum likelihood and Bayesian methods.

• We have also discussed EM algorithm for ML estimation of mixture densities.

• In non-parametric methods we do not know form of density.

• We considered Kernel density estimates. • We also saw k-nearest neighbour estimates.

PR NPTEL course – p.117/130

• Bayes classifier is optimal when we exactly know the posterior probabilities.

PR NPTEL course – p.118/130

• Bayes classifier is optimal when we exactly know the posterior probabilities.

• When we estimate densities, there would be inaccuracies.

PR NPTEL course – p.119/130

• Bayes classifier is optimal when we exactly know the posterior probabilities.

• When we estimate densities, there would be inaccuracies.

• This results in the non-optimality of implemented Bayes classifier.

PR NPTEL course – p.120/130

• Bayes classifier is optimal when we exactly know the posterior probabilities.

• When we estimate densities, there would be inaccuracies.

• This results in the non-optimality of implemented Bayes classifier.

• As we discussed in the beginning, another approach to classifier design is based on discriminant functions.

PR NPTEL course – p.121/130

• Bayes classifier is optimal when we exactly know the posterior probabilities.

• When we estimate densities, there would be inaccuracies.

• This results in the non-optimality of implemented Bayes classifier.

• As we discussed in the beginning, another approach to classifier design is based on discriminant functions.

• We consider this next.

PR NPTEL course – p.122/130

• Recall that a discriminant function based classifier is

h(X) = 1 if g(W, X) > 0 = 0 Otherwise

where g is called a discriminant function specified by the parameter vector W .

PR NPTEL course – p.123/130

• Recall that a discriminant function based classifier is

h(X) = 1 if g(W, X) > 0 = 0 Otherwise

where g is called a discriminant function specified by the parameter vector W .

• If we choose g(X) = q1(X) − q0(X), then this is the Bayes classifier.

PR NPTEL course – p.124/130

PR NPTEL course – p.125/130

• Recall that a discriminant function based classifier is

h(X) = 1 if g(W, X) > 0 = 0 Otherwise

where g is called a discriminant function specified by the parameter vector W .

• If we choose g(X) = q1(X) − q0(X), then this is the Bayes classifier.

• Instead of assuming functional form for class conditional densities, we can assume a functional form for g and learn the needed parameters.

PR NPTEL course – p.126/130

**Linear discriminant function
**

• Let W = (w0, w1, · · · , wd)T be the parameter vector and let X = (x1, · · · , xd)T is the feature vector.

• Then g specified as

g(W, X) = d

∑

i=1

wixi + w0

is called a linear discriminant function.

• Define X̃ = (1 x1, · · · , xd), called the **augumented
feature vector**.

• Then we can write g(W, X) = W T X̃ . PR NPTEL course – p.127/130

• We simply assume that the feature vector is augumented though we write it as X .

• The training set: {(Xi, yi), i = 1, · · · , n}. of patterns
is said to be **linearly separable **if there exists W ∗ such
that

XTi W ∗ > 0 if yi = 1

< 0 if yi = 0

• Any W ∗ that satisfies the above is called a separating hyperplane. (There exist infinitely many separating hyperplanes)

PR NPTEL course – p.128/130

PR NPTEL course – p.129/130

**Learning linear discriminant functions
**

• We need to learn ‘optimal’ W from the training samples.

• Perceptron learning algorithm is one of the earliest algorithms for learning linear discriminant functions.

• Finds a separating hyperplane if the training set is linearly separable.

• We can also have a risk-minimization approach to learning discriminant functions.

PR NPTEL course – p.130/130