# Non-Parametric Estimation - Introduction to Pattern Recognition - Lecture Slides, Slides for Advanced Algorithms. West Bengal State University

PDF (357 KB)
130 pages
1000+Number of visits
Description
The main points are:Non-Parametric Estimation, Density Functions, Kernel-Density Estimate, Parzen Window, Unit Hypercube, Data Points Falling, Kind of Generalization, Erecting Bins, D-Dimensional Gaussian Density, Gaussi...
20points
this document
Preview3 pages / 130

Recap

• We are discussing non-parametric estimation of density functions.

PR NPTEL course – p.1/130

Recap

• We are discussing non-parametric estimation of density functions.

• Here we do not assume any form for the density function.

PR NPTEL course – p.2/130

Recap

• We are discussing non-parametric estimation of density functions.

• Here we do not assume any form for the density function.

• The basic idea is to estimate the density by

f̂(x) = k

n V

where V is the volume of a small region around x in which k out of the n data samples are found.

PR NPTEL course – p.3/130

• The choice of size of V is critical for getting good estimates.

PR NPTEL course – p.4/130

• The choice of size of V is critical for getting good estimates.

• As discussed in last class there are two possibilities:

PR NPTEL course – p.5/130

• The choice of size of V is critical for getting good estimates.

• As discussed in last class there are two possibilities: • we can fix V and compute k (Parzen Window or

kernel-density estimate)

PR NPTEL course – p.6/130

• The choice of size of V is critical for getting good estimates.

• As discussed in last class there are two possibilities: • we can fix V and compute k (Parzen Window or

kernel-density estimate) • we can fix k and compute V (k-nearest neighbour

density estimate)

PR NPTEL course – p.7/130

• The choice of size of V is critical for getting good estimates.

• As discussed in last class there are two possibilities: • we can fix V and compute k (Parzen Window or

kernel-density estimate) • we can fix k and compute V (k-nearest neighbour

density estimate) • We discuss these in this class

PR NPTEL course – p.8/130

Parzen Windows

• We first consider the Parzen window method.

PR NPTEL course – p.9/130

Parzen Windows

• We first consider the Parzen window method. • Define a function φ : ℜd → ℜ by

φ(u) = 1 if |ui| ≤ 0.5, i = 1, · · · , d = 0 otherwise

where u = (u1, · · · , ud)T .

PR NPTEL course – p.10/130

Parzen Windows

• We first consider the Parzen window method. • Define a function φ : ℜd → ℜ by

φ(u) = 1 if |ui| ≤ 0.5, i = 1, · · · , d = 0 otherwise

where u = (u1, · · · , ud)T . • This defines a unit hypercube in ℜd centered at origin.

PR NPTEL course – p.11/130

Parzen Windows

• We first consider the Parzen window method. • Define a function φ : ℜd → ℜ by

φ(u) = 1 if |ui| ≤ 0.5, i = 1, · · · , d = 0 otherwise

where u = (u1, · · · , ud)T . • This defines a unit hypercube in ℜd centered at origin. • Note that φ(u) = φ(−u).

PR NPTEL course – p.12/130

• φ(u−u0 h

) is a hypercube of side h centered at u0.

PR NPTEL course – p.13/130

• φ(u−u0 h

) is a hypercube of side h centered at u0.

• Let D = {x1, · · · , xn} be the data samples.

PR NPTEL course – p.14/130

• φ(u−u0 h

) is a hypercube of side h centered at u0.

• Let D = {x1, · · · , xn} be the data samples. • Then, for any x, φ(x−xi

h ) would be 1 only if xi falls in a

hypercube of side h centered at x.

PR NPTEL course – p.15/130

• φ(u−u0 h

) is a hypercube of side h centered at u0.

• Let D = {x1, · · · , xn} be the data samples. • Then, for any x, φ(x−xi

h ) would be 1 only if xi falls in a

hypercube of side h centered at x. • Hence the number of data points falling in a

hypercube of side h centered at x is

k = n

i=1

φ

(

x − xi h

)

PR NPTEL course – p.16/130

• Hypercube of side h in ℜd has volume hd.

PR NPTEL course – p.17/130

• Hypercube of side h in ℜd has volume hd. • Hence we can write our estimated density function as

f̂(x) = 1

n

n ∑

i=1

1

hd φ

(

x − xi h

)

PR NPTEL course – p.18/130

• Hypercube of side h in ℜd has volume hd. • Hence we can write our estimated density function as

f̂(x) = 1

n

n ∑

i=1

1

hd φ

(

x − xi h

)

• Known as Parzen window estimate.

PR NPTEL course – p.19/130

• Hypercube of side h in ℜd has volume hd. • Hence we can write our estimated density function as

f̂(x) = 1

n

n ∑

i=1

1

hd φ

(

x − xi h

)

• Known as Parzen window estimate.

• If we store all xi, we can compute f̂(x) at any x.

PR NPTEL course – p.20/130

• Hypercube of side h in ℜd has volume hd. • Hence we can write our estimated density function as

f̂(x) = 1

n

n ∑

i=1

1

hd φ

(

x − xi h

)

• Known as Parzen window estimate.

• If we store all xi, we can compute f̂(x) at any x.

• The value of h determines size of volume element.

PR NPTEL course – p.21/130

• The Parzen window density estimate is

f̂(x) = 1

n

n ∑

i=1

1

hd φ

(

x − xi h

)

PR NPTEL course – p.22/130

• The Parzen window density estimate is

f̂(x) = 1

n

n ∑

i=1

1

hd φ

(

x − xi h

)

• This is a kind of generalization of the histogram idea.

PR NPTEL course – p.23/130

• The Parzen window density estimate is

f̂(x) = 1

n

n ∑

i=1

1

hd φ

(

x − xi h

)

• This is a kind of generalization of the histogram idea. • We are essentially erecting bins where needed and

counting.

PR NPTEL course – p.24/130

• The Parzen window density estimate is

f̂(x) = 1

n

n ∑

i=1

1

hd φ

(

x − xi h

)

• This is a kind of generalization of the histogram idea. • We are essentially erecting bins where needed and

counting. • We can use other φ functions also in this form of

estimate.

PR NPTEL course – p.25/130

• The φ function should satisfy

φ(u) ≥ 0, ∀u, and ∫

ℜd

φ(u) du = 1

• The hypercube window function that we used satisfies these.

PR NPTEL course – p.26/130

• The φ function should satisfy

φ(u) ≥ 0, ∀u, and ∫

ℜd

φ(u) du = 1

• The hypercube window function that we used satisfies these. Let

V =

ℜd

φ (u

h

)

du =

ℜd

φ

(

u − u0 h

)

du

PR NPTEL course – p.27/130

• The φ function should satisfy

φ(u) ≥ 0, ∀u, and ∫

ℜd

φ(u) du = 1

• The hypercube window function that we used satisfies these. Let

V =

ℜd

φ (u

h

)

du =

ℜd

φ

(

u − u0 h

)

du

• For our hypercube window, V = hd.

PR NPTEL course – p.28/130

• Then the estimate

f̂(x) = 1

n

n ∑

i=1

1

V φ

(

x − xi h

)

would be a density.

PR NPTEL course – p.29/130

• Then the estimate

f̂(x) = 1

n

n ∑

i=1

1

V φ

(

x − xi h

)

would be a density.

• That is, f̂ would satisfy

f̂(x) ≥ 0, ∀x, and ∫

f(x) dx = 1

PR NPTEL course – p.30/130

• We can choose many φ functions that satisfy the earlier conditions.

PR NPTEL course – p.31/130

• We can choose many φ functions that satisfy the earlier conditions.

• Then with appropriate V we get a density estimate.

PR NPTEL course – p.32/130

• We can choose many φ functions that satisfy the earlier conditions.

• Then with appropriate V we get a density estimate. • This general method is often called Kernel density

estimate.

PR NPTEL course – p.33/130

• For example, we can have

φ(u) =

(

1√ 2π

)d

exp

[

− 1 2 ||u||2

]

PR NPTEL course – p.34/130

• For example, we can have

φ(u) =

(

1√ 2π

)d

exp

[

− 1 2 ||u||2

]

• This is the d-dimensional Gaussian density

PR NPTEL course – p.35/130

• For example, we can have

φ(u) =

(

1√ 2π

)d

exp

[

− 1 2 ||u||2

]

• This is the d-dimensional Gaussian density

• For this φ also we get V = hd.

PR NPTEL course – p.36/130

• The density estimate now is

f̂(x) = 1

n

n ∑

i=1

(

1

h √

)d

exp

[

− ||x − xi|| 2

2 h2

]

PR NPTEL course – p.37/130

• The density estimate now is

f̂(x) = 1

n

n ∑

i=1

(

1

h √

)d

exp

[

− ||x − xi|| 2

2 h2

]

• We are essentially taking a Gaussian centered at each data point and representing the unknown density as a mixture of these Gaussians.

PR NPTEL course – p.38/130

• The density estimate now is

f̂(x) = 1

n

n ∑

i=1

(

1

h √

)d

exp

[

− ||x − xi|| 2

2 h2

]

• We are essentially taking a Gaussian centered at each data point and representing the unknown density as a mixture of these Gaussians.

• This Gaussian kernel gives a smoother density estimate.

PR NPTEL course – p.39/130

• We next look at convergence of such estimates.

PR NPTEL course – p.40/130

• We next look at convergence of such estimates.

• Let f̂n denote the density estimate with n samples and similarly let hn and Vn denote the quantities when sample size is n.

PR NPTEL course – p.41/130

• We next look at convergence of such estimates.

• Let f̂n denote the density estimate with n samples and similarly let hn and Vn denote the quantities when sample size is n.

• The density estimate is

f̂n(x) = 1

n

n ∑

i=1

1

Vn φ

(

x − xi hn

)

PR NPTEL course – p.42/130

• We next look at convergence of such estimates.

• Let f̂n denote the density estimate with n samples and similarly let hn and Vn denote the quantities when sample size is n.

• The density estimate is

f̂n(x) = 1

n

n ∑

i=1

1

Vn φ

(

x − xi hn

)

• The question we now ask is: does f̂n → f .

PR NPTEL course – p.43/130

• Define

δn(x) = 1

Vn φ

(

x

hn

)

PR NPTEL course – p.44/130

• Define

δn(x) = 1

Vn φ

(

x

hn

)

• Both amplitude and width of δn is affected by hn.

PR NPTEL course – p.45/130

• Define

δn(x) = 1

Vn φ

(

x

hn

)

• Both amplitude and width of δn is affected by hn. • We assume that as n → ∞, we have hn → 0 and δn

tends to a delta function.

PR NPTEL course – p.46/130

• Define

δn(x) = 1

Vn φ

(

x

hn

)

• Both amplitude and width of δn is affected by hn. • We assume that as n → ∞, we have hn → 0 and δn

tends to a delta function. • This is true for both the φ functions we saw.

PR NPTEL course – p.47/130

• By the properties of φ, we have ∫

1

Vn φ

(

x − xi hn

)

dx = 1

PR NPTEL course – p.48/130

• By the properties of φ, we have ∫

1

Vn φ

(

x − xi hn

)

dx = 1

• Hence we have ∫

δn(x − xi) dx = 1, ∀i, ∀n.

PR NPTEL course – p.49/130

• We can write f̂n in terms of δn as

PR NPTEL course – p.50/130

• We can write f̂n in terms of δn as

f̂n(x) = 1

n

n ∑

i=1

1

Vn φ

(

x − xi hn

)

PR NPTEL course – p.51/130

• We can write f̂n in terms of δn as

f̂n(x) = 1

n

n ∑

i=1

1

Vn φ

(

x − xi hn

)

= 1

n

n ∑

i=1

δn(x − xi)

PR NPTEL course – p.52/130

• We can write f̂n in terms of δn as

f̂n(x) = 1

n

n ∑

i=1

1

Vn φ

(

x − xi hn

)

= 1

n

n ∑

i=1

δn(x − xi)

• f̂n(x) is random variable because it depends in xi, i = 1, · · · , n.

PR NPTEL course – p.53/130

• We can write f̂n in terms of δn as

f̂n(x) = 1

n

n ∑

i=1

1

Vn φ

(

x − xi hn

)

= 1

n

n ∑

i=1

δn(x − xi)

• f̂n(x) is random variable because it depends in xi, i = 1, · · · , n.

• The xi are iid with density f .

PR NPTEL course – p.54/130

• Let f̄n(x) be expectation of f̂n(x). Then

PR NPTEL course – p.55/130

• Let f̄n(x) be expectation of f̂n(x). Then

f̄n(x) = 1

n

n ∑

i=1

E

[

1

Vn φ

(

x − xi hn

)]

PR NPTEL course – p.56/130

• Let f̄n(x) be expectation of f̂n(x). Then

f̄n(x) = 1

n

n ∑

i=1

E

[

1

Vn φ

(

x − xi hn

)]

=

1

Vn φ

(

x − z hn

)

f(z) dz

PR NPTEL course – p.57/130

• Let f̄n(x) be expectation of f̂n(x). Then

f̄n(x) = 1

n

n ∑

i=1

E

[

1

Vn φ

(

x − xi hn

)]

=

1

Vn φ

(

x − z hn

)

f(z) dz

=

δn(x − z) f(z) dz

PR NPTEL course – p.58/130

• Thus,

f̄n(x) =

δn(x − z) f(z) dz

PR NPTEL course – p.59/130

• Thus,

f̄n(x) =

δn(x − z) f(z) dz

• We know δn becomes delta function as n → ∞.

PR NPTEL course – p.60/130

• Thus,

f̄n(x) =

δn(x − z) f(z) dz

• We know δn becomes delta function as n → ∞. • Hence, as n → ∞, E[f̂n(x)] → f(x), ∀x.

PR NPTEL course – p.61/130

• Thus,

f̄n(x) =

δn(x − z) f(z) dz

• We know δn becomes delta function as n → ∞. • Hence, as n → ∞, E[f̂n(x)] → f(x), ∀x. • Now let us calculate variance of f̂n(x).

PR NPTEL course – p.62/130

• We have

f̂n(x) = n

i=1

1

n

1

Vn φ

(

x − xi hn

)

PR NPTEL course – p.63/130

• We have

f̂n(x) = n

i=1

1

n

1

Vn φ

(

x − xi hn

)

• Thus it is sum of n terms each being a function of xi.

PR NPTEL course – p.64/130

• We have

f̂n(x) = n

i=1

1

n

1

Vn φ

(

x − xi hn

)

• Thus it is sum of n terms each being a function of xi.

• Since xi are iid, variance of f̂n(x) would be sum of variances of these n random variables.

PR NPTEL course – p.65/130

• The mean of f̂n(x) is given by

f̄n(x) = n

i=1

E

[

1

n

1

Vn φ

(

x − xi hn

)]

PR NPTEL course – p.66/130

• The mean of f̂n(x) is given by

f̄n(x) = n

i=1

E

[

1

n

1

Vn φ

(

x − xi hn

)]

• Hence each expectation inside the sum is 1 n

f̄n(x).

PR NPTEL course – p.67/130

• Let σ2n be variance of f̂n(x). Then

PR NPTEL course – p.68/130

• Let σ2n be variance of f̂n(x). Then

σ2n = n Var

[

1

n

1

Vn φ

(

x − xi hn

)]

PR NPTEL course – p.69/130

• Let σ2n be variance of f̂n(x). Then

σ2n = n Var

[

1

n

1

Vn φ

(

x − xi hn

)]

= n E

[

1

n2 V 2n φ2

(

x − xi hn

)]

− n 1 n2

f̄ 2n(x)

PR NPTEL course – p.70/130

• Let σ2n be variance of f̂n(x). Then

σ2n = n Var

[

1

n

1

Vn φ

(

x − xi hn

)]

= n E

[

1

n2 V 2n φ2

(

x − xi hn

)]

− n 1 n2

f̄ 2n(x)

= 1

n Vn

1

Vn φ2

(

x − z hn

)

f(z) dz − 1 n

f̄ 2n(x)

PR NPTEL course – p.71/130

Thus we have

σ2n ≤ 1

n Vn

1

Vn φ2

(

x − z hn

)

f(z) dz

PR NPTEL course – p.72/130

Thus we have

σ2n ≤ 1

n Vn

1

Vn φ2

(

x − z hn

)

f(z) dz

≤ 1 n Vn

sup(φ)

1

Vn φ

(

x − z hn

)

f(z) dz

PR NPTEL course – p.73/130

Thus we have

σ2n ≤ 1

n Vn

1

Vn φ2

(

x − z hn

)

f(z) dz

≤ 1 n Vn

sup(φ)

1

Vn φ

(

x − z hn

)

f(z) dz

where sup(φ) = max u

φ(u)

PR NPTEL course – p.74/130

Thus we have

σ2n ≤ 1

n Vn

1

Vn φ2

(

x − z hn

)

f(z) dz

≤ 1 n Vn

sup(φ)

1

Vn φ

(

x − z hn

)

f(z) dz

where sup(φ) = max u

φ(u)

= sup(φ) f̄n(x)

n Vn

PR NPTEL course – p.75/130

• Thus we get

σ2n ≤ sup(φ) f̄n(x)

n Vn

where sup(φ) = maxu φ(u).

PR NPTEL course – p.76/130

• Thus we get

σ2n ≤ sup(φ) f̄n(x)

n Vn

where sup(φ) = maxu φ(u).

• This implies σ2n → 0 as n → ∞.

PR NPTEL course – p.77/130

• Thus we get

σ2n ≤ sup(φ) f̄n(x)

n Vn

where sup(φ) = maxu φ(u).

• This implies σ2n → 0 as n → ∞. • This finally shows that the kernel density estimate is a

consistent estimate.

PR NPTEL course – p.78/130

• Kernel density estimates are essentially mixture densities.

f̂(x) = 1

n1

n1 ∑

i=1

1

V φ

(

x − xi h

)

PR NPTEL course – p.79/130

• Kernel density estimates are essentially mixture densities.

f̂(x) = 1

n1

n1 ∑

i=1

1

V φ

(

x − xi h

)

• We store all the data samples.

PR NPTEL course – p.80/130

• Kernel density estimates are essentially mixture densities.

f̂(x) = 1

n1

n1 ∑

i=1

1

V φ

(

x − xi h

)

• We store all the data samples. • We compute the density whenever needed.

PR NPTEL course – p.81/130

• The kernel density estimators are easy to use.

PR NPTEL course – p.82/130

• The kernel density estimators are easy to use. • However, computationally they are expensive.

PR NPTEL course – p.83/130

• The kernel density estimators are easy to use. • However, computationally they are expensive. • Consider 2-class problem with n1 + n2 = n training

samples.

PR NPTEL course – p.84/130

• The kernel density estimators are easy to use. • However, computationally they are expensive. • Consider 2-class problem with n1 + n2 = n training

samples. • If we use Gaussian window function, at any x we

need to compute n Gaussians

PR NPTEL course – p.85/130

• The kernel density estimators are easy to use. • However, computationally they are expensive. • Consider 2-class problem with n1 + n2 = n training

samples. • If we use Gaussian window function, at any x we

need to compute n Gaussians • If we can model both class conditional densities as

Gaussian, the needed computation is much less.

PR NPTEL course – p.86/130

• Another issue is the size of the volume element.

PR NPTEL course – p.87/130

• Another issue is the size of the volume element. • Choosing the value for h is difficult.

PR NPTEL course – p.88/130

• Another issue is the size of the volume element. • Choosing the value for h is difficult. • Sometimes one may choose different h in different

parts of the feature space.

PR NPTEL course – p.89/130

• Another issue is the size of the volume element. • Choosing the value for h is difficult. • Sometimes one may choose different h in different

parts of the feature space. • Kernel density estimates are the more popular

non-parametric estimates.

PR NPTEL course – p.90/130

• A different approach to non-parametric density estimation is the k-nearest neighbour approach.

PR NPTEL course – p.91/130

• A different approach to non-parametric density estimation is the k-nearest neighbour approach.

• Here we do not have to choose the size parameter, h.

PR NPTEL course – p.92/130

• A different approach to non-parametric density estimation is the k-nearest neighbour approach.

• Here we do not have to choose the size parameter, h. • Instead we choose k and find V to enclose the k

nearest neighbours of x.

PR NPTEL course – p.93/130

• A different approach to non-parametric density estimation is the k-nearest neighbour approach.

• Here we do not have to choose the size parameter, h. • Instead we choose k and find V to enclose the k

nearest neighbours of x. • Then we take

f̂(x) = k

n V

PR NPTEL course – p.94/130

• Nearest neighbour density estimate is closely related to nearest neighbour classifier.

PR NPTEL course – p.95/130

• Nearest neighbour density estimate is closely related to nearest neighbour classifier.

• Consider a 2-class problem with prior prababilities pi and class conditional densities fi, i = 0, 1.

PR NPTEL course – p.96/130

• Nearest neighbour density estimate is closely related to nearest neighbour classifier.

• Consider a 2-class problem with prior prababilities pi and class conditional densities fi, i = 0, 1.

• Let f(x) = p0 f0(x) + p1 f1(x) be the overall density of feature vector.

PR NPTEL course – p.97/130

• Suppose there are n data samples with ni being from Class-i, i = 0, 1.

PR NPTEL course – p.98/130

• Suppose there are n data samples with ni being from Class-i, i = 0, 1.

• We do k-nearest neighbour estimation of f . Suppose the needed volume is V .

PR NPTEL course – p.99/130

• Suppose there are n data samples with ni being from Class-i, i = 0, 1.

• We do k-nearest neighbour estimation of f . Suppose the needed volume is V .

• Suppose in this volume there are ki samples of class-i, i = 0, 1.

PR NPTEL course – p.100/130