Nonparametric Classification: Modeling Probabilities & Density for Categorical Variables, Study notes of Engineering

Nonparametric classification techniques for modeling categorical variables and predicting a categorical response variable, g, from one or more predictor variables, x. Two approaches: modeling probability density functions for each group and directly modeling probabilities. It also provides examples using the rhomaa-umaa dataset and techniques such as nearest-neighbor classification and kernel density estimation.

Typology: Study notes

Pre 2010

Uploaded on 09/17/2009

koofers-user-7ug-1
koofers-user-7ug-1 🇺🇸

9 documents

1 / 24

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
NONPARAMETRIC CLASSIFICATION
TECHNIQUES
C&PE 940, 30 November 2005
Geoff Bohling
Assistant Scientist
Kansas Geological Survey
864-2093
Overheads and other resources available at:
http://people.ku.edu/~gbohling/cpe940
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18

Partial preview of the text

Download Nonparametric Classification: Modeling Probabilities & Density for Categorical Variables and more Study notes Engineering in PDF only on Docsity!

NONPARAMETRIC CLASSIFICATION

TECHNIQUES

C&PE 940, 30 November 2005

Geoff Bohling Assistant Scientist Kansas Geological Survey [email protected] 864-

Overheads and other resources available at:

http://people.ku.edu/~gbohling/cpe

Modeling Categorical Variables: Classification

In classification applications we are trying to develop a model for predicting a categorical response variable, G , from one or more predictor variables, X. That is, if we know that an observation arises from one of K different mutually exclusive classes or groups, Gk , then we are trying to estimate the probability of occurrence of each group at each point in the predictor space

P ˆ k ( x ) = Prob( G (^) k X = x )

We could then assign each estimation point to the class with the highest probability at that point, segmenting the predictor space into regions assigned to the different classes.

The fact that the probabilities are continuous variables immediately suggests the possibility of using the nonparametric regression techniques we discussed last time to model them. However, we

will have to apply transforms to ensure that the estimated Pk ( x )

values represent a legitimate set of probabilities for mutually exclusive events, meaning

0 ≤ Pk ( x ) ≤ 1 , k = 1 ,K, K

( ) 1 1

∑^ =

=

K

k

Pk x.

To use this approach, the group memberships in the training dataset are coded as a set of indicator variables, one for each

group, with yk (^) , i = 1 for the group to which data point i belongs and

0 for all other groups – the set of probabilities corresponding to our certain knowledge of the group membership for each training data point.

Example Data

We will look at predicting facies from logs for a Cretaceous section in north central Kansas:

Facies assignments from core are available from the Jones well, along with a suite of logs including neutron and density porosity, photoelectric factor, and thorium, uranium and potassium components of the spectral gamma ray log. We will recast the density porosity and apparent matrix density (Rhomaa) and the photoelectric factor as apparent matrix volumetric photoelectric absorption (Umaa), so that the six logs employed for discrimination are: Th, U, K, Rhomaa, Umaa, and φN. The six facies picked from core are marine, paralic, floodplain, channel, splay, and paleosol.

So, we will train on this data from the Jones well and look at predictions both in the Jones well and the Kenyon well.

Nearest-Neighbor Classification

This is basically the same as nearest neighbor averaging, except that instead of taking the average of y values at neighboring points, we are assigning the class at each point based on a majority vote of the class memberships at neighboring points. Alternatively, one can compute a vector of group membership probabilities by dividing the count for each group by the total number of neighboring points. As usual, we would want to scale the predictor variables to comparable ranges before computing distances to neighboring points. Applying this approach to the two-facies Rhomaa-Umaa example and predicting over a grid of Rhomaa- Umaa values using 15 nearest neighbors, we get the following map of probability of membership in the marine facies:

Applying nearest-neighbor classification with 20 neighbors to the entire Jones well dataset, using all six logs and predicting all six facies, we get the following table of predicted (columns) versus actual (rows) facies:

Core Predicted Facies Facies (^) Marine Paralic Floodplain Channel Splay Paleosol Marine 112 2 8 0 0 6 Paralic 2 110 19 19 3 3 Floodplain 5 7 206 7 4 3 Channel 0 1 5 254 2 0 Splay 0 0 21 2 15 0 Paleosol 2 5 25 0 0 34

Overall, 82% of the intervals are classified correctly. The predicted facies versus depth look a little noisy:

Kernel Density Estimation

This basically uses the same procedure as kernel regression, using kernel basis functions to produce smooth estimates of the group-

specific density functions, f^ ˆ k^ ( X ). Here the kernel function are not

serving to smooth out the y values, but are in a sense spreading out the location of each training data point in predictor space, turning a delta (spike) function at its exact location into a Gaussian (or similar) curve centered at that location. The kernel density estimate at each location is basically a sum of the surrounding kernels rescaled to represent a density estimate. In general, we would have to estimate fully multivariate density functions, using high-dimensional kernel functions. However, if we make the simplifying assumption of independence among the variables, then the multivariate density function for each group is given by the product of the density functions of the individual variables for that group:

f (^) k ( X ) = fk ( X 1 ) ⋅ fk ( X 2 ) L fk ( Xd )

This simplification means that we only need to develop one- dimensional density function estimates. Plugging these density estimates into Bayes’ theorem leads to a method called “naïve Bayes”.

Applying the naïve Bayes approach to the Rhomaa-Umaa example, we get the following density estimates for each group – multiplying the one-dimensional kernel density estimates along each axis to get the combined density estimates.

Note that the Rhomaa axis in these 3D plots is pointed in the opposite direction than the Rhomaa axis in the 2D plots, so the marine cluster is plotting towards the back rather than towards the front like it should.

Combining the density estimates for the two groups in Bayes theorem give the following probability for the Marine facies:

Or, as a contour plot (oriented the right way):

Applying the naïve Bayes procedure to the full Jones dataset we get the following table of results, with a 72% correct classification rate overall:

Core Predicted Facies Facies Marine Paralic Floodplain Channel Splay Paleosol Marine 116 0 6 0 0 6 Paralic 4 113 18 8 3 10 Floodplain 20 13 156 0 16 27 Channel 0 84 1 174 3 0 Splay 0 1 9 0 28 0 Paleosol 4 5 10 0 0 47

The predicted facies versus depth look like:

Here is a schematic representation of the classification network for our example (leaving the bias nodes out of the picture):

The exponential curves in the output nodes represent the softmax transform.

For d input variables, M hidden-layer nodes ( M +1 including the bias node), and K output classes, the number of weights in this network is M ⋅ ( d + 1 ) + K ⋅( M + 1 ). Again, we can use any of a

number of optimization algorithms to adjust the weights. And, just as for the regression problem, we can include a weight decay term in the objective function, forcing a smoother representation of the boundaries between classes than we would obtain otherwise.

Using a single hidden-layer node for the Rhomaa-Umaa example, the sigmoid basis simply forms a step or boundary between the paralic and marine data points. I’ve included those points on the plot in their indicator form – 0’s for paralic and 1’s for marine:

Since the classification problem is about drawing boundaries, the sigmoid basis functions are in a sense more natural to this problem than to the regression problem, since the step in the basis function forms a boundary.

Fitting a network with 20 hidden-layer nodes to the entire Jones dataset using a decay constant of 0.1 (just guessing at reasonable values for the tuning parameters), yields the following classification table, with 93% of the facies predicted correctly:

Core Predicted Facies Facies (^) Marine Paralic Floodplain Channel Splay Paleosol Marine 124 1 1 0 0 2 Paralic 0 140 5 9 0 2 Floodplain 0 9 216 2 2 3 Channel 0 4 2 256 0 0 Splay 0 2 8 0 28 0 Paleosol 0 2 7 0 0 57

And the predictions versus depth look like:

The predictions in the Kenyon are still quite messy: