Detectanomaly - Mathematics and Statistics - Study Notes, Study notes of Mathematical Statistics

In this study material file, you will learn about: Detect anomaly, Data Assumptions, Algorithm Steps, Modeling Stage, Scoring Stage, Reasoning Stage, Key Formulas from Two-Step Clustering

Typology: Study notes

2011/2012

Uploaded on 10/31/2012

sangawar
sangawar 🇮🇳

4.5

(4)

118 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
DETECTANOMALY
The anomaly detection procedure searches for unusual cases based on deviations from the norms
of their cluster groups. The procedure is designed to quickly detect unusual cases for data
auditing purposes in the exploratory data analysis step, prior to any inferential data analysis. This
algorithm is designed for generic anomaly detection;thatis,thedefinition of an anomalous case
is not specific to any particular application, such as detection of unusual payment patterns in the
healthcare industry or money laundering detection in the finance industry in which the definition
of an anomaly can be well defined.
Data Assumptions
Data. This procedure works with both continuous and categorical variables. Each row represents a
distinct observation, and each column represents a distinct variable upon which the peer groups
are based. A case identification variable can be available in the data file for marking output, but
it will not be used in the analysis. Missing values are allowed. The SPSS weight variable, if
specified, is ignored.
The detection model can be applied to a new test data file. The elements of the test data must be the
same as the elements of the training data. And, depending on the algorithm settings, the missing
value handling that is used to create the model may be applied to the test data file prior to scoring.
Case Order. Note that the solution may depend on the order of cases. To minimize order effects,
randomly order the cases. To verify the stability of a given solution, you may want to obtain
several different solutions with cases sorted in different random orders. In situations with
extremely large file sizes, multiple runs can be performed, with a sample of cases sorted in
different random orders.
Assumptions. The algorithm assumes that all variables are nonconstant and independent and
assumes that no case has missing values for all the input variables. Further, each continuous
variable is assumed to have a normal (Gaussian) distribution, and each categorical variable is
assumed to have a multinomial distribution. Empirical internal testing indicates that the procedure
is fairly robust to violations of both the assumption of independence and the distributional
assumptions, but be aware of how well these assumptions are met.
Notation
The following notation is used throughout this chapter unless otherwise stated:
ID The identity variable of each case in the data file.
nThe number of cases in the training data Xtrain .
Xok, k = 1, …, K The set of input variables in the training data.
Mk,k{1,…,K} IfX
ok is a continuous variable, Mkrepresents the grand mean, or average of
the variable across the entire training data.
1
pf3
pf4
pf5

Partial preview of the text

Download Detectanomaly - Mathematics and Statistics - Study Notes and more Study notes Mathematical Statistics in PDF only on Docsity!

DETECTANOMALY

The anomaly detection procedure searches for unusual cases based on deviations from the norms of their cluster groups. The procedure is designed to quickly detect unusual cases for data auditing purposes in the exploratory data analysis step, prior to any inferential data analysis. This algorithm is designed for generic anomaly detection; that is, the definition of an anomalous case is not specific to any particular application, such as detection of unusual payment patterns in the healthcare industry or money laundering detection in the finance industry in which the definition of an anomaly can be well defined.

Data Assumptions

Data. This procedure works with both continuous and categorical variables. Each row represents a distinct observation, and each column represents a distinct variable upon which the peer groups are based. A case identification variable can be available in the data file for marking output, but it will not be used in the analysis. Missing values are allowed. The SPSS weight variable, if specified, is ignored.

The detection model can be applied to a new test data file. The elements of the test data must be the same as the elements of the training data. And, depending on the algorithm settings, the missing value handling that is used to create the model may be applied to the test data file prior to scoring. Case Order. Note that the solution may depend on the order of cases. To minimize order effects, randomly order the cases. To verify the stability of a given solution, you may want to obtain several different solutions with cases sorted in different random orders. In situations with extremely large file sizes, multiple runs can be performed, with a sample of cases sorted in different random orders. Assumptions. The algorithm assumes that all variables are nonconstant and independent and assumes that no case has missing values for all the input variables. Further, each continuous variable is assumed to have a normal (Gaussian) distribution, and each categorical variable is assumed to have a multinomial distribution. Empirical internal testing indicates that the procedure is fairly robust to violations of both the assumption of independence and the distributional assumptions, but be aware of how well these assumptions are met.

Notation

The following notation is used throughout this chapter unless otherwise stated:

ID The identity variable of each case in the data file. n (^) The number of cases in the training data X (^) train. X (^) ok , k = 1, …, K The set of input variables in the training data. M (^) k , k ∈ {1, …, K} If X (^) ok is a continuous variable, M (^) k represents the grand mean, or average of the variable across the entire training data.

1

DETECTANOMALY

SD (^) k , k ∈ {1, …, K} If X (^) ok is a continuous variable, SD (^) k represents the grand standard deviation, or standard deviation of the variable across the entire training data. X (^) K+1 A continuous variable created in the analysis. It represents the percentage of variables (k = 1, …, K) that have missing values in each case. X (^) k , k = 1, …, K The set of processed input variables after the missing value handling is applied. For more information, see “Modeling Stage ” on p. 3. H, or the boundaries of H: [H (^) min , H (^) max ]

H is the pre-specified number of cluster groups to create. Alternatively, the bounds [H (^) min , H (^) max ] can be used to specify the minimum and maximum numbers of cluster groups. n (^) h , h = 1, …, H The number of cases in cluster h, h = 1, …, H, based on the training data. p (^) h , h = 1, …, H The proportion of cases in cluster h, h = 1, …, H, based on the training data. For each h, p (^) h = n (^) h /n. M (^) hk , k = 1, …, K+1, h = 1, …, H

If X (^) k is a continuous variable, M (^) hk represents the cluster mean, or average of the variable in cluster h based on the training data. If X (^) k is a categorical variable, it represents the cluster mode, or most popular categorical value of the variable in cluster h based on the training data. SD (^) hk , k ∈ {1, …, K+1}, h = 1, …, H

If X (^) k is a continuous variable, SD (^) hk represents the cluster standard deviation, or standard deviation of the variable in cluster h based on the training data. {n (^) hkj}, k ∈ {1, …, K}, h = 1, …, H, j = 1, …, Jk

The frequency set {n (^) hkj } is defined only when X (^) k is a categorical variable. If X (^) k has J (^) k categories, then n (^) hkj is the number of cases in cluster h that fall into category j. m (^) An adjustment weight used to balance the influence between continuous and categorical variables. It is a positive value with a default of 6. VDI (^) k , k = 1, …, K+1 The variable deviation index of a case is a measure of the deviation of variable value X (^) k from its cluster norm. GDI The group deviation index GDI of a case is the log-likelihood distance d(h, s), which is the sum of all the variable deviation indices {VDI (^) k , k = 1, …, K+1}. anomaly index The anomaly index of a case is the ratio of the GDI to that of the average GDI for the cluster group that the case belongs. variable contribution measure

The variable contribution measure of variable X (^) k for a case is the ratio of the VDI (^) k to the case’s corresponding GDI. pct (^) anomaly or n (^) anomaly A pre-specified value pct (^) anomaly determines the percentage of cases to be considered as anomalies. Alternatively a pre-specified positive integer value n (^) anomaly determines the number of cases to be considered as anomalies. cutpoint (^) anomaly A pre-specified cut point; cases with anomaly index values greater than cutpoint (^) anomaly are considered anomalous. k (^) anomaly A pre-specified integer threshold 1≤k (^) anomaly≤K+1 determines the number of variables considered as the reasons that the case is identified as an anomaly.

Algorithm Steps

This algorithm is divided into 3 stages:

Modeling. Cases are placed into cluster groups based on their similarities on a set of input variables. The clustering model used to determine the cluster group of a case and the sufficient statistics used to calculate the norms of the cluster groups are stored.

Scoring. The model is applied to each case to identify its cluster group and some indices are created for each case to measure the unusualness of the case with respect to its cluster group. All cases are sorted by the values of the anomaly indices. The top portion of the case list is identified as the set of anomalies.

DETECTANOMALY

Cases in the scoring data are screened out that contain a categorical variable with a valid category that does not appear in the training data. For example, if Region is a categorical variable with categories IL, MA and CA in the training data, a case in the scoring data that has a valid category FL for Region will be excluded from the analysis.

  1. Missing Value Handling (Optional). For each input variable X (^) ok , if X (^) ok is a continuous variable, use all valid values of that variable to compute the grand mean Mk and grand standard deviation SD (^) k. Replace the missing values of the variable by its grand mean. If X (^) ok is a categorical variable, combine all missing values and put together a missing value category. This category is treated as a valid category.
  2. Creation of Missing Value Pct Variable (Optional depending on Modeling Stage). If X (^) K+1 is created in the Modeling Stage, it is also computed for the scoring data.
  3. Assign Each Case to its Closest Non-noise Cluster. The clustering model from the Modeling Stage is applied to the processed variables of the scoring data file to create a cluster ID for each case. Cases belonging to the noise cluster are reassigned to their closest non-noise cluster. See the TwoStep Cluster algorithm document for more information on the noise cluster.
  4. Calculate Variable Deviation Indices. Given a case s, the closest cluster h is found. The variable deviation index VDI (^) k of variable Xk is defined as the contribution dk (h, s) of the variable to its log-likelihood distance d(h, s). The corresponding norm value is M (^) hk , which is the cluster sample mean of Xk if X (^) k is continuous, or the cluster mode of Xk if Xk is categorical.
  5. Calculate Group Deviation Index. The group deviation index GDI of a case is the log-likelihood distance d(h, s), which is the sum of all the variable deviation indices {VDIk , k = 1, …, K+1}.
  6. Calculate Anomaly Index and Variable Contribution Measures. Two additional indices are calculated that are easier to interpret than the group deviation index and the variable deviation index.

The anomaly index of a case is an alternative to the GDI which is computed as the ratio of the case’s GDI to the average GDI of the cluster to which the case belongs. Increasing values of this index correspond to greater deviations from the average, and indicate better anomaly candidates.

A variable’s variable contribution measure of a case is an alternative to the VDI which is computed as the ratio of the variable’s VDI to the case’s GDI. This is the proportional contribution of the variable to the deviation of the case. The larger the value of this measure, the greater the variable’s contribution to the deviation.

Odd Situations

Zero Divided by Zero

The situation in which the GDI of a case is zero and the average GDI of the cluster that the case belongs to is also zero is possible if the cluster is a singleton or is made up of identical cases and the case in question is the same as the identical cases. Whether this case is considered as an anomaly or not depends upon whether the number of identical cases that make up the cluster is large or small. For example, suppose that there are a total of 10 cases in the training and 2 clusters are resulted in which one cluster is a singleton; that is, made up of 1 case, and the other has 9 cases. In this situation, the case in the singleton cluster should be considered as an anomaly as it

DETECTANOMALY

does not belong to the larger cluster. One way to calculate the anomaly index in this situation is to set it as the ratio of average cluster size to the size of the cluster h, which is:

Following the 10 cases example, the anomaly index for the case belonging to the singleton cluster would be (10/2)/1 = 5, which should be large enough for the algorithm to catch it as an anomaly. In this situation, the variable contribution measure is set to 1/(K+1), where (K+1) is the number of processed variables in the analysis.

Nonzero Divided by Zero

The situation in which the GDI of a case is nonzero but the average GDI of the cluster that the case belongs to is zero is possible if the corresponding cluster is a singleton or is made up of identical cases and the case in question is not the same as the identical cases. Suppose that case i belongs to cluster h which has zero average GDI; that is, average(GDI)h = 0, but the GDI between case i and cluster h is nonzero, i.e., GDI(i, h) ≠ 0. One choice for the anomaly index calculation of case i could be to set the denominator as the weighted average GDI over all other clusters if this value is not zero, else set the calculation as the ratio of average cluster size to the size of the cluster h. That is,

if

else

This situation triggers a warning that the case is assigned to a cluster that is made up of identical cases.

Reasoning Stage

Every case now has a group deviation index and anomaly index, and a set of variable deviation indices and variable contribution measures. The purpose of this stage is to rank the likely anomalous cases and provide the reasons to suspect them of being anomalous.

  1. Identify the Most Anomalous Cases. Sort the cases in descending order on the values of the anomaly index. The top pctanomaly % (or alternatively the top nanomaly ) gives the anomaly list, subject to the restriction that cases with anomaly index less than or equal to cutpointanomaly are not considered anomalous.
  2. Provide Reasons for Considering a Case Anomalous. For each anomalous case, sort the variables by their corresponding VDI (^) k values in descending order. The top kanomaly variable names, its value (of the corresponding original variable Xok ), and the norm values are displayed as reasoning.