



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
In this study material file, you will learn about: Detect anomaly, Data Assumptions, Algorithm Steps, Modeling Stage, Scoring Stage, Reasoning Stage, Key Formulas from Two-Step Clustering
Typology: Study notes
1 / 6
This page cannot be seen from the preview
Don't miss anything!




The anomaly detection procedure searches for unusual cases based on deviations from the norms of their cluster groups. The procedure is designed to quickly detect unusual cases for data auditing purposes in the exploratory data analysis step, prior to any inferential data analysis. This algorithm is designed for generic anomaly detection; that is, the definition of an anomalous case is not specific to any particular application, such as detection of unusual payment patterns in the healthcare industry or money laundering detection in the finance industry in which the definition of an anomaly can be well defined.
Data. This procedure works with both continuous and categorical variables. Each row represents a distinct observation, and each column represents a distinct variable upon which the peer groups are based. A case identification variable can be available in the data file for marking output, but it will not be used in the analysis. Missing values are allowed. The SPSS weight variable, if specified, is ignored.
The detection model can be applied to a new test data file. The elements of the test data must be the same as the elements of the training data. And, depending on the algorithm settings, the missing value handling that is used to create the model may be applied to the test data file prior to scoring. Case Order. Note that the solution may depend on the order of cases. To minimize order effects, randomly order the cases. To verify the stability of a given solution, you may want to obtain several different solutions with cases sorted in different random orders. In situations with extremely large file sizes, multiple runs can be performed, with a sample of cases sorted in different random orders. Assumptions. The algorithm assumes that all variables are nonconstant and independent and assumes that no case has missing values for all the input variables. Further, each continuous variable is assumed to have a normal (Gaussian) distribution, and each categorical variable is assumed to have a multinomial distribution. Empirical internal testing indicates that the procedure is fairly robust to violations of both the assumption of independence and the distributional assumptions, but be aware of how well these assumptions are met.
The following notation is used throughout this chapter unless otherwise stated:
ID The identity variable of each case in the data file. n (^) The number of cases in the training data X (^) train. X (^) ok , k = 1, …, K The set of input variables in the training data. M (^) k , k ∈ {1, …, K} If X (^) ok is a continuous variable, M (^) k represents the grand mean, or average of the variable across the entire training data.
1
DETECTANOMALY
SD (^) k , k ∈ {1, …, K} If X (^) ok is a continuous variable, SD (^) k represents the grand standard deviation, or standard deviation of the variable across the entire training data. X (^) K+1 A continuous variable created in the analysis. It represents the percentage of variables (k = 1, …, K) that have missing values in each case. X (^) k , k = 1, …, K The set of processed input variables after the missing value handling is applied. For more information, see “Modeling Stage ” on p. 3. H, or the boundaries of H: [H (^) min , H (^) max ]
H is the pre-specified number of cluster groups to create. Alternatively, the bounds [H (^) min , H (^) max ] can be used to specify the minimum and maximum numbers of cluster groups. n (^) h , h = 1, …, H The number of cases in cluster h, h = 1, …, H, based on the training data. p (^) h , h = 1, …, H The proportion of cases in cluster h, h = 1, …, H, based on the training data. For each h, p (^) h = n (^) h /n. M (^) hk , k = 1, …, K+1, h = 1, …, H
If X (^) k is a continuous variable, M (^) hk represents the cluster mean, or average of the variable in cluster h based on the training data. If X (^) k is a categorical variable, it represents the cluster mode, or most popular categorical value of the variable in cluster h based on the training data. SD (^) hk , k ∈ {1, …, K+1}, h = 1, …, H
If X (^) k is a continuous variable, SD (^) hk represents the cluster standard deviation, or standard deviation of the variable in cluster h based on the training data. {n (^) hkj}, k ∈ {1, …, K}, h = 1, …, H, j = 1, …, Jk
The frequency set {n (^) hkj } is defined only when X (^) k is a categorical variable. If X (^) k has J (^) k categories, then n (^) hkj is the number of cases in cluster h that fall into category j. m (^) An adjustment weight used to balance the influence between continuous and categorical variables. It is a positive value with a default of 6. VDI (^) k , k = 1, …, K+1 The variable deviation index of a case is a measure of the deviation of variable value X (^) k from its cluster norm. GDI The group deviation index GDI of a case is the log-likelihood distance d(h, s), which is the sum of all the variable deviation indices {VDI (^) k , k = 1, …, K+1}. anomaly index The anomaly index of a case is the ratio of the GDI to that of the average GDI for the cluster group that the case belongs. variable contribution measure
The variable contribution measure of variable X (^) k for a case is the ratio of the VDI (^) k to the case’s corresponding GDI. pct (^) anomaly or n (^) anomaly A pre-specified value pct (^) anomaly determines the percentage of cases to be considered as anomalies. Alternatively a pre-specified positive integer value n (^) anomaly determines the number of cases to be considered as anomalies. cutpoint (^) anomaly A pre-specified cut point; cases with anomaly index values greater than cutpoint (^) anomaly are considered anomalous. k (^) anomaly A pre-specified integer threshold 1≤k (^) anomaly≤K+1 determines the number of variables considered as the reasons that the case is identified as an anomaly.
This algorithm is divided into 3 stages:
Modeling. Cases are placed into cluster groups based on their similarities on a set of input variables. The clustering model used to determine the cluster group of a case and the sufficient statistics used to calculate the norms of the cluster groups are stored.
Scoring. The model is applied to each case to identify its cluster group and some indices are created for each case to measure the unusualness of the case with respect to its cluster group. All cases are sorted by the values of the anomaly indices. The top portion of the case list is identified as the set of anomalies.
DETECTANOMALY
Cases in the scoring data are screened out that contain a categorical variable with a valid category that does not appear in the training data. For example, if Region is a categorical variable with categories IL, MA and CA in the training data, a case in the scoring data that has a valid category FL for Region will be excluded from the analysis.
The anomaly index of a case is an alternative to the GDI which is computed as the ratio of the case’s GDI to the average GDI of the cluster to which the case belongs. Increasing values of this index correspond to greater deviations from the average, and indicate better anomaly candidates.
A variable’s variable contribution measure of a case is an alternative to the VDI which is computed as the ratio of the variable’s VDI to the case’s GDI. This is the proportional contribution of the variable to the deviation of the case. The larger the value of this measure, the greater the variable’s contribution to the deviation.
Odd Situations
Zero Divided by Zero
The situation in which the GDI of a case is zero and the average GDI of the cluster that the case belongs to is also zero is possible if the cluster is a singleton or is made up of identical cases and the case in question is the same as the identical cases. Whether this case is considered as an anomaly or not depends upon whether the number of identical cases that make up the cluster is large or small. For example, suppose that there are a total of 10 cases in the training and 2 clusters are resulted in which one cluster is a singleton; that is, made up of 1 case, and the other has 9 cases. In this situation, the case in the singleton cluster should be considered as an anomaly as it
DETECTANOMALY
does not belong to the larger cluster. One way to calculate the anomaly index in this situation is to set it as the ratio of average cluster size to the size of the cluster h, which is:
Following the 10 cases example, the anomaly index for the case belonging to the singleton cluster would be (10/2)/1 = 5, which should be large enough for the algorithm to catch it as an anomaly. In this situation, the variable contribution measure is set to 1/(K+1), where (K+1) is the number of processed variables in the analysis.
Nonzero Divided by Zero
The situation in which the GDI of a case is nonzero but the average GDI of the cluster that the case belongs to is zero is possible if the corresponding cluster is a singleton or is made up of identical cases and the case in question is not the same as the identical cases. Suppose that case i belongs to cluster h which has zero average GDI; that is, average(GDI)h = 0, but the GDI between case i and cluster h is nonzero, i.e., GDI(i, h) ≠ 0. One choice for the anomaly index calculation of case i could be to set the denominator as the weighted average GDI over all other clusters if this value is not zero, else set the calculation as the ratio of average cluster size to the size of the cluster h. That is,
if
else
This situation triggers a warning that the case is assigned to a cluster that is made up of identical cases.
Every case now has a group deviation index and anomaly index, and a set of variable deviation indices and variable contribution measures. The purpose of this stage is to rank the likely anomalous cases and provide the reasons to suspect them of being anomalous.