Detectanomaly

DETECTANOMALY

The anomaly detection procedure searches for unusual cases based on deviations from the norms

of their cluster groups. The procedure is designed to quickly detect unusual cases for data

auditing purposes in the exploratory data analysis step, prior to any inferential data analysis. This

algorithm is designed for generic anomaly detection;thatis,thedefinition of an anomalous case

is not specific to any particular application, such as detection of unusual payment patterns in the

healthcare industry or money laundering detection in the finance industry in which the definition

of an anomaly can be well defined.

Data Assumptions

Data. This procedure works with both continuous and categorical variables. Each row represents a

distinct observation, and each column represents a distinct variable upon which the peer groups

are based. A case identification variable can be available in the data file for marking output, but

it will not be used in the analysis. Missing values are allowed. The SPSS weight variable, if

specified, is ignored.

The detection model can be applied to a new test data file. The elements of the test data must be the

same as the elements of the training data. And, depending on the algorithm settings, the missing

value handling that is used to create the model may be applied to the test data file prior to scoring.

Case Order. Note that the solution may depend on the order of cases. To minimize order effects,

randomly order the cases. To verify the stability of a given solution, you may want to obtain

several different solutions with cases sorted in different random orders. In situations with

extremely large file sizes, multiple runs can be performed, with a sample of cases sorted in

different random orders.

Assumptions. The algorithm assumes that all variables are nonconstant and independent and

assumes that no case has missing values for all the input variables. Further, each continuous

variable is assumed to have a normal (Gaussian) distribution, and each categorical variable is

assumed to have a multinomial distribution. Empirical internal testing indicates that the procedure

is fairly robust to violations of both the assumption of independence and the distributional

assumptions, but be aware of how well these assumptions are met.

Notation

The following notation is used throughout this chapter unless otherwise stated:

ID The identity variable of each case in the data file.

nThe number of cases in the training data Xtrain .

Xok, k = 1, …, K The set of input variables in the training data.

Mk,k∈{1,…,K} IfX

ok is a continuous variable, Mkrepresents the grand mean, or average of

the variable across the entire training data.

Detectanomaly - Mathematics and Statistics - Study Notes, Study notes of Mathematical Statistics