











Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Most of the previous research in outlier detection has been in the eld of statistics 9, 11,. 3]. These methods usually make assumptions about data distribution, ...
Typology: Summaries
1 / 19
This page cannot be seen from the preview
Don't miss anything!












Paper ID: 394
FindOut: Finding Outliers in Very Large Datasets
Dantong Yu, Gholam Sheikholeslami and Aidong Zhang Department of Computer Science and Engineering State University of New York at Bu alo Bu alo, NY 14260, USA
Abstract Signal pro cessing techniques have b een intro duced to transform images for image enhancement, ltering and restoration, analysis, and reconstruction. In this pap er, we apply signal pro cessing techniques to solve imp ortant KDD problems. In particular, we intro duce a novel deviation (or outlier) detection approach, termed FindOut, based on wavelet transform. The main idea in FindOut is to remove the clusters from the original data and then identify the outliers. Although previous research showed that such techniques may not b e e ective b ecause of the nature of the clustering, FindOut can successfully identify various p ercentages of outliers from large datasets. Thus, we can combine b oth clustering and outliers detection in a uni ed approach. Exp erimental results on very large datasets are presented which show the eciency and e ectiveness of the prop osed approach.
1 Intro duction
Data mining is a pro cess of extracting valid, previously unknown, and ultimately compre- hensible information from large databases and using it to make crucial decisions. Mo dern companies are awash in data on customers, clients, suppliers and industry trends. But data is of little use without intelligence. Here Intelligence refers to combing through the data to notice patterns, devise rules, and make predictions ab out future. Data mining technol- ogy p ermits organizations to make the most e ective use of the vast amounts of data that they have gathered. For example, in the banking industry, data mining can b e used in mo deling and predicting credit fraud, evaluating risk, p erforming trend analysis, analyzing pro tability, and helping with direct marketing campaigns.
One of the very interesting problems arising recently in the data mining research commu- nity is the deviation (or outliers) detection problem. Here outliers refer to those exceptions (^) This research is partially supp orted by an NSF CAREER grant I IS-9733730.
which deviate from the values anticipated based on a given (usually statistical) mo del or from some previously known exp ectation and norm [17]. The identi cation of outliers can lead to the discovery of truly unexp ected knowledge in areas such as electronic commerce exceptions, bankruptcy, credit card fraud, and even the analysis of p erformance statistics of sto ck exchange [13]. Such knowledge can lead to detecting abnormal events happ ened in the past or predicting p otential trends in the future. Such p otential trends may b ecome new directions for future invest, marketing, and other purp oses.
1.1 Related Work
Many clustering algorithms have b een recently prop osed. They include partitioning algo- rithms such as CLARANS [15], improved k -means algorithms [4, 6], hierarchical algorithms such as BIRCH [24], density-based metho ds such as DBSCAN [5] and CURE [7], or grid- based algorithms including STING [21], DBCLASD [22], and WaveCluster [18]. Some other recent clustering algorithms are CLIQUE [1], DENCLUE [8]. The goal of these metho ds is to detect the clusters in the data space. They try to remove or at least tolerate the outliers to b etter detect the clusters.
Most of the previous research in outlier detection has b een in the eld of statistics [9, 11, 3]. These metho ds usually make assumptions ab out data distribution, statistical distribution parameters, typ e or numb er of outliers. However, these parameters generally cannot b e easily determined which can make these metho d dicult to apply. Ruts and Rousseeuw prop osed a depth-based metho d to detect the outliers [16]. The ob jects are organized in layers in the data space according to some de nition for depth. Outliers are exp ected to b e in the deep layers. Such metho ds do not have the distribution tting problem, however, in practice for multidimensional spaces, they may not b e applicable. Sarawagi et al. prop osed a discovery- driven metho d that search for exception is guided by pre-computed indicators of exceptions at various levels of detail in a data cub e [17]. They consider a value in a cell of a data cub e to b e an exception, if it is signi cantly di erent from anticipated value based on a statistical mo del. Although this metho d can handle hierarchies and ordered dimensions, but mo del selection still is an issue for it.
Arning et al. intro duced a metho d for outlier detection which relies on the observation that after seeing a series of similar data, an element disturbing the series is considered an outlier [2]. Their metho d requires a function that can yield the degree to which a data element causes the dissimilarity of the data set to increase. It lo oks for the subset of data that lead to the greatest reduction in Kolmogorov complexity for the amount of data discarded [2].
Knorr and Ng presented algorithms to detect Distance-Based outliers [12, 13]. They
then get a new function which is called transformed distribution function. If the transformed distribution function can make the outliers more salient than other data p oints, we then can easily detect outliers within the dataset.
According to [10], many signal pro cessing op erations can b e mo deled as a linear space system. The two-dimensional case was intro duced in [10]. For a d-dimensional space, a linear space (LSg ) system can b e completely describ ed by its impulse response function g (x 1 ; x 2 ; : : : ; xd ). The function g (x 1 ; x 2 ; : : : ; xd ) is also called kernel function. When the input to the system is an impulse (x 1 ; x 2 ; : : : ; xd ), the output will b e exactly the kernel function g (x 1 ; x 2 ; : : : ; xd ). Thus by inputting an impulse, we can obtain kernel function and understand the functionality of the system. The system works as follows:
Input f (x 1 ; x 2 ; : : : ; xd )