FindOut: Finding Outliers in Very Large Datasets, Summaries of Statistics

Most of the previous research in outlier detection has been in the eld of statistics 9, 11,. 3]. These methods usually make assumptions about data distribution, ...

Typology: Summaries

2022/2023

Uploaded on 02/28/2023

ekambar
ekambar 🇺🇸

4.8

(25)

264 documents

1 / 19

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Paper ID: 394
FindOut
: Finding Outliers in Very Large Datasets
Dantong Yu, Gholam Sheikholeslami and Aidong Zhang
Department of Computer Science and Engineering
State University of New York at Bualo
Bualo, NY 14260, USA
Abstract
Signal pro cessing techniques have been introduced to transform images for image
enhancement, ltering and restoration, analysis, and reconstruction. In this paper, we
apply signal processing techniques to solve important KDD problems. In particular,
we introduce a novel deviation (or outlier) detection approach, termed
FindOut
, based
on wavelet transform. The main idea in FindOut is to remove the clusters from the
original data and then identify the outliers. Although previous research showed that
such techniques may not be eective because of the nature of the clustering, FindOut
can successfully identify various percentages of outliers from large datasets. Thus, we
can combine both clustering and outliers detection in a unied approach. Exp erimental
results on very large datasets are presented which show the eciency and eectiveness
of the proposed approach.
1 Introduction
Data mining is a process of extracting valid, previously unknown, and ultimately compre-
hensible information from large databases and using it to make crucial decisions. Modern
companies are awash in data on customers, clients, suppliers and industry trends. But data
is of little use without intelligence. Here Intelligence refers to combing through the data
to notice patterns, devise rules, and make predictions ab out future. Data mining technol-
ogy permits organizations to make the most eective use of the vast amounts of data that
they have gathered. For example, in the banking industry, data mining can be used in
modeling and predicting credit fraud, evaluating risk, performing trend analysis, analyzing
protability, and helping with direct marketing campaigns.
One of the very interesting problems arising recently in the data mining research commu-
nity is the deviation (or outliers) detection problem. Here outliers refer to those exceptions
This research is partially supported by an NSF CAREER grant I IS-9733730.
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13

Partial preview of the text

Download FindOut: Finding Outliers in Very Large Datasets and more Summaries Statistics in PDF only on Docsity!

Paper ID: 394

FindOut: Finding Outliers in Very Large Datasets 

Dantong Yu, Gholam Sheikholeslami and Aidong Zhang Department of Computer Science and Engineering State University of New York at Bu alo Bu alo, NY 14260, USA

Abstract Signal pro cessing techniques have b een intro duced to transform images for image enhancement, ltering and restoration, analysis, and reconstruction. In this pap er, we apply signal pro cessing techniques to solve imp ortant KDD problems. In particular, we intro duce a novel deviation (or outlier) detection approach, termed FindOut, based on wavelet transform. The main idea in FindOut is to remove the clusters from the original data and then identify the outliers. Although previous research showed that such techniques may not b e e ective b ecause of the nature of the clustering, FindOut can successfully identify various p ercentages of outliers from large datasets. Thus, we can combine b oth clustering and outliers detection in a uni ed approach. Exp erimental results on very large datasets are presented which show the eciency and e ectiveness of the prop osed approach.

1 Intro duction

Data mining is a pro cess of extracting valid, previously unknown, and ultimately compre- hensible information from large databases and using it to make crucial decisions. Mo dern companies are awash in data on customers, clients, suppliers and industry trends. But data is of little use without intelligence. Here Intelligence refers to combing through the data to notice patterns, devise rules, and make predictions ab out future. Data mining technol- ogy p ermits organizations to make the most e ective use of the vast amounts of data that they have gathered. For example, in the banking industry, data mining can b e used in mo deling and predicting credit fraud, evaluating risk, p erforming trend analysis, analyzing pro tability, and helping with direct marketing campaigns.

One of the very interesting problems arising recently in the data mining research commu- nity is the deviation (or outliers) detection problem. Here outliers refer to those exceptions  (^) This research is partially supp orted by an NSF CAREER grant I IS-9733730.

which deviate from the values anticipated based on a given (usually statistical) mo del or from some previously known exp ectation and norm [17]. The identi cation of outliers can lead to the discovery of truly unexp ected knowledge in areas such as electronic commerce exceptions, bankruptcy, credit card fraud, and even the analysis of p erformance statistics of sto ck exchange [13]. Such knowledge can lead to detecting abnormal events happ ened in the past or predicting p otential trends in the future. Such p otential trends may b ecome new directions for future invest, marketing, and other purp oses.

1.1 Related Work

Many clustering algorithms have b een recently prop osed. They include partitioning algo- rithms such as CLARANS [15], improved k -means algorithms [4, 6], hierarchical algorithms such as BIRCH [24], density-based metho ds such as DBSCAN [5] and CURE [7], or grid- based algorithms including STING [21], DBCLASD [22], and WaveCluster [18]. Some other recent clustering algorithms are CLIQUE [1], DENCLUE [8]. The goal of these metho ds is to detect the clusters in the data space. They try to remove or at least tolerate the outliers to b etter detect the clusters.

Most of the previous research in outlier detection has b een in the eld of statistics [9, 11, 3]. These metho ds usually make assumptions ab out data distribution, statistical distribution parameters, typ e or numb er of outliers. However, these parameters generally cannot b e easily determined which can make these metho d dicult to apply. Ruts and Rousseeuw prop osed a depth-based metho d to detect the outliers [16]. The ob jects are organized in layers in the data space according to some de nition for depth. Outliers are exp ected to b e in the deep layers. Such metho ds do not have the distribution tting problem, however, in practice for multidimensional spaces, they may not b e applicable. Sarawagi et al. prop osed a discovery- driven metho d that search for exception is guided by pre-computed indicators of exceptions at various levels of detail in a data cub e [17]. They consider a value in a cell of a data cub e to b e an exception, if it is signi cantly di erent from anticipated value based on a statistical mo del. Although this metho d can handle hierarchies and ordered dimensions, but mo del selection still is an issue for it.

Arning et al. intro duced a metho d for outlier detection which relies on the observation that after seeing a series of similar data, an element disturbing the series is considered an outlier [2]. Their metho d requires a function that can yield the degree to which a data element causes the dissimilarity of the data set to increase. It lo oks for the subset of data that lead to the greatest reduction in Kolmogorov complexity for the amount of data discarded [2].

Knorr and Ng presented algorithms to detect Distance-Based outliers [12, 13]. They

then get a new function which is called transformed distribution function. If the transformed distribution function can make the outliers more salient than other data p oints, we then can easily detect outliers within the dataset.

According to [10], many signal pro cessing op erations can b e mo deled as a linear space system. The two-dimensional case was intro duced in [10]. For a d-dimensional space, a linear space (LSg ) system can b e completely describ ed by its impulse response function g (x 1 ; x 2 ; : : : ; xd ). The function g (x 1 ; x 2 ; : : : ; xd ) is also called kernel function. When the input to the system is an impulse  (x 1 ; x 2 ; : : : ; xd ), the output will b e exactly the kernel function g (x 1 ; x 2 ; : : : ; xd ). Thus by inputting an impulse, we can obtain kernel function and understand the functionality of the system. The system works as follows:

Input f (x 1 ; x 2 ; : : : ; xd )

! Linear^ space^ system g (x 1 ; x 2 ; : : : ; xd )

! Output f^g (x 1 ; x 2 ; : : : ; xd )

where f (x 1 ; x 2 ; : : : ; xd ) and f^g (x 1 ; x 2 ; : : : ; xd ) are the input and output signals. For such a system, the output f^g (x 1 ; x 2 ; : : : ; xd ) is de ned as the conv ol ution of f (x 1 ; x 2 ; : : : ; xd ) with the impulse resp onse g (x 1 ; x 2 ; : : : ; xd ) and is de ned as follows:

f^g (x 1 ; x 2 ; : : : ; xd ) =

Z 1

Z 1

| {z 1} d

f (x^01 ; x^02 ; : : : ; x^0 d )  g (x 1 x^01 ; : : : ; xd x^0 d )dx^01 ; : : : ; dx^0 d :

For discrete functions and a given linear system, this b ecomes:

f^g (i 1 ; i 2 ; : : : ; id ) = f (i 1 ; i 2 ; : : : ; id )? g (i 1 ; i 2 ; : : : ; id )

Xm^1 k 1 =

Xm^2 k 2 =

Xmd kd =

f (k 1 ; k 2 ; : : : ; kd )  g (i 1 k 1 ; i 2 k 2 ; : : : ; id kd );

where? is the convolution op eration. When the kernel is a wavelet function w , we call the linear system as wavelet transform system. We denote the output as ^w when the input is a function .

2.2 Problem Formalization

Let A = fA 1 ; A 2 ; : : : ; Ad g b e a set of b ounded, totally ordered domains and S = A 1  A 2  : : :  Ad b e a d-dimensional numerical space or feature space. A 1 ; : : : ; Ad are referred as dimensions of S. The input dataset is a set of d-dimensional p oints O = fo 1 ; o 2 ; : : : ; oN g, where oi = hoi 1 ; oi 2 ; : : : ; oid i, 1  i  N. The j -th comp onent of oi is drawn from domain Aj.

We rst partition the original feature space into nonoverlapping hyp er-rectangles which we call cel ls to generate the quantized space. The cells are obtained by segmenting every

dimension Ai into mi numb er of intervals. Each cell ci is the intersection of one interval from each dimension. It has the form hci 1 ; ci 2 ; : : : ; ci (^) d i where ci (^) j = [lij ; hij ) is the right op en interval in the partitioning of Aj. Each cell ci has a list of statistical parameters ci  param asso ciated with it.

We say that a p oint ok = hok 1 ; : : : ; ok d i is contained in a cell ci , if lij  ok i < hij for 1  j  d. The list ci  param keeps track of the statistical prop erties such as aggregation, mean, variance, and the probability distribution of the data p oints contained in the cell ci. In general, in grid-based approaches by a single pass through the dataset the containment relations are discovered and appropriate statistical parameters are computed. Each cell has information ab out the density of the data contained in the cell. Thus, the collection of ci  param summarizes the dataset.

We cho ose the numb er of p oints contained in each cell as the only statistic to b e used. That is, we use ci  count to b e the ci  param, which is called as density. Thus a density function (ci ) is sp eci ed in the quantized space. A cell in the quantized space with 0 count value is called an empty cel l. A cell in the quantized space with nonzero count value is called a nonempty cel l. In our approach we apply wavelet transform on ci  count values. The transformed space is the result after wavelet transformation on the count values of the cells in the quantized space. The pro cedure can b e easily illustrated by linear space system. A wavelet transform is describ ed by wavelet function w , thus a linear space system LSw is de ned based on w. A density function  is de ned on the quantized space which maps a cell ci to ci  param. The result of LSw is a new density function ^w on the transformed space which maps ci to transformed value of ci  param , as follows:

Input (ci )

! Linear space system w

! Output ^w (ci ) We intro duce the following de nitions to b e used in the rest of the pap er.

De nition 1 (Signi cant cel l) Given a linear space system LSw and density function , a cel l c is a signi cant cel l if ^w (c)   , where  = p  V , p is input parameter and V is a statistical value of ^w.

The statistical value V can b e the average value, maximum value or summation of ^w.

De nition 2 (-neighbor) A cel l c 1 is an -neighbor of cel l c 2 if both are signi cant cel ls in transformed space and D (c 1 ; c 2 )  , where D is an appropriate distance metric and  > 0.

De nition 3 (Cluster) A cluster C is a set of signi cant cel ls fc 1 ; c 2 ; : : : ; cm g which are -connected in the transformed space.

2.4 Wavelet Transform

Wavelet transform is a signal pro cessing technique that decomp oses a signal into di erent frequency subbands (for example, high frequency subband and low frequency subband). A one-dimensional signal s can b e ltered by convolving the lter co ecients ck with the signal values:

^si =

MX 1

k =

ck si+k M 2 ; (1)

where M is the numb er of co ecients in the lter and ^s is the result of convolution. Wavelet transform provides us with a set of interesting lters. For example, Figure 1 shows the Cohen-Daub echies-Feauveau(2,2) biorthogonal wavelet [20].

Figure 1: Cohen-Daub echies-Feauveau (2,2) biorthogonal wavelet.

We now brie y review wavelet-based multi-resolution decomp osition. More details can b e found in Mallat's pap er [14]. To have multi-resolution representation of signals we can use discrete wavelet transform. We can compute a coarser approximation of the one-dimensional input signal S 0 by convolving it with the low pass lter H~ and down sampling the signal by two [14]. All the discrete approximations Sj , 1 < j < J (J is the maximum p ossible scale), can thus b e computed from S 0 by rep eating this pro cess. Figure 2 illustrates the metho d.

S

D

.. .. ....

S

D

S

j

j

0

1

1

S j-

~ (^2)

~ (^2)

~ (^2)

G~^2

G

H

H

Figure 2: Blo ck diagram of multi-resolution wavelet transform.

Dj denotes the di erence b etween Sj and Sj 1 and is called detail signal at the scale j. We can compute the detail signal Dj by convolving Sj 1 with the high pass lter G~ and

returning every other sample of output. The wavelet representation of a discrete signal S 0 can therefore b e computed by successively decomp osing Sj into Sj +1 and Dj +1 for 0  j < J. This representation provides information ab out signal approximation and detail signals at di erent scales. We can easily generalize the wavelet mo del to d-dimensional data space in which one-dimensional transform can b e applied multiple times.

2.5 Outlier Detection

We prop osed WaveCluster which partitions the feature space into cells to generate a quan- tized space [18]. It then applies wavelet transform on the quantized space. After nding the clusters in the transformed space, it assigns lab els to the cells according to the cluster that they b elong to. Finally it assigns lab els to the ob jects in the cells based on the cluster lab el of each cell. WaveCluster can detect arbitrary shap e clusters such as convex, concave, or nested clusters [18]. Using multiresolution prop erty of wavelets, WaveCluster detects the clusters at di erent degrees of detail and do es not require the exact numb er of clusters as input. WaveCluster takes advantage of the lters in wavelet transform and removes the noise from the feature space without requiring extra pro cessing time. It is a very fast and ecient metho d for very large databases with low numb er of dimensions. Using parallel pro cessing it can b e further sp eeded up.

Wavelet transform convolves its input with a low pass lter. A low pass lter has the prop- erty of removing the outliers (noise) from the input. It considers the in uence of the neigh- b oring ob jects in computing its output. For example Figure 3-a presents a one-dimensional dataset (signal) containing some noise. Figure 3-b shows the signal after convolving with a low pass lter. It shows how the low pass lter removes the outliers and smo othes the input data.

-100 -50 50 100

-0.

-0.

-100 -50 50 100

-0.

-0.

a) b) Figure 3: a) Original signal; b) Signal after convolving with low pass lter.

WaveCluster takes advantage of this prop erty of low pass lters and removes the outliers that may cause problem in detecting the clusters. However, in FindOut our goal is to nd

Figure 4-a shows an example of a two-dimensional feature space Sor ig. The gray scale value of each pixel in image re ects the numb er of ob jects in the corresp onding cell in the original feature space. Darker pixels have more ob jects. Figure 4-b shows the two clusters detected by WaveCluster in S 1 ;c. If we remove the pixels from the original space which corresp ond to the cluster cells, we will get the space Sor ig S 1 ;c shown in Figure 4-c. For b etter visualization, all nonempty cells in Figure 4-c (and 4-d) are shown in black instead of gray scale. Figure 4-c shows that in addition to the outliers, the p oints in the b oundaries of clusters are also included in the result. The goal of WaveCluster is to detect the core of the main clusters in the dataset. So it applies those wavelet transforms that can clear the area around the clusters to make them more distinct. However such p oints should not b e considered as outliers. To detect the clusters we use the information in the low frequency subbands (approximation signals). The other detail frequency subbands Sl ;D have information ab out the b oundaries of the clusters. We can apply this information and remove the cluster b oundary p oints from the set of outliers. To do so, we subtract Sl ;D from the spaces in the previous step and get the results shown in Figure 4-d.

a) b)

c) d)

Figure 4: a) Original data; b) Clusters detected by WaveCluster; c) Original data clusters; d) Outliers detected by FindOut.

As we mentioned earlier, we present FindOut as a to ol in WaveCluster. If we are only interested in the outliers and not the clusters, some steps of WaveCluster can b e skipp ed with little e ect on the e ectiveness of FindOut. This will make FindOut more ecient. For example, we do not have to nd the connected comp onents and instead of Sl ;c (result of step 1.4), we can subtract Sl ;A from original space Sor ig. Based on our exp eriments, usually the results from the rst level of wavelet transform contain all the outliers, thus we can avoid the computation required by the next levels of the wavelet transform. We are working on these issues as part of our on going research.

2.6 Time Complexity

The time complexity of the the steps 1.1 and 1.4 of FindOut's algorithm is O (N ), b ecause they scan all the database ob jects. Assuming m cells in each dimension of feature space, there would b e K = md^ cells. Complexity of applying wavelet transform would b e O (K ). To nd the connected comp onents, the required time is O (K ). Subtracting the spaces in step 2 has also a complexity of O (K ). Thus the time complexity of pro cessing data (without considering I/O) which is p erformed in steps 1.2, 1.3 and 2 would in fact b e O (K ). Since this algorithm is applied on very large databases with low numb er of dimensions, we can assume that N  K. As an example, for a database with 1,000,000 ob jects when the numb er of dimensions d is less than or equal to 6, and the numb er of intervals m is 10, this condition holds. Thus based on this assumption the overall time complexity of the algorithm will b e O (N ).

3 Exp eriments

In this section, we evaluate the p erformance of FindOut and demonstrate its e ectiveness and eciency on di erent typ es of distributions of data. Tests were done on synthetic datasets generated by us and on a real-world dataset. We present the exp eriments on 2-dimensional datasets so that the results can b e easily visualized. FindOut can b e applied to 3,4 or 5 low dimensions. As for higher dimensions, [23] provided a hash-based algorithm to address the high-dimensionality issue. All of the tests were done on a Sun UltraSPARC 168 MHz machine having 1024 MB of main memory.

a)datasets with outliers b) detected outliers

Figure 7: FindOut on datasets with di erent levels of outliers:

The imp ortant role of thresholding in outliers detection. The selection of threshold p (de ned in Section 2.2) directly a ects the results of the outlier detection. Higher thresholds may result in larger numb er of outliers. We used a simple uniform threshold value in our tests. But for b etter p erformance in real-world datasets, the knowledge such as intensity characteristics of datasets and p ercentage of outliers can b e used in thresholding. Figure 8 shows di erent threshold values applied for detection outliers in dataset DS3-10. We can see that the real dense and sparse areas do not change to o much when threshold changes. But the border area b etween them is highly a ected by the chosen threshold.

a) p=0.01; b) p=0.02; c)p=0.05;

d) p=0.10; e) p=0.25; f ) p=0.50.

Figure 8: The outliers of DS3-10 with di erent threshold values p.

Timing on di erent quantization and wavelet transform. Quantization m a ects running time. When m is low (very coarse quantization), the total numb er of cells is small, thus it takes less time. Running time is also a ected by wavelet functions with di erent width. The width of wavelet function determines the range of neighb orho o d. When a large neighb orho o d which a ects a data p oint is considered, the longer it takes on detecting outliers. Figure 9 shows CPU-time when wavelets Haar, Cohen-Daub echies-Feauveau(2,2), Cohen-Daub echies-Feauveau(4,2) [20], and Daub echies [19] are used on dataset DS3-10 under di erent quantizations.

The co de and the data rep orted in [13] was not available to do exp erimental comparisons. Their exp eriments show that their cell-based algorithm is the most ecient when the numb er of dimensions is less than or equal to 4. However for higher numb er of dimensions, its

3.2 A Real-World Dataset

Dataset sat-hue is created based on the image feature vectors from University of California at Irvine Machine Learning Rep ository. The size of the dataset is 392,402 and each element describ es hue and saturation values of segmentations of seven typ es of images: brickface, sky, foliage, cement, window, path and grass. To scale up the test, we randomly generated 10 to 1000 vectors around each basic feature vector within the hyp ersphere centered at the basic feature vector with a small radius r. Figure 10 shows the histogram of dataset sat-hue and detected outliers. All the detected p oints have exceptional hue and saturation values.

10 20 30 Hue (^40) 10

20

30

40

Saturation

0

1000

2000

3000 Density

10 20 30 Hue 40

10 20 30 40 50

Hue 10

20

30

40

50

Saturation

0

1

2

3 Density

10 20 30 40 50

Hue

a)Dataset sat-hue; b)Detected Outliers;

Figure 10: Dataset sat-hue

4 Conclusion

This pap er has addressed an interesting question of how to apply signal pro cessing techniques to solve imp ortant KDD problems and the e ectiveness of such approaches. We presented a novel deviation (or outlier) detection approach, termed FindOut, based on wavelet transform. By combining outliers detection with clustering, FindOut can successfully identify various p ercentages of outliers from large datasets. Thus, our approach is highly cost-e ective. Ex- p erimental results on very large data sets have demonstrated the eciency and e ectiveness of the prop osed approach. This approach can have many p otential uses. Based on the found outliers, - we can build a mo del that describ es what kind of data have abnormal distance to majority, thus b ecome outliers. This mo del can b e used to predict future datasets.

References

[1] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the ACM SIGMOD Conference on Management of Data, pages 94{105, Seattle, WA, 1998.

[2] A. Arning, R. Agrawal, and P. Raghavan. A linear metho d for deviation detection in large databases. In Proceedings of the Second International Conference on Know ledge Discovery and Data Mining, pages 164{169, 1996.

[3] V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley, 3rd edition, 1994.

[4] P. S. Bradley, Usama Fayyad, and Cory Reina. Scaling clustering algorithms to large databases. In Proceedings of the Fourth International Conference on Know ledge Dis- covery and Data Mining, pages 9{15, New York, August 1998.

[5] M. Ester, H. Kriegel, J. Sander, and X. Xu. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of 2nd International Conference on KDD, 1996.

[6] Usama Fayyad, Cory Reina, and P. S. Bradley. Initialization of iterative re nement clus- tering algorithms. In Proceedings of the Fourth International Conference on Know ledge Discovery and Data Mining, pages 194{198, New York, August 1998.

[7] Sudipto Guha, Ra jeev Rastogi, and Kyuseok Shim. Cure: An ecient clustering algo- rithm for large databases. In Proceedings of the ACM SIGMOD conference on Manage- ment of Data, pages 73{84, Seattle, WA, 1998.

[8] Alexander Hinneburg and Daniel A.Keim. An ecient approach to clustering in large multimedia databases with noise. In Proceedings of the Fourth International Conference on Know ledge Discovery and Data Mining, pages 58{65, New York, August 1998.

[9] D. Hoaglin, F. Mosteller, and J. Tukey. Understanding Robust and Exploratory Data Analysis. John Wiley, New York, 1983.

[10] Ramesh Jain, Rangachar Kasturi, and Brian G.Schunck. Machine Vision. MIT PRESS,

[11] R. Johnson. Applied Multivariate Statistical Analysis. Prentice Hall, 1992.

[23] D. Yu, S. Chatterjee, G. Sheikholeslami, and A. Zhang. Eciently detecting arbitrary shap ed clusters in very large datasets with high dimensions. Technical Rep ort 98-8, State University of New York at Bu alo, Department of Computer Science and Engineering, Novemb er 1998.

[24] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. BIRCH: An Ecient Data Clus- tering Metho d for Very Large Databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pages 103{114, Montreal, Canada,