D-Stream: Efficient Clustering of Multi-dimensional Stream Data | Study Guides, Projects, Research Computer Science

Density-Based Clustering for Real-Time Stream Data

Yixin Chen

Department of Computer Science and

Engineering

Washington University in St. Louis

St. Louis, USA

[email protected]

Li Tu

Institute of Information Science and Technology

Nanjing University of Aeronautics and

Astronautics

[email protected]

ABSTRACT

Existing data-stream clustering algorithms such as CluS-

tream are based on k-means. These clustering algorithms

are incompetent to ﬁnd clusters of arbitrary shapes and can-

not handle outliers. Further, they require the knowledge of

kand user-speciﬁed time window. To address these issues,

this paper proposes D-Stream, a framework for cluster-

ing stream data using a density-based approach. The algo-

rithm uses an online component which maps each input data

record into a grid and an oﬄine component which computes

the grid density and clusters the grids based on the den-

sity. The algorithm adopts a density decaying technique to

capture the dynamic changes of a data stream. Exploiting

the intricate relationships between the decay factor, data

density and cluster structure, our algorithm can eﬃciently

and eﬀectively generate and adjust the clusters in real time.

Further, a theoretically sound technique is developed to de-

tect and remove sporadic grids mapped to by outliers in

order to dramatically improve the space and time eﬃciency

of the system. The technique makes high-speed data stream

clustering feasible without degrading the clustering quality.

The experimental results show that our algorithm has su-

perior quality and eﬃciency, can ﬁnd clusters of arbitrary

shapes, and can accurately recognize the evolving behaviors

of real-time data streams.

Categories and Subject Descriptors

H.2.8 [Database Management]: Database Applications—

data mining

General Terms

Algorithms, Experimentation, Performance, Theory

Keywords

Stream data mining, density-based clustering, D-Stream,

sporadic grids

Permission to make digital or hard copies of all or p art of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior specific

permission and/or a fee.

KDD’07, August 12–15, 2007, San Jose, California, USA.

1. INTRODUCTION

Clustering high-dimensional stream data in real time is

a diﬃcult and important problem with ample applications

such as network intrusion detection, weather monitoring,

emergency response systems, stock trading, electronic busi-

ness, telecommunication, planetary remote sensing, and web

site analysis. In these applications, large volume of multi-

dimensional data streams arrive at the data collection center

in real time. Examples such as the transactions in a super-

market and the phone records of a mobile phone company

illustrate that, the raw data typically have massive volume

and can only be scanned once following the temp oral or-

der [7, 8]. Recently, there has been active research on how

to store, query and analyze data streams.

Clustering is a key data mining task. In this paper, we

consider clustering multi-dimensional data in the form of a

stream, i.e. a sequence of data records stamp ed and ordered

by time. Stream data clustering analysis causes unprece-

dented diﬃculty for traditional clustering algorithms. There

are several key challenges. First, the data can only be ex-

amined in one pass. Second, viewing a data stream as a

long vector of data is not adequate in many applications. In

fact, in many applications of data stream clustering, users

are more interested in the evolving behaviors of clusters.

Recently, there have been diﬀerent views and approaches

to stream data clustering. Earlier clustering algorithms for

data stream uses a single-phase model which treats data

stream clustering as a continuous version of static data clus-

tering [9]. These algorithms uses divide and con quer schemes

that partition data streams into segments and discover clus-

ters in data streams based on a k-means algorithm in ﬁnite

space [10, 12]. A limitation of such schemes is that they put

equal weights to outdated and recent data and cannot cap-

ture the evolving characteristics of stream data. Moving-

window techniques are proposed to partially address this

problem [2, 4].

Another recent data stream clustering paradigm proposed

by Aggarwal et al. uses a two-phase scheme [1] which con-

sists of an online component that processes raw data stream

and produces summary statistics and an oﬄine component

that uses the summary data to generate clusters. Strate-

gies for dividing the time horizon and manage the statistics

are studied. The design leads to the CluStream system [1].

Many recent data stream clustering algorithms are based on

CluStream’s two-phase framework. Wang et al. proposed an

improved oﬄine component using an incomplete partition-

ing strategy [17]. Extensions of this work including cluster-

ing multiple data streams [6], parallel data streams [5], and

133

Research Track Paper

D-Stream: Efficient Clustering of Multi-dimensional Stream Data, Study Guides, Projects, Research of Computer Science

Related documents

Partial preview of the text

Download D-Stream: Efficient Clustering of Multi-dimensional Stream Data and more Study Guides, Projects, Research Computer Science in PDF only on Docsity!

Density-Based Clustering for Real-Time Stream Data

Yixin Chen

Department of Computer Science and

Engineering

Washington University in St. Louis

St. Louis, USA

[email protected]

Li Tu

Institute of Information Science and Technology

Nanjing University of Aeronautics and

Astronautics

[email protected]

ABSTRACT

Categories and Subject Descriptors

General Terms

Keywords

1. INTRODUCTION