Streaming Algorithms: Getting Number of Distinct Elements and Computing Second Moment | Study notes Advanced Algorithms

CS787: Advanced Algorithms

Scribe: Priyananda Shenoy and Shijin Kong Lecturer: Shuchi Chawla

Topic: Streaming Algorithms(continued) Date: 10/26/2007

We continue talking about streaming algorithms in this lecture, including algorithms on getting

number of distinct elements in a stream and computing second moment of element frequencies.

18.1 Introduction and Recap

Let us review the fundamental framework of streaming. Suppose we have a stream coming in from

time 0 to time T. At each time t∈[T], the coming element is at∈[n]. Frequency for any element

iis defined as mi=|{j|aj=i}|. We will see a few problems to solve under this framework. For

different problems, we expect to have a complexity of only O(log n) and O(log T) on either storage

or update time. We should only make a few passes over the stream under these strong constraints.

First problem we will solve in this lecture is getting the number of distinct elements. A simple

way to do this is to perform random sampling on all elements, maintain statistics over sampling,

and then extraploate real number of distinct elements from statistics. For example, if we pick k

of the nelements uniformly at random, and count the number of distinct elements in samples, we

can roughly estimate the number of distinct elements over all elements by multiplying the result

with n. However, in many cases where elements are unevenly distributed (e. g. some elements

dominate), random sampling with a small kwill cause much lower estimation than actual number

of distinct elements. In order to be accurate we need k= Ω(n)

The second problem, computing second moment of element frequencies depends on accurate esti-

mation of mi. Again we can apply random sampling in similar way and estimate frequency of each

element. But again we will have the problem of inaccuracy in random sampling. Some elements

may not be sampled at all.

In order to better resolve those problems, we introduce another two different algorithms which

perform better than random sampling based algorithm. Before that, let’s define what an unbiased

estimator is:

Definition 18.1.1 An unbiased estimator for quantity Qis a random variable Xsuch that E[X] =

18.2 Number of Distinct Elements

Let’s assign cto represent the number of distinct elements. We are going to prove that we can

probabilistically distinguish between c < k and c>2kby using a single bit of memory. kis related

to a hash functions family Hwhere,

∀h∈H,h: [n]→[k]

Algorithm 1:

Streaming Algorithms: Getting Number of Distinct Elements and Computing Second Moment, Study notes of Advanced Algorithms

Related documents

Partial preview of the text

Download Streaming Algorithms: Getting Number of Distinct Elements and Computing Second Moment and more Study notes Advanced Algorithms in PDF only on Docsity!

18.1 Introduction and Recap

18.2 Number of Distinct Elements

[

]

[

]

[

]

E

[

X^2

]

= E[

E

[

X^2

]

[

X^2

]

− (E[X])^2

[

]

18.3.1 Space requirements

18.3.2 Improving the accuracy: Median of means method

mean mean mean mean

median

Xi

8/ ε 2 8/ε^2 8/ ε 2 8/ ε 2