



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Algorithms for handling streaming data, specifically focusing on getting the number of distinct elements and computing the second moment of element frequencies. Two algorithms, algorithm 1 and algorithm 2, which perform better than random sampling for these tasks. The document also covers the concept of unbiased estimators and provides proofs for the correctness of the algorithms.
Typology: Study notes
1 / 6
This page cannot be seen from the preview
Don't miss anything!




CS787: Advanced Algorithms Scribe: Priyananda Shenoy and Shijin Kong Lecturer: Shuchi Chawla Topic: Streaming Algorithms(continued) Date: 10/26/
We continue talking about streaming algorithms in this lecture, including algorithms on getting number of distinct elements in a stream and computing second moment of element frequencies.
Let us review the fundamental framework of streaming. Suppose we have a stream coming in from time 0 to time T. At each time t ∈ [T ], the coming element is at ∈ [n]. Frequency for any element i is defined as mi = |{j|aj = i}|. We will see a few problems to solve under this framework. For different problems, we expect to have a complexity of only O(log n) and O(log T ) on either storage or update time. We should only make a few passes over the stream under these strong constraints.
First problem we will solve in this lecture is getting the number of distinct elements. A simple way to do this is to perform random sampling on all elements, maintain statistics over sampling, and then extraploate real number of distinct elements from statistics. For example, if we pick k of the n elements uniformly at random, and count the number of distinct elements in samples, we can roughly estimate the number of distinct elements over all elements by multiplying the result with n. However, in many cases where elements are unevenly distributed (e. g. some elements dominate), random sampling with a small k will cause much lower estimation than actual number of distinct elements. In order to be accurate we need k = Ω(n)
The second problem, computing second moment of element frequencies depends on accurate esti- mation of mi. Again we can apply random sampling in similar way and estimate frequency of each element. But again we will have the problem of inaccuracy in random sampling. Some elements may not be sampled at all.
In order to better resolve those problems, we introduce another two different algorithms which perform better than random sampling based algorithm. Before that, let’s define what an unbiased estimator is:
Definition 18.1.1 An unbiased estimator for quantity Q is a random variable X such that E[X] = Q.
Let’s assign c to represent the number of distinct elements. We are going to prove that we can probabilistically distinguish between c < k and c > 2 k by using a single bit of memory. k is related to a hash functions family H where,
∀h ∈ H, h : [n] → [k]
Algorithm 1:
Suppose the bit we have is b, initially set b = 0.
for some t ∈ [T ], if h(at) = 0, then set b = 1.
Let us compute some event probabilities:
Pr[b = 0] =
k
)c
Pr[b = 0|c < k] =
k
)c
k
)k
Pr[b = 0|c > 2 k] =
k
)c 6
k
) 2 k 6
e^2
Now we have separation between the cases c < k and c > 2 k. In the next step, we can use multiple bits to boost this separation.
Algorithm 2:
Maintain x bits b 0 , b 1 ,... , bx and run Algorithm1 independently over each bit.
Next we pick a value between 14 and (^) e^12 , say 16. If |{j|bj = 0}| > x 6 , output c < k, else output c > 2 k.
Claim 18.2.1 The error probability of Algorithm 2 is δ if x = O(log (^1) δ ).
Proof: Suppose c < k, then the expected number of bits that are 0 in all x bits is atleast x 4.
By Chernoff’s bound:
Pr
actual number of bits that are zero < x 6
Pr
actual number of bits that are 0 <
x 4
6 e−^
121 32 x 4
Using x = O(log (^1) δ ) gives us the answser with probability 1 − δ. Similarly if c > 2 k, we can show a similar bound on Pr
number of actual bits that are 0 > x 6
We repeat log n times, and set δ = (^) log´δ n for each time. Then, by union bound, the probability that
any run fails is ≤ log n. (^) logδ´ n = ´δ.
To get a (1 + ≤) approximation, we would need O
log n ≤^2
log log n + log (^1) δ
bits
The expected value of X is taken as the value of μ 2. We will see what the value of k needs to be to get an accurate answer with high probability. To do this, we will apply Chebychev’s bound. Consider the expected value of X^2 :
i
m^4 i Y (^) i^4 + 4
i 6 =j
m^3 i mj Y (^) i^3 Yj + 6
i 6 =j
m^2 i m^2 j Y (^) i^2 Y (^) j^2 +
i 6 =j 6 =´i
m^2 i mj m´iY (^) i^2 Yj Y´i + 24
i 6 =j 6 =´i 6 =´j
mimj m´im´j YiYj Y´iY´j ]
Assuming that the variables Yi are 4-way indepedent, we can simplify this to
i
m^4 i + 6
i 6 =j
m^2 i m^2 j
The variance of Xi (as defined in Algorithm 3) is given by
var[X] = E
i
m^4 i + 6
i 6 =j
m^2 i m^2 j ) − (
i
m^4 i + 2
i 6 =j
m^2 i m^2 j )
i 6 =j
m^2 i m^2 j
i
m^2 i )^2
≤ 2 μ^22
var[X] =
var[X] k
2 μ^22 k
By Chebychev’s inequality:
Pr
|X − μ 2 | ≥ ≤μ 2
var[X] ≤^2 μ^22
≤ 2 μ^22 k≤^2 μ^22
≤
k≤^2
Hence, to compute μ 2 within a factor of (1±≤) with probability (1−δ), we need to run the algorithm 2 δ≤^2 times.
Lets analyze the space requirements for the given algorithm. In each run on the algorithm, we need O(log T ) space to maintain Z. If we explicitly store Yi, we would need O(n) bits, which is too expensive. We can improve upon this by contructing a hash function to generate values for Yi on the fly. For the above analysis to hold, the hash function should ensure that any group of upto four Yis are independent( i.e. the hash function belongs to a 4-Way Independent Hash Family). We skip the details of how to construct such a hash family, but this can be done using only O(log n) bits per hash function.
In Algorithm 4, we use the mean of many trials to compute the required value. This has the disadvantage that some inaccurate trials could adversely affect the solution. So we need a large number of samples linear in (^1) δ to get reasonable accuracy. Instead of using the mean, the following procedure can be used to get better results. The idea is to take the median of the means of subsamples. The reason this works better is because the median is less sensitive to the outliers in a sample, as compared to the mean.
Group adjacent Xis into k groups of size (^) ≤^82 each. For each group calculate the mean. Then the expected value of X is obtained by taking the median of the k means. The total number of samples of X we use are k (^) ≤^82
Fig 1: Median of mean method
To see how this improves the accuracy, consider a particular group as shown in the figure. Let Xi be the mean of the group.