






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
In this section, we study three ways to estimate the tail probabilities of random variables. Please note that the more information we know about the random ...
Typology: Study notes
1 / 10
This page cannot be seen from the preview
Don't miss anything!







Johns Hopkins University Scribe: Your Name
In this section, we study three ways to estimate the tail probabilities of random variables. Please note that the more information we know about the random variable, the better the estimate we can derive about a given tail probability.
Theorem 1. Markov Inequality
If X is a non-negative valued random variable with an expectation of μ, then for any c > 0 , P [X ≥ cμ] ≤ (^1) c.
Proof. By definition,
μ =
a Johns Hopkins University Scribe: Your Name
Proof. Let random variable Y = (X −μX )^2. Then, E[Y ] = E[(X −μX )^2 ] = σ^2 X by definition of σX. Note that Y is a non-negative valued random variable.
Now, P [|X − μX | ≥ cσX ] = P [(X − μX )^2 ≥ c^2 σ^2 X ] = P [Y ≥ c^2 σ X^2 ].
Applying Markov Inequality to the random variable Y , P r[Y ≥ c^2 σ^2 X ] = P [Y ≥ c^2 μY ] ≤ 1 c^2
Note that the random variable X need not be non-negative valued for the Chebychev in- equality to hold.
The tail estimates given by Theorem 1 and Theorem 2 work for random variables in general. However, if the random variable X can be expressed as a sum of n independent random variables each of which is 0, 1 −valued, then we can obtain very tight bounds on the tail estimates. This is expressed in the following theorem and the bounds are commonly called Chernoff Bounds.
Theorem 3. Chernoff Bound for upper tail
Let X be a random variable defined as X = X 1 + X 2 + · · · + Xn where each Xi, 1 ≤ i ≤ n, is a 0 , 1 − valued random variable and all Xi’s are independent. Also, let E[X] = μ and
P [Xi = 1] = pi, 1 ≤ i ≤ n. Then for any δ > 0 , P [X ≥ μ(1 + δ)] ≤
eδ (1+δ)1+δ
)μ .
Proof. While proving Chebychev inequality (Theorem 2), we made use of a second-order moment. Application of higher order moments would generally improve the bound on the tail inequality. We establish the tail estimates of the sums of independent random variables by utilizing the exponential function, which essentially captures a weighted sum of all the moments.
Let Y = etX^ , for an appropriate positive value of t to be chosen later on. If we let Yi = etXi^ , Yi’s are also independent as Xi’s are. Note that Y = Y 1 Y 2 · · · Yn. Note also that
E[Yi] = E[etXi^ ] = piet^ + (1 − pi)e^0 = 1 − pi + piet^ (1)
E[Y ] = E[Y 1 Y 2 · · · Yn] = Πni=1E[Yi] = Πni=1(1 − pi + piet) (2)
in which the second equality follows from the independence of Yi’s.
Observe that,
μ =
∑^ n
i=
pi (3)
Johns Hopkins University Scribe: Your Name
That is, it suffices if we prove
f (δ) = δ − (1 + δ) ln(1 + δ) + δ
2 4 ≤^ 0.
Differentiating f twice, we have
f ′(δ) = − ln(1 + δ) + δ 2
f ′′(δ) = (^) 2(1+δ−^1 δ)
Note that f ′′(δ) ≤ 0 for 0 < δ ≤ 1. Hence f ′(δ) is monotonically non-increasing as δ varies from 0 to 1.
Since f ′(1) < 0, f ′(δ) < 0 for any 0 < δ ≤ 1. Hence f (δ) is monotonically decreasing as δ varies from 0 to 1. Since f (0) = 0, f (δ) ≤ 0 for any 0 < δ ≤ 1.
For the interval δ > 1 , we now establish ( e
δ (1+δ)(1+δ) )
μ (^) ≤ e −μδ 2 ln^ δ. Simplifying as above, it
suffices to prove g(δ) = δ −(1+δ) ln(1+δ)+ δ^ ln 2 δ≤ 0 when δ > 1. Once again differentiating twice,
g′(δ) = 12 − ln(1 + δ) + lnδ 2 , and
g′′(δ) = − (^) 1+^1 δ + (^) 1+^1 δ + (^21) δ = (^2) δ^1 (1+−δδ).
Note that g′′(δ) < 0 when δ > 1. Hence,
g′(δ) is monotonically decreasing, g′(1) is negative. Hence,
g′(δ) < 0 when δ ≥ 1. Consequently,
g(δ) is monotonically decreasing as δ increases. Since g(1) < 0 , g(δ) < 0 for any δ > 1.
We now apply tail inequalities for two problems.
Johns Hopkins University Scribe: Your Name
2.4.1 n Balls and n Bins
Consider throwing n balls, independently and uniformly at random, into n bins. We are interested in the probability that bin 1 contains more than 7 balls. Define n 0 , 1 − valued random variables Xi, 1 ≤ i ≤ n, defined as Xi = 1 if ball i falls into bin 1 and 0 other- wise. By uniformity, P [Xi = 1] = (^) n^1. Define the random variable X = X 1 + X 2 + · · · Xn. Thus X denotes the number of balls that fall in bin 1. By the linearity of expectation, E[X] = E[
∑n i=1 Xi] =^
∑n i=1 E[Xi] =^ n^
1 n = 1.
Using the Markov inequality from Theorem 1, we get
For Chebychev inequality (Theorem 2), we first compute the standard deviation of X
Var(Xi) = E[X i^2 ] − E[Xi]^2 =
n
n^2
, and (8)
Var(X) =
∑^ n
i=
V ar(Xi) = n(
n
n^2
n
where the first equality in (9) follows from the independence of X i′ s. Hence,
σX =
n
Applying Chebychev inequality to the random variable X,
1 − (^1) n
1 − (^) n^1 36
Using Chernoff bound from Theorem 4,
P [X ≥ 7] = P [X ≥ (1 + 6)1] ≤ e−^
Now comparing equations (7), (11) and (12) we can see that using the Chebychev inequal- ity gives a better bound than that of the Markov inequality, and a much better bound is obtained using Chernoff bounds. As another example, let us consider the probabil- ity that bin 1 has more than 1 + 10 ln n balls. Using the Markov inequality, we get P [X ≥ 1 + 10 ln n] ≤ (^) 1+10 ln^1 n. Using the Chebychev inequality, we get that P [X ≥
1 + 10 ln n] ≤ P [|X − 1 | ≥ 10 ln n] ≤ 1 − (^) n^1 100 ln^2 n ≤^
1 100 ln^2 n whereas, using Chernoff bounds, we
Johns Hopkins University Scribe: Your Name
2.6.1 Independent and identical {− 1 , +1} valued random variables
Theorem 7. Let Xi, 1 ≤ i ≤ n, be n independent and identically distributed {− 1 , +1} valued random variables such that P [Xi = +1] = P [Xi = −1] = 1/ 2. Let the random variable X be defined by X =
∑n i=1 Xi. Then,^ P^ [X^ ≥^ δ] =^ P^ [X^ ≤ −δ]^ ≤^ e
−δ^2 / 2 n (^) for any
δ > 0.
Proof. By Symmetry we have P [X ≥ δ] = P [X ≤ −δ]. Here we prove a weaker form of the theorem: P [X ≥ δ] ≤ e−δ (^2) / 6 n , making use of Theorem 4. By making use of the exponential function etX^ , a direct derivation results in the claimed bound. Observe that E[Xi] = 0 and E[X] = 0. We define random variables Yi = 1+ 2 X i, 1 ≤ i ≤ n, and Y =
∑n i=1 Yi.^ Note that each Yi is 0,1-valued, and E[Yi] = 1/2. Hence E[Y ] = n/2. Thus, P [X ≥ δ] = P [Y ≥ n 2 +^
δ 2 ] =^ P^ [Y^ ≥^
n 2 (1 +^
δ n )]^ ≤^ e
− (^) nδ^22 n (^213) = e− δ 62 n (^).
When δ > n, there is no need to apply this result since we know that X can never take a value a greater than n. An alternative form of the above Theorem is P [X ≥ δn] ≤ e−δ
(^2) n/ 2 .
3 Set Balancing Problem
In this section, we apply Chernoff bounds to another problem known as the Set Balancing Problem, which is defined as follows. Given an n×n { 0 , 1 } matrix A, find a {− 1 , +1} valued column vector X such that the product AX has the smallest maximum absolute entry, i.e. minimize ||AX||∞.
Example 1. Let the matrix A be as given below.
For X =
Thus the maximum absolute entry is 2.
Our goal is to make every entry of AX as close to 0 as possible. In a way, we are measuring the discrepancy using the maximum absolute value of the entries of AX. In general, it is not possible to make every entry of AX to be 0, for example when A has a row with an odd number of 1′s. A brute force solution for choosing X involves trying all possible column vectors of size n, which would take Ω(2n) time. Instead, we develop a very simple randomized algorithm that guarantees an expected discrepancy of O(
n ln n). In a subse- quent chapter, we derandomize the algorithm and obtain a deterministic polynomial time
Johns Hopkins University Scribe: Your Name
algorithm with the same guarantee on the discrepancy. It is interesting to note that for this problem, Spencer [] proved that for any matrix A there is a column vector X such that the discrepancy is at most 6
n. It is not known whether there is a polynomial time randomized algorithm that guarantees a discrepancy of O(
n).
The randomized algorithm works as follows. Let X = [X 1 X 2 · · · Xn]T^. Choose each Xi independently and u.a.r. with P [Xi = +1] = P [Xi = −1] = 1/ 2 , 1 ≤ i ≤ n. Before ana- lyzing the performance guarantee of the randomized algorithm we state the classic Boole’s inequality.
Fact 1. Boole’s Inequality:
Let E 1 , E 2 , · · · En be n events. Then, P [E 1 ∪ E 2 ∪ · · · ∪ En] ≤ P [E 1 ] + P [E 2 ] + · · · + P [En].
Boole’s inequality has the following application. Let the events Ei be bad events. Then the union of all these bad events defines the event where at least one bad event occurs. To be able to prove that no bad event occurs with high probability, we can bound the probability that some bad event occurs by estimating the probability of each bad event (individually even though we are not given that all bad events are independent) and summing the probabilities.
Let the product AX be Y = [Y 1 Y 2 · · · Yn]T^. Consider any Yi wlog let it be, say Y 1. By the definition of matrix multiplication, Y 1 = A 11 X 1 + A 12 X 2 + · · · + A 1 nXn where the Aij denotes the element of A at ith^ row and jth^ column. Note that E[Xi] = 0 and by linearity of expectation, E[Y 1 ] = 0. For δ = 2
n ln n, using Theorem 7 we get P [Y 1 ≥ δ] = P [Y 1 ≤
−δ] ≤ e(^
− 4 n ln n 2 n )^ ≤ e−2 ln^ n^ = (^) n^12.
Hence, P [|Y 1 | ≥ δ] = P [Y 1 ≥ δ] + P [Y 1 ≤ −δ] ≤ (^) n^22.
Let us interpret each event |Yi| ≥ 2
n ln n, designated Ei, as a “bad event”. Thus using Boole’s inequality, P [for some i, |Yi| ≥ 2
n ln n] ≤
∑n i=1 P^ [|Yi| ≥^2
n ln n] ≤ n (^) n^22 = (^) n^2. Thus with probability greater than 1 − (^2) n , every entry in Y has absolute value at most
2
n ln n.
Hence with high probability, ||AX||∞ < 2
n ln n. We can even upperbound E(||AX||∞) by observing that when ||AX||∞ is not less than 2
nlnn, it can have a value at most n. Hence, the expected value of the maximum absolute value is at most (1 − (^2) n )
n ln n + (^) n^22 n.
In this section, we revisit the RandQuickSort algorithm. We will use the Chernoff bound formula to bound a tail probability of the execution time.
We can view the execution of thje algorithm on any iven input of n numbers as a binary tree of pivots. Every node has an associated sets of elements and a pivot element. For
Johns Hopkins University Scribe: Your Name
In the path from the root to the ith smallest element, we define a 0,1 valued r.v. Xi for the ith as follows: Xi takes the value 1 if its pivot is in the middle half of the elements. Let α be the number of Xi’s that equal 1. If the array at the 24 ln nth level has more than one element, then
( 34 )αn > 1, Then α < 3 ln n(the exact value is not important) ⇒ α ln(3/4) + ln n > 0
⇒ α < (^) ln(4ln^ /n3) ⇒ α < 3 ln n
Thus, if more than 3 ln n of the Xi’s are 1, then path of pivots from the root to the ith element is shorter than 24 ln n.
Let X = X 1 + X 2 + · · · + X24 ln n (this is the number of times the Xi’s are 1). Note that P (Xi = 1) = 1/2, hence E(X) = 12 ln n.
P (the path of pivots from the root to the ith element is longer than24 ln n) ≤ P (X ≤ 3 ln n). (13)
P [X ≤ 3 lnn] = P [X ≤ 12 lnn(1 − 34 )] ≤ e(^
−12 ln 2 n)( 34 ) 2 ≤ (^) n^13.
The first inequality follows from the chernoff bound for the lower tail. Using the following version of Chernoff Bound: P (X ≤ (1 − δ)μ) ≤ e−μδ (^2) / 2 , (??) ≤ e−(12 ln^ n)(3/4) (^2) / 2 ≤ (^) n(27^1 /8) ≤ 1 n^3.
Now, using Boole’s inequality, P (The path of pivots from the root to any one of the n elements is longer than 24 ln n) ≤ (^) n^12.
Previously, we have determined that the expected run-time of RandQuickSort is ≤ 2 n ln n. Now, we have established a high probability bound by sacrificing a constant multiplier in the runtime.