Prepare-se para as provas

Estude fácil! Tem muito documento disponível na Docsity

Ganhe pontos para baixar

Ganhe pontos ajudando outros esrudantes ou compre um plano Premium

Guias e Dicas

Prepare-se para as provas

Estude fácil! Tem muito documento disponível na Docsity

Encontrar documentos

Prepare-se para as provas com trabalhos de outros alunos como você, aqui na Docsity

Pesquisar documentos Store

Os melhores documentos à venda: Trabalhos de alunos formados

Videoaulas

Prepare-se com as videoaulas e exercícios resolvidos criados a partir da grade da sua Universidade

QuizNEW

Responda perguntas de provas passadas e avalie sua preparação.

Ganhe pontos para baixar

Ganhe pontos ajudando outros esrudantes ou compre um plano Premium

Comunidade

Pergunte à comunidade

Peça ajuda à comunidade e tire suas dúvidas relacionadas ao estudo

Ranking universidades

Descubra as melhores universidades em seu país de acordo com os usuários da Docsity

Guias grátis

Os eBooks que salvam estudantes!

Baixe gratuitamente nossos guias de estudo, métodos para diminuir a ansiedade, dicas de TCC preparadas pelos professores da Docsity

Notas de Aulas - Processos Estocásticos (Cadeias de Markov)

Tipologia: Notas de aula

2023

1 / 32

Baixe Notas de Aulas - Processos Estocásticos e outras Notas de aula em PDF para Estatística, somente na Docsity! 0 Notation and Terminology. This course will be concerned with the applications of information theory concepts in statistics. Much of the course will be based on lectures given by Imre Csiszár at Maryland in 1989. Some recent results about depen- dent processes will also be given. It is assumed that the reader is familiar with basic information theory ideas as presented, for example, in the initial chapters of the Csiszár-Körner book, and with basic statistical concepts as presented, for example, in the book by Cox and Hink- ley. Notation and terminology that will be used in these lectures will be introduced in this section. The symbol A = {a1, a2, . . . , a|A|} will denote a finite set of cardinality |A| and xn m will denote the sequence xm, xm+1, . . . , xn, where each xi ∈ A. The set of all n- length sequences xn 1 will be denoted by An, the set of all infinite sequences x = x∞1 , with xi ∈ A, i ≥ 1 will be denoted by A∞, and the set of all finite sequences drawn from A will be denoted by A∗. If u and v are finite length sequences then their concatenation is denoted by uv, and uk = uk−1u, k > 1. The entropy H(P ) of a probability distribution, P = (P (a)) on A, is defined by the formula H(P ) = − ∑ a∈A P (a) log P (a), where here, as elsewhere in these lectures, base two loga- rithms are used. Random variable notation is often used in this context, that is, H(X) denotes the entropy of the distribution P of the random variable X. If P and Q are two distributions on A then their divergence or cross-entropy is defined by D(P‖Q) = ∑ a∈A P (a) log P (a) Q(a) . If P is the joint distribution of two random variables (X, Y ) then their joint entropy is defined by H(X, Y ) = − ∑ (a,b) P (a, b) log P (a, b), while the conditional entropy H(X|Y ) and mutual infor- mation I(X ∧ Y ) are defined, respectively, by H(X|Y ) = H(X, Y )−H(Y ), I(X ∧ Y ) = H(X) + H(Y )−H(X, Y ) = H(X)−H(X|Y ) = H(Y )−H(Y |X). Two types of codes will be of interest. A block code is a mapping C:An 7→ Bm, while a variable-length code is a mapping C:An 7→ B∗. The length function L:An 7→ {1, 2, . . .} for a variable-length is defined by the formula C(xn 1 ) = b L(xn 1 ) 1 . Thus, in particular, a block code is just a variable length code whose length function is constant. A block code C is invertible (or faithful) if it is one- to-one. A variable-length code is uniquely decodable if for any two distinct sequences, u(1), u(2), . . . , u(m) and v(1), v(2), . . . , v(k), where u(i), v(j) ∈ An, ∀i, j, the con- catenations of the images, C(u(1))C(u(2)) · · ·C(u(m)) and C(v(1))C(v(2)) · · ·C(v(k)), are not equal. A con- dition that guarantees unique decodability is the prefix condition. A variable-length code C satisfies the prefix condition if C(v) = C(u)w, u, v ∈ An, w ∈ B∗ ⇒ w = Λ, u = v, where Λ denotes the empty string. In most cases of interest to us, the image alphabet will be binary, that is, B = {0, 1}. It is easy to see that the length function for a binary prefix code must satisfy the so-called Kraft inequality.∑ xn 1 2−L(xn 1 ) ≤ 1. It can in fact be shown that a uniquely decodable binary code also satisfies the Kraft inequality, and that if L is a positive integer-valued function on An for which the Kraft inequality holds then there is a binary prefix code C whose length function is L. (Thus, in particular, for any uniquely decodable code C with length function L there is a prefix code C̃ whose length function is also L.) The reason for the connection between the Kraft inequality and prefix codes is the connection between the Kraft inequality and binary trees, a connection that we now sketch. A (binary) tree is a directed graph (V,E), along with a distinguished vertex r ∈ V , called the root, such that the following properties hold. 1. The outdegree of each vertex is at most 2. 2. The indegree of the root is 0. The indegree of all other vertices is exactly 1. 3. Given any v ∈ V − r there is a directed path from r to v. It is easy to see from the above that there is only one path from r to any v 6= r; the length of this path is called 1 the depth d(v) of v. A vertex is called an outer node if its outdegree is 0; otherwise it is an inner node. Let O denote the set of outer nodes. It is easy to see that the edges of the tree can be labeled by 0’s and 1’s so that for any vertex v whose outdegree is 2, the two edges leading out of v have different labels. Such a labeling assigns a binary sequence of length d(v) to each outer node v such that distinct outer nodes are assigned distinct sequences. The labeling is therefore just a binary code on the set of outer nodes. Furthermore, the code is a prefix code, due to the simple fact that an outer node is not an inner node! It is clear that∑ v∈O 2−v(x) ≤ 1. In summary, binary trees lead to binary prefix codes on their outer codes for which the Kraft inequality holds. Now suppose L is a positive integer-valued function defined on a set A such that ∑ 2−L(a) ≤ 1. Our goal is to show that there is a prefix code C whose length function is L. Without loss of generality it can be assumed that A is labeled so that L(ai) ≤ L(ai+1), i < |A|. The code C is defined by setting C(ai) = w(i) ∈ B∗, where w(1) is a block of 0’s of length L(a1), and w(i), i > 1 is the first L(ai) bits in the binary expansion of ∑ j<i 2 −L(aj). It is left to the reader to show that this defines a prefix code. The code is known as the Shannon-Fano code, or simply the Shannon code. The following theorem summarizes this coding construction in a form that will be used later. Theorem 1 Let P be a probability distribution on A and define L(a) = d− log P (a)e, a ∈ A, where d·e de- notes the least integer function. There is a binary pre- fix code for which the expected length satisfies E(L) =∑ a L(a)P (a) ≤ H(P ) + 1. We shall also make use of a prefix code defined on the integers, a code that is essentially due to Elias. Let b(n) be the usual binary representation of the integer n ≥ 0, and let `(n) denote the length of b(n), so that `(n) = dlog2(n + 1)e. Let Ok denote a sequence of 0’s of length k. The code is defined by C(n) = 0`(`(n))b(`(n))b(n). For example b(12) = 1100, so b(`(12)) = b(4) = 100 and `(`(12)) = 3. Thus C(12) = 0001001100. The decoding is as follows. The initial block 000 of 0’s has length 3. This tells us to look in the next 3 places, where we see 100, the binary representation of 4, which in turn tells us to look in the next 4 places where we see 1100, the binary representation of 12. The code C is a prefix code; the codeword length is `(n) + 2`(`(n)), which, for large n, is approximately equal to log2 n + 2 log2 log2 n. 1 Large Deviations. One important application of information theory is to the theory of large deviations. A key to this application is the theory of types. The n-type of a sequence xn 1 ∈ An is just another name for its empirical distribution P̂ = P̂xn 1 , that is, the distribution defined by P̂ (a) = |{i:xi = a}| n , a ∈ A. Two sequences xn 1 and yn 1 are said to be equivalent if they have the same type; the equivalence classes will be called type classes. The type class of xn 1 will be denoted by T n P , where P = P̂xn 1 . The proof of the following lemma is left to the student. Lemma 1 The number of possible types is( n + |A| − 1 |A| − 1 ) . Theorem 2 For any type P( n + |A| − 1 |A| − 1 )−1 2nH(P ) ≤ |T n P | ≤ 2nH(P ). Proof. Fix the type P and define Pn(xn 1 ) = ∏ i P (xi). A simple calculation shows that if xn 1 has type P then Pn(xn 1 ) = 2−nH(P ). Since Pn is a probability distribu- tion on An we must have Pn(T n P ) ≤ 1. This gives the desired upper bound since Pn(T n P ) = |T n P |2−nH(P ). The lower bound can be obtained as follows. Let A = {a1, a2, . . . , at}, where t = |A|. By definition of types we can write P (ai) = ki/n, i = 1, 2, . . . , t with k1+k2+ . . .+ kt = n, where ki is the number of times ai appears in xn 1 for any fixed xn 1 ∈ T n P . Thus we have |T n P | = n! k1!k2! · · · kt! , so that nn = (k1 + . . . + kt)n = ∑ n! j1! · · · jt! kj1 1 · · · kjt t , 2 for some t̃ ∈ (0, t). But d dt D(Pt‖Q) = ∑ a (P (a)− P ∗(a)) log Pt(a) Q(a) , and this converges (as t ↓ 0) to −∞ if P ∗(a) = 0 for some a ∈ S(P ), and otherwise to ∑ a (P (a)− P ∗(a)) log P ∗(a) Q(a) . (3) It follows that the first contigency is ruled out, proving that S(P ∗) ⊃ S(P ), and also that the quantity (??) is nonnegative, proving the claimed inequality. Now we examine some situations in which the inequal- ity of Theorem ?? is actually an equality. For any given functions f1, f2, . . . , fk on A and corresponding numbers α1, α2, . . . , αk, the set L = {P : ∑ a P (a)fi(a) = αi, 1 ≤ i ≤ k}, will be called a linear family of probability distributions. For any given functions f1, f2, . . . , fk on A, the set E of all P such that P (a) = cQ(a) exp( k∑ 1 θifi(a)), for some θ1, . . . , θk, will be called an exponential family of probability distri- butions; here Q is any given distribution and c = c(θ1, . . . , θk) = (∑ a Q(a) exp( k∑ 1 θifi(a)) )−1 . We will assume that S(Q) = A; then S(P ) = A for all P ∈ E . Note that Q ∈ E . The family E depends on Q, of course, and but only in a weak manner, for any element of E could play the role of Q. If necessary to emphasize this dependence on Q we shall write E = EQ. Theorem 6 The I-projection P ∗ of Q onto a linear fam- ily L satisfies D(P‖Q) = D(P‖P ∗) + D(P ∗‖Q), ∀P ∈ L. Further, if S(L) = A then L ∩ EQ = {P ∗}. Proof. By the preceeding theorem, S(P ∗) = S(L). Hence for every P ∈ L there is some t < 0 such that Pt = (1 − t)P ∗ + tP ∈ L. Therefore, we must have (d/dt)D(Pt‖Q)|t=0 = 0, that is, the quantity (??) in the preceeding proof is equal to 0, for all P ∈ L. This gives the desired identity. Also we can equivalently write∑ a P (a) [ log P ∗(a) Q(a) −D(P ∗‖Q) ] = 0, P ∈ L. (4) Now, by the definition of L, the distributions P ∈ L, regarded as |A|-dimensional vectors, are in the orthog- onal complement of the subspace F spanned by the k vectors, {fi(·)−αi: 1 ≤ i ≤ k}. If S(L) = A then the dis- tributions P ∈ L also span the orthogonal complement of F , from Lemma ??, below , and hence the identity (??) implies that the vector log P ∗(·) Q(·) −D(P ∗‖Q) must be in F . This proves that P ∗ ∈ EQ. Finally, if P̃ ∈ L ∩ EQ then it is easily checked that the identity (??) holds for P̃ in place of P ∗. This implies that P̃ satisfies the Pythagorean identity in the role of P ∗, and this, in turn, implies that P̃ = P ∗. The proof of the theorem is finished, once the following linear algebra result is established. Lemma 2 Suppose V is a the subspace of Rn such that there is a strictly positive vector p ∈ V ⊥, the orthogonal complement of V . Then V ⊥ is spanned by the probabil- ity vectors that belong to it. Proof. Choose a basis for V ⊥ of the form {p, q1, . . . , q`} and determine ti ∈ (0, 1), 1 ≤ i ≤ ` such that pi = (1 − ti)p + tiqi is a nonnegative vector. The vectors {p, p1, . . . , p`} are easily seen to be a basis for V ⊥; each can be then be rescaled to obtain a basis for V ⊥ that consists of probability vectors. This completes the proof of the lemma. If S(L) 6= A then no element of the exponential family E = EQ can belong to L, but since E is not a closed set in general, some element of the closure, cl(E) may be in L. Indeed, if there is a P̃ ∈ L ∩ cl(E) then the Pythagorean identity still holds for P̃ , and this implies that P̃ = P ∗. A sequence of elements converging to P ∗ can always be generated by the “generalized iterative scaling” algorithm, which will be discussed at the end of this section. Hence we always have L ∩ cl(E) = {P ∗}. Suppose now that L1, . . . ,Lm are given linear families and generate a sequence of distributions Pn as follows: Set P0 = Q (any given distribution with S(Q) = A), let P1 be the I-projection of P0 onto L1, P2 the I-projection of P1 onto L2, and so on, where for n > m we mean by Ln that Li for which i ≡ n (mod m); i. e., L1, . . . ,Lm is repeated cyclically. 5 Theorem 7 If ∩m i=1Li = L 6= ∅ then Pn → P ∗, the I-projection of Q onto L. Proof. By the preceeding theorem, we have for every P ∈ L (even for P ∈ Ln) that D(P‖Pn−1) = D(P‖Pn) + D(Pn‖Pn−1), n = 1, 2, . . . Adding these equations for 1 ≤ n ≤ N we get that D(P‖Q) = D(P‖P0) = D(P‖PN ) + N∑ n=1 D(Pn‖Pn−1). By compactness there exists a subsequence PNk → P ′, say, and then from the preceeding inquality we get for Nk →∞ that D(P‖Q) = D(P‖P ′) + ∞∑ n=1 D(Pn‖Pn−1) (5) Since this series is convergent we have D(Pn‖Pn−1) → 0, and hence also |Pn − Pn−1| → 0, where |Pn − Pn−1| denotes the usual variational distance ∑ a(|Pn(a) − Pn−1(a)|. This implies that together with PNk → P ′ we also have PNk+1 → P ′, PNk+2 → P ′, . . . , PNk+m → P ′. Since by the periodic construction, among the m consec- utive elements, PNk , PNk+1, . . . , PNk+m−1 there is one in each Li, i = 1, 2, . . . ,m, it follows that P ′ ∈ ∩Li = L. Since P ′ ∈ L it may be substituted for P in (??) to yield D(P ′‖Q) = ∞∑ i=1 D(Pn‖Pn−1). With this, in turn, (??) becomes D(P‖Q) = D(P‖P ′) + D(P ′‖Q), which proves that P ′ equals the I-projection of Q onto L. Finally, as P ′ was the limit of an arbitrary convergent subsequence of the sequence Pn, our result means that every convergent subsequence of Pn has the same limit P ∗. Using compactness again, this proves that Pn → P ∗ and completes the proof of the theorem. Now we discuss iterative scaling, a method for evalu- ating I-projections that is useful in the analysis of con- tigency tables, a subject to be discussed in the next sec- tion. Let B = {B1, B2, . . . , Bk} be a partition of A and let P be a distribution on A. The distribution defined on {1, 2, . . . , k} by the formula PB(i) = ∑ a∈Bi P (a), is called the B-lumping of P . Fix nonnegative constants αi, i ≤ k, whose sum is 1, and let L = {P :PB(i) = αi,∀i}. The I-projection of any Q onto L is obtained simply by “scaling”: P ∗(a) = ciQ(a), a ∈ Bi, where ci = αi QB(i) . (6) This follows from the fact that lumping does not increase divergence, that is, D(P‖Q) ≥ D(PB‖QB). The condition that P ∈ L is equivalent to the condition that P (Bi) = αi, ∀i. If P ∗(a) = αiQ(a)/Q(Bi), a ∈ Bi then∑ a P ∗(a) log P ∗(a) Q(a) = ∑ i ∑ a∈Bi αiQ(a) Q(Bi) log αi Q(Bi) = D(α‖QB), α = (α1, . . . , αk). Thus, if P ∈ L then D(P‖Q) ≥ D(PB‖QB) = D(α‖QB), which establishes (??). Now, if L1,L2, . . . ,Lm are all of the preceeding form, then the iterated sequence of I-projections P1, P2, . . . , in Theorem ?? can all be obtained by iterative scal- ing, and the theorem gives that the so obtained sequence converges to the I-projection of Q onto the intersection L = ∩m i=1Li. In particular, as we shall see in a later section, iterative scaling can be used to evaluate the I- projections that are needed in the analysis of contigency tables. 3 f-divergence and contigency tables. Let f(t) be a convex function defined for t > 0 with f(1) = 0. The f-divergence of a distribution P from Q is defined by Df (P‖Q) = ∑ a Q(x)f ( P (x) Q(x) ) . Here we take 0f(0 0) = 0, f(0) = limt→0 f(t), 0f(a 0 ) = limt→0 tf(a t ) = a limu→∞ f(u) u . Some examples include the following. (1) f(t) = t log t ⇒ Df (P‖Q) = D(P‖Q). (2) f(t) = − log t ⇒ Df (P‖Q) = D(Q‖P ). 6 (3) f(t) = (t− 1)2 ⇒ Df (P‖Q) = ∑ a (P (a)−Q(a))2 Q(a) . (4) f(t) = 1− √ t ⇒ Df (P‖Q) = 1− ∑ a √ P (a)Q(a). (5) f(t) = |t− 1| ⇒ Df (P‖Q) = |P −Q|. The expression Df (P‖Q) = ∑ a (P (a)−Q(a))2 Q(a) will be de- noted by χ2(P,Q). The analogue of the log-sum inequal- ity is∑ i bif ( ai bi ) ≥ bf ( a b ) , a = ∑ ai, b = ∑ bi. Using this, many of the properties of the information divergence D(P‖Q) extend to general f -divergences, in particular Lemma 3 Df (P‖Q) ≥ 0 and if f is strictly convex at t = 1 then Df (P‖Q) = 0 only when P = Q. Further, Df (P‖Q) is a convex function of the pair (P,Q), and the partitioning property, Df (P‖Q) ≥ Df (PB‖QB) holds for any partition B of A. A basic theorem about f -divergences is the following approximation property. Theorem 8 If f is twice differentiable at t = 1 and f ′′(1) > 0 then for any Q with S(Q) = A and P “ close” to Q we have Df (P‖Q) ∼ f ′′(1) 2 χ2(P,Q) (Formally, Df (P‖Q)/χ2(P,Q) → f ′′(1)/2 as χ2(P,Q) → 0.) Proof. Since f(1) = 0, Taylor’s expansion gives f(t) = f ′(1)(t− 1) + f ′′(1) 2 (t− 1)2 + ε(t)(t− 1)2, where ε(t) → 0 as t → 1. Hence Q(a)f ( P (a) Q(a) ) = f ′(1)(P (a)−Q(a)) + f ′′(1) 2 (P (a)−Q(a))2 Q(a) +ε ( P (a) Q(a) ) (P (a)−Q(a))2 Q(a) . Summing over a ∈ A then establishes the theorem. Remark 2 The same proof works even if Q is not fixed, provided that no Q(a) can become arbitrarily small. However, the theorem (the “asymptotic equivalence” of f -divergences subject to the differentiability hypotheses) does not remain true if Q is not fixed and the probabili- ties of Q(a) are not bounded away from 0. Corollary 2 If f satisfies the hypotheses of the theo- rem and P̂ is the empirical distribution (i. e., type) of a sample of size n drawn independently from the distribu- tion Q, then (2/f ′′(1))nDf (P̂‖Q) has an asymptotic χ2 distribution, with |A|− 1 degrees of freedom, as n →∞. The χ2 distribution with k degrees of freedom is de- fined as the distribution of the sum of squares of k inde- pendent random variables having the standard normal distribution. By this corollary, both (2/ log e)nD(P̂‖Q) and (2/ log e)nD(Q‖P̂ ) are asymptotically χ2 with |A|− 1 degrees of freedom. One property that distiguishes information divergence among f -divergences is transitivity of projections, as summarized in the following lemma. It can, in fact, be shown that the only f -divergence for which either of the two properties of the lemma holds is the informational divergence. Lemma 4 Let P ∗ be the I-projection of Q onto a linear family L. Then (i) For any convex subfamily L′ ⊂ L the I-projections of Q and of P ∗ onto L′ are the same. (ii) For any “translate” L′ of L, the I-projections of Q and of P ∗ onto L′ are the same, provided S(P ∗) = A. Proof. By the Pythagorean identity D(P‖Q) = D(P‖P ∗) + D(P ∗‖Q), P ∈ L. It follows that on any subset of L the minimum of D(P‖Q) and of D(P‖P ∗) are acheived by the same P . This establishes (i). L′ is called a translate of L if it is defined in terms of the same functions fi, but possibly different αi. Hence, the exponential family corresponding to L′ is the same as it is for L. Since S(P ∗) = A, we know that P ∗ be- longs to this exponential family. But every element of the exponential family has the same I-projection onto L′, which establishes (ii). 7 In the following lemma the sets P and Q and function D(P‖Q) are completely arbitrary. In later applications D(P‖Q) will be the divergence and P and Q will be convex sets of distributions on a finite set A. Theorem 10 Let D(P‖Q) be an arbitrary real-valued function defined for P ∈ P, Q ∈ Q such that P ∗ = P ∗(Q) = arg minP D(P‖Q) exists for all Q ∈ Q and Q∗ = Q∗(P ) = arg minQ D(P‖Q) exists for all P ∈ P. Suppose further that there is a nonnegative function δ(P‖P ′) defined on P × P with the following “three- points property,” δ(P‖P ∗(Q)) + D(P ∗(Q)‖Q) ≤ D(P‖Q), ∀P ∈ P, Q ∈ Q, as well as the following “four-points property,” D(P ′‖Q′) + δ(P ′‖P ) ≥ D(P ′‖Q∗(P )), ∀P, P ′ ∈ P, Q′ ∈ Q. Let Q0 be an arbitrary member of Q and recursively define Pn = arg min P∈P D(P‖Qn−1), Qn = arg min Q∈Q D(Pn‖Q). (8) Then lim n→∞ D(Pn‖Qn) = inf P∈P,Q∈Q D(P‖Q). If, in addition, (i) minQ∈Q D(P‖Q) is continuous in P , (ii) P is compact, and (iii) δ(P‖Pn) → 0 iff Pn → P , then for the iteration (??) Pn will converge to some P ∗, such that if Q∗ = arg minQ∈Q D(P ∗‖Q) then D(P ∗‖Q∗) = minP∈P,Q∈Q D(P‖Q) and, moreover, δ(P ∗‖Pn) ↓ 0 and D(Pn‖Qn)−D(P ∗‖Q∗) ≤ δ(P ∗‖Pn−1)− δ(P ∗‖Pn). Proof. We have, by the three-points property, δ(P‖Pn+1) + D(Pn+1‖Qn) ≤ D(P‖Qn), and, by the four-points property D(P‖Qn) ≤ D(P‖Q) + δ(P‖Pn), for all P ∈ P, Q ∈ Q. Hence δ(P‖Pn+1) ≤ D(P‖Q)−D(Pn+1‖Qn) + δ(P‖Pn) (9) The inequality (??) implies the desired basic limit re- sult lim n→∞ D(Pn‖Qn) = inf P∈P,Q∈Q D(P‖Q). Indeed, if this were false it would mean that there exist P ∈ P, Q ∈ Q and ε > 0 such that lim n→∞ D(Pn‖Qn) = lim n→∞ D(Pn+1‖Qn) > D(P‖Q) + ε. Then (??) would give that δ(P‖Pn+1) ≤ δ(P‖Pn) − ε, n = 1, 2, . . . which contradicts the assumption that δ is nonnegative. Suppose assumptions (i)-(iii) hold. Pick a sub- sequence Pnk → P ∗, as k → ∞ and let Q∗ = arg minQ∈Q D(P ∗‖Q). Our basic limit re- sult and assumption (i) imply that (P ∗, Q∗) achieves minP,Q D(P‖Q). But it is easy to see that (??) im- plies that if (P,Q) achieves minP minQ D(P‖Q) then δ(P‖Pn+1) ≤ δ(P‖Pn) for every n. Thus δ(P ∗‖Pn) must be nondecreasing, and, by assumption (iii), its limit must be 0. Using assumption (iii) once more, we conclude that Pn → P ∗. The final inquality in the statement of the theorem then follows from (??) by replacing (P,Q) by (P ∗, Q∗). This completes the proof of the theorem. Now we wish to apply the theorem to the case when D(P‖Q) is the divergence and P and Q are convex, com- pact sets of nonnegative measures on A. No assumption that the measures are probability distributions is made at this point; hence, in particular, D(P‖Q) may have negative values. Of course, if ∑ P (a) ≥ ∑ Q(a) then D(P‖Q) ≥ 0. Furthermore, the quantity δ(P‖Q) = ∑ a [ P (a) log P (a) Q(a) − (P (a)−Q(a)) log e ] , is always nonnegative and vanishes iff P = Q. This δ sat- isfies assumption (iii) of the theorem as well as the three- points and four-points properties. We verify the four- points property and leave the verification of the other properties to the reader. Let Q∗ = arg minQ∈Q, let Q′ be an arbitrary member of Q, and set Qt = (1−t)Q∗+tQ′ ∈ Q, 0 ≤ t ≤ 1. Then 0 ≤ 1 t [D(P‖Qt)−D(P‖Q∗)] = d dt D(P‖Qt) ∣∣ t=t̃, 0 < t̃ ≤ t. With t → 0 it follows that 0 ≤ lim t̃→0 ∑ a P (a) (Q∗(a)−Q′(a)) log e (1− t̃)Q∗(a) + t̃Q′(a) = ∑ a P (a) Q∗(a)−Q′(a) Q∗(a) log e. (10) 10 If we then combine this with the fact that log t ≥ (1 − 1/t) log e) we obtain ∑ a P ′(a) log P ′(a)Q∗(a) Q′(a)P (a) − ( P ′(a)− P (a) ) log e ≥ 0, which is just a rewritten version of the four-points prop- erty. Remark 5 Suppose we are given a convex family F of random variables defined on a finite probability space (Ω, P ) and let X∗ be a member of the family for which E(log X) is maximal. Then, letting X and X∗ play the role of Q′ and Q∗, respectively, the inequality (??) gives that E ( X∗ −X X ) ≥ 0, i. e., E ( X∗ X ) ≥ 1,∀X ∈ F . The finiteness assumption is not really needed here, for all that is needed is that maxE(log X) is attained. This is known as Cover’s inequality. The result of Theorem ?? can be applied to the prob- lem of minimizing divergence from a set of distributions that is the image of a “nice” set in some other space. Let T :A 7→ B be a given mapping and for any P on A write P T for its image on B, that is, P T (b) = ∑ a:Ta=b P (a). Problem 1. Given a set Q̃ = {QT :Q ∈ Q} of distri- butions on B for some set Q of distributions on A, minimize D(P̃‖Q̃), subject to Q̃ ∈ Q̃ for some given P̃ on B. Here it is assumed that to any P ∈ P, a Q ∈ Q minimizing D(P‖Q) can “easily” be found. Problem 2. The same but with the role of P and Q interchanged. The first problem is relevant for maximum likelihood estimation based on partially observed data, when es- timation from the full data would be “easy.” The two problems can be solved in similar ways; we concentrate on the first one. Let P be the set of all P on A such that P T = P̃ . Here P̃ and the elements of Q̃ are not necessarily probability distributions; indeed, either ∑ P̃ (b) or ∑ Q̃(b) maybe less than, equal to, or greater than 1. Nevertheless the partitioning inequality gives D(P‖Q) ≥ D(P T ‖QT ) with equality iff P (a) Q(a) = P T (Ta) QT (Ta) ,∀a ∈ A. Hence P ∗ ∈ P, Q∗ ∈ Q achieve minP,Q D(P‖Q) iff Q̃∗ = Q∗T achieves minQ̃ D(P̃‖Q̃). Such (P ∗, Q∗) can be achieved using Theorem ??. In- deed, to Qn−1 ∈ Q we can find Pn ∈ P minimizing D(P‖Qn−1) for P ∈ P merely by letting Pn(a) = Qn−1(a) P̃ (Ta) QT n−1(Ta) , for by definition P T = P̃ , if P ∈ P. The alternate step, finding Qn ∈ Q minimizing D(Pn‖Q) is ‘easily” found, by assumption. Now we apply the preceeding dicsussion to a mixture distribution problem. Let Q̃ be the set of all Q̃ of the form Q̃(b) = ∑k i=1 ciµi(b), where ci ≥ 0, ∑ ci = 1, and µi(b) are arbitrary nonnegative measures. Goal: Find (c∗1, . . . , c ∗ k) achieving minQ̃ D(P̃‖Q̃), for a given P̃ . Solution. Let A be the set of all pairs (i, b), 1 ≤ i ≤ k, b ∈ B, and let T (i, b) = b. Define P and Q as above and apply the iteration scheme. Thus P = {P : k∑ i=1 P (i, b) = P̃ (b)}, Q = {Q:Q(i, b) = ciµi(b)}. Start with an arbitrary (c0 1, . . . , c 0 k) with positive com- ponents that sum to 1; this defines Q0(i, b) = c0 i µi(b). If Qn−1(i, b) = cn−1 i µi(b) is already defined let Pn be determined as above, that is, Pn(i, b) = Qn−1(i, b) P̃ (b) Q̃n−1(b) = cn−1 i µi(b) P̃ (b)∑ j cn−1 j µj(b) . The next step is to find Qn ∈ Q minimizing D(Pn‖Q). To do this put Pn(i) = ∑ b Pn(i, b), Pn(b|i) = Pn(i, b)/Pn(i) and use the relation Q(i, b) = ciµi(b) to write D(Pn‖Q) = k∑ i=1 ∑ b Pn(i, b) log Pn(i, b) Q(i, b) in the form D(Pn‖Q) = ∑ i,b Pn(i)Pn(b|i) [ log Pn(i) ci + log Pn(b|i) µi(b) ] . (11) Note that ∑ i Pn(i) = ∑ b P̃n(b), and hence D(Pn‖Q) is minimized if in (??) we set ci = Pn(i)/ ∑ b P̃ (b) (using 11 the fact that Pn(b|i) is a probability distribution for fixed i.) Thus the recursion for cn i will be cn i = cn−1 i ∑ b P̃ (b)µi(b)∑ j cn−1 j µj(b)∑ b P̃ (b) , and by our general theorem, cn i → c∗i achieving minQ̃ D(P̃‖Q̃). Remark 6 The finiteness of B is not essential for the convergence of this iteration. In particular, using the remark with Cover’s inequality, Remark ??, for positive valued random variables X1, . . . , Xk, the weights c∗i max- imizing E(log ∑ i ciXi) can be found by the same itera- tion, i. e., cn i = cn−1 i E ( Xi∑ j cn−1 j Xj ) . This is Cover’s portfolio algorithm. Remark 7 The “decomposition of mixtures” algorithm can be used also if the individual µi’s depend on some parameter to be estimated, i. e., when Q̃ = { Q̃: Q̃(b) = ∑ i ciµ(b|θi) } . Then, from (??), θn i is chosen to minimize the divergence ∑ b Pn(b|i) log Pn(y|i) µi(y|θ) . Unfortunately, the general theorem is not applicable to this case, because Q̃ and Q are not convex. Indeed, the iteration may get stuck at a local mimimum and fail to find the global one. 5 Redundancy. This and the next two sections are concerned with measuring the performance of codes. The symbol Cn will denote a binary prefix n-code with length function L = L(Cn, n). The (pointwise) redundancy R = RP (Cn, n) of the code Cn relative to a distribution P on An is defined by R(xn 1 ) = L(xn 1 )− log 1 P (xn 1 ) . The expected redundancy is R̄ = E(R) = ∑ xn 1 L(xn 1 )P (xn 1 )− ∑ xn 1 P (xn 1 ) log 1 P (xn 1 ) . The Shannon code determined by the length function L(xn 1 ) = d− log P (xn 1 )e produces essentially zero redun- dancy, and, is almost the optimal code for P in that it produces expected coding length within 1 bit of the minimal expected coding length. Thus, in general, re- dundancy gives an approximate measure of the cost in using the code Cn on P -sequences, rather than the opti- mal code. Note that the expected redundancy E(R) = E(L) − H(P ) is always nonnegative, but the pointwise redun- dancy R(xn 1 ) can take negative values. We will show that for random processes the pointwise redundancy is essentially nonnegative. A random process is an infi- nite sequence X1, X2, . . . of A-valued random variables defined on probability space (Ω, P ∗). The Kolmogorov representation of a process produces the measure P on the space A∞ of infinite sequences drawn from A, which is defined by requiring that the value of P on cylinder sets [an 1 ] = {x ∈ A∞:xn 1 = an 1} be given by the formula P ([an 1 ]) = Prob (Xi = ai, : 1 ≤ i ≤ n) . If P is the Kolmogorov measure determined by a process we shall write Pn for the measure on An determined by Pn(an 1 ) = P ([an 1 ]). (In cases where n is clear from the context we write P in place of Pn.) Note that a process defines a sequence of distributions Pn, where Pn is defined on An. The key difference between the concept of process and the general concept of sequences {Pn} of distributions is that Pn+1 is required to be related to Pn by the (Kolmogorov consistency) formula Pn(xn 1 ) = ∑ xn+1 Pn+1(xn+1 1 ). In the remainder of this section, P will denote the Kolomogorov measure of a random process {Xn} and 12 which implies that R(xn 1 ) ≤ log(P (xn 1 )/Q(xn 1 )). This, combined with (??) implies our desired result that R(xn 1 )/n → 0, a.s. This completes the proof of the the- orem. The following principle, called the minimum descrip- tion length (MDL) principle has been suggested by Ris- sanen. Principle. The statistical information in data is best extracted when a possibly short descrip- tion of the data is found. The distribution in- ferred from the data is the one that leads to the shortest description, taking into account that the inferred distribution itself must be de- scribed. Let Γ be a given finite or countably infinite list of stationary ergodic processes on the space A∞. Let to each U ∈ Γ a codeword of length L(U) be assigned as a description of U ; these lengths must satisfy the Kraft inequality. Then, given a sample xn 1 , the MDL estimate P̂n of the unknown distribution P is P̂n = U , where U achieves minU∈Γ [L(U)− log U(xn 1 )]. Theorem 15 If P ∈ Γ then P̂n = P , eventually almost surely. Proof. Let Q = ∑ U∈Γ−{P} 2−L(U)U, and note that Q(xn 1 ) ≥ max U∈Γ−{P} 2−L(U)U(xn 1 ), that is, log 1 Q(xn 1 ) ≤ min U∈Γ−{P} [ L(U) + log 1 U(xn 1 ) ] . (15) Now, Q is singular with respect to P , since each sta- tionary, ergodic U 6= P is singular with respect to the stationary, ergodic process P , hence by Theorem ?? the redundancy of the Shannon code with respect to Q goes to +∞, that is, log 1 Q(xn 1 ) − log 1 P (xn 1 ) →∞, a.s. Using the bound (??) we therefore have min U∈Γ−{P} [ L(U) + log 1 U(xn 1 ) ] − log 1 P (xn 1 ) →∞, a.s., hence, for sufficiently large n min U∈Γ−{P} [ L(U) + log 1 U(xn 1 ) ] > log 1 P (xn 1 ) + L(P ). The preceeding inequality implies that P̂n = P and com- pletes the proof of the theorem. Now let us be given a finite or countable list of para- metric families of (stationary, ergodic) processes {Pθ: θ ∈ Θγ , where γ ∈ Γ, and to each family on the list, i. e., to each γ ∈ Γ suppose there is assigned a codeword of length L(γ) describing this family, such that the Kraft inequality holds. Further, let on each parameter set Θγ be given a “prior” νγ , i. e., νγ is a probability measure on Θγ . We also assume that the mixture distributions Qγ = ∫ Θγ Pθνγ(dθ), γ ∈ Γ are mutually singular. (In particular, these mean that the families {Pθ: θ ∈ Θγ are essentially disjoint.) Theorem 16 There exists subsets Θ̃γ ⊂ Θγ of full mea- sure 1, such that if P ∈ Θ̃γ∗ , for some γ∗ ∈ Γ, then min γ∈Γ [ L(γ) + log 1 Qγ(xn 1 ) ] is attained for γ = γ∗, eventually almost surely. Proof. In other words, the family containing the true distribution will be found with probability 1, unless P is in a subset of this family having νγ-measure 0. Exactly as in the proof of the preceeding theorem (re- placing U by Qγ and L(U) by L(γ)) we obtain that for sufficiently large n, min γ∈Γ [ L(γ) + log 1 Qγ(xn 1 ) ] will be attained for γ = γ∗, with Qγ∗-probability 1. Let F be the set of all x ∈ A∞ for which this “almost sure” statement is true, so that Qγ∗(F c) = 0. Since by defini- tion Qγ∗(F c) = ∫ Θγ∗ Pθ(F c)νγ∗(dθ), it follows that νγ∗({θ:Pθ(F c) > 0}) = 0 and we can take Θ̃γ∗ = Θγ∗ − {θ:Pθ(F c) > 0}. This completes the proof of the theorem. Remark 8 The hypotheses of Theorem ?? are fulfilled, in particular, when the parameter sets Θγ are subsets of Euclidean spaces of different dimensions and νγ is abso- lutely continuous with respect to the Lebesgue measure for the corresponding dimension. 15 6 Redundancy bounds. Some techniques for obtaining bounds on redundancy for i.i.d processes will be discussed in this section. Con- sider the i.i.d. process with alphabet A = {1, . . . , k} with distribution P . We then have P (xn 1 ) = k∏ i=1 P (i)ni , where ni is the number of times i occurs in xn 1 . This probability is maximum if P (i) = ni/n, hence the maxi- mum likelihood estimate is given by PML(xn 1 ) = k∏ i=1 ( ni n )ni . When encoding with respect to an auxiliary distribution Q, the redundancy satisfies (disregarding at most 1 bit) the following simple bound R(xn 1 ) = log P (xn 1 ) Q(xn 1 ) ≤ log PML(xn 1 ) Q(xn 1 ) . (16) Let us take for Q the mixture distribution Q(xn 1 ) =∫ U(xn 1 )ν(p) dp, with a Dirichlet prior having density ν(p) = Γ (∑k i=1 αi + k ) ∏k i=1 Γ (αi + 1) k∏ i=1 pαi i , p = (p1, . . . , pk). For α1 = . . . = αk = −1/2 we will get a sharp upper bound on the redundancy (??), a bound not depending on the true distribution P nor xn 1 . Before we state and derive this bound we obtain a representation for Q that will be useful in constructing the Shannon code for Q.. For a Dirichlet prior with arbitrary αi > −1,∀i, we have Q(xn 1 ) = ∫ U(xn 1 )ν(p) dp = = ∫ k∏ i=1 pni+αi i dp · Γ (∑k i=1 αi + k ) ∏k i=1 Γ (αi + 1) = Γ (∑k i=1 αi + k ) Γ (n + ∑ αi + k) k∏ i+1 Γ (ni + αi + 1) Γ (αi + 1) . Using the functional equation Γ(x + 1) = xΓ(x) we see that Q(xn 1 ) is given by the ratio∏k i=1 [(ni + αi)(ni − 1 + αi) . . . (1 + αi] (n− 1 + ∑ αi + k)(n− 2 + ∑ αi + k) . . . ( ∑ αi + k) or, equivalently, Q(xn 1 ) = n∏ j=1 n(xj |xj−1 1 ) + 1 + αxj j − 1 + ∑ αi + k . (17) where n(xj |xj−1 1 ) is the number of occurences of the sym- bol xj in the “past” xj−1 1 . Theorem 17 If Q is defined by (??) with αi = −1/2,∀i, the redundancy always satisfies R(xn 1 ) ≤ log Γ(n + k 2 )Γ(1 2) Γ(n + 1 2)Γ(k 2 ) ≤ ≤ k − 1 2 log n− log Γ(k/2) Γ(1/2) + εn where εn → 0 as n →∞. Proof. The second inequality is a simple consequence of Stirling’s formula for the Γ-function, so it is enough to prove the first inequality. For αi ≡ −1/2 we have Q(xn 1 ) = Γ(k 2 ) Γ(n + k 2 ) k∏ i=1 Γ(ni + 1 2) Γ(1 2) = = ∏k i=1 [ (ni − 1 2)(ni − 3 2) · · · 1 2 ] (n− 1 + k 2 )(n− 2 + k 2 ) · · · k 2 (18) Note that, in particular, if xn 1 consists of identical sym- bols, say, xi ≡ a, then Q(xn 1 ) = Γ(k 2 )Γ(n + 1 2) Γ(n + k 2 )Γ(1 2) ; hence to prove Theorem ?? it is enough to show that R(xn 1 ) ≤ log(1/Q(xn 1 )). The simple upper bound (??) then tells us that it is enough to show that PML(xn 1 ) ≤ k∏ i=1 ( ni n )ni ≤ Q(xn 1 ) Q(x̃n 1 ) , where x̃i ≡ a. The identity (??) can then be used to see that it is enough to prove that k∏ i=1 ( ni n )ni ≤ ∏k i=1 [ (ni − 1 2)(ni − 3 2) · · · 1 2 ] (n− 1 2)(n− 3 2) · · · 1 2 , which can be converted to k∏ i=1 ( ni n )ni ≤ ∏k i=1 [2ni(2ni − 1) · · · (ni + 1)] 2n(2n− 1) · · · (n + 1) (19) 16 since (n− 1 2 )(n− 3 2 ) · · · 1 2 = 1 n! [ n(n− 1 2 ) · · · 1 2 ] = (2n)! 22nn! = 2n(2n− 1) · · · (n + 1) 22n . At last we have arrived at the assertion we shall prove, namely, (??). This will be proved if we show that it is possible to assign to each ` = 1, . . . , n in a one-to-one mannner, a pair (i, j), 1 ≤ i ≤ k, 1 ≤ j ≤ n, such that ni n ≤ ni + j n + ` (20) Now, for any given ` and i, (??) holds iff j ≥ ni`/n. Hence the number of those 1 ≤ j ≤ ni that satisfy (??) is greater than ni−ni`/n, and the total number of pairs (i, j), 1 ≤ i ≤ k, 1 ≤ j ≤ n, satisfying (??) is greater than k∑ i=1 ( ni − ni n ` ) = n− `. It follows that if we assign to ` = n any (i, j) satisfying (??) (i. e., i may be chosen arbitrarily and j = ni), then recursively assign to each ` = n − 1, n − 2, etc., a pair (i, j) satisfying (??) that were not assigned previously, we never get stuck; at each step there will be at least one “free” pair (i, j) (because the total number of pairs (i, j) satisfying (??) is greater than n− `, the number of pairs already assigned.) This completes the proof of the theorem. Our next goal is to show that the result of the preceed- ing theorem is “best possible,” even if we don’t insist on a uniformly small redundancy (i. e., on a bound valid for every xn 1 ), but want only the average redundancy E(R) to be small. Consider any prefix code. Without loss of general- ity (for the purpose of bounding the redundancy) we may assume that it satisfies the Kraft inequality with the equality sign, and therefore that is is a Shannon code with respect to some Q (not necessarily of mixture type.) Then E(R(Xn 1 )) = E log P (Xn 1 ) Q(Xn 1 ) = D(Pn‖Q). Since P is unknown, we want to select Q in such a way that no matter what P is the average redundancy will be small, that is, we want Q to minimize sup P EP (RP (Xn 1 )) = sup P D(Pn‖Q). Suppose we choose P at random with prior distribution ν; then the observation of xn 1 provides information about the unknown P , measured by the mutual information I(ν) = H(Qν)− ∫ H(Pn)ν(dP ) = H(Qν)− n ∫ H(P )ν(dP ), where Qν is the mixture distribution, Qν =∫ P (xn 1 )ν(dP ). Even though this mutual information ap- pears to be unrelated to the previous average redun- dancy, the remarkable fact is that inf Q sup P D(Pn‖Q) = sup ν I(ν). Indeed, the following lemma holds in general. Lemma 5 Consider any noisy channel with input al- phabet U = {1, . . . , `} and output alphabet V = {1, . . . ,m}, given by the probability distributions Pi on V governing the output if the input is i, i = 1, . . . , `. For any input distribution π, let Qπ denote the output distribution and let I(π) = ∑ i,j π(i)Pi(j) log Pi(j) Qπ(j) = ∑ i π(i)D(Pi‖Qπ) be the mutual information between input and output. Then max π I(π) = min Q max 1≤i≤` D(Pi‖Q). Proof. The left-side is known as the channel capacity. The lemma states that it equals the “radius” of the small- est “divergence ball” that contains all the Pi’s. To es- tablish this relation first note that for any distribution Q on V , I(π) = ∑̀ i=1 m∑ j=1 π(i)Pi(j) [ log Pi(j) Q(j) + log Q(j) Qπ(j) ] = ∑̀ i=1 π(i)D(Pi‖Q)−D(Qπ‖Q). This identity shows that for any fixed π min Q ∑̀ i=1 π(i)D(Pi‖Q) = I(π), and hence max π I(π) = max π min Q ∑̀ i=1 π(i)D(Pi‖Q). 17 which is equal to∏ `∈S ∏k i=1(n(i, `)− 1/2)(n(i, `)− 3/2) . . . (1/2) (n(`)− 1 + k/2)(n(`)− 2 + k/2) . . . (k/2) , and this code has redundancy R(xn 1 ) ≤ ∑ `∈S [ k − 1 2 log n(`) + constant ] ≤ |S|k − 1 2 log n + constant. Furthermore, Rissanen’s theorem implies that even the average redundancy cannot be substantially smaller the above bound, for any universal codes, except possibly for a vanishingly small set of parameter values, i. e., matrices P (i`). Remark 9 Before leaving this topic of redundancy bounds let us mention an aspect of our discussion which has some practical value in designing codes. In the pre- ceeding section we derived the following formula (see (??)), valid for the i.i.d. case Q(xn 1 ) = n∏ j=1 n(xj |xj−1 1 ) + 1 + αxj j − 1 + ∑ αi + k where n(xj |xj−1 1 ) is the number of occurences of the sym- bol xj in the “past” xj−1 1 . This formula suggests the conditional probabilities Q(xj |xj−1 1 ) = n(xj |xj−1 1 ) + 1 + αxj j − 1 + ∑ αi + k The latter formula can be used as the specification of the conditional probabilities used in arithmetic coding, a (practical) sequential procedure that yields the same asymptotics as the Shannon coding procedure. Likewise, the Markov discussion in this section leads to the conditional formula Q(ik|i1, . . . , ik−1) = nk−1(i, j) + 1/2 nk−1(i) + k/2 , if ik−1 = i, ik = j, where nk−1(i, j) is the number of consecutive (i, j)’s in the sequence ik−1 0 and nk−1(i) = ∑ j nk−1(i, j). These conditional probabilities are easily evaluated, because only simple updating is needed to go from k − 1 to k; arithmetic coding can then be performed. The corresponding finite context formula is Q(ik|i1, . . . , ik−1) = nk−1(i, `) + 1/2 nk−1(`) + k/2 , if sk−1 = `, ik = j. These can then be used to do arithmetic coding in the finite context case; such coding will also yield the same asymptotics as the Shannon code. Rissanen’s theorem implies that even the average redundancy can not be substantially smaller than the bound above, for any uni- versal code, expect possibly for a vanishingly small set of parameters (i. e., matrices P (i|`).) 8 Additions. 8.1 The scaling formula. The scaling formula P ∗(a) = ciQ(a), a ∈ Bi, where ci = αi QB(i) . (24) see (??) can be proved as follows. First, lumping does not increase divergence, that is, D(P‖Q) ≥ D(PB‖QB). The condition that P ∈ L is equivalent to the condition that P (Bi) = αi, ∀i. If P ∗(a) = αiQ(a)/Q(Bi), a ∈ Bi then∑ a P ∗(a) log P ∗(a) Q(a) = ∑ i ∑ a∈Bi αiQ(a) Q(Bi) log αi Q(Bi) = D(α‖QB), α = (α1, . . . , αk). Thus, if P ∈ L then D(P‖Q) ≥ D(PB‖QB) = D(α‖QB), which establishes (??). 8.2 Pearson’s χ2. The chi-square function was defined on page ??. In the case when P = P̂ , the empirical distribution the formula can be rewritten as follows. χ2(P̂ , Q) = ∑ a (P̂ (a)−Q(a))2 Q(a) = 1 n ∑ (nP̂ (a)− nQ(a))2 nQ(a) = 1 n χ2 k−1, where χ2 k−1 = ∑ (nP̂ (a)−nQ(a))2 nQ(a) is Pearson’s classical chi- square function. Here nP̂ (a) gives the observed count, while nQ(a) gives the expected count of the number of appearances of a. 20 8.3 Maximum entropy and Likelihood. There is an important case when divergence mini- mization corresponds to maximum likelihood, namely, the case when the linear family contains the empirical distribution. Suppose we are given the corresponding linear and exponential families, L = {P : ∑ a P (a)fi(a) = αi, 1 ≤ i ≤ k} E = {P :P (a) = c(θ)Q(a) exp( k∑ 1 θifi(a))}. Theorem 20 If Q(a) > 0,∀a and if the empirical dis- tribution P̂ belongs to L then the maximum likelihood estimate in E is the I-projection P ∗ of any member of E onto L. Furthermore, the minimum value of D(P̂‖P ) for P ∈ E is attained at P ∗. Proof. Let P ∗ = D(L‖Q), so that L ∩ E = {P ∗}. If P ∈ E we can write P = cQ(a) exp( ∑ θifi(a)), P ∗ = c∗Q(a) exp( ∑ θ∗i fi(a)). Since ∑ P ∗(a)fi(a) = αi we have 0 ≤ D(P ∗‖P ) = log c∗ + ∑ θ∗i αi − (log c + ∑ θiαi), so that log c∗ + ∑ θ∗i αi = max P∈E (log c + ∑ θiαi). If P̂ ∈ L, however, then D(P̂‖P ) − D(P̂‖P ∗) = D(P ∗‖P ), since ∑ P̂ (a)fi(a) = αi. This proves that the minimum value of D(P̂‖P ) for P ∈ E is attained at P ∗. Furthermore, P (xn 1 ) = n ∑ P̂ (a) log P (a), so that if P ∈ E and P̂ ∈ L then log P ∗(xn 1 ) P (xn 1 ) = n ∑ P̂ (a) log P ∗(a) P (a) = D(P ∗‖P ) ≥ 0, so that P ∗ is indeed the MLE in E . The argument can be applied to any member of E in place of the given Q, since they all describe the same exponential family. 8.4 Redundancy for the LZ algorithm. An upper bound on the reduncancy of the form O(log log n/ log n) for the Lempel-Ziv (LZ) algorithm on the class of i.i.d. processes will now be established. Extensions of these results to the Markov and hidden Markov cases can also be obtained. Let c = c(xn 1 ) be the number of commas in the LZ parsing of xn 1 . The final block, which may be empty, is coded by telling the first prior word that this block pre- fixes. Let ULZ(xn 1 ) be the length of the resulting code. Each word, except the final word, can be encoded with at most dlog ce bits to give the location of the prior oc- curence of all but its final symbol and dlog |A|e to encode this final symbol. Thus we have the upper bound ULZ(xn 1 ) ≤ (c + 1)dlog ce+ cdlog |A|e. (25) The next step in upper bounding the redundancy is to obtain a lower bound on − log P (xn 1 ), stated here as the following lemma. Lemma 6 There is a positive number δ such that if P is an i.i.d. process then − log P (xn 1 ) ≥ c log c− cδ + log(n/c) n/c . Proof. Let W = W (xn 1 ) be the first c words in the LZ parsing of xn 1 , let WL = WL(Xn 1 ) be the subset of W consisting of the words of length L, and let c(L) be the cardinality of WL. We then have P (xn 1 ) ≤ Lmax∏ L=1 ∏ w∈WL P (w), so that − log P (xn 1 ) ≥ − Lmax∑ L=1 ∑ w∈WL log P (w) = − Lmax∑ L=1 c(L) ∑ w∈WL 1 c(L) log P (w) ≥ − Lmax∑ L=1 c(L) log ∑ w∈WL P (w) c(L) ≥ Lmax∑ L=1 c(L) log c(L) where the first inequality comes from Jensen’s in- equality, and the final inequality uses the fact that∑ w∈WL P (w) ≤ 1, which holds because the words in WL are distinct and have fixed length L. To obtain a suitable bound on ∑ c(L) log c(L) set L̄ = 1 c Lmax∑ L=1 c(L) log c(L), 21 so that∑ c(L) log c(L) = −c ∑ c(L) c log 1 c(L) = −c ∑ c(L) c log 2−L/L̄ c(L) − c (a) ≥ −c + c log c− c log ∑Lmax 1 2−L/L̄ ≥ −c + c log c− c log 2−1/L̄ 1−2−1/L̄ (b) ≥ −c + c log c− c log(21/L̄ − 1) (c) ≥ −c + c log c− c log( ln 2 L̄ ) ≥ c log c− cδ − c log n c . where Jensen’s inequality was used in (a) and the finite sum was replaced by the infinite sum (of a geometric series) to go to (b). The Taylor expansion of 2x was used to obtain (c), while the final line used δ = 1 − log(ln 2) and the fact that L̄ ≤ n/c. This completes the proof of Lemma ??. Taking the difference between the upper bound on the code length, (??), and the lower bound of Lemma ??, then dividing by n, produces the redudancy bound 1 n RLZ(xn 1 ) ≤ K c n + log n/c n/c , (26) where K is a constant. To complete the argument a simple bound for c/n will be needed, a bound that follows from the foct that the largest value of c is obtained when all short blocks occur. It is enough to consider the case when all blocks of length up to t occur, so that c = t∑ 1 |A|i ∼ |A|t, n = t∑ 1 i|A|i ∼ t|A|t, which gives the (asymptotic) bound c/n = O(1/ log n). Since log x/x is decreasing in x for x > e, the desired result, 1 n RLZ(xn 1 ) = O ( log log n log n ) , follows easily from the bound (??). 8.5 Minimization for general measures. The minimization result claimed in the paragraph fol- lowing statement (10) on page 10 follows from a general result about nonnegative measures, a result that is a sim- ple consequence of the log-sum inequality. Suppose P is an arbitrary nonnegative measure, suppose Q is a prob- ability distribution, and set Q∗(a) = P (a)/ ∑ P (b). The log-sum inequality then gives D(P‖Q) ≥ (∑ P (a) ) log (∑ P (a) ) = D(P‖Q∗). 8.6 Cutting off the memory. Let P be a stationary finite-alphabet process. The k-step Markoviztion of P is the k-step Markov process P (k) defined by the transition probabilities P (xk+1|xk 1) = P (xk+1 1 ) P (xk 1) . The following general result shows that the conditions stated in Theorems 13 and 14 often hold. For example, the set of all Markov types of all orders is a countable set for which the conditions of Theorem 14 hold for every ergodic process P . Theorem 21 D∞(P‖P (k)) → 0 as k →∞. Proof. We have log P (xk 1) P (k)(xn 1 ) = n∑ i=k+1 log P (xi+1|xi 1) P (xi+1|xi i−k+1) so that taking expectations yields EP ( log P (xk 1) P (k)(xn 1 ) ) = n∑ i=k+1 ∑ xi+1 1 P (xi+1 1 ) log P (xi+1|xi 1) P (xi+1|xi i−k+1) . (27) To see what this is we use the formula I(X ∧ Y |Z) = ∑ P (x, y, z) log P (x|y, z) P (x|z) , with X = Xi+1, Y = Xi 1, Z = Xi i−k+1; the sum (??) then takes the form n∑ i=k+1 I(Xi+1 ∧Xi 1|Xi i−k+1) = n∑ i=k+1 I(X1 ∧X0 −i+1|X0 −k+1) where stationarity was used to obtain the final form. Now we pass to the limit in n, using the martingale the- orem to obtain D∞(P‖P (k)) = I(X1 ∧X0 −∞|X0 −k+1), whichs goes to 0 as k →∞, establishing the theorem. 22 it suffices to show that for θ = θML we have Eθfi = fi(x0), 1 = 1, . . . , k. But this immediately follows by setting the derivatives (∂/∂θi) log Pθ(x0) equal to 0. For this last step it is necessary to assume that θML is an interior point of the set of those θ′s for which Pθ is de- fined, that is, that the integral in the definition of c(θ) is finite. Example 7 Let Pn and Qn be n-dimensional distribu- tions on An such that n−1D(Pn‖Qn) → 0. Show that if for some sets Bn ⊂ An we have Qn(Bn) < exp(−εn) for some ε > 0 that does not depend on n then Pn(Bn) → 0. Is it also true that Pn(Bn) < exp(−εn) implies that Qn(Bn) → 0? Solution. For an arbitrary set A, D(Pn‖Qn) ≥ Pn(A) log Pn(A) Qn(A) + Pn(Ac n) log Pn(Ac) Qn(Ac) ≥ Pn(A) log Pn(A) + Pn(Ac n) log Pn(Ac n) +Pn(A) log 1 Qn(A) ≥ −1 + Pn(A) log 1 Qn(A) . If here Qn(A) ≤ exp(−εn) then it follows that D(Pn‖Qn) ≥ −1 + εnPn(A), that is, Pn(A) ≤ 1 ε [ 1 n D(Pn‖Qn) + 1 n ] . Example 8 Let X, Y, Z be real-valued random vari- ables with unknown joint density p(x, y, z) for which E(X2) + E(Y 2) + E(Z2) = a and E(XY ) + E(Y Z) = b, where a and b are known. Show that the joint den- sity achieving maximum entropy subject to these con- straints is Gaussian with mean 0. Indicate how its co- variance matrix could be determined (the actual compu- tation is not required) and show that for this maximum entropy joint distribution E(X2) = E(Z2) 6= E(Y 2) and E(XY ) = E(Y Z) 6= E(XZ). Solution. Let f1(x, y, z) = x2 + y2 + z2 and f2(x, y, z) = xy + yz. Then the entropy H(p) = − ∫ p(u) log p(u) du, where u = (x, y, z), du = dxdydz, has to be max- imized subject to the constraints inf f1(u)p(u) du = a, ∫ f2(u)p(u) du = b. The maximizing density will be in the exponential family pθ(u) = c(θ) exp[θ1f1(u) + θ2f2(u), θ = (θ1, θ2), provided this family has a member satisfying the con- straints. Comparing the family with the 3-dimensional, mean 0, Gassian densities, that is, those of the form, p(u) = (detA)1/2) (2π)3/2 exp { −1 2 uAuT } , where A is symmetric and positive definite, we see that our exponential family is a subfamily of these Gaussians, with −2θ0 −θ1 0 −θ1 −2θ0 −θ1 0 −θ1 −2θ0 . Computing the covariance matrix Σ = A−1, the given moment constraints result in two equations for the un- knowns θ1 and θ2. The solution of these equations is straightforward, but tedious. It is clear, however, from the form of A, that the first and last elements of the main diagonal of Σ = A−1 will be equal and its middle element will be different from these (unless θ1 = 0, which occurs if b = 0, when the maximum entropy distribution is iid.) The remaining assertion of the problem also follows from the form of A without any further calculations. 10 Summary of Process Concepts. A number of process concepts will be used in the discus- sion of redundancy. These concepts and the results to be used are summarized here. A (stochastic) process is a sequence {Xn} of random variables defined on a probability space, say (X, Σ, µ). We shall assume that all the random variables have val- ues in a fixed finite set A, called the alphabet. For each n a process defines a probability measure on An, called the n-fold joint distribution, by the formula Pn(xn 1 ) = Prob (Xi = xi, 1 ≤ i ≤ n) . The sequence of measures {Pn} is not completely arbi- trary, for the consistency conditions, Pn(xn 1 ) = ∑ xn+1 Pn+1(xn+1 1 ) (31) must hold. The space (X, Σ, µ) on which the process is defined is not important; all that matters is the sequence of joint distributions, {Pn}. In fact, two processes are said to be equivalent if they have the same joint distribu- tions; we are free to choose any convenient space and sequence of functions, as long as the joint distributions is not changed. The Kolmogorov model takes the space 25 to be the set A∞ of infinite sequences drawn from A, and the functions to be the coordinate functions, defined by X̂n(x) = xn, x ∈ A∞. The measure is the (unique) Borel measure P defined by the requirement that if [an 1 ] = {x:xi = ai, 1 ≤ i ≤ n} is the cylinder set defined by an 1 , then P ([an 1 ]) = Pn(an 1 ). In summary, the concept of process, that is, a sequence of measures {Pn} that satisfy the consistency conditions, (??), is formally equivalent to the concept of Borel prob- ability measure P on the sequence space A∞. We usually take the latter as our definition of process; thus, when we say process we shall mean a Borel probability measure P on the sequence space A∞. We shall use the notation P (an 1 ) for P ([an 1 ]), as well as sample path terminology. A sample path is a member of A∞, while a finite sample path is a member of some An. A process P is stationary if it is invariant under the shift T , which is the transformation on A∞ defined by the formula (Tx)n = xn+1, x ∈ A∞, n = 1, 2, . . .. Thus P is stationary if and only if P = P ◦ T−1. A stationary process is ergodic if almost every sam- ple path is “typical” for the process. The concept of “typical” is defined as follows. The relative frequency of occurence of ak 1 in the sequence xn 1 is the distribution P̂k = P̂k(·|xn 1 ) on Ak defined by P̂k(ak 1|xn 1 ) = |{i ∈ [0, n− k]:xi+k i+1 = ak 1}| n− k + 1 . The measure P̂k is also called the empirical distribution of overlapping k-blocks in the sample path xn 1 . The se- quence x is said to be typical for the process P if for all k and all ak 1, the following holds P (ak 1) = lim n→∞ P̂k(ak 1|xn 1 ). The set of sequences that are typical for P will be de- noted by T (P ). A stationary process P is ergodic if its set of typical sequences has measure 1, that is, if P (T (P )) = 1. The entropy (or entropy-rate) of a stationary process P is defined by H(P ) = limn Hn/n where the n-th order entropy Hn = Hn(P ) is defined by Hn = − ∑ an 1 P (an 1 ) log P (an 1 ). The entropy theorem (also known as the Shannon- McMillan-Breiman Theorem) asserts that if P is an er- godic process of entropy H then 1 n log 1 P (xn 1 ) = − 1 n log P (xn 1 ) = H, a. s. For ergodic processes we also have that the entropy of the empirical distribution, H(P̂k), converges almost surely to the theoretical entropy, Hk. Furthermore, if we define transition probabilities by the formula P̂k(ak|ak−1 1 ) = P̂k(ak 1)∑ ak P̂k(ak 1) , the entropy of the resulting Markov chain will converge almost surely, as sample path length n →∞, to the con- ditional entropy, H(Xk|Xk−1 1 ), which, in turn, converges as k →∞ to the entropy-rate H(P ). A stationary process P is always a mixture of ergodic processes, that is, there is a probability space (Y, Σ, ν) and a family Uy, y ∈ Y, of ergodic processes such that for each an 1 the function Uy(xn 1 ) is Σ-measurable and such that P (an 1 ) = ∫ Uy(an 1 )ν(dy). The process {Xn} is finite-state (hidden Markov) if there is a finite alphabet process {Sn} such that the pro- cess Yn = (Xn, Sn) is a Markov chain. Csiszár has shown (unpublished) that if Q is finite-state then for any sta- tionary, ergodic process P the limiting divergence-rate D∞(P‖Q) = lim n 1 n ∑ an 1 P (xn 1 ) log P (xn 1 ) Q(xn 1 ) exists, and, furthermore, (1/n) log P (xn 1 )/Q(xn 1 ) con- verges, for P -almost all x, to the limit D∞(P‖Q). 11 Homework # 1. Due Date: Oktober 7-én. 1. Find the Shannon-Fano code for the distribution P = (0.4, 0.35, 0.1, 0.1, 0.05). Determine the average length and compare it with the entropy H(P ). Can you improve this code by shortening some words, without losing the prefix property? Do you get an optimal code in this way? 2. Determine whether there exist binary prefix codes with the following codeword lengths and give such a code if the answer is yes. (a) 2,3,3,3,4,4,4,4,4,5,5,5 (b) 2,2,3,3,4,4,4,5,6,6 3. Determine which of the following bit sequences can be a code of some sequence of integers, using the prefix code given in the notes. (a) 0011110000001100010001110101010100011101 (b) 00001001110010000000100111100001011100010010000 26 4. Let B ⊂ An and let P = (1/|B|) ∑ x∈B Px, be the average type of the sequences in B. (a) Prove that |B| ≤ exp[nH(P )]. (b) Prove that k∑ i=0 ( n i ) ≤ 2nh(k/n), k ≤ n/2 where h(p) = −p log p− (1− p) log(1− p). (c) Given a function f on a finite set X, show that for every α that is a possible value of∑n i=1 f(xi)/n, we have∣∣∣∣∣ { xn 1 : 1 n n∑ i=1 f(xi) = α }∣∣∣∣∣ ≤ exp [ n max E(f(X))=α H(X) ] 5. Prove that H(Y |X) is a concave function of the joint distribution of (X, Y ), that is, if PXY = αPX1Y1 + (1 − α)PX2Y2 then H(Y |X) ≥ αH(Y1|X1) + (1 − α)H(Y2|X2). 6. Let P1 and P2 be probability distributions on the finite set X such that D(P2‖P1) > γ. Prove that the I-projection P ∗ of P2 on Π = {Q:D(Q‖P1) ≤ γ} is of the form P ∗ = cP θ 1 P 1−θ 2 , where c > 0 and 0 < θ < 1 are determined by the requirements that∑ P ∗(x) = 1 and D(P ∗‖P1) = γ. (Hint: first show that P ∗ is also the I-projection of P2 on the linear family L = {Q: ∑ Q(x) log P1(x) P2(x) = δ − γ}, where δ = D(P ∗‖P2).) 7. Prove that D(P‖Q) ≤ χ2(P,Q) log e. 8. Let X1, X2, . . . be an iid sequence of X-valued ran- dom variables with entropy H, and let Ĥn be the empirical entropy of Xn 1 , that is, the entropy of the empirical distribution P̂n. Prove that H − 1 n log ( n + |X| − 1 |X| − 1 ) ≤ E(Ĥn) ≤ H. 9. Given two strictly positive finite distributions P1 and P2 on X, determine γ such that there is exactly one P ∗ with D(P ∗‖P1) = D(P ∗‖P2) = γ. Show that γ = − log min 0≤θ≤1 ∑ x P θ 1 (x)P 1−θ 2 (x). 12 Homework # 2. 1. For k simple hypotheses P1, . . . , Pk, and a clas- sification rule consisting of the partition A = (A1, . . . , Ak) of Xn such that Pi is accepted when the sample belongs to Ai, there are k error prob- abilities, ei = Pn i (Ac i ), i = 1, . . . , k. Give a nec- essary and sufficient condition for the existence of classification rules such that all k error probabil- ities go to 0 with exponential rate at least some γ > 0, as the sample size n goes to infinity, that is, ei ≤ exp(−n(γ + o(1)), i = 1, . . . , k. 2. Let E1 ⊂ E2 be exponential families of the form E1 = Q:Q(x) = Q0(x)c(θ) exp k1∑ i=1 θifi(x) E2 = Q:Q(x) = Q0(x)c(θ) exp k2∑ i=1 θifi(x) , where k2 > k1. Given a sample with empirical distribution P̂ , let P ∗ i ∈ Ei, be the maximum likelihood estimate for the model Ei i = 1, 2. Prove that P ∗ 2 is the I-projection of P ∗ 1 onto L = {P : ∑k2 i=1 P (x)fi(x) = ∑k2 i=1 P̂ (x)fi(x)}. 3. In a telephone network serving r cities, the incom- ing and outgoing calls were counted in each city on a given day. From these numbers, xin(k) and xout(k), k = 1, . . . , r, the number of calls x(i, j) from city i to city j are inferred by the method of maximum en- tropy, setting x∗(i, j) = np∗(i, j); here n is the total number of calls and P ∗ = {p∗(i, j)} is the maximum entropy distribution among those P = {p(i, j)} that satisfy the marginal constraints r∑ j=1 p(k, j) = 1 n xout(k), r∑ i=1 p(i, k) = 1 n xin(k), for k = 1, . . . , r, and, in addition, p(k, k) = 0, k = 1, . . . , r (local calls were not counted.) Specify the exponential family for which this P ∗ is the maxi- mum likelihood estimate, and suggest an iterative algorithm for determining P ∗. 4. Suppose that for a 5 × 5 array of random variables Xij , each taking values in a finite set X, the joint distributions of “neighboring pairs” (Xij , Xi(j+1)) and (Xij , X(i+1)j) are known, where addition mod- ulo 5 is used. Based on this information, the joint distribution of the whole array is estimated by 27 3. In general, maximizing H(P ) is the same as min- imizing D(P‖Q0) where Q0 is the uniform distri- bution. In our case we must have P (i, i) = 0, for every i; therefore, maximizing H(P ) is equivalent to minimizing D(P‖Q0) where Q(i, i) = 0,∀i, and Q(i, j) = constant, i 6= j. This minimization can be performed by iteratively adjusting the marginals (iterative scaling). The exponential family will consist of all distribu- tions of the form P (i, j) = cQ(i, j)a(i)b(j) and the maximum entropy distribution will be the ML es- timate for this family. In this case the exponential family through Q0, the uniform distribution, is not appropriate because it does not intersect the set of feasible distributions, all of which have their diago- nal elements equal to 0. 4. For each pair (i, j), 1 ≤ i ≤ 5, 1 ≤ j ≤ n, let L(1) ij denote the set of all joint distributions on X25 whose two-dimensional marginal representing the joint dis- tribution of Xij and Xi(j+1) equals the given one. Similarly, let L(2) ij , 1 ≤ i ≤ n, 1 ≤ j ≤ 5, be defined by the given joint distribution of Xij and X(i+1)j . Let L be the intersection of all these linear linear families and let P0 be the uniform distribution on X25. The required maximum entropy joint distri- bution will be the I-projection of P0 on L. It can be computed by iterative scaling, performing cyclically I-projections on the sets L(1) ij and L(2) ij (by adjust- ing the corresponding two-dimensional marginals.) Since L is the intersection of 40 sets L(1) ij and L(2) ij , one cycle of the iteration will consist of 40 consecu- tive scalings. 5. From Section 6, ν(p) = Γ (∑k i=1 αi + k ) ∏k i=1 Γ (αi + 1) k∏ i=1 pαi i , is a density (for every α1, . . . , αk greater than -1), hence its integral over the probability simplex is 1. Applying this with k = 2 and with n0 and n1 = n− n0 in the role of α1 and α2, it follows that Q(xn 1 ) = ∫ pn0(1− p)n1dp = Γ(n0 + 1)Γ(n1 + 1) Γ(n + 2) = n0!n1! (n + 1)! = 1 n + 1 n0!n1! n! . If n0 ∼ αn then Stirling’s formula, k! ∼ kke−k √ 2πk, gives n0!n1! n! ∼ nn0 0 nn1 1 nn √ 2πnα(1− α) which implies that log PML(xn 1 ) Q(xn 1 ) = log ( nn0 0 nn1 1 nn /Q(xn 1 ) ) ∼ 1 2 log n+ const. If, on the other hand, n0 is a constant, then Q(xn 1 ) = 1 n + 1 n0!n1! n! = n0! (n + 1)n · · · (n− n0 + 1) and PML(xn 1 ) Q(xn 1 ) = = nn0 0 n0! · (n− n0)n−n0 nn · (n + 1)n · · · (n− n0 + 1) = nn0 0 n0! · (1− n0 n )n−n0 × × [ (1− 1 n ) · · · (1− n0 − 1 n ) ] (n + 1), so that in this case, log PML(xn 1 ) Q(xn 1 ) ∼ log n + const. A better choice for Q is the mixture with respect to the Dirichlet prior with α1 = α2 = −1/2, i. e., with ν(p) = 1/(π √ p(1− p)), for then log PML(xn 1 ) Q(xn 1 ) will be asymptotically (1/2) log n + constant, no matter what is xn 1 . 6. (i) From formula (17) in the lecture notes with k = 2 we have the auxiliary distribution Q(xn 1 ) = (n0 − 1 2)(n0 − 3 2) · · · 1 2 · (n1 − 1 2)(n1 − 3 2) · · · 1 2 n! With n = 32, n0 = 22, n1 = 10, we obtain L(xn 1 ) = d− log Q(xn 1 )e = 32. In the Markov case, the formula on page 16 of the notes yields Q(xn 1 ) = (n(0, 0)− 1/2) . . . (1/2) · (n(0, 1)− 1/2) . . . (1/2) n0! ×(n(1, 0)− 1/2) . . . (1/2) · (n(1, 1)− 1/2) . . . (1/2) n1! . 30 In our case, setting the unspecified initial state equal to 0, we have noo = 16, n01 = 6, n10 = 6, n11 = 4 and n0 = 22, n1 = 10. (Note that the present n0 and n1 equal those in part (i) only because the initial state has been set equal to the last bit of the sequence xn 1 ; otherwise there would be a difference of 1.) Substituting these values we obtain L(xn 1 ) = d− log Q(xn 1 )e = 33. Remark. The perhaps surprising result is that for the given sequence neither method leads to compres- sion. This is so in spite of the fact that the first or- der empirical entropy Ĥn is clearly less than 1, and the second order empirical entropy Ĥ (2) n is clearly smaller than Ĥn. The reason is that the true code- length is not Ĥn (or Ĥ (2) n , respectively), rather, an additional term (1/2) log n (or log n, respectively), has to be added which stands for the description of the ML distribution. 7. Let r > 0 be any number such that for some prefix code with word length function L(xn 1 ) we have L(xn 1 ) + log Pθ(xn 1 ) ≤ r, θ ∈ Θ, xn 1 ∈ An. Thus log Pθ(xn 1 ) ≤ −L(xn 1 ) + r, so that sup θ∈Θ Pθ(xn 1 ) ≤ 2−L(xn 1 )2r. Summing over xn 1 and using the Kraft inequality then gives Sn ≤ 2r, so that rn ≥ log Sn. On the other hand, for the Shannon code with respect to the auxiliary distribution Q(xn 1 ) = S−1 n supθ∈Θ Pθ(xn 1 ), we have L(xn 1 ) + log Pθ(xn 1 ) ≤ log Pθ(xn 1 ) Q(xn 1 ) + 1 ≤ log Sn + 1, which proves that rn ≤ log Sn + 1. 8. The first inequality is trivial because nĤn = − log PML(xn 1 ). Consider the mixture distribution Q with respect to the Dirichlet prior with αi ≡ −1/2. Theorem 17 gives log PML(xn 1 ) Q(xn 1 ) ≤ k − 1 2 log n + const. Combining this with nĤn = − log PML(xn 1 ) then yields log 1 Q(xn 1 ) ≤ nĤn + k − 1 2 log n + const. On the other hand, the (pointwise) redundancy of the Shannon code with respect to Q, though it might be negative for some xn 1 , is lower bounded by a ran- dom variable that has finite expectation, (Corollary 3). Thus − log Q(xn 1 ) ≥ − log P ∗(xn 1 ) − Y , where E(Y ) is finite. Putting this together with the pre- ceeding inequality yields the bound log 1 P ∗(xn 1 ) ≤ nĤn + k − 1 2 log n + const. + Y, which completes the proof. Remark. It follows in a similar manner that if X1, X2, . . . , is an m-th order Markov chain (with ar- bitrarily specified states at times 0,−1, . . . ,−m+1) then for the m-th order emprical (conditional) en- tropy Ĥm n we have nĤm n ≤ − log P ∗(xn 1 ) ≤ nĤm n + |A|m(|A| − 1) 2 log n + const. + Z, where Z is a random variable not depending on n, whose expectation is finite. If X1, X2, . . . is Markov of order `, then it is also Markov of order m > ` and hence nĤ` n ≤ − log P ∗(xn 1 ) ≤ nĤm n + |A|m(|A| − 1) 2 log n + const. + Zm, so that, Ĥ` n − Ĥm n ≤ |A|m(|A| − 1) 2n log n + const. + 1 n Zm. This allows us to check if the Markov chain is of order ` < m. Here Zm can be positive or negative, but since it has finite expected value, we can use the Markov inequality to get bounds on the probability that Zm > ε > 0 and use this in the above. 14.1 Corrections. Line n+ is the n-th line from the top and line n− is the n-th line from the bottom. 1. Page 2, column 2, line 20+: Change js+1 to 1 + js. 2. Page 3, column 1, line 19+: Change P (P̂n ∈ Π) to P (P̂n ∈ Πn). 3. Page 3, column 1, line 8-: Change (1/n) ∑ a f(a) > α to (1/n) ∑ i f(xi) > α. 31 4. Page 4, column 1, line 12+: The minimum should be over P ∈ Π, not P ∗ ∈ Π. 5. Page 7, column 2, line 18+: γ = (j1, . . . , jd) should be ω = (j1, . . . , jd) 6. Page 8, column 1, line 17-: log P̂ (ω0)/P (ω0) should be log 1−P̂ (ω0) 1−P (ω0) . 7. Page 11, column 1, lines 2- and 11-: Replace∑ xn 1∈Bn(c) 2−L(xn 1 ) by ∑∞ n=1 ∑ xn 1∈Bn(c) 2−L(xn 1 ). 8. Page 11, column 2, line 16+: Replace Zn = P (xn 1 )/Q(xn 1 ) by Zn = Q(xn 1 )/P (xn 1 ). 9. Page 11, column 2, line 16-: Replace P (Ã) by P (Ãm) and Q(Ã) by Q(Ãm). 10. Page 12, column 1, formula (11): In the integral exponent the logarithm should be multiplied by 1/n. 11. Page 12, column 1, formula (12): Replace P ∈ by U ∈. 12. Page 12, column 1, line 19-: Replace P ∈ N∞ by U ∈ Nε. 13. Page 12, column 1, line 3-: Replace “code” by “pro- cess U”. 14. Page 14, column 1, formula (14): Replace i + 1 by i = 1 in the product. 15. Page 15, column 2, line 14-: Replace log e 2 by log ε 2 . 32