
























Estude fácil! Tem muito documento disponível na Docsity
Ganhe pontos ajudando outros esrudantes ou compre um plano Premium
Prepare-se para as provas
Estude fácil! Tem muito documento disponível na Docsity
Prepare-se para as provas com trabalhos de outros alunos como você, aqui na Docsity
Encontra documentos específicos para os exames da tua universidade
Prepare-se com as videoaulas e exercícios resolvidos criados a partir da grade da sua Universidade
Responda perguntas de provas passadas e avalie sua preparação.
Ganhe pontos para baixar
Ganhe pontos ajudando outros esrudantes ou compre um plano Premium
Notas de Aulas - Processos Estocásticos (Cadeias de Markov)
Tipologia: Notas de aula
1 / 32
Esta página não é visível na pré-visualização
Não perca as partes importantes!

























This course will be concerned with the applications of information theory concepts in statistics. Much of the course will be based on lectures given by Imre Csisz´ar at Maryland in 1989. Some recent results about depen- dent processes will also be given. It is assumed that the reader is familiar with basic information theory ideas as presented, for example, in the initial chapters of the Csisz´ar-K¨orner book, and with basic statistical concepts as presented, for example, in the book by Cox and Hink- ley. Notation and terminology that will be used in these lectures will be introduced in this section. The symbol A = {a 1 , a 2 ,... , a|A|} will denote a finite set of cardinality |A| and xnm will denote the sequence xm, xm+1,... , xn, where each xi ∈ A. The set of all n- length sequences xn 1 will be denoted by An, the set of all infinite sequences x = x∞ 1 , with xi ∈ A, i ≥ 1 will be denoted by A∞, and the set of all finite sequences drawn from A will be denoted by A∗. If u and v are finite length sequences then their concatenation is denoted by uv, and uk^ = uk−^1 u, k > 1. The entropy H(P ) of a probability distribution, P = (P (a)) on A, is defined by the formula
H(P ) = −
∑
a∈A
P (a) log P (a),
where here, as elsewhere in these lectures, base two loga- rithms are used. Random variable notation is often used in this context, that is, H(X) denotes the entropy of the distribution P of the random variable X. If P and Q are two distributions on A then their divergence or cross-entropy is defined by
D(P ‖Q) =
∑
a∈A
P (a) log
P (a) Q(a)
If P is the joint distribution of two random variables (X, Y ) then their joint entropy is defined by
H(X, Y ) = −
∑
(a,b)
P (a, b) log P (a, b),
while the conditional entropy H(X|Y ) and mutual infor- mation I(X ∧ Y ) are defined, respectively, by
H(X|Y ) = H(X, Y ) − H(Y ), I(X ∧ Y ) = H(X) + H(Y ) − H(X, Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X).
Two types of codes will be of interest. A block code is a mapping C: An^7 → Bm, while a variable-length code
is a mapping C: An^7 → B∗. The length function L: An^7 → { 1 , 2 ,.. .} for a variable-length is defined by the formula
C(xn 1 ) = bL(x
n 1 )
Thus, in particular, a block code is just a variable length code whose length function is constant. A block code C is invertible (or faithful) if it is one- to-one. A variable-length code is uniquely decodable if for any two distinct sequences, u(1), u(2),... , u(m) and v(1), v(2),... , v(k), where u(i), v(j) ∈ An, ∀i, j, the con- catenations of the images, C(u(1))C(u(2)) · · · C(u(m)) and C(v(1))C(v(2)) · · · C(v(k)), are not equal. A con- dition that guarantees unique decodability is the prefix condition. A variable-length code C satisfies the prefix condition if
C(v) = C(u)w, u, v ∈ An, w ∈ B∗^ ⇒ w = Λ, u = v,
where Λ denotes the empty string. In most cases of interest to us, the image alphabet will be binary, that is, B = { 0 , 1 }. It is easy to see that the length function for a binary prefix code must satisfy the so-called Kraft inequality. ∑
xn 1
2 −L(x
n 1 ) ≤ 1.
It can in fact be shown that a uniquely decodable binary code also satisfies the Kraft inequality, and that if L is a positive integer-valued function on An^ for which the Kraft inequality holds then there is a binary prefix code C whose length function is L. (Thus, in particular, for any uniquely decodable code C with length function L there is a prefix code C˜ whose length function is also L.) The reason for the connection between the Kraft inequality and prefix codes is the connection between the Kraft inequality and binary trees, a connection that we now sketch. A (binary) tree is a directed graph (V, E), along with a distinguished vertex r ∈ V , called the root, such that the following properties hold.
It is easy to see from the above that there is only one path from r to any v 6 = r; the length of this path is called
the depth d(v) of v. A vertex is called an outer node if its outdegree is 0; otherwise it is an inner node. Let O denote the set of outer nodes. It is easy to see that the edges of the tree can be labeled by 0’s and 1’s so that for any vertex v whose outdegree is 2, the two edges leading out of v have different labels. Such a labeling assigns a binary sequence of length d(v) to each outer node v such that distinct outer nodes are assigned distinct sequences. The labeling is therefore just a binary code on the set of outer nodes. Furthermore, the code is a prefix code, due to the simple fact that an outer node is not an inner node! It is clear that ∑
v∈O
2 −v(x)^ ≤ 1.
In summary, binary trees lead to binary prefix codes on their outer codes for which the Kraft inequality holds. Now suppose L is a positive integer-valued function defined on a set A such that ∑ 2 −L(a)^ ≤ 1. Our goal is to show that there is a prefix code C whose length function is L. Without loss of generality it can be assumed that A is labeled so that L(ai) ≤ L(ai+1), i < |A|. The code C is defined by setting C(ai) = w(i) ∈ B∗, where w(1) is a block of 0’s of length L(a 1 ), and w(i), i > 1 is the first L(ai) bits in the binary expansion of
∑ j It is instructive to see how to calculate the expo- nent D(Π‖P ) for the preceeding example. Consider the exponential family of distributions P˜ of the form P˜ (a) = cP (a)2tf^ (a), where c = (∑ a P (a)2tf^ (a))−^1. Clearly ∑ a P˜^ (a)f^ (a) is a continuous function of the pa- rameter t and this function tends to max f (a) as t → ∞. (Check!) As t = 0 gives P˜ = P , it follows by the as- sumption ∑ a
P (a)f (a) < α < max a f (a)
that there an element of the exponential family, with t > 0, such that
P (a)f (a) = α. Denote this P˜ by Q∗, so that,
Q∗(a) = c∗P (a)2t ∗f (a) , t∗^ > 0 ,
∑ a
Q∗(a)f (a) = α.
We claim that
D(Π‖P ) = D(Q∗‖P ) = log c∗^ + t∗α. (1)
To show that D(Π‖P ) = D(Q∗‖P ) it suffices to show that D(Q‖P ) > D(Q∗‖P ) for every Q ∈ Π, i. e., for every Q for which
∑ a Q(a)f^ (a)^ > α.^ A direct calculation gives
∑ a
Q∗(a) log Q∗(a) P (a)
∑ a
Q∗(a) [log c∗^ + t∗f (a)] = log c∗^ + t∗α (2)
and
∑ a
Q(a) log Q∗(a) P (a)
∑ a
Q(a) [log c∗^ + t∗f (a)] > log c∗^ + t∗α.
Hence
D(Q‖P ) − D(Q∗‖P ) >
D(Q‖P ) −
∑ a
Q(a) log Q∗(a) P (a)
This completes the proof of (??).
Remark 1 Replacing P in (??) by any P˜ of the expo- nential family, i. e., P˜ (a) = cP (a)2tf^ (a), we get that
D(Q∗‖ P˜ ) =
log
c∗ c
Since D(Q∗‖ P˜ ) > 0 for P˜ 6 = Q∗, it follows that
log c + tα = − log
∑ a
P (a)2tf^ (a)^ + tα
attains its maximum at t = t∗. This means that the “large deviations exponent”
lim n→
[ −
n
log P n
{ 1 n
∑^ n
i=
f (Xi) > α)
}]
can be represented also as
max t≥ 0
[ − log
∑ a
P (a)2tf^ (a)^ + tα
] .
This latter form is the one usually found in text- books. Note that the restriction t ≥ 0 is not needed when α > ∑ a P^ (a)f^ (a), because, as just seen, the un- constrained maximum is attained at t∗^ > 0. However, the restriction to t ≥ 0 takes care also of the case when α ≤
∑ a P^ (a)f^ (a), when the exponent is equal to 0.
The I-projection of a distribution Q onto a closed, convex subset Π of distributions on A is the P ∗^ ∈ Π such that D(P ∗‖Q) = min P ∈Π
In the sequel we suppose that Q(a) > 0 for all a ∈ A. The function D(P ‖Q) is then continuous and strictly convex in P , so that P ∗^ exists and is unique. The support of the distribution P is the set S(P ) = {a: P (a) > 0 }. Since Π is convex, among the supports of elements of Π there is one whose support contains all the others; this will be called the support of Π and denoted by S(Π).
Theorem 5 S(P ∗) = S(Π) and D(P ‖Q) ≥ D(P ‖P ∗) + D(P ∗‖Q) for all P ∈ Π.
Proof. Of course, if the asserted inequality holds for some P ∗^ ∈ Π and all P ∈ Π then P ∗^ must be the I-projection of Q onto Π. For arbitrary P ∈ Π, by the convexity of Π we have Pt = (1 − t)P ∗^ + tP ∈ Π, for 0 ≤ t ≤ 1 , hence for each t ∈ (0, 1),
t [D(Pt‖Q) − D(P ∗‖Q)] =
d dt D(Pt‖Q) |t=˜t ,
for some ˜t ∈ (0, t). But
d dt D(Pt‖Q) =
∑ a
(P (a) − P ∗(a)) log
Pt(a) Q(a)
and this converges (as t ↓ 0) to −∞ if P ∗(a) = 0 for some a ∈ S(P ), and otherwise to
∑ a
(P (a) − P ∗(a)) log P ∗(a) Q(a)
It follows that the first contigency is ruled out, proving that S(P ∗) ⊃ S(P ), and also that the quantity (??) is nonnegative, proving the claimed inequality.
Now we examine some situations in which the inequal- ity of Theorem ?? is actually an equality. For any given functions f 1 , f 2 ,... , fk on A and corresponding numbers α 1 , α 2 ,... , αk, the set
L = {P :
∑ a
P (a)fi(a) = αi, 1 ≤ i ≤ k},
will be called a linear family of probability distributions. For any given functions f 1 , f 2 ,... , fk on A, the set E of all P such that
P (a) = cQ(a) exp(
∑^ k 1
θifi(a)), for some θ 1 ,... , θk,
will be called an exponential family of probability distri- butions; here Q is any given distribution and
c = c(θ 1 ,... , θk) =
( ∑ a
Q(a) exp(
∑^ k
1
θifi(a))
)− 1 .
We will assume that S(Q) = A; then S(P ) = A for all P ∈ E. Note that Q ∈ E. The family E depends on Q, of course, and but only in a weak manner, for any element of E could play the role of Q. If necessary to emphasize this dependence on Q we shall write E = EQ.
Theorem 6 The I-projection P ∗^ of Q onto a linear fam- ily L satisfies
D(P ‖Q) = D(P ‖P ∗) + D(P ∗‖Q), ∀P ∈ L.
Further, if S(L) = A then L ∩ EQ = {P ∗}.
Proof. By the preceeding theorem, S(P ∗) = S(L). Hence for every P ∈ L there is some t < 0 such that Pt = (1 − t)P ∗^ + tP ∈ L. Therefore, we must have (d/dt)D(Pt‖Q)|t=0 = 0, that is, the quantity (??) in the
preceeding proof is equal to 0, for all P ∈ L. This gives the desired identity. Also we can equivalently write ∑ a
P (a)
[ log
P ∗(a) Q(a)
] = 0, P ∈ L. (4)
Now, by the definition of L, the distributions P ∈ L, regarded as |A|-dimensional vectors, are in the orthog- onal complement of the subspace F spanned by the k vectors, {fi(·)−αi: 1 ≤ i ≤ k}. If S(L) = A then the dis- tributions P ∈ L also span the orthogonal complement of F, from Lemma ??, below , and hence the identity (??) implies that the vector
log
must be in F. This proves that P ∗^ ∈ EQ. Finally, if P˜ ∈ L ∩ EQ then it is easily checked that the identity (??) holds for P˜ in place of P ∗. This implies that P˜ satisfies the Pythagorean identity in the role of P ∗, and this, in turn, implies that P˜ = P ∗. The proof of the theorem is finished, once the following linear algebra result is established. Lemma 2 Suppose V is a the subspace of Rn^ such that there is a strictly positive vector p ∈ V ⊥, the orthogonal complement of V. Then V ⊥^ is spanned by the probabil- ity vectors that belong to it. Proof. Choose a basis for V ⊥^ of the form {p, q 1 ,... , q} and determine ti ∈ (0, 1), 1 ≤ i ≤ such that pi = (1 − ti)p + tiqi is a nonnegative vector. The vectors {p, p 1 ,... , p`} are easily seen to be a basis for V ⊥; each can be then be rescaled to obtain a basis for V ⊥^ that consists of probability vectors. This completes the proof of the lemma.
If S(L) 6 = A then no element of the exponential family E = EQ can belong to L, but since E is not a closed set in general, some element of the closure, cl(E) may be in L. Indeed, if there is a P˜ ∈ L ∩ cl(E) then the Pythagorean identity still holds for P˜ , and this implies that P˜ = P ∗. A sequence of elements converging to P ∗ can always be generated by the “generalized iterative scaling” algorithm, which will be discussed at the end of this section. Hence we always have L ∩ cl(E) = {P ∗}.
Suppose now that L 1 ,... , Lm are given linear families and generate a sequence of distributions Pn as follows: Set P 0 = Q (any given distribution with S(Q) = A), let P 1 be the I-projection of P 0 onto L 1 , P 2 the I-projection of P 1 onto L 2 , and so on, where for n > m we mean by Ln that Li for which i ≡ n (mod m); i. e., L 1 ,... , Lm is repeated cyclically.
(3) f (t) = (t − 1)^2 ⇒ Df (P ‖Q) =
∑ a
(P (a) − Q(a))^2 Q(a)
(4) f (t) = 1 −
t ⇒ Df (P ‖Q) = 1 −
∑ a
√ P (a)Q(a).
(5) f (t) = |t − 1 | ⇒ Df (P ‖Q) = |P − Q|.
The expression Df (P ‖Q) =
∑ a
(P (a)−Q(a))^2 Q(a) will be de- noted by χ^2 (P, Q). The analogue of the log-sum inequal- ity is
∑
i
bif
( ai bi
) ≥ bf
( a b
) , a =
∑ ai, b =
∑ bi.
Using this, many of the properties of the information divergence D(P ‖Q) extend to general f -divergences, in particular
Lemma 3 Df (P ‖Q) ≥ 0 and if f is strictly convex at t = 1 then Df (P ‖Q) = 0 only when P = Q. Further, Df (P ‖Q) is a convex function of the pair (P, Q), and the partitioning property, Df (P ‖Q) ≥ Df (P B‖QB) holds for any partition B of A.
A basic theorem about f -divergences is the following approximation property.
Theorem 8 If f is twice differentiable at t = 1 and f ′′(1) > 0 then for any Q with S(Q) = A and P “ close” to Q we have
Df (P ‖Q) ∼ f ′′(1) 2 χ^2 (P, Q)
(Formally, Df (P ‖Q)/χ^2 (P, Q) → f ′′(1)/2 as χ^2 (P, Q) → 0.)
Proof. Since f (1) = 0, Taylor’s expansion gives
f (t) = f ′(1)(t − 1) +
f ′′(1) 2 (t − 1)^2 + (t)(t − 1)^2 ,
where (t) → 0 as t → 1. Hence
Q(a)f
( P (a) Q(a)
f ′(1)(P (a) − Q(a)) + f ′′(1) 2
(P (a) − Q(a))^2 Q(a)
+
( P (a) Q(a)
) (P (a) − Q(a))^2 Q(a)
Summing over a ∈ A then establishes the theorem.
Remark 2 The same proof works even if Q is not fixed, provided that no Q(a) can become arbitrarily small. However, the theorem (the “asymptotic equivalence” of f -divergences subject to the differentiability hypotheses) does not remain true if Q is not fixed and the probabili- ties of Q(a) are not bounded away from 0.
Corollary 2 If f satisfies the hypotheses of the theo- rem and Pˆ is the empirical distribution (i. e., type) of a sample of size n drawn independently from the distribu- tion Q, then (2/f ′′(1))nDf ( Pˆ ‖Q) has an asymptotic χ^2 distribution, with |A| − 1 degrees of freedom, as n → ∞.
The χ^2 distribution with k degrees of freedom is de- fined as the distribution of the sum of squares of k inde- pendent random variables having the standard normal distribution. By this corollary, both (2/ log e)nD( Pˆ ‖Q) and (2/ log e)nD(Q‖ Pˆ ) are asymptotically χ^2 with |A| − 1 degrees of freedom.
One property that distiguishes information divergence among f -divergences is transitivity of projections, as summarized in the following lemma. It can, in fact, be shown that the only f -divergence for which either of the two properties of the lemma holds is the informational divergence.
Lemma 4 Let P ∗^ be the I-projection of Q onto a linear family L. Then
(i) For any convex subfamily L′^ ⊂ L the I-projections of Q and of P ∗^ onto L′^ are the same.
(ii) For any “translate” L′^ of L, the I-projections of Q and of P ∗^ onto L′^ are the same, provided S(P ∗) = A.
Proof. By the Pythagorean identity
D(P ‖Q) = D(P ‖P ∗) + D(P ∗‖Q), P ∈ L.
It follows that on any subset of L the minimum of D(P ‖Q) and of D(P ‖P ∗) are acheived by the same P. This establishes (i). L′^ is called a translate of L if it is defined in terms of the same functions fi, but possibly different αi. Hence, the exponential family corresponding to L′^ is the same as it is for L. Since S(P ∗) = A, we know that P ∗^ be- longs to this exponential family. But every element of the exponential family has the same I-projection onto L′, which establishes (ii).
Table 1: A 2-dimensional contingency table.
x(0, 0) x(0, 1) · · · x(0, r 2 ) x(0·) x(1, 0) x(1, 1) · · · x(1, r 2 ) x(1·) .. .
x(r 1 , 0) x(r 1 , 1) · · · x(r 1 , r 2 ) x(r 1 ·) x(·, 0) x(·, 1) · · · x(·, r 2 ) n
Now we apply some of these ideas to the analysis of contingency tables. A 2-dimensional contigency table is indicated in Figure ??. The sample data have two features, with categories 0,... , r 1 for the first feature and 0,... , r 2 for the second feature. The cell counts
x(j 1 , j 2 ), 0 ≤ j 1 ≤ r 1 , 0 ≤ j 2 ≤ r 2
are nonnegative integers; thus in the sample there were x(j 1 , j 2 ) members that had category j 1 for the first fea- ture and j 2 for the second. The table has two marginals with marginal counts
x(j 1 ·) =
∑^ r^2
j 2 =
x(j 1 , j 2 ), x(·j 2 ) =
∑^ r^1
j 1 =
x(j 1 , j 2 ).
The sum of all the counts is
n =
∑
j 1
x(j 1 ·) =
∑
j 2
x(·j 2 ) =
∑
j 1
∑
j 2
x(j 1 , j 2 ).
The term contigency table comes from this exam- ple, the cell counts being arranged in a table, with the marginal counts appearing at the margins. Other forms are also commonly used, e. g., the marginal empirical probabilities are indicated by replacing x(j 1 ·) by ˆp(j 1 ·) = x(j 1 ·)/n and x(·j 2 ) by ˆp(·j 2 ) = x(·j 2 )/n, and/or the counts are replaced by the relative counts, pˆ(j 1 , j 2 ) = x(j 1 , j 2 )/n. In the general case the sample has d features of in- terest, with the ith feature having categories 0, 1 ,... , ri. The d-tuples ω = (j 1 ,... , jd) are called cells; the corre- sponding cell count x(ω) is the number of members of the sample such that, for each i, the ith feature is in the jith category. The collection of possible cells will be denoted by Ω. The empirical distribution is defined by pˆ(ω) = x(ω)/n, where n =
∑ ω x(ω) is the sample size. By a d-dimensional contingency table we mean either the aggregate of the cell counts x(ω), or the empirical distri- bution ˆp, or sometimes any distribution P on Ω (mainly when considered as a model for the “true distribution” from which the sample came.)
The marginals of a contingency table are obtained by restricting attention to those features i that be- long to some given set γ ⊂ { 1 , 2 ,... , d}. Formally, for γ = (i 1 ,... , ik) we denote by ω(γ) the γ-projection of ω = (j 1 ,... , jd), that is, ω(γ) = (ji 1 , ji 2 ,... , jik ). The γ-marginal of the contingency table is given by the marginal counts
x(ω(γ)) =
∑
ω′:ω′(γ)=ω(γ)
x(ω′)
or the corresponding empirical distribution ˆp(ω(γ)) = x(ω(γ))/n. In general the γ-marginal of any distribution P (ω): ω ∈ Ω is defined as the distribution Pγ defined by the marginal probabilities
Pγ (ω(γ)) =
∑
ω′:ω′(γ)=ω(γ)
P (ω′).
In general a d-dimensional contigency table has d one-dimensional marginals, d(d − 1)/2 two-dimensional marginals, etc., corresponding to the subsets of { 1 ,... , d} of one, two, etc., elements. For contingency tables the most important linear fam- ilies of distributions are those defined by fixing certain γ-marginals, for a family Γ of sets γ ⊂ { 1 ,... , d}. Thus, denoting the fixed marginals by P¯γ , γ ∈ Γ, we consider
L = {P : Pγ = P¯γ , γ ∈ Γ}.
The exponential family (through any given Q) that cor- responds to this linear family L consists of all distribu- tions that can be represented in product form as
P (ω) = cQ(ω)
∏
γ∈Γ
aγ (ω(γ)). (7)
In particular, if L is given by fixing the one-dimensional marginals (i. e., Γ consists of the one point subsets of { 1 ,... , d} then the corresponding exponential family consists of the distributions of the form
P (i 1 ,... , id) = cQ(i 1 ,... , id)a 1 (i 1 ) · · · ad(id)
The family of all distributions of the form (??) is called the log-linear family with interactions γ ∈ Γ. In most applications, Q is chosen as the uniform distributions; often the name “log-linear family” is restricted to this case. Then (??) gives that the log of P (ω) is equal to a sum of terms, each representing an “interaction” γ ∈ Γ, for it depends on ω = (j 1 ,... , jd) only through ω(γ) = (ji 1 ,... , jik ), where γ = (i 1 ,... , ik). A log-linear family is also called a log-linear model. It should be noted that the representation (??) is not
In the following lemma the sets P and Q and function D(P ‖Q) are completely arbitrary. In later applications D(P ‖Q) will be the divergence and P and Q will be convex sets of distributions on a finite set A.
Theorem 10 Let D(P ‖Q) be an arbitrary real-valued function defined for P ∈ P, Q ∈ Q such that P ∗^ = P ∗(Q) = arg minP D(P ‖Q) exists for all Q ∈ Q and Q∗^ = Q∗(P ) = arg minQ D(P ‖Q) exists for all P ∈ P. Suppose further that there is a nonnegative function δ(P ‖P ′) defined on P × P with the following “three- points property,”
δ(P ‖P ∗(Q)) + D(P ∗(Q)‖Q) ≤ D(P ‖Q), ∀P ∈ P, Q ∈ Q,
as well as the following “four-points property,”
D(P ′‖Q′) + δ(P ′‖P ) ≥ D(P ′‖Q∗(P )), ∀P, P ′^ ∈ P, Q′^ ∈ Q.
Let Q 0 be an arbitrary member of Q and recursively define
Pn = arg min P ∈P
D(P ‖Qn− 1 ), Qn = arg min Q∈Q D(Pn‖Q). (8)
Then
nlim→∞ D(Pn‖Qn) =^ inf P ∈P,Q∈Q
If, in addition, (i) minQ∈Q D(P ‖Q) is continuous in P , (ii) P is compact, and (iii) δ(P ‖Pn) → 0 iff Pn → P , then for the iteration (??) Pn will converge to some P ∗, such that if Q∗^ = arg minQ∈Q D(P ∗‖Q) then D(P ∗‖Q∗) = minP ∈P,Q∈Q D(P ‖Q) and, moreover, δ(P ∗‖Pn) ↓ 0 and
D(Pn‖Qn) − D(P ∗‖Q∗) ≤ δ(P ∗‖Pn− 1 ) − δ(P ∗‖Pn).
Proof. We have, by the three-points property,
δ(P ‖Pn+1) + D(Pn+1‖Qn) ≤ D(P ‖Qn),
and, by the four-points property
D(P ‖Qn) ≤ D(P ‖Q) + δ(P ‖Pn),
for all P ∈ P, Q ∈ Q. Hence
δ(P ‖Pn+1) ≤ D(P ‖Q) − D(Pn+1‖Qn) + δ(P ‖Pn) (9)
The inequality (??) implies the desired basic limit re- sult nlim→∞ D(Pn‖Qn) =^ inf P ∈P,Q∈Q
Indeed, if this were false it would mean that there exist P ∈ P, Q ∈ Q and > 0 such that
nlim→∞ D(Pn‖Qn) = lim n→∞ D(Pn+1‖Qn)^ > D(P^ ‖Q) +^ . Then (??) would give that δ(P ‖Pn+1) ≤ δ(P ‖Pn) − , n = 1, 2 ,... which contradicts the assumption that δ is nonnegative. Suppose assumptions (i)-(iii) hold. Pick a sub- sequence Pnk → P ∗, as k → ∞ and let Q∗^ = arg minQ∈Q D(P ∗‖Q). Our basic limit re- sult and assumption (i) imply that (P ∗, Q∗) achieves minP,Q D(P ‖Q). But it is easy to see that (??) im- plies that if (P, Q) achieves minP minQ D(P ‖Q) then δ(P ‖Pn+1) ≤ δ(P ‖Pn) for every n. Thus δ(P ∗‖Pn) must be nondecreasing, and, by assumption (iii), its limit must be 0. Using assumption (iii) once more, we conclude that Pn → P ∗. The final inquality in the statement of the theorem then follows from (??) by replacing (P, Q) by (P ∗, Q∗). This completes the proof of the theorem.
Now we wish to apply the theorem to the case when D(P ‖Q) is the divergence and P and Q are convex, com- pact sets of nonnegative measures on A. No assumption that the measures are probability distributions is made at this point; hence, in particular, D(P ‖Q) may have negative values. Of course, if
∑ P (a) ≥
∑ Q(a) then D(P ‖Q) ≥ 0. Furthermore, the quantity
δ(P ‖Q) =
∑ a
[ P (a) log
P (a) Q(a) − (P (a) − Q(a)) log e
] ,
is always nonnegative and vanishes iff P = Q. This δ sat- isfies assumption (iii) of the theorem as well as the three- points and four-points properties. We verify the four- points property and leave the verification of the other properties to the reader. Let Q∗^ = arg minQ∈Q, let Q′^ be an arbitrary member of Q, and set Qt = (1−t)Q∗^ +tQ′^ ∈ Q, 0 ≤ t ≤ 1. Then
0 ≤
t
[D(P ‖Qt) − D(P ‖Q∗)] = d dt D(P ‖Qt)
∣∣ t=˜t,^0 <^ ˜t^ ≤^ t. With t → 0 it follows that
0 ≤ lim t^ ˜→ 0
∑ a
P (a) (Q∗(a) − Q′(a)) log e (1 − ˜t)Q∗(a) + ˜tQ′(a)
=
∑ a
P (a) Q∗(a) − Q′(a) Q∗(a)
log e. (10)
If we then combine this with the fact that log t ≥ (1 − 1 /t) log e) we obtain
∑ a
P ′(a) log P ′(a)Q∗(a) Q′(a)P (a)
( P ′(a) − P (a)
) log e ≥ 0 ,
which is just a rewritten version of the four-points prop- erty.
Remark 5 Suppose we are given a convex family F of random variables defined on a finite probability space (Ω, P ) and let X∗^ be a member of the family for which E(log X) is maximal. Then, letting X and X∗^ play the role of Q′^ and Q∗, respectively, the inequality (??) gives that
E
( X∗^ − X X
) ≥ 0 , i. e., E
( X∗ X
) ≥ 1 , ∀X ∈ F.
The finiteness assumption is not really needed here, for all that is needed is that max E(log X) is attained. This is known as Cover’s inequality.
The result of Theorem ?? can be applied to the prob- lem of minimizing divergence from a set of distributions that is the image of a “nice” set in some other space. Let T : A 7 → B be a given mapping and for any P on A write P T^ for its image on B, that is, P T^ (b) = ∑ a:T a=b P^ (a).
Problem 1. Given a set Q˜ = {QT^ : Q ∈ Q} of distri- butions on B for some set Q of distributions on A, minimize D( P˜ ‖ Q˜), subject to Q˜ ∈ Q˜ for some given P˜ on B. Here it is assumed that to any P ∈ P, a Q ∈ Q minimizing D(P ‖Q) can “easily” be found.
Problem 2. The same but with the role of P and Q interchanged. The first problem is relevant for maximum likelihood estimation based on partially observed data, when es- timation from the full data would be “easy.” The two problems can be solved in similar ways; we concentrate on the first one. Let P be the set of all P on A such that P T^ = P˜. Here P˜ and the elements of Q˜ are not necessarily probability distributions; indeed, either
P (b) or
Q(b) maybe less than, equal to, or greater than 1. Nevertheless the partitioning inequality gives D(P ‖Q) ≥ D(P T^ ‖QT^ ) with equality iff P (a) Q(a)
P T^ (T a) QT^ (T a)
, ∀a ∈ A.
Hence P ∗^ ∈ P, Q∗^ ∈ Q achieve minP,Q D(P ‖Q) iff Q˜∗^ = Q∗T^ achieves min (^) Q˜ D( P˜ ‖ Q˜).
Such (P ∗, Q∗) can be achieved using Theorem ??. In- deed, to Qn− 1 ∈ Q we can find Pn ∈ P minimizing D(P ‖Qn− 1 ) for P ∈ P merely by letting
Pn(a) = Qn− 1 (a) P˜ (T a) QTn− 1 (T a)
for by definition P T^ = P ,˜ if P ∈ P. The alternate step, finding Qn ∈ Q minimizing D(Pn‖Q) is ‘easily” found, by assumption.
Now we apply the preceeding dicsussion to a mixture distribution problem. Let Q˜ be the set of all Q˜ of the form Q˜(b) = ∑k i=1 ciμi(b), where^ ci^ ≥^0 ,^
∑ ci = 1, and μi(b) are arbitrary nonnegative measures.
Goal: Find (c∗ 1 ,... , c∗ k) achieving min (^) Q˜ D( P˜ ‖ Q˜), for a given P˜.
Solution. Let A be the set of all pairs (i, b), 1 ≤ i ≤ k, b ∈ B, and let T (i, b) = b. Define P and Q as above and apply the iteration scheme. Thus
∑^ k i=
P (i, b) = P˜ (b)},
Q = {Q: Q(i, b) = ciμi(b)}.
Start with an arbitrary (c^01 ,... , c^0 k) with positive com- ponents that sum to 1; this defines Q 0 (i, b) = c^0 i μi(b). If Qn− 1 (i, b) = cn i −^1 μi(b) is already defined let Pn be determined as above, that is,
Pn(i, b) = Qn− 1 (i, b) P˜ (b) Q˜n− 1 (b)
= cn i −^1 μi(b) P˜ (b) ∑ j c n− 1 j μj^ (b)^
The next step is to find Qn ∈ Q minimizing D(Pn‖Q). To do this put Pn(i) =
∑ b Pn(i, b), Pn(b|i) = Pn(i, b)/Pn(i) and use the relation Q(i, b) = ciμi(b) to write
D(Pn‖Q) =
∑^ k
i=
∑
b
Pn(i, b) log
Pn(i, b) Q(i, b)
in the form
D(Pn‖Q) =
∑
i,b
Pn(i)Pn(b|i)
[ log Pn(i) ci
] .
(11) Note that
∑ i Pn(i) =^
∑ b P˜n(b), and hence^ D(Pn‖Q) is minimized if in (??) we set ci = Pn(i)/ ∑ b P˜^ (b) (using
Cn will denote a binary prefix n-code, for n = 1, 2 ,.. .. The word code will mean either the sequence {Cn} or one member Cn of this sequence; the context will make clear which possiblity is being used. Our first result ex- presses the idea that for random processes the pointwise redundancy is essentially nonnegative, in that it is very unlikely to asymptotically take large negative values.
Theorem 11 Let {cn} be a sequence of positive num- bers satisfying
∑ 2 −cn^ < ∞. Then R(xn 1 ) ≥ −cn, even- tually almost surely.
Proof. Let
An(c) = {xn 1 : R(xn 1 ) < −c} = {xn 1 : 2L(x n 1 ) P (xn 1 ) < 2 −c}.
Then
P (An(c)) =
∑
xn 1 ∈An(c)
P (xn 1 )
< 2 −c^
∑
xn 1 ∈An(c)
2 −L(x
n 1 ) ≤ 2 −c,
where we used the Kraft inequality. Hence
∑^ ∞
n=
Prob (R(X 1 n ) < −cn) =
∑^ ∞
n=
P (An(cn)) ≤
∑^ ∞
n=
2 −cn^ < ∞.
The theorem now follows from the Borel-Cantelli princi- ple.
A sharper lower bound can be obtained for the case when there is a process Q such that each Cn is a Shannon code for Qn, or in the case when the sequence of codes {Cn} satisfies the strong prefix property, that is, for m 6 = n the code word for xm 1 is not a prefix of the code word for xn 1 unless m ≤ n and xm 1 is a prefix of xn 1. We state this as a corollary as its proof is a modification of the preceeding proof.
Corollary 3 For the Shannon code with respect to a process Q, or for a strongly prefix code, the pointwise redundancy R(xn 1 ) is bounded below by a random vari- able and E(infn R(xn 1 )) > − log e.
Proof. Let
Bn(c) = {xn 1 : R(xn 1 ) < −c, R(xk 1 ) ≥ −c, k < n}.
As in the proof of the theorem,
P (Bn(c)) < 2 −c^
∑
xn 1 ∈Bn(c)
2 −L(x
n 1 ) ,
and ∑ it is sufficient to show that ∞ n=
∑ xn 1 ∈Bn(c) 2 −L(xn 1 ) (^) ≤ 1. If the code is a strong prefix code then ∑^ ∞ n=
∑
xn 1 ∈Bn(c)
2 −L(x n 1 ) ≤ 1 ,
and hence we are done. If the code is a Shannon code for a process Q then ∑
xn 1 ∈Bn(c)
2 −L(x n 1 ) ≤
∑
xn 1 ∈Bn(c)
Q(xn 1 ) = Q( B˜n(c)),
where B˜n(c) is the union of the [xn 1 ] for which xn 1 ∈ Bn(c). Since these sets are disjoint, the sum of their Q-measures cannot exceed 1 and we again reach the desired result that
∑∞ n=
∑ xn 1 ∈Bn(c) 2 −L(xn 1 ) (^) ≤ 1. This completes the proof of the corollary.
If the sequences to be encoded are sample paths from some known random process P then we cannot do sig- nificantly better (in the sense of minimizing expected redundancy) than we can by using the Shannon code, which produces expected redundancy of at most 1. In many typical situations, however, the process P is un- known, although it may be known to belong to some parametric family. In such cases it is difficult to design codes for which the redundancy stays bounded. The fol- lowing result shows that if the code is a Shannon code for some Q then the redundancy will indeed be unbounded, unless Q is already very nearly the same as P. Theorem 12 If Q is singular with respect to P then the P -redundancy of the Shannon code with respect to Q goes to infinity with probability 1. Proof. The redundancy equals log(P (xn 1 )/(Q(xn 1 )), up to 1 bit, hence it suffices to show that Zn = Q(xn 1 )/P (xn 1 ) goes to 0, with probability 1. Towards this end, let Fn be the smallest σ-algebra for which the sequences xn 1 are measurable, that is, the σ-algebra generated by the cylin- der sets [xn 1 ], xn 1 ∈ An. Then {Zn} is a martingale with respect to the increasing sequence {Fn} and therefore converges almost surely to some random variable Z. It suffices to show that Z = 0. Since Q is assumed to be singular with respect to P there is a measurable set A˜ ⊂ A∞^ such that P ( A˜) = 1 , Q( A˜) = 0. Let μ by the measure defined by
μ(B) = P (B) + Q(B) +
∫
B
Z dP.
Since ∪nFn generates the entire σ-algebra, for every > 0 there exists A˜m ∈ Fm, for sufficiently large m, such
that the symmetric difference between A˜ and A˜m has μ-measure less than . In particular,
P ( A˜m) > 1 − , Q( A˜m) < ,
∫
A^ ˜m^ Z dP > E(Z)^ −^ .
But∫ for n ≥ m the martingale property gives
A˜m Zn^ dP^ =^ Q( A˜m), and therefore Fatou’s lemma gives ∫
A^ ˜m
Z dP ≤ lim inf n→∞
∫
A^ ˜m
Zn dP = Q( A˜m) < .
It follows that E(Z) < 2 and hence that E(Z) = 0. Since Z ≥ 0, we must have Z = 0, with probability 1, as claimed. This completes the proof of the theorem.
Good codes are those for which the P -redundancy grows slowly as n → ∞. The following theorem gives a condition that guarantees the existence of such codes, under some restrictions about the process P. In this and later results, the limiting divergence-rate for processes is defined by
D∞(P ‖Q) = lim n→∞
n
D(P n‖Qn),
provided this limit exists. The limit is known to exist for stationary P if Q is i.i.d. or finite-order Markov, but not necessarily otherwise.
Theorem 13 Suppose P is stationary ergodic and let Q∫ be a mixture of stationary ergodic distributions, Q = U ν(dU ), such that for every > 0 the set of all finite- order Markov measures U with D∞(P ‖U ) < has pos- itive ν-measure. Then for the Shannon code with re- spect to Q, the redundancy satisfies R(xn 1 )/n → 0 , al- most surely.
Proof. We have to prove that for every > 0, log(P (xn 1 )/Q(xn 1 )) < n, eventually almost surely, or, equivalently,
P (xn 1 ) < 2 nQ(xn 1 ), eventually a.s.
Let N be the set of finite-order Markov measures U for which D∞(P ‖U ) < and note that
Q(xn 1 ) =
∫ U (xn 1 )ν(dU ) ≥
∫
N
U (xn 1 )ν(dU ),
so that
2 nQ(xn 1 ) P (xn 1 )
∫
N
2 nU (xn 1 ) P (xn 1 ) ν(dU ) ≥ ∫
N
n
( −log P^ (x
n 1 ) U (xn 1 )
) ν(dU ). (12)
The entropy theorem implies that 1 n
log P (xn 1 ) U (xn 1 )
for P -almost all infinite sequences (the exceptional set may depend on U .) This means that the set of all pairs (x, U ), where x ∈ A∞, for which (??) does not hold, has P × ν-measure 0; this in turn implies that for P -almost all x, the set of U ′s not satisfying (??) has ν-measure 0 (in both cases, by Fubini’s theorem.) Thus, for P -almost all x the integrand in (??) goes to infinity for ν-almost all P ∈ N∞. It follows by Fa- tou’s lemma that the integral itself goes to +∞, which completes the proof of the theorem.
An important class of examples of codes that satisfy the hypotheses of the preceeding theorem are obtained as follows. Let Γ be a given (countable) list of stationary ergodic distributions, and let each U ∈ Γ be assigned a “description length” L(U ), subject to the Kraft inequal- ity,
∑ U ∈Γ 2 −L(U ) (^) ≤ 1. Then xn 1 can be encoded by a prefix code of length
min U ∈Γ
[ L(U ) + log
U (xn 1 )
] ;
namely, choose U ∈ Γ achieving this minimum, encode xn 1 by the Shannon code with respect to U , and add a preample of length L(U ) to identify U (here, the 1 bit error from dropping the upper integer part symbol is disregarded.) Let us call this the code generated by the list Γ. Theorem 14 If to any > 0 there is some finite-order Markov code in the list Γ with D∞(P ‖U ) < , then the redundancy of the code generated by Γ satisfies R(xn 1 )/n → 0, almost surely. Proof. Set Q =
∑ U ∈Γ 2 −^2 L(U^ )U^ ; then^ Q^ satisfies the hypotheses of Theorem ??, hence 1 n
log P (xn 1 ) U (xn 1 )
→ 0 , a.s. (14)
Now we want to show that R(xn 1 )/n → 0, a.s., where R is the redundancy of the code defined by the list Γ. To- wards this end, note that the condition ∑ U ∈Γ 2
implies that
Q(xn 1 ) ≤ max U ∈Γ 2 −L(U^ )U (xn 1 ),
from which it follows that
log
Q(xn 1 ) ≥ min U ∈Γ
[ L(U ) + log
U (xn 1 )
] ,
Some techniques for obtaining bounds on redundancy for i.i.d processes will be discussed in this section. Con- sider the i.i.d. process with alphabet A = { 1 ,... , k} with distribution P. We then have
P (xn 1 ) =
∏^ k
i=
P (i)ni^ ,
where ni is the number of times i occurs in xn 1. This probability is maximum if P (i) = ni/n, hence the maxi- mum likelihood estimate is given by
PML(xn 1 ) =
∏^ k
i=
( ni n
)ni .
When encoding with respect to an auxiliary distribution Q, the redundancy satisfies (disregarding at most 1 bit) the following simple bound
R(xn 1 ) = log P (xn 1 ) Q(xn 1 )
≤ log PML(xn 1 ) Q(xn 1 )
Let us take for∫ Q the mixture distribution Q(xn 1 ) = U (xn 1 )ν(p) dp, with a Dirichlet prior having density
ν(p) =
(∑ k i=1 αi^ +^ k
)
∏k i=1 Γ (αi^ + 1)
∏^ k
i=
pα i i, p = (p 1 ,... , pk).
For α 1 =... = αk = − 1 /2 we will get a sharp upper bound on the redundancy (??), a bound not depending on the true distribution P nor xn 1. Before we state and derive this bound we obtain a representation for Q that will be useful in constructing the Shannon code for Q.. For a Dirichlet prior with arbitrary αi > − 1 , ∀i, we have
Q(xn 1 ) =
∫ U (xn 1 )ν(p) dp =
∫ (^) ∏k
i=
pn i i+αidp ·
(∑ k i=1 αi^ +^ k
)
∏k i=1 Γ (αi^ + 1)
=
(∑ k i=1 αi^ +^ k
)
Γ (n + ∑ αi + k)
∏^ k
i+
Γ (ni + αi + 1) Γ (αi + 1)
Using the functional equation Γ(x + 1) = xΓ(x) we see that Q(xn 1 ) is given by the ratio
∏k i=1 [(ni^ +^ αi)(ni^ −^ 1 +^ αi)^...^ (1 +^ αi] (n − 1 +
∑ αi + k)(n − 2 +
∑ αi + k)... (
∑ αi + k)
or, equivalently,
Q(xn 1 ) =
∏^ n
j=
n(xj |xj 1 − 1 ) + 1 + αxj j − 1 + ∑ αi + k
where n(xj |xj 1 − 1 ) is the number of occurences of the sym- bol xj in the “past” xj 1 − 1.
Theorem 17 If Q is defined by (??) with αi = − 1 / 2 , ∀i, the redundancy always satisfies
R(xn 1 ) ≤ log
Γ(n + k 2 )Γ( 12 ) Γ(n + 12 )Γ( k 2 )
k − 1 2 log n − log Γ(k/2) Γ(1/2)
where n → 0 as n → ∞.
Proof. The second inequality is a simple consequence of Stirling’s formula for the Γ-function, so it is enough to prove the first inequality. For αi ≡ − 1 /2 we have
Q(xn 1 ) =
Γ( k 2 ) Γ(n + k 2 )
∏^ k i=
Γ(ni + 12 ) Γ( 12 )
∏k i=
[ (ni − 12 )(ni − 32 ) · · · (^12)
]
(n − 1 + k 2 )(n − 2 + k 2 ) · · · k 2
Note that, in particular, if xn 1 consists of identical sym- bols, say, xi ≡ a, then
Q(xn 1 ) =
Γ( k 2 )Γ(n + 12 ) Γ(n + k 2 )Γ( 12 )
hence to prove Theorem ?? it is enough to show that R(xn 1 ) ≤ log(1/Q(xn 1 )). The simple upper bound (??) then tells us that it is enough to show that
PML(xn 1 ) ≤
∏^ k
i=
( ni n
)ni ≤ Q(xn 1 ) Q(˜xn 1 )
where ˜xi ≡ a. The identity (??) can then be used to see that it is enough to prove that
∏^ k
i=
( ni n
)ni ≤
∏k i=
[ (ni − 12 )(ni − 32 ) · · · (^12)
]
(n − 12 )(n − 32 ) · · · (^12)
which can be converted to ∏^ k
i=
( ni n
)ni ≤
∏k i=1 [2ni(2ni^ −^ 1)^ · · ·^ (ni^ + 1)] 2 n(2n − 1) · · · (n + 1)
since
(n −
)(n −
n!
[ n(n −
(2n)! 22 nn! =
2 n(2n − 1) · · · (n + 1) 22 n^
At last we have arrived at the assertion we shall prove, namely, (??). This will be proved if we show that it is possible to assign to each ` = 1,... , n in a one-to-one mannner, a pair (i, j), 1 ≤ i ≤ k, 1 ≤ j ≤ n, such that
ni n
ni + j n + `
Now, for any given and i, (??) holds iff j ≥ ni/n. Hence the number of those 1 ≤ j ≤ ni that satisfy (??) is greater than ni − ni`/n, and the total number of pairs (i, j), 1 ≤ i ≤ k, 1 ≤ j ≤ n, satisfying (??) is greater than ∑k
i=
( ni − ni n
) = n − `.
It follows that if we assign to = n any (i, j) satisfying (??) (i. e., i may be chosen arbitrarily and j = ni), then recursively assign to each = n − 1 , n − 2 , etc., a pair (i, j) satisfying (??) that were not assigned previously, we never get stuck; at each step there will be at least one “free” pair (i, j) (because the total number of pairs (i, j) satisfying (??) is greater than n − `, the number of pairs already assigned.) This completes the proof of the theorem.
Our next goal is to show that the result of the preceed- ing theorem is “best possible,” even if we don’t insist on a uniformly small redundancy (i. e., on a bound valid for every xn 1 ), but want only the average redundancy E(R) to be small. Consider any prefix code. Without loss of general- ity (for the purpose of bounding the redundancy) we may assume that it satisfies the Kraft inequality with the equality sign, and therefore that is is a Shannon code with respect to some Q (not necessarily of mixture type.) Then
E(R(X 1 n )) = E log
P (X 1 n ) Q(X 1 n ) = D(P n‖Q).
Since P is unknown, we want to select Q in such a way that no matter what P is the average redundancy will be small, that is, we want Q to minimize
sup P
EP (RP (X 1 n )) = sup P
D(P n‖Q).
Suppose we choose P at random with prior distribution ν; then the observation of xn 1 provides information about the unknown P , measured by the mutual information
I(ν) = H(Qν ) −
∫ H(P n)ν(dP )
= H(Qν ) − n
∫ H(P )ν(dP ),
where∫ Qν is the mixture distribution, Qν = P (xn 1 )ν(dP ). Even though this mutual information ap- pears to be unrelated to the previous average redun- dancy, the remarkable fact is that
inf Q sup P
D(P n‖Q) = sup ν
I(ν).
Indeed, the following lemma holds in general.
Lemma 5 Consider any noisy channel with input al- phabet U = { 1 ,... , } and output alphabet V = { 1 ,... , m}, given by the probability distributions Pi on V governing the output if the input is i, i = 1,... ,. For any input distribution π, let Qπ denote the output distribution and let
I(π) =
∑
i,j
π(i)Pi(j) log
Pi(j) Qπ(j)
∑
i
π(i)D(Pi‖Qπ)
be the mutual information between input and output. Then max π I(π) = min Q max 1 ≤i≤`
D(Pi‖Q).
Proof. The left-side is known as the channel capacity. The lemma states that it equals the “radius” of the small- est “divergence ball” that contains all the Pi’s. To es- tablish this relation first note that for any distribution Q on V ,
I(π) =
∑^ `
i=
∑^ m
j=
π(i)Pi(j)
[ log Pi(j) Q(j)
]
∑^ `
i=
π(i)D(Pi‖Q) − D(Qπ‖Q).
This identity shows that for any fixed π
min Q
∑`
i=
π(i)D(Pi‖Q) = I(π),
and hence
max π I(π) = max π min Q
∑`
i=
π(i)D(Pi‖Q).
on a finite alphabet A = { 1 ,... , k}. Then θ may be iden- tified with the vector of the probabilities (P 1 ,... , Pk), and since these form a (k − 1)-dimensional subspace we get that (??) holds for k replaced by k − 1, thus proving that that universal codes constructed in the preceeding section have asymptotically optimal redundancy.
Our results extend beyond the i.i.d. case; in particular they extend to the Markov case. A Markov chain with transition matrix P (j|i), 1 ≤ i, j ≤ k, is given by the joint distributions
Prob (Xt = it, 0 ≤ t ≤ n) = P (i 0 )
∏^ k t=
P (it|it− 1 ).
We will suppose that the initial state i 0 is fixed, so that we can rewrite these probabilities in the form,
Prob (Xt = it, 0 ≤ t ≤ n) =
∏^ k i=
∏^ k j=
P (j|i)n(i,j), (23)
where n(i, j) is the number of times the pair i, j occurs in adjacent places in xn 0. Further, let n(i) = ∑ j n(i, j) denote the number of occurences of i in the block xn 0 −^1 and note that the probability in (??) is maximized for Pˆ (j|i) = n(i, j)/n(i), that is
PML(xn 0 ) =
∏^ k
i=
∏^ k
j=
P^ ˆ (j|i)n(i,j).
By analogy with the i.i.d. case we introduce the mix- ture distribution
Q(xn 1 ) =
∏^ k
i=
∫ (^) ∏k
j=
P (j|i)n(i,j)ν(P (·|i))dP,
where ν is the Dirichlet prior with αi ≡ − 1 /2. Thus Q(xn 1 ) is given by the product
∏^ k
i=
∏^ k
j=
Γ(n(i, j) + 1/2) Γ(1/2)
Γ(k/2) Γ(n(i) + k/2)
(^) ,
which is, in turn, equal to the product
∏^ k
i=
∏k j=1(n(i, j)^ −^1 /2)(n(i, j)^ −^3 /2)^...^ (1/2) (n(i) − 1 + k/2)(n(i) − 2 − k/2)... (k/2)
The redundancy of the code based on the above auxil- iary distribution can be bounded, using the correspond- ing i.i.d. result. It follows that
R(xn 1 ) =
∑^ k
i=
[ k − 1 2
log n(i) + constant
]
k(k − 1) 2 log n + constant.
Again, this result is asymptotically best possible (up to the constant term.) Indeed, on account of Risan- nen’s theorem, even the average redundancy cannot be made significantly smaller than (k(k − 1)/2) log n on a set of positive Lebesgue measure in the parameter space needed to describe Markov chain probabilities.
Rissanen has provided an interesting application of his theorem to a special class of processes, which we will call the chains with finite context. A process has finite context if there is a positive integer m and a function f : Am^7 → S where S is some finite set) such that
Prob (Xt = it, 0 ≤ t ≤ n) =
P (i 0 )
∏^ k
t=
P (it|f (ij−m,... , ij− 1 )),
where it is assumed here that i−m+1,... , i 0 is fixed. The elements of S are called “contexts” or “states” and P (i|) is interpreted as the “probability of the symbol i in the context.” Of course, any source that is Markov of order m has finite memory, and conversely; the context idea emphasizes that the probability of occurence of the next symbol may be depend on something much simpler than the entire past of length m, namely |S| may be much smaller than |Am| = km, and it would be nice to take advantage of this fact in coding. To obtain optimal bounds for processes with finite con- text we need make only a few changes in our preceeding discussion. Let us fix S and f and let n(i, ), i < n, ∈ S denote the number of pairs (i, `) that occur among the pairs (it, st− 1 ), where st− 1 = f (it−m,... , it− 1 ), for t < n. We then have
P (xn 1 ) =
∏
`∈S
∏k
i=
P (i|)n(i,)
and the maximum likelihood probabilities
PM L(xn 1 ) =
∏
`∈S
∏k
i=
P^ ˆ (i|)n(i,),
P^ ˆ (i|) = n(i,)/n(), n() =
∑
i
n(i, `).
Again, as in the i.i.d. case, an asymptotically optimal universal code is the one based on the auxiliary mixture distribution (with the Dirichlet prior), as follows,
Q(xn 1 ) =
∏
`∈S
∫ (^) ∏k
i=
P (i|)n(i,)ν(P (·|`))dP
which is equal to
∏
`∈S
∏k i=1(n(i, )^ −^1 /2)(n(i,)^ −^3 /2)^...^ (1/2) (n() − 1 + k/2)(n() − 2 + k/2)... (k/2)
and this code has redundancy
R(xn 1 ) ≤
∑
`∈S
[ k − 1 2 log n(`) + constant
]
k − 1 2 log n + constant.
Furthermore, Rissanen’s theorem implies that even the average redundancy cannot be substantially smaller the above bound, for any universal codes, except possibly for a vanishingly small set of parameter values, i. e., matrices P (i`).
Remark 9 Before leaving this topic of redundancy bounds let us mention an aspect of our discussion which has some practical value in designing codes. In the pre- ceeding section we derived the following formula (see (??)), valid for the i.i.d. case
Q(xn 1 ) =
∏^ n
j=
n(xj |xj 1 − 1 ) + 1 + αxj j − 1 + ∑ αi + k
where n(xj |xj 1 − 1 ) is the number of occurences of the sym- bol xj in the “past” xj 1 − 1. This formula suggests the conditional probabilities
Q(xj |xj 1 − 1 ) =
n(xj |xj 1 − 1 ) + 1 + αxj j − 1 +
∑ αi + k
The latter formula can be used as the specification of the conditional probabilities used in arithmetic coding, a (practical) sequential procedure that yields the same asymptotics as the Shannon coding procedure. Likewise, the Markov discussion in this section leads to the conditional formula
Q(ik|i 1 ,... , ik− 1 ) = nk− 1 (i, j) + 1/ 2 nk− 1 (i) + k/ 2
, if ik− 1 = i, ik = j,
where nk− 1 (i, j) is the number of consecutive (i, j)’s in the sequence ik 0 − 1 and nk− 1 (i) =
∑ j nk−^1 (i, j).^ These conditional probabilities are easily evaluated, because only simple updating is needed to go from k − 1 to k; arithmetic coding can then be performed. The corresponding finite context formula is Q(ik|i 1 ,... , ik− 1 ) = nk− 1 (i, ) + 1/ 2 nk− 1 () + k/ 2
, if sk− 1 = `, ik = j.
These can then be used to do arithmetic coding in the finite context case; such coding will also yield the same asymptotics as the Shannon code. Rissanen’s theorem implies that even the average redundancy can not be substantially smaller than the bound above, for any uni- versal code, expect possibly for a vanishingly small set of parameters (i. e., matrices P (i|`).)
The scaling formula
P ∗(a) = ciQ(a), a ∈ Bi, where ci = αi QB(i)
see (??) can be proved as follows. First, lumping does not increase divergence, that is,
D(P ‖Q) ≥ D(P B‖QB).
The condition that P ∈ L is equivalent to the condition that P (Bi) = αi, ∀i. If P ∗(a) = αiQ(a)/Q(Bi), a ∈ Bi then ∑ a
P ∗(a) log P ∗(a) Q(a)
∑ i
∑
a∈Bi
αiQ(a) Q(Bi)
log αi Q(Bi) = D(α‖QB^ ), α = (α 1 ,... , αk).
Thus, if P ∈ L then
D(P ‖Q) ≥ D(P B^ ‖QB^ ) = D(α‖QB^ ),
which establishes (??).
The chi-square function was defined on page ??. In the case when P = Pˆ , the empirical distribution the formula can be rewritten as follows.
χ^2 ( P , Qˆ ) =
∑ a
( Pˆ (a) − Q(a))^2 Q(a)
=
n
∑ (^) (n Pˆ (a) − nQ(a))^2 nQ(a)
=
n χ^2 k− 1 ,
where χ^2 k− 1 =
∑ (^) (n Pˆ (a)−nQ(a)) 2 nQ(a) is Pearson’s classical chi- square function. Here n Pˆ (a) gives the observed count, while nQ(a) gives the expected count of the number of appearances of a.