Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Communication Theory, Lecture Notes- Maths - Prof Alison Etheridge, Study notes of Communications Engineering

University of London (Lond)Communications Engineering

Introduction source coding channel capacity and noisy coding

Typology: Study notes

2010/2011

Uploaded on 09/10/2011

gilian 🇬🇧

4.6

(11)

228 documents

1 / 53

This page cannot be seen from the preview

Don't miss anything!

Lecture Notes for Communication Theory

February 15, 2011

Please let me know of any errors, typos, or poorly expressed arguments. And please do

not reproduce or distribute this document outside Oxford University.

David Stirzaker.

Contents

1 Introduction 2

1.1 Basics ....................................... 2

1.2 Coding....................................... 3

1.3 Sourceandchannel................................ 4

1.4 Entropy ...................................... 6

1.5 Typicality ..................................... 10

1.6 Shannon’s first theorem: noiseless (or source) coding . . . . . . . . . . . . . 11

1.7 Information .................................... 12

1.8 Shannon’s second theorem. Noisy (or channel) coding. . . . . . . . . . . . . 15

1.9 Differentialentropy................................ 18

1.10 Interpretation of entropy and information . . . . . . . . . . . . . . . . . . . 20

2 Source coding 23

2.1 Compactsymbolcodes.............................. 23

2.2 Prefixcodes.................................... 24

2.3 The entropy bound for noiseless coding . . . . . . . . . . . . . . . . . . . . . 25

2.4 Optimality: Huffman codes . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.5 Otherprefixcodes ................................ 30

3 Channel capacity and noisy coding 34

3.1 Introduction: basic channels . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2 Symmetricchannels ............................... 36

3.3 Specialchannels ................................. 38

3.4 Concavity of Hand I.............................. 40

3.5 Fano’s inequality and the NCT converse . . . . . . . . . . . . . . . . . . . . 43

3.6 The noisy coding theorem for the BSC . . . . . . . . . . . . . . . . . . . . . 45

3.7 Another interpretation of entropy and information . . . . . . . . . . . . . . 49

Discover Study notes of Communications Engineering University of London (Lond)

Partial preview of the text

Download Communication Theory, Lecture Notes- Maths - Prof Alison Etheridge and more Study notes Communications Engineering in PDF only on Docsity!

Lecture Notes for Communication Theory

Please let me know of any errors, typos, or poorly expressed arguments. And please do not reproduce or distribute this document outside Oxford University. David Stirzaker.

 - February 15,

1 Introduction Contents
- 1.1 Basics
- 1.2 Coding
- 1.3 Source and channel
- 1.4 Entropy
- 1.5 Typicality
- 1.6 Shannon’s first theorem: noiseless (or source) coding
- 1.7 Information
- 1.8 Shannon’s second theorem. Noisy (or channel) coding.
- 1.9 Differential entropy
- 1.10 Interpretation of entropy and information
2 Source coding
- 2.1 Compact symbol codes
- 2.2 Prefix codes
- 2.3 The entropy bound for noiseless coding
- 2.4 Optimality: Huffman codes
- 2.5 Other prefix codes
3 Channel capacity and noisy coding
- 3.1 Introduction: basic channels
- 3.2 Symmetric channels
- 3.3 Special channels
- 3.4 Concavity of H and I
- 3.5 Fano’s inequality and the NCT converse
- 3.6 The noisy coding theorem for the BSC
- 3.7 Another interpretation of entropy and information

1 Introduction

This section outlines the basic problems that communication theory has been developed to solve, introduces the main ideas that are used to deal with those problems, and sketches their solutions. In other words, it amounts to a preview and synopsis of the rest of the course.

1.1 Basics

The principal problem of communication is to accept a message at some point, and then reproduce it as efficiently, reliably, and securely as possible at some other point. The first step in this is taken by the engineers who make a suitable mechanism M for transferring the message, in the form of a signal of some kind; this mechanism will be called the channel whatever its actual physical form, which may be a wire, fibre, disc, aerial and receiver, tape, book and so on. The two ends of the channel are called the source and the receiver (which may be the same if a message is stored for retrieval) and may be separated in time or space, or both. Communication theory begins at this stage, by supposing that there exists such a channel that is imperfect in the senses of being finite, and noisy or insecure, or all three of these. If the channel were capable of transmitting an unbounded number of symbols, arbitrarily quickly, with any desired degree of accuracy, and in total privacy, then no further work would be needed; but no such channels exist. So, as mentioned above, the theorist has three tasks:

We seek to transmit the message efficiently, by which we mean that the signal should make the minimum use of the channel required to convey the message to the receiver. Typically, channels cost money to use, or have competing demands for their time, or both.
We want the message to be sent reliably, by which we mean that ideally the receiver gets exactly the original message, with no errors. More realistically, we may ask only that the message arrives with an arbitrarily small chance of an error.
We may often wish our message to be private, that is to say secret, from others. By this we mean that a spy who records the signals passing through the channel will be no nearer to knowing what the original message is. Sometimes it is sufficient to require only that working out what the original message was (though possible for the enemy) would need an impractical amount of effort. Other aspects of privacy that might be desired include the receiver being confident that the enemy cannot alter the message, (or substitute an entirely fresh one), and being able to prove the authenticity and integrity of the received signal to somebody else.

These are strong requirements, but a remarkable sequence of ideas initiated by Claude Shannon, and much developed since, has shown that they are achievable; at least up to a certain well-defined level in each case. The key idea and technique which makes all this possible is that of coding, which we now consider.

alphabet A. A message of length n is denoted by x ∈ An, where An^ is the set of all strings of length n of symbols from A. The set of all finite strings of symbols from A is denoted by A∗·. Definition: a code c(·), (or encoding, or code function), for the source S is a mapping from A∗^ to the set B∗^ of finite-length strings from an alphabet B, which may be called codewords. [If the mapping is to Bm, for some m, then the code is said to be a block code.] Formally c(·) : x ∈ A∗^ → c(x) ∈ B∗.

In addition, there is a decoder d(·) which maps B∗^ to the set of possible messages. The length of the codeword c(x) is denoted by |c(x)|. For efficiency we would like |c(x)| to be small in the long run; for reliability we would like d(c(x)) = x as often as possible; for secrecy we wish any enemy who knows c(x) not to be able to identify x in general. Briefly, the core of communication theory is devising good codes. Among the various properties that good codes might have, this one is clearly almost es- sential. Definition: a code c(·) is uniquely decipherable if the concatenation c(x 1 )c(x 2 )... c(xn) of codewords of symbols (or messages) from S is the image of (corresponds to) at most one sequence x 1... xn. An important class of uniquely decipherable codes is this: Definition: a code is a prefix (or instantaneous) code if no codeword is the prefix of another. (Which is to say that we cannot add letters after some c(x) to get another code- word c(y) = c(x)b 1... bm.] Example. Telephone numbers are a prefix code. The mathematical codes defined above are crucial in communication theory, but the broader concept of coding is of much wider application. For example, the sequence of amino acids in DNA encodes a number of physical attributes of the individual in question. For more wide-ranging applications we note that musical notation encodes the music it- self. Maps encode various features of the surface of the earth. Plans and elevations encode buildings. After some thought, you will realise that speech encodes your thoughts, and writing encodes your speech. This in turn can be given in Morse code. The ultimate conclusion of this process is a binary encoding, which is a string of symbols using an al- phabet of just two symbols [0, 1]. After a little more thought you may agree that anything of practical interest in communication must be capable of encoding as a string of symbols. [You may care to recall Wittgenstein’s remark: “whereof we cannot speak, thereof one must be silent”.]

1.3 Source and channel

The message to be communicated is supplied by the source, about whose nature we need not be specific, but it has three key properties. First, by what we have said above, we can assume that the message comprises a finite string of symbols. [For if it were not, we would simply encode it as such.] Secondly, the message is to be selected from a set of possible messages; (which we shall assume to be finite, for simplicity). And thirdly we are uncertain about what the message is to be, because if it were known in advance, it would be unnecessary to send it. It is therefore natural to regard the output of a source as a

random sequence of symbols, which we refer to as random variables and vectors (with a slight abuse of the convention that these shall be real-valued). Definition: a discrete source comprises a sequence of random variables X 1 , X 2 ,... taking values in a finite alphabet A. Any finite string is a message. If the Xr are independent and identically distributed, then the source is said to be memoryless, and we can write P (Xr = x) = p(x) for all r. At this stage, we shall assume that sources are discrete and memoryless. [Of course, many real sources do not have these properties, but the ideas and methods that we shall develop in this simple case can be generally extended to deal with more complicated sources.] Thus the probability that the source emits a message x = (x 1 ,... , xn) ∈ An^ is

P (X = x) = P (X 1 = x 1 ,... , Xn = xn) = P (X 1 = x 1 )... P (Xn = xn) =

∏^ n

p(xr)

by the independence. This is encoded as a signal to enter a channel: Definition: given an alphabet B of possible input symbols, and an output D of possible output symbols, a discrete channel is a family of conditional distributions p(y|x), x ∈ B, y ∈ D. This array is called the channel matrix and denoted by M , so that for input X and output Y M = P (Y = y|X = x) = p(y|x)

Since

p(y|x) = 1, M is a stochastic matrix.

It may be square, and it may be doubly stochastic, (i.e.,

p(y|x) = 1), but not usu-

ally. More generally the rth extension of the channel is the family of conditional joint distributions of r uses of M , given the input (x 1 ,... , xr) = x

p(y 1 ,... , yr|x 1 ,... , xr) = P (Y = y|X = x)

The channel is said to be memoryless, and denoted by DMC, if

p(y 1 ,... , yr|x 1 ,... , xr) =

∏^ r

p(yi|xi)

Thus uses of the channel are conditionally independent, given the input. We shall always assume this. We note two extreme cases:

(a) If Y = X, so that p(y|x) = 1, whenever x = y ∈ B, (and p(y|x) is 0 otherwise) then the channel is perfect, or noiseless.

(b) If p(y|x) does not depend on x, (i.e. p(y|x) = p(y) for all x), then the output is pure noise, independent of the input, and the channel is useless.

entropy called bits. Thus the flip of a fair coin has unit entropy, because we chose to take logarithms to base 2.

Lemma.

H(X) =

∑^ n

H(Xr) = nH(X)

for the discrete memoryless source.

Proof.

H(X) = −E log p(X) = E log

∏^ n

P (Xr) = −E

∑^ n

log p(Xr) =

∑^ n

H(Xr)

Likewise it is shown that if X and Y are independent then

H(X, Y ) = H(X) + H(Y )

Lemma: H(X) = 0 if and only if X is a constant with probability 1.

Proof. Each term in the sum is zero iff either pX (x) = 0 or pX (x) = 1. There must thus be just one x with the second property.

Lemma. Let c(·) be an invertible (uniquely decipherable) encoding of X. Then the en- tropy of c(X) is the same as that of X. We interpret this as the important result: an invertible code neither increases uncertainty nor loses information.

Proof. Let Y = c(X). Then (in an obvious notation)

H(Y) = −

pY (y) log pY (y) = −

P (c(X) = y) log P (c(X) = y)

= −

P (X = c−^1 (y)) log P (X = c−^1 (y))

= H(X)

, using the unique decipherability.

We turn from these useful lemmas to an important result, which we shall use often;

Theorem. Gibbs inequality. Let X have distribution p(x), and let q(x), x ∈ A, be any other probability distribution on the same alphabet as p(x). Then H(X) uniquely minimizes the value of the function G(q) = −E log q(X) over all choices of q. That is to say, for any distributions p and q on A

H(X) = −

p(x) log p(x) ≤ −

p(x) log q(x)

with equality if and only if p(x) = q(x), x ∈ A.

Proof of Gibbs’ inequality. We give two proofs. For the first, recall Jensen’s inequality for a strictly convex function, u(X) of a random variable X; that is: Eu(X) ≥ u(EX) with equality iff X is constant, so that X = EX. Now u(x) = − log x is strictly convex for x > 0. Therefore, letting X have distribution p(x),

∑ p(x) log p(x) q(x)

= E

− log q(X) p(X)

≥ − log

E

q(X) p(X)

= − log

[

q(x) p(x)

p(x)

]

with equality iff q(x) p(x) = constant = E

q(X) p(X)

, so that p(x) = q(x) for all x.

For the second proof, recall that logb x ≤

x − 1 loge b , for b > 1 and x > 0, with equality iff

x = 1. Hence

− loge 2

p(x) log p(x) q(x)

p> 0

p(x) loge q(x) p(x)

p> 0

p(x)

q(x) p(x)

with equality iff p(x) = q(x) for all p(x) > 0

p> 0

q(x) − 1

≤ 0 , with equality iff

p> 0

q(x) = 1,

which entails q(x) = p(x) when p(x) = 0. Hence equality holds throughout iff p(x) = q(x) for all x. Here are some useful consequences. Corollary. H(X) ≤ log |A| = log a, with equality if and only if X is uniformly distributed on A. Proof. Let q(x) = |A|−^1 = a−^1 , x ∈ A. Then

H(X) ≤ −

x∈A

p(x) log

= log a,

with equality if and only if p(x) = a−^1.

We interpret this by regarding H(X) as a measure of how “spread out” the distribution of X is over its possible letters. Now, rearranging Gibbs inequality, we find that if we regard it as a function of the two distributions we have this ∑

p(x) log p(x) q(x)

: = d(p, q) ≥ 0 ,

Hence the probability in (*) is less than or equal to

1 δ^2 n^2

E[log p(X) + nH(X)]^2 =

δ^2

n^2

var Ln =

nδ^2

var L 1 → 0

as n → ∞ since L 1 has finite variance. [Note that this is essentially a simple weak law of large numbers.]

We use this key theorem in the next section to show that although the total number of possible messages of length n is |A|n, in fact, with probability arbitrarily close to 1, X is a message lying in a set Tn of messages that is much smaller than An, except when X is uniform on A.

1.5 Typicality

We have shown above that the entropy of a message X of length n from the source is nH(X), where H(X) is the entropy of any letter. Before the message appears, not much can be said about any particular symbol, except its distribution p(x). But suppose we consider arbitrarily long messages from the source. Claude Shannon’s remarkable insight was that such messages have this property. Theorem. Typicality. Consider a discrete memoryless source. Then for > 0 and δ > 0 there exists n 0 < ∞ such that for all n > n 0 the set An^ of all possible sequences of length n can be divided into disjoint sets Tn and Un such that Tn ∪ Un = An^ and

(1) 2−n(H+δ)^ ≤ p(x) ≤ 2 −n(H−δ), for x ∈ Tn, (2) P (X ∈ Tn) ≥ 1 − (3) (1 − )2n(H−δ)^ ≤ |Tn| ≤ 2 n(H+δ)

That is to say, more informally, as n increases An^ can be split into a set Un of arbitrarily small probability (called the untypical sequences) and a set Tn of probability arbitrarily near 1, by (2), called the typical set. Thus for many practical purposes we can treat the messages of length n > n 0 as though there were only 2nH^ of them, by (2) and (3), and with each such typical message having roughly the same probability 2−nH^ of occurring, by (1). The point of this is that from above, for some γ > 0 , H(X) ≤ log |A| − γ, provided that X is not uniform on A. Hence, choosing δ < γ, |Tn | ≤ 2 n(H+δ)^ ≤ 2 n(δ−γ)|A|n^ and we see that the set of typical messages is much smaller than the set of possible messages in the long run. This idea makes possible both Shannon’s source and channel coding theorems, as we see in the following sections. [Note the slightly counter-intuitive fact that the most probable messages are not typical.] Proof of the theorem. Define the typical set Tn to be those messages x whose log-likelihood is within a distance δ from H. That is to say

Tn = {x :

∣∣^1

log p(x) + H(X)

∣∣ < δ}

Rearranging the inequality gives (1). Now using the empirical log-likelihood convergence theorem of the previous section gives (2). It follows that

1 − ≤ P (X ∈ Tn) =

x∈Tn

p(x) ≤ 1 ,

and now applying the two bounds in (1) to each p(x) in the sum gives (3). For example,

1 ≥

x∈Tn

p(x) ≥

x∈Tn

2 −n(H+δ)

= |Tn| 2 −n(H+δ)

, so that |Tn| ≤ 2 n(H+δ). This theorem is sometimes called the Asymptotic Equipartition Property, or AEP. Finally, we note that the idea of typicality can be formulated more strongly. The results above address only the probability of a sequence x, so that a sequence x is typical if | (^) n^1 log p(x) + H| < δ. This tells us little about the actual sequence itself, that is to say the actual frequency of occurrence of the letters of A in the message x. Strong typicality addresses exactly that; so we define N (α, x) to be the number of occurrences of α ∈ A in x. The collection [N (α, x) : α ∈ A] is called the type of x. Definition. Let δ > 0. The message x ∈ An^ is said to be δ-strongly typical for pX (x) if ∣∣ ∣∣^1 n N (α, x) − pX (α)

∣∣ < δ when pX (α) > 0

, and N (α, x) = 0 whenever pX (α) = 0. That is to say, the empirical distribution N (α, x) is close to the source distribution pX (α); (in total variation distance, more formally). The set of such sequences is called the strongly typical set, and it turns out to have essentially the same properties as the weakly (or entropy) typical set. That is to say, its probability is arbitrarily close to 1, and its sequences are asymptotically equiprobable. This may be called the strong AEP.

1.6 Shannon’s first theorem: noiseless (or source) coding

Recall that our task is to use the channel efficiently; an obvious way to do this is to seek a code that minimizes the expected length of the encoded message, or signal, passing through the channel. Remarkably, Shannon showed this: Theorem. If a source having entropy H(X), is encoded using an alphabet B, of size b = |B|, then given > 0, for large enough n there is an encoding function c(·), from An to Bm^ ∪ Bk, for some k, m ≥ 1, such that

1 n

E|c(X)| ≤

H(X)

log b

That is to say, the expected number of signal symbols per symbol of X = (X 1 ,... , Xn) is

arbitrarily close to

H(X)

log b

, as n → ∞.

Conversely, no such invertible block encoding using B can have shorter expected length

These can be seen as the conditional probabilities defining a channel with input alphabet Ar, and output alphabet Br, and this may be called the rth extension of the channel p(y | x). The array p(y | x), x ∈ A, y ∈ B is called the channel matrix. It is stochastic, and may or may not be a square matrix. For any given input distribution pX (x), the input and output have joint distribution

p(x, y) = pX (x)p(y | x)

Thus X and Y have respective entropies H(X) and H(Y ), and joint entropy H(X, Y ). However, in the context of a noisy channel it is natural to consider yet another entropy function: the entropy of the distribution p(y | x) for any fixed x. This is given by

H(Y | X = x) = −

p(y | x) log p(y | x)

and called the conditional entropy of Y given X = x. Note that as x ranges over A, this defines a random variable, being a function of X. It therefore has an expectation, which is the expected value of the entropy in Y , conditional on the value of X, before the input symbol is supplied. It is given by

H(Y | X) =

pX (x)H(Y | X = x)

x,y

p(x, y) log p(y | x)

= −E log p(Y | X)

[This is not a random variable of course, despite the similarity of notation with conditional expectation E(Y | X) which is a random variable.] Lemma H(X | Y ) ≥ 0, with equality iff X is a non-random function of Y ; ie X = g(Y ) for some g(·). Proof. The non-negativity is obvious. Now H(X | Y ) = 0 iff H(X | Y = y) = 0 for all y. But any entropy is 0 iff the distribution is concentrated at a point, so that x = g(y) for some g and all y. We return to this entropy later, but note that H(X | Y ) is of particular interest, as it represents the expected uncertainty of the receiver of the transmitted signal about what was actually sent. It has been called the equivocation.[And H(Y | X) has been called the prevarication.] Now recall that we mentioned two extreme cases, useless channels in which the output is noise independent of the input, and perfect channels with no noise. Obviously in interme- diate cases it would be very useful to have some measure of just how good (or bad) the channel is; that is to say, how close Y is to X, in some suitable sense. We would then (we hope) be able to choose pX (x) to make Y as close to X as possible, thus optimizing the channel’s performance. Fortunately, we have already defined such a measure of closeness above, in the form of the relative entropy (or Kullback-Leibler divergence). We therefore judge our channel by how far it is from being useless, namely the relative entropy between p(x, y) and pX (x)pY (y). Definition. For random variables X and Y , (seen as the input signal and output signal

respectively), their mutual information I(X; Y ) is the relative entropy between p(x, y) and pX (x)pY (y)

I(X; Y ) =

x,y

p(x, y) log p(x, y) pX (x)pY (y)

= E log p(X, Y ) pX (X)pY (Y )

When X and Y are independent I(X; Y ) = 0, and the channel is useless. When X = Y, I(X; Y ) = H(X), and the input and output have the same entropy, as expected. In intermediate cases we may choose pX (x) to get the best we can from the channel. This definition is therefore natural: Definition. The (Shannon) capacity of a channel with input X and output Y is

C = max (^) pX (x)I(X; Y )

As with H(X), this definition will be further justified by its applications, to follow. For the moment, we note some properties of I(X; Y ), and its relationship to entropies. Theorem. I(X; Y ) = H(X) + H(Y ) − H(X, Y ) = H(X) − H(X | Y ) = H(Y ) − H(Y | X) = I(Y ; X) ≥ 0

with equality in the last line if and only if X and Y are independent. Proof. All follow from the definitions, except the last assertion which is a consequence of Gibbs inequality, yielding equality when p(x, y) = pX (x)pY (y), as required. Corollaries

H(X) ≥ H(X | Y ) with equality iff X and Y are independent, (informally we recall this as conditioning reduces entropy)
H(X, Y ) ≤ H(X) + H(Y ) with equality iff X and Y are independent
H(X, Y ) = H(X) + H(Y | X) = H(Y ) + H(X | Y ) This is called the chain rule.
For non-random g(·)

(a) H(g(X)) ≤ H(X), with equality iff g is invertible (b) H(X, g(X)) = H(X)

Proof. 1-3 are trivial. For 4, recall that H(g(X) | X) = 0, so we obtain

(b) H(X, g(X)) = H(X) + H(g(X) | X) = H(X)

(a) H(X, g(X)) = H(g(X)) + H(X | g(X)) ≥ H(g(X))

with equality iff H(X | g(X)) = 0, which means X is a function of g(X), as required for the invertibility.

Example. Shannon noiseless coding bound. Let a source sequence X = (X 1 , X 2 ,... , Xn) be encoded as binary strings by a uniquely

Sketch proof. Let X and Y be the input and output of the channel with capacity C, and let X have the distribution pX (x) that actually achieves the capacity C. [In practical cases, I(X; Y ) is a continuous function on a closed bounded subset of Ra, so the supremum over pX (x) is indeed attained.] For arbitrarily large n, consider the sequence of inputs (X 1 ,... , Xn), where these are i.i.d. with distribution pX (x). By typicality, (the AEP), we have that

There are about 2nH(X)^ typical inputs x.
There are about 2nH(Y^ )^ typical outputs y.
The conditional entropy of the input X given the output Y is H(X | Y ), and thus, also by typicality, to each typical output there corresponds on average about 2 nH(X|Y^ )^ typical inputs.

To see this another way, consider the input and output together as a single random vector with entropy H(X, Y ). Shannon’s theorem (the AEP) shows that there are about 2nH(X,Y^ ) input-output pairs, which are often called jointly typical sequences. Thus there are about

2 nH(X,Y^ ) 2 nH(Y^ )^

= 2nH(X|Y^ )

typical inputs per typical outputs on average; (as seen above from the other point of view). Now suppose that there is a source producing messages with entropy rate R 0 = H < C; that is to say as n increases it supplies about 2nR^0 typical messages of length n. We wish to encode these for reliable transmission through the channel of capacity C. Choose R such that R 0 < R < C, and construct a coding scheme as follows:

From the 2nH(X)^ typical input messages defined above (with distribution pX (x) achieving capacity) select 2nR^ independently at random, with replacement; these are the codewords. [Note that since the selection is with replacement, we admit the possibility of having two codewords the same, somewhat counter-intuitively.] Denote this codebook by x(1)... x(2nR).
The decoding scheme is this: for any output Y we will look at the set S(Y) of corresponding typical inputs (i.e., those that are jointly typical with Y), of which there are typically about 2nH(X|Y^ ), as remarked above. If x(r) is sent, then with probability arbitrarily close to 1, the output Y will be jointly typical with x(r). If on examining the set S(Y) of inputs that are jointly typical with Y we find no other codeword than x(r) then decoding Y as x(r) is correct. Otherwise, if S(Y) contains another codeword, we declare an error in transmission.

With this codebook and decoding scheme, we pick a codeword to send at random from x(1),... , x(2nR). The average probability of error (averaged over random choice of code- books and random choice of codeword to send) is therefore pe = P (at least one codeword not equal to that sent lies in the set S(Y))

∑^2 nR

P (x(k) ∈ S(Y)), since P (∪Ai) ≤

P (Ai)

u 2 nR 2 nH(X|Y^ ) 2 nH(X)

, because there are about 2nH(X)^ possible choices for x(k), of which about 2nH(X|Y^ )^ are in S(Y). Hence pe ≤ 2 nR 2 −nC^ , since pX (x) achieves C → 0 as n → ∞, because R < C. It follows that for any > 0, there exists n < ∞ such that there is a fixed set of codewords x(1),... , x(2nR) that has average (over codeword selected) error smaller than . Now order these codewords by their probabilities of error, and discard the worst half (with greatest probability of error). The remaining codewords have arbitrarily small maximum probability of error, and there are 2n(R−^

1 n )^ codewords in the book. This exceeds 2nR^0 for large enough n, so the message from the source can thus be invertibly coded, with maximum probability of error as small as we choose. Note that this is purely a proof of the existence of such a codebook. There is no clue as to how we might find it, (except the essentially useless method of searching through all possible codebooks).

We conclude this section with a brief look at other popular decoding rules for noisy chan- nels. [For noiseless channels decoding is clearly trivial, since the receiver always sees the codeword that was sent.] Formally, in general, we have this: Definition. A decoder (or decoding function) g(·) is defined on all possible outputs y of the channel, and takes values in the set of all codewords, possibly augmented by a symbol e denoting that the decoder declares an error. Example. The ideal observer (or minimum error decoder) chooses the most likely code- word given the output of the channel. Thus (in an obvious notation)

g(y) =

c(x) if there is a unique x such that p(c(x) | y) is maximal e otherwise

This rule has a potential disadvantage, in that it is necessary to know the distribution of the codewords, p(c), since

p(c | y) = p(y | c)

p(c) p(y)

A decoder without this problem is this: Example. Maximum likelihood decoder.

g(y) =

c(x) if there is a unique x such that p(y | c(x)) is maximal e otherwise

This chooses the codeword that makes the received message most likely. Another way of defining decoders is to view the codewords and signal as points in the same suitable space, with a distance function ‖ · ‖. Example. Minimum distance (or nearest neighbour) decoder.

g(y) =

c(x) if there is a unique x such that ‖c(x) − y‖ is minimal e otherwise

When alphabets are binary, a very natural distance between binary strings c(x) and y of length n is the Hamming distance, in which ‖c(x) − y‖ is the number of places at which

h(X | Y ) ≤ h(X), with equality iff X and Y are independent.

Furthermore, the mutual information is defined and behaves likewise. Definition. I(X; Y ) = H(X) + H(Y ) − H(X, Y ) = d(f (x, y), fX (x)fY (y)) ≥ 0 with equality if and only if X and Y are independent.

Example. Q = (X, Y ) is a random point uniformly distributed in the square determined by the four points having Cartesian coordinates (0, ±1), (± 1 , 0). What is the information conveyed about X by Y? Solution. The joint and marginal densities are f (x, y) = 12 ; fX (x) = 1 − |x|; fY (y) = 1 − |y|. Hence,

I(X, Y ) = E log

f (X, Y ) fX (X)fY (Y ) = − log 2 − E log{(1 − |X|)(1 − |Y |)}

= − log 2 −

log(1 − |x|)(1 − |y|)dxdy , by symmetry

= − log 2 − 4

∫ (^1) −x

log(1 − x)dydx , also by symmetry

= − log 2 + 4

[

(1 − x)^2 log(1 − x)

] 1

(1 − x)dx

1 − loge 2 u 0 .31 nats u 0 .44 bits

Note that if S had been the square (± 1 , ±1), then I(X; Y ) = 0, as X and Y are then independent. But the covariance of X and Y is zero in both cases, so I(X; Y ) is a better measure of association from one point of view. Finally, we note one important difference between H(X) and h(X). When X is simple and g(X) is a one-one invertible function of X, we have H(X) = H(g(X)). This is not necessarily true for differential entropy. Example. Let a 6 = 0 be constant, and let X have differential entropy h(X). Then Y = aX has density

fY (y) =

|a|

( (^) y a

and

h(Y ) = h(aX) = −

|a| fX

( (^) y a

log fX

( (^) y a

− log |a|

= h(X) + log |a|

And, more generally, if Xn is a random n-vector, and A an n × n matrix with non-zero determinant detA, then h(AXn) = h(Xn) + log |detA|

1.10 Interpretation of entropy and information

While not essential in mathematics, interpretations of mathematical concepts are usually welcome, because they lend plausibility to axioms, and suggest which theorems should be most interesting. So we note that the concepts of entropy and mutual information defined above can be interpreted as measures of our real-world concepts of uncertainty, surprise, and information. We argue as follows:

(a) We defined the entropy H(X) of the random variable X having probability dis- tribution p(x) to be the expected value of the empirical log-likelihood: H(X) = E{− log p(X)}. Here is an intuitive interpretation of H(X). Suppose that E is some event that may occur with probability p = P (E). In advance of the relevant experiment we have some level of uncertainty about whether E will occur or not, and if later informed that E has occurred we feel some measure of surprise. The key point is that both our uncertainty and surprise vary according to P (E). To see this consider E and Ec^ with probabilities P (E) = 10−^6 , and P (Ec) = 1 − 10 −^6. We feel rather more uncertain about the occurrence of E than Ec, and equally we would be rather more surprised at the occurrence of E than Ec. Since it is the transfer of information that has resolved the uncertainty and created surprise, all of these depend on P (E) = p. We claim that the following are intuitively natural properties of the surprise s(E) that we feel about E’s occurrence.

(i) It depends only on p, and not further on the value of any random variable defined on E, nor on any meaning conveyed by E, nor any other semantic aspect of E. That is to say, s(E) = u(p), for some function u(p), 0 ≤ p ≤ 1, taking numerical values. For example, consider the events E 1 = you win £ 106 with probability 10−^6. E 2 = you are struck by lightning with probability 10−^6. Obviously your feelings, (semantic connotations), about these two events, and the random outcomes defined on them, are very different. But you are equally surprised in each case. (ii) The function u(p) is decreasing in p. That is to say, you are more surprised by more unlikely events when they occur. (iii) The surprise occasioned by the occurrence of independent events is the sum of their surprises. That is to say, if A and B are independent with probabilities p and q, then s(A ∩ B) = u(pq) = u(p) + u(q).

(iv) The surprise u(p) varies continuously with p. (v) There is no surprise in a certain event, so s(Ω) = u(1) = 0.

From these it follows (by some analysis which we omit) that for some constant c > 0

u(p) = −c log p

Communication Theory, Lecture Notes- Maths - Prof Alison Etheridge, Study notes of Communications Engineering

Related documents

Partial preview of the text

Download Communication Theory, Lecture Notes- Maths - Prof Alison Etheridge and more Study notes Communications Engineering in PDF only on Docsity!

Lecture Notes for Communication Theory

1 Introduction

1.1 Basics

1.3 Source and channel

= H(X)

= E

E

[

]

1.5 Typicality

∣∣^1

1.6 Shannon’s first theorem: noiseless (or source) coding

H(X)

H(X)

I(X; Y ) =

[

] 1

1.10 Interpretation of entropy and information