













































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Introduction source coding channel capacity and noisy coding
Typology: Study notes
1 / 53
This page cannot be seen from the preview
Don't miss anything!














































Please let me know of any errors, typos, or poorly expressed arguments. And please do not reproduce or distribute this document outside Oxford University. David Stirzaker.
- February 15, This section outlines the basic problems that communication theory has been developed to solve, introduces the main ideas that are used to deal with those problems, and sketches their solutions. In other words, it amounts to a preview and synopsis of the rest of the course.
The principal problem of communication is to accept a message at some point, and then reproduce it as efficiently, reliably, and securely as possible at some other point. The first step in this is taken by the engineers who make a suitable mechanism M for transferring the message, in the form of a signal of some kind; this mechanism will be called the channel whatever its actual physical form, which may be a wire, fibre, disc, aerial and receiver, tape, book and so on. The two ends of the channel are called the source and the receiver (which may be the same if a message is stored for retrieval) and may be separated in time or space, or both. Communication theory begins at this stage, by supposing that there exists such a channel that is imperfect in the senses of being finite, and noisy or insecure, or all three of these. If the channel were capable of transmitting an unbounded number of symbols, arbitrarily quickly, with any desired degree of accuracy, and in total privacy, then no further work would be needed; but no such channels exist. So, as mentioned above, the theorist has three tasks:
These are strong requirements, but a remarkable sequence of ideas initiated by Claude Shannon, and much developed since, has shown that they are achievable; at least up to a certain well-defined level in each case. The key idea and technique which makes all this possible is that of coding, which we now consider.
alphabet A. A message of length n is denoted by x ∈ An, where An^ is the set of all strings of length n of symbols from A. The set of all finite strings of symbols from A is denoted by A∗·. Definition: a code c(·), (or encoding, or code function), for the source S is a mapping from A∗^ to the set B∗^ of finite-length strings from an alphabet B, which may be called codewords. [If the mapping is to Bm, for some m, then the code is said to be a block code.] Formally c(·) : x ∈ A∗^ → c(x) ∈ B∗.
In addition, there is a decoder d(·) which maps B∗^ to the set of possible messages. The length of the codeword c(x) is denoted by |c(x)|. For efficiency we would like |c(x)| to be small in the long run; for reliability we would like d(c(x)) = x as often as possible; for secrecy we wish any enemy who knows c(x) not to be able to identify x in general. Briefly, the core of communication theory is devising good codes. Among the various properties that good codes might have, this one is clearly almost es- sential. Definition: a code c(·) is uniquely decipherable if the concatenation c(x 1 )c(x 2 )... c(xn) of codewords of symbols (or messages) from S is the image of (corresponds to) at most one sequence x 1... xn. An important class of uniquely decipherable codes is this: Definition: a code is a prefix (or instantaneous) code if no codeword is the prefix of another. (Which is to say that we cannot add letters after some c(x) to get another code- word c(y) = c(x)b 1... bm.] Example. Telephone numbers are a prefix code. The mathematical codes defined above are crucial in communication theory, but the broader concept of coding is of much wider application. For example, the sequence of amino acids in DNA encodes a number of physical attributes of the individual in question. For more wide-ranging applications we note that musical notation encodes the music it- self. Maps encode various features of the surface of the earth. Plans and elevations encode buildings. After some thought, you will realise that speech encodes your thoughts, and writing encodes your speech. This in turn can be given in Morse code. The ultimate conclusion of this process is a binary encoding, which is a string of symbols using an al- phabet of just two symbols [0, 1]. After a little more thought you may agree that anything of practical interest in communication must be capable of encoding as a string of symbols. [You may care to recall Wittgenstein’s remark: “whereof we cannot speak, thereof one must be silent”.]
The message to be communicated is supplied by the source, about whose nature we need not be specific, but it has three key properties. First, by what we have said above, we can assume that the message comprises a finite string of symbols. [For if it were not, we would simply encode it as such.] Secondly, the message is to be selected from a set of possible messages; (which we shall assume to be finite, for simplicity). And thirdly we are uncertain about what the message is to be, because if it were known in advance, it would be unnecessary to send it. It is therefore natural to regard the output of a source as a
random sequence of symbols, which we refer to as random variables and vectors (with a slight abuse of the convention that these shall be real-valued). Definition: a discrete source comprises a sequence of random variables X 1 , X 2 ,... taking values in a finite alphabet A. Any finite string is a message. If the Xr are independent and identically distributed, then the source is said to be memoryless, and we can write P (Xr = x) = p(x) for all r. At this stage, we shall assume that sources are discrete and memoryless. [Of course, many real sources do not have these properties, but the ideas and methods that we shall develop in this simple case can be generally extended to deal with more complicated sources.] Thus the probability that the source emits a message x = (x 1 ,... , xn) ∈ An^ is
P (X = x) = P (X 1 = x 1 ,... , Xn = xn) = P (X 1 = x 1 )... P (Xn = xn) =
∏^ n
r=
p(xr)
by the independence. This is encoded as a signal to enter a channel: Definition: given an alphabet B of possible input symbols, and an output D of possible output symbols, a discrete channel is a family of conditional distributions p(y|x), x ∈ B, y ∈ D. This array is called the channel matrix and denoted by M , so that for input X and output Y M = P (Y = y|X = x) = p(y|x)
Since
y
p(y|x) = 1, M is a stochastic matrix.
It may be square, and it may be doubly stochastic, (i.e.,
x
p(y|x) = 1), but not usu-
ally. More generally the rth extension of the channel is the family of conditional joint distributions of r uses of M , given the input (x 1 ,... , xr) = x
p(y 1 ,... , yr|x 1 ,... , xr) = P (Y = y|X = x)
The channel is said to be memoryless, and denoted by DMC, if
p(y 1 ,... , yr|x 1 ,... , xr) =
∏^ r
i=
p(yi|xi)
Thus uses of the channel are conditionally independent, given the input. We shall always assume this. We note two extreme cases:
(a) If Y = X, so that p(y|x) = 1, whenever x = y ∈ B, (and p(y|x) is 0 otherwise) then the channel is perfect, or noiseless.
(b) If p(y|x) does not depend on x, (i.e. p(y|x) = p(y) for all x), then the output is pure noise, independent of the input, and the channel is useless.
entropy called bits. Thus the flip of a fair coin has unit entropy, because we chose to take logarithms to base 2.
Lemma.
H(X) =
∑^ n
r=
H(Xr) = nH(X)
for the discrete memoryless source.
Proof.
H(X) = −E log p(X) = E log
∏^ n
r=
P (Xr) = −E
∑^ n
r=
log p(Xr) =
∑^ n
r=
H(Xr)
Likewise it is shown that if X and Y are independent then
H(X, Y ) = H(X) + H(Y )
Lemma: H(X) = 0 if and only if X is a constant with probability 1.
Proof. Each term in the sum is zero iff either pX (x) = 0 or pX (x) = 1. There must thus be just one x with the second property.
Lemma. Let c(·) be an invertible (uniquely decipherable) encoding of X. Then the en- tropy of c(X) is the same as that of X. We interpret this as the important result: an invertible code neither increases uncertainty nor loses information.
Proof. Let Y = c(X). Then (in an obvious notation)
H(Y) = −
pY (y) log pY (y) = −
P (c(X) = y) log P (c(X) = y)
= −
P (X = c−^1 (y)) log P (X = c−^1 (y))
, using the unique decipherability.
We turn from these useful lemmas to an important result, which we shall use often;
Theorem. Gibbs inequality. Let X have distribution p(x), and let q(x), x ∈ A, be any other probability distribution on the same alphabet as p(x). Then H(X) uniquely minimizes the value of the function G(q) = −E log q(X) over all choices of q. That is to say, for any distributions p and q on A
H(X) = −
p(x) log p(x) ≤ −
p(x) log q(x)
with equality if and only if p(x) = q(x), x ∈ A.
Proof of Gibbs’ inequality. We give two proofs. For the first, recall Jensen’s inequality for a strictly convex function, u(X) of a random variable X; that is: Eu(X) ≥ u(EX) with equality iff X is constant, so that X = EX. Now u(x) = − log x is strictly convex for x > 0. Therefore, letting X have distribution p(x),
∑ p(x) log p(x) q(x)
− log q(X) p(X)
≥ − log
q(X) p(X)
= − log
x
q(x) p(x)
p(x)
with equality iff q(x) p(x) = constant = E
q(X) p(X)
, so that p(x) = q(x) for all x.
For the second proof, recall that logb x ≤
x − 1 loge b , for b > 1 and x > 0, with equality iff
x = 1. Hence
− loge 2
x
p(x) log p(x) q(x)
p> 0
p(x) loge q(x) p(x)
p> 0
p(x)
q(x) p(x)
with equality iff p(x) = q(x) for all p(x) > 0
p> 0
q(x) − 1
≤ 0 , with equality iff
p> 0
q(x) = 1,
which entails q(x) = p(x) when p(x) = 0. Hence equality holds throughout iff p(x) = q(x) for all x. Here are some useful consequences. Corollary. H(X) ≤ log |A| = log a, with equality if and only if X is uniformly distributed on A. Proof. Let q(x) = |A|−^1 = a−^1 , x ∈ A. Then
H(X) ≤ −
x∈A
p(x) log
a
= log a,
with equality if and only if p(x) = a−^1.
We interpret this by regarding H(X) as a measure of how “spread out” the distribution of X is over its possible letters. Now, rearranging Gibbs inequality, we find that if we regard it as a function of the two distributions we have this ∑
x
p(x) log p(x) q(x)
: = d(p, q) ≥ 0 ,
Hence the probability in (*) is less than or equal to
1 δ^2 n^2
E[log p(X) + nH(X)]^2 =
δ^2
n^2
var Ln =
nδ^2
var L 1 → 0
as n → ∞ since L 1 has finite variance. [Note that this is essentially a simple weak law of large numbers.]
We use this key theorem in the next section to show that although the total number of possible messages of length n is |A|n, in fact, with probability arbitrarily close to 1, X is a message lying in a set Tn of messages that is much smaller than An, except when X is uniform on A.
We have shown above that the entropy of a message X of length n from the source is nH(X), where H(X) is the entropy of any letter. Before the message appears, not much can be said about any particular symbol, except its distribution p(x). But suppose we consider arbitrarily long messages from the source. Claude Shannon’s remarkable insight was that such messages have this property. Theorem. Typicality. Consider a discrete memoryless source. Then for > 0 and δ > 0 there exists n 0 < ∞ such that for all n > n 0 the set An^ of all possible sequences of length n can be divided into disjoint sets Tn and Un such that Tn ∪ Un = An^ and
(1) 2−n(H+δ)^ ≤ p(x) ≤ 2 −n(H−δ), for x ∈ Tn, (2) P (X ∈ Tn) ≥ 1 − (3) (1 − )2n(H−δ)^ ≤ |Tn| ≤ 2 n(H+δ)
That is to say, more informally, as n increases An^ can be split into a set Un of arbitrarily small probability (called the untypical sequences) and a set Tn of probability arbitrarily near 1, by (2), called the typical set. Thus for many practical purposes we can treat the messages of length n > n 0 as though there were only 2nH^ of them, by (2) and (3), and with each such typical message having roughly the same probability 2−nH^ of occurring, by (1). The point of this is that from above, for some γ > 0 , H(X) ≤ log |A| − γ, provided that X is not uniform on A. Hence, choosing δ < γ, |Tn | ≤ 2 n(H+δ)^ ≤ 2 n(δ−γ)|A|n^ and we see that the set of typical messages is much smaller than the set of possible messages in the long run. This idea makes possible both Shannon’s source and channel coding theorems, as we see in the following sections. [Note the slightly counter-intuitive fact that the most probable messages are not typical.] Proof of the theorem. Define the typical set Tn to be those messages x whose log-likelihood is within a distance δ from H. That is to say
Tn = {x :
n
log p(x) + H(X)
∣∣ < δ}
Rearranging the inequality gives (1). Now using the empirical log-likelihood convergence theorem of the previous section gives (2). It follows that
1 − ≤ P (X ∈ Tn) =
x∈Tn
p(x) ≤ 1 ,
and now applying the two bounds in (1) to each p(x) in the sum gives (3). For example,
1 ≥
x∈Tn
p(x) ≥
x∈Tn
2 −n(H+δ)
= |Tn| 2 −n(H+δ)
, so that |Tn| ≤ 2 n(H+δ). This theorem is sometimes called the Asymptotic Equipartition Property, or AEP. Finally, we note that the idea of typicality can be formulated more strongly. The results above address only the probability of a sequence x, so that a sequence x is typical if | (^) n^1 log p(x) + H| < δ. This tells us little about the actual sequence itself, that is to say the actual frequency of occurrence of the letters of A in the message x. Strong typicality addresses exactly that; so we define N (α, x) to be the number of occurrences of α ∈ A in x. The collection [N (α, x) : α ∈ A] is called the type of x. Definition. Let δ > 0. The message x ∈ An^ is said to be δ-strongly typical for pX (x) if ∣∣ ∣∣^1 n N (α, x) − pX (α)
∣∣ < δ when pX (α) > 0
, and N (α, x) = 0 whenever pX (α) = 0. That is to say, the empirical distribution N (α, x) is close to the source distribution pX (α); (in total variation distance, more formally). The set of such sequences is called the strongly typical set, and it turns out to have essentially the same properties as the weakly (or entropy) typical set. That is to say, its probability is arbitrarily close to 1, and its sequences are asymptotically equiprobable. This may be called the strong AEP.
Recall that our task is to use the channel efficiently; an obvious way to do this is to seek a code that minimizes the expected length of the encoded message, or signal, passing through the channel. Remarkably, Shannon showed this: Theorem. If a source having entropy H(X), is encoded using an alphabet B, of size b = |B|, then given > 0, for large enough n there is an encoding function c(·), from An to Bm^ ∪ Bk, for some k, m ≥ 1, such that
1 n
E|c(X)| ≤
log b
That is to say, the expected number of signal symbols per symbol of X = (X 1 ,... , Xn) is
arbitrarily close to
log b
, as n → ∞.
Conversely, no such invertible block encoding using B can have shorter expected length
These can be seen as the conditional probabilities defining a channel with input alphabet Ar, and output alphabet Br, and this may be called the rth extension of the channel p(y | x). The array p(y | x), x ∈ A, y ∈ B is called the channel matrix. It is stochastic, and may or may not be a square matrix. For any given input distribution pX (x), the input and output have joint distribution
p(x, y) = pX (x)p(y | x)
Thus X and Y have respective entropies H(X) and H(Y ), and joint entropy H(X, Y ). However, in the context of a noisy channel it is natural to consider yet another entropy function: the entropy of the distribution p(y | x) for any fixed x. This is given by
H(Y | X = x) = −
y
p(y | x) log p(y | x)
and called the conditional entropy of Y given X = x. Note that as x ranges over A, this defines a random variable, being a function of X. It therefore has an expectation, which is the expected value of the entropy in Y , conditional on the value of X, before the input symbol is supplied. It is given by
H(Y | X) =
x
pX (x)H(Y | X = x)
x,y
p(x, y) log p(y | x)
= −E log p(Y | X)
[This is not a random variable of course, despite the similarity of notation with conditional expectation E(Y | X) which is a random variable.] Lemma H(X | Y ) ≥ 0, with equality iff X is a non-random function of Y ; ie X = g(Y ) for some g(·). Proof. The non-negativity is obvious. Now H(X | Y ) = 0 iff H(X | Y = y) = 0 for all y. But any entropy is 0 iff the distribution is concentrated at a point, so that x = g(y) for some g and all y. We return to this entropy later, but note that H(X | Y ) is of particular interest, as it represents the expected uncertainty of the receiver of the transmitted signal about what was actually sent. It has been called the equivocation.[And H(Y | X) has been called the prevarication.] Now recall that we mentioned two extreme cases, useless channels in which the output is noise independent of the input, and perfect channels with no noise. Obviously in interme- diate cases it would be very useful to have some measure of just how good (or bad) the channel is; that is to say, how close Y is to X, in some suitable sense. We would then (we hope) be able to choose pX (x) to make Y as close to X as possible, thus optimizing the channel’s performance. Fortunately, we have already defined such a measure of closeness above, in the form of the relative entropy (or Kullback-Leibler divergence). We therefore judge our channel by how far it is from being useless, namely the relative entropy between p(x, y) and pX (x)pY (y). Definition. For random variables X and Y , (seen as the input signal and output signal
respectively), their mutual information I(X; Y ) is the relative entropy between p(x, y) and pX (x)pY (y)
x,y
p(x, y) log p(x, y) pX (x)pY (y)
= E log p(X, Y ) pX (X)pY (Y )
When X and Y are independent I(X; Y ) = 0, and the channel is useless. When X = Y, I(X; Y ) = H(X), and the input and output have the same entropy, as expected. In intermediate cases we may choose pX (x) to get the best we can from the channel. This definition is therefore natural: Definition. The (Shannon) capacity of a channel with input X and output Y is
C = max (^) pX (x)I(X; Y )
As with H(X), this definition will be further justified by its applications, to follow. For the moment, we note some properties of I(X; Y ), and its relationship to entropies. Theorem. I(X; Y ) = H(X) + H(Y ) − H(X, Y ) = H(X) − H(X | Y ) = H(Y ) − H(Y | X) = I(Y ; X) ≥ 0
with equality in the last line if and only if X and Y are independent. Proof. All follow from the definitions, except the last assertion which is a consequence of Gibbs inequality, yielding equality when p(x, y) = pX (x)pY (y), as required. Corollaries
(a) H(g(X)) ≤ H(X), with equality iff g is invertible (b) H(X, g(X)) = H(X)
Proof. 1-3 are trivial. For 4, recall that H(g(X) | X) = 0, so we obtain
(b) H(X, g(X)) = H(X) + H(g(X) | X) = H(X)
(a) H(X, g(X)) = H(g(X)) + H(X | g(X)) ≥ H(g(X))
with equality iff H(X | g(X)) = 0, which means X is a function of g(X), as required for the invertibility.
Example. Shannon noiseless coding bound. Let a source sequence X = (X 1 , X 2 ,... , Xn) be encoded as binary strings by a uniquely
Sketch proof. Let X and Y be the input and output of the channel with capacity C, and let X have the distribution pX (x) that actually achieves the capacity C. [In practical cases, I(X; Y ) is a continuous function on a closed bounded subset of Ra, so the supremum over pX (x) is indeed attained.] For arbitrarily large n, consider the sequence of inputs (X 1 ,... , Xn), where these are i.i.d. with distribution pX (x). By typicality, (the AEP), we have that
To see this another way, consider the input and output together as a single random vector with entropy H(X, Y ). Shannon’s theorem (the AEP) shows that there are about 2nH(X,Y^ ) input-output pairs, which are often called jointly typical sequences. Thus there are about
2 nH(X,Y^ ) 2 nH(Y^ )^
= 2nH(X|Y^ )
typical inputs per typical outputs on average; (as seen above from the other point of view). Now suppose that there is a source producing messages with entropy rate R 0 = H < C; that is to say as n increases it supplies about 2nR^0 typical messages of length n. We wish to encode these for reliable transmission through the channel of capacity C. Choose R such that R 0 < R < C, and construct a coding scheme as follows:
With this codebook and decoding scheme, we pick a codeword to send at random from x(1),... , x(2nR). The average probability of error (averaged over random choice of code- books and random choice of codeword to send) is therefore pe = P (at least one codeword not equal to that sent lies in the set S(Y))
∑^2 nR
k=
P (x(k) ∈ S(Y)), since P (∪Ai) ≤
P (Ai)
u 2 nR 2 nH(X|Y^ ) 2 nH(X)
, because there are about 2nH(X)^ possible choices for x(k), of which about 2nH(X|Y^ )^ are in S(Y). Hence pe ≤ 2 nR 2 −nC^ , since pX (x) achieves C → 0 as n → ∞, because R < C. It follows that for any > 0, there exists n < ∞ such that there is a fixed set of codewords x(1),... , x(2nR) that has average (over codeword selected) error smaller than . Now order these codewords by their probabilities of error, and discard the worst half (with greatest probability of error). The remaining codewords have arbitrarily small maximum probability of error, and there are 2n(R−^
1 n )^ codewords in the book. This exceeds 2nR^0 for large enough n, so the message from the source can thus be invertibly coded, with maximum probability of error as small as we choose. Note that this is purely a proof of the existence of such a codebook. There is no clue as to how we might find it, (except the essentially useless method of searching through all possible codebooks).
We conclude this section with a brief look at other popular decoding rules for noisy chan- nels. [For noiseless channels decoding is clearly trivial, since the receiver always sees the codeword that was sent.] Formally, in general, we have this: Definition. A decoder (or decoding function) g(·) is defined on all possible outputs y of the channel, and takes values in the set of all codewords, possibly augmented by a symbol e denoting that the decoder declares an error. Example. The ideal observer (or minimum error decoder) chooses the most likely code- word given the output of the channel. Thus (in an obvious notation)
g(y) =
c(x) if there is a unique x such that p(c(x) | y) is maximal e otherwise
This rule has a potential disadvantage, in that it is necessary to know the distribution of the codewords, p(c), since
p(c | y) = p(y | c)
p(c) p(y)
A decoder without this problem is this: Example. Maximum likelihood decoder.
g(y) =
c(x) if there is a unique x such that p(y | c(x)) is maximal e otherwise
This chooses the codeword that makes the received message most likely. Another way of defining decoders is to view the codewords and signal as points in the same suitable space, with a distance function ‖ · ‖. Example. Minimum distance (or nearest neighbour) decoder.
g(y) =
c(x) if there is a unique x such that ‖c(x) − y‖ is minimal e otherwise
When alphabets are binary, a very natural distance between binary strings c(x) and y of length n is the Hamming distance, in which ‖c(x) − y‖ is the number of places at which
Furthermore, the mutual information is defined and behaves likewise. Definition. I(X; Y ) = H(X) + H(Y ) − H(X, Y ) = d(f (x, y), fX (x)fY (y)) ≥ 0 with equality if and only if X and Y are independent.
Example. Q = (X, Y ) is a random point uniformly distributed in the square determined by the four points having Cartesian coordinates (0, ±1), (± 1 , 0). What is the information conveyed about X by Y? Solution. The joint and marginal densities are f (x, y) = 12 ; fX (x) = 1 − |x|; fY (y) = 1 − |y|. Hence,
I(X, Y ) = E log
f (X, Y ) fX (X)fY (Y ) = − log 2 − E log{(1 − |X|)(1 − |Y |)}
= − log 2 −
S
log(1 − |x|)(1 − |y|)dxdy , by symmetry
= − log 2 − 4
0
∫ (^1) −x
0
log(1 − x)dydx , also by symmetry
= − log 2 + 4
(1 − x)^2 log(1 − x)
0
0
(1 − x)dx
1 − loge 2 u 0 .31 nats u 0 .44 bits
Note that if S had been the square (± 1 , ±1), then I(X; Y ) = 0, as X and Y are then independent. But the covariance of X and Y is zero in both cases, so I(X; Y ) is a better measure of association from one point of view. Finally, we note one important difference between H(X) and h(X). When X is simple and g(X) is a one-one invertible function of X, we have H(X) = H(g(X)). This is not necessarily true for differential entropy. Example. Let a 6 = 0 be constant, and let X have differential entropy h(X). Then Y = aX has density
fY (y) =
|a|
fX
( (^) y a
and
h(Y ) = h(aX) = −
|a| fX
( (^) y a
log fX
( (^) y a
− log |a|
dy
= h(X) + log |a|
And, more generally, if Xn is a random n-vector, and A an n × n matrix with non-zero determinant detA, then h(AXn) = h(Xn) + log |detA|
While not essential in mathematics, interpretations of mathematical concepts are usually welcome, because they lend plausibility to axioms, and suggest which theorems should be most interesting. So we note that the concepts of entropy and mutual information defined above can be interpreted as measures of our real-world concepts of uncertainty, surprise, and information. We argue as follows:
(a) We defined the entropy H(X) of the random variable X having probability dis- tribution p(x) to be the expected value of the empirical log-likelihood: H(X) = E{− log p(X)}. Here is an intuitive interpretation of H(X). Suppose that E is some event that may occur with probability p = P (E). In advance of the relevant experiment we have some level of uncertainty about whether E will occur or not, and if later informed that E has occurred we feel some measure of surprise. The key point is that both our uncertainty and surprise vary according to P (E). To see this consider E and Ec^ with probabilities P (E) = 10−^6 , and P (Ec) = 1 − 10 −^6. We feel rather more uncertain about the occurrence of E than Ec, and equally we would be rather more surprised at the occurrence of E than Ec. Since it is the transfer of information that has resolved the uncertainty and created surprise, all of these depend on P (E) = p. We claim that the following are intuitively natural properties of the surprise s(E) that we feel about E’s occurrence.
(i) It depends only on p, and not further on the value of any random variable defined on E, nor on any meaning conveyed by E, nor any other semantic aspect of E. That is to say, s(E) = u(p), for some function u(p), 0 ≤ p ≤ 1, taking numerical values. For example, consider the events E 1 = you win £ 106 with probability 10−^6. E 2 = you are struck by lightning with probability 10−^6. Obviously your feelings, (semantic connotations), about these two events, and the random outcomes defined on them, are very different. But you are equally surprised in each case. (ii) The function u(p) is decreasing in p. That is to say, you are more surprised by more unlikely events when they occur. (iii) The surprise occasioned by the occurrence of independent events is the sum of their surprises. That is to say, if A and B are independent with probabilities p and q, then s(A ∩ B) = u(pq) = u(p) + u(q).
(iv) The surprise u(p) varies continuously with p. (v) There is no surprise in a certain event, so s(Ω) = u(1) = 0.
From these it follows (by some analysis which we omit) that for some constant c > 0
u(p) = −c log p