Chapter 4 Parameter Estimation, Lecture notes of Statistics

The MLE is also an intuitive and unbiased estimator for the means of normal and. Poisson distributions. Likelihood as function of data or model parameters? In ...

Typology: Lecture notes

2021/2022

Uploaded on 08/01/2022

hal_s95
hal_s95 🇵🇭

4.4

(655)

10K documents

1 / 26

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Chapter 4
Parameter Estimation
Thus far we have concerned ourselves primarily with probability theory: what events may
occur with what probabilities, given a model family and choices for the parameters. This is
useful only in the case where we know the precise model family and parameter values for the
situation of interest. But this is the exception, not the rule, for both scientific inquiry and
human learning & inference. Most of the time, we are in the situation of processing data
whose generative source we are uncertain about. In Chapter 2 we briefly covered elemen-
tary density estimation, using relative-frequency estimation, histograms and kernel density
estimation. In this chapter we delve more deeply into the theory of probability density es-
timation, focusing on inference within parametric families of probability distributions (see
discussion in Section 2.11.2). We start with some important properties of estimators, then
turn to basic frequentist parameter estimation (maximum-likelihood estimation and correc-
tions for bias), and finally basic Bayesian parameter estimation.
4.1 Introduction
Consider the situation of the first exposure of a native speaker of American English to
an English variety with which she has no experience (e.g., Singaporean English), and the
problem of inferring the probability of use of active versus passive voice in this variety with
a simple transitive verb such as hit:
(1) The ball hit the window. (Active)
(2) The window was hit by the ball. (Passive)
There is ample evidence that this probability is contingent on a number of features of the
utterance and discourse context (e.g., Weiner and Labov, 1983), and in Chapter 6 we cover
how to construct such richer models, but for the moment we simplify the problem by assuming
that active/passive variation can be modeled with a binomial distribution (Section 3.4) with
parameter πcharacterizing the probability that a given potentially transitive clause eligible
51
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a

Partial preview of the text

Download Chapter 4 Parameter Estimation and more Lecture notes Statistics in PDF only on Docsity!

Chapter 4

Parameter Estimation

Thus far we have concerned ourselves primarily with probability theory: what events may occur with what probabilities, given a model family and choices for the parameters. This is useful only in the case where we know the precise model family and parameter values for the situation of interest. But this is the exception, not the rule, for both scientific inquiry and human learning & inference. Most of the time, we are in the situation of processing data whose generative source we are uncertain about. In Chapter 2 we briefly covered elemen- tary density estimation, using relative-frequency estimation, histograms and kernel density estimation. In this chapter we delve more deeply into the theory of probability density es- timation, focusing on inference within parametric families of probability distributions (see discussion in Section 2.11.2). We start with some important properties of estimators, then turn to basic frequentist parameter estimation (maximum-likelihood estimation and correc- tions for bias), and finally basic Bayesian parameter estimation.

4.1 Introduction

Consider the situation of the first exposure of a native speaker of American English to an English variety with which she has no experience (e.g., Singaporean English), and the problem of inferring the probability of use of active versus passive voice in this variety with a simple transitive verb such as hit:

(1) The ball hit the window. (Active)

(2) The window was hit by the ball. (Passive)

There is ample evidence that this probability is contingent on a number of features of the utterance and discourse context (e.g., Weiner and Labov, 1983), and in Chapter 6 we cover how to construct such richer models, but for the moment we simplify the problem by assuming that active/passive variation can be modeled with a binomial distribution (Section 3.4) with parameter π characterizing the probability that a given potentially transitive clause eligible

for passivization will in fact be realized as a passive.^1 The question faced by the native American English speaker is thus, what inferences should we make about π on the basis of limited exposure to the new variety? This is the problem of parameter estimation, and it is a central part of statistical inference. There are many different techniques for parameter estimation; any given technique is called an estimator, which is applied to a set of data to construct an estimate. Let us briefly consider two simple estimators for our example.

Estimator 1. Suppose that our American English speaker has been exposed to n transi- tive sentences of the variety, and m of them have been realized in the passive voice in eligible clauses. A natural estimate of the binomial parameter π would be m/n. Because m/n is the relative frequency of the passive voice, this is known as the relative frequency estimate (RFE; see Section 2.11.1). In addition to being intuitive, we will see in Section 4.3.1 that the RFE can be derived from deep and general principles of optimality in estimation procedures. However, RFE also has weaknesses. For instance, it makes no use of the speaker’s knowledge of her native English variety. In addition, when n is small, the RFE is unreliable: imagine, for example, trying to estimate π from only two or three sentences from the new variety.

Estimator 2. Our speaker presumably knows the probability of a passive in American English; call this probability q. An extremely simple estimate of π would be to ignore all new evidence and set π = q, regardless of how much data she has on the new variety. Although this option may not be as intuitive as Estimator 1, it has certain advantages: it is extremely reliable and, if the new variety is not too different from American English, reasonably accu- rate as well. On the other hand, once the speaker has had considerable exposure to the new variety, this approach will almost certainly be inferior to relative frequency estimation. (See Exercise to be included with this chapter.)

In light of this example, Section 4.2 describes how to assess the quality of an estimator in conceptually intuitive yet mathematically precise terms. In Section 4.3, we cover fre- quentist approaches to parameter estimation, which involve procedures for constructing point estimates of parameters. In particular we focus on maximum-likelihood estimation and close variants, which for multinomial data turns out to be equivalent to Estimator 1 above.In Section 4.4, we cover Bayesian approaches to parameter estimation, which involve placing probability distributions over the range of possible parameter values. The Bayesian estimation technique we will cover can be thought of as intermediate between Estimators 1 and 2.

4.2 Desirable properties for estimators

In this section we briefly cover three key properties of any estimator, and discuss the desir- ability of these properties.

(^1) By this probability we implicitly conditionalize on the use of a transitive verb that is eligible for pas- sivization, excluding intransitives and also unpassivizable verbs such as weigh.

only the first n/2. But why is this the case? Regardless of how many utterances she uses with Estimator 1, her estimate will be unbiased (think about this carefully if you are not immediately convinced). But our intuitions suggest that an estimate using less data is less reliable: it is likely to vary more dramatically due to pure freaks of chance. It is useful to quantify this notion of reliability using a natural statistical metric: the variance of the estimator, Var(θˆ) (Section 4.2.3). All else being equal, an estimator with smaller variance is preferable to one with greater variance. This idea, combined with a bit more simple algebra, quantitatively explains the intuition that more data is better for Estimator 1:

Var(ˆπ) = Var

(m

n

n^2

Var(m) (From scaling a random variable, Section 3.3.3)

Since m is binomially distributed, and the variance of the binomial distribution is nπ(1 − π) (Section 3.4), so we have

Var(ˆπ) =

π(1 − π) n

So variance is inversely proportional to the sample size n, which means that relative frequency estimation is more reliable when used with larger samples, consistent with intuition. It is almost always the case that each of bias and variance comes at the cost of the other. This leads to what is sometimes called bias-variance tradeoff: one’s choice of estimator may depend on the relative importance of expected accuracy versus reliability in the task at hand. The bias-variance tradeoff is very clear in our example. Estimator 1 is unbiased, but has variance that can be quite high when samples size n is small. Estimator 2 is biased, but it has zero variance. Which of the two estimators is preferable is likely to depend on the sample size. If our speaker anticipates that she will have very few examples of transitive sentences in the new English variety to go on, and also anticipates that the new variety will not be hugely different from American English, she may well prefer (and with good reason) the small bias of Estimator 2 to the large variance of Estimator 1. The lower-variance of two estimators is called the more efficient estimator, and the efficiency of one estimator η 1 relative to another estimator η 2 is the ratio of their variances, Var(ˆθη 1 )/Var(θˆη 2 ).

4.3 Frequentist parameter estimation and prediction

We have just covered a simple example of parameter estimation and discussed key proper- ties of estimators, but the estimators we covered were (while intuitive) given no theoretical underpinning. In the remainder of this chapter, we will cover a few major mathematically mo- tivated estimation techniques of general utility. This section covers frequentist estimation techniques. In frequentist statistics, an estimator gives a point estimate for the parameter(s)

of interest, and estimators are preferred or dispreferred on the basis of their general behavior, notably with respect to the properties of consistency, bias, and variance discussed in Sec- tion 4.2. We start with the most widely-used estimation technique, maximum-likelihood estimation.

4.3.1 Maximum Likelihood Estimation

We encountered the notion of the likelihood in Chapter 2, a basic measure of the quality of a set of predictions with respect to observed data. In the context of parameter estimation, the likelihood is naturally viewed as a function of the parameters θ to be estimated, and is defined as in Equation (2.29)—the joint probability of a set of observations, conditioned on a choice for θ—repeated here:

Lik(θ; y) ≡ P (y|θ) (4.2)

Since good predictions are better, a natural approach to parameter estimation is to choose the set of parameter values that yields the best predictions—that is, the parameter that maximizes the likelihood of the observed data. This value is called the maximum likelihood estimate (MLE), defined formally as:^2

θ^ ˆM LE^ def = arg max θ

Lik(θ; y) (4.3)

In nearly all cases, the MLE is consistent (Cramer, 1946), and gives intuitive results. In many common cases, it is also unbiased. For estimation of multinomial probabilities, the MLE also turns out to be the relative-frequency estimate. Figure 4.2 visualizes an example of this. The MLE is also an intuitive and unbiased estimator for the means of normal and Poisson distributions.

Likelihood as function of data or model parameters?

In Equation (4.2) I defined the likelihood as a function first and foremost of the parameters of one’s model. I did so as

4.3.2 Limitations of the MLE: variance

As intuitive and general-purpose as it may be, the MLE has several important limitations, hence there is more to statistics than maximum-likelihood. Although the MLE for multi- nomial distributions is unbiased, its variance is problematic for estimating parameters that determine probabilities of events with low expected counts. This can be a major problem

(^2) The expression arg maxx f (x) is defined as “the value of x that yields the maximum value for the ex- pression f (x).” It can be read as “arg-max over x of f (x).”

novel linguistic expression currently in use today, such as the increasingly familiar phrase the boss of me, as in:^3

(3) “You’re too cheeky,” said Astor, sticking out his tongue. “You’re not the boss of me.” (Tool, 1949, cited in Language Log by Benjamin Zimmer, 18 October 2007)

The only direct evidence for such expressions is, of course, attestations in written or recorded spoken language. Suppose that the linguist had collected 60 attestations of the expression, the oldest of which was recored 120 years ago. From a probabilistic point of view, this problem involves choosing a probabilistic model whose generated observations are n attestation dates y of the linguistic expression, and one of whose parameters is the earliest time at which the expression is coined, or t 0. When the problem is framed this way, the linguist’s problem is to devise a procedure for constructing a parameter estimate ˆt 0 from observations. For expository purposes, let us oversimplify and use the uniform distribution as a model of how attestation dates are generated.^4 Since the innovation is still in use today (time tnow), the parameters of the uniform distribution are [t 0 , tnow] and the only parameter that needs to be estimated is tnow. Let us arrange our attestation dates in chronological order so that the earliest date is y 1. What is the maximum-likelihood estimate ˆt 0? For a given choice of t 0 , a given date yi either falls in the interval [t 0 , tnow] or it does not. From the definition of the uniform distribution (Section 2.7.1) we have:

P (yi|t 0 , tnow) =

tnow −t 0 t^0 ≤^ yi^ ≤^ tnow 0 otherwise

Due to independence, the likelihood for the interval boundaries is Lik(t 0 ) =

i P^ (yi|t^0 , tnow). This means that for any choice of interval boundaries, if at least one date lies before t 0 , the entire likelihood is zero! Hence the likelihood is non-zero only for interval boundaries containing all dates. For such boundaries, the likelihood is

Lik(t 0 ) =

∏^ n

i=

tnow − t 0

(tnow − t 0 )n^

This likelihood grows larger as tnow − t 0 grows smaller, so it will be maximized when the interval length tnow − t 0 is as short as possible—namely, when t 0 is set to the earliest attested

(^3) This phrase has been the topic of intermittent discussion on the Language Log blog since 2007. (^4) This is a dramatic oversimplification, as it is well known that linguistic innovations prominent enough for us to notice today often followed an S-shaped trajectory of usage frequency (Bailey, 1973; Cavall-Sforza and Feldman, 1981; Kroch, 1989; Wang and Minett, 2005). However, the general issue of bias in maximum-likelihood estimation present in the oversimplified uniform-distribution model here also carries over to more complex models of the diffusion of linguistic innovations.

y 1 yn

tnow − y 1

t 0 tnow

tnow − t 0

y 2 , · · · , yn Figure 4.3: The bias of the MLE for uniform distributions

0.65 0 20 40 60 80 100

sample size

Average MLE interval length as a proportion of true interval length 0 20 40 60 80 100

sample size

Average MLE interval length as a proportion of true interval length

Figure 4.4: Bias of the MLE (left) and the bias-corrected estimator (right), shown numerically using 500 simulations for each value of n.

date y 1. This fact is illustrated in Figure 4.3: the tighter the posited interval between t 0 and tnow, the greater the resulting likelihood. You probably have the intuition that this estimate of the contact interval duration is conservative: certainly the novel form appeared in English no later than t 0 , but it seems rather unlikely that the first use in the language was also the first attested use!^5 This intuition is correct, and its mathematical realization is that the MLE for interval boundaries of a uniform distribution is biased. Figure 4.4 visualizes this bias in terms of average interval length (over a number of samples) as a function of sample size. For any finite sample size, the MLE is biased to underestimate true interval length, although this bias decreases as sample size increases (as well it should, because the MLE is a consistent estimator). Fortunately, the size of the MLE’s bias can be quantified analytically: the expected ML-estimated interval size is (^) nn+1 times the true interval size.Therefore, if we adjust the MLE by multiplying it by n+1 n , we arrive at an unbiased estimator for interval length. The correctness of this adjustment is confirmed by the right-hand plot in Figure 4.4. In the case of our historical linguist with three recovered documents, we achieve the estimate

(^5) The intuition may be different if the first attested use was by an author who is known to have introduced a large number of novel expressions into the language which subsequently gained in popularity. This type of situation would point to a need for a more sophisticated probabilistic model of innovation, diffusion, and attestation.

I θ Y

Figure 4.7: The structure of a simple Bayesian model. Observable data Y and prior beliefs I are conditionally independent given the model parameters.

S^2 =

N

N − 1

σˆ^2 M LE (4.9)

N − 1

i

(yi − ˆμ)^2 (4.10)

This is the most frequently used estimate of the underlying variance of a normal distribution from a sample. In R, for example, the function var(), which is used to obtain sample variance, computes S^2 rather than ˆσM LE. An example of estimating normal densities is shown in Figure 4.6, using F3 formants from 15 native English-speaking children on the vowel [æ]. The MLE density estimate is a slightly narrower curve than the bias-adjusted estimate.

4.4 Bayesian parameter estimation and density esti-

mation

In frequentist statistics as we have discussed thus far, one uses observed data to construct a point estimate for each model parameter. The MLE and bias-adjusted version of the MLE are examples of this. In Bayesian statistics, on the other hand, parameter estimation involves placing a probability distribution over model parameters. In fact, there is no concep- tual difference between parameter estimation (inferences about θ) and prediction or density estimation (inferences about future y) in Bayesian statistics.

4.4.1 Anatomy of inference in a simple Bayesian model

A simple Bayesian model has three components. Observable data are generated as random variables y in some model from a model family with parameters θ. Prior to observing a particular set of data, however, we already have beliefs/expectations about the possible model parameters θ; we call these beliefs I. These beliefs affect y only through the mediation of the model parameters—that is, y and I are conditionally independent given θ (see Section 2.4.2). This situation is illustrated in Figure 6.1, which has a formal interpretation as a graphical model (Appendix C). In the Bayesian framework, both parameter estimation and density estimation simply involve the application of Bayes’ rule (Equation (2.5)). For example, parameter estimation means calculating the probability distribution over θ given observed data y and our prior beliefs I. We can use Bayes rule to write this distribution as follows:

P (θ|y, I) =

P (y|θ, I)P (θ|I) P (y|I)

Likelihood for θ ︷ ︸︸ ︷ P (y|θ)

Prior over θ ︷ ︸︸ ︷ P (θ|I) P (y|I) ︸ ︷︷ ︸ Likelihood marginalized over θ

(because y ⊥ I | θ) (4.12)

The numerator in Equation (4.12) is composed of two quantities. The first term, P (y|θ), should be familiar from Section 2.11.5: it is the likelihood of the parameters θ for the data y. As in much of frequentist statistics, the likelihood plays a central role in parameter estimation in Bayesian statistics. However, there is also a second term, P (θ|I), the prior distribution over θ given only I. The complete quantity (4.12) is the posterior distribution over θ. It is important to realize that the terms “prior” and “posterior” in no way imply any temporal ordering on the realization of different events. The only thing that P (θ|I) is “prior” to is the incorporation of the particular dataset y into inferences about θ. I can in principle incorporate all sorts of knowledge, including other data sources, scientific intuitions, or—in the context of language acquisition—innate biases. Finally, the denominator is simply the marginal likelihood P (y|I) =

θ P^ (y|θ)P^ (θ|I)^ dθ^ (it is the model parameters^ θ^ that are being marginalized over; see Section 3.2). The data likelihood is often the most difficult term to calculate, but in many cases its calculation can be ignored or circumvented because we can accomplish everything we need by computing posterior distributions up to a normalizing constant (Section 2.8; we will see an new example of this in the next section). Since Bayesian inference involves placing probability distributions on model parameters, it becomes useful to work with probability distributions that are specialized for this purpose. Before we move on to our first simple example of Bayesian parameter and density estimation, we’ll now introduce one of the simplest (and most easily interpretable) such probability distributions: the beta distribution.

4.4.2 The beta distribution

The beta distribution is important in Bayesian statistics involving binomial distributions. It has two parameters α 1 , α 2 and is defined as follows:

P (π|α 1 , α 2 ) =

B(α 1 , α 2 )

πα^1 −^1 (1 − π)α^2 −^1 (0 ≤ π ≤ 1 , α 1 > 0 , α 2 > 0) (4.13)

where the beta function B(α 1 , α 2 ) (Section B.1) serves as a normalizing constant:

B(α 1 , α 2 ) =

0

πα^1 −^1 (1 − π)α^2 −^1 dπ (4.14)

given transitive sentence will be in the passive voice. For Bayesian statistics, we must first specify the beliefs I that characterize the prior distribution P (θ|I) to be held before any data from the new English variety is incorporated. In principle, we could use any proper probability distribution on the interval [0, 1] for this purpose, but here we will use the beta distribution (Section 4.4.2). In our case, specifying prior knowledge I amounts to choosing beta distribution parameters α 1 and α 2. Once we have determined the prior distribution, we are in a position to use a set of observations y to do parameter estimation. Suppose that the observations y that our speaker has observed are comprised of n total transitive sentences, m of which are passivized. Let us simply instantiate Equation (4.12) for our particular problem:

P (π|y, α 1 , α 2 ) =

P (y|π)P (π|α 1 , α 2 ) P (y|α 1 , α 2 )

The first thing to notice here is that the denominator, P (y|α 1 , α 2 ), is not a function of π. That means that it is a normalizing constant (Section 2.8). As noted in Section 4.4, we can often do everything we need without computing the normalizing constant, here we ignore the denominator by re-expressing Equation (4.15) in terms of proportionality:

P (π|y, α 1 , α 2 ) ∝ P (y|π)P (π|α 1 , α 2 )

From what we know about the binomial distribution, the likelihood is P (y|π) =

(n m

πm(1 − π)n−m, and from what we know about the beta distribution, the prior is P (π|α 1 , α 2 ) = 1 B(α 1 ,α 2 ) π

α 1 − (^1) (1 − π)α 2 − (^1). Neither (n m

nor B(α 1 , α 2 ) is a function of π, so we can also ignore them, giving us

P (π|y, α 1 , α 2 ) ∝

Likelihood ︷ ︸︸ ︷ πm(1 − π)n−m

︷ Prior︸︸ ︷ πα^1 −^1 (1 − π)α^2 −^1 ∝ πm+α^1 −^1 (1 − π)n−m+α^2 −^1 (4.16)

Now we can crucially notice that the posterior distribution on π itself has the form of a beta distribution (Equation (4.13)), with parameters α 1 + m and α 2 + n − m. This fact that the posterior has the same functional form as the prior is called conjugacy; the beta distribution is said to be conjugate to the binomial distribution. Due to conjugacy, we can circumvent the work of directly calculating the normalizing constant for Equation (4.16), and recover it from what we know about beta distributions. This gives us a normalizing constant of B(α 1 + m, α 2 + n − m). Now let us see how our American English speaker might apply Bayesian inference to estimating the probability of passivization in the new English variety. A reasonable prior distribution might involve assuming that the new variety could be somewhat like American English. Approximately 8% of spoken American English sentences with simple transitive

0.0 0.2 0.4 0.6 0.8 1.

0

2

4

6

8

p

P(p)

0

1

Likelihood(p)

Prior Likelihood (n=7) Posterior (n=7) Likelihood (n=21) Posterior (n=21)

Figure 4.9: Prior, likelihood, and posterior distributions over π. Note that the likelihood has been rescaled to the scale of the prior and posterior; the original scale of the likelihood is shown on the axis on the right.

verbs are passives (Roland et al., 2007), hence our speaker might choose α 1 and α 2 such that the mode of P (π|α 1 , α 2 ) is near 0.08. A beta distribution has a mode if α 1 , α 2 > 1, in which case the mode is (^) α 1 α+^1 α− 21 − 2 , so a reasonable choice might be α 1 = 3, α 2 = 24, which

puts the mode of the prior distribution at 252 = 0.08.^6 Now suppose that our speaker is exposed to n = 7 transitive verbs in the new variety, and two are passivized (m = 2). The posterior distribution will then be beta-distributed with α 1 = 3 + 2 = 5, α 2 = 24 + 5 = 29. Figure 4.9 shows the prior distribution, likelihood, and posterior distribution for this case, and also for the case where the speaker has been exposed to three times as much data in similar proportions (n = 21, m = 6). In the n = 7, because the speaker has seen relatively little data, the prior distribution is considerably more peaked than the likelihood, and the posterior distribution is fairly close to the prior. However, as our speaker sees more and more data, the likelihood becomes increasingly peaked, and will eventually dominate in the behavior of the posterior (See Exercise to be included with this chapter). In many cases it is useful to summarize the posterior distribution into a point estimate of the model parameters. Two commonly used such point estimates are the mode (which

(^6) Compare with Section 4.3.1—the binomial likelihood function has the same shape as a beta distribution!

This expression can be reduced to

P (r|k, I, y) =

k r

)∏r− 1 i=0 (α^1 +^ m^ +^ i)^

∏k−r− 1 ∏^ i=0^ (α^2 +^ n^ −^ m^ +^ i) k− 1 i=0 (α^1 +^ α^2 +^ n^ +^ i)^

k r

B(α 1 + m + r, α 2 + n − m + k − r) B(α 1 + m, α 2 + n − m)

which is an instance of what is known as the beta-binomial model. The expression may seem formidable, but experimenting with specific values for k and r reveals that it is simpler than it may seem. For a single trial (k = 1), for example, this expression reduces to P (r = 1|k, I, y) = (^) α 1 α+^1 +α 2 m+n , which is exactly what would be obtained by using the posterior mean. For two trials (k = 2), we would have P (r = 1|k, I, y) = 2 (^) (α 1 +(αα^12 ++mn)()(αα^21 ++nα− 2 +mn)+1) , which is slightly less than what would be obtained by using the posterior mean.^7 This probability mass lost from the r = 1 outcome is redistributed into the more extreme r = 0 and r = 2 outcomes. For k > 1 trials in general, the beta-binomial model leads to density estimates of greater variance—also called dispersion in the modeling context—than for the binomial model using posterior mean. This is illustrated in Figure 4.10. The reason for this greater dispersion is that different future trials are only conditionally independent given a fixed choice of the binomial parameter π. Because there is residual uncertainty about this parameter, successes on different future trials are positively correlated in the Bayesian prediction despite the fact that they are conditionally independent given the underlying model parameter (see also Section 2.4.2 and Exercise 2.2). This is an important property of a wide variety of models which involve marginalization over intermediate variables (in this case the binomial parameter); we will return to this in Chapter 8 and later in the book.

4.5 Computing approximate Bayesian inferences with

sampling techniques

In the example of Bayesian inference given in Section 4.4.3, we were able to express both (i) the posterior probability over the binomial parameter π, and (ii) the probability distri- bution over new observations as the closed-form expressions^8 shown in Equations (4.16) (^7) With the posterior mean, the term (α 1 + α 2 + n + 1) in the denominator would be replaced by another instance of (α 1 + α 2 + n), giving us

P (r = 1|k, πˆ) = (α^1 +^ m)(α^2 +^ n^ −^ m) (α 1 + α 2 + n)^2 (4.20)

(^8) A closed-form expression is one that can be written exactly as a combination of a finite number of “well-known” functions (such as polynomials, logarithms, exponentials, and so forth).

0 10 20 30 40 50

k

P(k passives out of 50 trials | y, I)

Binomial Beta−Binomial

Figure 4.10: The beta-binomial model has greater dispersion than the binomial model. Re- sults shown for α 1 + m = 5, α 2 + n − m = 29.

and (4.20) respectively. We were able to do this due to the conjugacy of the beta distri- bution to the binomial distribution. However, it will sometimes be the case that we want to perform Bayesian inferences but don’t have conjugate distributions to work with. As a simple example, let us turn back to a case of inferring the ordering preference of an English binomial, such as {radio, television}. The words in this particular binomial differ in length (quantified as, for example, number of syllables), and numerous authors have suggested that a short-before-long metrical constraint is one determinant of ordering preferences for En- glish binomials (Cooper and Ross, 1975; Pinker and Birdsong, 1979, inter alia). Our prior knowledge therefore inclines us to expect a preference for the ordering radio and television (abbreviated as r) more strongly than a preference for the ordering television and radio (t), but we may be relatively agnostic as to the particular strength of the ordering preference. A natural probabilistic model here would be the binomial distribution with success parameter π, and a natural prior might be one which is uniform within each of the ranges 0 ≤ π ≤ 0. 5 and 0. 5 < π < 1, but twice as large in the latter range as in the former range. This would be the following prior:

p(π = x) =

2 3 0 ≤^ x^ ≤^0.^5 4 3 0.^5 < x^ ≤^1 0 otherwise

which is a step function, illustrated in Figure 4.11a. In such cases, there are typically no closed-form expressions for the posterior or predic- tive distributions given arbitrary observed data y. However, these distributions can very

pA ~ dunif(0,0.5) pB ~ dunif(0.5,1) i ~ dbern(2/3) p <- (1 - i) * pA + i * pB

is a way of encoding the step-function prior of Equation (4.21). The first two lines say that there are two random variables, pA and pB, drawn from uniform distributions on [0, 0 .5] and [0. 5 , 1] respectively. The next two lines say that the success parameter p is equal to pA 23 of the time, and is equal to pB otherwise. These four lines together encode the prior of Equation (4.21). Finally, the last two lines say that there are two more random variables parameterized by p: a single token (prediction1) and the number of r outcomes in ten more tokens (prediction2). There are several incarnations of BUGS, but here we focus on a newer incarnation, JAGS, that is open-source and cross-platform. JAGS can interface with R through the R library rjags.^9 Below is a demonstration of how we can use BUGS through R to estimate the posteriors above with samples.

> ls() > rm(i,p) > set.seed(45) > # first, set up observed data > response <- c(rep(1,5),rep(0,5)) > # now compile the BUGS model > m <- jags.model("../jags_examples/asymm_binomial_prior/asymm_binomial_prior.bug",data > # initial period of running the model to get it converged > update(m,1000) > # Now get samples > res <- coda.samples(m, c("p","prediction1","prediction2"), thin = 20, n.iter=5000) > # posterior predictions not completely consistent due to sampling noise > print(apply(res[[1]],2,mean)) > posterior.mean <- apply(res[[1]],2,mean)

> plot(density(res[[1]][,1]),xlab=expression(pi),ylab=expression(paste("p(",pi,")")))

> # plot posterior predictive distribution 2 > preds2 <- table(res[[1]][,3]) > plot(preds2/sum(preds2),type='h',xlab="r",ylab="P(r|y)",lwd=4,ylim=c(0,0.25)) > posterior.mean.predicted.freqs <- dbinom(0:10,10,posterior.mean[1]) > x <- 0:10 + 0. > arrows(x, 0, x, posterior.mean.predicted.freqs,length=0,lty=2,lwd=4,col="magenta") > legend(0,0.25,c(expression(paste("Marginizing over ",pi)),"With posterior mean"),lty=

(^9) JAGS can be obtained freely at http://calvin.iarc.fr/~martyn/software/jags/, and rjags at http://cran.r-project.org/web/packages/rjags/index.html.

0.0 0.2 0.4 0.6 0.8 1.

0.^ 0.^ 1.^

π

P(π)

(a) Prior over π

0.0 0.2 0.4 0.6 0.8 1.

0

1

2

3

density.default(x = res[[1]][, 1])

π

p(π)

(b) Posterior over π

0.^ 0.^

0.^ 0.^

r

P(r|y)

0 1 2 3 4 5 6 7 8 9 10

Marginizing over With posterior mean π

(c) Posterior predictive dis- tribution for N = 10, marginalizing over π versus using posterior mean

Figure 4.11: A non-conjugate prior for the binomial distribution: prior distribution, posterior over π, and predictive distribution for next 10 outcomes

Two important notes on the use of sampling: first, immediately after compiling we specify a “burn-in” period of 1000 iterations to bring the Markov chain to a “steady state” with:^10

update(m,1000)

Second, there can be autocorrelation in the Markov chain: samples near to one another in time are non-independent of one another.^11 In order to minimize the bias in the estimated probability density, we’d like to minimize this autocorrelation. We can do this by sub-sampling or “thinning” the Markov chain, in this case taking only one out of every 20 samples from the chain as specified by the argument thin = 20 to coda.samples(). This reduces the autocorrelation to a minimal level. We can get a sense of how bad the autocorrelation is by taking an unthinned sample and computing the autocorrelation at a number of time lags:

> m <- jags.model("../jags_examples/asymm_binomial_prior/asymm_binomial_prior.bug",data > # initial period of running the model to get it converged > update(m,1000) > res <- coda.samples(m, c("p","prediction1","prediction2"), thin = 1, n.iter=5000) > autocorr(res,lags=c(1,5,10,20,50))

We see that the autocorrelation is quite problematic for an unthinned chain (lag 1), but it is much better at higher lags. Thinning the chain by taking every twentieth sample is more than sufficient to bring the autocorrelation down

(^10) For any given model there is no guarantee how many iterations are needed, but most of the models covered in this book are simple enough that on the order of thousands of iterations is enough. (^11) The autocorrelation of a sequence ~x for a time lag τ is simply the covariance between elements in the sequence that are τ steps apart, or Cov(xi, xi+τ ).