Notes on population genetics and evolution: “Cheat sheet” for ..., Slides of Genetics

Notes on population genetics and evolution: “Cheat sheet” for review. 1. Genetic drift. Terminology. Genetic drift is the stochastic fluctuation in allele ...

Typology: Slides

2022/2023

Uploaded on 05/11/2023

cristelle
cristelle 🇺🇸

4.5

(53)

374 documents

1 / 12

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Harvard-MIT Division of Health Sciences and Technology
HST.508: Quantitative Genomics, Fall 2005
Instructors: Leonid Mirny, Robert Berwick, Alvin Kho, Isaac Kohane
N
otes on population genetics and evolution: “Cheat sheet” for review
1.
Genetic drift
T
erminology.
G
enetic drift is the stochastic fluctuation in allele frequency due to random
s
ampling in a population.
P
olymorphism describes sites (nucleotide positions, etc.) variable
w
ithin a species;
di
vergence describes sites variable between species.
1.
1
Wr
ight
-F
isher model.
T
he Wright
-F
isher model describes the process of genetic drift within a finite population.
T
he model assumes:
1.N
diploid organisms (so, 2
N
gametes)
2.M
onoecious reproduction with an infinite # of gametes (no sexual
r
ecombination)
3.N
on
-ove
rlapping generations
4.R
andom mating
5.N
o mutation
6. N
o selection
T
he Wright
-F
isher model assumes that the ancestors of the present generation are
obt
ained by random sampling
w
ith replacement from the previous generation. Looking
f
orward in time, consider the familiar starting point of classical population genetics: two
a
lleles,
A
and
a,
segregating in the population. Let
i
be the number of copies of allele
A,
s
o that
Ni
is th
e
number of copies of allele
a.
Thus the current frequency of
A
in the
popul
ation is
p
= i/
N,
and the current frequency of
a
is 1
p.
We assume that there is no
di
fference in fitness between the two alleles, that the population is not subdivided, and
t
hat m
ut
ations do not occur. This gives the familiar formula for the probability that a gene
w
ith
i
copies in the present generation is found in j copies in the next generation:
P
ij =N
j
!
"
#$
%
&pj(1'p)N'j 0 (j(N
L
et the current generation be generation zero and
Kt
r
e
present the counts of allele
A
in
f
uture generations. The binomial equation above states that
K1
is binomially distributed
w
ith parameters
N
and
p
=
i/N
, given
K0
=
i.
F
rom standard results in statistics, we know the mean and variance of
K1:
E[K1]=Np =i
Var[K1]=Np(p!1)
S
o, the number of copies of
A
is expected to remain the same on average, but in fact may
t
ake any value from zero to
N.
A particular variant may become extinct (go to zero
c
opies) or fix (go to
N
copies) in the population even in a sing
l
e generation. Over time,
t
he frequency of
A
will drift randomly according to the Markov chain with transition
Prepared by Professor Robert Berwick.
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Notes on population genetics and evolution: “Cheat sheet” for ... and more Slides Genetics in PDF only on Docsity!

Harvard-MIT Division of Health Sciences and TechnologyHST.508: Quantitative Genomics, Fall 2005 Instructors: Leonid Mirny, Robert Berwick, Alvin Kho, Isaac Kohane

Notes on population genetics and evolution: “Cheat sheet” for review

1. Terminology. Genetic drift Genetic drift is the stochastic fluctuation in allele frequency due to random sampling in a population. P diolymorphismvergence describes sites variable between species. describes sites (nucleotide positions, etc.) variable within a species; 1. The Wright 1 Wright- - Fisher model. Fisher model describes the process of genetic drift within a finite population. The model assumes: 1.� N diploid organisms (so, 2 N gametes) 2.�Monoecious reproduction with an infinite # of gametes (no sexual recombination) 3. 4.��NonRandom mating-overlapping generations 5. 6.� No mutationNo selection The Wright obtained by random sampling-Fisher model assumes that the ancestors of the present generation are with replacement from the previous generation. Looking forward in time, consider the familiar starting point of classical population genetics: two alleles, A and a , segregating in the population. Let i be the number of copies of allele A , so that population is Ni is th p = i/e number of copies of allele N , and the current frequency of a. Thus the current frequency of a is 1– p. We assume that there is no A in the difference in fitness between the two alleles, that the population is not subdivided, and that mutations do not occur. This gives the familiar formula for the probability that a gene with i copies in the present generation is found in j copies in the next generation: Pij =! "# N j^ $ %& p j^ ( 1 ' p ) N^ '^ j^ 0 ( j ( N Let the current generation be generation zero and future generations. The binomial equation above states that K t represent the counts of allele K 1 is binomially distributed A in with parameters N and p = i / N , given K 0 = i. From standard results in statistics, we know the mean and variance of K 1 : VEar [ K [^1 K^ ]^ =^ Np^ =^ i 1 ]^ =^ Np (^ p^!^1 ) So, the number of copies of take any value from zero to A is expected to remain the same on average, but in fact may N. A particular variant may become extinct (go to zero copies) or fix (go to the frequency of A (^) will drift randomly according to the Markov chain with transition N copies) in the population even in a single generation. Over time,

Prepared by Professor Robert Berwick.

pr lost from the population.obabilities given by the above formula, and eventually one or the other allele will be

Perhaps the easiest way to s through a biologically motivated example.ee how the Wright Imagine that before dying each individual in-Fisher binomial sampling model works is the population produces a very large number of gametes. However, the population size is tightly controlled so that only N of these can be admitted into the next generation. The frequency of allele differences, the next A (^) in the gamete pool will begeneration is obtained by randomly choosing i / N , and because there are no fitness N alleles. The connection to the bin chance of success. Because the gamete pool is so large, we assume it is not depleted byomial distribution is clear: we perform N trials, each with p = i / N this sampling, so the probability number of A alleles in the next generation is the binomial distribution with parameters ( i / N is still the same for each trial. The distribution of the N , i / N ) as expected. The decay of heterozygosity Before we take up the backward, ancestral process for the Wright. -Fisher model, we will look at the classical forward de be the probability that two randomly sampled gene copies are different.rivation. The heterozygosity of a population is defined to For a randomly mating heterozygous at a locus. L diploid population,et the current generation be generation zero, and let this is equivalent to the chance that an individual p 0 be the is frequency of the binomial chance that one allele A now. The heterozygosity of the population now is equal to A (and one a ) is chosen in two random draws. H 0 = 2 p 0 (1– p 0 ),

Let the random variab Then, as we have seen in earlier lectures, in the next generation the heterozygosity willle P t represent the frequencies of A in each future generation t. have changed to be realization of the process of genetic drift. On average, heterozygosity (variation) will be H 1 = 2 P 1(1– P 1 ). However, H 1 will vary depending on the random lost through drift: E [ H 1 ] = E [ 2 P 1 ( 1! P 1 )] = 2 ( E [ P 1 ]! E [ P 1 ]^2! Var [ P 1 ]) = 2 p 0 ( 1! p 0 )( 1! (^21) N ) = H 0 ( 1! (^21) N ) In the haploid case, we replace 2N by N. After t generations, we have: E [ Ht ] = H 0 " #$ 1! (^21) N % &'^ t The approximation is valid for heterozygosity decays at rate 1/ large N N per generation, 1/2. Thus, as we’ve seen, in the Wright N if diploid. The decrease of-Fisher model,

[ 1 - 1/(2 Nt )] H 0 =[1-1/(2 Nt )]…[1-1/(2 N 2 )](1-1/(2 N 1 )] H 0 Again approximating by a Taylor series, we have: e - t/2Ne (^) = ( e- 1 /2Nt )…( e- 1 /2N1 ) or, taking logs of both sides:

N^1 e^ =^ ! "# (^1) t $ %& (^) N^1 t^ +^ ...^ +^

N^1

2 +^

N^1

1

Thus dominated b Ne is the harmonic mean of the actual population size. Since the harmonic mean isy the smallest terms, population bottlenecks, or brief reductions in actual population size, can have a strong influence on the effective population size and heterozygosity (read: variance). (This is called variance effective population size .) Effective p results depend on the assumption that the Wrightopulation size is crucial to all the calculations because the mathematical-Fisher idealization of binomial draws is being maintained. 1.2 More generally, the Effective population size effective population size is thus the size of an idealized population that has the same magnitude of drift as an idealized Wright effective population size is always less than the census population size due to factors such-Fisher population. The as this one. by figuring out what value of Other cases may be de N that would yield thealt with as in the case of fluctuating population size, same rate of loss in heterozygosity ‘ particular instancas if’ the population were an ideal Fishere, what the reduction in variance is from generation to generation, as we-Wright sample – that is, by calculating, in each did above, and then back calculating what the value of examples include: N ‘should have been.’ Some

  1. males Unequal numbers of males and females. and 20 females. Due to the dominance hierarchy only one of the males Imagine a zoo population with 20 actually breeds. strength of drift in this system? 40? 21? What is the relevant population size that informs us about the If Nm is the number of breeding males (1 in this example) genes in the offspring generation will derive from parent females and half from and Nf the number of breeding females (20), then half of the parent males 2. Overlapping generations
    1. NonNon--Poisson distribution of fecundity (i.e., different numbers of offsprrandom mating, i.e., population structure in general ing)
  2. Coalescent Coalesent theory theory describes the genealogical relationships among individuals in a Wright-Fisher population. Notation: Let (coalescence) of T (^2) t (^) wo be the time in generations until the most recent common ancestor genes (alleles, sequences,…) chosen at random from a population of size whenever we use “ N (better and more correct: N ” we really mean Ne. We will also call these Ne .). We assume in what follows that the genes, lineages. Also, from now on,

P ( T 2 = t + 1 ) = P ( T 2 > t )! P ( T 2 > t + 1 ), t = 0 , 1 , 2 ,... = " #$ 1! (^21) N % &'^ t! " #$ 1! (^21) N % &'^ t^ " #$ 1! (^21) N % &' = " #$ 1! (^21) N % &'^ t ( 1! 1 + (^21) N ) = (^21) N^ " #$ 1! (^21) N % &'^ t

s statistical calculations testing for selection, below.)equences, etc. are drawn from a single species. (This is important for some of the

In general, That is, after coalescence, the two genes are Ti = the time until the coalescence of i (^) i lineages (genes, alleles, s dentical. We are interested in theequences,…). distribution of the ‘waiting times’ until each coalescence, as well as the variance of these times, and, further, the expected waiting time and the total waiting time until all lineages have collapsed into a single common ancestor. as a stochastic process with rather simple properties. It turns out that all this can be described Note that each coalescent event is independent of all others – the waiting times are independent.

3. M (^1) easured in discrete time, in a Wright Basic results. -Fisher population of size 2N the distribution of waiting times until the collapse (coalescence, identity) of two sequences is geometric with the probability of success p (= coalescence ) = 1/(2 N ) in any one generation, and so the probability of between this and the heterozygosity computation.) It is easy to see that the waiting times failure (= not coalescing) is 1- p , or [1-1/(2 N )]. (Note the close relation form a geometric distri coalescent event has not occurred, as the product ofbution by considering the probability that up until time t ‘not coalescing’ events, just as with t a the heterozygosity iteration. not (^) t If we let P (T 2 > t ) denote the probability that two lineages P ( T 2 > t ) = " #$ 1! (^21) N % &'^ t , t = 0 , 1 , 2 ,... t

have coalesced, for times =0, 1, 2, …, then this is simply:

so the probability that two lineages collapse at exactly the th^ time step is:

And this is clearly a geometric distribution.

3. W 1.1 e can gain a very intuitive picture of the same process by the following argument. A very, very intuitive picture. We start by considering the coalescence time in a sample of two genes. Genes the present generation, and their common ancestor A lived t generations X and Y live in ago. Consequently, as we look backward from the present into the past, the two lines of descent remain distinct for t generations, at which time they coalesce into a single line of descent. In a given generation, the lines coalesce if the two g copies of a single parental gene in the generation before. Otherwise, the two lines remainenes in that generation are distinct.

E [ Tk ] = (^) k (^4 k N! 1 ) = (^2) kN " #$ 2 % &'

(ii) The expected time to coalescence from k to k- 1 lineages is:

E [ T 3 ] = 3 (^43 N! 1 ) = 23 N

= 132 N

So for example, if we h average: ave 3 sequences, the time to the first coalescence will be, on

This makes sense, since for the first coalescence, we have a (3 choose 2) or 3 possible ways of collapsing 3 sequences together (1st (^) and 2nd; 1st (^) and 3rd; 2nd (^) and 3rd) – there are more cars in the intersection, so a higher chance that they will ‘collide’, and so a lower waiting time until they do coalesce (specifically, 1/3 of the average time when there are only 2 sequences). And so on: for four lineages (sequences), we initially have 4-choose- 2 options to collapse, which gives an expected time to first collapse of 2 N /6 = 1/6 (2 N ) , etc. (iii) important value that we’ll use to figure out the expected nucleotide diversity, may be The total length of all the branches in the genealogy tree, E [Ttot] , which is an computed as follows:

E [ Ttot ] = ! i =^ n 2 iE [ Ti ] = i^2 iN

i = 2 " #$ 2 % &'

!^ n =^4 N^ i

i = 1

n! ( 1

(iv) ancestor,” MRCA), The time to coalescence of and so the total expected depth of the coa all n lineages (the so-called “time to most recent commonlescent, can be found as follows. dependent on the sample size Note that this expected time is ‘about’ 4 n. Therefore, sampling an N , a bit less with a small factor n +1st (^) sequence adds only 2/ n to what may already be a sizeable number. DNA sequence polymorphism, which we describe below. This has implications for the measurement of Further, the equation for MRCA means that in generational units of 2 its asymptotic value of 2, even for moderate N , the time to MRCA is always very close to n. Thus, for all but the smallest samples, there will likely be a large number of coalescent events in the very recent history of the sample.

E [ Tn ] = 2 N " i^ n =^! 21 i ( i^2! 1 )= 2 N i 2 " i^ n =^! 21 i^1! 1! 1 i

= 2 N i 2 ( 1! 12 + 12! 13 + 13! ...! (^) n^1! 1 + (^) n^1! 1! (^1) n ) = 2 N i 2 # $% 1! (^1) n & '(

( Note that the full coalescent tree is dominated by the most ancient coalescent, of depth onv) Properties of the shape and size of the coalescent tree. average 2 the rest of the way, from 2 lineages to 1 in another expected time of 2 N. The tree collapses to just two lineages in expected time 2 NN., then collapses all

( follows, following the Rice book:vi) We can pass from the discre te, geometric distribution to its continuous analog asSince for 2 N > 100, we can expand e- 1 /2N (^) as a Taylor series approximately equal to (1 exponential distribution: -1/2 N ), we can rewrite the geometric distribution as an

P ( Tk > t )! # $% 1 " # $% k 2 & '( 21 N^ & '(^ t , which as N ) * !

$%^ k 2 & '(

2 N e

" # $% k 2 & '( 2 t N

If we rescale time in generational units of value, then we can simplify the basic coalescent results in a much neater form, which will τ= t / 2 N , so that one ‘clock tick’ is set to this also let us get the variance in a useful form:

P ( Tk ) = e!^ "^ #$ k^2 %^ &'^ ( E ( Tk ) = " #$ 2 k^ % &'^!^1 var( Tk ) = " #$ 2 k^ % &'^!^2 We see that 2 ‘natural unit’ for considering lineage coalescence. N (where N is of course actually the effective population size) is the

3. W (^2) e now add mutations to the genealogical tree to get some actual results and tests. The Adding mutations: The coalescent and the neutral theory idea is this: rather than ask, “for a given mutation parameter, what can we say about the ancestry of the sample?” we ask the more relevant question: “given this sample, what can we say about the population?” The key idea to adding mutations to the coalescent tree is that what we observe in terms of segregating sites are two superimposed, independent stochastic processes: one due to the distributed waiting times) and the other due to the random, neutral mutations sprinkled on lineages collapsing (which are n- 1 independent, exponentially/geometrically top of this lineage collapse pattern (which for large population sizes may be considered to be Poisson distributed).

Sij = number of mutations separating individuals i and j

A mn important point: our model of mutation here is traditionallyodel. Note that in doing this computation about neutral mutations and their ultimate called the infinite sites ‘effect’ in showing up as s have made implicit use of an assumption: each mutation is at aegregating sites, via sprinkling on the coalescent branches, we different site in the sequence, so that each mutation produces a distinct, segregating ‘spot’ on the DNA sequence. Roughly, this is what permits us to equate the number of segregating sites to the simple multiplication of the neutral mutation rate times the expected tree depth. You might want to think through what would happen if we allowed multiple ‘hits’ at the same nucleotide p per replication,osition. and If we assume that the mutation rate is, say, 10 that sequences are of ‘average’ length (like what?)-^6 – 10-^8 per base pair then this assumption does not seem too bad, so the infinite sites model seems OK for sequences.

3. N (^3) ow we can actually construct a test of the neutral hypothesis, based on two estimators of Using the coalescent to test hypotheses about nucleotide diversity: Tajima’s D theta. separating individuals two at a time, and average over all pairs. Another way we have of estimating θ is to just calculate the number of mutation This may be thought ofs as a sample average to estimate a population average, and is a common measure of nucleotide diversity. Denote by

Under the infinite number of segregating sites between sequences sites assumption, we can calculate i and S ijj (^). from a sample by calculating theIf we average Sij over all pairs ( den i,j )ote this by: in a sample of size n this is called the average number of pairwise differences. We

Dn = n ( n^2! 1 ) # i " jSij

Note that we can think of individuals ( i,j ) as a sample of size 2, so: E [ Sij ] = E [ S 2 ] =!

E [ Dn ] = n ( n^2! 1 ) # i " j E [ Sij ] = $

and so,

Thus, Dn is another, unbiased estimator of θ, called !ˆ T ˆ ! T

. Tajima (1981) was the first to investigate its properties. He noticed that since E[ D ]= = θ and E[ Sn ]= !ˆ W = an θ , ( an as

above, i.e., an = "^ ni =^! 111 i ) then the expected value of the difference !ˆ T– !ˆ W should be zero

under the standard neutral model model to be rejected (i.e., there is possibly positive selection).. Significant deviations from zero should cause the null Specifically, Tajima (1989) proposed the test statistic:

D = (^) V ˆ ar^ !ˆ T^ [! "ˆ T^! ˆ" WW ]

T the critical values.he denominator of Tajima’s We have to estimate this denominator (hence the ‘hat’ on D is an attempt to normalize for the effect of sample size on Var ) from the data by using the formula: V ˆ ar [! ˆ T " !ˆ] = e 1 S + e 2 S ( S " 1 ) where e 1 = (^) a^1 n^ # $% 3 (^ nn^ + "^1 1 ) " (^) a^1 (^) n^ & '( , e 2 = (^) an (^2 1) + bn^ # $%^2 ( 9 nn^2 (+ n^ n "^ + 1 )^3 )" n na^ +^ n^2 + (^) abnn 2 & '(

where bn = "^ n i =^! 11 i^12

This looks formidably complicated, but it’s really not (though tricky to derive): the coefficients come from the computation of the variance difference between the two estimators just as we derived the variance of Sn above. To approximated by a certain form (not quite a normal distribution, but a beta distribution), actually use this test, Tajima suggested that the distribution of D might be and provided tables of critical va upper (lower) critical value is the value above (below) which the observed value of thelues for the rejection of the standard neutral model. The statistic cannot be explained by the null model. As with any statistical test, it is necessary to specify a significance level alpha, which represents the acceptability of rejecting the null model just by chance when it is true. Roughly, values of Tajima’s at the 5% level (alpha = 0.05) if they are either greater than two or less than negative two. D are significant However, computer simulation. (This is any area of on D is not exactly beta-distributed and critical values are often determined using-going research.) There are several other related tests that you will probably encounter that are based on the sam D* and F tests, e.g.). e idea (Fu and Li’s

As far as how the most important thing, this can be understood in the following way. D value responds to deviations from the neutral model, which is the First, the sign of the test is determined only by the sign of the numerator, since the denominator is always positive. The D value becomes negative when there is an excess of either low-frequency (rare) polymorphisms. or high- frequencyThis might be caused by p polymorphisms (^) ositive selection, or, alternatively, expandingand a deficiency of middle-frequency population size (note that the Tajima model assumes constant population size for the null hypothesis). Large positive values of D can result from population contraction, or the balancing selection of two alt parameters cannot be overstressed.ernative polymorphisms. (Below we turn to a test for selection that does The sensitivity to demographic not make any such demographic assumptions, the McDonald correspondingly less powerful.) -Kreitman test; however, it is