















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The concept of sequence-structure asymmetry in biology, focusing on the relationship between sequence identity and structural compatibility. Topics include sequence capacity, flow between structures, and measuring protein fitness. The document also discusses methods for estimating sequence capacity, such as telescoping ratios and umbrella sampling, and the use of energy functions to measure protein fitness.
Typology: Papers
1 / 23
This page cannot be seen from the preview
Don't miss anything!
















1
1DWR 2JHO
Sequence Identity ~ 85%
Horse Sperm Whale
2
1LH1 2JHO
Sequence Identity ~ 20%
Lupinus luteus Sperm Whale
3
1LH1:_ 2/3 ALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE--VPQNNPE
1MBC:_ 1/2 VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED
1LH1:_ 60/61 LQAHAGKVFKLVYEAAIQLEVTGVVV--TDATLKNLGSVHVSKGVADAHFPVVKEAILKT
1MBC:_ 61/62 LKKHGVTVLTALGAILKKK---GHHEAELKPLAQSHATKHK---IPIKYLEFISEAIIHV
1LH1:_ 118/119 IKEVVGAKWSEELNSAWTIAYDELAIVIKKEMDDAA
1MBC:_ 115/116 LHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELG
4
7
Detailed (whole PDB), efficiently computable and
Detailed (whole PDB), efficiently computable and
experimentally testable model (the set of PDB
experimentally testable model (the set of PDB
structures was argued to be complete)
structures was argued to be complete)
Design of protein structures and protein switches
Design of protein structures and protein switches
Zero order model of the evolution of protein
Zero order model of the evolution of protein
sequences & structures (no selection due to
sequences & structures (no selection due to
function). function).
8
ShakhnovichShakhnovich
Dill
Dill
WolynesWolynes
Thirumalai
Thirumalai
LevittLevitt
…
So far no global view of capacity (and
So far no global view of capacity (and
thermodynamics) of the PDB, no flow.
thermodynamics) of the PDB, no flow.
9
Capacity shows weak correlation with the
Capacity shows weak correlation with the
number of sequences that
number of sequences that are found for a
are found for a
particular fold in the NR database (correlationparticular fold in the NR database (correlation
coefficient 0.2)coefficient 0.2)
Capacity correlates with mutation ratesCapacity correlates with mutation rates
measured experimentallymeasured experimentally ((J. D. Bloom, D. A.J. D. Bloom, D. A.
Drummond, F. H. Arnold, and C. O.Drummond, F. H. Arnold, and C. O. WilkeWilke. Structural. Structural
determinants of the rate of protein evolution in yeast. Mol.determinants of the rate of protein evolution in yeast. Mol. BiolBiol..
EvolEvol. 23:1751-1761, 2006. And C.. 23:1751-1761, 2006. And C. WilkeWilke, private, private
communication).communication).
10
13
TE13 (Toby, Elber)
X
X
N
E D
X
Q
( )
1
n
X
i i
i
E S X c!
=
THOM2 (Meller, Elber)
( )
X
i j ij
i j
E S X c!! r
<
A
R
N
E D
C
Q
Ergodic/well mixed model
NP complete
14
H
I
L
M K
F
P
A
R
N
E D
C
Q
Original (Native)
Sequence:
nat
Candidate Sequence:
nat
15
In general, we would like to compute the function N(E) :
Specifically, we want to compute N ( E
nat
).
The number of all possible sequences is 20The number of all possible sequences is 20
nn
, for proteins, for proteins
of length of length nn. For small proteins,. For small proteins, nn ≈≈ 50.50.
Random sampling of sequence space does not work:
Random sampling of sequence space does not work:
since since NN (( EE
natnat
) / 20) / 20
nn
can be exponentially small.can be exponentially small.
Need a more sophisticated counting method.Need a more sophisticated counting method.
16
k
1 2
1
( ) ( ) ( )
( ) ( ) ...
( ) ( ) ( )
ref
ref m
N E N E N E
N E N E
N E N E N E
=!!!!
EE
refref
:: Pre-selected reference energyPre-selected reference energy
N
N (
( E
E
refref
) : Number of sequences below
) : Number of sequences below E
E
refref
EE
11
…… EE
mm
: Values above ratios are: Values above ratios are itermediatesitermediates..
19
k
E mean
E
min
E nat
20
Given a structureGiven a structure XX , compute, compute EE
meanmean
andand NN (( EE
meanmean
Pick
Pick E
11
mm
s.t.
s.t. E
kk
k+k+ 11
and (
and ( E
kk
kk +1+
) is
) is
decreasing withdecreasing with k.k.
For
For k
k = 1
m
m , run the Markov chain for
, run the Markov chain for t
t steps.
steps.
Compute
Compute l
l
tt
k
k +1) /
l
l
tt
k
k )
kk +1+
kk
1 2
1
mean
mean m
21
State space is connected: all states communicate
State space is connected: all states communicate
via the minimum-energy state
via the minimum-energy state
Mixing time (the Markov chain is
Mixing time (the Markov chain is ergodic
ergodic ) is
) is
polynomial in sequence length.
polynomial in sequence length.
Generalizes Morris-Sinclair algorithm (1999) for
Generalizes Morris-Sinclair algorithm (1999) for
counting knapsack solutions to arbitrary alphabets counting knapsack solutions to arbitrary alphabets
22
Ω
Ω
Ω
Ω
-1-
25
26
27
-1-
28
31
For a foldFor a fold XX , how do we count only sequences that, how do we count only sequences that
both are both compatible with both are both compatible with XX and preferand prefer XX to allto all
other folds?
other folds?
Given a structureGiven a structure XX and a set ofand a set of competingcompeting
structures structures YY = {= { YY
1
1
K
K
},}, we wish to estimatewe wish to estimate
the function
the function C
) which gives the size of the set
) which gives the size of the set
: & , 1
j
S! " E S # X < E E S # X < E S # Y j = … K
32
( X )
min
E
( X )
E
( X )
nat
E
( Y )
nat
E
( Y )
min
E
( Y )
E
E
(
X
)
=
E
(
Y
)
33
( X )
min
E
( X )
E
( X )
nat
E
( Y )
nat
E
( Y )
min
E
( Y )
E
E
(
X
)
=
E
(
Y
)
34
E
min
E
k
E
k+ 1
k k ret k
C E = N E! f E
37
**
retret
*
*
min
min
*
*
min
min
*
*
**
38
ret
E
mean
E
nat
E
*
f
ret
39
nat
40
ret