Sequence-Structure Asymmetry: Understanding Protein Evolution and Capacity, Papers of Electrical and Electronics Engineering

The concept of sequence-structure asymmetry in biology, focusing on the relationship between sequence identity and structural compatibility. Topics include sequence capacity, flow between structures, and measuring protein fitness. The document also discusses methods for estimating sequence capacity, such as telescoping ratios and umbrella sampling, and the use of energy functions to measure protein fitness.

Typology: Papers

Pre 2010

Uploaded on 08/31/2009

koofers-user-59t-1
koofers-user-59t-1 🇺🇸

10 documents

1 / 23

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
1
Biological Overview:
Sequence-Structure Asymmetry
1DWR 2JHO
Sequence Identity ~ 85%
Horse Sperm Whale
2
Biological Overview:
Sequence-Structure Asymmetry
1LH1 2JHO
Sequence Identity ~ 20%
Sperm WhaleLupinus luteus
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17

Partial preview of the text

Download Sequence-Structure Asymmetry: Understanding Protein Evolution and Capacity and more Papers Electrical and Electronics Engineering in PDF only on Docsity!

1

Biological Overview:

Sequence-Structure Asymmetry

1DWR 2JHO

Sequence Identity ~ 85%

Horse Sperm Whale

2

Biological Overview:

Sequence-Structure Asymmetry

1LH1 2JHO

Sequence Identity ~ 20%

Lupinus luteus Sperm Whale

3

Structures are better conserved

than sequences during evolution

-> Homology modeling of structures

-> Protein design and evolution

1LH1:_ 2/3 ALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE--VPQNNPE

1MBC:_ 1/2 VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED

1LH1:_ 60/61 LQAHAGKVFKLVYEAAIQLEVTGVVV--TDATLKNLGSVHVSKGVADAHFPVVKEAILKT

1MBC:_ 61/62 LKKHGVTVLTALGAILKKK---GHHEAELKPLAQSHATKHK---IPIKYLEFISEAIIHV

1LH1:_ 118/119 IKEVVGAKWSEELNSAWTIAYDELAIVIKKEMDDAA

1MBC:_ 115/116 LHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELG

4

Biological Overview:

Sequence-Structure Asymmetry

7

Physical (stability based) network

model for sequence capacity of

structures & structural flips

 Detailed (whole PDB), efficiently computable and

Detailed (whole PDB), efficiently computable and

experimentally testable model (the set of PDB

experimentally testable model (the set of PDB

structures was argued to be complete)

structures was argued to be complete)

 Design of protein structures and protein switches

Design of protein structures and protein switches

 Zero order model of the evolution of protein

Zero order model of the evolution of protein

sequences & structures (no selection due to

sequences & structures (no selection due to

function). function).

8

Related work on capacity of

specific protein models

 ShakhnovichShakhnovich

 Dill

Dill

 WolynesWolynes

 Thirumalai

Thirumalai

 LevittLevitt

 …

 So far no global view of capacity (and

So far no global view of capacity (and

thermodynamics) of the PDB, no flow.

thermodynamics) of the PDB, no flow.

9

Is capacity relevant to biology?

 Capacity shows weak correlation with the

Capacity shows weak correlation with the

number of sequences that

number of sequences that are found for a

are found for a

particular fold in the NR database (correlationparticular fold in the NR database (correlation

coefficient 0.2)coefficient 0.2)

Capacity correlates with mutation ratesCapacity correlates with mutation rates

measured experimentallymeasured experimentally ((J. D. Bloom, D. A.J. D. Bloom, D. A.

Drummond, F. H. Arnold, and C. O.Drummond, F. H. Arnold, and C. O. WilkeWilke. Structural. Structural

determinants of the rate of protein evolution in yeast. Mol.determinants of the rate of protein evolution in yeast. Mol. BiolBiol..

EvolEvol. 23:1751-1761, 2006. And C.. 23:1751-1761, 2006. And C. WilkeWilke, private, private

communication).communication).

10

Experimental tests

 Collaboration with ThomasCollaboration with Thomas MaglieryMagliery onon

Lambda repressor - 160K mutants

Lambda repressor - 160K mutants

 Protein flipsProtein flips - Bryan lab (flips are well- Bryan lab (flips are well

known for RNA)

known for RNA)

13

Approximate Energy Functions

TE13 (Toby, Elber)

X

X

N

E D

X

Q

( )

1

n

X

i i

i

E S X c!

=

THOM2 (Meller, Elber)

( )

X

i j ij

i j

E S X c!! r

<

A

R

N

E D

C

Q

Ergodic/well mixed model

NP complete

14

Specifying Sequence-Structure

Fitness Criteria

H

I

L

M K

F

P

A

R

N

E D

C

Q

Original (Native)

Sequence:

S

nat

: HILKMFP

Candidate Sequence:

S : ARNEDCQ

We say S is fit for X if ( ) ( )

nat

E S! X " E S! X

15

Estimating the Sequence

Capacity of a Fold

 In general, we would like to compute the function N(E) :

 Specifically, we want to compute N ( E

nat

).

 The number of all possible sequences is 20The number of all possible sequences is 20

nn

, for proteins, for proteins

of length of length nn. For small proteins,. For small proteins, nn ≈≈ 50.50.

 Random sampling of sequence space does not work:

Random sampling of sequence space does not work:

since since NN (( EE

natnat

) / 20) / 20

nn

can be exponentially small.can be exponentially small.

 Need a more sophisticated counting method.Need a more sophisticated counting method.

N ( E ) = S : E ( S! X )" E

16

Estimating N(E)

Express

Express

N

N

E

E

S

S

E

E

S

S → X ) < E

k

as

as

telescoping ratios/ umbrella sampling:

telescoping ratios/ umbrella sampling:

1 2

1

( ) ( ) ( )

( ) ( ) ...

( ) ( ) ( )

ref

ref m

N E N E N E

N E N E

N E N E N E

=!!!!

EE

refref

:: Pre-selected reference energyPre-selected reference energy

N

N (

( E

E

refref

) : Number of sequences below

) : Number of sequences below E

E

refref

EE

11

…… EE

mm

: Values above ratios are: Values above ratios are itermediatesitermediates..

19

Choosing Intermediate E

k

Values

E mean

E

min

E nat

20

Algorithm Summary: N ( E )

 Given a structureGiven a structure XX , compute, compute EE

meanmean

andand NN (( EE

meanmean

 Pick

Pick E

E

11

E

E

mm

s.t.

s.t. E

E

kk

E

E

k+k+ 11

and (

and ( E

E

kk

E

E

kk +1+

) is

) is

decreasing withdecreasing with k.k.

 For

For k

k = 1

m

m , run the Markov chain for

, run the Markov chain for t

t steps.

steps.

Compute

Compute l

l

tt

k

k +1) /

l

l

tt

k

k )

N

N

E

E

kk +1+

N

N

E

E

kk

1 2

1

mean

mean m

N E N E N E

N E N E

N E N E N E

21

Counting With THOM2:

Markov Chain Convergence

 State space is connected: all states communicate

State space is connected: all states communicate

via the minimum-energy state

via the minimum-energy state

 Mixing time (the Markov chain is

Mixing time (the Markov chain is ergodic

ergodic ) is

) is

polynomial in sequence length.

polynomial in sequence length.

 Generalizes Morris-Sinclair algorithm (1999) for

Generalizes Morris-Sinclair algorithm (1999) for

counting knapsack solutions to arbitrary alphabets counting knapsack solutions to arbitrary alphabets

22

Sequence capacity without

competition

Remain at a particular fold

Remain at a particular fold

X

X

and perform

and perform

counting for this single structure

counting for this single structure

Compute

Compute

N(E),

N(E),

Ω

Ω

(E)

(E)

=dN/dE

=dN/dE

, S(E)=log(

, S(E)=log(

Ω

Ω

(E))

(E))

The temperature of sequence selection is defined

The temperature of sequence selection is defined

as: as: T=(T=(dS/dEdS/dE))

-1-

 Compute for a representative set of PDB

structures (~3000 folds)

25

Counting without competition:

Different folds, same length

26

Sequence Capacity and Flow

27

Coarse description of fold

connectivity

 Different temperatures forDifferent temperatures for alternate foldsalternate folds

suggests lack of connectivity. suggests lack of connectivity.

T=(

T=(

dS/dE

dS/dE

-1-

28

Temperature distribution for the

potential TE-

31

Differentiating Between

Competing Folds

 For a foldFor a fold XX , how do we count only sequences that, how do we count only sequences that

both are both compatible with both are both compatible with XX and preferand prefer XX to allto all

other folds?

other folds?

 Given a structureGiven a structure XX and a set ofand a set of competingcompeting

structures structures YY = {= { YY

1

1

,, …… YY

K

K

},}, we wish to estimatewe wish to estimate

the function

the function C

C

E

E

) which gives the size of the set

) which gives the size of the set

: & , 1

j

S! " E S # X < E E S # X < E S # Y j = … K

32

Differentiating Between

Competing Folds

( X )

min

E

( X )

E

( X )

nat

E

( Y )

nat

E

( Y )

min

E

( Y )

E

E

(

X

)

=

E

(

Y

)

X Y

33

Differentiating Between

Competing Folds

( X )

min

E

( X )

E

( X )

nat

E

( Y )

nat

E

( Y )

min

E

( Y )

E

E

(

X

)

=

E

(

Y

)

X Y

34

Counting with Competition

E

min

E

k

E

k+ 1

k k ret k

C E = N E! f E

37

Maximal Retention Energy E

E

E

**

is the first-encountered energy where

is the first-encountered energy where

f

f

retret

is

is

maximum (mostly 1). In general,

maximum (mostly 1). In general,

E

E

*

*

E

E

min

min

Between

Between

E

E

*

*

and

and

E

E

min

min

the protein evolves in

the protein evolves in

structure and sequence spaces. structure and sequence spaces.

Below

Below

E

E

*

*

only the sequence evolves.

only the sequence evolves.

Native

Native

proteins are always found above proteins are always found above EE

**

38

E

and Behavior of f

ret

( E )

E

mean

E

nat

E

*

f

ret

39

E

nat

-E

and contact density

40

E

and f

ret

( E