Multiple Sequence Alignment and Analysis - Lab | BSC 5936, Lab Reports of Biology

Material Type: Lab; Class: ST:TEACH/LEARN SCIEN; Subject: BIOLOGICAL SCIENCES; University: Florida State University; Term: Fall 2003;

Typology: Lab Reports

Pre 2010

Uploaded on 08/30/2009

koofers-user-8iv-1
koofers-user-8iv-1 🇺🇸

10 documents

1 / 23

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Steve Thompson 9/18/03
1
Special Topics BSC4933/5936
Special Topics BSC4933/5936
Florida State University
Florida State University
The Department of Biological Science
The Department of Biological Science
www.bio.
www.bio.fsu
fsu.
.edu
edu
Sept. 18, 2003
Sept. 18, 2003
An Introduction to Bioinformatics
An Introduction to Bioinformatics
http://bio.fsu.edu/~stevet/BSC5936.html
http://bio.fsu.edu/~stevet/MultipleAlignment.pdf
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17

Partial preview of the text

Download Multiple Sequence Alignment and Analysis - Lab | BSC 5936 and more Lab Reports Biology in PDF only on Docsity!

Special Topics BSC4933/

Special Topics BSC4933/

Florida State University

Florida State University

The Department of Biological Science

The Department of Biological Science

www.bio.www.bio.fsufsu..eduedu

Sept. 18, 2003

Sept. 18, 2003

An Introduction to Bioinformatics

An Introduction to Bioinformatics

http://bio.fsu.edu/~stevet/BSC5936.html

http://bio.fsu.edu/~stevet/MultipleAlignment.pdf

More data yields stronger analyses

More data yields stronger analyses —

if done carefully!

if done carefully!

Mosaic Mosaic ideasideas andand evolutionaryevolutionary ‘‘importance.importance.’’

Multiple Sequence

Multiple Sequence

Alignment & Analysis

Alignment & Analysis

Steven M. Thompson

Steven M. Thompson

Florida Florida StateState UniversityUniversity SchoolSchool ofof

Computational Science and

Computational Science and

Information Technology (

Information Technology (

CSIT

CSIT

The power and sensitivity of sequence based computational

methods dramatically increases with the addition of more data.

More data yields stronger analyses — if done carefully!

Otherwise, it can confound the issue. The patterns of

conservation become clearer by comparing the conserved

portions of sequences amongst a larger and larger dataset.

Those areas most resistant to change are functionally the most

important to the molecule. The basic assumption is that those

portions of sequence of crucial functional value are most

constrained against evolutionary change. They will not tolerate

many mutations. Not that mutations do not occur in these

portions, just that most mutations in the region are lethal so we

never see them. Other areas of sequence are able to drift more

readily being less subject to evolutionary pressure. Therefore,

sequences end up a mosaic of quickly and slowly changing

regions over evolutionary time.

So what; why even bother?So what; why even bother?

Applications:

Applications:

Probe, primer, and motif design; Probe, primer, and motif design;

Graphical illustrations; Graphical illustrations;

Comparative

Comparative

homology

homology

inference;

inference;

Molecular evolutionary analysis. Molecular evolutionary analysis.

All rightAll right —— well, how do you do it?well, how do you do it?

OK, back to Multiple Sequence Alignment

OK, back to Multiple Sequence Alignment

Applicability? Applicability?

Applicability? So what’s so great about multiple sequence alignments; why would anyone

want to bother? They are:

very useful in the development of primers and probes as well as in motif discovery;

great for producing annotated, publication quality, graphics and illustrations;

invaluable in structure/function studies through homology inference; and

required for molecular evolutionary phylogenetic inference programs.

Alignments help with primer design and motif discovery by allowing you to visualize the most

conserved regions. Any level of specificity can be achieved by picking areas of high variability

in the overall dataset that correspond to areas of high conversation in subset datasets to

differentiate between universal and specific probe sequences.

Graphics prepared from alignments can dramatically illustrate functional and structural

conservation. These can take many forms of all or portions of an alignment — shaded or

colored boxes or letters for each residue, cartoon representations of features, running line

graphs of overall similarity, overlays of attributes, various consensus representations — all can

be printed with high-resolution equipment, usually in color or gray tones.

Conserved regions of an alignment are functionally important. Structure is also conserved in

these crucial regions. In fact, recognizable structural conservation between true homologues

extends way beyond statistically significant sequence similarity. An oft-cited example is in the

serine protease superfamily. S. griseus protease A demonstrates remarkably little similarity

when compared to the rest of the superfamily (Expectation values E()≥ 10

in a typical search)

yet its three-dimensional structure clearly shows its allegiance to the serine proteases (Pearson,

W.R., personal communication). These principles are the premise of ‘homology modeling.’

Alignments are used to infer phylogeny. Based on the assertion of homologous positions,

programs such as those in PAUP* (Phylogenetic Analysis Using Parsimony [and other

methods]) and PHYLIP (PHYLogeny Inference Package) estimate the most reasonable

evolutionary tree for that alignment. This is a huge, complicated, and highly contentious field.

(See the Woods Hole Marine Biological Laboratory’s excellent summer course, the Workshop

on Molecular Evolution, at http://newfish.mbl.edu/Course/.) However, always remember that

regardless of algorithm used, parsimony, any distance method, maximum likelihood, or even

Bayesian Inference, all molecular sequence phylogenetic inference programs make the

absolute validity of your input alignment their first and most critical assumption.

Dynamic

Dynamic

programming

programming

s

s

complexity

complexity

increases exponentially with the number ofincreases exponentially with the number of

sequences being compared:

sequences being compared:

N-dimensional matrix....

N-dimensional matrix....

complexity=[sequence length] complexity=[sequence length]

number of sequencesnumber of sequences

As seen in pairwise dynamic programming, looking at every possible

position by sliding one sequence along every other sequence, just will

not work for alignment. Therefore, dynamic programming reduces the

problem back down to N

2

. But how do you work with more than just two

sequences at a time? It becomes a much harder problem. You could

painstakingly manually align all your sequences using some type of

editor, and many people do just that, but some type of an automated

solution is desirable, at least as a starting point to manual alignment.

However, solving the dynamic programming algorithm for more than just

two sequences rapidly becomes intractable. Dynamic programming’s

complexity, and hence its computational requirements, increases

exponentially with the number of sequences in the dataset being

compared (complexity=[sequence length]

number of sequences

Mathematically this is an N-dimensional matrix, quite complex indeed.

As we have seen, pairwise dynamic programming solves a two-

dimensional matrix, and the complexity of the solution is equal to the

length of the longest sequence squared. Well, a three member standard

dynamic programming sequence comparison would be a matrix with

three axes, the length of the longest sequence cubed, and so forth. You

can at least draw a three-dimensional matrix, but more than that

becomes impossible to even visualize. It quickly boggles the mind!

ThereforeTherefore —— pairwisepairwise,,

progressiveprogressive dynamicdynamic

programming restricts the

programming restricts the

solution to the neighbor-

solution to the neighbor-

hoodhood ofof onlyonly twotwo

sequencessequences atat aa time.time.

AllAll sequencessequences areare

compared,compared, pairwisepairwise, and, and

then each is aligned to its

then each is aligned to its

mostmost similarsimilar partnerpartner oror

groupgroup ofof partners.partners. EachEach

groupgroup ofof partnerspartners isis thenthen

aligned to finish the

aligned to finish the

complete multiple

complete multiple

sequencesequence alignment.alignment.

Multiple Sequence Dynamic Programming Multiple Sequence Dynamic Programming

How the Algorithm Works. Therefore, the most common

implementations of automated multiple alignment modify dynamic

programming by establishing a pairwise order in which to build the

alignment. This modification is known as pairwise, progressive dynamic

programming. Originally attributed to Feng and Doolittle (1987), this

variation of the dynamic programming algorithm generates a global

alignment, but restricts its search space at any one time to a local

neighborhood of the full length of only two sequences. Consider a

group of sequences. First all are compared to each other, pairwise,

using normal dynamic programming. This establishes an order for the

set, most to least similar. Subgroups are clustered together similarly.

Then take the top two most similar sequences and align them using

normal dynamic programming. Now create a consensus of the two and

align that consensus to the third sequence using standard dynamic

programming. Now create a consensus of the first three sequences

and align that to the forth most similar. This process continues until it

has worked its way through all sequences and/or sets of clusters. The

pairwise, progressive solution is implemented in several programs.

Perhaps the most popular is Higgins’ and Thompson’s ClustalW (1994)

and its multi-platform, graphical user interface ClustalX (Thompson, et

al., 1997). ClustalX has versions available for most windowing

computing Operating Systems — most UNIX flavors, Microsoft

Windows, and Macintosh. The ClustalX homesite guarantees the latest

version: ftp://ftp-igbmc.u-strasbg.fr/ClustalX/. The GCG program PileUp

implements a very similar method within the Wisconsin Package.

Web resources for

Web resources for pairwise

pairwise ,

,

progressive multiple alignment

progressive multiple alignment —

http://www.

http://www.

techfak

techfak

uni

uni

bielefeld bielefeld.de/.de/bcdbcd//CurricCurric//MulAliMulAli/welcome.html/welcome.html..

http:// http://pbilpbil..univuniv-lyon1.-lyon1.frfr/alignment.html/alignment.html

http://www. http://www.ebiebi.ac..ac.ukuk//clustalwclustalw//

http:// http://searchlaunchersearchlauncher..bcmbcm..tmctmc..eduedu//

However, problems with very large datasets and

However, problems with very large datasets and

huge multiple alignments make doing multiple

huge multiple alignments make doing multiple

sequence alignment on the Web impractical sequence alignment on the Web impractical

after your dataset has reached a certain size. after your dataset has reached a certain size.

You

You

ll know it when you

ll know it when you

re there!

re there!

Biocomputing sites around the globe on the World Wide Web (WWW)

provide access to multiple alignment resources. In general Web

resources for multiple alignment aren’t as easy to use nor as powerful as

performing multiple alignment locally on either your own office machine or

on a local dedicated sequence analysis server. Some of the difficulty

comes from limits in Web interface scripting and forms capabilities, and

cut-and-paste errors, but also just the unreliability of Internet connections

in general. In spite of that warning, it is possible, and relatively easy to

take advantage of multiple sequence resources available on the Internet

through the WWW. However, problems with very large datasets and huge

multiple alignments make doing multiple sequence alignment on the Web

impractical after your dataset has reached a certain size. You’ll recognize

that size very quickly when you’ve reached it!

One of the most comprehensive collections is at the Bielefeld University

Virtual School of Natural Sciences BioComputing Division (VSNS-BCD) in

Germany: http://www.techfak.uni-

bielefeld.de/bcd/Curric/MulAli/welcome.html. Another very good one is at

the PBIL (Pôle Bio-Informatique Lyonnais) World Wide Web server in

Lyon, France (http://pbil.univ-lyon1.fr/alignment.html) and the European

Molecular Biology Laboratory’s European Bioinformatics Institute (EMBL

EBI) in Hinxton U.K. has a slick interface to ClustalW

(http://www.ebi.ac.uk/clustalw/). In the U.S.A. the previously mentioned

Baylor College of Medicine Search Launcher

(http://searchlauncher.bcm.tmc.edu/) is also available.

Reliability Reliability andand thethe

Comparative Approach

Comparative Approach —

explicit homologous correspondence;explicit homologous correspondence;

manual adjustments based onmanual adjustments based on

knowledge,knowledge,

especially structural, regulatory, and

especially structural, regulatory, and

functional sites.

functional sites.

Therefore, editors likeTherefore, editors like SeqLabSeqLab andand

the Ribosomal Database Project:the Ribosomal Database Project:

http:// http://rdprdp..cmecme..msumsu..eduedu/html//html/..

Reliability?

To help assure the reliability of sequence alignments always use

comparative approaches. A multiple sequence alignment is a

hypothesis of evolutionary history. Insure that you have prepared a

good one — be sure that it makes sense. Think about it — a

sequence alignment is a statement of positional homology. It

establishes the explicit homologous correspondence of each

individual sequence position, each column in the alignment.

Therefore, devote considerable time and energy toward developing

the most satisfying multiple sequence alignment possible.

This includes adjusting alignments manually based on your

knowledge of the biological system being studied.

Researchers have successfully used the conservation of covarying

sites in ribosomal and other structural RNA alignments to assist in

alignment refinement. That is, as one base in a stem structure

changes the corresponding Watson-Crick paired base will change in

a corresponding manner. This process has been used extensively

by the Ribosomal Database Project at the Center for Microbial

Ecology at Michigan State University to help guide the construction

of their rRNA alignments and structures. The WWW Uniform

Resource Locator (URL) is http://rdp.cme.msu.edu/html/.

Structural & Functional correspondence inStructural & Functional correspondence in

the Wisconsin Package

the Wisconsin Package

s

s

SeqLab

SeqLab

Editing alignments is allowed and to be encouraged.

Specialized sequence editing software such as GCG’s SeqLab

Editor help achieve this, but any editor will do as long as the

sequences end up properly formatted afterwards. After some

automated solution has offered its best guess, go into the

alignment and use your own brain to improve it. Use all

available information and understanding to insure that all

columns are truly homologous. Look for conserved functional

sites to help guide your judgment. Overall, things to look for

include columns of strongly conserved residues such as

tryptophans, cysteines, and histidines, and important structural

amino acids such as prolines, tyrosines and phenylanines;

make sure they all align. Be sure that the hydrophobic

substitution triumvirate — isoleucine, leucine, valine, and to a

lesser extent methionine align — they all easily swap places.

Assure that all known enzymatic, regulatory, and structural

elements align. The results of any subsequent analyses are

absolutely dependent upon the alignment.

Beware of aligning apples and

Beware of aligning apples and

oranges oranges [[ andand grapefruitgrapefruit ]] !!

Parologous Parologous

versus

versus

orthologous

orthologous ;

genomic versus

genomic versus

cDNA

cDNA ;

mature mature versusversus

precursor.precursor.

Be sure an alignment makes biological sense — align things that

make sense to align! Beware of comparing ‘apples and oranges.’

If creating alignments for phylogenetic inference, either make

paralogous comparisons (i.e. evolution via gene duplication) to

ascertain gene phylogenies within one organism, or orthologous

(within one ancestral loci) comparisons to ascertain gene

phylogenies between organisms which should imply organismal

phylogenies. Try not to mix them up without complete data

representation. Lots of confusion can arise, especially if you do not

have all the data and/or if the nomenclature is contradictory;

extremely misleading interpretations can result. Be wary of trying

to align genomic sequences with cDNA when working with DNA;

the introns will cause all sorts of headaches. Similarly, do not align

mature and precursor proteins from the same organism and loci. It

does not make evolutionary sense, as one is not evolved from the

other, rather one is the other. These are all easy mistakes to

make; try your best to avoid them.

Mask out uncertain areas

Mask out uncertain areas —

I reiterate, the most important factor in inferring reliable phylogenies is the

accuracy of the multiple sequence alignment. The interpretation of your

results is utterly dependent on the quality of your input. In fact, many

experts advice against using any parts of the sequence data that are at all

questionable. Only analyze those portions that assuredly do align. If any

portions of the alignment are in doubt, throw them out. This usually

means trimming down or masking out the alignment’s terminal ends and

may require internal trimming or masking as well. Biocomputing is always

a delicate balance — signal against noise — and sometimes it can be

quite the balancing act!

Remember the old adage “garbage in — garbage out!” Some general

guidelines to remember include the following:

If the homology of a region is in doubt, then throw it out

(or “mask” it, as can be done using SeqLab).

Avoid the most diverged parts of molecules;

they are the greatest source of systematic error.

Do not include sequences that are more diverged than necessary for

the analysis at hand.

Complications cont.

Complications cont. —

Format hassles!

Format hassles!

Specialized Specialized formatformat conversionconversion

tools such as

tools such as GCG

GCG ’

’ s

s From

From ’

and To

and To ’

’ programs and

programs and

PAUPSearch

PAUPSearch .

.

Don Don GilbertGilbert’’ss publicpublic domaindomain

ReadSeq

ReadSeq program.

program.

One of the biggest problems in computational biology is that of

molecular sequence data format. Each suite of programs to come

along seems to require its own different sequence format. The major

databases all have their own; Clustal has its own; even the database

similarity searching program FastA has a sequence format associated

with it. GCG Wisconsin Package sequence format exists both as single

and Multiple Sequence Format (MSF) and GCG’s SeqLab has its own

format called Rich Sequence Format (RSF) that contains both

sequence data and reference and feature annotation. PAUP* has a

required format called the NEXUS file and PHYLIP has its own unique

input data format requirements. The PAUP* interfaces in the GCG

Wisconsin Package, PAUPSearch and PAUPDisplay, automatically

generate their required NEXUS format directly from the GCG formatted

files. Most systems are not nearly so helpful. Several different

programs are available to convert formats back and forth between the

required standards, but it all can get quite confusing. One program

available, ReadSeq by Don Gilbert at Indiana University (1990), allows

for the back and forth conversion between several different formats. I

would heartily recommend installing it on all of your computers. It

comes as an old ‘tried-and-trued’ C version or a new JAVA version with

a graphical interface. I don’t have much experience with the JAVA

version but have relied on the C version for many years.

Still more complications

Still more complications —

Indels

Indels

and missing

and missing

data symbols (i.e.data symbols (i.e.

gaps) designation

gaps) designation

discrepancydiscrepancy

headaches

headaches

., -, ~, ?, N, or X

., -, ~, ?, N, or X

..... Help!

..... Help!

Alignment gaps are another problem. Different program suites may use

different symbols to represent them. Most programs use hyphens, “-”,

the Wisconsin Package uses periods, “.”. Furthermore, not all gaps in

sequences should be interpreted as deletions. Interior gaps are probably

okay to represent this way, as regardless of whether a deletion, insertion

or a duplication event created the gap, logically they will be treated the

same by the algorithms. These are indels. However, end gaps should

not be represented as indels because a lack of information beyond the

length of a given sequence may not be due to a deletion or insertion

event. It may have nothing to do with the particular stretch being

analyzed at all. It may just not have been sequenced! These gaps are

just place holders for the sequence. Therefore, it is safest to manually

edit an alignment to change leading and trailing gap symbols to “x”’s

which mean “unknown amino acid,” or “n”’s which mean “unknown base,”

or “?”’s which is supported by many programs, but not all, and means

“unknown residue or indel.” This will assure that the programs do not

make incorrect assumptions about your sequences.

The first GTP binding domain

The first GTP binding domain

of EF 1

of EF 1 a

a /

/ Tu

Tu —

AA consensusconsensus

isn isn’’tt necessarilynecessarily

the biologically

the biologically

“ “correctcorrect””

combination.

combination.

A A simplesimple

consensus

consensus

throws throws muchmuch

information

information

away! away!

Therefore, motif

Therefore, motif

definition. definition.

G H V D H G K S

Based on experimental evidence, we know that the indicated region bounded by

the Glycine and Serine above is essential. Just count up the various residues

in those columns and assign the most common one to the consensus. Simple.

But what about the fact that the middle Histidine isn’t always a Histidine; in this

dataset, just as often it’s a Serine and sometimes it’s an Alanine. Other

positions are also seen not be invariant. And there’s lots of other members of

this gene family not being represented here at all. A consensus isn’t

necessarily the biologically “correct” combination. How do we include this other

information? A simple consensus throws much of it away. Therefore, we need

to adopt some sort of standardized ambiguity notation. The trick is to define a

motif such that it minimizes false positives and maximizes true positives; i.e. it

needs to be just discriminatory enough. The development of the exact motif is

largely empirical; a pattern is made, tested against the database, then refined,

over and over, although when experimental evidence is available, it is always

incorporated. This is known as motif definition and a scientist in Switzerland,

Dr. Amos Bairoch, has done it for tons of sequences and catalogued the results

in a compilation, the PROSITE Database of protein families and domains. It

contains 1079 documentation entries that describe 1459 different motifs

(Release 16.33, of 25-Jan-2001)! This sort of characteristic local sequence

description is variously known as motif, template, signature, pattern, and even

fingerprint; don’t let the terminology confuse you. They can all be thought of as

a one-dimensional description of some sort of consensus region of a sequence

dataset.

The EF 1

The EF 1 a

a /

/ Tu

Tu P-Loop

P-Loop —

Defined as:

Defined as:

(A,G)x4GK(S,T).

(A,G)x4GK(S,T).

A one-dimensional

A one-dimensional

regular-expression

regular-expression ’

of a conserved site.

of a conserved site.

Not necessarily

Not necessarily

biologically

biologically

meaningful.

meaningful.

Motifs are limited in

Motifs are limited in

their ability to

their ability to

discriminate a

discriminate a

residue

residue ’

s

s

importance.

importance. ’

This site is known as the P-Loop and is defined as (A,G)x4GK(S,T), i.e. either

an Alanine or a Glycine, followed by four of anything, followed by an invariant

Glycine-Lysine pair, followed by either a Serine or a Threonine. Exceptions

are noted in the PROSITE documentation. This particular site has been very

well researched and many three-dimensional structures are available for it. It

always has a beta/alpha/beta secondary structure conformation and is

sometimes known as the “Rossman Fold.” Here the site is shown in the

Guanine Nucleotide-Binding Protein G(I), Alpha-1 Subunit (Adenylate

Cyclase-Inhibiting) from Rattus norvegicus (common rat), GBI1_RAT,

courtesy ExPASy’s Swiss-3DImage collection:

ftp://ca.expasy.org/databases/swiss-3dimage/IMAGES/JPEG/S3D00521.jpg

OK, so motifs are one way to “capture” the information of an important portion

of an alignment. However, motifs are not necessarily biologically

meaningful, especially those associated with post-translational modifications

such as phosphorylation sites, and they can not convey any degree of the

“importance” of the residues. For instance, in the P-Loop, is it better to have

an Alanine or a Glycine in that first position or doesn’t it matter? This lack of

a sense of importance causes a loss of sensitivity. In order to convey the

importance of each residue, a more “robust” method must be used.