















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Lab; Class: ST:TEACH/LEARN SCIEN; Subject: BIOLOGICAL SCIENCES; University: Florida State University; Term: Fall 2003;
Typology: Lab Reports
1 / 23
This page cannot be seen from the preview
Don't miss anything!
















Sept. 18, 2003
Sept. 18, 2003
An Introduction to Bioinformatics
An Introduction to Bioinformatics
More data yields stronger analyses
More data yields stronger analyses —
if done carefully!
if done carefully!
Mosaic Mosaic ideasideas andand evolutionaryevolutionary ‘‘importance.importance.’’
The power and sensitivity of sequence based computational
methods dramatically increases with the addition of more data.
More data yields stronger analyses — if done carefully!
Otherwise, it can confound the issue. The patterns of
conservation become clearer by comparing the conserved
portions of sequences amongst a larger and larger dataset.
Those areas most resistant to change are functionally the most
important to the molecule. The basic assumption is that those
portions of sequence of crucial functional value are most
constrained against evolutionary change. They will not tolerate
many mutations. Not that mutations do not occur in these
portions, just that most mutations in the region are lethal so we
never see them. Other areas of sequence are able to drift more
readily being less subject to evolutionary pressure. Therefore,
sequences end up a mosaic of quickly and slowly changing
regions over evolutionary time.
Applicability? Applicability?
Applicability? So what’s so great about multiple sequence alignments; why would anyone
want to bother? They are:
very useful in the development of primers and probes as well as in motif discovery;
great for producing annotated, publication quality, graphics and illustrations;
invaluable in structure/function studies through homology inference; and
required for molecular evolutionary phylogenetic inference programs.
Alignments help with primer design and motif discovery by allowing you to visualize the most
conserved regions. Any level of specificity can be achieved by picking areas of high variability
in the overall dataset that correspond to areas of high conversation in subset datasets to
differentiate between universal and specific probe sequences.
Graphics prepared from alignments can dramatically illustrate functional and structural
conservation. These can take many forms of all or portions of an alignment — shaded or
colored boxes or letters for each residue, cartoon representations of features, running line
graphs of overall similarity, overlays of attributes, various consensus representations — all can
be printed with high-resolution equipment, usually in color or gray tones.
Conserved regions of an alignment are functionally important. Structure is also conserved in
these crucial regions. In fact, recognizable structural conservation between true homologues
extends way beyond statistically significant sequence similarity. An oft-cited example is in the
serine protease superfamily. S. griseus protease A demonstrates remarkably little similarity
when compared to the rest of the superfamily (Expectation values E()≥ 10
in a typical search)
yet its three-dimensional structure clearly shows its allegiance to the serine proteases (Pearson,
W.R., personal communication). These principles are the premise of ‘homology modeling.’
Alignments are used to infer phylogeny. Based on the assertion of homologous positions,
programs such as those in PAUP* (Phylogenetic Analysis Using Parsimony [and other
methods]) and PHYLIP (PHYLogeny Inference Package) estimate the most reasonable
evolutionary tree for that alignment. This is a huge, complicated, and highly contentious field.
(See the Woods Hole Marine Biological Laboratory’s excellent summer course, the Workshop
on Molecular Evolution, at http://newfish.mbl.edu/Course/.) However, always remember that
regardless of algorithm used, parsimony, any distance method, maximum likelihood, or even
Bayesian Inference, all molecular sequence phylogenetic inference programs make the
absolute validity of your input alignment their first and most critical assumption.
number of sequencesnumber of sequences
As seen in pairwise dynamic programming, looking at every possible
position by sliding one sequence along every other sequence, just will
not work for alignment. Therefore, dynamic programming reduces the
problem back down to N
2
. But how do you work with more than just two
sequences at a time? It becomes a much harder problem. You could
painstakingly manually align all your sequences using some type of
editor, and many people do just that, but some type of an automated
solution is desirable, at least as a starting point to manual alignment.
However, solving the dynamic programming algorithm for more than just
two sequences rapidly becomes intractable. Dynamic programming’s
complexity, and hence its computational requirements, increases
exponentially with the number of sequences in the dataset being
compared (complexity=[sequence length]
number of sequences
Mathematically this is an N-dimensional matrix, quite complex indeed.
As we have seen, pairwise dynamic programming solves a two-
dimensional matrix, and the complexity of the solution is equal to the
length of the longest sequence squared. Well, a three member standard
dynamic programming sequence comparison would be a matrix with
three axes, the length of the longest sequence cubed, and so forth. You
can at least draw a three-dimensional matrix, but more than that
becomes impossible to even visualize. It quickly boggles the mind!
ThereforeTherefore —— pairwisepairwise,,
progressiveprogressive dynamicdynamic
programming restricts the
programming restricts the
solution to the neighbor-
solution to the neighbor-
hoodhood ofof onlyonly twotwo
sequencessequences atat aa time.time.
AllAll sequencessequences areare
compared,compared, pairwisepairwise, and, and
then each is aligned to its
then each is aligned to its
mostmost similarsimilar partnerpartner oror
groupgroup ofof partners.partners. EachEach
groupgroup ofof partnerspartners isis thenthen
aligned to finish the
aligned to finish the
complete multiple
complete multiple
sequencesequence alignment.alignment.
How the Algorithm Works. Therefore, the most common
implementations of automated multiple alignment modify dynamic
programming by establishing a pairwise order in which to build the
alignment. This modification is known as pairwise, progressive dynamic
programming. Originally attributed to Feng and Doolittle (1987), this
variation of the dynamic programming algorithm generates a global
alignment, but restricts its search space at any one time to a local
neighborhood of the full length of only two sequences. Consider a
group of sequences. First all are compared to each other, pairwise,
using normal dynamic programming. This establishes an order for the
set, most to least similar. Subgroups are clustered together similarly.
Then take the top two most similar sequences and align them using
normal dynamic programming. Now create a consensus of the two and
align that consensus to the third sequence using standard dynamic
programming. Now create a consensus of the first three sequences
and align that to the forth most similar. This process continues until it
has worked its way through all sequences and/or sets of clusters. The
pairwise, progressive solution is implemented in several programs.
Perhaps the most popular is Higgins’ and Thompson’s ClustalW (1994)
and its multi-platform, graphical user interface ClustalX (Thompson, et
al., 1997). ClustalX has versions available for most windowing
computing Operating Systems — most UNIX flavors, Microsoft
Windows, and Macintosh. The ClustalX homesite guarantees the latest
version: ftp://ftp-igbmc.u-strasbg.fr/ClustalX/. The GCG program PileUp
implements a very similar method within the Wisconsin Package.
Web resources for
Web resources for pairwise
pairwise ,
,
progressive multiple alignment
progressive multiple alignment —
—
Biocomputing sites around the globe on the World Wide Web (WWW)
provide access to multiple alignment resources. In general Web
resources for multiple alignment aren’t as easy to use nor as powerful as
performing multiple alignment locally on either your own office machine or
on a local dedicated sequence analysis server. Some of the difficulty
comes from limits in Web interface scripting and forms capabilities, and
cut-and-paste errors, but also just the unreliability of Internet connections
in general. In spite of that warning, it is possible, and relatively easy to
take advantage of multiple sequence resources available on the Internet
through the WWW. However, problems with very large datasets and huge
multiple alignments make doing multiple sequence alignment on the Web
impractical after your dataset has reached a certain size. You’ll recognize
that size very quickly when you’ve reached it!
One of the most comprehensive collections is at the Bielefeld University
Virtual School of Natural Sciences BioComputing Division (VSNS-BCD) in
Germany: http://www.techfak.uni-
bielefeld.de/bcd/Curric/MulAli/welcome.html. Another very good one is at
the PBIL (Pôle Bio-Informatique Lyonnais) World Wide Web server in
Lyon, France (http://pbil.univ-lyon1.fr/alignment.html) and the European
Molecular Biology Laboratory’s European Bioinformatics Institute (EMBL
EBI) in Hinxton U.K. has a slick interface to ClustalW
(http://www.ebi.ac.uk/clustalw/). In the U.S.A. the previously mentioned
Baylor College of Medicine Search Launcher
(http://searchlauncher.bcm.tmc.edu/) is also available.
Reliability Reliability andand thethe
Comparative Approach
Comparative Approach —
—
Reliability?
To help assure the reliability of sequence alignments always use
comparative approaches. A multiple sequence alignment is a
hypothesis of evolutionary history. Insure that you have prepared a
good one — be sure that it makes sense. Think about it — a
sequence alignment is a statement of positional homology. It
establishes the explicit homologous correspondence of each
individual sequence position, each column in the alignment.
Therefore, devote considerable time and energy toward developing
the most satisfying multiple sequence alignment possible.
This includes adjusting alignments manually based on your
knowledge of the biological system being studied.
Researchers have successfully used the conservation of covarying
sites in ribosomal and other structural RNA alignments to assist in
alignment refinement. That is, as one base in a stem structure
changes the corresponding Watson-Crick paired base will change in
a corresponding manner. This process has been used extensively
by the Ribosomal Database Project at the Center for Microbial
Ecology at Michigan State University to help guide the construction
of their rRNA alignments and structures. The WWW Uniform
Resource Locator (URL) is http://rdp.cme.msu.edu/html/.
Editing alignments is allowed and to be encouraged.
Specialized sequence editing software such as GCG’s SeqLab
Editor help achieve this, but any editor will do as long as the
sequences end up properly formatted afterwards. After some
automated solution has offered its best guess, go into the
alignment and use your own brain to improve it. Use all
available information and understanding to insure that all
columns are truly homologous. Look for conserved functional
sites to help guide your judgment. Overall, things to look for
include columns of strongly conserved residues such as
tryptophans, cysteines, and histidines, and important structural
amino acids such as prolines, tyrosines and phenylanines;
make sure they all align. Be sure that the hydrophobic
substitution triumvirate — isoleucine, leucine, valine, and to a
lesser extent methionine align — they all easily swap places.
Assure that all known enzymatic, regulatory, and structural
elements align. The results of any subsequent analyses are
absolutely dependent upon the alignment.
Beware of aligning apples and
Beware of aligning apples and
oranges oranges [[ andand grapefruitgrapefruit ]] !!
Parologous Parologous
versus
versus
orthologous
orthologous ;
genomic versus
genomic versus
cDNA
cDNA ;
mature mature versusversus
precursor.precursor.
Be sure an alignment makes biological sense — align things that
make sense to align! Beware of comparing ‘apples and oranges.’
If creating alignments for phylogenetic inference, either make
paralogous comparisons (i.e. evolution via gene duplication) to
ascertain gene phylogenies within one organism, or orthologous
(within one ancestral loci) comparisons to ascertain gene
phylogenies between organisms which should imply organismal
phylogenies. Try not to mix them up without complete data
representation. Lots of confusion can arise, especially if you do not
have all the data and/or if the nomenclature is contradictory;
extremely misleading interpretations can result. Be wary of trying
to align genomic sequences with cDNA when working with DNA;
the introns will cause all sorts of headaches. Similarly, do not align
mature and precursor proteins from the same organism and loci. It
does not make evolutionary sense, as one is not evolved from the
other, rather one is the other. These are all easy mistakes to
make; try your best to avoid them.
Mask out uncertain areas
Mask out uncertain areas —
—
I reiterate, the most important factor in inferring reliable phylogenies is the
accuracy of the multiple sequence alignment. The interpretation of your
results is utterly dependent on the quality of your input. In fact, many
experts advice against using any parts of the sequence data that are at all
questionable. Only analyze those portions that assuredly do align. If any
portions of the alignment are in doubt, throw them out. This usually
means trimming down or masking out the alignment’s terminal ends and
may require internal trimming or masking as well. Biocomputing is always
a delicate balance — signal against noise — and sometimes it can be
quite the balancing act!
Remember the old adage “garbage in — garbage out!” Some general
guidelines to remember include the following:
If the homology of a region is in doubt, then throw it out
(or “mask” it, as can be done using SeqLab).
Avoid the most diverged parts of molecules;
they are the greatest source of systematic error.
Do not include sequences that are more diverged than necessary for
the analysis at hand.
Complications cont.
Complications cont. —
—
Format hassles!
Format hassles!
Specialized Specialized formatformat conversionconversion
tools such as
tools such as GCG
GCG ’
’ s
s From
From ’
’
and To
and To ’
’ programs and
programs and
PAUPSearch
PAUPSearch .
.
Don Don GilbertGilbert’’ss publicpublic domaindomain
ReadSeq
ReadSeq program.
program.
One of the biggest problems in computational biology is that of
molecular sequence data format. Each suite of programs to come
along seems to require its own different sequence format. The major
databases all have their own; Clustal has its own; even the database
similarity searching program FastA has a sequence format associated
with it. GCG Wisconsin Package sequence format exists both as single
and Multiple Sequence Format (MSF) and GCG’s SeqLab has its own
format called Rich Sequence Format (RSF) that contains both
sequence data and reference and feature annotation. PAUP* has a
required format called the NEXUS file and PHYLIP has its own unique
input data format requirements. The PAUP* interfaces in the GCG
Wisconsin Package, PAUPSearch and PAUPDisplay, automatically
generate their required NEXUS format directly from the GCG formatted
files. Most systems are not nearly so helpful. Several different
programs are available to convert formats back and forth between the
required standards, but it all can get quite confusing. One program
available, ReadSeq by Don Gilbert at Indiana University (1990), allows
for the back and forth conversion between several different formats. I
would heartily recommend installing it on all of your computers. It
comes as an old ‘tried-and-trued’ C version or a new JAVA version with
a graphical interface. I don’t have much experience with the JAVA
version but have relied on the C version for many years.
Still more complications
Still more complications —
—
., -, ~, ?, N, or X
., -, ~, ?, N, or X
Alignment gaps are another problem. Different program suites may use
different symbols to represent them. Most programs use hyphens, “-”,
the Wisconsin Package uses periods, “.”. Furthermore, not all gaps in
sequences should be interpreted as deletions. Interior gaps are probably
okay to represent this way, as regardless of whether a deletion, insertion
or a duplication event created the gap, logically they will be treated the
same by the algorithms. These are indels. However, end gaps should
not be represented as indels because a lack of information beyond the
length of a given sequence may not be due to a deletion or insertion
event. It may have nothing to do with the particular stretch being
analyzed at all. It may just not have been sequenced! These gaps are
just place holders for the sequence. Therefore, it is safest to manually
edit an alignment to change leading and trailing gap symbols to “x”’s
which mean “unknown amino acid,” or “n”’s which mean “unknown base,”
or “?”’s which is supported by many programs, but not all, and means
“unknown residue or indel.” This will assure that the programs do not
make incorrect assumptions about your sequences.
The first GTP binding domain
The first GTP binding domain
of EF 1
of EF 1 a
a /
/ Tu
Tu —
—
AA consensusconsensus
isn isn’’tt necessarilynecessarily
the biologically
the biologically
“ “correctcorrect””
combination.
combination.
A A simplesimple
consensus
consensus
throws throws muchmuch
information
information
away! away!
Therefore, motif
Therefore, motif
definition. definition.
G H V D H G K S
Based on experimental evidence, we know that the indicated region bounded by
the Glycine and Serine above is essential. Just count up the various residues
in those columns and assign the most common one to the consensus. Simple.
But what about the fact that the middle Histidine isn’t always a Histidine; in this
dataset, just as often it’s a Serine and sometimes it’s an Alanine. Other
positions are also seen not be invariant. And there’s lots of other members of
this gene family not being represented here at all. A consensus isn’t
necessarily the biologically “correct” combination. How do we include this other
information? A simple consensus throws much of it away. Therefore, we need
to adopt some sort of standardized ambiguity notation. The trick is to define a
motif such that it minimizes false positives and maximizes true positives; i.e. it
needs to be just discriminatory enough. The development of the exact motif is
largely empirical; a pattern is made, tested against the database, then refined,
over and over, although when experimental evidence is available, it is always
incorporated. This is known as motif definition and a scientist in Switzerland,
Dr. Amos Bairoch, has done it for tons of sequences and catalogued the results
in a compilation, the PROSITE Database of protein families and domains. It
contains 1079 documentation entries that describe 1459 different motifs
(Release 16.33, of 25-Jan-2001)! This sort of characteristic local sequence
description is variously known as motif, template, signature, pattern, and even
fingerprint; don’t let the terminology confuse you. They can all be thought of as
a one-dimensional description of some sort of consensus region of a sequence
dataset.
The EF 1
The EF 1 a
a /
/ Tu
Tu P-Loop
P-Loop —
—
Defined as:
Defined as:
(A,G)x4GK(S,T).
(A,G)x4GK(S,T).
A one-dimensional
A one-dimensional
regular-expression
regular-expression ’
of a conserved site.
of a conserved site.
Not necessarily
Not necessarily
biologically
biologically
meaningful.
meaningful.
Motifs are limited in
Motifs are limited in
their ability to
their ability to
discriminate a
discriminate a
residue
residue ’
s
s
importance.
importance. ’
This site is known as the P-Loop and is defined as (A,G)x4GK(S,T), i.e. either
an Alanine or a Glycine, followed by four of anything, followed by an invariant
Glycine-Lysine pair, followed by either a Serine or a Threonine. Exceptions
are noted in the PROSITE documentation. This particular site has been very
well researched and many three-dimensional structures are available for it. It
always has a beta/alpha/beta secondary structure conformation and is
sometimes known as the “Rossman Fold.” Here the site is shown in the
Guanine Nucleotide-Binding Protein G(I), Alpha-1 Subunit (Adenylate
Cyclase-Inhibiting) from Rattus norvegicus (common rat), GBI1_RAT,
courtesy ExPASy’s Swiss-3DImage collection:
ftp://ca.expasy.org/databases/swiss-3dimage/IMAGES/JPEG/S3D00521.jpg
OK, so motifs are one way to “capture” the information of an important portion
of an alignment. However, motifs are not necessarily biologically
meaningful, especially those associated with post-translational modifications
such as phosphorylation sites, and they can not convey any degree of the
“importance” of the residues. For instance, in the P-Loop, is it better to have
an Alanine or a Glycine in that first position or doesn’t it matter? This lack of
a sense of importance causes a loss of sensitivity. In order to convey the
importance of each residue, a more “robust” method must be used.