Multiple Sequence Alignment: Techniques, Tools, and Applications - Prof. Dietlind Gerloff, Study notes of Chemistry

An overview of multiple sequence alignment (msa), its importance in bioinformatics, various methods such as dynamic programming, progressive alignment, and psi-blast, and popular tools like clustalw, tcoffee, and hmms. It also covers conservation patterns and their significance.

Typology: Study notes

Pre 2010

Uploaded on 08/19/2009

koofers-user-wva
koofers-user-wva 🇺🇸

9 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Multiple Sequence Alignment
Multiple sequence alignment is probably the single-
most important bioinformatics tools.
Many applications require accurate MSAs
PSIBLAST
Family and domain classification
Pattern identification
Structure prediction
secondary structure
fold recognition
Phylogeny
Full-genome alignments in browsers
Conservation Patterns
Cys pairs -disulfide bonds
His, Ser -catalytic sites
Cys, His -metal binding sites
Gly, Pro -ends of 2° structure elements, turns
Lys, Arg, Asp, Glu - ligand binding
Lys/Arg-Asp/Glu pairs - salt bridges
Leu -coiled coils, leucine zippers
Motifs, secondary structure, indels
pf3
pf4
pf5
pf8

Partial preview of the text

Download Multiple Sequence Alignment: Techniques, Tools, and Applications - Prof. Dietlind Gerloff and more Study notes Chemistry in PDF only on Docsity!

Multiple Sequence Alignment

• Multiple sequence alignment is probably the single-

most important bioinformatics tools.

• Many applications require accurate MSAs

• PSIBLAST

• Family and domain classification

• Pattern identification

• Structure prediction

• secondary structure

• fold recognition

• Phylogeny

• Full-genome alignments in browsers

Conservation Patterns

• Cys pairs -disulfide bonds

• His, Ser - catalytic sites

• Cys, His -metal binding sites

• Gly, Pro -ends of 2° structure elements, turns

• Lys, Arg, Asp, Glu - ligand binding

• Lys/Arg-Asp/Glu pairs - salt bridges

• Leu - coiled coils, leucine zippers

• Motifs, secondary structure, indels

PSI-BLAST Alignments

• The goal of BLAST is rapid detection by

detecting high-scoring local alignments. It

doesn’t necessarily find the optimal global or

local alignment

• Profiles throw away information for regions that

are insertions relative to the query

Methods

• Dynamic Programming

• Gives the optimal solution, but prohibitively slow

• Progressive

• ClustalW

• http://www.ebi.ac.uk/clustalw/index.html

(most commonly used)

• Tcoffee

• http://igs-server.cnrs-mrs.fr/Tcoffee/

(a little better, but slower)

• Iterative

• better than progressive methods, but slower

• Dialign

• HMMs

Input Formats

• FASTA format

• Download from NCBI, ExPASy, EBI, Pfam …

• Sequence names should be

• Unique

• 15 characters or less

• Comprised of only A-Z,a-z,0-9 and _

(Do not use #$%@|*!:;. or spaces)

for Clustal programs

ClustalW Output

CLUSTAL W ( 1. 82 ) Multiple Sequence Alignments

Sequence format is Pearson

Sequence 1 : sp_P 13795 206 aa

Sequence 2 : gi_ 31242623 213 aa

Sequence 3 : gi_ 3822409 195 aa

Sequence 4 : gi_ 39593308 235 aa

Sequence 5 : gi_ 32567202 207 aa

Start of Pairwise alignments

Aligning...

Sequences ( 1 : 2 ) Aligned. Score: 57

Sequences ( 1 : 3 ) Aligned. Score: 59

Sequences ( 1 : 4 ) Aligned. Score: 52

Sequences ( 1 : 5 ) Aligned. Score: 51

Sequences ( 2 : 3 ) Aligned. Score: 77

Sequences ( 2 : 4 ) Aligned. Score: 53

Sequences ( 2 : 5 ) Aligned. Score: 54

Sequences ( 3 : 4 ) Aligned. Score: 60

Sequences ( 3 : 5 ) Aligned. Score: 61

Sequences ( 4 : 5 ) Aligned. Score: 87

Guide tree file created: [/ebi/extserv/old-work/clustalw- 20040206 - 01234219 .dnd]

Start of Multiple Alignment

There are 4 groups

Aligning...

Group 1 : Sequences: 2 Score: 3818

Group 2 : Sequences: 3 Score: 3429

Group 3 : Sequences: 2 Score: 4233

Group 4 : Sequences: 5 Score: 3386

Alignment Score 7423

CLUSTAL-Alignment file created [/ebi/extserv/old-work/clustalw- 20040206 - 01234219 .aln]

ClustalW Guide Tree

• The guide tree shows the distances between sequences obtained from

the initial pairwise alignments.

• This is the order that sequences were added into the MSA

• Guide tree is not a phylogenetic tree (it’s just a rough estimate of

similarity), however a true phylogenetic tree can be generated after

making an alignment

Progressive Alignment

• Greedy algorithm

• Breaks problem up into smaller problems

• Finds best solution to each small problem

• Combine solutions to get answer to whole problem

• Not necessarily the global answer.

• Doesn’t use all information in solving sub-problems.

• Suboptimal answers for small problems may combine to

give a better overall answer

• Gaps: once created, they stay as part of alignment

for rest of alignment iterations

Aligned FASTA (A2M) Format

>SN 29 _RAT/ 142 - 196

PSSRLKEAINTSKDQESKYQASHPNLRRLHDAE---LDSVPASTV----NTEVY-----P

KNSSL---R-----A

>SN 29 _HUMAN/ 142 - 197

PNNRLKEAISTSKEQEAKYQASHPNLR-------KLDDTDPVPRGA---GSAMSTDA-YP

KNPHL---R-----A

>SN 25 _TORMA/ 95 - 148

PCNK----LKNFEAGGAYKKVWGNNQD------G-VVASQP-ARVMD-DREQMA-----M

SGGYI--RRI-TDDA

>O 93578 / 11 - 59

PCNK----MKS-----GASKAWGNNQD------G-VVASQP-ARVVD-EREQMA-----I

SGGFI--RRV-TDDA

>SN 25 _DROME/ 98 - 149

PCNK----SQSFK---EDDGTWKGNDD------GKVVNNQP-QRVMD-DRNGM-----MA

QAGYI--GRI-TNDA

• Uppercase and ‘-’ characters are alignment columns. There must be the

same number of aligned characters in all sequences.

• Insertions that are not part of the alignment, are indicated with lower

case and ‘.’ characters. These are not read (i.e. they’re for humans

only)

• Benefits

• Easily machine readable

• Readable by most programs that read FASTA format

(Note: characters in lowercase, if there were any,

would indicate that the alignment is incertain at

these positions)

Graphical - Jalview

  • Postscript, PDF, HTML
  • Looks pretty and very visually informative
  • Completely useless for further computational analysis.

DO NOT SAVE GRAPHICS AS YOUR ONLY OUTPUT

  • Jalview - - Java alignment editor (http://www.jalview.org)
    • Available as an online applet or as an application
    • Makes nice pictures and allow interactive editing

e.g. Jalview,

ClustalX (or others)

Sequence Logos

• Logos are another useful visualization of alignments that allow

conserved positions to be easily picked out.

• Multiple tools available on the web or can be downloaded:

• http://weblogo.berkeley.edu

Tcoffee

• Makes a library of pair-wise global and several

local alignments

• Tries to find a multiple alignment that has best

consensus with all alignments in the library.

• Still a progressive algorithm

• Slower, but usually a bit better than ClustalW

Note: there are many other MSA-programs around - which is the "best"

may depend on the specifics of your protein set + there may be new

"winners" that you can easily access. A good one to try also is

MUSCLE, for example (developed in Berkeley)