Download Multiple Sequence Alignment - Computational Biology Tools | BME 110 and more Study notes Chemistry in PDF only on Docsity!
Multiple Sequence Alignment
BME 110: CompBio Tools
Todd Lowe
April 20, 2006
Multiple Sequence Alignment
Multiple sequence alignment is probably the single-most important bioinformatics tools.
Many applications require accurate MSAs
PSIBLAST
Family and domain classification
Pattern identification
Structure prediction
• secondary structure • fold recognition
Phylogeny
PSIBLAST Alignments
The goal of BLAST is rapid detection bydetecting high-scoring local alignments. Itdoesn’t necessarily find the optimal global orlocal alignment
Profiles throw away information for regions thatare insertions relative to the query
Methods
Dynamic Programming
Gives the optimal solution, but prohibitively slow
Progressive
ClustalW
http://www.ebi.ac.uk/clustalw/index.html
Tcoffee
http://igs-server.cnrs-mrs.fr/Tcoffee/
(a little better, but slower)
Iterative
better than progressive methods, but slower
Dialign
HMMs
ClustalW Example
Input: 5 sequences detected by BLASTp using human SNAP-25as a query
Default parameters, output order: input
sp_P13795 M A E DA D M R N ELE
E
M Q R
R
A D QLADESLESTR
R^
ML QLVE
E
SKDA GIRTLVML DE Q G E QLERIE
E
G M D QINK D
M K EAEKNLTDLGKFC GLCVC PC NKLKS
S
DAYKKA W G N
N
Q D G V
V
AS QPARV
V
DERE Q M AISG
G
FIR
R
VTN D
ARENE M DE N LEQVS GI
I
GNLR H M ALD M G N EIDTQ N R QIDRIMEKADSNKTRIDEAN Q RATK ML GS G
g i_ M P A
A
APPAEN G A
A
VPKTELQ EL Q M K Q
Q
Q V
V
DESLDSTR
R
MLALCEESTEV G M RTIVMLDE Q G E QLDRIE
E
G M D QINAD M REAEKNLS G M E K C
C
GICVLPC NKSASFKED
D
GT W K G N D
D
G KV
V
N^
N
Q P Q R V M D
D
R N G LGP QA
G YIGRITNDARE DE M E
E^
N M G Q V NT MIGNLR N M ALD M G S ELEN Q N R QIDRINRK G DS NATRIA
A
ANERAH D
LLK >g i_ M PT
T
AEPA Q EN GAPRSELQ EL QLKA G Q VTDETLESTR
R
MLALCEESKEA GIRTLVALD
D
Q G E QLERIE
E
N
M D QINAD M K EAEKNLTG M EKFC GLCVLPW N K SAPFKENE DA W K G N D
D
G KV
V
N^
N
Q P Q R V M D
D
G S G LGP Q G
G
YIGRITNDAREDE M E
E
NV G Q V NT MIGNLRN M AID M G SELEN Q N R QIDRIKNKAE M
g
i_ M S A R
R
G AP G
G
Q R HP RPYAVEPTVDINGLVLPAD M S DELK GLNV GIDEKTIESLESTR
R
MLALCEESKEA G
IKTLV MLD
D
Q G E QLER CE G ALDTINQD M K EAED HLK G M EKC
C
GLCVLPW N KTD
D
FEKNSEYAKA W K
K
D^
D
D
G
G
VISD QP RITV G DPT M G P Q G
G
YITKITNDA RE DE M DE NIQ QVST MV G NLRN M AID MSTEVSN Q N R Q LDR
IHDKA QS NEVRVESANKRAKNLITK >g i_ M S G D
D
DIPEGLEAINLK M N AT
T
D
D
SLESTR
R
MLALCEESKEA GIKTLV M LD
D^
Q G E QLER CE GALDTINQ D
M K EAED HLKG M EK C
C
GLCVLP W N KTD D FEKTEFAKA W K
K
D^
D^
D G
G
VISD Q P RITVG DS
S^
M G P Q G
G
YITKIT
N DA RE DE M D ENV Q
Q
VST M V G NLRN M AIDM STEVSN Q N R QLD RIHDKAQ S NEVRVESANKRAKNLITK
Input Formats
FASTA format
Download from NCBI, ExPASy, EBI, …
Sequence names should be
Unique
15 characters or less
Comprised of only A-Z,a-z,0-9 and _ (Do not use #$%@|*!:;. or spaces)
ClustalW Guide Tree
The guide tree shows the distances between sequencesobtained from the initial pairwise alignments.
This is the order that sequences were added into the MSA
IT IS NOT A PHYLOGENETIC TREE!!!
Progressive Alignment
Greedy algorithm
Breaks problem up into smaller problems
Finds best solution to each small problem
Combine solutions to get answer to whole problem
Not necessarily the global answer.
Doesn’t use all information in solving sub-problems.
Suboptimal answers for small problems may combine togive a better overall answer
ClustalW Alignment
CLUSTAL W (1.82) multiple sequence alignment sp_P
---MAEDAD------------------------MRNELEEMQRRADQLADESLESTRRML 33
gi_
MPAAAPPAENG-------------------AAVPKTELQELQMKQQQVVDESLDSTRRML 41
gi_
MPTTAEPAQE--------------------NGAPRSELQELQLKAGQVTDETLESTRRML 40
gi_
MSARRGAPGGQRHPRPYAVEPTVDINGLVLPADMSDELKGLNVGIDEKTIESLESTRRML 60
gi_
MSGDDDIPEG---------------------------LEAINLKMNATTDDSLESTRRML 33
sp_P
QLVEESKDAGIRTLVMLDEQGEQLERIEEGMDQINKDMKEAEKNLTDLGKFCGLCVCPCN 93
gi_
ALCEESTEVGMRTIVMLDEQGEQLDRIEEGMDQINADMREAEKNLSGMEKCCGICVLPCN 101
gi_
ALCEESKEAGIRTLVALDDQGEQLERIEENMDQINADMKEAEKNLTGMEKFCGLCVLPWN 100
gi_
ALCEESKEAGIKTLVMLDDQGEQLERCEGALDTINQDMKEAEDHLKGMEKCCGLCVLPWN 120
gi_
ALCEESKEAGIKTLVMLDDQGEQLERCEGALDTINQDMKEAEDHLKGMEKCCGLCVLPWN 93^ * .:.::: *:*****: *
sp_P
KLKSSDA---YKKAWGNNQDG-VVASQPARVVDEREQMAISGGFIRRVTNDARENEMDEN 149
gi_
KSASFKE---DDGTWKGNDDGKVVNNQPQRVMDDRNGLGPQAGYIGRITNDAREDEMEEN 158
gi_
KSAPFKE---NEDAWKGNDDGKVVNNQPQRVMDDGSGLGPQGGYIGRITNDAREDEMEEN 157
gi_
KTDDFEKNSEYAKAWKKDDDGGVISDQPRITVGDPT-MGPQGGYITKITNDAREDEMDEN 179
gi_
KTDDFEK-TEFAKAWKKDDDGGVISDQPRITVGDSS-MGPQGGYITKITNDAREDEMDEN 151 *^
sp_P
LEQVSGIIGNLRHMALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG 206
gi_
MGQVNTMIGNLRNMALDMGSELENQNRQIDRINRKGDSNATRIAAANERAHDLLK-- 213
gi_
VGQVNTMIGNLRNMAIDMGSELENQNRQIDRIKNKAEM------------------- 195
gi_
IQQVSTMVGNLRNMAIDMSTEVSNQNRQLDRIHDKAQSNEVRVESANKRAKNLITK- 235
gi_
VQQVSTMVGNLRNMAIDMSTEVSNQNRQLDRIHDKAQSNEVRVESANKRAKNLITK- 207 : . ::****::..:..****:**
Interleaved Formats
Most common output formats for MSAs areinterleaved:
MSF, ASN, BLAST query-anchored formats
All sequences are stacked up, and chopped intoblocks of ~60 residues
Easy for humans to read, but difficult to edit
Tools for converting formats are available on theweb
Graphical - Jalview
-^ Postscript, PDF, HTML -^ Looks pretty and very visually informative -^ Completely useless for further computational analysis. DO NOT SAVE GRAPHICS AS YOUR ONLY OUTPUT -^ Jalview -- Java alignment editor (http://www.jalview.org) -^ Available as an online applet or as an application -^ Makes nice pictures and allow interactive editing
Sequence Logos
Logos are another useful visualization of alignments that allowconserved positions to be easily picked out.
Multiple tools available on the web or can be downloaded:
http://weblogo.berkeley.edu
Other Uses of MSA Servers
ClustalW can refine an alignment
If sequences are aligned when submitted, this info isused.
Tcoffee can
Combine alignments
Evaluate alignment quality
Use structural information if available
Criteria for a Good MSA
Most methods align proteins on the basis of sequencesimilarity, but what we really want to know is:
Evolutionary similarity
Functional similarity
Structural similarity
If the sequences are closely related, these similarities areall equivalent. As sequences become more divergent,theses similarities may not be equivalent.
There isn’t necessarily one ‘correct’ alignment for a family.MSA doesn’t necessarily reflect structural or functionalalignment.