Multiple Sequence Alignment - Computational Biology Tools | BME 110, Study notes of Chemistry

Material Type: Notes; Class: Computational Biology Tools; Subject: Biomolecular Engineering; University: University of California-Santa Cruz; Term: Unknown 2006;

Typology: Study notes

Pre 2010

Uploaded on 09/17/2009

koofers-user-wtd
koofers-user-wtd 🇺🇸

10 documents

1 / 23

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Multiple Sequence Alignment
BME 110: CompBio Tools
Todd Lowe
April 20, 2006
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17

Partial preview of the text

Download Multiple Sequence Alignment - Computational Biology Tools | BME 110 and more Study notes Chemistry in PDF only on Docsity!

Multiple Sequence Alignment

BME 110: CompBio Tools

Todd Lowe

April 20, 2006

Multiple Sequence Alignment

Multiple sequence alignment is probably the single-most important bioinformatics tools.

Many applications require accurate MSAs

PSIBLAST

Family and domain classification

Pattern identification

Structure prediction

• secondary structure • fold recognition

Phylogeny

PSIBLAST Alignments

The goal of BLAST is rapid detection bydetecting high-scoring local alignments. Itdoesn’t necessarily find the optimal global orlocal alignment

Profiles throw away information for regions thatare insertions relative to the query

Methods

Dynamic Programming

Gives the optimal solution, but prohibitively slow

Progressive

ClustalW

http://www.ebi.ac.uk/clustalw/index.html

Tcoffee

http://igs-server.cnrs-mrs.fr/Tcoffee/

(a little better, but slower)

Iterative

better than progressive methods, but slower

Dialign

HMMs

ClustalW Example

Input: 5 sequences detected by BLASTp using human SNAP-25as a query

Default parameters, output order: input

sp_P13795 M A E DA D M R N ELE

E

M Q R

R

A D QLADESLESTR

R^

ML QLVE

E

SKDA GIRTLVML DE Q G E QLERIE

E

G M D QINK D

M K EAEKNLTDLGKFC GLCVC PC NKLKS

S

DAYKKA W G N

N

Q D G V

V

AS QPARV

V

DERE Q M AISG

G

FIR

R

VTN D

ARENE M DE N LEQVS GI

I

GNLR H M ALD M G N EIDTQ N R QIDRIMEKADSNKTRIDEAN Q RATK ML GS G

g i_ M P A

A

APPAEN G A

A

VPKTELQ EL Q M K Q

Q

Q V

V

DESLDSTR

R

MLALCEESTEV G M RTIVMLDE Q G E QLDRIE

E

G M D QINAD M REAEKNLS G M E K C

C

GICVLPC NKSASFKED

D

GT W K G N D

D

G KV

V

N^

N

Q P Q R V M D

D

R N G LGP QA

G YIGRITNDARE DE M E

E^

N M G Q V NT MIGNLR N M ALD M G S ELEN Q N R QIDRINRK G DS NATRIA

A

ANERAH D

LLK >g i_ M PT

T

AEPA Q EN GAPRSELQ EL QLKA G Q VTDETLESTR

R

MLALCEESKEA GIRTLVALD

D

Q G E QLERIE

E

N

M D QINAD M K EAEKNLTG M EKFC GLCVLPW N K SAPFKENE DA W K G N D

D

G KV

V

N^

N

Q P Q R V M D

D

G S G LGP Q G

G

YIGRITNDAREDE M E

E

NV G Q V NT MIGNLRN M AID M G SELEN Q N R QIDRIKNKAE M

g

i_ M S A R

R

G AP G

G

Q R HP RPYAVEPTVDINGLVLPAD M S DELK GLNV GIDEKTIESLESTR

R

MLALCEESKEA G

IKTLV MLD

D

Q G E QLER CE G ALDTINQD M K EAED HLK G M EKC

C

GLCVLPW N KTD

D

FEKNSEYAKA W K

K

D^

D

D

G

G

VISD QP RITV G DPT M G P Q G

G

YITKITNDA RE DE M DE NIQ QVST MV G NLRN M AID MSTEVSN Q N R Q LDR

IHDKA QS NEVRVESANKRAKNLITK >g i_ M S G D

D

DIPEGLEAINLK M N AT

T

D

D

SLESTR

R

MLALCEESKEA GIKTLV M LD

D^

Q G E QLER CE GALDTINQ D

M K EAED HLKG M EK C

C

GLCVLP W N KTD D FEKTEFAKA W K

K

D^

D^

D G

G

VISD Q P RITVG DS

S^

M G P Q G

G

YITKIT

N DA RE DE M D ENV Q

Q

VST M V G NLRN M AIDM STEVSN Q N R QLD RIHDKAQ S NEVRVESANKRAKNLITK

Input Formats

FASTA format

Download from NCBI, ExPASy, EBI, …

Sequence names should be

Unique

15 characters or less

Comprised of only A-Z,a-z,0-9 and _ (Do not use #$%@|*!:;. or spaces)

ClustalW Guide Tree

The guide tree shows the distances between sequencesobtained from the initial pairwise alignments.

This is the order that sequences were added into the MSA

IT IS NOT A PHYLOGENETIC TREE!!!

Progressive Alignment

Greedy algorithm

Breaks problem up into smaller problems

Finds best solution to each small problem

Combine solutions to get answer to whole problem

Not necessarily the global answer.

Doesn’t use all information in solving sub-problems.

Suboptimal answers for small problems may combine togive a better overall answer

ClustalW Alignment

CLUSTAL W (1.82) multiple sequence alignment sp_P

---MAEDAD------------------------MRNELEEMQRRADQLADESLESTRRML 33

gi_

MPAAAPPAENG-------------------AAVPKTELQELQMKQQQVVDESLDSTRRML 41

gi_

MPTTAEPAQE--------------------NGAPRSELQELQLKAGQVTDETLESTRRML 40

gi_

MSARRGAPGGQRHPRPYAVEPTVDINGLVLPADMSDELKGLNVGIDEKTIESLESTRRML 60

gi_

MSGDDDIPEG---------------------------LEAINLKMNATTDDSLESTRRML 33

sp_P

QLVEESKDAGIRTLVMLDEQGEQLERIEEGMDQINKDMKEAEKNLTDLGKFCGLCVCPCN 93

gi_

ALCEESTEVGMRTIVMLDEQGEQLDRIEEGMDQINADMREAEKNLSGMEKCCGICVLPCN 101

gi_

ALCEESKEAGIRTLVALDDQGEQLERIEENMDQINADMKEAEKNLTGMEKFCGLCVLPWN 100

gi_

ALCEESKEAGIKTLVMLDDQGEQLERCEGALDTINQDMKEAEDHLKGMEKCCGLCVLPWN 120

gi_

ALCEESKEAGIKTLVMLDDQGEQLERCEGALDTINQDMKEAEDHLKGMEKCCGLCVLPWN 93^ * .:.::: *:*****: *

sp_P

KLKSSDA---YKKAWGNNQDG-VVASQPARVVDEREQMAISGGFIRRVTNDARENEMDEN 149

gi_

KSASFKE---DDGTWKGNDDGKVVNNQPQRVMDDRNGLGPQAGYIGRITNDAREDEMEEN 158

gi_

KSAPFKE---NEDAWKGNDDGKVVNNQPQRVMDDGSGLGPQGGYIGRITNDAREDEMEEN 157

gi_

KTDDFEKNSEYAKAWKKDDDGGVISDQPRITVGDPT-MGPQGGYITKITNDAREDEMDEN 179

gi_

KTDDFEK-TEFAKAWKKDDDGGVISDQPRITVGDSS-MGPQGGYITKITNDAREDEMDEN 151 *^

sp_P

LEQVSGIIGNLRHMALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG 206

gi_

MGQVNTMIGNLRNMALDMGSELENQNRQIDRINRKGDSNATRIAAANERAHDLLK-- 213

gi_

VGQVNTMIGNLRNMAIDMGSELENQNRQIDRIKNKAEM------------------- 195

gi_

IQQVSTMVGNLRNMAIDMSTEVSNQNRQLDRIHDKAQSNEVRVESANKRAKNLITK- 235

gi_

VQQVSTMVGNLRNMAIDMSTEVSNQNRQLDRIHDKAQSNEVRVESANKRAKNLITK- 207 : . ::****::..:..****:**

Interleaved Formats

Most common output formats for MSAs areinterleaved:

MSF, ASN, BLAST query-anchored formats

All sequences are stacked up, and chopped intoblocks of ~60 residues

Easy for humans to read, but difficult to edit

Tools for converting formats are available on theweb

Graphical - Jalview

-^ Postscript, PDF, HTML -^ Looks pretty and very visually informative -^ Completely useless for further computational analysis. DO NOT SAVE GRAPHICS AS YOUR ONLY OUTPUT -^ Jalview -- Java alignment editor (http://www.jalview.org) -^ Available as an online applet or as an application -^ Makes nice pictures and allow interactive editing

Sequence Logos

Logos are another useful visualization of alignments that allowconserved positions to be easily picked out.

Multiple tools available on the web or can be downloaded:

http://weblogo.berkeley.edu

Other Uses of MSA Servers

ClustalW can refine an alignment

If sequences are aligned when submitted, this info isused.

Tcoffee can

Combine alignments

Evaluate alignment quality

Use structural information if available

Criteria for a Good MSA

Most methods align proteins on the basis of sequencesimilarity, but what we really want to know is:

Evolutionary similarity

Functional similarity

Structural similarity

If the sequences are closely related, these similarities areall equivalent. As sequences become more divergent,theses similarities may not be equivalent.

There isn’t necessarily one ‘correct’ alignment for a family.MSA doesn’t necessarily reflect structural or functionalalignment.