Bioinformatics Algorithm, Databases and the Tools - Homework 3 | CMSC 423, Assignments of Computer Science

Material Type: Assignment; Class: BIOINFO ALGS, DB, TOOLS; Subject: Computer Science; University: University of Maryland; Term: Spring 2007;

Typology: Assignments

Pre 2010

Uploaded on 07/30/2009

koofers-user-xk0-1
koofers-user-xk0-1 🇺🇸

10 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
CMSC423: Bioinformatic Algorithms,
Databases and Tools
Lecture 14
Genome assembly
Administrativia
CMSC423 forum on CS forums
http://forum.cs.umd.edu/
Project questions?
pf3
pf4
pf5
pf8

Partial preview of the text

Download Bioinformatics Algorithm, Databases and the Tools - Homework 3 | CMSC 423 and more Assignments Computer Science in PDF only on Docsity!

CMSC423: Bioinformatic Algorithms,

Databases and Tools

Lecture 14

Genome assembly

Administrativia

• CMSC423 forum on CS forums

http://forum.cs.umd.edu/

• Project questions?

Homework 3 answer

Shotgun sequencing

shearing

sequencing

assembly

original DNA

Repeats

AAAAAAAAAAAAAAAAAAAA

AAAAAA AAAAAA AAAAAA

AAAAAA AAAAAA

AAAAAA AAAAAA

AAAAAA

AAAAAA

AAAAAA

AAAAAA

AAAAAA

AAAAAA

AAAAAA

AAAAAA

Typical contig coverage

1

2

3

4

5

6 Coverage

Contig

Reads

Imagine raindrops on a sidewalk

Lander-Waterman statistics

L = read length

T = minimum overlap G = genome size

N = number of reads

c = coverage (NL / G)

 = 1 – T/L

E(#islands) = Ne-c

E(island size) = L(ec^ – 1) / c + 1 – 

contig = island with 2 or more reads

All pairs alignment

  • Needed by the assembler
  • Try all pairs – must consider ~ n^2 pairs
  • Smarter solution: only n x coverage (e.g. 8) pairs are possible - Build a table of k-mers contained in sequences (single pass through the genome) - Generate the pairs from k-mer table (single pass through k- mer table)

A^ k-mer

B

C H^ D

I

F

G

E

Sequencing by hybridization

AAAA

AAAC

AAAG

AAAT

AACA

AACG

AACT

AAGA

probes - all possible k-mers

AACAGTAGCTAGATG

AACA TAGC AGAT

ACAG AGCT GATG

CAGT GCTA

AGTA CTAG

GTAG TAGA

Assembling SBH data

Main entity: oligomer (overlap) Relationship between oligomers: adjacency

ACCTGATGCCAATTGCACT...

CTGAT follows CCTGA (they share 4 nucleotides: CTGA)

Problem: given all the k-mers, find the original string

In assembly: fake the SBH experiment - break the reads into k-mers

Eulerian circuit

• Eulerian circuit: visit each edge (bridge) exactly once

and come back to the start

ACCTAGATTGAGGTCG

ACCTAGATTGAGGTC CCTAGATTGAGGTCG