Microarray Data Analysis RNA Folding | CMSC 423, Study notes of Computer Science

Material Type: Notes; Class: BIOINFO ALGS, DB, TOOLS; Subject: Computer Science; University: University of Maryland; Term: Spring 2007;

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-4qj
koofers-user-4qj ๐Ÿ‡บ๐Ÿ‡ธ

10 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
CMSC423: Bioinformatic Algorithms,
Databases and Tools
Lecture 21
Microarray data analysis
RNA folding
Hierarchical clustering
โ€ข UPGMA (remember from phylogenetic trees?)
โ€“ compute distance between genes (e.g. euclidean distance of
expression vectors)
โ€“ join most similar genes
โ€“ repeat
โ€“ Key element โ€“ compute distance between a gene and a
cluster, or between two clusters โ€“ average distance between
all genes in the two clusters
pf3
pf4
pf5
pf8

Partial preview of the text

Download Microarray Data Analysis RNA Folding | CMSC 423 and more Study notes Computer Science in PDF only on Docsity!

CMSC423: Bioinformatic Algorithms,

Databases and Tools

Lecture 21

Microarray data analysis RNA folding

Hierarchical clustering

  • UPGMA (remember from phylogenetic trees?)
    • compute distance between genes (e.g. euclidean distance of expression vectors)
    • join most similar genes
    • repeat
    • Key element โ€“ compute distance between a gene and a cluster, or between two clusters โ€“ average distance between all genes in the two clusters

k-means clustering

  • Split data into exactly k clusters
  • Basic algorithm:
    • Create k arbitrary clusters - pick k points as cluster centers and assign each other point to the closest center
    • Re-compute the center of each cluster
    • Re-assign points to clusters
    • Repeat
  • Another approach: pick a point at and see if moving it to a different cluster will improve the quality of the overall solution. Repeat!

Other clustering methods

  • Principal component analysis
    • "rotate" cloud of points until clusters become obvious
    • essentially projection onto the appropriate plane or line
  • Self Organizing Maps
    • based on neural networks
  • Clustering of time-series data

Clustering of time-series data

correlated anti-correlated un-correlated

http://www.cs.cmu.edu/~jernst/stem/

Assessing significance

  • All clustering methods produce clusters EVEN IF NO CLUSTERS EXIST!!!
  • Need to associate a confidence that the clusters are real
  • Basic approach โ€“ bootstrapping
    • randomly shuffle data labels (e.g. disease/no disease, or time-point)
    • recompute clustering
    • count how often the initial clusters appear in random data

RNA folding

  • Function of RNA molecules depends on how they fold, based on nucleotide base-pairing

Nussinov's algorithm

  • Assumes no pseudo-knots
  • Dynamic programming approach โ€“ maximize # of pairings
  • S โ€“ string of nucleotides representing the RNA molecule
  • Sub-problem โ€“ F[i,j] โ€“ score of folding just S[i..j]
  • Initial values: F[i-1,i] = F[i,i] = F[i, i+1] = 0

Nussinov's algorithm

i+ i j

j- j

i

i+ i

j- j

i (^) k k+1 j

I. F[i+1,j]

F[i,j] is the maximum of:

II. F[i,j-1]

III. F[i+1,j-1] + 1 if S[i+1] complementary to S[j-1]

IV. maxk F[i,k]+F[k+1,j]

S[i] unpaired

S[j] unpaired

S[i] paired with S[j]

Branch

Questions

  • In what order do we fill the dynamic programming table?
  • How can we ensure that "loops" consist of at least k nucleotides?