Microarray Data Analysis Exercise: Clustering, Functional Enrichment, and Motif Finding, Assignments of Computer Science

Instructions for an exercise aimed at helping students get familiar with microarray data analysis using tools like mev, funcassociate, and alignace or meme. The exercise involves clustering genes into 30 clusters using mev, performing functional enrichment analysis on selected clusters using gene ontology terms, and finding motifs in the promoter sequences of genes in the clusters using alignace or meme.

Typology: Assignments

Pre 2010

Uploaded on 08/19/2009

koofers-user-kas-1
koofers-user-kas-1 🇺🇸

10 documents

1 / 5

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Homework 3
Due: Thursday, Dec 6, 8:30pm
Turn in everything electronically. However, please do not send me too many extra stuff
that I did not ask for. Read carefully below for what I am expecting.
Purpose
The purpose of this exercise is for you to get familar with microarray data and some available tools for
analyzing them.
Data and tools needed
(1) Download gene expression data cellcycle.txt from the course website (http://www.cs.utsa.edu/
~jruan/teaching/cs5263_fall_2007/hw3_files/cellcycle.txt).
The original data were obtained by two different research groups using microarray to study gene
expression at different stages of cell cycles in the Saccharomyces cerevisiae (the more common name:
baker’s yeast). (If you have forgotten what is a cell cycle, just keep reading. You don’t need to know
it to finish the homework). One study was conducted using Affymetrix microarray by Cho et. al.
(Molecular Cell, July 1998), and included 17 time points covering approximately two cell cycles. The
other study was conducted by Spellman et. al. (Molecular Biology of the Cell, December 1998) using
cDNA microarray, and consisted of three subsets of experiments, each with 10-20 time points covering
about two cell cycles. The combined data were downloaded from http://genome-www.stanford.edu/
cellcycle/.
I have processed the data with the following procedure:
i. Removed genes with more than ten missing values.
ii. Missing values for the remaining genes were replaced with uniformly distributed random numbers
between -1 and 1.
iii. Genes were ranked according to the standard deviation of their expression vectors. The top 3000
genes with the highest variability were selected. The rest were discarded.
The processed data are stored as a 3000 x 73 matrix, where each row corresponds to a gene, and each
column corresponds to an experiment (a time point). Gene names and experiment names are shown
as row and column headers, respectively. Data are provided as log ratios between gene expression
levels measured from yeast cells collected at a particular time point (if you want to know, the cells
were “synchronized” so that they were in the same “cell-cycle phases”.) and those measured from a
mixture of yeast cells collected at different time points (i.e., the sample consisted of cells from different
cell-cycle phases).
(2) Download the MeV 4.0 software for micorarray gene expression clustering from http://www.tm4.org/
mev.html (or some other clustering tools if you like).
(3) FuncAssociate web interface for gene ontology analysis (http://llama.med.harvard.edu/cgi/func/
funcassociate). You can use other similar tools if you prefer.
(4) AlignACE or MEME for motif finding. Both tools have standalone versions and web interfaces. Alig-
nACE is available at http://atlas.med.harvard.edu/. MEME can be accessed and downloaded at
http://meme.sdsc.edu/meme/intro.html. I believe both tools have some limitations on input sizes
if you use their web interfaces. So if the number of sequences you provide exceeds the limit, you may
have to download and install a standalone version on your local computer.
1
pf3
pf4
pf5

Partial preview of the text

Download Microarray Data Analysis Exercise: Clustering, Functional Enrichment, and Motif Finding and more Assignments Computer Science in PDF only on Docsity!

Homework 3

Due: Thursday, Dec 6, 8:30pm

Turn in everything electronically. However, please do not send me too many extra stuff that I did not ask for. Read carefully below for what I am expecting.

Purpose

The purpose of this exercise is for you to get familar with microarray data and some available tools for analyzing them.

Data and tools needed

(1) Download gene expression data cellcycle.txt from the course website (http://www.cs.utsa.edu/ ~jruan/teaching/cs5263_fall_2007/hw3_files/cellcycle.txt). The original data were obtained by two different research groups using microarray to study gene expression at different stages of cell cycles in the Saccharomyces cerevisiae (the more common name: baker’s yeast). (If you have forgotten what is a cell cycle, just keep reading. You don’t need to know it to finish the homework). One study was conducted using Affymetrix microarray by Cho et. al. (Molecular Cell, July 1998), and included 17 time points covering approximately two cell cycles. The other study was conducted by Spellman et. al. (Molecular Biology of the Cell, December 1998) using cDNA microarray, and consisted of three subsets of experiments, each with 10-20 time points covering about two cell cycles. The combined data were downloaded from http://genome-www.stanford.edu/ cellcycle/. I have processed the data with the following procedure:

i. Removed genes with more than ten missing values. ii. Missing values for the remaining genes were replaced with uniformly distributed random numbers between -1 and 1. iii. Genes were ranked according to the standard deviation of their expression vectors. The top 3000 genes with the highest variability were selected. The rest were discarded.

The processed data are stored as a 3000 x 73 matrix, where each row corresponds to a gene, and each column corresponds to an experiment (a time point). Gene names and experiment names are shown as row and column headers, respectively. Data are provided as log ratios between gene expression levels measured from yeast cells collected at a particular time point (if you want to know, the cells were “synchronized” so that they were in the same “cell-cycle phases”.) and those measured from a mixture of yeast cells collected at different time points (i.e., the sample consisted of cells from different cell-cycle phases).

(2) Download the MeV 4.0 software for micorarray gene expression clustering from http://www.tm4.org/ mev.html (or some other clustering tools if you like).

(3) FuncAssociate web interface for gene ontology analysis (http://llama.med.harvard.edu/cgi/func/ funcassociate). You can use other similar tools if you prefer.

(4) AlignACE or MEME for motif finding. Both tools have standalone versions and web interfaces. Alig- nACE is available at http://atlas.med.harvard.edu/. MEME can be accessed and downloaded at http://meme.sdsc.edu/meme/intro.html. I believe both tools have some limitations on input sizes if you use their web interfaces. So if the number of sequences you provide exceeds the limit, you may have to download and install a standalone version on your local computer.

Problem 1: Clustering of microarray data (15 points)

Use MeV or some other tools to cluster the 3000 genes into 30 clusters. You can use k-means, or any other clustering algorithm you like. A brief instruction on how to use MeV: After downloading TM 4.0 to your load disk and unzip (no installation necessary), double-click the MeV program icon. You’ll see two windows: MultiExperiment Viewer, and Multiple Array Viewer. You only need to interact with the latter. Follow the steps below to load your expression data into the program: Click on menu File → Load data → File Format Descriptions → Tab delimited. You can then navigate through the directories to find the expression data file. You’ll then see part of the matrix being loaded into the window. It is important to click on the first value in the upper-left corner (the red box as shown in the figure below). After clicking “load”, you’ll see an image of the expression matrix. The image maybe be too wide to fit on your screen. You can adjust this by selecting “display” → “select element size” → “10x10”, or use a customize size such as 10x5.

You are now ready to analyze the data using MeV. In this exercise you’re asked to partition the genes into 30 clusters using k-means. You can achieve this by clicking on the menu “Analysis” and then selecting “Clustering” and “KMC”. Change number of clusters to 30. After the clustering is done, the results will appear on the left panel. Double-click on “Analysis Results”, and “KMC-genes”, you’ll see several tabs, providing alternative views of the clusters. You can visualize the expression levels for the genes within each cluster using the tab “Expression Images” or “Expression Graphs”, and the average expression level for each cluster using the tab “Centroid Graphs”. Centroid Graphs and Expression Graphs also allows you to look at all clusters simultaneously. Table Views displays the genes in each cluster. Right click the right panel to save information regarding to which gene belongs to which cluster. The name you provided in the file save diaglog will be used as a prefix. For example, if you provide “cluster” when asked for a name, you will end up with 30 files with names cluster-1.txt, cluster-2.txt, etc. These files will be used for problems #2 and #3.

MOTIF 1 width = 15 sites = 20 llr = 241 E-value = 3.5e-



Motif 1 Description

Simplified A 88991a pos.-specific C 12:1717:91:9: probability G 11111:1:119:98: matrix T 1:1:2:191:

bits 2.

1.7 * 1.5 * * * * * Information 1.2 ** * ****** content 1.0 *** * ****** (17.4 bits) 0.8 **** * ******** 0.6 *************** 0.4 *************** 0.2 *************** 0.0 ---------------

Multilevel AAAACACTCAGCGGT consensus C sequence

Optional: (5 points) For better visualization, you may want to create a sequence logo for each motif using the WebLogo website (http://weblogo.berkeley.edu/logo.cgi). For AlignACE, you can extract the sequences from the output (remove MAP scores, numbers, etc.). Copy-paste the sequences into the WebLogo interface and create a seqlogo with default parameters. You can do similar things with MEME output by extracting sequences from the BLOCKS format (if you use MEME webiste there is an option “VIEW RAW”, which will give you the raw sequences and save you some time). Below is a motif in BLOCKS format in a MEME output. For drawing a sequence logo, remove everything else except the sequences.

Motif 1 in BLOCKS format

BL MOTIF 1 width=15 seqs= 11 ( 80) AAAACAGTCAGCGGT 1 3 ( 62) AAAACACTCATCGGT 1 14 ( 75) AAAACAGTCAGCGGC 1 2 ( 144) TAAATACTCAGCGGT 1 1 ( 184) AGAATACTCAGCGGT 1 6 ( 112) AAAAAAATCAGCGGC 1 16 ( 53) AAAACATTCAACGGT 1 9 ( 70) AAAGCACTCAGCAGT 1 12 ( 12) AAAACACACCGCGGT 1 13 ( 28) ACAAAACTCAGCGAT 1 8 ( 85) AAACCACTGAGCGGT 1 5 ( 5) AAAACCCTCGGCGGT 1 10 ( 46) AAAATACTCAGTGAT 1 18 ( 29) ACGACACTCAGCTGT 1 20 ( 183) CAAGCACTTAGCGGT 1 17 ( 5) AAAAGAATCAGCGCC 1 15 ( 154) GAAACACTCAGCGCA 1

19 ( 161) AATACAATCAGCTGC 1

7 ( 23) ACAACACTCAGAGTC 1

4 ( 16) CAAACACAAAACGGT 1

To turn in: The motifs in pure-text format or sequence logos as described above, and their statistical significance (MAP scores or E-values).

Bonus (5 points)

How much time did you spent? Who did you discuss with and what was the discussion about? Any comments about the course and homeworks? What topic would you like to learn more about? What topic you didn’t enjoy so much?