Microarray Analysis Exercise: Classification using Support Vector Machines, Assignments of Biology

An exercise on using support vector machines (svm) for classification in microarray analysis. Students will use data from golub et al. (1999) and preprocess it using r and bioconductor packages. They will then apply svm to the preprocessed data and evaluate the error rates for both the training and testing sets. The exercise aims to help students learn how to apply svm to microarray data analysis.

Typology: Assignments

Pre 2010

Uploaded on 07/23/2009

koofers-user-y84
koofers-user-y84 🇺🇸

4.5

(2)

10 documents

1 / 4

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Exercise : Microarray Analysis: Classification
using Support Vector Machines
April 14, 2006
Due Date: April 27, 2006
Objective
Learn to apply SVM on microarray data analysis.
1 Pre-lab
In this exercise we explore the use of support vector machines (svm) for classi-
fication in microarray analysis. We will use the data set presented in Golub et
al. (1999) and available in an online repository from the authors ( http://www-
genome.wi.mit.edu/mpr/data set ALL AML.html) and were included in a R data
package, golubEsets in Bioconductor. The expression data in this dataset are from
the study of gene expression in two types of acute leukemias: acute lymphoblasic
leukemia (ALL) and acute myeloid leukemia (AML). Gene expression levels were
measure using Affymetrix high-density oligonucleotide arrays (HU6800 chip) con-
taining probes for 6,817 human genes and ESTs. The chip actually contains 7,129
different probe sets; some of these map to the sample genes and others are there
for quality control purposes. The data comprise of 38 samples of ALL (38 B-cell
ALL and 9 T-cell ALL) and 25 samples of AML. These samples are further divided
into a training set (golubTrain) with 38 observations and a test set (golubTest) of
34 observations. The svm solver comes from the package e1071(you can download
it from http://cran.r-project.org/src/contrib/Descriptions/e1071.html).
2 Data pre-processing
First you will need to load the following R and Bioconductor packages
1
pf3
pf4

Partial preview of the text

Download Microarray Analysis Exercise: Classification using Support Vector Machines and more Assignments Biology in PDF only on Docsity!

Exercise : Microarray Analysis: Classification

using Support Vector Machines

April 14, 2006

Due Date: April 27, 2006

Objective

  • Learn to apply SVM on microarray data analysis.

1 Pre-lab

In this exercise we explore the use of support vector machines (svm) for classi- fication in microarray analysis. We will use the data set presented in Golub et al. (1999) and available in an online repository from the authors ( http://www- genome.wi.mit.edu/mpr/data set ALL AML.html) and were included in a R data package, golubEsets in Bioconductor. The expression data in this dataset are from the study of gene expression in two types of acute leukemias: acute lymphoblasic leukemia (ALL) and acute myeloid leukemia (AML). Gene expression levels were measure using Affymetrix high-density oligonucleotide arrays (HU6800 chip) con- taining probes for 6,817 human genes and ESTs. The chip actually contains 7, different probe sets; some of these map to the sample genes and others are there for quality control purposes. The data comprise of 38 samples of ALL (38 B-cell ALL and 9 T-cell ALL) and 25 samples of AML. These samples are further divided into a training set (golubTrain) with 38 observations and a test set (golubTest) of 34 observations. The svm solver comes from the package e1071(you can download it from http://cran.r-project.org/src/contrib/Descriptions/e1071.html).

2 Data pre-processing

First you will need to load the following R and Bioconductor packages

library(golubEsets) library(e1071) library(Biobase) library(genefilter)

Then we obtain the required expression data in the form of exprSets (golubTrain and golubTest) by using the data function.

data(golubTrain) data(golubTest)

Apply the preliminary gene filter procedure on golubTrain as we did before.

X <- exprs(golubTrain) X[X<100] <- 100 X[X>16000] <- 16000 mmfilt <- function(r=5, d=500, na.rm=TRUE) { function(x) { minval <- min(x, na.rm=na.rm) maxval <- max(x,na.rm=na.rm) (maxval/minval > r) && (maxval-minval > d) } } mmfun <- mmfilt() ffun <- filterfun(mmfun) sub <- genefilter(X, ffun) X <- X[sub,] X <- log10(X) golubTrainSub<-golubTrain[sub,] golubTrainSub@exprs <- X Y <- golubTrainSub$ALL.AML Y <- paste(golubTrain$ALL.AML,golubTrain$T.B.cell) Y <- sub(" NA","",Y)

This is a non-specific filter. The genes were selected according to their variability not with respect to their ability to classify any particular set of samples.

In order to make the test set comparable we must select the same set of genes and apply the same transformations to that data set.

Xt <- exprs(golubTest)

  • Question 4: As a second exercise you could reverse the rolls of the two data sets, the test set could be treated as the training data set and the training data set could be treated as the test data set. What is the error rate for training set? What is the average error rate for 10-fold cross-validation? What is the error rate for testing set?

For more details about svm in R, please refer to package e1071 manual at http://cran.r- project.org/doc/packages/e1071.pdf

4 What do you need to submit?

  1. The complete source code in R (do include some comments).
  2. The answers of the questions in this exercise. You need some R commands that did not appear in the text to answer the questions. Get help from reference manual or CRAN network.

5 Acknowledgement

This exercise is adapted from the lab material in A short course on Computational and Statistical Aspects of Microarray Analysis, A. Antoniadis and R. Gentleman, May 2003, Milan