


Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An exercise on using support vector machines (svm) for classification in microarray analysis. Students will use data from golub et al. (1999) and preprocess it using r and bioconductor packages. They will then apply svm to the preprocessed data and evaluate the error rates for both the training and testing sets. The exercise aims to help students learn how to apply svm to microarray data analysis.
Typology: Assignments
1 / 4
This page cannot be seen from the preview
Don't miss anything!



Due Date: April 27, 2006
Objective
In this exercise we explore the use of support vector machines (svm) for classi- fication in microarray analysis. We will use the data set presented in Golub et al. (1999) and available in an online repository from the authors ( http://www- genome.wi.mit.edu/mpr/data set ALL AML.html) and were included in a R data package, golubEsets in Bioconductor. The expression data in this dataset are from the study of gene expression in two types of acute leukemias: acute lymphoblasic leukemia (ALL) and acute myeloid leukemia (AML). Gene expression levels were measure using Affymetrix high-density oligonucleotide arrays (HU6800 chip) con- taining probes for 6,817 human genes and ESTs. The chip actually contains 7, different probe sets; some of these map to the sample genes and others are there for quality control purposes. The data comprise of 38 samples of ALL (38 B-cell ALL and 9 T-cell ALL) and 25 samples of AML. These samples are further divided into a training set (golubTrain) with 38 observations and a test set (golubTest) of 34 observations. The svm solver comes from the package e1071(you can download it from http://cran.r-project.org/src/contrib/Descriptions/e1071.html).
First you will need to load the following R and Bioconductor packages
library(golubEsets) library(e1071) library(Biobase) library(genefilter)
Then we obtain the required expression data in the form of exprSets (golubTrain and golubTest) by using the data function.
data(golubTrain) data(golubTest)
Apply the preliminary gene filter procedure on golubTrain as we did before.
X <- exprs(golubTrain) X[X<100] <- 100 X[X>16000] <- 16000 mmfilt <- function(r=5, d=500, na.rm=TRUE) { function(x) { minval <- min(x, na.rm=na.rm) maxval <- max(x,na.rm=na.rm) (maxval/minval > r) && (maxval-minval > d) } } mmfun <- mmfilt() ffun <- filterfun(mmfun) sub <- genefilter(X, ffun) X <- X[sub,] X <- log10(X) golubTrainSub<-golubTrain[sub,] golubTrainSub@exprs <- X Y <- golubTrainSub$ALL.AML Y <- paste(golubTrain$ALL.AML,golubTrain$T.B.cell) Y <- sub(" NA","",Y)
This is a non-specific filter. The genes were selected according to their variability not with respect to their ability to classify any particular set of samples.
In order to make the test set comparable we must select the same set of genes and apply the same transformations to that data set.
Xt <- exprs(golubTest)
For more details about svm in R, please refer to package e1071 manual at http://cran.r- project.org/doc/packages/e1071.pdf
4 What do you need to submit?
5 Acknowledgement
This exercise is adapted from the lab material in A short course on Computational and Statistical Aspects of Microarray Analysis, A. Antoniadis and R. Gentleman, May 2003, Milan