







































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An overview of various functions in an r package used for processing and analyzing mass spectrometry data. The functions include data preprocessing, peak detection, mass adjustment, and biomarker matrix creation. The document also includes examples of how to use these functions and test their output.
Typology: Exams
1 / 47
This page cannot be seen from the preview
Don't miss anything!








































Version 1.
Date Feb 01 2007
Title Processing & Classification of Protein Mass Spectra (SELDI) Data
Author Jarek Tuszynski
Maintainer Jarek Tuszynski
Depends R (>= 2.0.0), PROcess, e1071, nnet, rpart, caTools, XML, digest, MASS
Description Functions for processing and classification of protein mass spectra (SELDI) data. Also includes support for mzXML Files.
License The caMassClass Software License, Version 1.0 (See COPYING file or “http://ncicb.nci.nih.gov/download/camassclasslicense.jsp”)
Repository CRAN
Date/Publication 2007-02-01 20:35:
caMassClass-package.................................... 2 msc.baseline.subtract.................................... 3 msc.biomarkers.fill..................................... 5 msc.biomarkers.read.csv & msc.biomarkers.write.csv................... 6 msc.classifier.run...................................... 7 msc.classifier.test...................................... 10 msc.copies.merge...................................... 12 msc.features.remove.................................... 14 msc.features.scale...................................... 15 msc.features.select..................................... 17 msc.mass.adjust....................................... 18 msc.mass.cut........................................ 21 msc.peaks.align....................................... 22 msc.peaks.clust....................................... 25
2 caMassClass-package
msc.peaks.find........................................ 27 msc.peaks.read.csv & msc.peaks.write.csv......................... 29 msc.peaks.read.mzXML & msc.peaks.write.mzXML................... 30 msc.preprocess.run..................................... 32 msc.project.read....................................... 34 msc.project.run....................................... 37 msc.rawMS.read.csv.................................... 38 msc.rawMS.read.mzXML & msc.rawMS.write.mzXML................. 40 msc.sample.correlation................................... 41 read.mzXML & write.mzXML............................... 43
Index 46
caMassClass-package Processing and Classification of Protein Mass Spectra Data
Description
Functions for processing and classification of protein mass spectra data. Includes support for I/O in mzXML and CSV formats.
Details
Package: caMassClass Version: 1. Date: 2006-04- Depends: R (>= 2.0.0), PROcess, e1071, nnet, rpart, caTools, XML, digest License: The caMassClass Software License, Version 1.0 (See COPYING file or http://ncicb.nci.nih.gov/download/camassclasslicense.jsp) URL: http://ncicb.nci.nih.gov/download/index.jsp
Index of Preprocessing Functions:
msc.project.read Read and Manage a Batch of Protein Mass Spectra msc.project.run Read and Preprocess Protein Mass Spectra msc.preprocess.run Preprocessing Pipeline of Protein Mass Spectra msc.baseline.subtract Baseline Subtraction for Mass Spectra Data msc.mass.cut Remove Low Mass Portion of the Mass Spectra Data. msc.mass.adjust Perform Normalization and Mass Drift Adjustment msc.peaks.find Find Peaks of Mass Spectra msc.peaks.clust Clusters Peaks of Mass Spectra (low level) msc.peaks.align Align Peaks of Mass Spectra into Biomarker Matrix msc.biomarkers.fill Fill Empty Spaces in Biomarker Matrix msc.copies.merge Merge Multiple Copies of Mass Spectra Samples msc.sample.correlation Sample Correlation (low level)
4 msc.baseline.subtract
Arguments
X Spectrum data either in matrix format [nFeatures × nSamples] or in 3D array format [nFeatures × nSamples × nCopies]. Row names (rownames(X) store M/Z mass of each row/feature. ... Parameters to be passed to bslnoff function from PROcess library. See de- tails for explanation of breaks, qntl, and bw. Boolean parameter plot can be used to plot results.
Details
Perform baseline subtraction for every sample in a batch of data, using bslnoff function from PROcess library. The bslnoff function splits spectrum into breaks number of exponentially growing regions. Baseline is calculated by applying quantile(...,probs=qntl) to each region and smoothing the results using loess(..., span=bw, degree=1) function.
Value
Data in the same format and size as input variable X but with the subtracted baseline.
Author(s)
Jarek Tuszynski (SAIC) 〈[email protected]〉
See Also
Examples
if (!file.exists("Data_IMAC.Rdata")) example("msc.project.read") load("Data_IMAC.Rdata")
Y = msc.baseline.subtract(X) avr = mean(abs(X-Y)) cat("Data Size: ", dim(X), " average change :", avr, "\n") stopifnot(avr<0.3)
directory = system.file("Test", package = "PROcess")
msc.biomarkers.fill 5
X = msc.rawMS.read.csv(directory) Y = msc.baseline.subtract(X, plot=TRUE) avr = mean(abs(X-Y)) cat("Data Size: ", dim(X), " average change :", avr, "\n") stopifnot(avr>7.5)
msc.biomarkers.fill Fill Empty Spaces in Biomarker Matrix
Description
Fill empty spaces (NA’s) in biomarker matrix created by msc.peaks.align
Usage
msc.biomarkers.fill( X, Bmrks, BinBounds, FillType=0.9)
Arguments
X Spectrum data either in matrix format [nFeatures × nSamples] or in 3D array format [nFeatures × nSamples × nCopies]. Row names (rownames(X)) store M/Z mass of each row. Bmrks biomarker matrix containing one sample per column and one biomarker per row BinBounds position (mass) of left-most and right-most peak in each bin FillType how to fill empty spaces in biomarker data?
Details
This function attempts to correct a problem which is a side-effect of msc.peaks.align func- tion. Namely numerous NA’s in biomarker data, each time when some peak was found only in some of the samples. msc.peaks.align already removed the most problematic features using SampFrac variable, but likely a lot of NA’s remain and they can cause problem for some classifi- cation algorithms.
Value
Data in the same format and size as Bmrks
msc.classifier.run 7
Usage
X = msc.biomarkers.read.csv(fname, mzXML.record=FALSE) msc.biomarkers.write.csv(X, fname)
Arguments
fname either a character string naming a file or a connection. X biomarker data in form of a 2D matrix (nFeatures × nSamples) or 3D array (nFeatures × nSamples × nCopies. Notice that this data is in format which is a transpose of data in CSV file. mzXML.record should mzXML record be created to store mata-data (input file names)?
Value
Function msc.biomarkers.read.csv returns peak information data frame. See argument X above. If mzXML.record was set to true than mzXML record with input file names will be attached to X as "mzXML" attribute. Function msc.biomarkers.write.csv does not return anything.
Author(s)
Jarek Tuszynski (SAIC) 〈[email protected]〉
See Also
msc.biomarkers.fill, msc.rawMS.read.csv
Examples
example("msc.peaks.align", verbose=FALSE) # create biomarkers data X = Y$Bmrks # biomarkers data is stored in variable 'Y$Bmrks' msc.biomarkers.write.csv(X, "biomarkers.csv") Y = msc.biomarkers.read.csv("biomarkers.csv", mzXML.record=TRUE) stopifnot( all(X==Y, na.rm=TRUE) ) mzXML = attr(Y,"mzXML") strsplit(mzXML$parentFile, '\n') # show mzXML$parentFile record file.remove("biomarkers.csv")
msc.classifier.run Train and Test Chosen Classifier.
Description
Common interface for training and testing several standard classifiers. Includes feature selection and feature scaling steps. Allows to specify that some test samples are multiple copies of the same sample, and should return the same label.
8 msc.classifier.run
Usage
msc.classifier.run( xtrain, ytrain, xtest, ret.prob=FALSE, RemCorrCol=0, KeepCol=0, prior=1, same.sample=NULL, ScaleType=c("none", "min-max", "avr-std", "med-mad"), method=c("svm", "nnet", "lda", "qda", "LogitBoost", "rpart"), ...)
Arguments
xtrain A matrix or data frame with training data. Rows contain samples and columns contain features/variables ytrain Class labels for the training data samples. A response vector with one label for each row/component of x. Can be either a factor, string or a numeric vector. xtest A matrix or data frame with test data. Rows contain samples and columns con- tain features/variables ret.prob if set to TRUE than the a-posterior probabilities for each class are returned as attribute called "probabilities". same.sample optional parameter which allows to specify that some (or all) test samples have multiple copies which should be used to predict a single label for all of them. Can be either a factor, string or a numeric vector, with unique values for different samples and identical values for copies of the same sample. RemCorrCol If non-zero than some of the highly correlated columns are removed using msc.features.remove function with ccMin=RemCorrCol. KeepCol If non-zero than columns with low AUC are removed.
10 msc.classifier.test
Examples
data(iris) mask = sample.split(iris[,5], SplitRatio=1/4) # very few points to train xtrain = iris[ mask,-5] # use output of sample.split to ... xtest = iris[!mask,-5] # create train and test subsets ytrain = iris[ mask, 5] ytest = iris[!mask, 5] table(ytrain, msc.classifier.run(xtrain,ytrain,xtrain, method="svm") ) table(ytrain, msc.classifier.run(xtrain,ytrain,xtrain, method="nnet") ) table(ytrain, msc.classifier.run(xtrain,ytrain,xtrain, method="lda") ) table(ytrain, msc.classifier.run(xtrain,ytrain,xtrain, method="qda") ) table(ytrain, msc.classifier.run(xtrain,ytrain,xtrain, method="LogitBoost") )
a=table(ytrain, msc.classifier.run(xtrain,ytrain,xtrain, method="LogitBoost") ) stopifnot( sum(diag(a))==length(ytrain) )
msc.classifier.test Test a Classifier through Cross-validation
Description
Test classifier through cross-validation. Common interface for cross-validation of several standard classifiers. Includes feature selection and feature scaling steps. Allows to specify that some test samples are multiple copies of the same sample, and should return the same label.
Usage
msc.classifier.test( X, Y, iters=50, SplitRatio=2/3, verbose=FALSE, RemCorrCol=0, KeepCol=0, prior=1, same.sample=NULL, ScaleType=c("none", "min-max", "avr-std", "med-mad"), method=c("svm", "nnet", "lda", "qda", "LogitBoost", "rpart"), ...)
Arguments
X A matrix or data frame with training/testing data. Rows contain samples and columns contain features/variables Y Class labels for the training data samples. A response vector with one label for each row/component of x. Can be either a factor, string or a numeric vector. Labels with ’NA’ value signify test data-set. iters Number of iterations. Each iteration consist of splitting the data into train and test sets, performing the classification and storing results SplitRatio Splitting ratio used to divide available data during cross-validation:
msc.classifier.test 11
Details
This function follows standard cross-validation steps:
Value
Y Predicted class labels. If there were any unknown samples in input data, marked by NA’s in input Y, than output Y will only hold prediction of those samples, otherwise prediction will be made for all samples. Res Holds fraction of correct prediction during cross-validation for each iteration. mean(Res) will give you average accuracy. Tabl Contingency table of predictions shows all the input label compared to output labels
Author(s)
Jarek Tuszynski (SAIC) 〈[email protected]〉
msc.copies.merge 13
Details
Quality of a sample is measured by calculating for each copy of each sample two variables: inner correlation (average correlation between multiple copies of the same sample) and outer correlation (average correlation between each sample and every other sample within the same copy). Inner correlation measures how similar copies are to each other and outer correlation measures how sim- ilar each copy is to everybody else. For example in case of experiment using SELDI technology to distinguish cancerous samples and non-cancerous samples one can assume that most of the proteins present in both cancerous and non-cancerous samples will be the same. In that case one will expect high correlation between samples and even higher correlation between copies of the same sample if mergeType/4 (mergeType %/% 4) is
Option 2 is more suitable in case of data with a lot of copies, when we can afford dropping one copy. Option 1 is designed to patch the most serious problems with the data. There are also four merging options, if mergeType mod 4 (mergeType %% 4) is
In preparation for classification one can use multiple copies in several ways: option 2 above im- proves (one hopes) accuracy of each sample, while options 1 and 3 increase number of samples available during classification. So the choice is: do we want a lot of samples during classification or fewer, better samples? The best option of mergeType depends on kind of data.
Value
Return matrix containing features as rows and samples as columns, unless mergeType is 0,4, or 8 when no merging is done and data is returned in same or similar format as the input format.
14 msc.features.remove
Author(s)
Jarek Tuszynski (SAIC) 〈[email protected]〉
See Also
Examples
if (!file.exists("Data_IMAC.Rdata")) example("msc.project.read") load("Data_IMAC.Rdata")
Y = msc.copies.merge(X, 1+2+4) colnames(Y) stopifnot( dim(Y)==c(11883,60) )
msc.features.remove Remove Highly Correlated Features
Description
Remove Highly Correlated Features. The function checks neighbor features looking for highly correlated ones and removes one of them. Used in order to drop dimensionality of the data.
Usage
msc.features.remove(Data, Auc, ccMin=0.9, verbose=FALSE)
Arguments
Data Data containing one sample per row and one feature per column. Auc A measure of usefulness of each column/feature, used to choose which one of two highly correlated columns to remove. Usually a measure of discrimination power of each feature as measured by colAUC, student t-test or other method. See details. ccMin Minimum correlation coefficient of "highly correlated" columns. verbose Boolean flag turns debugging printouts on.
16 msc.features.scale
Arguments
xtrain A matrix or data frame with train data. Rows contain samples and columns contain features/variables xtest A matrix or data frame with test data. Rows contain samples and columns con- tain features/variables type Following types are recognized
Details
Many classification algorithms perform better if input data is scaled beforehand. Some of them perform scaling internally (for example svm), but many don’t. For some it makes no difference (for example rpart or LogitBoost). In case xtrain contains NA values or infinities all non-finite numbers are omitted from scaling parameter calculations.
Value
xtrain A matrix or data frame with scaled train data. xtest A matrix or data frame with scaled test data.
Author(s)
Jarek Tuszynski (SAIC) 〈[email protected]〉
See Also
Used by msc.classifier.test and msc.features.select functions.
Examples
library(e1071) data(iris) mask = sample.split(iris[,5], SplitRatio=1/4) # very few points to train xtrain = iris[ mask,-5] # use output of sample.split to ... xtest = iris[!mask,-5] # create train and test subsets ytrain = iris[ mask, 5] ytest = iris[!mask, 5] x = msc.features.scale(xtrain, xtest) model = svm(x$xtrain, ytrain, scale=FALSE) print(a <- table(predict(model, x$xtest), ytest) ) model = svm(xtrain, ytrain, scale=FALSE) print(b <- table(predict(model, xtest), ytest) ) stopifnot( sum(diag(a)) msc.features.select 17
msc.features.select Reduce Number of Features Prior to Classification
Description
Select subset of individual features that are potentially most useful for classification.
Usage
msc.features.select( x, y, RemCorrCol=0.98, KeepCol=0.6)
Arguments
x A matrix or data frame with training data. Rows contain samples and columns contain features/variables y Class labels for the training data samples. A response vector with one label for each row/component of x. Can be either a factor, string or a numeric vector. RemCorrCol If non-zero than some of the highly correlated columns are removed using msc.features.remove function with ccMin=RemCorrCol. KeepCol If non-zero than columns with low AUC are removed.
Details
This function reduces number of features in the data prior to classification, using following steps:
This function finds subset of individual features that are potentially most useful for classifica- tion, and each feature is rated individually. However, often set of two or more very poor indi- vidual features can produce a superior classifier. So, this function should be used with care. I found it very useful when classifying raw protein mass spectra (SELDI) data, for reducing dimen- sionality of the data from 10 000’s to 100’s prior of classification, instead of peak-finding (see msc.peaks.find).
Value
Vector of column indexes to be kept.
msc.mass.adjust 19
Arguments
X Spectrum data either in matrix format [nFeatures × nSamples] or in 3D array format [nFeatures × nSamples × nCopies]. Row names (rownames(X)) store M/Z mass of each row. scalePar Controls scaling (normalization): 1 means that afterwards all samples will have the same mean, 2 means that afterwards all samples will have the same mean and medium (default) shiftPar Controls mass adjustment. Shifting sample has to improve correlation by at least that amount to be considered. Designed to prevent shifts based on "im- provement" on order of magnitude of machine accuracy. If set to too large will turn off shifting. Default = 0.0005. AvrSamp Is used to normalize test set the same way train set was normalized. Test set is processed using AvrSamp array that was one of the outputs from train-set mass-adjustment. See examples. shiftX matrix [nSamp × nCopy] - integer number of positions a sample should be shifted to the right (+) or left (-). Output from msc.mass.adjust.calc and input to msc.mass.adjust.apply. scaleY matrix [nSamp × nCopy] - multiply each sample in order to normalize it. Output from msc.mass.adjust.calc and input to msc.mass.adjust.apply. shiftY matrix [nSamp × nCopy] - subtract this number from scaled sample (if matching medians). Output from msc.mass.adjust.calc and input to msc.mass.adjust.apply.
Details
Mass adjustment assumes that SELDI data has some error associated with inaccuracy of setting the starting point of time measurement (x-axis origin or zero M/Z value). We try to correct this error by allowing the samples to shift a few time-steps to the left or to the right, if that will help with cross-correlation with other samples. The function performs the following steps
msc.mass.adjust function was split into two parts (one to calculate parameters and one to ap- ply them) in order to give users more flexibility and information about what is done to the data. This split allows inspection, plotting and/or modification of shiftX, shiftY, scaleY parameters before data is modified. For example one can set shiftX to zero to perform normalization without mass adjustment or set shiftY to zero and scaleY to one to perform mass adjustment without normalization. Three function provided are:
20 msc.mass.adjust
Value
Functions msc.mass.adjust and msc.mass.adjust.apply return modified spectra in the same format and size as X. Functions msc.mass.adjust.calc returns list containing the fol- lowing:
shiftX matrix [nSamp × nCopy] - integer number of positions sample should be shifted to the right (+) or left (-) scaleY matrix [nSamp × nCopy] - multiply each sample in order to normalize it shiftY matrix [nSamp × nCopy] - subtract this number from scaled sample (if matching mediums) AvrSamp Use AvrSamp returned from train-set mass-adjustment to process test-set
Author(s)
Jarek Tuszynski (SAIC) 〈[email protected]〉
References
Description of more elaborate algorithm for similar purpose can be found in Lin S., Haney R., Campa M., Fitzgerald M., Patz E.; "Characterizing phase variations in MALDI-TOF data and correcting them by peak alignment"; Cancer Informatics 2005: 1(1) 32-
See Also
Examples
if (!file.exists("Data_IMAC.Rdata")) example("msc.project.read") load("Data_IMAC.Rdata")
out = msc.mass.adjust.calc (X) Y = msc.mass.adjust.apply(X, out$ShiftX, out$ScaleY, out$ShiftY) stopifnot( mean(out$ShiftX)==-0.15, abs(mean(out$ScaleY)-0.98)<0.01 )
Z = cbind(colMeans(X), colMeans(Y)) colnames(Z) = c("copy 1 before", "copy 2 before", "copy 1 after", "copy 2 after" ) cat("Sample means after and after:\n") Z