Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Processing and Analysis of Mass Spectrometry Data using R Package, Exams of Spanish Language

University of California - Los Angeles (UCLA)Spanish Language

An overview of various functions in an r package used for processing and analyzing mass spectrometry data. The functions include data preprocessing, peak detection, mass adjustment, and biomarker matrix creation. The document also includes examples of how to use these functions and test their output.

Typology: Exams

Pre 2010

Uploaded on 08/30/2009

koofers-user-abr 🇺🇸

10 documents

1 / 47

This page cannot be seen from the preview

Don't miss anything!

Package ‘caMassClass’

April 17, 2009

Version 1.6

Date Feb 01 2007

Title Processing & Classification of Protein Mass Spectra (SELDI) Data

Author Jarek Tuszynski <jaroslaw.w[email protected]>

Maintainer Jarek Tuszynski <jaroslaw[email protected]>

Depends R (>= 2.0.0), PROcess, e1071, nnet, rpart, caTools, XML, digest, MASS

Description Functions for processing and classification of protein mass spectra (SELDI) data. Also

includes support for mzXML Files.

License The caMassClass Software License, Version 1.0 (See COPYING file or

“http://ncicb.nci.nih.gov/download/camassclasslicense.jsp”)

Repository CRAN

Date/Publication 2007-02-01 20:35:11

Rtopics documented:

caMassClass-package.................................... 2

msc.baseline.subtract .................................... 3

msc.biomarkers.fill ..................................... 5

msc.biomarkers.read.csv & msc.biomarkers.write.csv . . . . . . . . . . . . . . . . . . . 6

msc.classifier.run ...................................... 7

msc.classifier.test ...................................... 10

msc.copies.merge...................................... 12

msc.features.remove .................................... 14

msc.features.scale...................................... 15

msc.features.select ..................................... 17

msc.mass.adjust....................................... 18

msc.mass.cut ........................................ 21

msc.peaks.align....................................... 22

msc.peaks.clust ....................................... 25

Discover Exams of Spanish Language University of California - Los Angeles (UCLA)

Partial preview of the text

Download Processing and Analysis of Mass Spectrometry Data using R Package and more Exams Spanish Language in PDF only on Docsity!

Package ‘caMassClass’

April 17, 2009

Version 1.

Date Feb 01 2007

Title Processing & Classification of Protein Mass Spectra (SELDI) Data

Author Jarek Tuszynski

Maintainer Jarek Tuszynski

Depends R (>= 2.0.0), PROcess, e1071, nnet, rpart, caTools, XML, digest, MASS

Description Functions for processing and classification of protein mass spectra (SELDI) data. Also includes support for mzXML Files.

License The caMassClass Software License, Version 1.0 (See COPYING file or “http://ncicb.nci.nih.gov/download/camassclasslicense.jsp”)

Repository CRAN

Date/Publication 2007-02-01 20:35:

R topics documented:

caMassClass-package.................................... 2 msc.baseline.subtract.................................... 3 msc.biomarkers.fill..................................... 5 msc.biomarkers.read.csv & msc.biomarkers.write.csv................... 6 msc.classifier.run...................................... 7 msc.classifier.test...................................... 10 msc.copies.merge...................................... 12 msc.features.remove.................................... 14 msc.features.scale...................................... 15 msc.features.select..................................... 17 msc.mass.adjust....................................... 18 msc.mass.cut........................................ 21 msc.peaks.align....................................... 22 msc.peaks.clust....................................... 25

2 caMassClass-package

msc.peaks.find........................................ 27 msc.peaks.read.csv & msc.peaks.write.csv......................... 29 msc.peaks.read.mzXML & msc.peaks.write.mzXML................... 30 msc.preprocess.run..................................... 32 msc.project.read....................................... 34 msc.project.run....................................... 37 msc.rawMS.read.csv.................................... 38 msc.rawMS.read.mzXML & msc.rawMS.write.mzXML................. 40 msc.sample.correlation................................... 41 read.mzXML & write.mzXML............................... 43

Index 46

caMassClass-package Processing and Classification of Protein Mass Spectra Data

Description

Functions for processing and classification of protein mass spectra data. Includes support for I/O in mzXML and CSV formats.

Details

Package: caMassClass Version: 1. Date: 2006-04- Depends: R (>= 2.0.0), PROcess, e1071, nnet, rpart, caTools, XML, digest License: The caMassClass Software License, Version 1.0 (See COPYING file or http://ncicb.nci.nih.gov/download/camassclasslicense.jsp) URL: http://ncicb.nci.nih.gov/download/index.jsp

Index of Preprocessing Functions:

msc.project.read Read and Manage a Batch of Protein Mass Spectra msc.project.run Read and Preprocess Protein Mass Spectra msc.preprocess.run Preprocessing Pipeline of Protein Mass Spectra msc.baseline.subtract Baseline Subtraction for Mass Spectra Data msc.mass.cut Remove Low Mass Portion of the Mass Spectra Data. msc.mass.adjust Perform Normalization and Mass Drift Adjustment msc.peaks.find Find Peaks of Mass Spectra msc.peaks.clust Clusters Peaks of Mass Spectra (low level) msc.peaks.align Align Peaks of Mass Spectra into Biomarker Matrix msc.biomarkers.fill Fill Empty Spaces in Biomarker Matrix msc.copies.merge Merge Multiple Copies of Mass Spectra Samples msc.sample.correlation Sample Correlation (low level)

4 msc.baseline.subtract

Arguments

X Spectrum data either in matrix format [nFeatures × nSamples] or in 3D array format [nFeatures × nSamples × nCopies]. Row names (rownames(X) store M/Z mass of each row/feature. ... Parameters to be passed to bslnoff function from PROcess library. See de- tails for explanation of breaks, qntl, and bw. Boolean parameter plot can be used to plot results.

Details

Perform baseline subtraction for every sample in a batch of data, using bslnoff function from PROcess library. The bslnoff function splits spectrum into breaks number of exponentially growing regions. Baseline is calculated by applying quantile(...,probs=qntl) to each region and smoothing the results using loess(..., span=bw, degree=1) function.

Value

Data in the same format and size as input variable X but with the subtracted baseline.

Author(s)

Jarek Tuszynski (SAIC) 〈[email protected]〉

See Also

Part of msc.preprocess.run and msc.project.run pipelines.
Previous step in the pipeline was msc.project.read and msc.rawMS.read.csv
Next step in the pipeline is msc.mass.cut
This function uses bslnoff (from PROcess library) which is a single-spectrum baseline removal function implemented using loess function.
Function rmBaseline (from PROcess library) can read all CSV files in directory and re- move their baselines.

Examples

load "Data_IMAC.Rdata" file containing raw MS spectra 'X'

if (!file.exists("Data_IMAC.Rdata")) example("msc.project.read") load("Data_IMAC.Rdata")

run msc.baseline.subtract using 3D input

this data had baseline removed already so little change is expected

Y = msc.baseline.subtract(X) avr = mean(abs(X-Y)) cat("Data Size: ", dim(X), " average change :", avr, "\n") stopifnot(avr<0.3)

test on data provided in PROcess package (2D input)

this is "raw" data, so large changes are expected

directory = system.file("Test", package = "PROcess")

msc.biomarkers.fill 5

X = msc.rawMS.read.csv(directory) Y = msc.baseline.subtract(X, plot=TRUE) avr = mean(abs(X-Y)) cat("Data Size: ", dim(X), " average change :", avr, "\n") stopifnot(avr>7.5)

msc.biomarkers.fill Fill Empty Spaces in Biomarker Matrix

Description

Fill empty spaces (NA’s) in biomarker matrix created by msc.peaks.align

Usage

msc.biomarkers.fill( X, Bmrks, BinBounds, FillType=0.9)

Arguments

X Spectrum data either in matrix format [nFeatures × nSamples] or in 3D array format [nFeatures × nSamples × nCopies]. Row names (rownames(X)) store M/Z mass of each row. Bmrks biomarker matrix containing one sample per column and one biomarker per row BinBounds position (mass) of left-most and right-most peak in each bin FillType how to fill empty spaces in biomarker data?

if 0<=FillType<=1 than fill spaces with quantile(probs=FillType). For example: if FillType=1/2 than medium will be used, if FillType= than maximum value will be used, if FillType=0.9 than maximum will be used after discarding 10% of "outliers"
if FillType<0 than empty spaces will not be filled and NA’s will remain
if FillType==2 than X value closest to the center of the bin will be used
if FillType==3 empty spaces will be set to zero

Details

This function attempts to correct a problem which is a side-effect of msc.peaks.align func- tion. Namely numerous NA’s in biomarker data, each time when some peak was found only in some of the samples. msc.peaks.align already removed the most problematic features using SampFrac variable, but likely a lot of NA’s remain and they can cause problem for some classifi- cation algorithms.

Value

Data in the same format and size as Bmrks

msc.classifier.run 7

Usage

X = msc.biomarkers.read.csv(fname, mzXML.record=FALSE) msc.biomarkers.write.csv(X, fname)

Arguments

fname either a character string naming a file or a connection. X biomarker data in form of a 2D matrix (nFeatures × nSamples) or 3D array (nFeatures × nSamples × nCopies. Notice that this data is in format which is a transpose of data in CSV file. mzXML.record should mzXML record be created to store mata-data (input file names)?

Value

Function msc.biomarkers.read.csv returns peak information data frame. See argument X above. If mzXML.record was set to true than mzXML record with input file names will be attached to X as "mzXML" attribute. Function msc.biomarkers.write.csv does not return anything.

Author(s)

Jarek Tuszynski (SAIC) 〈[email protected]〉

See Also

msc.biomarkers.fill, msc.rawMS.read.csv

Examples

example("msc.peaks.align", verbose=FALSE) # create biomarkers data X = Y$Bmrks # biomarkers data is stored in variable 'Y$Bmrks' msc.biomarkers.write.csv(X, "biomarkers.csv") Y = msc.biomarkers.read.csv("biomarkers.csv", mzXML.record=TRUE) stopifnot( all(X==Y, na.rm=TRUE) ) mzXML = attr(Y,"mzXML") strsplit(mzXML$parentFile, '\n') # show mzXML$parentFile record file.remove("biomarkers.csv")

msc.classifier.run Train and Test Chosen Classifier.

Description

Common interface for training and testing several standard classifiers. Includes feature selection and feature scaling steps. Allows to specify that some test samples are multiple copies of the same sample, and should return the same label.

8 msc.classifier.run

Usage

msc.classifier.run( xtrain, ytrain, xtest, ret.prob=FALSE, RemCorrCol=0, KeepCol=0, prior=1, same.sample=NULL, ScaleType=c("none", "min-max", "avr-std", "med-mad"), method=c("svm", "nnet", "lda", "qda", "LogitBoost", "rpart"), ...)

Arguments

xtrain A matrix or data frame with training data. Rows contain samples and columns contain features/variables ytrain Class labels for the training data samples. A response vector with one label for each row/component of x. Can be either a factor, string or a numeric vector. xtest A matrix or data frame with test data. Rows contain samples and columns con- tain features/variables ret.prob if set to TRUE than the a-posterior probabilities for each class are returned as attribute called "probabilities". same.sample optional parameter which allows to specify that some (or all) test samples have multiple copies which should be used to predict a single label for all of them. Can be either a factor, string or a numeric vector, with unique values for different samples and identical values for copies of the same sample. RemCorrCol If non-zero than some of the highly correlated columns are removed using msc.features.remove function with ccMin=RemCorrCol. KeepCol If non-zero than columns with low AUC are removed.

if KeepCol smaller than 0.5 - do nothing
if KeepCol in between [0.5, 1] - keep columns with AUC bigger than KeepCol
if KeepCol bigger than one - keep top "KeepCol" number of columns ScaleType Optional parameter, if provided than following types are recognized
"none" - no scaling is performed
"min-max" - data minimum is mapped to 0 and maximum is mapped to 1
"avr-std" - data is mapped to zero mean and unit variance
"med-mad" - data is mapped to zero median and unit mad (median absolute deviation) prior class weights. following types are recognized
prior==1 - all samples in all classes have equal weight (default)
prior==2 - all classes have equal weight
prior is a vector - a named vector of weights for the different classes, used for asymmetric class sizes. method classifier to be used. Following ones are recognized (followed by some param- eters that could be passed through... :
"svm" - see svm from e1071 package. Possible parameters: cost, gamma
"nnet" - see nnet from nnet package. Possible parameters: size, decay, maxit

10 msc.classifier.test

Examples

data(iris) mask = sample.split(iris[,5], SplitRatio=1/4) # very few points to train xtrain = iris[ mask,-5] # use output of sample.split to ... xtest = iris[!mask,-5] # create train and test subsets ytrain = iris[ mask, 5] ytest = iris[!mask, 5] table(ytrain, msc.classifier.run(xtrain,ytrain,xtrain, method="svm") ) table(ytrain, msc.classifier.run(xtrain,ytrain,xtrain, method="nnet") ) table(ytrain, msc.classifier.run(xtrain,ytrain,xtrain, method="lda") ) table(ytrain, msc.classifier.run(xtrain,ytrain,xtrain, method="qda") ) table(ytrain, msc.classifier.run(xtrain,ytrain,xtrain, method="LogitBoost") )

a=table(ytrain, msc.classifier.run(xtrain,ytrain,xtrain, method="LogitBoost") ) stopifnot( sum(diag(a))==length(ytrain) )

msc.classifier.test Test a Classifier through Cross-validation

Description

Test classifier through cross-validation. Common interface for cross-validation of several standard classifiers. Includes feature selection and feature scaling steps. Allows to specify that some test samples are multiple copies of the same sample, and should return the same label.

Usage

msc.classifier.test( X, Y, iters=50, SplitRatio=2/3, verbose=FALSE, RemCorrCol=0, KeepCol=0, prior=1, same.sample=NULL, ScaleType=c("none", "min-max", "avr-std", "med-mad"), method=c("svm", "nnet", "lda", "qda", "LogitBoost", "rpart"), ...)

Arguments

X A matrix or data frame with training/testing data. Rows contain samples and columns contain features/variables Y Class labels for the training data samples. A response vector with one label for each row/component of x. Can be either a factor, string or a numeric vector. Labels with ’NA’ value signify test data-set. iters Number of iterations. Each iteration consist of splitting the data into train and test sets, performing the classification and storing results SplitRatio Splitting ratio used to divide available data during cross-validation:

if (0<=SplitRatio<1) then SplitRatio fraction of samples will be used for training and the rest for validation.

msc.classifier.test 11

if (SplitRatio==1) leave-one-out cross-validation. All but one sam- ples will used for training, and validation will be done using single sample per iteration.
if (SplitRatio>1) then SplitRatio number of samples to be used for training and the rest for validation. RemCorrCol See msc.classifier.run. KeepCol See msc.classifier.run. ScaleType See msc.classifier.run. prior See msc.classifier.run. same.sample See msc.classifier.run. method See msc.classifier.run. verbose boolean flag turns debugging printouts on. ... Additional parameters to be passed to classifiers. See method for suggestions.

Details

This function follows standard cross-validation steps:

Class labels Y are used to divide data X into train set (with known labels) and test set (labels set to NA and will be calculated)
For number of iterations repeat the following steps of cross-validation:
- split train data into temporary train and test sets using sample.split function from caTools package.
- train and test the chosen classifier using temporary train and test data sets and msc.classifier.run function
Calculate the overall performance of the classifier
Train the classifier using the whole train data set (all labeled samples)
Use this classifier to predict values of the whole test data set (all samples without label - NA.)

Value

Y Predicted class labels. If there were any unknown samples in input data, marked by NA’s in input Y, than output Y will only hold prediction of those samples, otherwise prediction will be made for all samples. Res Holds fraction of correct prediction during cross-validation for each iteration. mean(Res) will give you average accuracy. Tabl Contingency table of predictions shows all the input label compared to output labels

Author(s)

Jarek Tuszynski (SAIC) 〈[email protected]〉

msc.copies.merge 13

Details

Quality of a sample is measured by calculating for each copy of each sample two variables: inner correlation (average correlation between multiple copies of the same sample) and outer correlation (average correlation between each sample and every other sample within the same copy). Inner correlation measures how similar copies are to each other and outer correlation measures how sim- ilar each copy is to everybody else. For example in case of experiment using SELDI technology to distinguish cancerous samples and non-cancerous samples one can assume that most of the proteins present in both cancerous and non-cancerous samples will be the same. In that case one will expect high correlation between samples and even higher correlation between copies of the same sample if mergeType/4 (mergeType %/% 4) is

0 - all copies are kept
1 - if inner correlation is smaller than outer correlation, or in other words, if a signature is more similar to other signatures than to other copies of the same signature, than there is some problem with that signature. In that case that bad signature can be replaced with the best copy of the signature.
2 - rate each copy of each sample using score=outer_correlation + inner_correlation measure. Delete worst copy.

Option 2 is more suitable in case of data with a lot of copies, when we can afford dropping one copy. Option 1 is designed to patch the most serious problems with the data. There are also four merging options, if mergeType mod 4 (mergeType %% 4) is

0 - no merging is done to the data and it is left as 3D array
1 - all copies are concatenated X = cbind(X[,,1], X[,,2], ..., X[,,nCopy]) so they seem as separate samples
2 - all copies are averaged X = (X[,,1] + X[,,2] + ... + X[,,nCopy])/nCopy)
3 - all copies are first averaged and than concatenated with extra average copy X = cbind(X[,,1], X[,,2], ..., X[,,nCopy], Xavr)

In preparation for classification one can use multiple copies in several ways: option 2 above im- proves (one hopes) accuracy of each sample, while options 1 and 3 increase number of samples available during classification. So the choice is: do we want a lot of samples during classification or fewer, better samples? The best option of mergeType depends on kind of data.

0 if data has single copy.
1+2+4 will produce the largest number of samples since we will keep all the copies and an average of all the copies
2+8 will produce single most accurate sample from multiple copies (usually if more than 2 copies are present) since we will delete outliers before averaging all the copies

Value

Return matrix containing features as rows and samples as columns, unless mergeType is 0,4, or 8 when no merging is done and data is returned in same or similar format as the input format.

14 msc.features.remove

Author(s)

Jarek Tuszynski (SAIC) 〈[email protected]〉

See Also

Part of msc.preprocess.run and msc.project.run pipelines.
Previous step in the pipeline was msc.mass.adjust or peak finding functions: msc.peaks.find, msc.peaks.align, and msc.biomarkers.fill
Next step in the pipeline is data classification msc.classifier.test
Uses msc.sample.correlation

Examples

load "Data_IMAC.Rdata" file containing raw MS spectra 'X'

if (!file.exists("Data_IMAC.Rdata")) example("msc.project.read") load("Data_IMAC.Rdata")

run msc.copies.merge

Y = msc.copies.merge(X, 1+2+4) colnames(Y) stopifnot( dim(Y)==c(11883,60) )

msc.features.remove Remove Highly Correlated Features

Description

Remove Highly Correlated Features. The function checks neighbor features looking for highly correlated ones and removes one of them. Used in order to drop dimensionality of the data.

Usage

msc.features.remove(Data, Auc, ccMin=0.9, verbose=FALSE)

Arguments

Data Data containing one sample per row and one feature per column. Auc A measure of usefulness of each column/feature, used to choose which one of two highly correlated columns to remove. Usually a measure of discrimination power of each feature as measured by colAUC, student t-test or other method. See details. ccMin Minimum correlation coefficient of "highly correlated" columns. verbose Boolean flag turns debugging printouts on.

16 msc.features.scale

Arguments

xtrain A matrix or data frame with train data. Rows contain samples and columns contain features/variables xtest A matrix or data frame with test data. Rows contain samples and columns con- tain features/variables type Following types are recognized

"min-max" - data minimum is mapped to 0 and maximum is mapped to 1
"avr-std" - data is mapped to zero mean and unit variance
"med-mad" - data is mapped to zero median and unit mad (median abso- lute deviation)

Details

Many classification algorithms perform better if input data is scaled beforehand. Some of them perform scaling internally (for example svm), but many don’t. For some it makes no difference (for example rpart or LogitBoost). In case xtrain contains NA values or infinities all non-finite numbers are omitted from scaling parameter calculations.

Value

xtrain A matrix or data frame with scaled train data. xtest A matrix or data frame with scaled test data.

Author(s)

Jarek Tuszynski (SAIC) 〈[email protected]〉

See Also

Used by msc.classifier.test and msc.features.select functions.

Examples

library(e1071) data(iris) mask = sample.split(iris[,5], SplitRatio=1/4) # very few points to train xtrain = iris[ mask,-5] # use output of sample.split to ... xtest = iris[!mask,-5] # create train and test subsets ytrain = iris[ mask, 5] ytest = iris[!mask, 5] x = msc.features.scale(xtrain, xtest) model = svm(x$xtrain, ytrain, scale=FALSE) print(a <- table(predict(model, x$xtest), ytest) ) model = svm(xtrain, ytrain, scale=FALSE) print(b <- table(predict(model, xtest), ytest) ) stopifnot( sum(diag(a)) msc.features.select 17

msc.features.select Reduce Number of Features Prior to Classification

Description

Select subset of individual features that are potentially most useful for classification.

Usage

msc.features.select( x, y, RemCorrCol=0.98, KeepCol=0.6)

Arguments

x A matrix or data frame with training data. Rows contain samples and columns contain features/variables y Class labels for the training data samples. A response vector with one label for each row/component of x. Can be either a factor, string or a numeric vector. RemCorrCol If non-zero than some of the highly correlated columns are removed using msc.features.remove function with ccMin=RemCorrCol. KeepCol If non-zero than columns with low AUC are removed.

if KeepCol smaller than 0.5 - do nothing
if KeepCol in between [0.5, 1] - keep columns with AUC bigger than KeepCol
if KeepCol bigger than one - keep top KeepCol number of columns

Details

This function reduces number of features in the data prior to classification, using following steps:

calculate AUC measure for each feature using colAUC
remove some of the highly correlated neighboring columns using msc.features.remove function.
remove columns with low AUC

This function finds subset of individual features that are potentially most useful for classifica- tion, and each feature is rated individually. However, often set of two or more very poor indi- vidual features can produce a superior classifier. So, this function should be used with care. I found it very useful when classifying raw protein mass spectra (SELDI) data, for reducing dimen- sionality of the data from 10 000’s to 100’s prior of classification, instead of peak-finding (see msc.peaks.find).

Value

Vector of column indexes to be kept.

msc.mass.adjust 19

Arguments

X Spectrum data either in matrix format [nFeatures × nSamples] or in 3D array format [nFeatures × nSamples × nCopies]. Row names (rownames(X)) store M/Z mass of each row. scalePar Controls scaling (normalization): 1 means that afterwards all samples will have the same mean, 2 means that afterwards all samples will have the same mean and medium (default) shiftPar Controls mass adjustment. Shifting sample has to improve correlation by at least that amount to be considered. Designed to prevent shifts based on "im- provement" on order of magnitude of machine accuracy. If set to too large will turn off shifting. Default = 0.0005. AvrSamp Is used to normalize test set the same way train set was normalized. Test set is processed using AvrSamp array that was one of the outputs from train-set mass-adjustment. See examples. shiftX matrix [nSamp × nCopy] - integer number of positions a sample should be shifted to the right (+) or left (-). Output from msc.mass.adjust.calc and input to msc.mass.adjust.apply. scaleY matrix [nSamp × nCopy] - multiply each sample in order to normalize it. Output from msc.mass.adjust.calc and input to msc.mass.adjust.apply. shiftY matrix [nSamp × nCopy] - subtract this number from scaled sample (if matching medians). Output from msc.mass.adjust.calc and input to msc.mass.adjust.apply.

Details

Mass adjustment assumes that SELDI data has some error associated with inaccuracy of setting the starting point of time measurement (x-axis origin or zero M/Z value). We try to correct this error by allowing the samples to shift a few time-steps to the left or to the right, if that will help with cross-correlation with other samples. The function performs the following steps

normalize all samples in such a way as to make their means (and optionally medians) the same
if multiple copies exist than
- align multiple copies of each sample to each other
- temporarily merge multiple copies of each sample to create a "super-sample" vector with more features
align each sample to the mean of all samples
recalculate mean of all samples and repeat above step

msc.mass.adjust function was split into two parts (one to calculate parameters and one to ap- ply them) in order to give users more flexibility and information about what is done to the data. This split allows inspection, plotting and/or modification of shiftX, shiftY, scaleY parameters before data is modified. For example one can set shiftX to zero to perform normalization without mass adjustment or set shiftY to zero and scaleY to one to perform mass adjustment without normalization. Three function provided are:

msc.mass.adjust.calc - calculates and returns all the normalization and mass drift adjustment parameters

20 msc.mass.adjust

msc.mass.adjust.apply - performs normalization and mass drift adjustment using pre- calculated parameters
msc.mass.adjust - simple interface version of above 2 functions

Value

Functions msc.mass.adjust and msc.mass.adjust.apply return modified spectra in the same format and size as X. Functions msc.mass.adjust.calc returns list containing the fol- lowing:

shiftX matrix [nSamp × nCopy] - integer number of positions sample should be shifted to the right (+) or left (-) scaleY matrix [nSamp × nCopy] - multiply each sample in order to normalize it shiftY matrix [nSamp × nCopy] - subtract this number from scaled sample (if matching mediums) AvrSamp Use AvrSamp returned from train-set mass-adjustment to process test-set

Author(s)

Jarek Tuszynski (SAIC) 〈[email protected]〉

References

Description of more elaborate algorithm for similar purpose can be found in Lin S., Haney R., Campa M., Fitzgerald M., Patz E.; "Characterizing phase variations in MALDI-TOF data and correcting them by peak alignment"; Cancer Informatics 2005: 1(1) 32-

See Also

Part of msc.preprocess.run and msc.project.run pipelines.
Previous step in the pipeline is msc.mass.cut
Next step in the pipeline is either msc.peaks.find or msc.copies.merge

Examples

load "Data_IMAC.Rdata" file containing raw MS spectra 'X'

if (!file.exists("Data_IMAC.Rdata")) example("msc.project.read") load("Data_IMAC.Rdata")

run on 3D input data using long syntax

out = msc.mass.adjust.calc (X) Y = msc.mass.adjust.apply(X, out$ShiftX, out$ScaleY, out$ShiftY) stopifnot( mean(out$ShiftX)==-0.15, abs(mean(out$ScaleY)-0.98)<0.01 )

check what happened to means

Z = cbind(colMeans(X), colMeans(Y)) colnames(Z) = c("copy 1 before", "copy 2 before", "copy 1 after", "copy 2 after" ) cat("Sample means after and after:\n") Z

Processing and Analysis of Mass Spectrometry Data using R Package, Exams of Spanish Language

Related documents

Partial preview of the text

Download Processing and Analysis of Mass Spectrometry Data using R Package and more Exams Spanish Language in PDF only on Docsity!

Package ‘caMassClass’

April 17, 2009

R topics documented:

load "Data_IMAC.Rdata" file containing raw MS spectra 'X'

run msc.baseline.subtract using 3D input

this data had baseline removed already so little change is expected

test on data provided in PROcess package (2D input)

this is "raw" data, so large changes are expected

load "Data_IMAC.Rdata" file containing raw MS spectra 'X'

run msc.copies.merge

load "Data_IMAC.Rdata" file containing raw MS spectra 'X'

run on 3D input data using long syntax

check what happened to means