Gene Expression Analysis and Statistics: A Case Study in Bioinformatics - Prof. Nicholas S, Lab Reports of Computer Science

A case study on gene expression analysis and statistics using bioinformatics. It covers the background of gene expression analysis using microarray technology, preprocessing of data, differential expression testing, and multiple comparison adjustment. The document also includes visualization and annotation tests. The case study uses data from acute lymphoblastic leukemia (all) with 12,625 genes and 128 samples, and focuses on the main ideas of the analysis.

Typology: Lab Reports

Pre 2010

Uploaded on 07/30/2009

koofers-user-ypi
koofers-user-ypi 🇺🇸

10 documents

1 / 45

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Gene Expression Analysis
and Statistics: A Case Study
Bioinformatics: Problems and Solutions
Dr. John R. Stevens
Utah State University
CS 5890/6890
9 July 2007
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d

Partial preview of the text

Download Gene Expression Analysis and Statistics: A Case Study in Bioinformatics - Prof. Nicholas S and more Lab Reports Computer Science in PDF only on Docsity!

Gene Expression Analysisand Statistics: A Case Study

Bioinformatics: Problems and SolutionsDr. John R. StevensUtah State UniversityCS 5890/68909 July 2007

Outline 

Background – ALL data



Differential Expression



Multiple Comparisons



Report File



Visual Summaries



Annotation Test



Reference: Bioinformatics and Computational BiologySolutions Using R and Bioconductor (2005), edited byGentleman et al.



Here, focus on the main ideas

Microarray Technology: General Idea 

Use mRNA transcript

abundance level as a measure

of the level of “expression” for the correspondinggene

Represent thousands of genes on an array 

each spot on array corresponds to a specific gene(a set of probes on the array represent a gene)



“apply” a prepared sample to chip (array)



mRNA in sample hybridizes to array



fluoresces



look at spot intensities



intensity assumed proportional to mRNA transcriptabundance level

Background: ALL data 

A

cute L

ymphoblastic L

eukemia (Ritz lab)

12,625 genes



128 samples (arrays)



phenotypic data on all 128 patients, including: 

95 B-cell cancer patients



33 T-cell cancer patients

Reference: Chiaretti et al., Blood (2004)103(7)

Array images and “quality”

(code chunk 1)

residual images from 2 of the 128 arrays

Boxplots and “Preprocessing”

(code chunk 2)

boxplots of log2intensities from 20 ofthe 128 arrays

Common preprocessing methods– each has strengths and weaknesses 

MAS5 

Affymetrix’s algorithm



dChip 

Li & Wong’s model (probe-level model)



vsn 

“Variance Stabilization”



RMA 

fits probe-level model to all arrays(assume probes have same “effect” on each array)



GCRMA 

RMA-based, with additional adjustment for position of G & Cnucleotides in probe sequence



(more proposed all the time – active area of research)

Each point represents asingle gene’sexpression level on onearray afterpreprocessing all thearrays togetherNotice similarities anddifferences

(code chunk 3)

13

Simple / Naïve test of DE 

“Observe” gene expression levels under twoconditions (after preprocessing)

Calculate: average log fold change

Make a cut-off: R

i

treatment" "

of j

replicate

in k

gene

of

level

expr.

log

ijk Y

k

gene

for

change

fold

log

ave.

i

treatment

in k

gene

for

expr.

log

ave.

1

2

=

= =

k

k

k k i

Y

Y

LFC

Y

R

LFC

k

>

if

t"

significan "

is k

Gene

What’s wrong with the naïve test? 

Summary interpretation okay:

LFC > 0 for “up-regulated” genesLFC < 0 for “down-regulated” genes

Ignores variability of estimate 

cannot really test for “significance”



what if larger LFC have large variability?

  • then not necessarily significant

How to use this to “test” for DE? 

What is being tested?

Null: No change for gene k

Under null, t

k

~ t dist. with n

k

d.f.

(expected distribution under repeat sampling)

But what is needed to do this? 

“Large” sample size



Estimate

σ

k

= “pop. SD” for gene k

“parametric” assumption

(code chunk 3.5)

What if we don’t have a large enoughsample size? (note: probably don’t) 

Two main problems: 

  1. Estimate

σ

k

(especially for small sample size)

  1. Appropriate sampling distribution of test stat.

Basic solutions: 

  1. To estimate

σ

k

: Pool information across genes

  1. For comparison: 

use parametric assumption on “improved” test stat.



use non-parametric methods

Assumptions in linear model (Smyth)

[

]

k

w w

k

w w

d

k

k

w k

w k

k

k

k

d

k k

k

k

k k w k k w k w k

k

k

k

k

k

t

V

t

p

n

d

d

V

N

w

V

Var

Then

d.f.

resid.

covariate

For

and

and

estimates

Obtain

,

,

k not necessary here

This is a traditional statisticallinear regression model.

Hierarchical model to borrow informationacross genes (Smyth): eBayes

[

]

k

w w

d

d

k

k

w k

w k

k

k

k

k

k

k

d

k

t

V

t

d

d

d s d E d s

s d

=

=

=

0

,

0

~

~

ˆ

~

statistic

t

moderated" "

the

Then

ˆ

ˆ |

~

mean

posterior

he

Consider t

methods)

Bayes

empirical

using

data

from

estimated

and

(

1

~

1

on

distributi

prior

Assume

,

,

0

2

(^20) 0

2

2

2

0

(^20)

2

(^20) 0

2

&

σ

β

σ

σ

σ

σ

χ

σ

represents added information from usingall genes

(using all of the genes)