





































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A case study on gene expression analysis and statistics using bioinformatics. It covers the background of gene expression analysis using microarray technology, preprocessing of data, differential expression testing, and multiple comparison adjustment. The document also includes visualization and annotation tests. The case study uses data from acute lymphoblastic leukemia (all) with 12,625 genes and 128 samples, and focuses on the main ideas of the analysis.
Typology: Lab Reports
1 / 45
This page cannot be seen from the preview
Don't miss anything!






































Outline
Background – ALL data
Differential Expression
Multiple Comparisons
Report File
Visual Summaries
Annotation Test
Reference: Bioinformatics and Computational BiologySolutions Using R and Bioconductor (2005), edited byGentleman et al.
Here, focus on the main ideas
Microarray Technology: General Idea
Use mRNA transcript
abundance level as a measure
of the level of “expression” for the correspondinggene
Represent thousands of genes on an array
each spot on array corresponds to a specific gene(a set of probes on the array represent a gene)
“apply” a prepared sample to chip (array)
mRNA in sample hybridizes to array
fluoresces
look at spot intensities
intensity assumed proportional to mRNA transcriptabundance level
Background: ALL data
A
cute L
ymphoblastic L
eukemia (Ritz lab)
12,625 genes
128 samples (arrays)
phenotypic data on all 128 patients, including:
95 B-cell cancer patients
33 T-cell cancer patients
Reference: Chiaretti et al., Blood (2004)103(7)
Array images and “quality”
(code chunk 1)
residual images from 2 of the 128 arrays
Boxplots and “Preprocessing”
(code chunk 2)
boxplots of log2intensities from 20 ofthe 128 arrays
Common preprocessing methods– each has strengths and weaknesses
MAS5
dChip
vsn
RMA
GCRMA
(more proposed all the time – active area of research)
Each point represents asingle gene’sexpression level on onearray afterpreprocessing all thearrays togetherNotice similarities anddifferences
(code chunk 3)
13
Simple / Naïve test of DE
“Observe” gene expression levels under twoconditions (after preprocessing)
Calculate: average log fold change
Make a cut-off: R
i
treatment" "
of j
replicate
in k
gene
of
level
expr.
ijk Y
k
gene
for
change
fold
log
ave.
i
treatment
in k
gene
for
expr.
log
ave.
1
2
=
−
= =
⋅
⋅
⋅
k
k
k k i
Y
Y
LFC
Y
R
LFC
k
>
if
t"
significan "
is k
Gene
What’s wrong with the naïve test?
Summary interpretation okay:
LFC > 0 for “up-regulated” genesLFC < 0 for “down-regulated” genes
Ignores variability of estimate
cannot really test for “significance”
what if larger LFC have large variability?
How to use this to “test” for DE?
What is being tested?
Null: No change for gene k
Under null, t
~ t dist. with n
d.f.
(expected distribution under repeat sampling)
But what is needed to do this?
“Large” sample size
Estimate
σ
= “pop. SD” for gene k
“parametric” assumption
(code chunk 3.5)
What if we don’t have a large enoughsample size? (note: probably don’t)
Two main problems:
σ
(especially for small sample size)
Basic solutions:
σ
: Pool information across genes
use parametric assumption on “improved” test stat.
use non-parametric methods
k
w w
k
w w
d
k
k
w k
w k
k
k
k
d
k k
k
k
k k w k k w k w k
k
k
k
k
k
,
,
k not necessary here
This is a traditional statisticallinear regression model.
Hierarchical model to borrow informationacross genes (Smyth): eBayes
[
]
k
w w
d
d
k
k
w k
w k
k
k
k
k
k
k
d
k
t
V
t
d
d
d s d E d s
s d
=
=
=
0
,
0
~
~
ˆ
~
statistic
t
moderated" "
the
Then
ˆ
ˆ |
~
mean
posterior
he
Consider t
methods)
Bayes
empirical
using
data
from
estimated
and
(
1
~
1
on
distributi
prior
Assume
,
,
0
2
(^20) 0
2
2
2
0
(^20)
2
(^20) 0
2
&
σ
β
σ
σ
σ
σ
χ
σ
represents added information from usingall genes
(using all of the genes)