









Prepara tus exámenes y mejora tus resultados gracias a la gran cantidad de recursos disponibles en Docsity
Gana puntos ayudando a otros estudiantes o consíguelos activando un Plan Premium
Prepara tus exámenes
Prepara tus exámenes y mejora tus resultados gracias a la gran cantidad de recursos disponibles en Docsity
Prepara tus exámenes con los documentos que comparten otros estudiantes como tú en Docsity
Encuentra los documentos específicos para los exámenes de tu universidad
Estudia con lecciones y exámenes resueltos basados en los programas académicos de las mejores universidades
Responde a preguntas de exámenes reales y pon a prueba tu preparación
Consigue puntos base para descargar
Gana puntos ayudando a otros estudiantes o consíguelos activando un Plan Premium
Comunidad
Pide ayuda a la comunidad y resuelve tus dudas de estudio
Ebooks gratuitos
Descarga nuestras guías gratuitas sobre técnicas de estudio, métodos para controlar la ansiedad y consejos para la tesis preparadas por los tutores de Docsity
Asignatura: Estadística, Profesor: ramon diaz uriarte, Carrera: Biología, Universidad: UAM
Tipo: Apuntes
1 / 17
Esta página no es visible en la vista previa
¡No te pierdas las partes importantes!










Dept. Biochemistry, Universidad Autónoma de Madrid
Instituto de Investigaciones Biomédicas “Alberto Sols” (UAM-CSIC)
Madrid, Spain
1 Data and graph basics.......................................... 2 2 Study Cases................................................ 2 3 Files we will use............................................. 3 4 Reading data............................................... 3 4.1 Spreadsheets and csv format....................................... 3 4.2 Reading data in R commander..................................... 4 5 Exploring data.............................................. 5 5.1 Visualizing data............................................. 5 5.2 Summarizing data............................................ 6 5.3 Types of variables and representation of numerical variables..................... 8 6 Managing data.............................................. 9 6.1 Stacking data and factors in R..................................... 9 6.2 Subsetting and aggregating data.................................... 10 7 Summary................................................. 15 8 Exercises................................................. 16
In this lesson we will learn about the following concepts:
We will also learn how to perform basic data processing in R (R commander):
Disclaimer: This document is not, nor does the author pretend it to be, a substitute the reference books cited in the syllabus. It is intended as a mere guide for the course contents. Moreover, it may (and surely does) contain errors.
Case 1. Suppose you are studying the role of the transcripts from EFNA3 gene in cancer metastasis. As part of this project, you contact a collaborator who has developed a zebrafish model to assess metastatic potential of cells. The experiment is quite straightforward: breast cancer cells, expressing different transcripts from this gene (each cell line expresses one specific isoform), are labeled with a fluorescent dye and injected into the perivitelline space of several zebrafish embryos (see figure 2.1). Then, 2- days after injection disseminated (i.e. metastatic) cells are visualized using confocal microscopy and the total number of disseminated cells per fish is recorded. Several fish are injected with each type of cell to determine the statistical significance of the results. After a few weeks your colleague send you an email with the results (see figure 2.2). Accompanying figure 2.2 is the following legend: “ Systemic dissemination of cancer cells is enhanced by expression of different EFNA3 constructs: MDA-MB- 231 cells infected with lentivirus designed to induce expression of EFNA3, NC1, NC1s, NC2, NC2s or empty control (pLOC) were labeled with DiI dye and approximately 100-500 cells were injected per embryo in the perivitelline space of two days old transgenic Tg(fli1a:EGFP) zebrafish embryos. A) Flourescent micrographs of the posterior tail showing distal dissemination of tumor cells (red) and their close association with the blood vessels of the embryo (green) B) Graph depicting the number of disseminated tumor cells. n = 17-41. * p<0.05; ** p<0.01 *** p<0.001”. You are absolutely delighted with these results as they clearly show that cells expressing your gene of interesrt lead to an increased number of distant metastasis. Moreover, the results are statistically significant. Superb!. Now, on closer inspection, there are a few issues with this figure and its caption. Can you spot them? If not, ask yourself: 1. Bar graphs are probably the most common way of data representation in biomedical sciences but, is it the best way to represent this kind of data? why? 2. Do you miss some information in the figure legend? what do represent the bars? and the error bars? why is this important? 3. How were the reported p-values obtained? does it really matter?
Fig. 2.1: Zebrafish experiment design (^) Fig. 2.2: (Typical) Representation of experiment results
4.2 Reading data in R commander 4 READING DATA
You are done. If you open the file with any text editor you’ll see rows of numbers, each row correspond- ing to one in the original excel file, delimited by commas. The comma separated values keep the same ordering as in the columns in the original file. Note that we can export data in this format using any delimiter instead of commas, which can be useful in some cases^4. In this regard, tabs are commonly used as delimiters in text files instead of commas^5.
Fig. 4.1: Reading data into R commander
(^3) Also, if you have several “sheets” in your Excel document, only the active one will be exported to csv. (^4) for example, in Spain comma is used to separate decimals, so it is not a good idea to use commas to separate decimal data values in this case. (^5) this variation of the csv format is sometimes referred as “tab separated values” or “tsv”, but many times text files with the extension .csv use tabs. (^6) by default the name is set to “Dataset” (^7) If you are not sure about the delimiter that was used, just open the file with any text editor (or word processor) and look at its content (remember that csv files are just text files).
Fig. 4.2: R-code: Reading data from file
ZF_data <- read.table("Zebrafish_data.csv", sep = ",", header = T, na.strings = "NA", dec = ".")
Now the data has been loaded into the computer memory and it is accessible to R commander, so we can visualize and edit it by clicking on the appropriate buttons (“view dataset” “edit dataset”, located right below the main menu of the R commander main window, see 6.4). Note that the name of the data set will appear in the “Active data set” area and you can see the data by clicking on “View data set” (see figure 6.4). Of course, we could have perform the same operation, that is loading data into memory “Zebrafish_data.csv”, using R commands, as is shown in figure 4.2. Note that, in this case, data is stored into the variable Zebrafish_data.
Fig. 5.1: Histograms to visualize data distribution
Lets take a peek at the data, usually the best way to get familiar with the raw data is to represent it. Choose “Graphs” in the menu bar and look at all the graph options. For example, it is always useful to explore how data values are distributed. Histograms are a useful representation to visualize the distribution of numerical data. Go to Graphs–>Histograms and select the group of data to be represented (PLOC, EFNA3,...). Then choose “Options” to decorate the graph with appropriate labels (see figure 5.1). The resulting histogram represent the number of animals (y axis) that present a given number of metastasis (x axis). Please, try also strip charts^8 from the “Graphs” menu (figure 5.1). In this plot, each dot represents a single fish, so it allows the visualization of the number of metastasis recorded in each individual fish. If you repeat this process for the control (pLOC) and experimental points (EFNA3,NC1,...), you’ll see that fish injected with cells expressing EFNA3 tend to have higher number of distant metastasis.
(^8) sometimes termed dot plots
5 EXPLORING DATA 5.2 Summarizing data
Other summary statistics^11 commonly used as error bars are SEM and confidence intervals (CI). It should be noted however, that SEM and CI are measures of the confidence on the determination of the mean rather than a mere description of the data spread. Thus, they are used in inferential statistics^12. We will see much more about SEM and CI during the course. Table 1provides a summary of measures of center and spread and reference Cumming et al. [2007] provides and excellent review about the use of error bars in experimental biology.
It is important to note that, although some of these statistics try to describe the same thing (e.g. center of the data distribution), their numerical value may not be the same (and usually is not the same).
Because figure symbols and error bars can represent many different things, they are meaningless, or even misleading, if the figure legend does not state what kind they are. Thus, always describe in the figure legend the meaning of all the symbols included in the figure. Specifically, when representing measures of center and/or spread/error you must always clearly state what they are.
Fig. 5.3: Getting summary statistics
To see the summary statistics for the zebrafish dataset, just go to “Summaries” under the “Statistics” in the main menu and then select “Active data set” (figure 5.3, up to step 1).
For each group you will get the minimum and maximum values (that is the range), the values of the 1st and 3rd^ quartiles (their difference is the interquartile range, IQR), the mean and the median (figure 5.4).
Alternatively you can get a customized summary from the “Numerical summaries” option under “Sum- maries” (see 2a in figure 5.3). Make sure you select all the groups (2b in figure 5.3) and the mark all the statistics you want to be displayed (see 2c figure 5.3), including SEM (see 2d figure 5.3). Note that you can choose the quantiles to be calculated^13 (figure 5.5).
Please take a moment to explore the summary tables. Although the mean and median values within each group are different, they do not differ much except in the case of NC1s. The distribution of values in the NC1s group right skewed (see figure 5.2) and, as a consequence, the value of the mean is displaced to higher values. Thus, the use of mean in figure 2.2, together with the lack of direct information about data spread, misleads the reader into thinking that the number of disseminated cells in NC1s is high above the control (pLoC) and, in fact, higher than in any other group. However, a closer analysis reveals that the variability (data spread) in this group (see for example IQR or SD in figure 5.5) was much higher than in the rest of the groups and, in fact, the first and second quartiles are very similar between NC1s and pLoC, suggesting that the increased dissemination was observed only in some animals. In contrast, EFNA3 seems to induce a generalized increase in the number of
(^11) summary statistic is a single number summarizing a large amount of data (^12) make inferences about populations using data drawn from the population (^13) remember that quantile 0.5 correspond to the 2nd quartile which is the median
5.3 Types of variables and representation of numerical variables 5 EXPLORING DATA
Fig. 5.4: Data summaries
summary(ZF_data)
Fig. 5.5: Customized data summary
distant metastasis in all animals albeit with different penetrance (see figure 5.2 and summary statistics in figure 5.5).
As we have seen above and in spite of being widely used (and abused) in cellular and molecular biology publications, bar plots are not the most convenient way to represent numerical variables. A much more appropriate graphical representation would be a box plot were, the distribution of data is clearly visible.
Box plot. To build a box plot numerical data is sorted in ascending order, so that the first quartile (Q1, 25th percentile) , the median (i.e. second quartile or 50th^ percentile), and the third quartile (Q3, 75th^ percentile) can be calculated. Then, the first step is drawing a dark line denoting the median. Next, a rectangle that goes from the Q1 to Q3 and thus represents the middle 50% of the data is plotted. Finally, error bars or “whiskers” are plotted so that they go up to the maximum/minimum values contained within 1.5 times the IQR from the Q1 or Q3 a^ (i.e. the go up to the most extreme value contained within Q1-1.5IQR, Q3+1.5IQR). Any value ouside the [Q1-1.5IQR, Q3+1.5IQR] interval is shown as a dot to make it easier to spot potential ouliers. a (^) Actually this is just one common definition of wiskers (Tukey), but there are others. In Spear definition whiskers extend to minimum and maximum values; In Altman, whiskers extend to 5th and 95th percentile.
Now, lets represent the zebrafish data using box plots^14 (at least the three groups in figure 5.2)^15. Compare the graphs with the numbers in the summary statistics (figures 5.4 and 5.5), as you’ll see this representation is much more informative than the bar plot in figure 2.2.
box plots a extremely efficient way to represent numerical data, but unfortunately they are seldom used in biomedical publications. One reason for it, is that standard spreadsheet software tools, such as MS Excel, do not produce this kind of graph. To overcome this problem and promote the use of this kind
(^14) box plots are also known and box-and-whisker plots (^15) To do it just Follow the steps shown in figure 5.1 but choose “Boxplot” in the “Graph” dropdown menu
6.2 Subsetting and aggregating data 6 MANAGING DATA
more disseminated cells in fish X than in Y or that the average number of disseminated cells per fish is 234). After stacking data into a single column, values from each condition appear as a contiguous set adjacent to the others. Thus, to keep track of where (which fish) each data value comes from, we add another column (another variable or attribute) that records the experimental condition. This second variable is categorical (defines a qualitative, non measurable attribute of each case). In R categorical variables are called “factors”. As mentioned before the number of possible values of a categorical variable (or factor) are called levels. In this case, the variable has six levels (“pLoC”, “EFNA3”, “NC1”, “NC1s”, “NC2” and “NC2s”).
Fig. 6.1: Stacking data
Stacking columns is very easy, just go to Data–>Active dataset–>Stack variables. Make sure you select all columns and provide a name for the new dataset (by default “StackedData”). Note that a new dataset is created and is set as the active dataset (see 6.1). By clicking on “View dataset” you will see “StackedData” has two columns: the first one containing the stacked data values of distant metastasis and a second column that contains conditions (see 6.1). Remember that this second column is a categorical variable or “factor”.
The resulting table is called data matrix and it is the standard way to record and organize data. In a data matrix each row correspond to a unique case and each column corresponds to a variable.
Now that we have our data organized in a data matrix, we can plot all experimental conditions at once. For example, generate a “stripchart” or “boxplot” from the “StackedData” data matrix. Note that there is a single (numerical) variable to plot (called “variable”), select “Plot by factor” and choose the factor (in this case there is a single categorical variable (a single factor) termed “factor”. With this information the software will partition the numerical data according to the different levels in the categorical variable and draw a box plot for each subset. The R code that stack columns and generate a box plot of the distant metastasis in each experimental condition is shown in figure 6.2.
Compare this graph with the one in figure 2.2. In particular, note that the box plot is a much more informative representation of the values in the NC1s group.
Exercise. Generate a strip chart of the zebrafish data and compare it to the figure 6.2. Which type of representation is more informative?^18
Lets focus now on the second case (see 2.3). Save data from the file “qPCR_data.xls” as a comma separated text file (“qPCR_data.csv”) and load it into R commander. This data set contains the results of all four experiments (figure 2.3) arranged in three columns: experiment, treatment, EFN3A_levels. Note that this this the “correct” data arrangement.
(^18) In this case, both type of representations could be used. strip charts are in general preferred over box plots when a reduced number of numerical values are represented. If we have, for example, only 3 values it would not make much sense to represent quartiles. On the other hand, when a large number of values are represented box plots tend to provide a much cleaner visualization of data distribution.
6 MANAGING DATA 6.2 Subsetting and aggregating data
Fig. 6.2: R-code: Stacking data
StackedData <- stack(ZF_data[, c("EFNA3", "NC1", "NC1s", "NC2", "NC2s", "pLoC")]) names(StackedData) <- c("variable", "factor") boxplot(variable ~ factor, vertical = TRUE, method = "stack", ylab = "Number of distant metastasis", data = StackedData)
l
l
l
ll
Exercise. How many variables does this data matrix contain?^19 what type are they?^20 Indicate the levels of the categorical variables if any^21.
(^19 ) (^20) The two first columns (“experiment” and “Treatment” represent categorical variables (factors) and the last one “EFNA3_level_AU” is a numerical variable (^21) “Experiment” has four levels: “qpcr0809”, “qpcr0109”, “qpcr2508” and “qpcr2506”; “Treatment” has two levels: “vehicle” and “drugX”
6 MANAGING DATA 6.2 Subsetting and aggregating data
Fig. 6.5: R-code: Subsetting data
qPCRdata <- read.table("qPCR_data.csv", sep = ",", header = T) FirstExp <- subset(qPCRdata, Experiment == "qpcr250614") with(FirstExp, stripchart(EFNA3_level_AU ~ Treatment, method = "jitter", vertical = T, ylab = "EFNA3 mRNA level (AU)", xlab = "Treatment"))
drugX vehicle
40
60
80
100
120
140
Treatment
EFNA3 mRNA level (AU)
The R code to produce this graph is shown in figure 6.5. This data looks nice: tightly packed replicates and large (much larger than the difference between values within each group) difference between the two groups. The difference between groups will definitely be significant (well see how to calculate it on the next lesson).
mainly used when our focus is the shape of the distribution. Strip charts and box plots are very useful to compare two or more sub groups of a single numerical variable. Finally, when the number of values in each group is small, it doesn’t make much sense to represent quartiles, etc. Thus for groups of just a few values, as is our case here, we will preferentially use strip charts.
6.2 Subsetting and aggregating data 6 MANAGING DATA
Fig. 6.6: Biological Replicates from ref. Vaux et al. [2012]
Now, what is the meaning of these replicates? How generalizable is this result? The replicates in this case are just independent measures of the same RNA samples (those coming from experiment 2506), thus they their variance represents technical variability in our RT-qPCR procedure (pippeting errors, non-uniform measurement of PCR plate by the optical device,...) but it does not captures the biological variability because we measured RNA coming from a single sample. For example, if the basal level of the EFNA3 mRNA were variable (sensible to culture conditions, culture age, cell confluence,...), this variability could explain the observed difference rather than the treatment. It could also happen that response to treatment is non-uniform between individuals (remember HUVEC cells are primary cells from a human donor, see figure 2.3) and, just by chance, we could have bumped into a donor whose endothelial cells show an extreme response to the drug. For all these reasons, we need several independent biological replicates (in addition to technical replicates) of the experiment. It is important to randomize all possible sources of variability in the set of independent experiments. Importance of replication is clearly and concisely explained in reference Vaux et al. [2012], see also figure 6. extracted from it.
(b) perform a basic descriptive analysis of the raw data including summaries and graphs.
(a) Numerical variables i. Representation of a single numerical variable: A. Histogram. Commonly used to represent the shape of the distribution of a numerical variable. B. Box plots (box-and-whisker plot). Useful and efficient way of representing the center and spread of a set of values and to compare between different subsets of numerical values corre- sponding to different groups. C. Strip chart (dot plot). Same as box plots. It is preferred over box plots for the representation of a reduced number of data values. ii. Representation of two numerical variables: A. Scatter plot (see lesson 2 and 3)
(b) Categorical variables i. Representation of a single categorical variable: A. Bar plots. Note that, although bar plots are frequently used in bioscience publications to represent numerical data, this kind of graph is intended for the representation of the number of cases in each level of a categorical variable. We’ll see much more about representation of categorical variables in lesson 4. B. pie chart (see lesson 4) ii. Representation of more than one categorical variable: A. Segmented bar plots (see lesson 4) B. Mosaic plots (see lesson 4)
Please complete the following exercises and answer the corresponding questions in the “Self-evaluation” test in the Moodle page of the course.
Exercise 1. You are measuring the volume of tumors in a set of mice treated with a new drug and write them down on a piece of paper. The file “tumor_vol.csv” contains the data. Now, you measure a new tumor and its size is 0.049 mm3. If by mistake you skip the first decimal digit and enter it as 0.51 instead of 0.051, how will the mean be affected? and the median? Import the data and edit it to generate two new data sets one including the value 0.51 and another including the value 0.051. Calculate summary statistics and represent the two data sets using a boxplots and strip charts. In the case of the data set containing the incorrect measure, which statistic (mean or median) describes best the data (i.e. is closer to statistic calculated from the data set with the correct measure)?
sep
Exercise 2. Match each of the elements in a box plot with their meaning.
(^24) Although we have not seen many of these graphical representations yet, we include them in this summary for completeness of the explanation and for your reference.
Element Definition Dark line within box Standard deviation Dots most extreme value within 1.5 times the IQR from the mean “Wiskers” mean Length of the box most extreme value within 1.5 times the IQR from the median 50th percentile any value farther than 1.5 times the IQR from the mean erroneous data values sep
Exercise 3. Could you match histograms A-C to boxplots d1-d3 (figure 8.1)? How does the value of the mean compares to the median in each group? and between groups? How does the IQR between groups compare?
Fig. 8.1: Exercise 2 figure
sep
Exercise 4. To analyze the effect of diet on the level of a lipid metabolite (lipid X), 6 mice were randomly divided in two groups and the animals in each group were fed with regular chow or high fat diet (HFD) for 10 weeks. Then, animals were euthanized and the level of the lipid metabolite was determined in four independent 1mg samples from the liver and gastrocnemius muscle of each mice. The file “metabolite_diet.xls” contains the amount of lipid (in microgr/mg tissue) in each sample. Reorder the data to generate a data matrix (do it manually by copying-pasting data sets) and save it as a comma-separated text file. How many rows does the resulting data matrix contain? What type each variable is? How many cases does it contain? Keeping in mind that with this experiment we wanted to know whether the HFD increases the content of lipid X in the liver (and/or the muscle), what is the “n” (independent biological replicates) of each treatment group? Make a plot of the results.
Geoff Cumming, Fiona Fidler, and David L Vaux. Error bars in experimental biology. The Journal of cell biology , 177(1):7–11, apr 2007. ISSN 0021-9525. doi: 10.1083/jcb.200611141. URL http://jcb.rupress. org/cgi/content/long/177/1/7.
Michaela Spitzer, Jan Wildenhain, Juri Rappsilber, and Mike Tyers. BoxPlotR: a web tool for generation of box plots. Nature methods , 11(2):121–2, feb 2014. ISSN 1548-7105. doi: 10.1038/nmeth.2811. URL http://www.ncbi.nlm.nih.gov/pubmed/24481215.
David L Vaux, Fiona Fidler, and Geoff Cumming. Replicates and repeats–what is the difference and is it significant? A brief discussion of statistics and experimental design. EMBO reports , 13(4):291–296, apr 2012. ISSN 1469-3178. doi: 10.1038/embor.2012.36. URL http://www.pubmedcentral.nih.gov/articlerender. fcgi?artid=3321166{&}tool=pmcentrez{&}rendertype=abstract.