






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An introduction to r, a free software package for statistical analysis, data visualization, and algebraic computation. It covers the basics of r syntax, data types (vectors and matrices), and functions for data manipulation and statistical analysis. The tutorial also includes examples and exercises.
Typology: Study Guides, Projects, Research
1 / 11
This page cannot be seen from the preview
Don't miss anything!







R Tutorial (http://www.public.iastate.edu/~gdancik/stat430) R is a freely available software package used for statistical analysis, data visualization, and algebraic (matrix) computation that can run on Unix, Windows, and Mac operating systems. R is a command-based (functional) language with many objects and functions built-in. Users can also define their own objects and functions, and many specialized packages are also available (http://cran.r-project.org/src/contrib/PACKAGES.html) For more background, downloads, and a more thorough user-manual: http://cran.r-project.org/ Note: On certain platforms, R will not recognize the opening and closing quotation marks (“ and ”) found throughout this file, but will recognize the generic quotation marks (‘). If any of the commands gives an error when copied and pasted into R, try typing in the quotation marks manually into R, or using the text version of this file. R can be used like a calculator 5 + 9 4 / 7 + (100-2) / 5 sqrt(16) exp(8) The assignment operator is the ‘=’ sign; ‘<-’ can also be used a = 3 x = 4 x**a or x^a returns xa The workspace is defined as all objects and user-defined functions in the current environment. The command ls() returns a list of all elements in the environment The command rm(a, x) can be used to remove the two elements from the workspace that we created above Getting Help (?): ?ls ?matrix Comments: The # sign is used to denote a comment (the same is in perl) Data types: vectors – these are 1 dimensional (1 row of numbers, characters, etc.) v= 1:
v[2] #returns the 2nd^ element of the vector length(v) #returns the number of elements in the vector v = c(‘a’, ‘b’, ‘c’) v = c(1,2,5) v = seq(1,10,by=2) v = rep(10,6) matrices – these are two-dimensional, with all elements of the same type v = 1: m = matrix(v, nrow = 3,ncol = 5,byrow = T) #creates a 3 (row) x 5 (column) matrix m = matrix(v, 3, byrow = T) # does the same thing
1 1 1 2 2 2 3 3 3 4 4 4 dim(m) # returns the number of rows and columns of matrix m dim(m)[1] #the number of rows dim(m)[2] #the number of columns we can access elements of the matrix m using m[ rows , columns ], where rows and columns are the rows and columns of interest m[1:2,2:3] returns rows 1 and 2 and columns 2 and 3 m[ rows , ] returns the specified rows (and all columns) m[, columns ] returns the specified columns (and all rows) Note: if only 1 row or column is specified, then a vector will be returned Can you change the element in the 3rd^ column and the 4th^ row to 0? Matrix arithmetic m + 3 # adds 3 to each element of m m * 5 # multiplies all elements in m by 5
m1 + m Suppose we want to evaluate a function on all rows or columns of a matrix. We can easily do this with the apply function The function mean () returns the mean value of the elements in the object passed into it. apply(m,1,mean) #returns the mean of each row apply(m,2,mean) #returns the mean of each column
Data input A list of commands in a file can be read using source( file.name ) source(‘http://www.public.iastate.edu/~gdancik/summer2007/files/setx.txt’) Reading in a file data = read.table(‘http://www.public.iastate.edu/~gdancik/summer2007/files/ BigClass.txt‘, sep = ‘,’, header = T) data.frames Data frames are objects that combine features (particularly element access methods) of matrices and lists The columns of ‘data’ are ‘name’, ‘age’, ‘sex’, ‘height’, and ‘weight’ This can be determined using colnames(data) data$name data$age summary(data) Suppose we want to change the heading of ‘sex’ to ‘gender’ We can rename all of the columns using colnames(data) = new.names
o data[index,] - note that we need to include the ‘,’ after ‘index’. Why is this? o The previous two steps may also be combined: data[data$age > 15,]
We calculate a p-value using n ofT s T p i^
where T* is the observed test statistic. This approach uses the classical definition of probability (theoreom 1.3.4 of notes), where A = Ti ≥ T*, Ω = {all possible ways of assigning 22 individuals to be male and 18 all possible ways of assigning 22 individuals to be male and 18 individuals to be female from 40 individuals}, where A P ( A ) Most of the time, it is not practical to calculate Ti for all permutations of the data. However, we can make use the empirical definition of probability, that is n A P ( A )lim n n The text refers to the use of empirical probabilities to carry out permutation tests as an approximate randomization. Let’s do this in R: K = 5000 ## the number of samples to obtain heights = rbind(matrix(hm), matrix(hf)) T = matrix(0,K) ## matrix of test statistics t.star = abs(mean(hm) – mean(hf)) for (i in 1:K) {all possible ways of assigning 22 individuals to be male and 18 heights = sample(heights) ## randomly permute the order of heights xm = mean(heights[1:22]) xf = mean(heights[23:40]) T[i] = abs(xm – xf) } hist(T, main = “distribution of test statistic”) abline(v = t.star, col = “red”) p = sum(T >= t.star) / length(T)
Writing your own functions (and loops) f = function(x1, x2 = 0) {all possible ways of assigning 22 individuals to be male and 18 return (x1 + x2) } The return statement above is optional. We may want to have a function that simply ‘does something’. For example, suppose we have a matrix m, and we want to plot a line that corresponds to each row of the matrix. m = matrix(1:15,ncol=5,byrow=TRUE) plotLines = function(m, ...) {all possible ways of assigning 22 individuals to be male and 18
lower = min(m) upper = max(m) for (i in 1:dim(m)[1]) {all possible ways of assigning 22 individuals to be male and 18 plot(m[i,], ylim = c(lower, upper), type = ‘l’, ...) par(new=TRUE) } } There are many options for graphical parameters, see ‘?plot’ and ‘?par’, for examples. R also allows while loops: i = while (i < 10) {all possible ways of assigning 22 individuals to be male and 18 print(i) i = i + 1 } Within a loop you may use break or next statements, similar to Perl.
#Let’s visualize the standard normal density x = seq(-5,5, by=0.1) plot(x, dnorm(x), type = ‘l’) #We can generate 1000 observations from the standard normal distribution z = rnorm(1000) hist(z) #Let Z ~ N(0,1). Then pnorm(1.645) # returns P(Z < 1.645) qnorm(.95) # returns the value z, for which (P(Z < z) = 0.
flips = runif(100) flips[ flips < 0.5 ] = ‘H’ flips[ flips != ‘H’] = ‘T’ flips = as.factor(flips) summary(flips)
countHT.C = function(x) {all possible ways of assigning 22 individuals to be male and 18 numH = 0 numT = 0 x = toupper(x) ## convert to all uppercase for (i in 1:length(x)) {all possible ways of assigning 22 individuals to be male and 18 if (x[i] == ‘H’) numH = numH + 1 if (x[i] == ‘T’) numT = numT+ 1 } print(paste(“Heads: ”, numH)) print(paste(“Tails: ”, numT)) }
countHT.R = function(x) {all possible ways of assigning 22 individuals to be male and 18 x = toupper(x) numH = length( x[ x == ‘H’] ) numT = length( x[ x == ‘T’] ) print(paste(“Heads: ”, numH)) print(paste(“Tails: ”, numT)) }
Recall the Central Limit Theoreom: If Xi is a random variable with mean μ and variance σ^2 , the sample mean of a random sample of size n has the approximate distribution n X (^) n N 2 ~ , This is exactly true if X is normally distributed, and is approximately true for all other distributions of X, with the approximation being more accurate as n increases. Let’s visualize this using R:
CLT <- function(r = 5000, dsn = rnorm, ...) {
par(mfrow = c(2,2)) hist(dsn(r, ...), xlab = "x", main = "original dsn") n.sizes = c(10, 30, 500) for (n in n.sizes) { m = matrix(0, r) for (i in 1:r) { m[i] = mean(dsn(n, ...)) } title = paste("n = ", n) hist(m, main = title, xlab = "xbar") } }