R Programming Tutorial: Data Analysis and Matrix Computation with R, Study Guides, Projects, Research of Statistics

An introduction to r, a free software package for statistical analysis, data visualization, and algebraic computation. It covers the basics of r syntax, data types (vectors and matrices), and functions for data manipulation and statistical analysis. The tutorial also includes examples and exercises.

Typology: Study Guides, Projects, Research

Pre 2010

Uploaded on 09/02/2009

koofers-user-hqm
koofers-user-hqm 🇺🇸

10 documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
R Tutorial ( http://www.public.iastate.edu/~gdancik/stat430 )
R is a freely available software package used for statistical analysis, data visualization,
and algebraic (matrix) computation that can run on Unix, Windows, and Mac operating
systems. R is a command-based (functional) language with many objects and functions
built-in. Users can also define their own objects and functions, and many specialized
packages are also available (http://cran.r-project.org/src/contrib/PACKAGES.html)
For more background, downloads, and a more thorough user-manual:
http://cran.r-project.org/
Note: On certain platforms, R will not recognize the opening and closing quotation marks
(“ and ”) found throughout this file, but will recognize the generic quotation marks (‘). If
any of the commands gives an error when copied and pasted into R, try typing in the
quotation marks manually into R, or using the text version of this file.
R can be used like a calculator
5 + 9
4 / 7 + (100-2) / 5
sqrt(16)
exp(8)
The assignment operator is the ‘=’ sign; ‘<-’ can also be used
a = 3
x = 4
x**a or x^a returns xa
The workspace is defined as all objects and user-defined functions in the current
environment. The command ls() returns a list of all elements in the environment
The command rm(a, x) can be used to remove the two elements from the workspace that
we created above
Getting Help (?):
?ls
?matrix
Comments:
The # sign is used to denote a comment (the same is in perl)
Data types:
vectors – these are 1 dimensional (1 row of numbers, characters, etc.)
v= 1:15
# type the name of the object, in this case v, to view it
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download R Programming Tutorial: Data Analysis and Matrix Computation with R and more Study Guides, Projects, Research Statistics in PDF only on Docsity!

R Tutorial (http://www.public.iastate.edu/~gdancik/stat430) R is a freely available software package used for statistical analysis, data visualization, and algebraic (matrix) computation that can run on Unix, Windows, and Mac operating systems. R is a command-based (functional) language with many objects and functions built-in. Users can also define their own objects and functions, and many specialized packages are also available (http://cran.r-project.org/src/contrib/PACKAGES.html) For more background, downloads, and a more thorough user-manual: http://cran.r-project.org/ Note: On certain platforms, R will not recognize the opening and closing quotation marks (“ and ”) found throughout this file, but will recognize the generic quotation marks (‘). If any of the commands gives an error when copied and pasted into R, try typing in the quotation marks manually into R, or using the text version of this file. R can be used like a calculator 5 + 9 4 / 7 + (100-2) / 5 sqrt(16) exp(8) The assignment operator is the ‘=’ sign; ‘<-’ can also be used a = 3 x = 4 x**a or x^a returns xa The workspace is defined as all objects and user-defined functions in the current environment. The command ls() returns a list of all elements in the environment The command rm(a, x) can be used to remove the two elements from the workspace that we created above Getting Help (?): ?ls ?matrix Comments: The # sign is used to denote a comment (the same is in perl) Data types: vectors – these are 1 dimensional (1 row of numbers, characters, etc.) v= 1:

type the name of the object, in this case v, to view it

v[2] #returns the 2nd^ element of the vector length(v) #returns the number of elements in the vector v = c(‘a’, ‘b’, ‘c’) v = c(1,2,5) v = seq(1,10,by=2) v = rep(10,6) matrices – these are two-dimensional, with all elements of the same type v = 1: m = matrix(v, nrow = 3,ncol = 5,byrow = T) #creates a 3 (row) x 5 (column) matrix m = matrix(v, 3, byrow = T) # does the same thing

can you create the matrix below:

1 1 1 2 2 2 3 3 3 4 4 4 dim(m) # returns the number of rows and columns of matrix m dim(m)[1] #the number of rows dim(m)[2] #the number of columns we can access elements of the matrix m using m[ rows , columns ], where rows and columns are the rows and columns of interest m[1:2,2:3] returns rows 1 and 2 and columns 2 and 3 m[ rows , ] returns the specified rows (and all columns) m[, columns ] returns the specified columns (and all rows) Note: if only 1 row or column is specified, then a vector will be returned Can you change the element in the 3rd^ column and the 4th^ row to 0? Matrix arithmetic m + 3 # adds 3 to each element of m m * 5 # multiplies all elements in m by 5

for 2 matrices m1 and m2 of equal dimension, add corresponding elements

m1 + m Suppose we want to evaluate a function on all rows or columns of a matrix. We can easily do this with the apply function The function mean () returns the mean value of the elements in the object passed into it. apply(m,1,mean) #returns the mean of each row apply(m,2,mean) #returns the mean of each column

Data input A list of commands in a file can be read using source( file.name ) source(‘http://www.public.iastate.edu/~gdancik/summer2007/files/setx.txt’) Reading in a file data = read.table(‘http://www.public.iastate.edu/~gdancik/summer2007/files/ BigClass.txt‘, sep = ‘,’, header = T) data.frames Data frames are objects that combine features (particularly element access methods) of matrices and lists The columns of ‘data’ are ‘name’, ‘age’, ‘sex’, ‘height’, and ‘weight’ This can be determined using colnames(data) data$name data$age summary(data) Suppose we want to change the heading of ‘sex’ to ‘gender’ We can rename all of the columns using colnames(data) = new.names

  • Can you create a vector of the column names we want?
  • Can you change ‘sex’ to ‘gender’?
  • Can you rename the column names? Alternately, we could have used colnames(data)[3] = ‘gender’ Another data type is the logical data type (TRUE or FALSE; or alternatively T and F) 5 > 3 5 > 9 Logical operators (e.g. to compare two numbers): >, <, >=, <=, ==, != v = 1: index = v > 5 # for each element of v, check if that element > 5 v[index] # returns the elements of v that are > 5
  • In the big class data set, retrieve a list of students greater than 15 years old o index = data$age > 15

o data[index,] - note that we need to include the ‘,’ after ‘index’. Why is this? o The previous two steps may also be combined: data[data$age > 15,]

  • Other examples: o data[data$gender == ‘M’,] o data[data$age == 12,] o data[data$age == 12]$height ## the heights of all 12 year olds o data[data$age == 12, 4] Relationship between two variables: To reduce typing, let x = data$height and y = data$weight plot(x,y, xlab = ‘height’, ylab = ‘weight’, main = ‘scatterplot of height and weight’) cor(x,y) returns the correlation between x and y Linear models: A linear model (for one input variable) has the form: y = b 0 + b 1 x , where y is referred to as the response variable and x is an input variable. fit1 = lm(y~x) # fits a linear model of the above form summary(fit1) Estimates of b 0 and b 1 are the first and second elements of fit1$coeff, respectively. These can also be found through summary(fit1) or simply by printing the fit object If we plug in our estimates of b 0 and b 1 into the original equation, we can predict a person’s weight (y) from their known height (x). Doing this for our known weights, we get a list of fitted values, fit1$fitted. plot(x,y, xlab = ‘height’, ylab = ‘weight’, main = ‘scatterplot of height and weight’) lines(x, fit1$fitted, col = ‘red’) Now let us consider the model y = b 0 + b 1 x 1 + b 2 x 2 , where y = data$weight x 1 = data$age x 2 = data$height

We calculate a p-value using n ofT s T p i^

'  *

 where T* is the observed test statistic. This approach uses the classical definition of probability (theoreom 1.3.4 of notes), where A = Ti ≥ T*, Ω = {all possible ways of assigning 22 individuals to be male and 18 all possible ways of assigning 22 individuals to be male and 18 individuals to be female from 40 individuals}, where   A P ( A ) Most of the time, it is not practical to calculate Ti for all permutations of the data. However, we can make use the empirical definition of probability, that is n A P ( A )lim n  n The text refers to the use of empirical probabilities to carry out permutation tests as an approximate randomization. Let’s do this in R: K = 5000 ## the number of samples to obtain heights = rbind(matrix(hm), matrix(hf)) T = matrix(0,K) ## matrix of test statistics t.star = abs(mean(hm) – mean(hf)) for (i in 1:K) {all possible ways of assigning 22 individuals to be male and 18 heights = sample(heights) ## randomly permute the order of heights xm = mean(heights[1:22]) xf = mean(heights[23:40]) T[i] = abs(xm – xf) } hist(T, main = “distribution of test statistic”) abline(v = t.star, col = “red”) p = sum(T >= t.star) / length(T)

Writing your own functions (and loops) f = function(x1, x2 = 0) {all possible ways of assigning 22 individuals to be male and 18 return (x1 + x2) } The return statement above is optional. We may want to have a function that simply ‘does something’. For example, suppose we have a matrix m, and we want to plot a line that corresponds to each row of the matrix. m = matrix(1:15,ncol=5,byrow=TRUE) plotLines = function(m, ...) {all possible ways of assigning 22 individuals to be male and 18

this is a comment

lower = min(m) upper = max(m) for (i in 1:dim(m)[1]) {all possible ways of assigning 22 individuals to be male and 18 plot(m[i,], ylim = c(lower, upper), type = ‘l’, ...) par(new=TRUE) } } There are many options for graphical parameters, see ‘?plot’ and ‘?par’, for examples. R also allows while loops: i = while (i < 10) {all possible ways of assigning 22 individuals to be male and 18 print(i) i = i + 1 } Within a loop you may use break or next statements, similar to Perl.

#Let’s visualize the standard normal density x = seq(-5,5, by=0.1) plot(x, dnorm(x), type = ‘l’) #We can generate 1000 observations from the standard normal distribution z = rnorm(1000) hist(z) #Let Z ~ N(0,1). Then pnorm(1.645) # returns P(Z < 1.645) qnorm(.95) # returns the value z, for which (P(Z < z) = 0.

Simulate flipping a coin 100 times, where P(H) = P(T) = ½

flips = runif(100) flips[ flips < 0.5 ] = ‘H’ flips[ flips != ‘H’] = ‘T’ flips = as.factor(flips) summary(flips)

suppose we did not know about the summary function, and had to write our

own function to count the number of heads / tails

a C-style function

countHT.C = function(x) {all possible ways of assigning 22 individuals to be male and 18 numH = 0 numT = 0 x = toupper(x) ## convert to all uppercase for (i in 1:length(x)) {all possible ways of assigning 22 individuals to be male and 18 if (x[i] == ‘H’) numH = numH + 1 if (x[i] == ‘T’) numT = numT+ 1 } print(paste(“Heads: ”, numH)) print(paste(“Tails: ”, numT)) }

an R style function

countHT.R = function(x) {all possible ways of assigning 22 individuals to be male and 18 x = toupper(x) numH = length( x[ x == ‘H’] ) numT = length( x[ x == ‘T’] ) print(paste(“Heads: ”, numH)) print(paste(“Tails: ”, numT)) }

Recall the Central Limit Theoreom: If Xi is a random variable with mean μ and variance σ^2 , the sample mean of a random sample of size n has the approximate distribution         n X (^) n N 2 ~ ,   This is exactly true if X is normally distributed, and is approximately true for all other distributions of X, with the approximation being more accurate as n increases. Let’s visualize this using R:

the dots (...) are used as additional arguments to dsn

CLT <- function(r = 5000, dsn = rnorm, ...) {

divide the window into 4 panels

par(mfrow = c(2,2)) hist(dsn(r, ...), xlab = "x", main = "original dsn") n.sizes = c(10, 30, 500) for (n in n.sizes) { m = matrix(0, r) for (i in 1:r) { m[i] = mean(dsn(n, ...)) } title = paste("n = ", n) hist(m, main = title, xlab = "xbar") } }