Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

R Programming Tutorial: Data Analysis and Matrix Computation with R, Study Guides, Projects, Research of Statistics

Iowa State University (ISU)Statistics

An introduction to r, a free software package for statistical analysis, data visualization, and algebraic computation. It covers the basics of r syntax, data types (vectors and matrices), and functions for data manipulation and statistical analysis. The tutorial also includes examples and exercises.

Typology: Study Guides, Projects, Research

Pre 2010

Uploaded on 09/02/2009

koofers-user-hqm 🇺🇸

10 documents

1 / 11

This page cannot be seen from the preview

Don't miss anything!

R Tutorial ( http://www.public.iastate.edu/~gdancik/stat430 )

R is a freely available software package used for statistical analysis, data visualization,

and algebraic (matrix) computation that can run on Unix, Windows, and Mac operating

systems. R is a command-based (functional) language with many objects and functions

built-in. Users can also define their own objects and functions, and many specialized

packages are also available (http://cran.r-project.org/src/contrib/PACKAGES.html)

For more background, downloads, and a more thorough user-manual:

http://cran.r-project.org/

Note: On certain platforms, R will not recognize the opening and closing quotation marks

(“ and ”) found throughout this file, but will recognize the generic quotation marks (‘). If

any of the commands gives an error when copied and pasted into R, try typing in the

quotation marks manually into R, or using the text version of this file.

R can be used like a calculator

5 + 9

4 / 7 + (100-2) / 5

sqrt(16)

exp(8)

The assignment operator is the ‘=’ sign; ‘<-’ can also be used

a = 3

x = 4

x**a or x^a returns xa

The workspace is defined as all objects and user-defined functions in the current

environment. The command ls() returns a list of all elements in the environment

The command rm(a, x) can be used to remove the two elements from the workspace that

we created above

Getting Help (?):

?ls

?matrix

Comments:

The # sign is used to denote a comment (the same is in perl)

Data types:

vectors – these are 1 dimensional (1 row of numbers, characters, etc.)

v= 1:15

# type the name of the object, in this case v, to view it

Discover Study Guides, Projects, Research of Statistics Iowa State University (ISU)

Partial preview of the text

Download R Programming Tutorial: Data Analysis and Matrix Computation with R and more Study Guides, Projects, Research Statistics in PDF only on Docsity!

R Tutorial (http://www.public.iastate.edu/~gdancik/stat430) R is a freely available software package used for statistical analysis, data visualization, and algebraic (matrix) computation that can run on Unix, Windows, and Mac operating systems. R is a command-based (functional) language with many objects and functions built-in. Users can also define their own objects and functions, and many specialized packages are also available (http://cran.r-project.org/src/contrib/PACKAGES.html) For more background, downloads, and a more thorough user-manual: http://cran.r-project.org/ Note: On certain platforms, R will not recognize the opening and closing quotation marks (“ and ”) found throughout this file, but will recognize the generic quotation marks (‘). If any of the commands gives an error when copied and pasted into R, try typing in the quotation marks manually into R, or using the text version of this file. R can be used like a calculator 5 + 9 4 / 7 + (100-2) / 5 sqrt(16) exp(8) The assignment operator is the ‘=’ sign; ‘<-’ can also be used a = 3 x = 4 x**a or x^a returns xa The workspace is defined as all objects and user-defined functions in the current environment. The command ls() returns a list of all elements in the environment The command rm(a, x) can be used to remove the two elements from the workspace that we created above Getting Help (?): ?ls ?matrix Comments: The # sign is used to denote a comment (the same is in perl) Data types: vectors – these are 1 dimensional (1 row of numbers, characters, etc.) v= 1:

type the name of the object, in this case v, to view it

v[2] #returns the 2nd^ element of the vector length(v) #returns the number of elements in the vector v = c(‘a’, ‘b’, ‘c’) v = c(1,2,5) v = seq(1,10,by=2) v = rep(10,6) matrices – these are two-dimensional, with all elements of the same type v = 1: m = matrix(v, nrow = 3,ncol = 5,byrow = T) #creates a 3 (row) x 5 (column) matrix m = matrix(v, 3, byrow = T) # does the same thing

can you create the matrix below:

1 1 1 2 2 2 3 3 3 4 4 4 dim(m) # returns the number of rows and columns of matrix m dim(m)[1] #the number of rows dim(m)[2] #the number of columns we can access elements of the matrix m using m[ rows , columns ], where rows and columns are the rows and columns of interest m[1:2,2:3] returns rows 1 and 2 and columns 2 and 3 m[ rows , ] returns the specified rows (and all columns) m[, columns ] returns the specified columns (and all rows) Note: if only 1 row or column is specified, then a vector will be returned Can you change the element in the 3rd^ column and the 4th^ row to 0? Matrix arithmetic m + 3 # adds 3 to each element of m m * 5 # multiplies all elements in m by 5

for 2 matrices m1 and m2 of equal dimension, add corresponding elements

m1 + m Suppose we want to evaluate a function on all rows or columns of a matrix. We can easily do this with the apply function The function mean () returns the mean value of the elements in the object passed into it. apply(m,1,mean) #returns the mean of each row apply(m,2,mean) #returns the mean of each column

Data input A list of commands in a file can be read using source( file.name ) source(‘http://www.public.iastate.edu/~gdancik/summer2007/files/setx.txt’) Reading in a file data = read.table(‘http://www.public.iastate.edu/~gdancik/summer2007/files/ BigClass.txt‘, sep = ‘,’, header = T) data.frames Data frames are objects that combine features (particularly element access methods) of matrices and lists The columns of ‘data’ are ‘name’, ‘age’, ‘sex’, ‘height’, and ‘weight’ This can be determined using colnames(data) data$name data$age summary(data) Suppose we want to change the heading of ‘sex’ to ‘gender’ We can rename all of the columns using colnames(data) = new.names

Can you create a vector of the column names we want?
Can you change ‘sex’ to ‘gender’?
Can you rename the column names? Alternately, we could have used colnames(data)[3] = ‘gender’ Another data type is the logical data type (TRUE or FALSE; or alternatively T and F) 5 > 3 5 > 9 Logical operators (e.g. to compare two numbers): >, <, >=, <=, ==, != v = 1: index = v > 5 # for each element of v, check if that element > 5 v[index] # returns the elements of v that are > 5
In the big class data set, retrieve a list of students greater than 15 years old o index = data$age > 15

o data[index,] - note that we need to include the ‘,’ after ‘index’. Why is this? o The previous two steps may also be combined: data[data$age > 15,]

Other examples: o data[data$gender == ‘M’,] o data[data$age == 12,] o data[data$age == 12]$height ## the heights of all 12 year olds o data[data$age == 12, 4] Relationship between two variables: To reduce typing, let x = data$height and y = data$weight plot(x,y, xlab = ‘height’, ylab = ‘weight’, main = ‘scatterplot of height and weight’) cor(x,y) returns the correlation between x and y Linear models: A linear model (for one input variable) has the form: y = b 0 + b 1 x , where y is referred to as the response variable and x is an input variable. fit1 = lm(y~x) # fits a linear model of the above form summary(fit1) Estimates of b 0 and b 1 are the first and second elements of fit1$coeff, respectively. These can also be found through summary(fit1) or simply by printing the fit object If we plug in our estimates of b 0 and b 1 into the original equation, we can predict a person’s weight (y) from their known height (x). Doing this for our known weights, we get a list of fitted values, fit1$fitted. plot(x,y, xlab = ‘height’, ylab = ‘weight’, main = ‘scatterplot of height and weight’) lines(x, fit1$fitted, col = ‘red’) Now let us consider the model y = b 0 + b 1 x 1 + b 2 x 2 , where y = data$weight x 1 = data$age x 2 = data$height

We calculate a p-value using n ofT s T p i^

'  *

 where T* is the observed test statistic. This approach uses the classical definition of probability (theoreom 1.3.4 of notes), where A = Ti ≥ T*, Ω = {all possible ways of assigning 22 individuals to be male and 18 all possible ways of assigning 22 individuals to be male and 18 individuals to be female from 40 individuals}, where   A P ( A ) Most of the time, it is not practical to calculate Ti for all permutations of the data. However, we can make use the empirical definition of probability, that is n A P ( A )lim n  n The text refers to the use of empirical probabilities to carry out permutation tests as an approximate randomization. Let’s do this in R: K = 5000 ## the number of samples to obtain heights = rbind(matrix(hm), matrix(hf)) T = matrix(0,K) ## matrix of test statistics t.star = abs(mean(hm) – mean(hf)) for (i in 1:K) {all possible ways of assigning 22 individuals to be male and 18 heights = sample(heights) ## randomly permute the order of heights xm = mean(heights[1:22]) xf = mean(heights[23:40]) T[i] = abs(xm – xf) } hist(T, main = “distribution of test statistic”) abline(v = t.star, col = “red”) p = sum(T >= t.star) / length(T)

Writing your own functions (and loops) f = function(x1, x2 = 0) {all possible ways of assigning 22 individuals to be male and 18 return (x1 + x2) } The return statement above is optional. We may want to have a function that simply ‘does something’. For example, suppose we have a matrix m, and we want to plot a line that corresponds to each row of the matrix. m = matrix(1:15,ncol=5,byrow=TRUE) plotLines = function(m, ...) {all possible ways of assigning 22 individuals to be male and 18

this is a comment

lower = min(m) upper = max(m) for (i in 1:dim(m)[1]) {all possible ways of assigning 22 individuals to be male and 18 plot(m[i,], ylim = c(lower, upper), type = ‘l’, ...) par(new=TRUE) } } There are many options for graphical parameters, see ‘?plot’ and ‘?par’, for examples. R also allows while loops: i = while (i < 10) {all possible ways of assigning 22 individuals to be male and 18 print(i) i = i + 1 } Within a loop you may use break or next statements, similar to Perl.

#Let’s visualize the standard normal density x = seq(-5,5, by=0.1) plot(x, dnorm(x), type = ‘l’) #We can generate 1000 observations from the standard normal distribution z = rnorm(1000) hist(z) #Let Z ~ N(0,1). Then pnorm(1.645) # returns P(Z < 1.645) qnorm(.95) # returns the value z, for which (P(Z < z) = 0.

Simulate flipping a coin 100 times, where P(H) = P(T) = ½

flips = runif(100) flips[ flips < 0.5 ] = ‘H’ flips[ flips != ‘H’] = ‘T’ flips = as.factor(flips) summary(flips)

suppose we did not know about the summary function, and had to write our

own function to count the number of heads / tails

a C-style function

countHT.C = function(x) {all possible ways of assigning 22 individuals to be male and 18 numH = 0 numT = 0 x = toupper(x) ## convert to all uppercase for (i in 1:length(x)) {all possible ways of assigning 22 individuals to be male and 18 if (x[i] == ‘H’) numH = numH + 1 if (x[i] == ‘T’) numT = numT+ 1 } print(paste(“Heads: ”, numH)) print(paste(“Tails: ”, numT)) }

an R style function

countHT.R = function(x) {all possible ways of assigning 22 individuals to be male and 18 x = toupper(x) numH = length( x[ x == ‘H’] ) numT = length( x[ x == ‘T’] ) print(paste(“Heads: ”, numH)) print(paste(“Tails: ”, numT)) }

Recall the Central Limit Theoreom: If Xi is a random variable with mean μ and variance σ^2 , the sample mean of a random sample of size n has the approximate distribution         n X (^) n N 2 ~ ,   This is exactly true if X is normally distributed, and is approximately true for all other distributions of X, with the approximation being more accurate as n increases. Let’s visualize this using R:

the dots (...) are used as additional arguments to dsn

CLT <- function(r = 5000, dsn = rnorm, ...) {

divide the window into 4 panels

par(mfrow = c(2,2)) hist(dsn(r, ...), xlab = "x", main = "original dsn") n.sizes = c(10, 30, 500) for (n in n.sizes) { m = matrix(0, r) for (i in 1:r) { m[i] = mean(dsn(n, ...)) } title = paste("n = ", n) hist(m, main = title, xlab = "xbar") } }