R Programming: Statistical Computing and Graphics, Study Guides, Projects, Research of Computer Science

An introduction to r, a free software package for statistical analysis, data visualization, and matrix computation. It covers the basics of r programming, including data types (vectors and matrices), creating and manipulating objects, and using built-in functions. The document also touches upon data input, statistical analysis, and writing custom functions.

Typology: Study Guides, Projects, Research

Pre 2010

Uploaded on 09/02/2009

koofers-user-j4g
koofers-user-j4g 🇺🇸

10 documents

1 / 10

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Programming in R: Statistical Computing and Graphics
R is a freely available software package used for statistical analysis, data visualization,
and algebraic (matrix) computation that can run on Unix, Windows, and Mac operating
systems. R is a command-based language with many objects and functions built-in.
Users can also define their own objects and functions, and many specialized packages are
also available.
For more background, downloads, and a more thorough user-manual see:
http://cran.r-project.org/
Note: On certain platforms, R will not recognize the opening and closing quotation marks
(‘ and ’) found throughout this file, but will recognize the generic quotation marks. If any
of the commands gives an error when copied and pasted into R, try typing in the
quotation marks manually into R, or using a text version of this file.
R can be used like a calculator
5 + 9
4 / 7 + (100-2) / 5
sqrt(16)
exp(8)
The assignment operator is the ‘=‘ sign; ‘<-’ can also be used
a = 3
x = 4
x**a or x^a returns xa
The workspace is defined as all objects and user-defined functions in the current
environment. The command ls() returns a list of all elements in the environment
The command rm(a, x) can be used to remove the two elements from the workspace that
we created above
Getting Help (?):
?ls
?matrix
Comments:
The # sign is used to denote a comment (the same is in perl)
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download R Programming: Statistical Computing and Graphics and more Study Guides, Projects, Research Computer Science in PDF only on Docsity!

Programming in R: Statistical Computing and Graphics R is a freely available software package used for statistical analysis, data visualization, and algebraic (matrix) computation that can run on Unix, Windows, and Mac operating systems. R is a command-based language with many objects and functions built-in. Users can also define their own objects and functions, and many specialized packages are also available. For more background, downloads, and a more thorough user-manual see: http://cran.r-project.org/ Note: On certain platforms, R will not recognize the opening and closing quotation marks (‘ and ’) found throughout this file, but will recognize the generic quotation marks. If any of the commands gives an error when copied and pasted into R, try typing in the quotation marks manually into R, or using a text version of this file. R can be used like a calculator 5 + 9 4 / 7 + (100-2) / 5 sqrt(16) exp(8) The assignment operator is the ‘=‘ sign; ‘<-’ can also be used a = 3 x = 4 x**a or x^a returns xa The workspace is defined as all objects and user-defined functions in the current environment. The command ls() returns a list of all elements in the environment The command rm(a, x) can be used to remove the two elements from the workspace that we created above Getting Help (?): ?ls ?matrix Comments: The # sign is used to denote a comment (the same is in perl)

Data types: vectors – these are 1 dimensional (1 row of numbers, characters, etc.) v= 1:

type the name of the object, in this case v, to view it

v[2] #returns the 2nd^ element of the vector length(v) #returns the number of elements in the vector v = c(‘a’, ‘b’, ‘c’) v = c(1,2,5) v = seq(1,10,by=2) v = rep(10,6) matrices – these can be multidimensional, but all elements must be of the same type v = 1: m = matrix(v, nrow = 3,ncol = 5,byrow = T) #creates a 3 (row) x 5 (column) matrix m = matrix(v, 3, byrow = T) # does the same thing

can you create the matrix below:

1 1 1 2 2 2 3 3 3 4 4 4 dim(m) # returns the number of rows and columns of matrix m dim(m)[1] #the number of rows dim(m)[2] #the number of columns we can access elements of the matrix m using m[ rows , columns ], where rows and columns are the rows and columns of interest m[1:2,2:3] returns rows 1 and 2 and columns 2 and 3 m[ rows , ] returns the specified rows (and all columns) m[, columns ] returns the specified columns (and all rows) Note: if only 1 row or column is specified, then a vector will be returned Can you change the element in the 3rd^ column and the 4th^ row to 0? Matrix arithmetic m + 3 # adds 3 to each element of m m * 5 # multiplies all elements in m by 5

for 2 matrices m1 and m2 of equal dimension, add corresponding elements

m1 + m

Data input A list of commands in a file can be read using source( file.name ) source(‘http://www.public.iastate.edu/~gdancik/summer2007/files/setx.txt’) Reading in a file data = read.table(‘http://www.public.iastate.edu/~gdancik/summer2007/files/ BigClass.txt‘, sep = ‘,’, header = T) data.frames Data frames are objects that combine features (particularly element access methods) of matrices and lists The columns of ‘data’ are ‘name’, ‘age’, ‘sex’, ‘height’, and ‘weight’ This can be determined using colnames(data) data$name data$age summary(data) Suppose we want to change the heading of ‘sex’ to ‘gender’ We can rename all of the columns using colnames(data) = new.names

  • Can you create a vector of the column names we want?
  • Can you change ‘sex’ to ‘gender’?
  • Can you rename the column names? Alternately, we could have used colnames(data)[3] = ‘gender’ Another data type is the logical data type (TRUE or FALSE; or alternatively T and F) 5 > 3 5 > 9 Logical operators (e.g. to compare two numbers): >, <, >=, <=, ==, != v = 1: index = v > 5 # for each element of v, check if that element > 5 v[index] # returns the elements of v that are > 5
  • In the big class data set, retrieve a list of students greater than 15 years old o index = data$age > 15 o data[index,] - note that we need to include the ‘,’ after ‘index’. Why is this? o The previous two steps may also be combined: data[data$age > 15,]
  • Other examples: o data[data$gender == ‘M’,] o data[data$age == 12,] o data[data$age == 12]$height o data[data$age == 12, 4] Relationship between two variables: To reduce future typing, first enter: x = data$height y = data$weight cor(x,y) returns the correlation between x and y plot(x,y, xlab = ‘height’, ylab = ‘weight’, main = ‘scatterplot of height and weight’) Linear models: A linear model (for one input variable) has the form: y = b 0 + b 1 x , where y is referred to as the response variable and x is an input variable. fit1 = lm(y~x) #fits a linear model of the above form summary(l) Estimates of b 0 and b 1 are the first and second elements of l$coeff, respectively. These can also be found through summary(l) or simply by printing the l object If we plug in our estimates of b 0 and b 1 into the original equation, we can predict a person’s weight (y) from their known height (x). Doing this for our known weights, we get a list of fitted values, l$fitted. plot(x,y, xlab = ‘height’, ylab = ‘weight’, main = ‘scatterplot of height and weight’) lines(x, fit1$fitted, col = ‘red’) Now let us consider the model y = b 0 + b 1 x 1 + b 2 x 2 , where y = data$weight x1 = data$age x2 = data$height fit2 = lm(y ~ x 1 + x 2 )

m = matrix(1:15,ncol=5,byrow=T) plotLines = function(m, ...) {

this is a comment

lower = min(m) upper = max(m) for (i in 1:dim(m)[1]) { plot(m[i,], ylim = c(lower, upper), type = ‘l’, ...) par(new=T) } } R also allows while loops: i = while (i < 10) { print(i) i = I + 1 } Within a loop you may use break or next statements, similar to Perl. Conditional statements is5 = function(x) { if (x == 5) { print (‘x is equal to 5’) } else { print (‘x is not equal to 5’) } } Note: There is no if else expression in R – you must used nested if…else statements. Saving and Loading R objects First let’s check our current working directory. This is the directory in which files will be saved or the directory that R attempts to be read from if only a file name is specified. In order to get and set the working directory, use the functions ‘getwd’ and ‘setwd’ It is recommended that you change the working directory now….

save the current workspace in the current working directory

save.image(file = ‘file.RData’)

save(x, file = x.RData’) # can be used to save a subset of objects in the workspace

will load in the specified workspace or R object (note: objects in the workspace you are

loading in will overwrite any objects currently defined

load(‘file.RData’)

save the matrix m as a text file

write(t(m), ncolumns = ncol(m), file = ‘m.txt’) # this can later be read in using the

read.table command.

Probability distributions R can handle all common probability distributions, including the normal and (continuous) uniform distribution. For the normal distribution (standard normal by default), ‘dnorm’ gives the density, ‘pnorm’ gives the distribution function, ‘qnorm’ gives the quantile function, and ‘rnorm’ generates random deviates. Other probability functions work similarly (e.g., dunif, punif, etc. for the uniform distribution) #We can visualize the standard normal density x = seq(-5,5, by=0.1) plot(x, dnorm(x), type = ‘l’) #We can generate 1000 observations from the standard normal distribution z = rnorm(1000) hist(z) #Let Z ~ N(0,1). Then pnorm(1.645) # returns P(Z < 1.645) qnorm(.95) # returns the value z, for which (P(Z < z) = 0.

Simulate flipping a coin 100 times, where P(H) = P(T) = ½

flips = runif(100) flips[ flips < 0.5 ] = ‘H’ flips[ flips != ‘H’] = ‘T’ flips = as.factor(flips) summary(flips)

count number of nucleotides in a sequence x

countN1 = function(x) { numA = 0 numG = 0 numC = 0 numT = 0

  1. A researcher has identified genetic structure that she believes is conserved throughout the genome. In order to determine the probability that this structure arose by chance, she generates many random sequences of the same length, with marginal probabilities for each nucleotide based on their empirical probabilities.
  2. A researcher is studying promoter regions that are rich in guanine, and, from a list of candidate promoters, wants to look at all sequences where guanine content is greater than 30%.
  3. Generating random sequences – a. Create a function that generates a single random nucleotide X where P( X = “G” ) = 0.30, P( X = “A”) = 0.20, P( X = “C”) = 0.25, and P( X = “T”) = 0. Hint: You may want to use the runif() function to do this. b. Using the function you have created in (a), create another function that generates a random nucleotide sequence of length n. c. Generate a random nucleotide sequence of length 100 using the sample() function, where the probability of each nucleotide is given in (a) Hinte: type ‘?sample’ for more information.
  4. Sequence analysis – a. Load the object ‘sequences’ using the following command: load(url(‘http://www.public.iastate.edu/~gdancik/summer2007/files/ sequences.RData’)) to get a data.frame of dna sequences, which has the name ‘sequences’. Each column contains a 40-base nucleotide sequence. For example, sequences[,1] will return the first sequence (as a factor) b. Since the columns of sequences are factors, summary(sequences) will tell you the number of each nucleotide in each column. However, suppose that we did not know this. Modify the countN1 or countN2 functions to take a single sequence, and return a vector of 4 elements that corresponds to the number of A’s, G’s, C’s, and T’s in the sequence. (Note: you will need to remove the toupper() function, since we are now working with factors, and not characters). c. Use the apply function to return a 4 x 10 matrix with the number of A’s, G’s, C’s, and T’s in each of the 10 sequences.