Download S-Plus/R Quantitative Programming: Objects, Data Manipulation, and Graphics - Prof. A. Joh and more Study notes Statistics in PDF only on Docsity!
Week 12+ Class Activities
File: week-12PLUS-05dec05.doc
Directory (hp/compaq): C:\baileraj\Classes\Fall 2005\sta402\handouts TOPICS IN STATISTICAL PROGRAMMING (varies)
- Introduction to quantitative programming in S-Plus/R
- objects – vectors, lists, matrices, dataframes
- reading data [ scan , read.table, sas.get ]
- summarizing data sets [ mean, var, summary, table ]
- graphical displays [ plot, pairs , trellis displays]
- writing functions References: Venables WW and Ripley BD. 1997. Modern Applied Statistics with S- Plus. Springer-Verlag. Venables WW and Ripley BD. 2000. S Programming. Springer-Verlag. Spector P. 1994. An introduction to S and S-Plus. Duxbury Press Krause A and Olson M. 1997. The Basics and S and S-Plus. Springer. S-Plus for Windows Documentation http://216.211.131.2/support/doc_splus_win.asp Getting Started with S-Plus for Windows (GUI stuff) http://216.211.131.2/support/splus60win/tutorial.pdf http://www.insightful.com/support/doc_splus_win.asp R Project downloads, FAQs and manuals http://www.r-project.org/ Past SAS activities and how these might relate to S-Plus/R ideas week SAS ideas S-Plus/R ideas
1 BASIC CONCEPTS
*Review basic concepts of statistical computing and research data management
- Introduce SAS data sets
- Review the form of SAS Statements and SAS names
- Introduce SAS procedures
- Review the structure of SAS programs
- Describe SAS data libraries and what they can contain
- Show documenting SAS programs using comments
- Illustrate running SAS programs and basic debugging Introduce language elements Objects in S-Plus/R (scalars, vectors, matrices, dataframes, lists) Object-oriented functionality 2 USING SAS PROCEDURES
- Introduce the idea of SAS system options
- Briefly review statements that can be used with most procedures (BY, WHERE, TITLE, FOOTNOTE, LABEL, FORMAT)
- PROC CONTENTS for describing a data set
- PROC PRINT for listing the observations in a data set
- PROC CHART and PROC PLOT for producing low resolution graphs
- PROC FREQ for one-way frequency tables and n-way cross- tabulations
- PROC UNIVARIATE for descriptive statistics and distributional information
- PROC MEANS for descriptive statistics
- PROC SORT for sorting a data set
- SAS documentation and the online help system
- options()
- titles build into most displays
- names(dataframe)
- low resolution graphs via printer()
- table() – frequency tables
- summary()
- means(), var(), quantile(), etc.
- sort(), index()
- ?command 3 REPORT WRITING
- Introduce the Output Delivery System (ODS) for customizing procedure output
- PROC TABULATE for producing nicely-formatted tables No ODS analog although can save graphics in a host of different formats 4 AN INTRODUCTION TO STATISTICAL MODELING
- PROC REG for linear modeling (a very basic introduction)
- PROC GLM for anova models lm (linear models), glm (generalized linear models) 5 HIGH-RESOLUTION GRAPHICS AND FORMATS
- Introduce concepts related to high-resolution graphs
- PROC GCHART and PROC GPLOT for producing high- resolution graphs
- SAS-supplied formats and PROC FORMAT for user-defined formats plot, pairs, boxplot,
11 SAS/IML Programming
- Basic matrix concepts: rows, columns, scalars
- matrix operators
- subscripting
- matrix functions
- creating matrices from data sets and vice versa
- sample applications matrix manip. part of the language Overview and abbreviated history (from Krause/Olson and R project page) S is a quantitative programming environment developed at AT&T Bell Labs
- 1976 start
- 1981 rewrite in C and port to unix
- 1985 function concept introduced
- 1987 S-Plus, a supported and extended version of S, started
- 1989 S-Plus for DOS
- 1993 S-Plus for Windows
- 1997 GUI for S-Plus for Windows
- ca 1997 R development team working (after R started by Robert Gentleman and Ross Ihaka of the Statistics Department of the University of Auckland R is a language and environment for statistical computing and graphics.
- GNU project which is similar to the S language, different implementation of S
- provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity. One of R's strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control. R is available as Free Software under the terms of the Free Software Foundation's GNU General Public License in source code form. It compiles and runs out of the box on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux). It also compiles and runs on Windows 9x/NT/2000 and MacOS.
GUI vs. Command Line S-Plus has a Graphical User Interface available (GUI) -- see http://216.211.131.2/support/splus60win/tutorial.pdf for more information We focus on command line interactions with S-Plus/R Getting help
- menu
- > help() [where “>” is the default command prompt in S/R]
- > ?quantile > help(quantile) # for help on specific command > ?”^” > help(“^”) # for help on operator Syntax Assignment (via “<-“ or “_”) TOPICS IN STATISTICAL PROGRAMMING (varies)
- writing functions Functions/Operators
- HELP .. ?fun.name help(fun.name) args(fun.name) Object-oriented: classes and Methods (into S in 1992) Classes = type of object under consideration Methods = what to do with object Most commonly used Methods: print, plot, summary (others: predict, fitted, resid, coef) Methods implementation Methodname.classname
"model.matrix.lm" "multicomp.lm" "plot.lm" "predict.lm" "print.lm" splus splus nlme3 splus splus "print.summary.lm" "proj.lm" "qqnorm.lm" "residuals.lm" "ssType3.lm" stat splus menu menu menu "step.lm" "summary.lm" "tabPlot.lm" "tabPredict.lm" "tabSummary.lm" Operators
- (unary minus) %/% integer divide (7 %/% 2 = 3) %% modulo (7 %% 2 = 1) %*% matrix multiplication %o% outer product : sequence Logical operators == != < <= > >= & = elementwise AND vec1 <- c(2,4,8) vec2 <- c(3,9, 27) vec1 <=4 & vec2 < (T T F) AND (T F F) = (T F F) | = elementwise OR vec1 <- c(2,4,8) vec2 <- c(3,9, 27) vec1 <=4 | vec2 < (T T F) OR (T F F) = (T T F) && = Control AND (mode(vec1)==”numeric”) && (min(vec1)>0)) T
|| = Control OR ! = unary not !(vec1 <=4 | vec2 <9) (F F T) Precedence
{} > () > $ > [ [[ > ^ > - > : > %*% %/% %% >
*/ > +- > <. <= >= == != >! > [& && | ||] >
<<- > <- _ ->
Example with functions x1 <- c(5:1, 1:5) x [1] 5 4 3 2 1 1 2 3 4 5 x1== [1] T F F F F F F F F T y1<- c(letters[5:1],LETTERS[1:5]) y [1] "e" "d" "c" "b" "a" "A" "B" "C" "D" "E" y1[x1==5] (1:length(x1))[x1==5] [1] 1 10 seq(along=x1)[x1==5] [1] 1 10 x2<-c(x1,NA) x [1] 5 4 3 2 1 1 2 3 4 5 NA x2[!is.na(x2)] [1] 5 4 3 2 1 1 2 3 4 5 sort(x1) [1] 1 1 2 2 3 3 4 4 5 5 order(x1) [1] 5 6 4 7 3 8 2 9 1 10
[7] 40.679650 44.390966 49.811534 52.
plot(xx,yy) myfit <- lm(yy ~ xx) myfit Call: lm(formula = yy ~ xx) Coefficients: (Intercept) xx 2.658308 5. Degrees of freedom: 10 total; 8 residual Residual standard error: 1. > names(myfit) [1] "coefficients" "residuals" "fitted.values" [4] "effects" "R" "rank" [7] "assign" "df.residual" "contrasts" [10] "terms" "call" > pmin pmax # returns vector of max/min of each position of 2 vecs APPLY Functions apply(obj, dim, function) lapply(list, function) sapply(list,function) tapply(object,indices,function) state.tmp <- as.data.frame(state.x77) attach(state.tmp) inc <- cut(Income,breaks=quantile(Income),include.lowest=T) inc [1] 1 4 3 1 4 4 4 3 4 2 4 2 4 2 3 3 1 1 1 4 3 3 3 1 2 2 2 4 2 4 1 4 1 4 3 1 3 [38] 2 3 1 2 1 2 2 1 3 4 1 2 3 attr(, "levels"): [1] "3098.00+ thru 3992.75" "3992.75+ thru 4519.00" "4519.00+ thru 4813.50" [4] "4813.50+ thru 6315.00" attr(, "class"): [1] "category" tapply(Murder,inc,mean) 3098.00+ thru 3992.75 3992.75+ thru 4519.00 4519.00+ thru 4813. 9.707692 6.191667 5. 4813.50+ thru 6315.
unlist(lapply(split(Murder,inc),mean))
3098.00+ thru 3992.75 3992.75+ thru 4519.00 4519.00+ thru 4813. 9.707692 6.191667 5. 4813.50+ thru 6315.
By functions by(nitrofen$total,nitrofen$conc,function(x) {sqrt(var(x))}) INDICES: x x 3.
INDICES: x x 3.
INDICES: x x 2.
INDICES: x x 5.
INDICES: x x 3. by(nitrofen$total,nitrofen$conc,function(x) {sqrt(var(x))}) INDICES: x x 3.
INDICES: x x 3.
INDICES: x x 2.
INDICES: x x 5.
INDICES: x x 3. lapply(split(nitrofen$total,nitrofen$conc),summary) $"0": Min. 1st Qu. Median Mean 3rd Qu. Max. "24.00" "30.25" "32.50" "31.40" "33.75" "36.00" $"80": Min. 1st Qu. Median Mean 3rd Qu. Max. "26.0" "29.5" "32.5" "31.5" "33.0" "36.0"
*** writing output** cat(object,”\n”) cat(“Another plot?”) ans<- readline() if (ans==”Y”) | (ans==”y”) { expression } *** couple of tricks**
- eval(parse(text= character-with-command )) cmd="objects()" cmd [1] "objects()" eval(parse(text=cmd)) [1] ".Last.fixed" ".Last.value" ".Random.seed" "Temp" [5] "ans" "cmd" "imat" "inc" [9] "intensity" "last.dump" "my.df" "my.lst" [13] "myfunc" "new.mat" "nitro.linreg" "nitrofen" [17] "ord.int" "par.old" "pi.est" "pi.est.fcn" [21] "pi.est.fcn2" "plot.any.func" "plot.any.func2" "state.tmp" [25] "tmp" "trt" "x" "x1" [29] "x2" "x3" "x4" "x5" [33] "xmat1" "xmat1.df" "xmat2" "xmat3" [37] "xmat4" "y" "y1" "y2"
- deparse(substitute(object-name))
- constructs character variable with value “object-name”
- paste to build commands xmat C1 C2 C R1 1 4 7 R2 2 5 8 R3 3 6 9 dimnames(xmat1)<- list(paste("Row",1:3,sep=""),paste("Col",1:3,sep="")) xmat Col1 Col2 Col Row1 1 4 7 Row2 2 5 8 Row3 3 6 9
example: MC estimate of PI pi.est.fcn2 <- function(nsims=1400,do.plot=T) { xsim <- runif(nsims) ysim <- runif(nsims) icheck <- as.numeric( ysim <= sqrt(1-xsim^2) ) points.under <- sum(icheck)/nsims if (do.plot==T) { plot(xsim,ysim,type=”n”,pty=”s”) text(xsim,ysim,icheck,col=ifelse(icheck==1,1,2)) xp <- sort(xsim) lines(xp,sqrt(1-xp^2),lwd=3) } piest <- 4points.under se.piest <- 4sqrt(points.under(1-points.under)/nsims) CL <- piest + c(-1,1)qnorm(.975)se.piest list(est.pi = piest, se.est.pi = se.piest, conf.int = CL) } pi.est.fcn2(nsims=4000) example: MC estimate of PI plot.any.func <- function(func,x.low,x.high, ltype=”l”,mtitle=””) { xp <- seq(from=x.low,to=x.high,length=1000) yp <- func(xp) plot(xp,yp,type=ltype, ylab=deparse(substitute(func)), main=mtitle) } plot.any.func(cos,-2pi,2pi,mtitle=”Cosine over -2pi to 2*pi”)
t.based.ci<- function(xxx) { mean(xxx) + c(-1,1)qt(.975,length(xxx)-1) sqrt(var(xxx)/length(xxx)) } one.sample.ci.sim <- function(Nsims=1400,Nobs=15,mu=0) { xmat<- matrix(rnorm(Nsims*Nobs),nrow=Nsims,ncol=Nobs) xres<- t(apply(xmat,1,t.based.ci)) check<- (xres[,1]<=mu)&(mu<=xres[,2]) cover <- sum(check)/Nsims cat(“Normal population:\n Number of observations = “,Nobs, ”\n Number of simulated experiments = “,Nsims, ”\n Coverage Probability = “,cover,”\n\n”) } for (iNobs in c(15,30,45)) { one.sample.ci.sim(Nsims=1400,Nobs=iNobs) } Normal population: Number of observations = 15 Number of simulated experiments = 1400 Coverage Probability = 0. Normal population: Number of observations = 30 Number of simulated experiments = 1400 Coverage Probability = 0. Normal population: Number of observations = 45 Number of simulated experiments = 1400 Coverage Probability = 0.
LAB EXERCISES
- Reading data into R Manatee.lst <- scan(what=list(year=0,nboats=0,ndead=0)) 77 447 13. 78 460 21. 79 481 24. 80 498 16. 81 513 24. 82 512 20. 83 526 15.
make a data frame with the Manatee data
Manatee.df <- as.data.frame(Manatee.lst) Manatee.df year nboats ndead 1 77 447 13 2 78 460 21 3 79 481 24 4 80 498 16 5 81 513 24 6 82 512 20 7 83 526 15 8 84 559 34 9 85 585 33 10 86 614 33 11 87 645 39 12 88 675 43 13 89 711 50 14 90 719 47
make a matrix with the Manatee data
Manatee.mat <- as.matrix(Manatee.df) pairs(Manatee.mat)
make a plot or two
plot(Manatee.df$nboats,Manatee.df$ndead)
how about multiple plots on a page
par(mfrow=c(2,2))
fit linear regression
man.fit <- lm(ndead~nboats, data=Manatee.df)
plot 1
plot(ndead~nboats, data=Manatee.df, xlab=”Number of boats registered (1000s)”, ylab=”Number of Manatees killed”, main=”Plot of Manatee Deaths vs. boats registered\n(years 1977- 1990)” )
pairs(state.x91[,c(1,2,3,6,7,8)]) # what does this do? faces(state.x91,labels=state.abb) stars(state.x91,labels=state.abb)
- Generate a data set with x = rep(c(0,80,160,240), mu.x = exp(log(30) + .008x - .00015x*x) and Y~Poisson(mu.x). a) Plot randomly generated Y (use rpois) vs. x. b) Generate the linear regression fit of sqrt(y) = b0 + b1 X + b2 X^2 (sqrt transformation is the variance stabilizing transformation for Poisson data). c) Superimpose this fit on a scatterplot of the data.
- Modify the plotting function to pass the number of x values you wish to plot. Note that you can also add the ability to pass other non-specified arguments via an ellipsis (...) in the argument list.
plot.any.func2 <- function(func,x.low,x.high, ltype=”l”,mtitle=””,...) { xp <- seq(from=x.low,to=x.high,length=100) yp <- func(xp) plot(xp,yp,type=ltype, ylab=deparse(substitute(func)), main=mtitle,...) }
check out how other parameters are passed in the function
calls below
plot.any.func2(cos,0,3pi,ltype="l",lwd=13) plot.any.func2(cos,0,3pi,ltype=”p”,cex=3) plot.any.func2(cos,0,3*pi,ltype=”p”,pch=6)
- Modify the t-based CI function (one.sample.ci.sim) to study samples from uniform and exponential samples.
- Import “Roberts Data.xls” (an Excel spreadsheet) into a data frame. Examine variables in this data frame. Generate summary statistics for each numeric variable in this data set. Plot X=Treatment..Hours. Y=Liver.SOD
- Import “Manatee” SAS data set into a data frame. Plot number of manatees deaths versus number of boats registered. Generate a linear regression fit relating number of deaths to number of boats registered. Superimpose this fit on the manatee plot. You may want to look at “lm”, “plot”, “lines”, “abline” and “coef” functions.
Q: How do I delete a bunch of similarly named files?
Suppose you have a number of files with “tmp” as part of the name... xtmp<- ytmp <- 111 ztmp <- “more junk” objects(pattern=”tmp”) rm(list=objects(pattern="tmp"))
Q: How do I manipulate multiple graphs?