Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Analyzing a subset of data • Creating data frames with ..., Schemes and Mind Maps of Printing

Washington State University (WSU or Wazzu)Printing

Analyzing a subset of the data: subset() or subset= R provides a couple of ways to restrict an analysis to a subset of the data. If you want to do many.

Typology: Schemes and Mind Maps

2022/2023

Uploaded on 03/01/2023

techy 🇺🇸

4.8

(9)

258 documents

1 / 5

This page cannot be seen from the preview

Don't miss anything!

meat.r: Explanation of code

Goals of code:

•Analyzing a subset of data

•Creating data frames with specified X values

•Calculating confidence and prediction intervals

•Lists and matrices

•Only printing a few observations

•Overlaying predicted values

•ANOVA lack of fit test

Analyzing a subset of the data:subset() or subset=

R provides a couple of ways to restrict an analysis to a subset of the data. If you want to do many

analyses or plots with a subset, I find it easier to create a second data frame with the desired subset

of data. If you want to do one analysis, it can be easier to specify that subset in the analysis.

To create a new data frame with a subset of values: There are two ways: selecting rows, or using

subset(). I illustrate subset(). The first argument is a data frame; the second argument specifies the

rows to keep, done using a logical operator. The example, subset(meat, time <= 6) starts with

all rows of the meat data frame and keeps only those where time is less than or equal to 6. The

logical expression is evaluated in the data frame, so you don’t need to write meat$time to specify

where to find the time variable.

To specify the subset as part of an analysis, add subset= logical operation specifying rows to keep.

Again, the expression is evaluated in the data frame, so you don’t need to write subset=(meat$time <= 6).

My practice is to put the logical expression inside () just so R doesn’t get confused.

Logical operators: R provides an extensive list of logical operators. These include

Symbol meaning notes

== equals requires two equals signs

<less than

>greater than

<= less than or equal

>= greater than or equal

!= not equal

%in% in followed by a vector of values, e.g. c(1, 3, 6)

will be true if first argument matches any of the second

1

Discover Schemes and Mind Maps of Printing Washington State University (WSU or Wazzu)

Partial preview of the text

Download Analyzing a subset of data • Creating data frames with ... and more Schemes and Mind Maps Printing in PDF only on Docsity!

meat.r: Explanation of code

Goals of code:

Analyzing a subset of data
Creating data frames with specified X values
Calculating confidence and prediction intervals
Lists and matrices
Only printing a few observations
Overlaying predicted values
ANOVA lack of fit test

Analyzing a subset of the data: subset() or subset= R provides a couple of ways to restrict an analysis to a subset of the data. If you want to do many analyses or plots with a subset, I find it easier to create a second data frame with the desired subset of data. If you want to do one analysis, it can be easier to specify that subset in the analysis.

To create a new data frame with a subset of values: There are two ways: selecting rows, or using subset(). I illustrate subset(). The first argument is a data frame; the second argument specifies the rows to keep, done using a logical operator. The example, subset(meat, time <= 6) starts with all rows of the meat data frame and keeps only those where time is less than or equal to 6. The logical expression is evaluated in the data frame, so you don’t need to write meat$time to specify where to find the time variable.

To specify the subset as part of an analysis, add subset= logical operation specifying rows to keep. Again, the expression is evaluated in the data frame, so you don’t need to write subset=(meat$time <= 6). My practice is to put the logical expression inside () just so R doesn’t get confused.

Logical operators: R provides an extensive list of logical operators. These include

Symbol meaning notes == equals requires two equals signs < less than

greater than <= less than or equal = greater than or equal != not equal %in% in followed by a vector of values, e.g. c(1, 3, 6) will be true if first argument matches any of the second

Creating data frames with specified X values: data.frame() If we want to predict Y for X values not in the data set used for analysis, we need to create a data frame with those values. This must be a data frame (not a column vector) and the column name must match the name of the X variable in the fitted model. The first step is to create a vector (column of values) that is used to create the data frame.

The three uses of data.frame show three ways to do similar things. They do create slightly different sets of prediction values. You only need one of these.

c(1,1.5,2,2.5,3,3.5,4,4.5,5): The c() function concatenates (hence the c) values to make a vector. c() can be used anywhere you need to create or specify a vector of values. The values are separated by commas.

seq(1, 8, 0.25): The seq() function generates a sequence of values. The first two arguments are the starting and ending values. The third is the step. If the step is omitted, e.g., seq(1,8), a step of 1 is assumed.

The 1:8 in the third use of data.frame() is a shortcut way to generate a sequence of integers from the starting to ending value.

The code then computes the logtime variable, as we’ve done before.

Lists and matrices: R provides many ways to store information. So far, we’ve met scalars (e.g., 5) and data frames, and we’ve just met vectors (e.g., c(1, 2, 4, 8)). A matrix has rows and columns, just like a data frame, but the entire matrix is either numeric or character. A data frame can have some columns that contain numbers and other columns that contain character strings. If a data frame with both numbers and character strings was converted to a matrix, it would become a matrix of character strings and all the numbers would be converted to character strings. The as.matrix() function converts another type of object into a matrix. The code involving test, test2, and testm demonstrates the difference between a data frame and a matrix.

You can access individual rows or columns by subscripting the matrix. Subscripts go in single square brackets. A matrix has two subscripts, which are separated by a comma. The first subscript indicates rows, the second indicates columns. A negative number omits the specified row(s) or column(s). If a subscript is omitted, e.g., [,1], all rows or columns are selected. So [,1] will extract the first column (and all rows). You can also subscript vectors (only one index, so no comma) and data frames (rows and columns). If you omit the comma, the matrix is subscripted as if it were one long vector composed of stacked columns. You probably want to specify a comma.

The names() function extracts the column names from a data frame. If you try names() on a matrix, the result is nothing (NULL). The dimnames() function extracts row and column names from a matrix. You can also use dimnames() on a data frame; the row names are generated automatically by R. When you run the code, you’ll see that the default for a matrix is to have column names but no row names.

ues. Adding level= with a number between 0 and 1 changes the coverage to the specified value. Changing to interval=’prediction’ gives you prediction intervals instead of confidence intervals.

The output from predict() is a vector when all you request are predictions; it is a list when you request standard errors. When you add se.fit=T, the output includes $fit with the vector of predicted (fitted) values and $se.fit with the vector of standard errors.

When you request an interval, the output is a matrix, with the 2nd and 3rd columns being the endpoints of the requested interval. If you request both the standard error and an interval, you get a list with the $fit component as a matrix.

You can extract the pieces you want by indexing or subscripting the list or matrix as described above.

Only printing a few observations: head() Sometimes, I want to print a few observations in a data set to get variable names (e.g., after reading a data set) or check for gross errors. The head() function prints the first six rows of a data frame, matrix, or vector. tail() prints the last six rows.

Overlaying observations and predicted values: plot() followed by lines() We have previously used plot() to plot data. You can follow plot() with commands that add stuff to the plot. The lines() function draws lines (as connect the dots, so it usually helps to to have observations in sorted order). The arguments to lines() are the X and Y information. Additional arguments modify the appearance of the line. The first lines() command adds the fitted line, drawn as a solid black line. fitted line. The second and third adds the lower and upper prediction limits as a dashed line (lty=2, where lty is “line type”). The default line type is lty=1.

The legend() function adds a legend to the plot. The first argument is the location. All the rest can be in any order and specify what to put on the plot. I prefer legends without boxes; bty=’n’ (box type) suppresses the box. I want legends for the two line types, so I specify two line types in lty= and my legend as a vector of character strings in legend=. If you used pch= and legend=, you would label points in the legend. The help file for legend gives you a long list of options.

ANOVA lack of fit test: anova() To construct the ANOVA lack of fit test, we need to create a copy of the X variable that is a factor variable. time.f is that factor version of the time variable. We will use this variable to indicate groups, each with its own mean, so it doesn’t matter whether the groups are defined by time or logtime.

We can compare the fit of the regression and the fit of the ANOVA model in two different ways.

Fit both models, then use anova() to compare the two fits. That is done by providing two models to anova(). If you wanted to sequentially compare more than two models (e.g., a simple linear regression, a quadratic regression, then an ANOVA fit), you provide three arguments to anova().
Fit one model with both terms, e.g., lm(ph ~ logtime + time.f), with a + separating the two

terms. Passing the output from that fit to anova() gives you the sequential change in fit as each term is added to the model.

Note: If the first variable in the model is the factor variable, R refuses to fit the second term. If you specify two models to anova() and the factor variable is first, the changes in df and SS are both negative. Fix by reversing the order of the variables in the lm() formula or the order of the models in anova().

Analyzing a subset of data • Creating data frames with ..., Schemes and Mind Maps of Printing

Related documents

Partial preview of the text

Download Analyzing a subset of data • Creating data frames with ... and more Schemes and Mind Maps Printing in PDF only on Docsity!