Modifying and combining Data Sets - Computing in Statistics - Notes | STAT 517, Study notes of Statistics

Material Type: Notes; Professor: Hitchcock; Class: COMPUTING IN STATISTICS; Subject: Statistics; University: University of South Carolina - Columbia; Term: Fall 2009;

Typology: Study notes

Pre 2010

Uploaded on 10/01/2009

koofers-user-s1y
koofers-user-s1y 🇺🇸

4.5

(2)

10 documents

1 / 14

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
STAT 517: Delwiche/Slaughter Chapter 6 Hitchcock
Chapter 6: Modifying and Combining Data Sets
The SET statement is a powerful statement in the DATA step.
Its main use is to read in a previously created SAS data set (either in WORK or
another library) which can be modified and saved as a new data set.
DATA newdatasetname;
SET olddatasetname;
...
run;
University of South Carolina Page 1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe

Partial preview of the text

Download Modifying and combining Data Sets - Computing in Statistics - Notes | STAT 517 and more Study notes Statistics in PDF only on Docsity!

Chapter 6: Modifying and Combining Data Sets

• The SET statement is a powerful statement in the DATA step.

• Its main use is to read in a previously created SAS data set (either in WORK or

another library) which can be modified and saved as a new data set.

DATA newdatasetname;

SET olddatasetname;

run;

• We could stack (not concatenate) multiple data sets by listing several data sets in

the SET statement.

• If one data set contains variable(s) not included in the other data set(s), the observa-

tions from the other sets will have missing values for those variables in the combined data set.

• If input data sets are sorted by a specific variable, stacking them may not preserve

the sorting.

• To preserve the sorting, we can interleave the data sets with a BY statement (but

we must sort all data sets first).

Merging Summary Statistics and Data

• Often we want to merge summary statistics (either statistics for entire data set or

often for groups within the data set) with the observations themselves.

• First calculate summary statistics using PROC MEANS (after sorting if necessary)

• Output the summary statistics to another data set with an OUTPUT statement.

• Give the statistics meaningful names in this output data set.

• Use a MERGE statement to combine the original data with the OUTPUT data from

PROC MEANS.

Merging Summary Statistics and Data

• Once summary stats are merged with original data, we can calculate:

  1. centered data observations
  2. standardized data observations
  3. data expressed as a percentage of group sums

• This is done by transforming data through functions involving the summary statistics.

• This can be tricky in R , too (I have to use match() twice to make this work).

• Variables read from the summary data set with the first SET statement are retained

with all observations.

• This is a general trick for merging one (or a few) observations with many, where no

common variable exists.

• The UPDATE statement is similar to MERGE, but is typically used when a data set

changes over time–new variables are added, values of variables are changed for old observations, etc. (See pg. 184-185.)

Data Set Options

We have already seen many of these options incorporated into our earlier examples.

• System options specified in Options statement (affect SAS operation, often for-

matting)

• Statement options affect the running of a step.

• NOPRINT is often used when you use a PROC to create an output data set.

Example:

NOPRINT option in PROC MEANS

NOWINDOWS option in PROC REPORT

DATA =... option in any procedure

Creating several data sets with the OUTPUT statement

• A single DATA step can create several SAS data sets (this is a trick I don’t use

nearly enough).

• The DATA line must give multiple data set names:

DATA set1 set2 set3;

• The OUTPUT statement is often used with IF-THEN statements or within a DO

loop. Example:

IF... THEN OUTPUT set1;

ELSE OUTPUT set2;

• The OUTPUT statement can also be used to create several observations from one.

• It transforms “wide” data sets into “long” data sets.

• It is often used with repeated-measures data (several values observed for each in-

dividual)

• OUTPUT is also useful for generating function values.

• Used in a DO loop, OUTPUT will tell SAS to create an observation at each iteration

of the DO loop.

Automatic Variables in SAS

• During the DATA step, SAS creates temporary “automatic” variables. These are not

typically saved as part of the data set, but they can be used in the DATA step.

• N keeps track of the number of times SAS has looped through the DATA step

(i.e., the number of observations that have been read).

• It may be different from “obs #” if data has been “subsetted”

• The automatic variable ERROR is binary: 1 if observation has an error, 0 if no

error.

FIRST.groupvariable →= 1 for the first observation with a new value for

“groupvariable”, = 0 otherwise

LAST.groupvariable →= 1 for the last observation with a new value for

“groupvariable”, = 0 otherwise

• They can be useful for picking out the highest or lowest values for each level of

“groupvariable”; sort by the “groupvariable”, then use a subsetting IF along with a

BY statement to save only the first or last occurences for each level of “groupvari-

able”.