Download Modifying and combining Data Sets - Computing in Statistics - Notes | STAT 517 and more Study notes Statistics in PDF only on Docsity!
Chapter 6: Modifying and Combining Data Sets
• The SET statement is a powerful statement in the DATA step.
• Its main use is to read in a previously created SAS data set (either in WORK or
another library) which can be modified and saved as a new data set.
DATA newdatasetname;
SET olddatasetname;
run;
• We could stack (not concatenate) multiple data sets by listing several data sets in
the SET statement.
• If one data set contains variable(s) not included in the other data set(s), the observa-
tions from the other sets will have missing values for those variables in the combined data set.
• If input data sets are sorted by a specific variable, stacking them may not preserve
the sorting.
• To preserve the sorting, we can interleave the data sets with a BY statement (but
we must sort all data sets first).
Merging Summary Statistics and Data
• Often we want to merge summary statistics (either statistics for entire data set or
often for groups within the data set) with the observations themselves.
• First calculate summary statistics using PROC MEANS (after sorting if necessary)
• Output the summary statistics to another data set with an OUTPUT statement.
• Give the statistics meaningful names in this output data set.
• Use a MERGE statement to combine the original data with the OUTPUT data from
PROC MEANS.
Merging Summary Statistics and Data
• Once summary stats are merged with original data, we can calculate:
- centered data observations
- standardized data observations
- data expressed as a percentage of group sums
• This is done by transforming data through functions involving the summary statistics.
• This can be tricky in R , too (I have to use match() twice to make this work).
• Variables read from the summary data set with the first SET statement are retained
with all observations.
• This is a general trick for merging one (or a few) observations with many, where no
common variable exists.
• The UPDATE statement is similar to MERGE, but is typically used when a data set
changes over time–new variables are added, values of variables are changed for old observations, etc. (See pg. 184-185.)
Data Set Options
We have already seen many of these options incorporated into our earlier examples.
• System options specified in Options statement (affect SAS operation, often for-
matting)
• Statement options affect the running of a step.
• NOPRINT is often used when you use a PROC to create an output data set.
Example:
NOPRINT option in PROC MEANS
NOWINDOWS option in PROC REPORT
DATA =... option in any procedure
Creating several data sets with the OUTPUT statement
• A single DATA step can create several SAS data sets (this is a trick I don’t use
nearly enough).
• The DATA line must give multiple data set names:
DATA set1 set2 set3;
• The OUTPUT statement is often used with IF-THEN statements or within a DO
loop. Example:
IF... THEN OUTPUT set1;
ELSE OUTPUT set2;
• The OUTPUT statement can also be used to create several observations from one.
• It transforms “wide” data sets into “long” data sets.
• It is often used with repeated-measures data (several values observed for each in-
dividual)
• OUTPUT is also useful for generating function values.
• Used in a DO loop, OUTPUT will tell SAS to create an observation at each iteration
of the DO loop.
Automatic Variables in SAS
• During the DATA step, SAS creates temporary “automatic” variables. These are not
typically saved as part of the data set, but they can be used in the DATA step.
• N keeps track of the number of times SAS has looped through the DATA step
(i.e., the number of observations that have been read).
• It may be different from “obs #” if data has been “subsetted”
• The automatic variable ERROR is binary: 1 if observation has an error, 0 if no
error.
FIRST.groupvariable →= 1 for the first observation with a new value for
“groupvariable”, = 0 otherwise
LAST.groupvariable →= 1 for the last observation with a new value for
“groupvariable”, = 0 otherwise
• They can be useful for picking out the highest or lowest values for each level of
“groupvariable”; sort by the “groupvariable”, then use a subsetting IF along with a
BY statement to save only the first or last occurences for each level of “groupvari-
able”.