Download SAS Programming: Recoding Missing Values, Creating Observations, and Merging Data - Prof. and more Exams Statistics in PDF only on Docsity!
Week 07/08 Class Activities 1
File: week-07-08-05oct05.doc
Directory (hp/compaq):
C:\baileraj\Classes\Fall 2005\sta402\handouts
See also: Chapter-07-26may05.doc (*.pdf) on Blackboard
based on:
C:\Documents and Settings\John Bailer\My
Documents\baileraj\Classes\Fall 2004\sta402\handouts\week-07-08-
13oct04.doc
C:\Documents and Settings\John Bailer\My
Documents\baileraj\Classes\Fall 2003\sta402\handouts\week7-
08oct03.doc
C:\Documents and Settings\John Bailer\My
Documents\baileraj\Classes\Fall 2003\sta402\handouts\week8-
15oct03.doc
SAS PROGRAMMING
* Arrays
* DO groups
* Statements: RETAIN, RENAME, LABEL, FORMAT, SUM
* Using formats in DATA steps
* Conditional execution
* More on missing values
Additional Ref: Cody, R. and Pass, R. (1995) SAS
Programming by Example. SAS
Institute Inc., Cary, NC. – Chapters 7 (“arrays”), 8 (“retain”), 5 (“SAS functions”)
ARRAYS
* look to use if writing the same set of code multiple times
* “arrays” can contain lists of variables
* “arrays” also good for restructuring data sets
Common example 1a: Recoding a set of variables^2
Suppose you have a data set “old_data” containing
Variables: a_var, b_var, var3, var4, var
(all numeric with missing values coded as -999)
Recode -999 as missing=.
data old_data;
input a_var b_var var3 var4 var5 @@;
datalines;
run;
data recode_ex; set old_data;
array all[5] a_var b_var var3 var4 var5;
do ii=1 to 5;
if all[ii] = -999 then all[ii]=.;
end;
drop ii;
/* can use either [], {}, () to reference array elements */
options nocenter nodate;
proc print;
run;
Obs a_var b_var var3 var4 var 1 1 2 3 4 5 2 6 7. 8 9 3 10 11 12. 14
/* alternative to get SAS to count array size &
drop ii; 4
proc print;
title ‘Recode 3: Using NUMERIC to select elements’;
run;
Recode 3: Using NUMERIC to select elements Obs char_var a_var b_var var3 var4 var 1 a 1 2 3 4 5 2 b 6 7. 8 9
3 c 10 11 12. 14
Recoding both numeric and character values using arrays
Data D5;
input name $ sex $ t1 t2 t3 t4 t5 time6 time_7;
ARRAY num_array{*} NUMERIC;
ARRAY char_array{*} CHARACTER;
/* recode the numeric variables */
DO inum = 1 to dim(num_array);
if num_array{inum}=-999 then num_array{inum}=.;
END;
/* recode the character variables */
Do ichar = 1 to dim(char_array);
if char_array{ichar}="-999" then char_array{ichar}=" ";
END;
drop inum ichar;
datalines;
MrSmith -999 6 6 5 5 5 4 3
-999 F 7 -999 4 4 3 -999 2
proc print;
title "Recoding missing values using Arrays using DO loop";
run;
Recoding missing values using Arrays using DO loop
Obs name sex t1 t2 t3 t4 t5 time6 time_
1 MrSmith 6 6 5 5 5 4 3
2 F 7. 4 4 3. 2 5
C ommon example 1b: Recoding a set of variables (with array initialized)
data temps;
array tempF(4) tempF1-tempF4 (32,50,68,86);
array tempC(4) tempC1-tempC4;
do itemp = 1 to 4;
tempC(itemp) = 5/9*(tempF(itemp)-32);
end;
drop itemp;
proc print;
run;
temp temp temp temp temp temp temp temp
Obs F1 F2 F3 F4 C1 C2 C3 C
C ommon example 2: Creating multiple observations from a single observation
data one;
1-x4;
input x1 x2 x3 x4;
datalines;
data two; set one;
array xx[4] x1-x4;
do time=1 to 4;
x=xx[time];
output;
end;
drop x
run;
ta one; set multi; 7
[4] x1-x4;
lues kept from previous observation;
if FIRST.id=1 then do ii=1 to 4;
ized to missing;
=heart_rate;
t;
oc print;
se multiple records to one record’;
ltiple records to one record
da
by id;
array xx
retain x1-x4; * va
xx[ii]=.; * elements initial
end;
xx[time]
if LAST.id=1 then outpu
keep id x1-x4;
run;
pr
title ‘Conden
run;
Condense mu Obs id x1 x2 x3 x 1 1 60 62 64 68 2 2 80 84 90 9
Alternative using PROC TRANSPOSE
proc transpose data=multi out=one_tran(keep=id hr1-hr4)
prefix=HR; by id;
var heart_rate;
proc print data=one_tran;
title alternative using PROC TRANSPOSE;
run;
alternative using PROC TRANSPOSE HR
ommon example 4: Inputting values in ARRAY variable list
Obs id HR1 HR2 HR 1 1 60 62 64 68 2 2 80 84 90 98
C
7} ADL1-ADL7;
or
Data D2;
*} ADL1-ADL7;
or
Data D3;
*} t1 t2 t3 t4 t5 time6 time_7;
more complicated example: Randomization test for testing equality of 2 populations
Data D1;
ARRAY ADL{
input ADL1-ADL7;
datalines;
ARRAY ADL{
input ADL1-ADL7;
datalines;
ARRAY ADL{
input t1 t2 t3 t4 t5 time6 time_7;
datalines;
A
use PLAN to generate a set of indices for the randomization test
and then use TRANSPOSE to package the output
/* nitrofen data
concentrations 0 and 160 will be used to illustrate a randomization test
libname class 'D:\baileraj\Classes\Fall 2003\sta402\data';
data test; set class.nitrofen;
if conc= 0 | conc= 160 ;
proc ttest ;
title NITROFEN: t-test of ( 0 , 160 ) concentrations;
class conc;
var total;
run ;
proc print data=obs_test; 10
title ‘Randomization test: observed data’;
run;
Randomiza tion test: observed data _
x x x x x x x x x x x t x x x x x x x x x x x x x x x x x x x x y
N
A
O M
b E x x x x x x x x x 1 1 1 1 1 1 1 1 1 1 2 p s _ 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 e
1 total 27 32 34 33 36 34 33 30 24 31 29 29 23 27 30 31 30 26 29 29 O */
proc plan ;
fa torsc test= 4000 ordered in= 20 ;
output out=d_permut;
run ;
proc transpose data=d_permut prefix=in out=out_permut(keep=in1-in20); by
test;
run ;
proc print data=out_permut;
run ;
data null; set obs_test;
file 'D:\baileraj\Classes\Fall 2003\sta402\SAS-programs\week7-perm.data';
put type xx1-xx20;
run ;
data null; set out_permut;
type = 'P'; * permutation data;
file 'D:\baileraj\Classes\Fall 2003\sta402\SAS-programs\week7-perm.data'
mod; /* mod option adds lines to existing file */
put type in1-in20;
run ;
/* week7-perm.data ...
O 27 32 34 33 36 34 33 30 24 31 29 29 23 27 30 31 30 26 29 29
P 8 14 4 11 3 2 12 1 6 13 17 9 15 16 5 19 20 7 10 18
P 12 2 8 10 13 7 9 16 4 19 15 3 5 14 17 1 20 11 6 18
P 18 17 13 14 5 8 19 16 3 12 11 9 10 7 2 20 4 6 1 15
P 6 12 4 20 19 16 11 5 15 18 1 8 3 13 17 14 10 9 7 2
P 8 17 4 19 2 11 1 7 6 3 9 13 20 14 12 18 15 10 5 16
P 11 7 17 6 18 13 3 12 8 10 19 16 2 20 4 5 15 1 9 14
P 17 11 4 7 20 6 9 16 1 2 14 12 5 18 10 8 15 13 3 19
data perm_data;
array both{ 20 } x1-x10 y1-y10; /* array for observed values */
array ins{ 20 } in1-in20; /* index array */
array perms{ 20 } xp1-xp10 yp1-yp10; /* array for permuted values */
infile 'D:\baileraj\Classes\Fall 2003\sta402\SAS-programs\week7- 11
perm.data';
input type $ @;
if type='O' then do;
input x1-x10 y1-y10;
obs_diff = mean(of x1-x10) - mean(of y1-y10);
retain obs_diff x1-x10 y1-y10;
end;
else do;
input in1-in20;
do ii = 1 to 20 ;
perms{ii} = both{ ins[ii] };
end;
perm_diff = mean(of xp1-xp10) - mean(of yp1-yp10);
perm_ge = (perm_diff >= obs_diff); * 1-tailed;
perm_2tail = (abs(perm_diff) >= abs(obs_diff)); * 2-tailed;
keep obs_diff perm_diff perm_ge perm_2tail;
* keep in1-in20 xp1-xp10 yp1-yp10 obs_diff perm_diff perm_ge;
output;
end;
proc print ;
run ;
proc freq data=perm_data;
title ‘NITROFEN: randomization test -> upper tail P-value’;
table perm_ge perm_2tail;
run;
NI TROFEN: randomization test -> upper tail P-value
e FREQ Procedure Cumulative Cumulative
< - - - P(upper) = 0.
Cumulative Cumulative
< - - - P(2 tail) = 0.
nd more complicated example: Randomization test 2 for spatial randomness
Th
perm_ge Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 0 3927 98.18 3927 98. 1 73 1.83 4000 100.
perm_2tail Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 0 3847 96.18 3847 96. 1 153 3.83 4000 100.
A 2
gnnobs = sumnnobs/4; * observed average NN distance;
talines;
oc print;
ta mccsr1; set plot1;
array nnobs nnobs1-nnobs4;
Generate a large number of CSR plots with 4 trees */
initialize counters of nn avg dist le or ge than observed;
isim = 1 to 1000;
do ii = 1 to 4;
ni(0);
Find NN distance for the simulated trees */
sumnncsr = 0;
00; * initialize;
do j=1 to 4;
m(i)-xsim(j))2 + (ysim(i)-ysim(j))2 );
sumnncsr=sumnncsr+nncsr(i);
end;
av
da
pr
da
array xobs xobs1-xobs4;
array yobs yobs1-yobs4;
array xsim xsim1-xsim4;
array ysim ysim1-ysim4;
array nncsr nncsr1-nncsr4;
/* CSR = completely spatially random */
numle = 0; numge = 0;
do
xsim(ii) = ranu
ysim(ii) = ranuni(0);
end;
do i=1 to 4;
nncsr(i) = 1
d=sqrt( (xsi
if (d<nncsr(i)) and (d>0) then nncsr(i)=d;
* output; * debugging;
end;
end;
avgnncsr = sumnncsr/4; 14
late counts of patterns consistent with regularity/aggreg.
ile = (avgnncsr <= avgnnobs);
numle = numle + ile;
drop i j ii xobs1-xobs4 yobs1-yobs4 nnobs1-nnobs
d; * if the isim - simulation loop;
proc print;
oc freq;
ige;
The FREQ Procedure
Cumulative Cumulative
Cumulative Cumulative
Accumu
ige = (avgnncsr >= avgnnobs);
numge = numge + ige;
sumnnobs sumnncsr;
output;
en
pr
table ile
run;
ile Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 0 46 4.60 46 4. 1 954 95.40 1000 100.
ige Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 0 954 95.40 954 95. 1 46 4.60 1000 100.
RETAIN
* tough to perform calculations across observations
alue to a variable but to remember a
* SAS normally initializes each variable to missing
* RETAIN instructs system not to assign a missing v
different value
data retain_demo1;
input dobs time x;
example: find the average weight by subject using
DATA step programming
/* STEP 1: read in the data file */
data diet;
input id @3 date mmddyy8. weight;
format date mmddyy8.;
datalines;
proc print;
title ‘diet data’;
run;
diet data Obs id date weight 1 1 10/01/92 155 2 1 10/08/92 158 3 1 10/15/92 158 4 1 10/22/92 158 5 2 09/02/92 200 6 2 09/09/92 198 7 2 09/16/92 196
8 2 09/23/92 202
data diet2; set diet;
if n= 1 then total=weight;
else if id = lag(id) then total=total+weight;
else if id NE lag(id) then total = weight; 17
proc print;
run;
Obs id date weight total 1 1 10/01/92 155 155 2 1 10/08/92 158 158 3 1 10/15/92 158. 4 1 10/22/92 158. 5 2 09/02/92 200 200 6 2 09/09/92 198. 7 2 09/16/92 196. 8 2 09/23/92 202.
/* STEP 2:
accumulate the total weight measurements for an individual ID
data diet3; set diet;
retain total 0 ;
if id = lag(id) then total=total+weight;
else if id NE lag(id) then total = weight;
proc print ;
run ;
Obs id date weight total 1 1 10/01/92 155 155 2 1 10/08/92 158 313 3 1 10/15/92 158 471 4 1 10/22/92 158 629 5 2 09/02/92 200 200 6 2 09/09/92 198 398 7 2 09/16/92 196 594 8 2 09/23/92 202 796
STEP 3:
Calculate the average weight and output the desired data set
proc sort data=diet3; by id;
data diet4; set diet3; by id;
if LAST.id; * special variable LAST.id=1 if last value in BY;
wt_avg = total/4;
keep id wt_avg;
proc print ;
run ;
Obs id wt_avg 1 1 157.
data test;
input id xstart xstop;
datalines;
proc print;
run;
data test2; set test; by id;
array start{9} start1-start9;
array stop{9} stop1-stop9;
array times{9} times1-times9;
retain count 0;
retain start1-start9 stop1-stop9 times1-times9;
if FIRST.id=1 then do; * initialize count and arrays with new ID;
count = 0;
do ii=1 to 9;
start{ii} = .;
stop{ii} = .;
times{ii} = .;
end;
end;
count = count + 1;
start{count} = xstart;
stop{count} = xstop;
times{count} = xstop - xstart;
if LAST.id=1 then output; * output results if last obs for ID;
drop xstart xstop ii;
run;
data test3; set test2;
total_time = sum(of times1-times9);
run;
proc print;
run;
material from
C:\Documents and Settings\John Bailer\My
Documents\baileraj\Classes\Fall 2003\sta402\handouts\week8-
15oct03.doc
COMBINING AND MANAGING SAS DATA SETS
* SET statement for concatenation and interleaving 20
* MERGE statement for joining observations
* UPDATE statement for updating a master file (maybe)
* Special variables: IN, END, FIRST, and LAST
* Creating multiple data sets in one DATA step
* Reshaping data sets
* Managing data sets using PROC COPY and PROC DATASETS
* Transporting data sets between hosts
Reference:
Delwiche LD and Slaughter SJ. 1996. The Little SAS Book: A Primer , 2nd edition. SAS
Institute. Cary, NC – Chapter 5.
Additional Ref: Cody, R. and Pass, R. (1995) SAS
Programming by Example. SAS
Institute Inc., Cary, NC. – Chapters 3 (“set/merge/update”)
Temporary versus Permanent SAS data sets (Delwiche and Slaughter – Ch. 2.9)
* if you use a data set more than once, it may be more efficient to save it as a permanent
data set.
* SAS data set names all have two levels – the first level is its LIBREF (SAS data library
referenced) and the second is the MEMBER name that identifies the data set within the
library.
* LIBREF points to a particular location – often a physical location (e.g. disk) or a
logical notation (e.g. directory).
/* example */
libname class 'D:\baileraj\Classes\Fall 2003\sta402\data’;
data nitrofen_A; set class.nitrofen;
brood=1; count=brood1; conc=conc; output;
brood=2; count=brood2; conc=conc; output;
brood=3; count=brood3; conc=conc; output;
keep brood count conc;
* in the example above, two data sets are defined/referenced -
LIBREF = “class” and MEMBER = “nitrofen”