SAS Programming: Recoding Missing Values, Creating Observations, and Merging Data - Prof. , Exams of Statistics

Sas code examples demonstrating various data manipulation techniques, including recoding missing values using arrays, creating multiple observations from a single observation, and merging multiple data sets. These examples can be useful for students and researchers working with sas data analysis.

Typology: Exams

Pre 2010

Uploaded on 08/18/2009

koofers-user-c70
koofers-user-c70 🇺🇸

9 documents

1 / 37

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Week 07/08 Class Activities 1
File: week-07-08-05oct05.doc
Directory (hp/compaq):
C:\baileraj\Classes\Fall 2005\sta402\handouts
See also: Chapter-07-26may05.doc (*.pdf) on Blackboard
based on:
C:\Documents and Settings\John Bailer\My
Documents\baileraj\Classes\Fall 2004\sta402\handouts\week-07-08-
13oct04.doc
C:\Documents and Settings\John Bailer\My
Documents\baileraj\Classes\Fall 2003\sta402\handouts\week7-
08oct03.doc
&
C:\Documents and Settings\John Bailer\My
Documents\baileraj\Classes\Fall 2003\sta402\handouts\week8-
15oct03.doc
SAS PROGRAMMING
* Arrays
* DO groups
* Statements: RETAIN, RENAME, LABEL, FORMAT, SUM
* Using formats in DATA steps
* Conditional execution
* More on missing values
Additional Ref: Cody, R. and Pass, R. (1995) SAS® Programming by Example. SAS
Institute Inc., Cary, NC. – Chapters 7 (“arrays”), 8 (“retain”), 5 (“SAS functions”)
ARRAYS
* look to use if writing the same set of code multiple times
* “arrays” can contain lists of variables
* “arrays” also good for restructuring data sets
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25

Partial preview of the text

Download SAS Programming: Recoding Missing Values, Creating Observations, and Merging Data - Prof. and more Exams Statistics in PDF only on Docsity!

Week 07/08 Class Activities 1

File: week-07-08-05oct05.doc

Directory (hp/compaq):

C:\baileraj\Classes\Fall 2005\sta402\handouts

See also: Chapter-07-26may05.doc (*.pdf) on Blackboard

based on:

C:\Documents and Settings\John Bailer\My

Documents\baileraj\Classes\Fall 2004\sta402\handouts\week-07-08-

13oct04.doc

C:\Documents and Settings\John Bailer\My

Documents\baileraj\Classes\Fall 2003\sta402\handouts\week7-

08oct03.doc

C:\Documents and Settings\John Bailer\My

Documents\baileraj\Classes\Fall 2003\sta402\handouts\week8-

15oct03.doc

SAS PROGRAMMING

* Arrays

* DO groups

* Statements: RETAIN, RENAME, LABEL, FORMAT, SUM

* Using formats in DATA steps

* Conditional execution

* More on missing values

Additional Ref: Cody, R. and Pass, R. (1995) SAS

Programming by Example. SAS

Institute Inc., Cary, NC. – Chapters 7 (“arrays”), 8 (“retain”), 5 (“SAS functions”)

ARRAYS

* look to use if writing the same set of code multiple times

* “arrays” can contain lists of variables

* “arrays” also good for restructuring data sets

Common example 1a: Recoding a set of variables^2

Suppose you have a data set “old_data” containing

Variables: a_var, b_var, var3, var4, var

(all numeric with missing values coded as -999)

Recode -999 as missing=.

data old_data;

input a_var b_var var3 var4 var5 @@;

datalines;

run;

data recode_ex; set old_data;

array all[5] a_var b_var var3 var4 var5;

do ii=1 to 5;

if all[ii] = -999 then all[ii]=.;

end;

drop ii;

/* can use either [], {}, () to reference array elements */

options nocenter nodate;

proc print;

run;

Obs a_var b_var var3 var4 var 1 1 2 3 4 5 2 6 7. 8 9 3 10 11 12. 14

/* alternative to get SAS to count array size &

drop ii; 4

proc print;

title ‘Recode 3: Using NUMERIC to select elements’;

run;

Recode 3: Using NUMERIC to select elements Obs char_var a_var b_var var3 var4 var 1 a 1 2 3 4 5 2 b 6 7. 8 9

3 c 10 11 12. 14

Recoding both numeric and character values using arrays

Data D5;

input name $ sex $ t1 t2 t3 t4 t5 time6 time_7;

ARRAY num_array{*} NUMERIC;

ARRAY char_array{*} CHARACTER;

/* recode the numeric variables */

DO inum = 1 to dim(num_array);

if num_array{inum}=-999 then num_array{inum}=.;

END;

/* recode the character variables */

Do ichar = 1 to dim(char_array);

if char_array{ichar}="-999" then char_array{ichar}=" ";

END;

drop inum ichar;

datalines;

MrSmith -999 6 6 5 5 5 4 3

-999 F 7 -999 4 4 3 -999 2

proc print;

title "Recoding missing values using Arrays using DO loop";

run;

Recoding missing values using Arrays using DO loop

Obs name sex t1 t2 t3 t4 t5 time6 time_

1 MrSmith 6 6 5 5 5 4 3

2 F 7. 4 4 3. 2 5

C ommon example 1b: Recoding a set of variables (with array initialized)

data temps;

array tempF(4) tempF1-tempF4 (32,50,68,86);

array tempC(4) tempC1-tempC4;

do itemp = 1 to 4;

tempC(itemp) = 5/9*(tempF(itemp)-32);

end;

drop itemp;

proc print;

run;

temp temp temp temp temp temp temp temp

Obs F1 F2 F3 F4 C1 C2 C3 C

C ommon example 2: Creating multiple observations from a single observation

data one;

1-x4;

input x1 x2 x3 x4;

datalines;

data two; set one;

array xx[4] x1-x4;

do time=1 to 4;

x=xx[time];

output;

end;

drop x

run;

ta one; set multi; 7

[4] x1-x4;

lues kept from previous observation;

if FIRST.id=1 then do ii=1 to 4;

ized to missing;

=heart_rate;

t;

oc print;

se multiple records to one record’;

ltiple records to one record

da

by id;

array xx

retain x1-x4; * va

xx[ii]=.; * elements initial

end;

xx[time]

if LAST.id=1 then outpu

keep id x1-x4;

run;

pr

title ‘Conden

run;

Condense mu Obs id x1 x2 x3 x 1 1 60 62 64 68 2 2 80 84 90 9

Alternative using PROC TRANSPOSE

proc transpose data=multi out=one_tran(keep=id hr1-hr4)

prefix=HR; by id;

var heart_rate;

proc print data=one_tran;

title alternative using PROC TRANSPOSE;

run;

alternative using PROC TRANSPOSE HR

ommon example 4: Inputting values in ARRAY variable list

Obs id HR1 HR2 HR 1 1 60 62 64 68 2 2 80 84 90 98

C

7} ADL1-ADL7;

or

Data D2;

*} ADL1-ADL7;

or

Data D3;

*} t1 t2 t3 t4 t5 time6 time_7;

more complicated example: Randomization test for testing equality of 2 populations

Data D1;

ARRAY ADL{

input ADL1-ADL7;

datalines;

ARRAY ADL{

input ADL1-ADL7;

datalines;

ARRAY ADL{

input t1 t2 t3 t4 t5 time6 time_7;

datalines;

A

use PLAN to generate a set of indices for the randomization test

and then use TRANSPOSE to package the output

/* nitrofen data

concentrations 0 and 160 will be used to illustrate a randomization test

libname class 'D:\baileraj\Classes\Fall 2003\sta402\data';

data test; set class.nitrofen;

if conc= 0 | conc= 160 ;

proc ttest ;

title NITROFEN: t-test of ( 0 , 160 ) concentrations;

class conc;

var total;

run ;

proc print data=obs_test; 10

title ‘Randomization test: observed data’;

run;

Randomiza tion test: observed data _

x x x x x x x x x x x t x x x x x x x x x x x x x x x x x x x x y

N

A

O M

b E x x x x x x x x x 1 1 1 1 1 1 1 1 1 1 2 p s _ 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 e

1 total 27 32 34 33 36 34 33 30 24 31 29 29 23 27 30 31 30 26 29 29 O */

proc plan ;

fa torsc test= 4000 ordered in= 20 ;

output out=d_permut;

run ;

proc transpose data=d_permut prefix=in out=out_permut(keep=in1-in20); by

test;

run ;

proc print data=out_permut;

run ;

data null; set obs_test;

file 'D:\baileraj\Classes\Fall 2003\sta402\SAS-programs\week7-perm.data';

put type xx1-xx20;

run ;

data null; set out_permut;

type = 'P'; * permutation data;

file 'D:\baileraj\Classes\Fall 2003\sta402\SAS-programs\week7-perm.data'

mod; /* mod option adds lines to existing file */

put type in1-in20;

run ;

/* week7-perm.data ...

O 27 32 34 33 36 34 33 30 24 31 29 29 23 27 30 31 30 26 29 29

P 8 14 4 11 3 2 12 1 6 13 17 9 15 16 5 19 20 7 10 18

P 12 2 8 10 13 7 9 16 4 19 15 3 5 14 17 1 20 11 6 18

P 18 17 13 14 5 8 19 16 3 12 11 9 10 7 2 20 4 6 1 15

P 6 12 4 20 19 16 11 5 15 18 1 8 3 13 17 14 10 9 7 2

P 8 17 4 19 2 11 1 7 6 3 9 13 20 14 12 18 15 10 5 16

P 11 7 17 6 18 13 3 12 8 10 19 16 2 20 4 5 15 1 9 14

P 17 11 4 7 20 6 9 16 1 2 14 12 5 18 10 8 15 13 3 19

data perm_data;

array both{ 20 } x1-x10 y1-y10; /* array for observed values */

array ins{ 20 } in1-in20; /* index array */

array perms{ 20 } xp1-xp10 yp1-yp10; /* array for permuted values */

infile 'D:\baileraj\Classes\Fall 2003\sta402\SAS-programs\week7- 11

perm.data';

input type $ @;

if type='O' then do;

input x1-x10 y1-y10;

obs_diff = mean(of x1-x10) - mean(of y1-y10);

retain obs_diff x1-x10 y1-y10;

end;

else do;

input in1-in20;

do ii = 1 to 20 ;

perms{ii} = both{ ins[ii] };

end;

perm_diff = mean(of xp1-xp10) - mean(of yp1-yp10);

perm_ge = (perm_diff >= obs_diff); * 1-tailed;

perm_2tail = (abs(perm_diff) >= abs(obs_diff)); * 2-tailed;

keep obs_diff perm_diff perm_ge perm_2tail;

* keep in1-in20 xp1-xp10 yp1-yp10 obs_diff perm_diff perm_ge;

output;

end;

proc print ;

run ;

proc freq data=perm_data;

title ‘NITROFEN: randomization test -> upper tail P-value’;

table perm_ge perm_2tail;

run;

NI TROFEN: randomization test -> upper tail P-value

e FREQ Procedure Cumulative Cumulative

< - - - P(upper) = 0.

Cumulative Cumulative

< - - - P(2 tail) = 0.

nd more complicated example: Randomization test 2 for spatial randomness

Th

perm_ge Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 0 3927 98.18 3927 98. 1 73 1.83 4000 100.

perm_2tail Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 0 3847 96.18 3847 96. 1 153 3.83 4000 100.

A 2

gnnobs = sumnnobs/4; * observed average NN distance;

talines;

oc print;

ta mccsr1; set plot1;

array nnobs nnobs1-nnobs4;

Generate a large number of CSR plots with 4 trees */

initialize counters of nn avg dist le or ge than observed;

isim = 1 to 1000;

do ii = 1 to 4;

ni(0);

Find NN distance for the simulated trees */

sumnncsr = 0;

00; * initialize;

do j=1 to 4;

m(i)-xsim(j))2 + (ysim(i)-ysim(j))2 );

sumnncsr=sumnncsr+nncsr(i);

end;

av

da

pr

da

array xobs xobs1-xobs4;

array yobs yobs1-yobs4;

array xsim xsim1-xsim4;

array ysim ysim1-ysim4;

array nncsr nncsr1-nncsr4;

/* CSR = completely spatially random */

numle = 0; numge = 0;

do

xsim(ii) = ranu

ysim(ii) = ranuni(0);

end;

do i=1 to 4;

nncsr(i) = 1

d=sqrt( (xsi

if (d<nncsr(i)) and (d>0) then nncsr(i)=d;

* output; * debugging;

end;

end;

avgnncsr = sumnncsr/4; 14

late counts of patterns consistent with regularity/aggreg.

ile = (avgnncsr <= avgnnobs);

numle = numle + ile;

drop i j ii xobs1-xobs4 yobs1-yobs4 nnobs1-nnobs

d; * if the isim - simulation loop;

proc print;

oc freq;

ige;

The FREQ Procedure

Cumulative Cumulative

Cumulative Cumulative

Accumu

ige = (avgnncsr >= avgnnobs);

numge = numge + ige;

sumnnobs sumnncsr;

output;

en

pr

table ile

run;

ile Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 0 46 4.60 46 4. 1 954 95.40 1000 100.

ige Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 0 954 95.40 954 95. 1 46 4.60 1000 100.

RETAIN

* tough to perform calculations across observations

alue to a variable but to remember a

* SAS normally initializes each variable to missing

* RETAIN instructs system not to assign a missing v

different value

data retain_demo1;

input dobs time x;

example: find the average weight by subject using

DATA step programming

/* STEP 1: read in the data file */

data diet;

input id @3 date mmddyy8. weight;

format date mmddyy8.;

datalines;

proc print;

title ‘diet data’;

run;

diet data Obs id date weight 1 1 10/01/92 155 2 1 10/08/92 158 3 1 10/15/92 158 4 1 10/22/92 158 5 2 09/02/92 200 6 2 09/09/92 198 7 2 09/16/92 196

8 2 09/23/92 202

data diet2; set diet;

if n= 1 then total=weight;

else if id = lag(id) then total=total+weight;

else if id NE lag(id) then total = weight; 17

proc print;

run;

Obs id date weight total 1 1 10/01/92 155 155 2 1 10/08/92 158 158 3 1 10/15/92 158. 4 1 10/22/92 158. 5 2 09/02/92 200 200 6 2 09/09/92 198. 7 2 09/16/92 196. 8 2 09/23/92 202.

/* STEP 2:

accumulate the total weight measurements for an individual ID

data diet3; set diet;

retain total 0 ;

if id = lag(id) then total=total+weight;

else if id NE lag(id) then total = weight;

proc print ;

run ;

Obs id date weight total 1 1 10/01/92 155 155 2 1 10/08/92 158 313 3 1 10/15/92 158 471 4 1 10/22/92 158 629 5 2 09/02/92 200 200 6 2 09/09/92 198 398 7 2 09/16/92 196 594 8 2 09/23/92 202 796

STEP 3:

Calculate the average weight and output the desired data set

proc sort data=diet3; by id;

data diet4; set diet3; by id;

if LAST.id; * special variable LAST.id=1 if last value in BY;

wt_avg = total/4;

keep id wt_avg;

proc print ;

run ;

Obs id wt_avg 1 1 157.

data test;

input id xstart xstop;

datalines;

proc print;

run;

data test2; set test; by id;

array start{9} start1-start9;

array stop{9} stop1-stop9;

array times{9} times1-times9;

retain count 0;

retain start1-start9 stop1-stop9 times1-times9;

if FIRST.id=1 then do; * initialize count and arrays with new ID;

count = 0;

do ii=1 to 9;

start{ii} = .;

stop{ii} = .;

times{ii} = .;

end;

end;

count = count + 1;

start{count} = xstart;

stop{count} = xstop;

times{count} = xstop - xstart;

if LAST.id=1 then output; * output results if last obs for ID;

drop xstart xstop ii;

run;

data test3; set test2;

total_time = sum(of times1-times9);

run;

proc print;

run;

material from

C:\Documents and Settings\John Bailer\My

Documents\baileraj\Classes\Fall 2003\sta402\handouts\week8-

15oct03.doc

COMBINING AND MANAGING SAS DATA SETS

* SET statement for concatenation and interleaving 20

* MERGE statement for joining observations

* UPDATE statement for updating a master file (maybe)

* Special variables: IN, END, FIRST, and LAST

* Creating multiple data sets in one DATA step

* Reshaping data sets

* Managing data sets using PROC COPY and PROC DATASETS

* Transporting data sets between hosts

Reference:

Delwiche LD and Slaughter SJ. 1996. The Little SAS Book: A Primer , 2nd edition. SAS

Institute. Cary, NC – Chapter 5.

Additional Ref: Cody, R. and Pass, R. (1995) SAS

Programming by Example. SAS

Institute Inc., Cary, NC. – Chapters 3 (“set/merge/update”)

Temporary versus Permanent SAS data sets (Delwiche and Slaughter – Ch. 2.9)

* if you use a data set more than once, it may be more efficient to save it as a permanent

data set.

* SAS data set names all have two levels – the first level is its LIBREF (SAS data library

referenced) and the second is the MEMBER name that identifies the data set within the

library.

* LIBREF points to a particular location – often a physical location (e.g. disk) or a

logical notation (e.g. directory).

/* example */

libname class 'D:\baileraj\Classes\Fall 2003\sta402\data’;

data nitrofen_A; set class.nitrofen;

brood=1; count=brood1; conc=conc; output;

brood=2; count=brood2; conc=conc; output;

brood=3; count=brood3; conc=conc; output;

keep brood count conc;

* in the example above, two data sets are defined/referenced -

LIBREF = “class” and MEMBER = “nitrofen”