SAS Programming: Arrays and Missing Values - Prof. A. John Bailer, Exams of Statistics

A part of a statistical programming in sas book by bailer. It covers the use of arrays to define and input values, changing missing value codes, and recoding missing values in numeric and character variables. It also includes examples of expanding a single record into multiple records and simulating plots under csr.

Typology: Exams

Pre 2010

Uploaded on 08/19/2009

koofers-user-gc5
koofers-user-gc5 🇺🇸

10 documents

1 / 16

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Statistical Programming in SAS Bailer
Week 09-10 [17+ Nov.] Class Activities – ARRAYS and RETAIN
File: week-09-10-Arrays-16nov08.doc
Directory: \\Muserver2\USERS\B\\baileraj\Classes\sta402\handouts
Chapter 8 - Programming in a DATA step
8.1. Storage bins for collections of values - ARRAYS
8.1.1 Defining values in ARRAY variable list directly
8.1.2 Inputting values in ARRAY variable list
8.1.3 Changing missing value codes for numeric variables to “.”
8.1.4 Recoding missing values for all numeric and character variables
8.1.5 Creating multiple observations from a single record
8.2 Case Study 1: Monte Carlo P-value for test of spatial randomness [omit]
8.3 Remembering variable values across observations – RETAIN
8.3.1 Processing multiple observations for an individual
8.4 Case Study 2: Randomization test for the equality of two populations [omit]
Summary
References
Exercises
8.1. Storage bins for collections of values - ARRAYS
Array = structure that is often used to contain a collection of
variables.
Basic syntax for defining an array: ARRAY arrayname{number
of array elements} variable-list. (Aside: {}, [ ] and ( ) can all be
used in array definitions.)
8.1.1 Defining values in ARRAY variable list directly
Display 8.1: Converting temperature scales using arrays
data temps;
array tempF(4) tempF1-tempF4 (32,50,68,86);
array tempC(4) tempC1-tempC4;
do itemp = 1 to 4;
tempC(itemp) = 5/9*(tempF(itemp)-32);
end;
drop itemp;
proc print;
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download SAS Programming: Arrays and Missing Values - Prof. A. John Bailer and more Exams Statistics in PDF only on Docsity!

Week 09-10 [17+ Nov.] Class Activities – ARRAYS and RETAIN

File: week-09-10-Arrays-16nov08.doc

Directory: \Muserver2\USERS\B\baileraj\Classes\sta402\handouts

Chapter 8 - Programming in a DATA step

8.1. Storage bins for collections of values - ARRAYS 8.1.1 Defining values in ARRAY variable list directly 8.1.2 Inputting values in ARRAY variable list 8.1.3 Changing missing value codes for numeric variables to “.” 8.1.4 Recoding missing values for all numeric and character variables 8.1.5 Creating multiple observations from a single record 8.2 Case Study 1: Monte Carlo P-value for test of spatial randomness [omit] 8.3 Remembering variable values across observations – RETAIN 8.3.1 Processing multiple observations for an individual 8.4 Case Study 2: Randomization test for the equality of two populations [omit] Summary References Exercises

8.1. Storage bins for collections of values - ARRAYS

Array = structure that is often used to contain a collection of

variables.

Basic syntax for defining an array: ARRAY arrayname { number

of array elements } variable-list. (Aside: {}, [ ] and ( ) can all be

used in array definitions.)

8.1.1 Defining values in ARRAY variable list directly

Display 8.1: Converting temperature scales using arrays

data temps; array tempF(4) tempF1-tempF4 (32,50,68,86); array tempC(4) tempC1-tempC4;

do itemp = 1 to 4; tempC(itemp) = 5/9*(tempF(itemp)-32); end; drop itemp;

proc print;

run;

The printout from Display 8.1 includes the following output

line:

temp temp temp temp temp temp temp temp Obs F1 F2 F3 F4 C1 C2 C3 C 1 32 50 68 86 0 10 20 30

8.1.2 Example 8.2: Inputting values in ARRAY variable list

Reading variable values into an array from datalines or an

external file.

To illustrate, suppose the number of activities of daily living

(ADL) that each resident of a nursing home could conduct was

measured during 7 consecutive quarters.

ADL[1] ADL[2] ADL[3] ADL[4] ADL[5] ADL[6] ADL[7]

Data D1; ARRAY ADL{7} ADL1-ADL7; input ADL1-ADL7; datalines; 6 6 5 5 5 4 3 ;

Can also have the size of the array defined by counting the

elements in the array by using the asterisk in “{*}.”

Data D2; ARRAY ADL{*} ADL1-ADL7; input ADL1-ADL7; datalines; 6 6 5 5 5 4 3 ;

Can define an array in terms of a collection of variable names.

This is illustrated in the code snippet:

Data D3; ARRAY ADL{*} t1 t2 t3 t4 t5 time6 time_7; input t1 t2 t3 t4 t5 time6 time_7;

8.1.4 Recoding missing values for all numeric and character

variables

Display 8.3: Recoding missing values in numeric and character

variables using arrays

Data D5; input name $ sex $ t1 t2 t3 t4 t5 time6 time_7;

ARRAY num_array{} NUMERIC; ARRAY char_array{} CHARACTER;

/* recode the numeric variables */ DO inum = 1 to dim(num_array); if num_array{inum}=-999 then num_array{inum}=.; END;

/* recode the character variables */ Do ichar = 1 to dim(char_array); if char_array{ichar}="-999" then char_array{ichar}=" "; END;

drop inum ichar; datalines; MrSmith -999 6 6 5 5 5 4 3 -999 F 7 -999 4 4 3 -999 2 ; proc print; title "Recoding missing values using Arrays using DO loop"; run;

This program produces the following output:

Recoding missing values using Arrays using DO loop Obs name sex t1 t2 t3 t4 t5 time6 time_ 1 MrSmith 6 6 5 5 5 4 3 2 F 7. 4 4 3. 2

8.1.5 Creating multiple observations from a single record

Different procedures in SAS may expect data to have different

structures.

Some of the multivariate procedures expect all of the variables

to be defined in a single record.

For mixed models applied to longitudinal data, different

measurements are expected to appear in different records.

It is important to be able to program the expansion of a single

record to multiple records or the condensing of multiple records

into a single record. We focus on the first of these tasks here.

Suppose we have data in the following form:

Name Sex t1 t2 t3 t4 t5 time6 time_ Smith M 6 6 5 5 5 4 3 Jones F 7 5 4 4 3 2 1 Fisher M 5 5 5 3 2 2 1 that we wish to transform into

Name Sex Time ADL Smith M 1 6 Smith M 2 6 Smith M 3 5 Smith M 4 5 Smith M 5 5 Smith M 6 4 Smith M 7 3 Jones F 1 7 Jones F 2 5 Jones F 3 4 Jones F 4 4 Jones F 5 3 Jones F 6 2 Jones F 7 1 Fisher M 1 5 Fisher M 2 5 Fisher M 3 5 Fisher M 4 3 Fisher M 5 2 Fisher M 6 2 Fisher M 7 1

Display 8.4: Expanding a single record into multiple records

  1. Repeat steps 3 and 4 a large number of times.
  2. The Monte Carlo P-value is the proportional of generated samples that were more extreme than observed. In the case of testing for clustering, this involves testing to see if the average nearest neighbor distance from simulated data is SMALLER than observed in the original configuration. The SAS program implementing this analysis is given in Display 8.5.

Display 8.5: Calculating a Monte Carlo P-value for testing average nearest neighbor distances /* enter the observed data into data arrays */ data plot_obs; array xobs(4) xobs1-xobs4; array yobs(4) yobs1-yobs4; do idat = 1 to 4; input xobs(idat) yobs(idat) @@; end; datalines; .25 .65 .25 .75 .35 .65 .35. ;

/* Calculate the observed average nearest neighbor distance */ data plot1; set plot_obs; array xobs(4) xobs1-xobs4; array yobs(4) yobs1-yobs4; array nnobs(4) nnobs1-nnobs4;

/* Determine the observed NN distance and average */ sumnnobs = 0; do i=1 to 4; * find NN distance for each point ; nnobs(i) = 100; * initialize distances to be large;

do j=1 to 4; * compare the ith point to all others; d=sqrt( (xobs(i)-xobs(j))2 + (yobs(i)-yobs(j))2 ); if (d<nnobs(i)) and (d>0) then nnobs(i)=d; end; /* Note: Each point is also compared to itself - thus, d= omitted from updated of nearest neighbor distance. */ sumnnobs=sumnnobs+nnobs(i); end;

avgnnobs = sumnnobs/4; * observed average NN distance;

/* Simulate plots under CSR */ data mccsr1; set plot1; array xobs xobs1-xobs4; * observed data; array yobs yobs1-yobs4; array xsim xsim1-xsim4; * CSR simulated data;

array ysim ysim1-ysim4;

array nnobs nnobs1-nnobs4; * observed NN distances; array nncsr nncsr1-nncsr4; * simulated NN distances;

/* Generate a large number of CSR plots with 4 trees / / CSR = completely spatially random */

  • initialize counters of nn avg dist le or ge than observed; numle = 0;
  • numge = 0; * commented out since focus on clustering here;

nsim = 4000; * user-specified option; seed1 = 12345; * if =0, then uses time since midnight;

do isim = 1 to nsim;

do ii = 1 to 4; xsim(ii) = ranuni(seed1); ysim(ii) = ranuni(seed1); end;

/* Find NN distance for the simulated trees */ sumnncsr = 0; do i=1 to 4; nncsr(i) = 100; * initialize;

do j=1 to 4; d=sqrt( (xsim(i)-xsim(j))2 + (ysim(i)-ysim(j))2 ); if (d<nncsr(i)) and (d>0) then nncsr(i)=d; end;

sumnncsr=sumnncsr+nncsr(i); end; avgnncsr = sumnncsr/4;

/* Accumulate counts of patterns consistent with clustering */ ile = (avgnncsr <= avgnnobs); numle = numle + ile;

  • ige = (avgnncsr >= avgnnobs);
  • numge = numge + ige;

drop i j ii xobs1-xobs4 yobs1-yobs4 nnobs1-nnobs sumnnobs sumnncsr; output; end; * of the isim - simulation loop;

data test; input id xstart xstop; datalines; 1 15 25 2 10 12 2 18 22 3 6 12 3 14 15 3 17 23 ;

proc print; run;

data test2; set test; by id; **** Comment 1; array start{9} start1-start9; array stop{9} stop1-stop9; array times{9} times1-times9;

retain count 0; **** Comment 2; retain start1-start9 stop1-stop9 times1-times9;

  • initialize count and arrays with new ID; if FIRST.id=1 then do; **** Comment 3; count = 0; do ii=1 to 9; start{ii} = .; stop{ii} = .; times{ii} = .; end; end;

count = count + 1; **** Comment 4; start{count} = xstart; stop{count} = xstop; times{count} = xstop - xstart;

if LAST.id=1 then do; **** Comment 5; first_time = times(1); total_time = sum(of times1-times9); keep id count first_time total_time; output; * output results if last obs for ID; end; run;

proc print; run;

The processing of these observations produces the following

first_ total_ Obs id count time time

1 1 1 10 10 2 2 2 2 6 3 3 3 6 13

**** Comment 1: The contents of the original data set test are imported into a

new data set test2 by the variable id. This creates two internal variables,

FIRST.ID and LAST.ID. The FIRST.ID =1 if the record corresponds to the

first record for a particular id and =0 for all other records. The LAST.ID =1 if the

record corresponds to the last record for a particular id and =0 for all other records.

For the first individual (id=1), FIRST.ID = LAST.ID =1 since there is only one

record for this individual. For the third individual (id=3), FIRST.ID =1 and

LAST.ID =0 for the data line “3 6 12”, FIRST.ID =0 and LAST.ID =0 for the

data line 3 14 15 , and FIRST.ID =0 and LAST.ID =1 for the data line 3 17

**** Comment 2: The two lines starting with RETAIN define the counter of the

number of enrollments for an individual and arrays for enrollment (START),

disenrollment (STOP) and time spent in a particular enrollment (TIMES).

RETAIN is used to make sure these values don’t disappear while other lines for the

same individual are being processed. The ARRAYS are set up to have at most 9

separate enrollment times.

**** Comment 3: When it is the first record for an individual, we initialize the

counter and the array elements. Note that the array elements are initially set to

missing since most individuals are expected to have fewer than this maximum.

**** Comment 4: Load the arrays with information from the current record and

calculate the time enrolled for this entry.

**** Comment 5: When it is the last record for an individual, we accumulate the

total time (note and missing values are ignored by the sum function), define the

length of the first stay, and output the results.

8.4 Case Study 8.2: Randomization test for the equality of two populations

[OMIT]

Randomization tests (Edgington 1995, Good 2001) can also be used to obtain P-values.

We next examine the results from a t-test comparing the mean responses in the control and 160 μg/l conditions. From this section of code and the accompanying results, we see that the two group t-test resulted in a P-value of 0.035 (assuming equal variances).

proc ttest; title NITROFEN: t-test of (0, 160) concentrations; class conc; var total; run;

Display 8.7: Output from PROC TTEST

NITROFEN: t-test of (0, 160) concentrations The TTEST Procedure Statistics Lower CL Upper CL Lower CL Upper CL conc N Mean Mean Mean SD SD SD Std Err

0 10 28.827 31.4 33.973 2.4737 3.5963 6.5654 1. 160 10 26.612 28.3 29.988 1.6229 2.3594 4.3073 0. Diff (1-2) 0.2424 3.1 5.9576 2.2981 3.0414 4.4977 1.

T-Tests Variable Method Variances DF t Value Pr > |t| total Pooled Equal 18 2.28 0. total Satterthwaite Unequal 15.5 2.28 0.

Equality of Variances Variable Method Num DF Den DF F Value Pr > F total Folded F 9 9 2.32 0.

PROC PLAN to generate a collection of permutations of the indices 1 through 20. The 10 observations corresponding to the first 10 indices in the permuted set are assigned for group 1 for a particular permutation, while the remaining observations are assigned to group 2. These permutations are read into a SAS data set for additional processing.

TRANSPOSE procedure is used to place each permuted set of data onto a single line.

Display 8.8: Constructing a randomization test of two population mean equality proc transpose data=test prefix=xx out=tran_out; var total; run; data obs_test; set tran_out; type = ‘O’; * identifies original data; run; proc print data=obs_test; title ‘Randomization test: observed data’;

run;

proc plan; factors test=4000 ordered in=20; output out=d_permut; run;

/*

PLAN used to generate a set of indices for the randomization test and then use TRANSPOSE to package the output

*/

proc transpose data=d_permut prefix=in out=out_permut(keep=in1-in20); by test; run; proc print data=out_permut; run;

data null; set obs_test; file 'D:\baileraj\Classes\Fall 2008\sta402\SAS- programs\week7-perm.data'; put type xx1-xx20; run;

data null; set out_permut; type = 'P'; * permutation data; file 'D:\baileraj\Classes\Fall 2008\sta402\SAS- programs\week7-perm.data' mod; /* mod option adds lines to existing file */ put type in1-in20; run;

/* week7-perm.data ... O 27 32 34 33 36 34 33 30 24 31 29 29 23 27 30 31 30 26 29 29 P 8 14 4 11 3 2 12 1 6 13 17 9 15 16 5 19 20 7 10 18 P 12 2 8 10 13 7 9 16 4 19 15 3 5 14 17 1 20 11 6 18 P 18 17 13 14 5 8 19 16 3 12 11 9 10 7 2 20 4 6 1 15

... */

data perm_data; array both{ 20 } x1-x10 y1-y10; /* array for observed values / array ins{ 20 } in1-in20; / index array */ array perms{ 20 } xp1-xp10 yp1-yp10;

< - - - P(upper) = 0.

Cumulative Cumulative perm_2tail Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

0 3847 96.18 3847 96. 1 153 3.83 4000 100. < - - - P(2 tail) = 0.

Thus we see that the two-tailed P-value from the randomization procedure, 0.0383, is quite similar to the P-value obtained from the t-test.

References

Cody, R. and Pass, R. (1995) SAS®^ Programming by Example. SAS Institute Inc., Cary, NC. – Chapters 7 (“arrays”), 8 (“retain”), 5 (“SAS functions”)

Edgington, E. (1995) Randomization tests. M. Dekker, New York.

Good, P.I. (2001) Resampling Methods: a practical guide to data analysis. Birkhauser, Boston.

Ripley (1981) Spatial Statistics. John Wiley & Sons. NY.

Waller, L.A. and Gotway-Crawford, C.(2004) Applied spatial statistics for public health data. John Wiley & Sons: NY.