Formatting, basic DATA step manipulations and programming | STA 402, Assignments of Statistics

Material Type: Assignment; Professor: Bailer; Class: Statistical Programming; Subject: Statistics; University: Miami University-Oxford; Term: Fall (First Sem) 2008;

Typology: Assignments

Pre 2010

Uploaded on 08/18/2009

koofers-user-knr
koofers-user-knr 🇺🇸

10 documents

1 / 45

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Statistical Programming in SAS Bailer
Week 08-09 [31+ Oct.] Class Activities
File: week-08-09-DATA-step-prog-30oct08.doc
Directory: \\Muserver2\USERS\B\\baileraj\Classes\sta402\handouts
Formatting, basic DATA step manipulations and programming
7. Formatting, data recoding and basic DATA step manipulations
1. Internal representations and output displays
2. Date and time formats
3. Recoding and transforming variables in a DATA step
4. Ordering how tasks are done – precedence of operations
5. What goes and what stays in a data set – DROP, KEEP, IF, WHERE, OUTPUT
6. Structured thinking about writing programs – pseudo-code and modules
7. CASE STUDY 7.1: Is the two-sample t-test robust to heterogeneous variances?
8. CASE STUDY 7.2: Monte Carlo integration to estimate Pr(0<Z<1.645) for Z~N(0,1)
9. CASE STUDY 7.3: Simple percentile-based bootstrap
10. Throw out your tables of statistical distributions – CDF, PDF, QUANTILE
11. Generating variables using random number generators – RAND
Exercises
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d

Partial preview of the text

Download Formatting, basic DATA step manipulations and programming | STA 402 and more Assignments Statistics in PDF only on Docsity!

Week 08-09 [31+ Oct.] Class Activities

File: week-08-09-DATA-step-prog-30oct08.doc

Directory: \Muserver2\USERS\B\baileraj\Classes\sta402\handouts

Formatting, basic DATA step manipulations and programming

  1. Formatting, data recoding and basic DATA step manipulations
    1. Internal representations and output displays
      1. Date and time formatsRecoding and transforming variables in a DATA step
    2. Ordering how tasks are done – precedence of operations
    3. What goes and what stays in a data set – DROP, KEEP, IF, WHERE, OUTPUT
      1. Structured thinking about writing programs – pseudo-code and modulesCASE STUDY 7.1: Is the two-sample t-test robust to heterogeneous variances?
    4. CASE STUDY 7.2: Monte Carlo integration to estimate Pr(0<Z<1.645) for Z~N(0,1)
    5. CASE STUDY 7.3: Simple percentile-based bootstrap
    6. Throw out your tables of statistical distributions – CDF, PDF, QUANTILE
    7. Generating variables using random number generators – RAND Exercises

7.1. Internal representations and output

displays

• getting data with formatted values into a SAS data set (via

“informats”) - how to process data and store an input variable

• displaying data values (“formats”) - how to display values of a particular

variable

Data (crudely classified) into 4 types: character, numeric, date and

time.

(One might argue that only two types are necessary since date and time data are essentially numeric but it is convenient to make this distinction when discussing formats

Common formats (character):

Characters: $w.

Display 7.1: data mrexample; Reading character and numeric variables with implicit and explicit formats

  • Lunneborg (1994) - body weight brain example; input species $ bodywt brainwt @@; datalines; beaver 1.35 8.10 cow 465.00 423.00 wolf 36.33 119.50 goat 27.66 115. guipig 1.04 5.50 diplodocus 11700.00 50.00 asielephant 2547.00 4603.00 donkey 187.10 419.00 horse 521.00 655.00 potarmonkey 10.00 115. cat 3.30 25.60 giraffe 529.000 680.00 gorilla 207.00 406.00 human 62.00 1320.00 afrelephant 6654.00 5712.00 triceratops 9400.00 70. rhemonkey 6.80 179.00 kangaroo 35.00 56.00 hamster 0.12 1.00 mouse 0.023 0.40 rabbit 2.50 12.10 sheep 55.50 175. jaguar 100.00 157.00 chimp 52.16 440.00 brachiosaurus 87000.00 154.50 rat 0.28 1.90 mole 0.122 3.00 pig 192.00 180 ; data mrexample2; length species $ 15; input species bodywt brainwt @@; beaver 1.35 8.10 cow 465.00 423.00 wolf 36.33 119.50 goat 27.66 115.00^ datalines; guipig 1.04 5.50 diplodocus 11700.00 50.00 asielephant 2547.00 4603.00 donkey 187.10 419.00 horse 521.00 655.00 potarmonkey 10.00 115. cat 3.30 25.60 giraffe 529.000 680.00 gorilla 207.00 406.00 human 62.00 1320.00 afrelephant 6654.00 5712.00 triceratops 9400.00 70. rhemonkey 6.80 179.00 kangaroo 35.00 56.00 hamster 0.12 1.00 mouse 0.023 0.40 rabbit 2.50 12.10 sheep 55.50 175.

data mrexample; input species $ bodywt brainwt data mrexample2;length species $ 15; @@; input species bodywt brainwt @@; species bodywt brainwt species bodywt brainwt rabbit (^) 2.50 12.1 rabbit (^) 2.50 12. sheep (^) 55.50 175.0 sheep (^) 55.50 175. jaguar (^) 100.00 157.0 jaguar (^) 100.00 157. chimp (^) 52.16 440.0 chimp (^) 52.16 440. brachios (^) 87000.00 154.5 brachiosaurus (^) 87000.00 154. Rat (^) 0.28 1.9 rat (^) 0.28 1. mole (^) 0.12 3.0 mole (^) 0.12 3. Pig (^) 192.00 180.0 pig (^) 192.00 180. Common formats (numeric): w. (w.d) BESTw. COMMAw.d Ew. DOLLARw.d Display 7.3: DATA step displays of numeric formats data numeric_format_show; /* character formatting illustrated first */ test_num = 1277695.384; put 'BEST6. / BEST9. / BEST12.'; put test_num BEST6.; put test_num BEST9.; put test_num BEST12.; put '-------------------------------'; put 'COMMA7. / COMMA10.1 / COMMA11.3'; put test_num COMMA9.; put test_num COMMA12.1; put test_num COMMA13.3; put '-------------------------------'; put 'E7.'; put test_num E7.; put '-------------------------------'; put '7. / 10.1 / 11.3'; put test_num 8.; put test_num 12.1; put test_num 13.3; put '-------------------------------'; put 'DOLLAR7. / DOLLAR10.2'; put test_num DOLLAR9.; put test_num DOLLAR12.2;

run;

Display 7.4: Output from SAS LOG with examples using numeric formats BEST6. / BEST9. / BEST12. put to the log 1.28E6 1277695. -------------------------------^ 1277695. COMMA7. (^) 1,277,695/ COMMA10.1 / COMMA11. 1,277,695.4 1,277,695. ------------------------------- E7. -------------------------------^ 1.3E+

  1. / 10.1 1277695 / 11. 1277695.4 1277695. ------------------------------- DOLLAR7. / DOLLAR10. $1277695 $1277695.

Example: User-defined format for numeric

variable

Display 7.5: Constructing a user-defined format for a numeric variable data toyexample; input literacy @@; literacy_too = literacy; -99 25.55 53 53.5 73.7 83^ datalines; 99.9 107. ; proc format; value literacyfmt 0-53=' First quartile' 53<-76=' 76<-90 =' Second quartile'Third quartile' 90<-100='Fourth quartile'. = 'Missing' data toyexample2; set toyexample;^ OTHER = 'Invalid'; format literacy literacyfmt.; ods rtf; proc print; proc means;^ run; var literacy literacy_too; run; ods rtf close;

Display 7.6: Output from printing dataset with a user-defined variable Obs literacy literacy_too (^1) Invalid -99. (^2) First quartile 25. (^3) First quartile 53. (^4) Second quartile 53. (^5) Second quartile 73. (^6) Third quartile 83. (^7) Fourth quartile 99. (^8) Invalid 107. (^9) Missing.

Display 7.7: Output from PROC MEANS Variable N Mean Std Dev Minimum Maximum literacy literacy_too

7.2. Character, numeric, time and date

formats

Large number of formats that can be used to read and display dates and

times.

Dates might be recorded as 30jun10, 30jun2010, 063010 or even

All require the use of different format to read them into SAS (see 7.8)

Display 7.8: Formats for reading common date-values into SAS variables data; input (^) @19 indate3 mmddyy. @26 indate4 ddmmyy8.;@1 indate1 date7. @9 indate2 date9. ; datalines; 30jun10 30jun2010 063010 30.06.10 ; ods rtf; proc print; ods rtf close;^ run; Why are 30jun10, 30jun2010, 063010 and 30.jun.10 all displayed as 18443? Display 7.9: Result of reading dates into SAS variables Obs indate1 indate2 indate3 indate (^1) 18443 18443 18443 18443

Dates are stored in SAS as the number of days from a selected point in time.

  • default origin of time for SAS is January 1, 1960.
  • Thus, June 30, 2010 is 18443 days away from January 1, 1960.

Display 7.12: Various Date formats illustrated data date_format_show; start = 0; put start date9.; today = 17700; put '-------------------------------'; * days since Jan 1, 1960; put 'DATE7. / DATE9.'; put today date7.; put today date9.; put '-------------------------------'; put 'DAY2. / DAY7.'; put today day2.; put today day7.; put '-------------------------------'; put 'EURDFDD8.'; put today eurdfdd8.; put '-------------------------------'; put 'MMDDYY8. / MMDDYY6.'; put today mmddyy8.; put today mmddyy6.; put '-------------------------------'; put 'WEEKDATE15. / WEEKDATE29.'; put today weekdate15.; put today weekdate29.; put '-------------------------------'; put 'WORDDATE12. / WORDDATE18.'; put today worddate12.; put today worddate18.; run; Display 7.13: Results from SAS LOG of displaying date using different formats 01JAN ------------------------------- DATE7. / DATE9. 17JUN08 17JUN ------------------------------- DAY2. / DAY7. (^1717) ------------------------------- EURDFDD8. 17.06.08 ------------------------------- MMDDYY8. 06/17/08 / MMDDYY6. (^061708) ------------------------------- WEEKDATE15. Tue, Jun 17, / 08 WEEKDATE29. -------------------------------^ Tuesday,^ June^ 17,^2008 WORDDATE12. Jun 17, 2008 / WORDDATE18. June 17, 2008

TIME also needs to be started as an elapsed number relative to some reference.

  • In SAS, time is recorded as the number of seconds that have elapsed since midnight.
  • Display 7.14 includes code to define three variables using time and date constants (the “t” and “d” behind the character expressions defines these variables as times and dates), and them displays the internal storage value and a formatting of this value. Display 7.14: Time and date formats illustrated data; time_date_origin = 0; nowtime = '09:00't; today = '17jun2008'd; put time_date_origin @20 time_date_origin datetime13.; put nowtime @20 nowtime time9.; put today @20 today date9.; As we see in Display 7.15, 9:00 corresponds to 96060=32400 seconds since midnight, and 17 June 2008 corresponds to 17700 days since January 1, 1960. Display 7.15: 0 The origin of time and dates in SAS and other formatted values 01JAN60:00: (^3240017700) 17JUN20089:00:

7.3. Recoding and transforming variables in

a DATA step

Display 7.18: Data transformations before fitting a polynomial regression model data nitrofen;

'\Muserver2\USERS\B\BAILERAJ\public.www\classes\sta402\data\ch2-^ infile dat.txt' firstobs=16 expandtabs missover pad ; input @17 conc 3. @49 total 2.; sqrt_total = sqrt(total); cconc = conc - 157; * transformed response variable;* construct mean-centered concentration; cconc2 = cconc*cconc; * quadratic term; ods rtf BODYTITLE; ods graphics on; proc reg data=nitrofen; model sqrt_total = cconc cconc2; * fit the polynomial reg. model; ods graphics off;^ run; ods rtf close;

Display 7.19: ODS statgraphics output from a polynomial regression

Boolean tricks to recode … Option 1: IF-THEN idino = 0; if (species="diplodoc" or species="tricerat" or species="brachios") then * define the indicator of dinosaurs; idino=1; Option 2: Boolean evaluation idino = (species="diplodoc" or species="tricerat" or species="brachios"); Display 7.20: Defining indicator variables for different treatments to fit an anova model options nodate formdlim="-"; data meat; input condition $ logcount @@; iPlastic = (condition= "Plastic"); iVacuum = (condition= "Vacuum"); iMixed = (condition= "Mixed"); iCO2 = (condition= "Co2"); Plastic^ datalines; 7.66 Plastic 6.98 Plastic 7. Vacuum Mixed 5.267.41 VacuumMixed 5.447.33 VacuumMixed 5.807. Co2 ; 3.51 Co2 2.91 Co2 3. title “bacteria growth under 4 packaging conditions”; ODS RTF bodytitle; proc print; title "Print to check indicator variable construction"; run; proc reg data=meat; title "Regression with indicator variables: alt. to one-way anova model"; model logcount = iPlastic iVacuum iMixed; run; proc glm data=meat; title "One-way anova model"; class condition; model logcount = condition; run; ods rtf close; Display 7.21: OUTPUT from PROC PRINT to confirm indicator variable coding

condition logcount iPlastic iVacuum iMixed iCO Plastic 7.66 1 0 0 0 Plastic 6.98 1 0 0 0

Example: Comparing coding when constructing categories (or why you test with missing values) Display 7.22: Constructing indicator variables for different categories of a numeric variable data toyexample; input literacy @@; cat_literacy1 = 1(0<literacy<=53) + 2(53<literacy<=76)

  • 3(76<literacy<=90) + 4(90<literacy<=100); cat_literacy2 = + 3(76<literacy<=90) + 4(90<literacy<=100); 1(literacy<=53) + 2(53<literacy<=76) if ( (literacy NE .) AND (0<=literacy<=100) ) then cat_literacy3 = 1(literacy<=53) + 2(53<literacy<=76)
  • 3(76<literacy<=90) + 4(90<literacy<=100); if ( (literacy EQ .) OR (100<literacy) OR (literacy<0) ) then cat_literacy4=.; else if (liter <=53) then cat_literacy4=1; else if (liter <=76) then cat_literacy4=2; else if (liter <=90) then cat_literacy4=3; else cat_literacy4=4;

-99 25.55 73.7 83^ datalines; 99.9 107. ; ods rtf; proc print; ods rtf close;^ run;

Display 7.23: Output from PROC PRINT for the different indicators Obs liter cat_literacy1 cat_literacy2 cat_literacy3 cat_literacy (^1) -99.00 0 1.. (^2) 25.55 1 1 1 1 3 73.70 2 2 2 2 4 83.00 3 3 3 3 (^5) 99.90 4 4 4 4 (^6) 107.00 0 0.. (^7). 0 1..

7.4. Ordering how tasks are done –

precedence of operations

Display 7.24: DATA block illustration for order of operations comparisons data preced_test; x1a = 322; x1b = (32)2; x2a = 3-2/2; x2b = (3-2)/2; x3a = -22; x3b = (-2)2; put '-------------------------'; put '| Order of operations |'; put '| illustrated put '-------------------------'; |'; put ' put '(32)2 = ' x1b; 322 = ' x1a; put ' put ' (3-2)/2 = ' x2b; 3-2/2 = ' x2a; put ' put ' (-2)2 = ' x3b; -22 = ' x3a; run; Display 7.25: Output from the SAS LOG for the order of operations illustration ------------------------- | | Orderillustrated of operations || ------------------------- 322 = 12 (32)2 3-2/2 == (^362) (3-2)/2 -22 == 0.5- (-2)**2 = 4

  • Ultimately, there is a simple moral to this cautionary programming tale: use PARENTHESES when concerned that operations need to be conducted in a specific order.
  • Pseudo-coding is a strategy for writing out in words or sentence fragments what should occur in a large programming task.

7.7. CASE STUDY 7.1: Is the two-sample t-

test robust to heterogeneous variances?

H 0 : μ 1 =μ 2 ,

T.S.: 1 2

1 2 s^1 n^1 n t Y Y stat (^) p +

= − where (^ )^ (^ )

1 2

2 1 12 2 22

n n s (^) p n s n s

assumes that the observations from the two populations are independent and that the two populations are normally distributed with a common variance.

  • In this case study, we conduct a Monte Carlo simulation to investigate the impact of unequal variance on the tests of mean equality when using the pooled-variance t-test. In particular, we investigate the operating characteristics of this test (Type I error rates, power). As we implement a study of the study of heterogeneous variances on the pooled-variance t-test, let’s think about some of the tasks that need to be implemented to conduct this study. We need to:
  • specify the conditions to be generated
  • generate data sets reflecting these conditions
  • calculate the test statistic
  • accumulate results over numerous simulated data sets Program may
  • have input parameters (the conditions studied)
  • produce output (a summary of the results over repeated simulated experiments)
  • generate data
  • calculate a test statistic

STEP 1: start with a program that includes only comments.

Display 7.30: Pseudo-code/comments for the t-test simulation /* Problem: Explore whether t-test really is robust to violations of the equal variance assumption Strategy: See if the t-test operates at the nominal Type I error rate when the unequal variance assumption is violated / / specify the conditions to be generated / / generate data sets reflecting these conditions / / calculate the test statistic / / accumulate results over numerous simulated data sets */

STEP 2: identify variables that define simulation conditions

need to specify …

  • number of simulated experiments (Nsims)
  • sample sizes (N1, N2)
  • population means (mu_1, mu_2)
  • standard deviations (sig_1, sig_
  • seed (myseed) for the random number generator so that we can generate the same sequence for our testing. Display 7.31: Specifying the simulation conditions for the t-test simulation /* Problem: Explore whether t-test really is robust to violations of the equal variance assumption Strategy: See if the t-test operates at the nominal Type I error rate when the unequal variance assumption is violated