SAS Programming: Transforming Data Sets and Using Functions, Assignments of Statistics

Information on transforming sas data sets using data steps, creating variables, and working with functions. Topics include the use of statements like data, set, output, return, where, if, drop, keep, length, and sas functions and operators. The document also covers working with date values and missing values.

Typology: Assignments

Pre 2010

Uploaded on 08/18/2009

koofers-user-4hm
koofers-user-4hm 🇺🇸

5

(1)

9 documents

1 / 27

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Week 06/07 Class Activities
File: week-06-07-07oct07.doc
Directory (hp/compaq):
C:\baileraj\Classes\Fall 2007\sta402\handouts
Directory: \\Muserver2\USERS\B\\baileraj\Classes\sta402\handouts
Week 6± TRANSFORMING SAS DATA SETS
Creating SAS data sets with DATA steps: flow of execution, including the program data
vector
Creating variables in DATA steps with assignment statements
Statements: DATA, SET, OUTPUT, RETURN, WHERE, IF, DROP, KEEP, LENGTH
Subsetting observations and variables
Using SAS functions and operators
Working with SAS date values (also time and date-time)
Introduction to missing values
Week 7 SAS PROGRAMMING
Declarative vs. executables statements
Statements: RETAIN, RENAME, LABEL, FORMAT, SUM
Using formats in DATA steps
Conditional execution
DO groups
Arrays
More on missing values
FORMATTING and data recoding
Additional Ref: Cody, R. and Pass, R. (1995) SAS® Programming by Example. SAS
Institute Inc., Cary, NC.
Formats
Informats
Internal formats
General form of formats:
Character: $formatw.
(e.g. $w. = standard character data, $HEXw. Convert to hexadecimal)
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b

Partial preview of the text

Download SAS Programming: Transforming Data Sets and Using Functions and more Assignments Statistics in PDF only on Docsity!

Week 06/07 Class Activities

File: week-06-07-07oct07.doc

Directory (hp/compaq):

C:\baileraj\Classes\Fall 2007\sta402\handouts

Directory: \Muserver2\USERS\B\baileraj\Classes\sta402\handouts

Week 6± TRANSFORMING SAS DATA SETS

• Creating SAS data sets with DATA steps: flow of execution, including the program data

vector

• Creating variables in DATA steps with assignment statements

• Statements: DATA, SET, OUTPUT, RETURN, WHERE, IF, DROP, KEEP, LENGTH

• Subsetting observations and variables

• Using SAS functions and operators

• Working with SAS date values (also time and date-time)

• Introduction to missing values

Week 7 SAS PROGRAMMING

• Declarative vs. executables statements

• Statements: RETAIN, RENAME, LABEL, FORMAT, SUM

• Using formats in DATA steps

• Conditional execution

• DO groups

• Arrays

• More on missing values

FORMATTING and data recoding

Additional Ref: Cody, R. and Pass, R. (1995) SAS ®^ Programming by Example. SAS

Institute Inc., Cary, NC.

Formats

Informats

Internal formats

General form of formats:

Character: $formatw.

(e.g. $w. = standard character data, $HEXw. Convert to hexadecimal)

Numeric: formatw.d

(e.g. BESTw. = SAS System chooses; COMMAw.d; DOLLARw.d; Ew. = sci. not.; w.d)

Date: formatw.

(e.g. DATEw. =ddmmmyy or ddmmmyyyy; DATETIME=ddmmmyy:hh:mm:ss.ss;

DAYw. = day of month; EURDFDDw. = dd.mm.yy; JULIANw.=Julian date; MMDDYYw.;

TIMEw.d = hh:mm:ss.ss; WEEKDATEw. = day-name,month-name,yy or yyyy;

WORDDATEw. = mont-name dd,yyyy

Comments:

$ = character

w = total width

d = number of decimal places

example: illustrating the formats above

data char_format_show;

/* character formatting illustrated first */

charstring = “Hello there”;

put charstring $11.;

put charstring $15.;

put charstring $5.;

run;

yields the following output on the SAS log

Hello there Hello there Hello

data numeric_format_show;

/* character formatting illustrated first */ test_num = 1277695.384 ; put 'BEST6. / BEST9. / BEST12.'; put test_num BEST6.; put test_num BEST9.; put test_num BEST12.; put '-------------------------------'; put 'COMMA7. / COMMA10.1 / COMMA11.3'; put test_num COMMA9.; put test_num COMMA12.1; put test_num COMMA13.3; put '-------------------------------'; put 'E7.'; put test_num E7.; put '-------------------------------';

put today weekdate29.; put '-------------------------------'; put 'WORDDATE12. / WORDDATE18.'; put today worddate12.; put today worddate18.;

run ;

yields the following output on the SAS log

**01JAN

DATE7. / DATE9. 29SEP 29SEP


DAY2. / DAY7. 29 29


EURDFDD8. 29.09.


MMDDYY8. / MMDDYY6. 09/29/ 092903


WEEKDATE15. / WEEKDATE29. Mon, Sep 29, 03 Monday, September 29, 2003


WORDDATE12. / WORDDATE18. Sep 29, 2003 September 29, 2003**

data time_format_show; start= 0 ; time_test = 1380442000 ; put start DATETIME13.; put time_test DATETIME17.; run ;

yields the following output on the SAS log

01JAN60:00: 29SEP03:08:06:

INFORMAT – input data according to a particular format

Suppose your data was in the following format …

1234567890123456789012345678901234567890 [column guides]

data test; input @ 1 date MMDDYY10. @ 21 time TIME8. @31 money DOLLAR10.2;

datalines;

*ODS RTF file='D:\baileraj\Classes\Fall 2007\sta402\SAS-programs\week6- prt1.rtf'; ODS RTF file= “\Muserver2\USERS\B\BAILERAJ\public.www\classes\sta402\examples\week06- prt1.rtf”;

proc print ; title print of date and time w/o formatting – internal SAS representation; var date time money; run ; proc print ; title print of date and time w/ formatting; var date time; format date MMDDYY10. time TIME8. money DOLLAR10.2; run ;

ODS RTF CLOSE;

Obs date time money

1 0 3600 100.

2 15977 35399 12693.

Obs date time money

1 01/01/1960 1:00:00 $100.

2 09/29/2003 9:49:59 $12,693.

INFORMATS can be used to process input variable values also can be defined using a PROC

FORMAT statement before a DATA step using INVALUE instead of value

/* example 8 from Cody and Pass (1995) */

/* set up informats for valid ranges of variables */

proc format;

invalue sbpfmt 40-300=SAME

OTHER = .;

invalue dbpfmt 10-150=SAME

OTHER = .;

run;

data demo;

input @1 ID $3. @4 SBP sbpfmt3. @7 DBP dbpfmt3.;

label loggnp = ‘Per capita Gross National Product (log10-transformed)’; label ienglish = ‘Indicator variable that primary language is English’;

proc format ; value Mlifefmt LOW-54 =' First quartile' 54<-63 =’Second quartile’ 63<-68 =’ Third quartile’ 68<-HIGH='Fourth quartile'; value Wlifefmt LOW-56 =' First quartile' 56<-67 =’Second quartile’ 67<-73 =’ Third quartile’ 73<-HIGH='Fourth quartile'; value Literfmt LOW-53 =' First quartile' 53<-76 =’Second quartile’ 76<-90 =’ Third quartile’ 90<-HIGH='Fourth quartile'; value catlit 1 ='First quartile' 2 ='Second quartile' 3 ='Third quartile' 4 ='Fourth quartile';

data country2; set country;

/* recoding option 1 */

/* - least attractive alternative */

if 0 LE liter LE 53 then categ_lit1 = 1;

if 53 LT liter LE 76 then categ_lit1 = 2;

if 76 LT liter LE 90 then categ_lit1 = 3;

if 90 LT liter then categ_lit1 = 4;

/* recoding option 2 */

if 0 LE liter LE 53 then categ_lit2 = 1;

else if 53 LT liter LE 76 then categ_lit2 = 2;

else if 76 LT liter LE 90 then categ_lit2 = 3;

else if 90 LT liter then categ_lit2 = 4;

/* recoding option 3 */

if 0 <= liter & liter <= 53 then categ_lit3 = 1;

else if 53 < liter & liter <= 76 then categ_lit3 = 2;

else if 76 < liter & liter <= 90 then categ_lit3 = 3;

else if 90 < liter then categ_lit3 = 4;

/* recoding option 4 */

if liter GE 0 AND liter LE 53 then categ_lit4 = 1;

else if liter GT 53 AND liter LE 76 then categ_lit4 = 2;

else if liter GT 76 AND liter LE 90 then categ_lit4 = 3;

else if liter GT 90 then categ_lit4 = 4;

/* recoding option 5 */

/* - may be more efficient than if-then-else */

select;

when (0 <= liter <= 53) categ_lit5=1;

when (53< liter <= 76) categ_lit5=2;

when (76<= liter <= 90) categ_lit5=3;

when (90< liter) categ_lit5=4;

when (liter=.) categ_lit5=.;

end;

/* recoding option 6 */

categ_lit6 = 1 *(0<liter<= 53 ) + 2 *( 53 <liter<= 76 ) + 3 *( 76 <liter<= 90 )

  • 4 *( 90 <liter); if liter=. then categ_lit6=.; * make sure missing=. not 0;

/* recoding option 7 */

/* - creates character variable with the formatted levels as

values */

categ_lit7 = put(liter,literfmt.);

run;

*ODS RTF file='D:\baileraj\Classes\Fall 2007\sta402\SAS-programs\week6- freq1.rtf'; ODS RTF file= “\Muserver2\USERS\B\BAILERAJ\public.www\classes\sta402\examples\week06- freq1.rtf”;

proc freq;

table categ_lit1-categ_lit7;

run;

ODS RTF CLOSE;

categ_lit1 Frequency Percent

Cumulative Frequency

Cumulative Percent

1 20 25.97 20 25.

2 19 24.68 39 50.

3 19 24.68 58 75.

4 19 24.68 77 100.

Frequency Missing = 2

categ_lit6 Frequency Percent

Cumulative Frequency

Cumulative Percent

1 20 25.97 20 25.

2 19 24.68 39 50.

3 19 24.68 58 75.

4 19 24.68 77 100. Frequency Missing = 2

categ_lit7 Frequency Percent

Cumulative Frequency

Cumulative Percent

. 2 2.53 2 2.

First quartile 20 25.32 22 27.

Third quartile 19 24.05 41 51.

Fourth quartile 19 24.05 60 75.

Second quartile 19 24.05 79 100.

TRANSFORMING SAS DATA SETS

* Creating SAS data sets with DATA steps: flow of execution, including the program data

vector

LIBNAME pointer ‘directory-containing-SAS-data-sets’;

(see example below)

* Creating variables in DATA steps with assignment statements

lots of examples...

sqrt_total = sqrt(total);

conc2 = conc**2;

Iplastic = (condition=”Plastic”);

categ_lit6 = 1 *(0<liter<= 53 ) + 2 *( 53 <liter<= 76 )

  • 3 *( 76 <liter<= 90 ) + 4 *( 90 <liter); if liter=. then categ_lit6=.; * make sure missing=. not 0;

Order of Operations/Precedence of operations …

  1. ** (exponentiation first)
  2. */ (multiplication and division second)
  3. +- (addition and subtraction third)
  4. – etc.

data preced_test; x1a = 322; x1b = (32)2; x2a = 3-2/2; x2b = (3-2)/2; x3a = -22; x3b = (-2)2; put ‘-------------------------‘; put ‘| Order of operations |’; put ‘| illustrated |’; put ‘-------------------------‘; put ‘ 322 = ‘ x1a; put ‘(32)2 = ‘ x1b; put ‘ 3-2/2 = ‘ x2a; put ‘ (3-2)/2 = ‘ x2b; put ‘ -22 = ‘ x3a; put ‘ (-2)**2 = ‘ x3b; run;

from the SAS LOG


| Order of operations |

| illustrated |

322 = 12 (32)2 = 36 3-2/2 = 2 (3-2)/2 = 0. -22 = - (-2)**2 = 4

MORAL: Use PARENTHESES when concerned that operations need to be conducted

in a specific order!!!!

* Statements: DATA, SET, OUTPUT, RETURN, WHERE, IF, DROP, KEEP, LENGTH

DATA = begin new data block

SET = place contents of one (or more) data set(s) into new data set. Concatenates

data sets if more than one data set named in the SET statement.

OUTPUT = writes an observation to an output data set

ODS RTF file="\Muserver2\USERS\B\BAILERAJ\public.www\classes\sta402\SAS- programs\week-06-tab1.rtf”;

proc print; run;

proc tabulate ; class conc brood; var count; table concbrood,count(min q1 median q3 max); run ; ODS RTF CLOSE;

count

Min Q1 Median Q3 Max

Nitrofen concentration brood

* Subsetting observations and variables

data nitrofen2; set class.nitrofen;

* select all observations with all but highest concentration;

if conc<310;

data nitrofen3; set class.nitrofen;

where conc<310;

run;

From the SAS LOG file

NOTE: There were 50 observations read from the data set CLASS.NITROFEN. NOTE: The data set WORK.NITROFEN2 has 40 observations and 7 variables. NOTE: DATA statement used: real time 0.01 seconds cpu time 0.01 seconds

1143 data nitrofen3; set class.nitrofen; 1144 where conc<310; 1145 run;

NOTE: There were 40 observations read from the data set CLASS.NITROFEN. WHERE conc<310; NOTE: The data set WORK.NITROFEN3 has 40 observations and 7 variables. NOTE: DATA statement used: real time 0.66 seconds cpu time 0.03 seconds

* Using SAS functions and operators

* Working with SAS date values (also time and date-time) – DISCUSSED ABOVE

* Introduction to missing values– DISCUSSED ABOVE

EXAMPLE: Using SAS data step to do Monte Carlo Integration

Problem: Estimate PI using Monte Carlo Integration

Strategy: Equation of a circle with radius=1: x^2 + y^2 = 1

which can be written y = sqrt(1-x^2)

Area of this circle = PI

Area of this circle in the first quadrant = PI/

Generate Ux ~ Uniform(0,1) and Uy ~ Uniform(0,1)

Check to see if Uy <= sqrt(1-Ux^2)

The proportion of generated points when this

Condition is true is an estimate of PI/4.

data MCint;

/* initialize seed */

seed1 = 12345;

“\Muserver2\USERS\B\BAILERAJ\public.www\classes\sta402\examples\week06-MC- fig.rtf”;

  • ODS RTF file='D:\baileraj\Classes\Fall 2007\sta402\SAS-programs\week6-MC- fig.rtf' proc gplot data=MCint; plot Uy*Ux=Under; run ; ODS RTF CLOSE;

EXAMPLE: Using SAS DATA programming to do a small MC simulation

/* Problem: Explore whether t-test really is robust to

violations of the equal variance assumption

Strategy: See if the t-test operates at the nominal

Type I error rate when the unequal variance

assumption is violated

Test case: n1=n2=

Population 1: N(0,1)

Population 2: N(0,4)

data twogroup;

array x{ 10 } x1-x10; array y{ 10 } y1-y10;

do isim = 1 to 10000 ;

/* generate samples X~N(0,1) Y~N(0,4) - normal case */ do isample = 1 to 10 ; x{isample} = rannor( 0 ); y{isample} = 2 *rannor( 0 ); end;

/* calculate the t-statistic */ xbar = mean(of x1-x10); ybar = mean(of y1-y10);

xvar = var(of x1-x10); yvar = var(of y1-y10);

s2p = (9xvar + 9yvar)/18;

tstat = (xbar-ybar)/sqrt(s2p(2/10)); Pvalue = 2(1-probt(abs(tstat),18)); Reject05 = (Pvalue <= 0.05);

keep xbar ybar xvar yvar s2p tstat Pvalue Reject05; output; end; * end of the simulation loop;

/ proc print* ; run ; */

proc freq; table Reject05; run;

Cumulative Cumulative Reject05 Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 0 9443 94.43 9443 94.

proc means data=gen_exp;

var time;

output out=m_out;

run;

data mle_exact; set m_out; if STAT='MEAN'; lambda_MLE = 1 /TIME;

proc print data=mle_exact; run ; /* lambda_ Obs TYPE FREQ STAT time MLE 1 0 25 MEAN 0.95268 1.

proc nlin method=dud data=gen_exp;

parameter lambda=0.25;

negloglin = -log(lambda) + lambda*time;

if negloglin<0 then negloglin = 1e-6;

model dummy = sqrt(negloglin);

run;

Iterative Phase Sum of Iter lambda Squares 0 0.2500 40. 1 1.1829 24. 2 0.9727 23. 3 1.0355 23. 4 1.0341 23. 5 1.0341 23. 6 1.0343 23. 7 1.0344 23. 8 1.0344 23. 9 1.0344 23. 10 1.0344 23. 11 1.0344 23. 12 1.0344 23. 13 1.0344 23.

NOTE: Convergence criterion met but a note in the log indicates a possible problem with the model.

Estimation Summary

Method Gauss-Newton Iterations 13 Subiterations 5 Average Subiterations 0. R 8.253E- PPC 8.915E- RPC(lambda) 3.603E- Object 5.51E- Objective 23. Observations Read 25 Observations Used 25 Observations Missing 0

NOTE: An intercept was not specified for this model.

Sum of Mean Approx Source DF Squares Square F Value Pr > F Model 1 -23.7907 -23.7907 -24.. Error 24 23.7907 0. Uncorrected Total 25 0

Approx Parameter Estimate Std Error Approximate 95% Confidence Limits

lambda 1.0344 0.000228 1.0339 1.

/* alternative code using NLMIXED where likelihood is directly entered / / added: 6 Oct 04 */

proc nlmixed data=gen_exp; parms lambda= 0.25 ; ll = log(lambda) - lambda*time; model time ~ general(ll); * could also use gamma(lambda,1) in model; run ;

Specifications Data Set WORK.GEN_EXP Dependent Variable time Distribution for Dependent Variable General Optimization Technique Dual Quasi-Newton Integration Method None

Dimensions Observations Used 25 Observations Not Used 0 Total Observations 25 Parameters 1

Parameters lambda NegLogLike 0.25 40.

Iteration History

Iter Calls NegLogLike Diff MaxGrad Slope 1 2 25.0547568 15.55684 9.516393 -380. 2 4 23.8231484 1.231608 1.308295 -0. 3 5 23.7908217 0.032327 0.359401 -0. 4 6 23.7880391 0.002783 0.01845 -0. 5 7 23.7880316 7.492E-6 0.000274 -0. 6 9 23.7880316 1.655E-9 1.785E-8 -3.31E-

NOTE: GCONV convergence criterion satisfied.

Fit Statistics -2 Log Likelihood 47. AIC (smaller is better) 49. AICC (smaller is better) 49. BIC (smaller is better) 50.


Parameter Estimates Standard Parameter Estimate Error DF t Value Pr > |t| Alpha Lower Upper Gradient lambda 1.0497 0.2099 25 5.00 <.0001 0.05 0.6173 1.4820 1.785E-

EXAMPLE: Using SAS DATA programming to do a percentile bootstap CI