SAS Procedure for Univariate and Bivariate Analysis: An Example, Exams of Statistics

An example of using sas procedures univariate and freq for univariate and bivariate analysis of data. It covers the use of options such as chisq, expected, deviation, and measures for testing independence and association between variables. The example includes creating contingency tables and calculating measures of association such as tau-c.

Typology: Exams

Pre 2010

Uploaded on 09/02/2009

koofers-user-m96
koofers-user-m96 🇺🇸

10 documents

1 / 16

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
PROC UNIVARIATE < options > ;
BY variables ;
CLASS variable-1 <(v-options)> < variable-2 <(v-options)> >
...< / KEYLEVEL= value1 | ( value1 value2 ) >;
VAR variables ;
ID variables ;
OUTPUT < OUT=SAS-data-set >
< keyword1=names...keywordk=names > < percentile-options >;
Proc Options:DATA = , NOPRINT, PLOT, FREQ, NORMAL, PCTLDEF=, VARDEF=
ALPHA=, CIBASIC (TYPE= ALPHA=), MU0=, TRIM= (TYPE= ALPHA=)
PCTLDEF = 1,2 3, 4 or 5 (5 methods of computing percentiles)
VARDEF = DF, N, WEIGHT, WDF (divisor for computing variance)
NORMAL computes the Shapiro-Wilk statistic W if n2000
or the Kolmogorov-Smirnov statistic D if n > 2000
TYPE= LOWER, UPPER, TWOSIDED (specify type of confidence intervals)
TRIM= list of intgers or fractions specifying amount of trimming
ALPHA=.05 (for both CIBASIC and TRIM)
Class Options: The v-options are MISSING or ORDER=
ORDER = FREQ, DATA, INTERNAL, FORMATTED
specifies how levels of the variable are ordered in the output
Keywords for OUTPUT:
N, NMISS, NOBS, MEAN, SUM, SD, VAR, SKEWNESS, KURTOSIS, SUMWGT,
MAX, MIN, RANGE, Q3, MEDIAN, Q1, ORANGE, P1, P5, P10, P90, P95, P99,
MODE, SIGNRANK, NORMAL, PCTLNAME= , PCTLPTS=, PCTLPRE=
Example:
proc univariate data=survey;
class county;
var acreage rainfall;
output out=new mean=ave1 ave2 var= v1 v2;
run;
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download SAS Procedure for Univariate and Bivariate Analysis: An Example and more Exams Statistics in PDF only on Docsity!

PROC UNIVARIATE < options > ; BY variables ; CLASS variable-1 <(v-options)> < variable-2 <(v-options)> > ...< / KEYLEVEL= value1 | ( value1 value2 ) >; VAR variables ; ID variables ; OUTPUT < OUT=SAS-data-set > < keyword1=names...keywordk=names > < percentile-options >;

Proc Options:DATA = , NOPRINT, PLOT, FREQ, NORMAL, PCTLDEF=, VARDEF=

ALPHA=, CIBASIC (TYPE= ALPHA=), MU0=, TRIM= (TYPE= ALPHA=) PCTLDEF = 1,2 3, 4 or 5 (5 methods of computing percentiles) VARDEF = DF, N, WEIGHT, WDF (divisor for computing variance) NORMAL computes the Shapiro-Wilk statistic W if n ≤ 2000 or the Kolmogorov-Smirnov statistic D if n > 2000 TYPE= LOWER, UPPER, TWOSIDED (specify type of confidence intervals) TRIM= list of intgers or fractions specifying amount of trimming ALPHA=.05 (for both CIBASIC and TRIM)

Class Options: The v-options are MISSING or ORDER=

ORDER = FREQ, DATA, INTERNAL, FORMATTED specifies how levels of the variable are ordered in the output

Keywords for OUTPUT:

N, NMISS, NOBS, MEAN, SUM, SD, VAR, SKEWNESS, KURTOSIS, SUMWGT, MAX, MIN, RANGE, Q3, MEDIAN, Q1, ORANGE, P1, P5, P10, P90, P95, P99, MODE, SIGNRANK, NORMAL, PCTLNAME= , PCTLPTS=, PCTLPRE=

Example:

proc univariate data=survey; class county; var acreage rainfall; output out=new mean=ave1 ave2 var= v1 v2; run;

PROC FREQ < options > ; BY variables ; TABLES requests < / options > ; EXACT statistic-options < / computation-options > ; TEST options ; OUTPUT < OUT=SAS-data-set > options ; WEIGHT variable < / options > ;

Proc Options:DATA= , ORDER= , FORMCHAR(1,2,7)=, NOPRINT

ORDER = FREQ, DATA, INTERNAL, FORMATTED specifies order of variable levels appear in the output FORMCHAR(1,2,7)=‘| −+’

TABLES Options: MISSING, LIST, OUT=

CHISQ EXPECTED, DEVIATION, CELLCH12, CUMCOL, MISPRINT, SPARSE, ALL NOFREQ, NOPERCENT, NOROW, NOCOL, NOCUM, NOPRINT BINOMIAL, TESTF=, TESTP= MEASURES, CL, ALPHA=, AGREE

TABLES Requests:

tables A tables AB tables A(B C) tables (A B)(C D) Tables (A B C)D tables A–C tables (A- -C)*D

data Color; input Eyes $ Hair $ Count @@; label Eyes =’Eye Color’ Hair =’Hair Color’; datalines; blue fair 23 blue red 7 blue medium 24 blue dark 11 green fair 19 green red 7 green medium 18 green dark 14 brown fair 34 brown red 5 brown medium 41 brown dark 40 brown black 3 ; proc freq data=Color order=data; weight Count; tables Hair / nocum testp=(30 12 30 25 3); title ’Hair Color of European Children’; run;

Application 2: Two-way Frequency Table with I populations and J categories

  • Testing homogeneity of several multinomial populations.
  • Random samples are taken from each population and then classified by a cat- egorical variable
  • The populations are usually defined by levels of variables such as gender, age group, state, etc.
  • Levels of the categorical variable form the k categories of the multinomial populations
  • Null hypothesis of homogeneity of populations is

H 0 : p 1 j = p 1 j = · · · = pIj j = 1, 2 ,... , J

Example:

  • For example, suppose samples are drawn from two populations (say, males and females or persons below and above the age of 40) and they are grouped into three categories (say, according to three levels of support for a certain local bond issue).
  • Suppose that the multinomial probabilities for each population are as given in the following table:

Groups p 11 p 12 p 13 Populations p 21 p 22 p 23

  • Null hypothesis of homogeneity of populations is

H 0 : p 11 = p 21 , p 12 = p 22 , p 13 = p 23

Application 3: Contingency tables constructed by multiple classification of a single random sample.

  1. Observations in a sample may be cross-classified by variables categorical vari- ables. - ordinal values: age, income, temperature - nominal values: gender, marital status, region
  2. These variables form natural subsets or strata of the data.
  3. These may also be generated from other quantitative variables such as population or income. - 3 income groups (say “low,” “middle,” or “high”) - according as income is < $30,000, between $30,000 and $70,000, or > $70,

Measures of Association: suitable for measuring the strength of the depen- dency between nominal variables but are also applicable for ordinal variables. The above three measures of association are all derived from the Pearson chi-square statistic.

  • phi coefficient The range is 0 < φ < min {

r − 1 ,

c − 1 }.

  • contingency coefficient C
    1. value of C is zero if there is no association
    2. value that is less than 1 even with perfect dependence
    3. value is dependent on the size of the table
    4. a maximum value of

√ (r − 1)/r for an r × r table

  • Cramer’s V is a normed measure, so its value is between 0 and 1

Other Measures of Association: Many of these statistical measures also re- quire the assignment of a dependent variable and an independent variable, as the goal is to predict a rank (category) of an individual on the dependent variable given that the individual belongs to a certain category in the independent variable.

For calculating the following measures need to define pair of observations as con- cordant or discordant

  • the pair (12, 2.7) and (15, 3.1) are concordant
  • the pair (12, 2.7) and (10, 3.1) are discordant
  • Gamma: is a normed measure; based on the numbers of concordant and discor- dant pairs.
  1. no discordant pairs, Gamma is +1; perfect positive association
  2. if there are no concordant pairs, Gamma is −1: perfect negative association
  3. values in between −1 and +1 measure the strength of negative or positive association.
  4. the numbers of discordant and concordant pairs are equal, Gamma is zero; rank of the independent variable cannot be used to predict the rank of the dependent variable.
  • Kendall’s tau-b: the ratio of the difference between the number of concor- dant and discordant pairs to the total number of pairs. It is scaled to be between −1 and +1 when there are no ties, but not otherwise.
  • Somers’ D: is an ordinal measure; adjusts for ties so that the value of the statistic lies between −1 and +1. two values of this statistic are computed: 1. when the row variable is considered the independent variable (Somers’ D C|R) 2. when the column is considered the independent variable (Somers’ D R|C)
  • Asymmetric lambda λ(R|C): a nominal measure; interpreted as the propor- tional reduction in error (PRE) in predicting the dependent (row) variable given the independent (column) variable. The range 0 ≤ λ(R|C) ≤ 1, values around 0.3 are considered high.
  • Asymmetric lambda λ(C|R): similarly interpreted
  • Stuart’s tau c: both variables in ordinal scale; adjusts for table size and corrects for ties. In the range − 1 ≤ τc ≤ 1.

Example:

In the following example, factor B could be a factor such as Marital Status and factor A could be a response to a question with three possibilities. Consider the table of frequencies is as follows:

b 1 b 2 b 3 Total a 1 8 16 31 55 a 2 9 18 74 101 a 3 34 23 17 74 Total 51 57 122 230

In this case, the column variable is called the independent variable with categories being classes, groups, or strata.

data ex8; input A $ B $ count @@; datalines; a1 b1 8 a1 b2 16 a1 b3 31 a2 b1 9 a2 b2 18 a2 b3 74 a3 b1 34 a3 b2 23 a3 b3 17 ;

PLOT Statement Options: HAXIS=, VAXIS= HAXIS=5 10 15 20 25 30 35 HAXIS=5 to 35 by 5 HAXIS=by 5 HAXIS=‘Kansas’ ‘Missouri’ ’Iowa’ ‘Illinois’ ‘Nebraska’ HAXIS=‘01MAY07’d to ‘01DEC07’d by month

HPOS=, VPOS=, HREF=, VREF=, HREFCHAR=, VREFCHAR=, BOX,, OVERLAY CONTOUR <=number-of-levels>

PLACEMENT=(expression(s)) PLACEMENT=(H=0, V=0,S=CENTER, L=1) PLACEMENT=(H=0 1) PLACEMENT=(H=0 1 -1V=1 -1) PLACEMENT=((s=right left:h=1 -1)(v=1 -1h=1 -1)) SPLIT=‘split-character ’

PENALTIES< (index-list) >=penalty-list PENALTIES(1)= PENALTIES(15 to 19)=2 3 4 10 15 25

Example:

proc plot data=mylib.fueldat; plot fuelroads; plot fuelroads=‘+’; plot fuelroads=‘’ $ st;

plot fuelroads=‘’ $ st/ haxis=0 to 20 by 2 placement=((s=right left:h=1 -1)(v=1 -1*h=1 -1));

title ‘Output from Sample Plot Statements’; run;

PROC CHART < option(s) >; BLOCK variable(s) < / option(s)>; BY variables; HBAR variable(s) < / option(s)>; PIE variable(s) < / option(s)>; STAR variable(s) < / option(s)>; VBAR variable(s) < / option(s)>;

Proc Options: DATA = , FORMACHAR= formchar $<($position(s)$)>$ = ‘formatting-character(s)’}

Options for: VBAR, HBAR, BLOCK, PIE, STAR MISSING, DISCRETE, TYPE=FREQ, PCT, FREQ, CPCT, SUM, MEAN SUMVAR = variable MIDPOINTS = values FREQ = variable AXIS = value

Options for: VBAR, HBAR, BLOCK only GROUP = variable SUBGROUP = variable LEVELS = n Defaults for TYPE= If TYPE= is omitted, the default is TYPE=FREQ except when SUMVAR= option is specified, in which case the default is TYPE=SUM

Example:

proc chart data=mylib.fueldat; vbar fuel; vbar fuel/midpoints =300 to 1000 by 100 type=percent; vbar incomgrp/discrete sumvar=fuel; vbar incomgrp/discrete sumvar=fuel type=mean subgroup=taxgrp; vbar incomgrp/discrete sumvar=fuel type=mean group=taxgrp; format incomgrp ing.; title ‘Illustrating HBAR statement in PROC CHART’; run;

Proc TABULATE Examples:

In the following examples, tables are presented that contain demographic informa- tion extracted from a data set containing the following variables:

Variable Description REGION code for region of the country CITYSIZE code for relative population size (S=small, M=medium, L=large) POP urban population

Each observation contains data for one city.

Most applications also use CLASS and VAR statements in addition to the PROC TABULATE statement.

PROC TABULATE

TITLE ‘REGION, POP’;

CLASS REGION;

VAR POP;

In the examples below, the required table format is specified with a TABLE state- ment and the output produced by different TABLE statements are sketched:

Example 1: TABLE REGION,POP; POP SUM REGION NC 4650000. NE 6666000. SO 6864000. WE 8376000.

Example 2: TABLE REGION,CITYSIZEPOPSUM ;

CITYSIZE L M S POP POP POP SUM SUM SUM REGION NC 3750000.00 750000.00 15000. NE 5022000.00 1422000.00 222000. SO 4488000.00 2088000.00 288000. WE 5592000.00 2592000.00 192000.

Example 3: TABLE REGIONCITYSIZE,POPSUM ;

POP SUM REGION CITYSIZE NC L 3750000. M 750000. S 150000. NE CITYSIZE L 5022000. M 1422000. S 222000. SO CITYSIZE L 4488000. M 2088000. S 288000. WE CITYSIZE L 5592000. M 2592000. S 192000.

Example 5: TABLE PRODUCT, REGION CITYSIZE, SALETYPE*(QUANTITY AMOUNT) ;

PRODUCT A SALETYPE R W QUANTITY AMOUNT QUANTITY AMOUNT SUM SUM SUM SUM REGION NC 1250.00 31250.00 1250.00 25000. NE 1600.00 40000.00 1600.00 32000. SO 1880.00 47000.00 1880.00 37600. WE 1840.00 46000.00 1840.00 36800. CITYSIZE L 3190.00 79750.00 3190.00 63800. M 2440.00 61000.00 2440.00 48800. S 940.00 23500.00 940.00 18800.

PRODUCT A

SALETYPE

R W

QUANTITY AMOUNT QUANTITY AMOUNT

SUM SUM SUM SUM

REGION

NC 1295.00 32375.00 1295.00 25900.