Docsity
Docsity

Prepare-se para as provas
Prepare-se para as provas

Estude fácil! Tem muito documento disponível na Docsity


Ganhe pontos para baixar
Ganhe pontos para baixar

Ganhe pontos ajudando outros esrudantes ou compre um plano Premium


Guias e Dicas
Guias e Dicas


Categorical Data Analysis Using the SAS System, Notas de estudo de Estatística

Categorical Data Analysis Using the SAS System - Maura E. Stokes Charles S. Davis Gary G. Koch

Tipologia: Notas de estudo

2012

Compartilhado em 14/04/2012

fernanda-ribeiro-21
fernanda-ribeiro-21 🇧🇷

4.4

(36)

25 documentos

1 / 647

Toggle sidebar

Esta página não é visível na pré-visualização

Não perca as partes importantes!

bg1
Categorical
Data Analysis Using
The SAS® System
2nd Edition
Maura E. Stokes
Charles S. Davis
Gary G. Koch
SAS Publishing
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Pré-visualização parcial do texto

Baixe Categorical Data Analysis Using the SAS System e outras Notas de estudo em PDF para Estatística, somente na Docsity!

Categorical

Data Analysis Using

The SAS® System

2nd Edition

Maura E. Stokes

Charles S. Davis

Gary G. Koch

SAS Publishing

Categorical

Data Analysis

Using

The SAS

®

System

Maura E. Stokes

Charles S. Davis

Gary G. Koch

2 nd Edition

∑ π k π

∑ π

k

χ

π

π

k

k

k

π

k

k π ∑

π

k

π

k

k

∑ π k

π

∑ π

k

∑ π k

∑ π

π k

k π k

∑ π k

k

∑ π k π

k

x^2

k π

k

π

π

k

π

∑ π k π

π

k

k

k π k

k π

π

k

∑ π

k π k

∑ (^) π k k

k

χ χ^2 χ^2

χ

χ χ 2 χ^22

χ^2 χ

2

χ^2

χ^2

χ χ 2 2

χ^2 χ^2

χ^2 χ 2

χ^2

χ^2

χ^2

χ^2 χ^2

χ^2 χ^2

χ^2

χ^2

χ^2

χ^2

χ^2

χ^2

χ^2

Table of Contents

Preface to the Second Edition v

Preface to the Second Edition

This second edition contains several new topics and includes numerous updates to reflect Version 8 of the SAS System. Chapter 15, “Generalized Estimating Equations,” is a new chapter that discusses the use of the GEE method, particularly as a tool for analyzing repeated measurements data. The book includes several comparisons of analyses using the GEE method, weighted least squares, and conditional logistic regression; the use of subject-specific models versus population-averaged models is discussed. Chapter 15 also describes the use of GEE methods for some univariate response situations.

Chapter 12, “Poisson Regression,” is a new chapter on Poisson regression. Previously, this topic was described in the chapter on time-to-event categorical data. The methodology is illustrated with several examples.

Chapters on the analysis of tables now include much more material on the use of exact tests of association, particularly Chapter 2, “The 2  2 Table,” and Chapter 5, “The s  r Table.”

Exact logistic regression using the LOGISTIC procedure is discussed in Chapter 8, “Logistic Regression I: Dichotomous Response.” Chapter 8 also describes the use of the CLASS statement in PROC LOGISTIC, and all of the examples in the various chapters using PROC LOGISTIC have been updated to take advantage of the new CLASS statement. Chapter 10, “Conditional Logistic Regression,” has been largely revised to put more emphasis on the stratified data setting.

In addition, miscellaneous revisions and additions appear throughout the book.

Computing Details

Writing a book for software that is constantly changing is not straightforward. This second edition is targeted for Version 8 of the SAS System and takes advantage of many of the features of that release. The examples were executed with the 8.1 release on the HP UNIX platform, but most of the output can be reproduced using Version 8.0 with the following changes for Release 8.1:

 PROC LOGISTIC adds exact logistic regression.

 PROC GENMOD models, by default, the probability of the lowest ordered response variable levels. (The default has been changed from previous releases to make it consistent with other procedures.)

To make things a little more complicated, the authors used an output template for the LOGISTIC procedure that will become the default in Release 8.2. The main difference is that the label for the chi-square statistic in the parameter estimates table is “Wald Chi-Square” in Release 8.2 (which was the label used in Version 6).

Acknowledgments

The second edition proved to be a substantial undertaking. We are thankful for getting a lot of help along the way.

We would like to thank Ozkan Zengin for his assistance in bringing this book up to date in a number of ways, including adaptation to a new publishing system and running and checking all of the examples. Dan Spitzner provided careful proofing.

Numerous colleagues contributed to this book with their conversations, reviews, and suggestions, and we are very grateful for their time and effort. We thank Bob Derr, Diane Catellier, Gordon Johnston, Lisa LaVange, John Preisser, David Schlotzhauer, Todd Schwartz, and Donna Watts.

And, of course, we remain thankful to those persons who helped to launch the first edition with their sundry feedback. They include Sonia Davis, William Duckworth II, Suzanne Edwards, Stuart Gansky, Greg Goodwin, Wendy Greene, Duane Hayes, Allison Kinkead, Antonio Pedroso-de-Lima, Annette Sanders, Catherine Tangen, Lisa Tomasko, and Greg Weier.

We also thank our many readers who found the book useful and encouraged its continuing life in a second edition.

Virginia Clark edited this book.

Ginny Matsey designed the cover.

Tim Arnold provided documentation programming support.

viii

2 Introduction

Chapter 1

Introduction

1.1 Overview

Data analysts often encounter response measures that are categorical in nature; their outcomes reflect categories of information rather than the usual interval scale. Frequently, categorical data are presented in tabular form, known as contingency tables. Categorical data analysis is concerned with the analysis of categorical response measures, regardless of whether any accompanying explanatory variables are also categorical or are continuous. This book discusses hypothesis testing strategies for the assessment of association in contingency tables and sets of contingency tables. It also discusses various modeling strategies available for describing the nature of the association between a categorical response measure and a set of explanatory variables.

An important consideration in determining the appropriate analysis of categorical variables is their scale of measurement. Section 1.2 describes the various scales and illustrates them with data sets used in later chapters. Another important consideration is the sampling framework that produced the data; it determines the possible analyses and the possible inferences. Section 1.3 describes the typical sampling frameworks and their ramifications. Section 1.4 introduces the various analysis strategies discussed in this book and describes how they relate to one another. It also discusses the target populations generally assumed for each type of analysis and what types of inferences you are able to make to them. Section 1.5 reviews how the SAS System handles contingency tables and other forms of categorical data. Finally, Section 1.6 provides a guide to the material in the book for various types of readers, including indications of the difficulty level of the chapters.

1.2 Scale of Measurement

The scale of measurement of a categorical response variable is a key element in choosing an appropriate analysis strategy. By taking advantage of the methodologies available for the particular scale of measurement, you can choose a well-targeted strategy. If you do not take the scale of measurement into account, you may choose an inappropriate strategy that could lead to erroneous conclusions. Recognizing the scale of measurement and using it properly are very important in categorical data analysis.

1.2 Scale of Measurement 5

Table 1.2. Arthritis Data

Improvement Sex Treatment Marked Some None Total Female Active 16 5 6 27 Female Placebo 6 7 19 32 Male Active 5 2 7 14 Male Placebo 1 0 10 11

Note that categorical response variables can often be managed in different ways. You could combine the Marked and Some columns in Table 1.2 to produce a dichotomous outcome: No Improvement versus Improvement. Grouping categories is often done during an analysis if the resulting dichotomous response is also of interest.

If you have more than two outcome categories, and there is no inherent ordering to the categories, you have a nominal measurement scale. Which of four candidates did you vote for in the town council election? Do you prefer the beach, mountains, or lake for a vacation? There is no underlying scale for such outcomes and no apparent way in which to order them.

Consider Table 1.3, which is analyzed in Chapter 5, “The s  r Table.” Residents in one town were asked their political party affiliation and their neighborhood. Researchers were interested in the association between political affiliation and neighborhood. Unlike ordinal response levels, the classifications Bayside, Highland, Longview, and Sheffeld lie on no conceivable underlying scale. However, you can still assess whether there is association in the table, which is done in Chapter 5.

Table 1.3. Distribution of Parties in Neighborhoods

Neighborhood Party Bayside Highland Longview Sheffeld Democrat 221 160 360 140 Independent 200 291 160 311 Republican 208 106 316 97

Categorical response variables sometimes contain discrete counts. Instead of falling into categories that are labeled (yes, no) or (low, medium, high), the outcomes are numbers themselves. Was the litter size 1, 2, 3, 4, or 5 members? Did the house contain 1, 2, 3, or 4 air conditioners? While the usual strategy would be to analyze the mean count, the assumptions required for the standard linear model for continuous data are often not met with discrete counts that have small range; the counts are not distributed normally and may not have homogeneous variance.

For example, researchers examining respiratory disease in children visited children in different regions two times and determined whether they showed symptoms of respiratory illness. The response measure was whether the children exhibited symptoms in 0, 1, or 2 periods. Table 1.4 contains these data, which are analyzed in Chapter 13, “Weighted Least Squares.”

6 Introduction

Table 1.4. Colds in Children

Periods with Colds Sex Residence 0 1 2 Total Female Rural 45 64 71 180 Female Urban 80 104 116 300 Male Rural 84 124 82 290 Male Urban 106 117 87 310

The table represents a cross-classification of gender, residence, and number of periods with colds. The analysis is concerned with modeling mean colds as a function of gender and residence.

Finally, another type of response variable in categorical data analysis is one that represents survival times. With survival data, you are tracking the number of patients with certain outcomes (possibly death) over time. Often, the times of the condition are grouped together so that the response variable represents the number of patients who fail during a specific time interval. Such data are called grouped survival times. For example, the data displayed in Table 1.5 are from Chapter 17, “Categorized Time-to-Event Data.” A clinical condition is treated with an active drug for some patients and with a placebo for others. The response categories are whether there are recurrences, no recurrences, or whether the patients withdrew from the study. The entries correspond to the time intervals 0–1 years, 1–2 years, and 2–3 years, which make up the rows of the table.

Table 1.5. Life Table Format for Clinical Condition Data

Controls Interval No Recurrences Recurrences Withdrawals At Risk 0–1 Years 50 15 9 74 1–2 Years 30 13 7 50 2–3 Years 17 7 6 30 Active Interval No Recurrences Recurrences Withdrawals At Risk 0–1 Years 69 12 9 90 1–2 Years 59 7 3 69 2–3 Years 45 10 4 59

1.3 Sampling Frameworks

Categorical data arise from different sampling frameworks. The nature of the sampling framework determines the assumptions that can be made for the statistical analyses and in turn influences the type of analysis that can be applied. The sampling framework also determines the type of inference that is possible. Study populations are limited to target populations, those populations to which inferences can be made, by assumptions justified by the sampling framework.

Generally, data fall into one of three sampling frameworks: historical data, experimental data, and sample survey data. Historical data are observational data, which means that the

8 Introduction

historical data, you often want to control for other explanatory variables that may have influenced the observed outcomes.

1.4.1 Randomization Methods

Table 1.1, the respiratory outcomes data, contains information obtained as part of a randomized allocation process. The hypothesis of interest is whether there is an association between treatment and outcome. For these data, the randomization is accomplished by the study design.

Table 1.6 contains data from a similar study. The main difference is that the study was conducted in two medical centers. The hypothesis of association is whether there is an association between treatment and outcome, controlling for any effect of center.

Table 1.6. Respiratory Improvement

Center Treatment Yes No Total 1 Test 29 16 45 1 Placebo 14 31 45 Total 43 47 90 2 Test 37 8 45 2 Placebo 24 21 45 Total 61 29 90

Chapter 2, “The 2  2 Table,” is primarily concerned with the association in 2  2 tables; in addition, it discusses measures of association, that is, statistics designed to evaluate the strength of the association. Chapter 3, “Sets of 2  2 Tables,” discusses the investigation of association in sets of 2  2 tables. When the table of interest has more than two rows and two columns, the analysis is further complicated by the consideration of scale of measurement. Chapter 4, “Sets of 2  r and s  2 Tables,” considers the assessment of association in sets of tables where the rows (columns) have more than two levels.

Chapter 5 describes the assessment of association in the general s  r table, and Chapter 6, “Sets of s  r Tables,” describes the assessment of association in sets of s  r tables. The investigation of association in tables and sets of tables is further discussed in Chapter 7, “Nonparametric Methods,” which discusses traditional nonparametric tests that have counterparts among the strategies for analyzing contingency tables.

Another consideration in data analysis is whether you have enough data to support the asymptotic theory required for many tests. Often, you may have an overall table sample size that is too small or a number of zero or small cell counts that make the asymptotic assumptions questionable. Recently, exact methods have been developed for a number of association statistics that permit you to address the same hypotheses for these types of data. The above-mentioned chapters illustrate the use of exact methods for many situations.

1.4.2 Modeling Strategies

Often, you are interested in describing the variation of your response variable in your data with a statistical model. In the continuous data setting, you frequently fit a model to the expected mean response. However, with categorical outcomes, there are a variety of

1.4 Overview of Analysis Strategies 9

response functions that you can model. Depending on the response function that you choose, you can use weighted least squares or maximum likelihood methods to estimate the model parameters.

Perhaps the most common response function modeled for categorical data is the logit. If you have a dichotomous response and represent the proportion of those subjects with an event (versus no event) outcome as p, then the logit can be written

log

 p 1 p



Logistic regression is a modeling strategy that relates the logit to a set of explanatory variables with a linear model. One of its benefits is that estimates of odds ratios, important measures of association, can be obtained from the parameter estimates. Maximum likelihood estimation is used to provide those estimates.

Chapter 8, “Logistic Regression I: Dichotomous Response,” discusses logistic regression for a dichotomous outcome variable. Chapter 9, “Logistic Regression II: Polytomous Response,” discusses logistic regression for the situation where there are more than two outcomes for the response variable. Logits called generalized logits can be analyzed when the outcomes are nominal. And logits called cumulative logits can be analyzed when the outcomes are ordinal. Chapter 10, “Conditional Logistic Regression,” describes a specialized form of logistic regression that is appropriate when the data are highly stratified or arise from matched case-control studies. Chapter 8 and Chapter 10 describe the use of exact conditional logistic regression for those situations where you have limited or sparse data, and the asymptotic requirements for the usual maximum likelihood approach are not met.

In logistic regression, the objective is to predict a response outcome from a set of explanatory variables. However, sometimes you simply want to describe the structure of association in a set of variables for which there are no obvious outcome or predictor variables. This occurs frequently for sociological studies. The loglinear model is a traditional modeling strategy for categorical data and is appropriate for describing the association in such a set of variables. It is closely related to logistic regression, and the parameters in a loglinear model are also estimated with maximum likelihood estimation. Chapter 16, “Loglinear Models,” discusses the loglinear model, including several typical applications.

Some application areas have features that led to the development of special statistical techniques. One of these areas for categorical data is bioassay analysis. Bioassay is the process of determining the potency or strength of a reagent or stimuli based on the response it elicits in biological organisms. Logistic regression is a technique often applied in bioassay analysis, where its parameters take on specific meaning. Chapter 11, “Quantal Bioassay Analysis,” discusses the use of categorical data methods for quantal bioassay.

Poisson regression is a modeling strategy that is suitable for discrete counts, and it is discussed in Chapter 12, “Poisson Regression.” Most often the log of the count is used as the response function so the model used is a loglinear one.

Besides the logit and log counts, other useful response functions that can be modeled include proportions, means, and measures of association. Weighted least squares