




























































































Estude fácil! Tem muito documento disponível na Docsity
Ganhe pontos ajudando outros esrudantes ou compre um plano Premium
Prepare-se para as provas
Estude fácil! Tem muito documento disponível na Docsity
Prepare-se para as provas com trabalhos de outros alunos como você, aqui na Docsity
Encontra documentos específicos para os exames da tua universidade
Prepare-se com as videoaulas e exercícios resolvidos criados a partir da grade da sua Universidade
Responda perguntas de provas passadas e avalie sua preparação.
Ganhe pontos para baixar
Ganhe pontos ajudando outros esrudantes ou compre um plano Premium
Categorical Data Analysis Using the SAS System - Maura E. Stokes Charles S. Davis Gary G. Koch
Tipologia: Notas de estudo
1 / 647
Esta página não é visível na pré-visualização
Não perca as partes importantes!





























































































®
∑ π k π
∑ π
k
χ
π
∑
π
k
∑ k
∑ k
∑
π
k
k π ∑
π
k
π
∑
∑ k
∑ k
∑ π k
π
∑ π
k
∑ π k
∑ π
∑
∑ π k
k
∑ π k π
k
x^2
k π
k
π
∑
π
k
π
∑ π k π
∑
π
k
∑
∑ k
∑ k π k
∑
∑ k π
∑
π
k
∑
∑ π
∑ (^) π k k
∑
∑ k
χ χ^2 χ^2
χ
χ χ 2 χ^22
χ^2 χ
2
χ^2
χ^2
χ χ 2 2
χ^2 χ^2
χ^2 χ 2
χ^2
χ^2
χ^2
χ^2 χ^2
χ^2 χ^2
χ^2
χ^2
χ^2
χ^2
χ^2
χ^2
χ^2
Preface to the Second Edition v
Preface to the Second Edition
This second edition contains several new topics and includes numerous updates to reflect Version 8 of the SAS System. Chapter 15, “Generalized Estimating Equations,” is a new chapter that discusses the use of the GEE method, particularly as a tool for analyzing repeated measurements data. The book includes several comparisons of analyses using the GEE method, weighted least squares, and conditional logistic regression; the use of subject-specific models versus population-averaged models is discussed. Chapter 15 also describes the use of GEE methods for some univariate response situations.
Chapter 12, “Poisson Regression,” is a new chapter on Poisson regression. Previously, this topic was described in the chapter on time-to-event categorical data. The methodology is illustrated with several examples.
Chapters on the analysis of tables now include much more material on the use of exact tests of association, particularly Chapter 2, “The 2 2 Table,” and Chapter 5, “The s r Table.”
Exact logistic regression using the LOGISTIC procedure is discussed in Chapter 8, “Logistic Regression I: Dichotomous Response.” Chapter 8 also describes the use of the CLASS statement in PROC LOGISTIC, and all of the examples in the various chapters using PROC LOGISTIC have been updated to take advantage of the new CLASS statement. Chapter 10, “Conditional Logistic Regression,” has been largely revised to put more emphasis on the stratified data setting.
In addition, miscellaneous revisions and additions appear throughout the book.
Writing a book for software that is constantly changing is not straightforward. This second edition is targeted for Version 8 of the SAS System and takes advantage of many of the features of that release. The examples were executed with the 8.1 release on the HP UNIX platform, but most of the output can be reproduced using Version 8.0 with the following changes for Release 8.1:
PROC LOGISTIC adds exact logistic regression.
PROC GENMOD models, by default, the probability of the lowest ordered response variable levels. (The default has been changed from previous releases to make it consistent with other procedures.)
To make things a little more complicated, the authors used an output template for the LOGISTIC procedure that will become the default in Release 8.2. The main difference is that the label for the chi-square statistic in the parameter estimates table is “Wald Chi-Square” in Release 8.2 (which was the label used in Version 6).
Acknowledgments
The second edition proved to be a substantial undertaking. We are thankful for getting a lot of help along the way.
We would like to thank Ozkan Zengin for his assistance in bringing this book up to date in a number of ways, including adaptation to a new publishing system and running and checking all of the examples. Dan Spitzner provided careful proofing.
Numerous colleagues contributed to this book with their conversations, reviews, and suggestions, and we are very grateful for their time and effort. We thank Bob Derr, Diane Catellier, Gordon Johnston, Lisa LaVange, John Preisser, David Schlotzhauer, Todd Schwartz, and Donna Watts.
And, of course, we remain thankful to those persons who helped to launch the first edition with their sundry feedback. They include Sonia Davis, William Duckworth II, Suzanne Edwards, Stuart Gansky, Greg Goodwin, Wendy Greene, Duane Hayes, Allison Kinkead, Antonio Pedroso-de-Lima, Annette Sanders, Catherine Tangen, Lisa Tomasko, and Greg Weier.
We also thank our many readers who found the book useful and encouraged its continuing life in a second edition.
Virginia Clark edited this book.
Ginny Matsey designed the cover.
Tim Arnold provided documentation programming support.
viii
2 Introduction
Introduction
Data analysts often encounter response measures that are categorical in nature; their outcomes reflect categories of information rather than the usual interval scale. Frequently, categorical data are presented in tabular form, known as contingency tables. Categorical data analysis is concerned with the analysis of categorical response measures, regardless of whether any accompanying explanatory variables are also categorical or are continuous. This book discusses hypothesis testing strategies for the assessment of association in contingency tables and sets of contingency tables. It also discusses various modeling strategies available for describing the nature of the association between a categorical response measure and a set of explanatory variables.
An important consideration in determining the appropriate analysis of categorical variables is their scale of measurement. Section 1.2 describes the various scales and illustrates them with data sets used in later chapters. Another important consideration is the sampling framework that produced the data; it determines the possible analyses and the possible inferences. Section 1.3 describes the typical sampling frameworks and their ramifications. Section 1.4 introduces the various analysis strategies discussed in this book and describes how they relate to one another. It also discusses the target populations generally assumed for each type of analysis and what types of inferences you are able to make to them. Section 1.5 reviews how the SAS System handles contingency tables and other forms of categorical data. Finally, Section 1.6 provides a guide to the material in the book for various types of readers, including indications of the difficulty level of the chapters.
The scale of measurement of a categorical response variable is a key element in choosing an appropriate analysis strategy. By taking advantage of the methodologies available for the particular scale of measurement, you can choose a well-targeted strategy. If you do not take the scale of measurement into account, you may choose an inappropriate strategy that could lead to erroneous conclusions. Recognizing the scale of measurement and using it properly are very important in categorical data analysis.
1.2 Scale of Measurement 5
Table 1.2. Arthritis Data
Improvement Sex Treatment Marked Some None Total Female Active 16 5 6 27 Female Placebo 6 7 19 32 Male Active 5 2 7 14 Male Placebo 1 0 10 11
Note that categorical response variables can often be managed in different ways. You could combine the Marked and Some columns in Table 1.2 to produce a dichotomous outcome: No Improvement versus Improvement. Grouping categories is often done during an analysis if the resulting dichotomous response is also of interest.
If you have more than two outcome categories, and there is no inherent ordering to the categories, you have a nominal measurement scale. Which of four candidates did you vote for in the town council election? Do you prefer the beach, mountains, or lake for a vacation? There is no underlying scale for such outcomes and no apparent way in which to order them.
Consider Table 1.3, which is analyzed in Chapter 5, “The s r Table.” Residents in one town were asked their political party affiliation and their neighborhood. Researchers were interested in the association between political affiliation and neighborhood. Unlike ordinal response levels, the classifications Bayside, Highland, Longview, and Sheffeld lie on no conceivable underlying scale. However, you can still assess whether there is association in the table, which is done in Chapter 5.
Table 1.3. Distribution of Parties in Neighborhoods
Neighborhood Party Bayside Highland Longview Sheffeld Democrat 221 160 360 140 Independent 200 291 160 311 Republican 208 106 316 97
Categorical response variables sometimes contain discrete counts. Instead of falling into categories that are labeled (yes, no) or (low, medium, high), the outcomes are numbers themselves. Was the litter size 1, 2, 3, 4, or 5 members? Did the house contain 1, 2, 3, or 4 air conditioners? While the usual strategy would be to analyze the mean count, the assumptions required for the standard linear model for continuous data are often not met with discrete counts that have small range; the counts are not distributed normally and may not have homogeneous variance.
For example, researchers examining respiratory disease in children visited children in different regions two times and determined whether they showed symptoms of respiratory illness. The response measure was whether the children exhibited symptoms in 0, 1, or 2 periods. Table 1.4 contains these data, which are analyzed in Chapter 13, “Weighted Least Squares.”
6 Introduction
Table 1.4. Colds in Children
Periods with Colds Sex Residence 0 1 2 Total Female Rural 45 64 71 180 Female Urban 80 104 116 300 Male Rural 84 124 82 290 Male Urban 106 117 87 310
The table represents a cross-classification of gender, residence, and number of periods with colds. The analysis is concerned with modeling mean colds as a function of gender and residence.
Finally, another type of response variable in categorical data analysis is one that represents survival times. With survival data, you are tracking the number of patients with certain outcomes (possibly death) over time. Often, the times of the condition are grouped together so that the response variable represents the number of patients who fail during a specific time interval. Such data are called grouped survival times. For example, the data displayed in Table 1.5 are from Chapter 17, “Categorized Time-to-Event Data.” A clinical condition is treated with an active drug for some patients and with a placebo for others. The response categories are whether there are recurrences, no recurrences, or whether the patients withdrew from the study. The entries correspond to the time intervals 0–1 years, 1–2 years, and 2–3 years, which make up the rows of the table.
Table 1.5. Life Table Format for Clinical Condition Data
Controls Interval No Recurrences Recurrences Withdrawals At Risk 0–1 Years 50 15 9 74 1–2 Years 30 13 7 50 2–3 Years 17 7 6 30 Active Interval No Recurrences Recurrences Withdrawals At Risk 0–1 Years 69 12 9 90 1–2 Years 59 7 3 69 2–3 Years 45 10 4 59
Categorical data arise from different sampling frameworks. The nature of the sampling framework determines the assumptions that can be made for the statistical analyses and in turn influences the type of analysis that can be applied. The sampling framework also determines the type of inference that is possible. Study populations are limited to target populations, those populations to which inferences can be made, by assumptions justified by the sampling framework.
Generally, data fall into one of three sampling frameworks: historical data, experimental data, and sample survey data. Historical data are observational data, which means that the
8 Introduction
historical data, you often want to control for other explanatory variables that may have influenced the observed outcomes.
Table 1.1, the respiratory outcomes data, contains information obtained as part of a randomized allocation process. The hypothesis of interest is whether there is an association between treatment and outcome. For these data, the randomization is accomplished by the study design.
Table 1.6 contains data from a similar study. The main difference is that the study was conducted in two medical centers. The hypothesis of association is whether there is an association between treatment and outcome, controlling for any effect of center.
Table 1.6. Respiratory Improvement
Center Treatment Yes No Total 1 Test 29 16 45 1 Placebo 14 31 45 Total 43 47 90 2 Test 37 8 45 2 Placebo 24 21 45 Total 61 29 90
Chapter 2, “The 2 2 Table,” is primarily concerned with the association in 2 2 tables; in addition, it discusses measures of association, that is, statistics designed to evaluate the strength of the association. Chapter 3, “Sets of 2 2 Tables,” discusses the investigation of association in sets of 2 2 tables. When the table of interest has more than two rows and two columns, the analysis is further complicated by the consideration of scale of measurement. Chapter 4, “Sets of 2 r and s 2 Tables,” considers the assessment of association in sets of tables where the rows (columns) have more than two levels.
Chapter 5 describes the assessment of association in the general s r table, and Chapter 6, “Sets of s r Tables,” describes the assessment of association in sets of s r tables. The investigation of association in tables and sets of tables is further discussed in Chapter 7, “Nonparametric Methods,” which discusses traditional nonparametric tests that have counterparts among the strategies for analyzing contingency tables.
Another consideration in data analysis is whether you have enough data to support the asymptotic theory required for many tests. Often, you may have an overall table sample size that is too small or a number of zero or small cell counts that make the asymptotic assumptions questionable. Recently, exact methods have been developed for a number of association statistics that permit you to address the same hypotheses for these types of data. The above-mentioned chapters illustrate the use of exact methods for many situations.
Often, you are interested in describing the variation of your response variable in your data with a statistical model. In the continuous data setting, you frequently fit a model to the expected mean response. However, with categorical outcomes, there are a variety of
1.4 Overview of Analysis Strategies 9
response functions that you can model. Depending on the response function that you choose, you can use weighted least squares or maximum likelihood methods to estimate the model parameters.
Perhaps the most common response function modeled for categorical data is the logit. If you have a dichotomous response and represent the proportion of those subjects with an event (versus no event) outcome as p, then the logit can be written
log
p 1 p
Logistic regression is a modeling strategy that relates the logit to a set of explanatory variables with a linear model. One of its benefits is that estimates of odds ratios, important measures of association, can be obtained from the parameter estimates. Maximum likelihood estimation is used to provide those estimates.
Chapter 8, “Logistic Regression I: Dichotomous Response,” discusses logistic regression for a dichotomous outcome variable. Chapter 9, “Logistic Regression II: Polytomous Response,” discusses logistic regression for the situation where there are more than two outcomes for the response variable. Logits called generalized logits can be analyzed when the outcomes are nominal. And logits called cumulative logits can be analyzed when the outcomes are ordinal. Chapter 10, “Conditional Logistic Regression,” describes a specialized form of logistic regression that is appropriate when the data are highly stratified or arise from matched case-control studies. Chapter 8 and Chapter 10 describe the use of exact conditional logistic regression for those situations where you have limited or sparse data, and the asymptotic requirements for the usual maximum likelihood approach are not met.
In logistic regression, the objective is to predict a response outcome from a set of explanatory variables. However, sometimes you simply want to describe the structure of association in a set of variables for which there are no obvious outcome or predictor variables. This occurs frequently for sociological studies. The loglinear model is a traditional modeling strategy for categorical data and is appropriate for describing the association in such a set of variables. It is closely related to logistic regression, and the parameters in a loglinear model are also estimated with maximum likelihood estimation. Chapter 16, “Loglinear Models,” discusses the loglinear model, including several typical applications.
Some application areas have features that led to the development of special statistical techniques. One of these areas for categorical data is bioassay analysis. Bioassay is the process of determining the potency or strength of a reagent or stimuli based on the response it elicits in biological organisms. Logistic regression is a technique often applied in bioassay analysis, where its parameters take on specific meaning. Chapter 11, “Quantal Bioassay Analysis,” discusses the use of categorical data methods for quantal bioassay.
Poisson regression is a modeling strategy that is suitable for discrete counts, and it is discussed in Chapter 12, “Poisson Regression.” Most often the log of the count is used as the response function so the model used is a loglinear one.
Besides the logit and log counts, other useful response functions that can be modeled include proportions, means, and measures of association. Weighted least squares