SOFTWARE SUPPLEMENT FOR CATEGORICAL DATA ANALYSIS
This supplement contains information about software for categorical
data analysis and is intended to supplement the material in the second
editions of Categorical Data Analysis (Wiley, 2002), referred to below
as CDA, and An Introduction to Categorical Data Analysis (Wiley,
2007), referred to below as ICDA, by Alan Agresti.
SAS
For the CDA text, see Appendix A for discussion, and go to CDA SAS examples for illustrations
of SAS for data sets in that book. See also the references of SAS
publications in that Appendix. For the ICDA text, go to ICDA
SAS examples.
For other examples of various analyses for some examples in CDA and in
ICDA, see the useful site set up by
the UCLA
Statistical Computing Center. One procedure not discussed in the
appendix of my text
is SURVEYLOGISTIC
for fitting binary and multiple-category logistic models by the
method of pseudo maximum likelihood, incorporating the sample design
into the analysis. Starting in version 9.2, Bayesian analyses for
generalized linear models are available with PROC GENMOD. See
Bayesian
GLMs.
R and S-Plus
R is free software maintained and regularly updated by a wide variety
of volunteers. It is an open source version using the S programming
language, and many S-Plus functions also work in R. For instance, the
discussion below about various S functions for categorical data methods
also applies to R. For details, see the R web site . This includes a
link to manuals, such as "An Introduction to R", and to the archives
in the Comprehensive R Archive Network (CRAN).
Dr. Laura Thompson has prepared an excellent, detailed manual (over
250 pages!!) on the use of S-Plus and R to conduct the analyses shown
CDA. You can get a copy of this
at Laura
Thompson S manual for CDA. If you are using ICDA instead of CDA,
you can get an example of any type of method of interest to you
covered in ICDA by doing a "find" search through this manual. Thanks
very much to Dr. Thompson for providing this very helpful
resource.
For ICDA, a very useful resource is the website
of Chris Bilder,
where the link to R has examples of the use of R for most chapters of
the text. The link to Schedule at Bilder's website for Statistics 875
at the University of Nebraska has notes for a course on this topic
following the ICDA text as well as R code and output imbedded within
the notes. Thanks to Dr. Bilder for this outstanding
resource.
An excellent source about R functions for various basic types of
categorical data analyses is material prepared
by Brett
Presnell R for CDA. This site has details (for an introductory
course on this topic at the University of Florida) for many of the
examples in ICDA. Also, his
website Brett
Presnell CDA course has notes for a course on this topic at the
University of Florida. Brett has improved some of my own course notes
and added R code and output.
Dr. Pat Altham at
Cambridge also has a site that is a good source of examples for Splus
and R.
For texts that contain examples of the use of S-Plus for various
categorical data methods, see "Modern Applied Statistics With S-Plus,"
3rd ed., by W. N. Venables and B. D. Ripley (Springer, 1999),
"Analyzing Medical Data Using S-PLUS" by B. Everitt and
S. Rabe-Hesketh (Springer, 2001), and "Regression Modeling Strategies"
by F. E. Harrell (Springer, 2001).
A useful site for learning R for those already familiar with SAS or
SPSS is R for
SAS and SPSS users, by Robert Muenchen.
Some of the useful R functions for categorical data analysis are:
dbinom() and dpois() for binomial and Poisson probabilities; e.g.,
dbinom(6,10,.5) for outcome 6 in 10 trials with parameter .5.
prop.test() for a test and score CI for a binomial proportion; e.g.,
prop.test(6,10,p=.5), but note that the default uses a continuity
correction; this can be turned off with correct=FALSE.
chisq.test() for chi-squared test
fisher.test() for Fisher's exact test
mantelhaen.test() for the Cochran-Mantel-Haenszel test
glm() for generalized linear models
mcnemar.test() for matched pairs
GLM: The usual sorts of generalized linear models can be fitted with
the glm() function. That function handles most of the models in the
CDA and ICDA texts. It can be used for such things as logistic
regression, Poisson regression, and loglinear models. Specialized
functions exist for particular methods, such as the loglin() function
to fit loglinear models using iterative proportional fitting.
MULTINOMIAL MODELS: The glm() function cannot handle multinomial
models, but specialized functions have been written by various users.
To fit baseline-category logit models, one can use the multinom()
function from the library nnet that has been provided by Venables and
Ripley to do various calculations by neural nets (see, e.g., p. 230 of
Venables and Ripley, 3rd ed.). To fit the proportional odds model for
ordinal responses, one can use the polr() function (proportional odds
logistic regression) in the MASS library (based on programs in the
text by Venables and Ripley; see p. 231 of their 3rd edition for
polr), and the function lrm() in Frank
Harrell's Design
S-plus library (see also Harrell's text "Regression Modeling
Strategies" mentioned above for discussion of fitting this model and
the continuation-ratio logit model). See also the VGAM package and
vglm() function developed
by Thomas Yee at
Auckland, New Zealand, which have functions that can also can fit a
wide variety of other models including adjacent-categories models,
continuation-ratio models, Goodman's RC association model, and
bivariate logistic and probit models for bivariate binary responses.
(There are examples of the use of many of these at a file I have put
at the
ordinal
website for my book, "Analysis of Ordinal Categorical Data.")
Also, see the Presnell website mentioned above for an example of
fitting a model for baseline-category logits. See volume 14, issue 3,
of Journal of Statistical
Software for a R package by K. Imai and D. A. van Dyk for Bayesian
fitting of the cumulative probit model.
GEE: The S archive at Statlib
contains a function gee() for analyses using generalized estimating
equations.
GLMM: Generalized linear mixed models (GLMMs) can be fitted with the
penalized quasi-likelihood method using the glmmPQL() function
developed by Brian Ripley for the MASS library. The function
GLMMgibbs() on CRAN employs a fully Bayesian approach with Gibbs
sampling. The R package MCMCglmm fits them using Markov Chain Monte
Carlo methods. The glmm() function in the repeated library fits
them with quadrature methods.
GENERALIZED LOGLINEAR MODELS: As mentioned in the CDA text appendix, Joseph Lang at the
University of Iowa (e-mail jblang@stat.uiowa.edu) has powerful R
functions for fitting generalized loglinear models and mean response
models by ML. The former class includes many of the standard marginal
models of interest for repeated measurement. At his home page, he
currently has these available in an R function, mph.fit(), that can fit
these and other "multinomial-Poisson homogeneous models for
contingency tables" described in a 2004 paper by Lang in the
Annals of Statistics (vol. 32, pp. 340-383).
BRADLEY-TERRY: Prof. David Firth at the University of Warwick has
prepared a R package (BradleyTerry) available at CRAN that is designed to fit
the Bradley-Terry model and versions of it whereby ability scores are
described by a linear predictor (see also volume 12, issue 01 at Journal of Statistical
Software). For an overview of what this package can do, you can
also go to
Dr. Firth's web site.
ITEM RESPONSE THEORY MODELS: Dimitris Rizopoulos from Leuven, Belgium
has prepared a package `ltm' for Item Response Theory analyses. This
package can fit the Rasch model, the two-parameter logistic model,
Birnbaum's three-parameter model, the latent trait model with up to
two latent variables, and Samejima's graded response model. See Latent
IRT).
MULTIPLICATIVE MODELS (nonlinear in parameters): The gnm
add-on package for R, developed by David Firth and Heather Turner at
the Univ. of Warwick, can fit multiplicative models such as Goodman's
RC model for two-way contingency tables and Anderson's stereotype
model for ordinal multinomial responses.
CORRESPONDENCE ANALYSIS: Nenadic and Greenacre have developed
the ca package for correspondence analysis. For details, see
Correspondence
Analysis with R.
CONFIDENCE INTERVALS FOR ASSOCIATION PARAMETERS IN 2x2 AND 2xc TABLES: For a
binomial proportion and for parameters comparing two binomial
proportions such as the difference of proportions, relative risk, and
odds ratio, a good general-purpose method for constructing confidence
intervals is to invert the score test. Such intervals are not
available in the standard software packages. Here are R functions for
confidence intervals for a proportion, R functions
for confidence
intervals comparing two proportions with independent samples, and
R functions
for confidence
intervals comparing two proportions with dependent samples. These
sites also contain R functions for some "exact" small-sample intervals
that guarantee at least the nominal coverage probability (such as the
Clopper-Pearson and Blaker confidence intervals for a proportion) and
adjustments of the Wald interval. Most of these were written by my
former graduate student, Yongyi Min, who also prepared the Bayesian
intervals mentioned below. The confidence intervals for a proportion
include the mid-P adaptation of the Clopper-Pearson interval (written
by Anna Gottard, Univ. of Firenze). Please quote this site if you use
one of these R functions for confidence intervals for association
parameters. We believe these functions are dependable, but no
guarantees or support are available, however, so use them at your own
risk.
Ralph Scherer at the Institute for Biometry in Hannover, Germany, has
prepared a package on CRAN incorporating many of these confidence
interval functions for proportions and comparisons of proportions. It
can be downloaded
at Scherer
PropCIs package.
Euijung Ryu (a former PhD student of mine who is now at Mayo Clinic)
has prepared R functions for various confidence intervals for the
ordinal measure [P(Y1 > Y2) + (1/2)P(Y1 = Y2)] that is useful for
comparing two multinomial distributions on an ordinal scale. Here is
a pdf file
of CIs
for ordinal effect measure, including simple methods as well as
score and profile likelihood intervals (which require using Joe Lang's
mph.fit function). Euijung has also prepared R functions
for
multiple comparisons of proportions with independent samples using
simultaneous confidence intervals for the difference of proportions or
the odds ratio, based on the studentized-range inversion of score
tests proposed by Agresti, Bini, Bertaccini, and Ryu in the journal
Biometrics, 2008.
BAYESIAN INFERENCE: For surveys of Bayesian inference using R, see
Survey by
J. H. Park and
Survey by Jim Albert. The
latter is a website for the text "Bayesian Computation with R" by Jim
Albert and shows examples of some categorical data analyses, such as
Bayesian inference for a 2x2 table, a Bayesian test of independence in
a contingency table, and probit regression.
Yongyi Min has prepared some R functions
for Bayesian
confidence intervals for 2x2 tables using independent beta priors
for two binomial parameters, for the difference of proportions, odds
ratio, and relative risk. (These are evaluated and compared to score
confidence intervals in a 2005 article in the journal Biometrics by
Agresti and Min.)
LATENT CLASS: Steve Buyske at Rutgers has prepared a library for
fitting latent class
models with the EM algorithm.
NEG BINOMIAL: The S archive
at Statlib contains a negbin() function for negative binomial
regression.
HIGHER-ORDER ASYMPTOTICS: Alessandra Brazzale has prepared functions
for a variety of higher-order
asymptotic analyses, including approximate conditional analysis
for logistic and loglinear models. For information about her
hoa package, see also
hoa.
Here are some very-lightly annotated examples of Splus sessions I have
conducted with some of the examples in CDA. I have not checked
these in awhile, and there is no guarantee that all of them are
correct!
- Chi-squared and loglinear analyses of 2x3 table on gender and
political party affiliation (Table 3.11)
- Loglinear and logit models for 2x2x2 table of death
penalty (Table 2.6)
- Linear-by-linear association model and row effects model and
column effects model for 4x4 table of opinions about birth control
and premarital sex (Table 9.3)
Stata
For information about Stata (including its use for complex methods
such as generalized linear mixed models and GEE), see "A Handbook of
Statistical Analyses Using Stata," 4th ed., by S. Rabe-Hesketh and
B. Everitt, CRC Press, 2006. For examples of categorical data
analyses for many data sets in the first edition of my text "An
Introduction to Categorical Data Analysis", see the useful site set
up by
the UCLA
Statistical Computing Center. Information about various programs is
available at Stata. A listing
of the extensive selection of categorical data methods available in
Stata is also given in Table 3 of the article by R. A. Oster in the
August 2002 issue of The American Statistician (pp. 243-244); the
main focus of that article is on methods for small-sample exact
analysis.
The tabulate program in Stata can generate many measures of
association and their standard errors. See
Stata help
for tabulate. Generalized linear models, such as logistic
regression and loglinear models, can be fitted with the glm
program. See Stata
help for glm.
In Stata, the ologit program fits cumulative logit models and the
oprobit model fits cumulative probit models. See
Stata help for
ologit. A program omodel is available from the Stata website for
fitting these models and testing the assumption of the same effects
for each cumulative probability (i.e., the proportional odds
assumption for cumulative logit models). Other ways to fit cumulative
link models are with the Stata module OGLM. Continuation-ratio logit
models can be fitted with the ocratio module.
See Stata
ocratio search.
For information about using GEE in Stata,
see Horton
article
and Stata GEE
search.
The GLLAMM module for Stata (see www.gllamm.org) can fit a very wide
variety of models, including logit and cumulative logit models with
random effects. For details, see
and Stata
gllamm search and Chapter 5 of "Multilevel and Longitudinal
Modeling Using Stata" by S. Rabe-Hesketh and A. Skrondal (Stata
Press, 2005).
SPSS (version 19)
CONTINGENCY TABLES:
The DESCRIPTIVE STATISTICS option on the ANALYZE menu has a suboption
called CROSSTABS, which provides several methods for contingency
tables. After identifying the row and column variables in CROSSTABS,
clicking on STATISTICS provides a wide variety of options, including
the chi-squared test and measures of association. The output lists
the Pearson statistic, its degrees of freedom, and its P-value
(labeled Asymp. Sig.). If any expected frequencies in a 2x2 table are
less than 5, Fisher's exact test results. It can also be requested by
clicking on Exact in the CROSSTABS dialog box and selecting the exact
test. SPSS also has an advanced module for small-sample inference
(called SPSS Exact Tests) that provides exact P-values for various
tests in CROSSTABS and NPAR TESTS procedures. For instance, the Exact
Tests module provides exact tests of independence for r x c
contingency tables with nominal or ordinal classifications. See the
publication "SPSS Exact Tests for Windows."
In CROSSTABS, clicking on CELLS provides options for displaying
observed and expected frequencies, as well as the standardized
residuals, labeled as "Adjusted standardized". Clicking on STATISTICS
in CROSSTABS provides options of a wide variety of statistics other
than chi-squared, including gamma and Kendall's tau-b. The output
shows the measures and their standard errors (labeled
Asymp. Std. Error), which you can use to construct confidence
intervals. It also provides a test statistic for testing that the
true measure equals zero, which is the ratio of the estimate to its
standard error. This test uses a simpler standard error that only
applies under independence and is inappropriate for confidence
intervals. One option in the list of statistics, labeled Risk,
provides as output the odds ratio and its confidence interval.
Suppose you enter the data as cell counts for the various combinations
of the two variables, rather than as responses on the two variables
for individual subjects; for instance, perhaps you call COUNT the
variable that contains these counts. Then, select the WEIGHT CASES
option on the DATA menu in the Data Editor window, instruct SPSS
to weight cases by COUNT.
GLMs and LOGISTIC REGRESSION:
To fit generalized linear models, on the ANALYZE menu select the
GENERALIZED LINEAR MODELS option and the GENERALIZED LINEAR MODELS
suboption. Select the Dependent Variable and then the Distribution
and Link Function. Click on the Predictors tab at the top of the
dialog box and then enter quantitative variables as Covariates and
categorical variables as Factors. Click on the Model tab at the top
of the dialog box and enter these variables as main effects, and
construct any interactions that you want in the model. Click on OK to
run the model.
To fit logistic regression models, on the ANALYZE menu select the
REGRESSION option and the BINARY LOGISTIC suboption. In the LOGISTIC
REGRESSION dialog box, identify the binary response (dependent)
variable and the explanatory predictors (covariates). Highlight
variables in the source list and click on a*b to create an
interaction term. Identify the explanatory variables that are
categorical and for which you want dummy variables by clicking on
Categorical and declaring such a covariate to be a Categorical
Covariate in the LOGISTIC REGRESSION: DEFINE CATEGORICAL VARIABLES
dialog box. Highlight the categorical covariate and under Change
Contrast you will see several options for setting up dummy variables.
The Simple contrast constructs them as in this text, in which
the final category is the baseline.
In the LOGISTIC REGRESSION dialog box, click on Method for
stepwise model selection procedures, such as backward elimination.
Click on Save to save predicted probabilities, measures of
influence such as leverage values and DFBETAS, and standardized
residuals. Click on Options to open a dialog box that contains
an option to construct confidence intervals for exponentiated
parameters.
Another way to fit logistic regression models is with the GENERALIZED
LINEAR MODELS option and suboption on the ANALYZE menu. You pick the
binomial distribution and logit link function. It is also possible
there to enter the data as the number of successes out of a certain
number of trials, which is useful when the data are in contingency
table form. One can also fit such models using the LOGLINEAR option
with the LOGIT suboption in the ANALYZE menu. One identifies the
dependent variable, selects categorical predictors as factors, and
selects quantitative predictors as cell covariates. The default fit
is the saturated model for the factors, without including any
covariates. To change this, click on Model and select a Custom model,
entering the predictors and relevant interactions as terms in a
customized (unsaturated) model. Clicking on Options, one can also
display standardized residuals (called adjusted residuals) for model
fits. This approach is well suited for logit models with categorical
predictors, since standard output includes observed and expected
frequencies. When the data file contains the data as cell counts,
such as binomial numbers of successes and failures, one weights each
cell by the cell count using the WEIGHT CASES option in the DATA
menu.
MULTINOMIAL RESPONSES and LOGLINEAR MODELS:
SPSS can also fit logistic models for categorical response variables
having several response categories. On the ANALYZE menu, choose the
REGRESSION option and then the ORDINAL suboption for a cumulative
logit model. Select the MULTINOMIAL LOGISTIC suboption for a
baseline-category logit model. In the latter, click on
Statistics and check Likelihood-ratio tests under Parameters to
obtain results of likelihood-ratio tests for the effects of the
predictors.
For loglinear models, one uses the LOGLINEAR option with GENERAL
suboption in the ANALYZE menu. One enters the factors for the model.
The default is the saturated model, so click on Model and select
a Custom model. Enter the factors as terms in a customized
(unsaturated) model and then select additional interaction effects.
Click on Options to show options for displaying observed and
expected frequencies and adjusted residuals. When the data file
contains the data as cell counts for the various combinations of
factors rather than as responses listed for individual subjects,
weight each cell by the cell count using the WEIGHT CASES option in
the DATA menu.
CLUSTERED DATA:
For GEE methods, on the ANALYZE menu choose the GENERALIZED LINEAR
MODELS option and the GENERALIZED ESTIMATING EQUATIONS suboption.
You can then select structure for the working correlation matrix
and identify the between-subject and within-subject variables.
For random effects models, on the ANALYZE menu choose the MIXED
MODELS option and the GENERALIZED LINEAR suboption.
GLIM
See first edition of "Categorical Data Analysis" (1990) for several
GLIM examples, as well as the 2005 text by Aitkin, Francis,
and Hinde on "Statistical Modeling in GLIM4" (Oxford) and Jim Lindsey's
1989 text on "The Analysis of Categorical Data Using GLIM"
(Springer-Verlag). See
Statlib for an archive of GLIM macros. Also,
Rory Wolfe has prepared macros for cumulative link models.
StatXact and LogXact
StatXact (Cytel Software,
Cambridge MA) provides exact analysis for categorical data methods and
some nonparametric methods. Among its procedures are small-sample
confidence intervals for differences and ratios of proportions and for
odds ratios, and Fisher's exact test and its generalizations for IxJ
tables. It also can conduct exact tests of conditional independence
and of equality of odds ratios in 2x2xK tables, and exact confidence
intervals for the common odds ratio in several 2x2 tables. StatXact
uses Monte Carlo methods to approximate exact P-values and confidence
intervals when a data set is too large for exact inference to be
computationally feasible. Its companion LogXact performs exact
conditional logistic regression. The President of Cytel Software is
Dr. Cyrus Mehta, who has been one of the most active researchers in
the past 20 years in advancing the development of algorithms for
conducting small-sample inference for categorical data. For a brief
survey of the capability of these packages, see the article by
R. A. Oster in the August 2002 issue of The American Statistician
(pp. 243-244). The manuals of these programs are good sources of
detailed explanations about the small-sample methods.
Others
SuperMix
distributed by Scientific Software International provides ML fitting
of generalized linear mixed models, including count responses, nominal
responses, and ordinal responses using cumulative links including the
cumulative logit, cumulative probit, and cumulative complementary
log-log. This program is based on software developed over the years
by Donald Hedeker and Robert Gibbons, who have also done considerable
research on mixed models. For multilevel models, the program is
supposed to be much faster than PROC MIXED or PROC NLMIXED in SAS and
make it possible to fit relatively complex models using ML rather than
approximations such as penalized quasi likelihood (communication from
Robert Gibbons).
Latent
Gold is the website for the Latent Gold program (marketed by
Statistical Innovations of Belmont, MA) for fitting a wide variety of
finite mixture models such as latent class models (i.e. the latent
variable is categorical). It can handle binary, nominal, ordinal, and
count response variables and can include random effects that are
treated in a nonparametric method rather than assumed to have a normal
distribution.
SUDAAN provides analyses for
categorical and continuous data from stratified multi-stage cluster
designs. It has facility (MULTILOG procedure) for GEE analyses of
marginal models for nominal and ordinal responses.
See SUDAAN
GEE.
Robert
Newcombe at the University of Wales in Cardiff provides an Excel
spreadsheet for forming various confidence intervals for a proportion
and for comparing two proportions with independent or with matched
samples. His website also has SPSS and Minitab macros for doing this.
Berger and
Boos has software for the Berger - Boos test and other
small-sample unconditional tests for 2-by-2 tables.
Pesarin
and Salmaso give a variety of permutation analyses for categorical
and continuous variables, including some multivariate analyses, using
a SAS macro constructed by Luigi Salmaso at the University of Padova.
For a survey of software for implementing the GEE method, see the
article by Horton and Lipsitz in The American Statistician, 1999,
vol. 53, pp. 160-169.
Copyright © 2008, Alan Agresti, Department of Statistics,
University of Florida.