SOFTWARE SUPPLEMENT FOR CATEGORICAL DATA ANALYSIS

This supplement contains information about software for categorical data analysis and is intended to supplement the material in the second editions of Categorical Data Analysis (Wiley, 2002), referred to below as CDA, and An Introduction to Categorical Data Analysis (Wiley, 2007), referred to below as ICDA, by Alan Agresti.

SAS

For the CDA text, see Appendix A for discussion, and go to CDA SAS examples for illustrations of SAS for data sets in that book. See also the references of SAS publications in that Appendix. For the ICDA text, go to ICDA SAS examples.

For other examples of various analyses for some examples in CDA and in ICDA, see the useful site set up by the UCLA Statistical Computing Center. One procedure not discussed in the appendix of my text is SURVEYLOGISTIC for fitting binary and multiple-category logistic models by the method of pseudo maximum likelihood, incorporating the sample design into the analysis. Starting in version 9.2, Bayesian analyses for generalized linear models are available with PROC GENMOD. See Bayesian GLMs.

R and S-Plus

R is free software maintained and regularly updated by a wide variety of volunteers. It is an open source version using the S programming language, and many S-Plus functions also work in R. For instance, the discussion below about various S functions for categorical data methods also applies to R. For details, see the R web site . This includes a link to manuals, such as "An Introduction to R", and to the archives in the Comprehensive R Archive Network (CRAN).

Dr. Laura Thompson has prepared an excellent, detailed manual (over 250 pages!!) on the use of S-Plus and R to conduct the analyses shown CDA. You can get a copy of this at Laura Thompson S manual for CDA. If you are using ICDA instead of CDA, you can get an example of any type of method of interest to you covered in ICDA by doing a "find" search through this manual. Thanks very much to Dr. Thompson for providing this very helpful resource.

For ICDA, a very useful resource is the website of Chris Bilder, where the link to R has examples of the use of R for most chapters of the text. The link to Schedule at Bilder's website for Statistics 875 at the University of Nebraska has notes for a course on this topic following the ICDA text as well as R code and output imbedded within the notes. Thanks to Dr. Bilder for this outstanding resource.

An excellent source about R functions for various basic types of categorical data analyses is material prepared by Brett Presnell R for CDA. This site has details (for an introductory course on this topic at the University of Florida) for many of the examples in ICDA. Also, his website Brett Presnell CDA course has notes for a course on this topic at the University of Florida. Brett has improved some of my own course notes and added R code and output.

Dr. Pat Altham at Cambridge also has a site that is a good source of examples for Splus and R.

For texts that contain examples of the use of S-Plus for various categorical data methods, see "Modern Applied Statistics With S-Plus," 3rd ed., by W. N. Venables and B. D. Ripley (Springer, 1999), "Analyzing Medical Data Using S-PLUS" by B. Everitt and S. Rabe-Hesketh (Springer, 2001), and "Regression Modeling Strategies" by F. E. Harrell (Springer, 2001).

A useful site for learning R for those already familiar with SAS or SPSS is R for SAS and SPSS users, by Robert Muenchen.

Some of the useful R functions for categorical data analysis are:

dbinom() and dpois() for binomial and Poisson probabilities; e.g., dbinom(6,10,.5) for outcome 6 in 10 trials with parameter .5.

prop.test() for a test and score CI for a binomial proportion; e.g., prop.test(6,10,p=.5), but note that the default uses a continuity correction; this can be turned off with correct=FALSE.

chisq.test() for chi-squared test

fisher.test() for Fisher's exact test

mantelhaen.test() for the Cochran-Mantel-Haenszel test

glm() for generalized linear models

mcnemar.test() for matched pairs

GLM: The usual sorts of generalized linear models can be fitted with the glm() function. That function handles most of the models in the CDA and ICDA texts. It can be used for such things as logistic regression, Poisson regression, and loglinear models. Specialized functions exist for particular methods, such as the loglin() function to fit loglinear models using iterative proportional fitting.

MULTINOMIAL MODELS: The glm() function cannot handle multinomial models, but specialized functions have been written by various users. To fit baseline-category logit models, one can use the multinom() function from the library nnet that has been provided by Venables and Ripley to do various calculations by neural nets (see, e.g., p. 230 of Venables and Ripley, 3rd ed.). To fit the proportional odds model for ordinal responses, one can use the polr() function (proportional odds logistic regression) in the MASS library (based on programs in the text by Venables and Ripley; see p. 231 of their 3rd edition for polr), and the function lrm() in Frank Harrell's Design S-plus library (see also Harrell's text "Regression Modeling Strategies" mentioned above for discussion of fitting this model and the continuation-ratio logit model). See also the VGAM package and vglm() function developed by Thomas Yee at Auckland, New Zealand, which have functions that can also can fit a wide variety of other models including adjacent-categories models, continuation-ratio models, Goodman's RC association model, and bivariate logistic and probit models for bivariate binary responses. (There are examples of the use of many of these at a file I have put at the ordinal website for my book, "Analysis of Ordinal Categorical Data.") Also, see the Presnell website mentioned above for an example of fitting a model for baseline-category logits. See volume 14, issue 3, of Journal of Statistical Software for a R package by K. Imai and D. A. van Dyk for Bayesian fitting of the cumulative probit model.

GEE: The S archive at Statlib contains a function gee() for analyses using generalized estimating equations.

GLMM: Generalized linear mixed models (GLMMs) can be fitted with the penalized quasi-likelihood method using the glmmPQL() function developed by Brian Ripley for the MASS library. The function GLMMgibbs() on CRAN employs a fully Bayesian approach with Gibbs sampling. The R package MCMCglmm fits them using Markov Chain Monte Carlo methods. The glmm() function in the repeated library fits them with quadrature methods.

GENERALIZED LOGLINEAR MODELS: As mentioned in the CDA text appendix, Joseph Lang at the University of Iowa (e-mail jblang@stat.uiowa.edu) has powerful R functions for fitting generalized loglinear models and mean response models by ML. The former class includes many of the standard marginal models of interest for repeated measurement. At his home page, he currently has these available in an R function, mph.fit(), that can fit these and other "multinomial-Poisson homogeneous models for contingency tables" described in a 2004 paper by Lang in the Annals of Statistics (vol. 32, pp. 340-383).

BRADLEY-TERRY: Prof. David Firth at the University of Warwick has prepared a R package (BradleyTerry) available at CRAN that is designed to fit the Bradley-Terry model and versions of it whereby ability scores are described by a linear predictor (see also volume 12, issue 01 at Journal of Statistical Software). For an overview of what this package can do, you can also go to Dr. Firth's web site.

ITEM RESPONSE THEORY MODELS: Dimitris Rizopoulos from Leuven, Belgium has prepared a package `ltm' for Item Response Theory analyses. This package can fit the Rasch model, the two-parameter logistic model, Birnbaum's three-parameter model, the latent trait model with up to two latent variables, and Samejima's graded response model. See Latent IRT).

MULTIPLICATIVE MODELS (nonlinear in parameters): The gnm add-on package for R, developed by David Firth and Heather Turner at the Univ. of Warwick, can fit multiplicative models such as Goodman's RC model for two-way contingency tables and Anderson's stereotype model for ordinal multinomial responses.

CORRESPONDENCE ANALYSIS: Nenadic and Greenacre have developed the ca package for correspondence analysis. For details, see Correspondence Analysis with R.

CONFIDENCE INTERVALS FOR ASSOCIATION PARAMETERS IN 2x2 AND 2xc TABLES: For a binomial proportion and for parameters comparing two binomial proportions such as the difference of proportions, relative risk, and odds ratio, a good general-purpose method for constructing confidence intervals is to invert the score test. Such intervals are not available in the standard software packages. Here are R functions for confidence intervals for a proportion, R functions for confidence intervals comparing two proportions with independent samples, and R functions for confidence intervals comparing two proportions with dependent samples. These sites also contain R functions for some "exact" small-sample intervals that guarantee at least the nominal coverage probability (such as the Clopper-Pearson and Blaker confidence intervals for a proportion) and adjustments of the Wald interval. Most of these were written by my former graduate student, Yongyi Min, who also prepared the Bayesian intervals mentioned below. The confidence intervals for a proportion include the mid-P adaptation of the Clopper-Pearson interval (written by Anna Gottard, Univ. of Firenze). Please quote this site if you use one of these R functions for confidence intervals for association parameters. We believe these functions are dependable, but no guarantees or support are available, however, so use them at your own risk.

Ralph Scherer at the Institute for Biometry in Hannover, Germany, has prepared a package on CRAN incorporating many of these confidence interval functions for proportions and comparisons of proportions. It can be downloaded at Scherer PropCIs package.

Euijung Ryu (a former PhD student of mine who is now at Mayo Clinic) has prepared R functions for various confidence intervals for the ordinal measure [P(Y1 > Y2) + (1/2)P(Y1 = Y2)] that is useful for comparing two multinomial distributions on an ordinal scale. Here is a pdf file of CIs for ordinal effect measure, including simple methods as well as score and profile likelihood intervals (which require using Joe Lang's mph.fit function). Euijung has also prepared R functions for multiple comparisons of proportions with independent samples using simultaneous confidence intervals for the difference of proportions or the odds ratio, based on the studentized-range inversion of score tests proposed by Agresti, Bini, Bertaccini, and Ryu in the journal Biometrics, 2008.

BAYESIAN INFERENCE: For surveys of Bayesian inference using R, see Survey by J. H. Park and Survey by Jim Albert. The latter is a website for the text "Bayesian Computation with R" by Jim Albert and shows examples of some categorical data analyses, such as Bayesian inference for a 2x2 table, a Bayesian test of independence in a contingency table, and probit regression.

Yongyi Min has prepared some R functions for Bayesian confidence intervals for 2x2 tables using independent beta priors for two binomial parameters, for the difference of proportions, odds ratio, and relative risk. (These are evaluated and compared to score confidence intervals in a 2005 article in the journal Biometrics by Agresti and Min.)

LATENT CLASS: Steve Buyske at Rutgers has prepared a library for fitting latent class models with the EM algorithm.

NEG BINOMIAL: The S archive at Statlib contains a negbin() function for negative binomial regression.

HIGHER-ORDER ASYMPTOTICS: Alessandra Brazzale has prepared functions for a variety of higher-order asymptotic analyses, including approximate conditional analysis for logistic and loglinear models. For information about her hoa package, see also hoa.

Here are some very-lightly annotated examples of Splus sessions I have conducted with some of the examples in CDA. I have not checked these in awhile, and there is no guarantee that all of them are correct!

Stata

For information about Stata (including its use for complex methods such as generalized linear mixed models and GEE), see "A Handbook of Statistical Analyses Using Stata," 4th ed., by S. Rabe-Hesketh and B. Everitt, CRC Press, 2006. For examples of categorical data analyses for many data sets in the first edition of my text "An Introduction to Categorical Data Analysis", see the useful site set up by the UCLA Statistical Computing Center. Information about various programs is available at Stata. A listing of the extensive selection of categorical data methods available in Stata is also given in Table 3 of the article by R. A. Oster in the August 2002 issue of The American Statistician (pp. 243-244); the main focus of that article is on methods for small-sample exact analysis.

The tabulate program in Stata can generate many measures of association and their standard errors. See Stata help for tabulate. Generalized linear models, such as logistic regression and loglinear models, can be fitted with the glm program. See Stata help for glm.

In Stata, the ologit program fits cumulative logit models and the oprobit model fits cumulative probit models. See Stata help for ologit. A program omodel is available from the Stata website for fitting these models and testing the assumption of the same effects for each cumulative probability (i.e., the proportional odds assumption for cumulative logit models). Other ways to fit cumulative link models are with the Stata module OGLM. Continuation-ratio logit models can be fitted with the ocratio module. See Stata ocratio search.

For information about using GEE in Stata, see Horton article and Stata GEE search.

The GLLAMM module for Stata (see www.gllamm.org) can fit a very wide variety of models, including logit and cumulative logit models with random effects. For details, see and Stata gllamm search and Chapter 5 of "Multilevel and Longitudinal Modeling Using Stata" by S. Rabe-Hesketh and A. Skrondal (Stata Press, 2005).

SPSS (version 19)

CONTINGENCY TABLES:

The DESCRIPTIVE STATISTICS option on the ANALYZE menu has a suboption called CROSSTABS, which provides several methods for contingency tables. After identifying the row and column variables in CROSSTABS, clicking on STATISTICS provides a wide variety of options, including the chi-squared test and measures of association. The output lists the Pearson statistic, its degrees of freedom, and its P-value (labeled Asymp. Sig.). If any expected frequencies in a 2x2 table are less than 5, Fisher's exact test results. It can also be requested by clicking on Exact in the CROSSTABS dialog box and selecting the exact test. SPSS also has an advanced module for small-sample inference (called SPSS Exact Tests) that provides exact P-values for various tests in CROSSTABS and NPAR TESTS procedures. For instance, the Exact Tests module provides exact tests of independence for r x c contingency tables with nominal or ordinal classifications. See the publication "SPSS Exact Tests for Windows."

In CROSSTABS, clicking on CELLS provides options for displaying observed and expected frequencies, as well as the standardized residuals, labeled as "Adjusted standardized". Clicking on STATISTICS in CROSSTABS provides options of a wide variety of statistics other than chi-squared, including gamma and Kendall's tau-b. The output shows the measures and their standard errors (labeled Asymp. Std. Error), which you can use to construct confidence intervals. It also provides a test statistic for testing that the true measure equals zero, which is the ratio of the estimate to its standard error. This test uses a simpler standard error that only applies under independence and is inappropriate for confidence intervals. One option in the list of statistics, labeled Risk, provides as output the odds ratio and its confidence interval.

Suppose you enter the data as cell counts for the various combinations of the two variables, rather than as responses on the two variables for individual subjects; for instance, perhaps you call COUNT the variable that contains these counts. Then, select the WEIGHT CASES option on the DATA menu in the Data Editor window, instruct SPSS to weight cases by COUNT.

GLMs and LOGISTIC REGRESSION:

To fit generalized linear models, on the ANALYZE menu select the GENERALIZED LINEAR MODELS option and the GENERALIZED LINEAR MODELS suboption. Select the Dependent Variable and then the Distribution and Link Function. Click on the Predictors tab at the top of the dialog box and then enter quantitative variables as Covariates and categorical variables as Factors. Click on the Model tab at the top of the dialog box and enter these variables as main effects, and construct any interactions that you want in the model. Click on OK to run the model.

To fit logistic regression models, on the ANALYZE menu select the REGRESSION option and the BINARY LOGISTIC suboption. In the LOGISTIC REGRESSION dialog box, identify the binary response (dependent) variable and the explanatory predictors (covariates). Highlight variables in the source list and click on a*b to create an interaction term. Identify the explanatory variables that are categorical and for which you want dummy variables by clicking on Categorical and declaring such a covariate to be a Categorical Covariate in the LOGISTIC REGRESSION: DEFINE CATEGORICAL VARIABLES dialog box. Highlight the categorical covariate and under Change Contrast you will see several options for setting up dummy variables. The Simple contrast constructs them as in this text, in which the final category is the baseline.

In the LOGISTIC REGRESSION dialog box, click on Method for stepwise model selection procedures, such as backward elimination. Click on Save to save predicted probabilities, measures of influence such as leverage values and DFBETAS, and standardized residuals. Click on Options to open a dialog box that contains an option to construct confidence intervals for exponentiated parameters.

Another way to fit logistic regression models is with the GENERALIZED LINEAR MODELS option and suboption on the ANALYZE menu. You pick the binomial distribution and logit link function. It is also possible there to enter the data as the number of successes out of a certain number of trials, which is useful when the data are in contingency table form. One can also fit such models using the LOGLINEAR option with the LOGIT suboption in the ANALYZE menu. One identifies the dependent variable, selects categorical predictors as factors, and selects quantitative predictors as cell covariates. The default fit is the saturated model for the factors, without including any covariates. To change this, click on Model and select a Custom model, entering the predictors and relevant interactions as terms in a customized (unsaturated) model. Clicking on Options, one can also display standardized residuals (called adjusted residuals) for model fits. This approach is well suited for logit models with categorical predictors, since standard output includes observed and expected frequencies. When the data file contains the data as cell counts, such as binomial numbers of successes and failures, one weights each cell by the cell count using the WEIGHT CASES option in the DATA menu.

MULTINOMIAL RESPONSES and LOGLINEAR MODELS:

SPSS can also fit logistic models for categorical response variables having several response categories. On the ANALYZE menu, choose the REGRESSION option and then the ORDINAL suboption for a cumulative logit model. Select the MULTINOMIAL LOGISTIC suboption for a baseline-category logit model. In the latter, click on Statistics and check Likelihood-ratio tests under Parameters to obtain results of likelihood-ratio tests for the effects of the predictors.

For loglinear models, one uses the LOGLINEAR option with GENERAL suboption in the ANALYZE menu. One enters the factors for the model. The default is the saturated model, so click on Model and select a Custom model. Enter the factors as terms in a customized (unsaturated) model and then select additional interaction effects. Click on Options to show options for displaying observed and expected frequencies and adjusted residuals. When the data file contains the data as cell counts for the various combinations of factors rather than as responses listed for individual subjects, weight each cell by the cell count using the WEIGHT CASES option in the DATA menu.

CLUSTERED DATA:

For GEE methods, on the ANALYZE menu choose the GENERALIZED LINEAR MODELS option and the GENERALIZED ESTIMATING EQUATIONS suboption. You can then select structure for the working correlation matrix and identify the between-subject and within-subject variables. For random effects models, on the ANALYZE menu choose the MIXED MODELS option and the GENERALIZED LINEAR suboption.

GLIM

See first edition of "Categorical Data Analysis" (1990) for several GLIM examples, as well as the 2005 text by Aitkin, Francis, and Hinde on "Statistical Modeling in GLIM4" (Oxford) and Jim Lindsey's 1989 text on "The Analysis of Categorical Data Using GLIM" (Springer-Verlag). See Statlib for an archive of GLIM macros. Also, Rory Wolfe has prepared macros for cumulative link models.

StatXact and LogXact

StatXact (Cytel Software, Cambridge MA) provides exact analysis for categorical data methods and some nonparametric methods. Among its procedures are small-sample confidence intervals for differences and ratios of proportions and for odds ratios, and Fisher's exact test and its generalizations for IxJ tables. It also can conduct exact tests of conditional independence and of equality of odds ratios in 2x2xK tables, and exact confidence intervals for the common odds ratio in several 2x2 tables. StatXact uses Monte Carlo methods to approximate exact P-values and confidence intervals when a data set is too large for exact inference to be computationally feasible. Its companion LogXact performs exact conditional logistic regression. The President of Cytel Software is Dr. Cyrus Mehta, who has been one of the most active researchers in the past 20 years in advancing the development of algorithms for conducting small-sample inference for categorical data. For a brief survey of the capability of these packages, see the article by R. A. Oster in the August 2002 issue of The American Statistician (pp. 243-244). The manuals of these programs are good sources of detailed explanations about the small-sample methods.

Others

SuperMix distributed by Scientific Software International provides ML fitting of generalized linear mixed models, including count responses, nominal responses, and ordinal responses using cumulative links including the cumulative logit, cumulative probit, and cumulative complementary log-log. This program is based on software developed over the years by Donald Hedeker and Robert Gibbons, who have also done considerable research on mixed models. For multilevel models, the program is supposed to be much faster than PROC MIXED or PROC NLMIXED in SAS and make it possible to fit relatively complex models using ML rather than approximations such as penalized quasi likelihood (communication from Robert Gibbons).

Latent Gold is the website for the Latent Gold program (marketed by Statistical Innovations of Belmont, MA) for fitting a wide variety of finite mixture models such as latent class models (i.e. the latent variable is categorical). It can handle binary, nominal, ordinal, and count response variables and can include random effects that are treated in a nonparametric method rather than assumed to have a normal distribution.

SUDAAN provides analyses for categorical and continuous data from stratified multi-stage cluster designs. It has facility (MULTILOG procedure) for GEE analyses of marginal models for nominal and ordinal responses. See SUDAAN GEE.

Robert Newcombe at the University of Wales in Cardiff provides an Excel spreadsheet for forming various confidence intervals for a proportion and for comparing two proportions with independent or with matched samples. His website also has SPSS and Minitab macros for doing this.

Berger and Boos has software for the Berger - Boos test and other small-sample unconditional tests for 2-by-2 tables.

Pesarin and Salmaso give a variety of permutation analyses for categorical and continuous variables, including some multivariate analyses, using a SAS macro constructed by Luigi Salmaso at the University of Padova.

For a survey of software for implementing the GEE method, see the article by Horton and Lipsitz in The American Statistician, 1999, vol. 53, pp. 160-169.


Copyright © 2008, Alan Agresti, Department of Statistics, University of Florida.