zeroinfl {pscl}R Documentation

Zero-inflated Count Data Regression

Description

Fit zero-inflated regression models for count data via maximum likelihood.

Usage

zeroinfl(formula, data, subset, na.action, weights, offset,
  dist = c("poisson", "negbin", "geometric"),
  link = c("logit", "probit", "cloglog", "cauchit", "log"),
  control = zeroinfl.control(...),
  model = TRUE, y = TRUE, x = FALSE, ...)

Arguments

formula symbolic description of the model, see details.
data, subset, na.action arguments controlling formula processing via model.frame.
weights optional numeric vector of weights.
offset optional numeric vector with an a priori known component to be included in the linear predictor of the count model.
dist character specification of count model family (a log link is always used).
link character specification of link function in the binary zero-inflation model (a binomial family is always used).
control a list of control arguments specified via zeroinfl.control.
model, y, x logicals. If TRUE the corresponding components of the fit (model frame, response, model matrix) are returned.
... arguments passed to zeroinfl.control in the default setup.

Details

Zero-inflated count models are two-component mixture models combining a point mass at zero with a proper count distribution. Thus, there are two sources of zeros: zeros may come from both the point mass and from the count component. Usually the count model is a poisson or negative binomial regression (with log link). The geometric distribution is a special case of the negative binomial with size parameter equal to 1. For modeling the unobserved state (zero vs. count), a binary model is used: in the simplest case only with an intercept but potentially containing regressors. For this zero-inflation model, a binomial model with different links can be used, typically logit or probit.

The formula mainly describes the count data model, i.e., y ~ x1 + x2 specifies a count data regression where all zero counts have the same probability of belonging to the zero component. This is equivalent to the model y ~ x1 + x2 | 1, making it more explicit that the zero-inflation model only has an intercept. Additionally, further regressors can be added to the zero-inflation model so that not all zeros have the same probability for belonging to the point mass component or to the count component. A typical formula is y ~ x1 + x2 | z1 + z2. The regressors in the zero and the count component can be overlapping (or identical).

All parameters are estimated by maximum likelihood using optim, with control options set in zeroinfl.control. Starting values can be supplied, estimated by the EM (expectation maximization) algorithm, or by glm.fit (the default). The latter corresponds to the first iteration of the EM algorithm and initializes the unobserved state as y > 0, i.e., all zeros are in the perfect component and only the non-zero counts in the count component. Standard errors are derived numerically using the Hessian matrix returned by optim. See zeroinfl.control for details.

The returned fitted model object is of class "zeroinfl" and is similar to fitted "glm" objects. For elements such as "coefficients" or "terms" a list is returned with elements for the zero and count component, respectively. For details see below.

A set of standard extractor functions for fitted model objects is available for objects of class "zeroinfl", including methods to the generic functions print, summary, coef, vcov, logLik, residuals, predict, fitted, terms, model.matrix. See predict.zeroinfl for more details on all methods.

Value

An object of class "zeroinfl", i.e., a list with components including

coefficients a list with elements "count" and "zero" containing the coefficients from the respective models,
residuals a vector of raw residuals (observed - fitted),
fitted.values a vector of fitted means,
optim a list with the output from the optim call for minimizing the negative log-likelihood,
control the control arguments passed to the optim call,
start the starting values for the parameters passed to the optim call,
weights the case weights used,
offset the offset vector used (if any),
n number of observations,
df.null residual degrees of freedom for the null model (= n - 2),
df.residual residual degrees of freedom for fitted model,
terms a list with elements "count", "zero" and "full" containing the terms objects for the respective models,
theta estimate of the additional theta parameter of the negative binomial model (if a negative binomial regression is used),
SE.logtheta standard error for log(theta),
loglik log-likelihood of the fitted model,
vcov covariance matrix of all coefficients in the model (derived from the Hessian of the optim output),
dist character string describing the count distribution used,
link character string describing the link of the zero-inflation model,
linkinv the inverse link function corresponding to link,
converged logical indicating successful convergence of optim,
call the original function call,
formula the original formula,
levels levels of the categorical regressors,
contrasts a list with elements "count" and "zero" containing the contrasts corresponding to levels from the respective models,
model the full model frame (if model = TRUE),
y the response count vector (if y = TRUE),
x a list with elements "count" and "zero" containing the model matrices from the respective models (if x = TRUE),

Author(s)

Achim Zeileis <Achim.Zeileis@R-project.org>

References

Cameron, A. Colin and Pravin K. Trevedi. 1998. Regression Analysis of Count Data. New York: Cambridge University Press.

Cameron, A. Colin and Pravin K. Trivedi. 2005. Microeconometrics: Methods and Applications. Cambridge: Cambridge University Press.

Lambert, Diane. 1992. “Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing.” Technometrics.V34(1):1-14

Zeileis, Achim, Christian Kleiber and Simon Jackman 2007. “Regression Models for Count Data in R.” Report 53, Department of Statistics and Mathematics, Wirtschaftsuniversität Wien, Research Report Series. http://epub.wu-wien.ac.at/.

See Also

zeroinfl.control, glm, glm.fit, glm.nb, hurdle

Examples

## from Long (1997)
data("bioChemists", package = "pscl")

## without inflation
## ("art ~ ." is "art ~ fem + mar + kid5 + phd + ment")
fm_pois <- glm(art ~ ., data = bioChemists, family = poisson)
fm_qpois <- glm(art ~ ., data = bioChemists, family = quasipoisson)
fm_nb <- glm.nb(art ~ ., data = bioChemists)

## with simple inflation
## (no regressors for 0 component)
fm_zip <- zeroinfl(art ~ ., data = bioChemists)
fm_zinb <- zeroinfl(art ~ ., data = bioChemists, dist = "negbin", EM = TRUE)

## inflation with regressors (choose starting values by EM)
## ("art ~ . | ." is "art ~ fem + mar + kid5 + phd + ment | fem + mar + kid5 + phd + ment")
fm_zip2 <- zeroinfl(art ~ . | ., data = bioChemists, EM = TRUE)
fm_zinb2 <- zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin", EM = TRUE)

[Package pscl version 0.92 Index]