Subtitles section Play video
In statistics, logistic regression, or logit regression, is a type of probabilistic statistical
classification model. It is also used to predict a binary response from a binary predictor,
used for predicting the outcome of a categorical dependent variable based on one or more predictor
variables. That is, it is used in estimating the parameters of a qualitative response model.
The probabilities describing the possible outcomes of a single trial are modeled, as
a function of the explanatory variables, using a logistic function. Frequently "logistic
regression" is used to refer specifically to the problem in which the dependent variable
is binary—that is, the number of available categories is two—while problems with more
than two categories are referred to as multinomial logistic regression or, if the multiple categories
are ordered, as ordered logistic regression. Logistic regression measures the relationship
between a categorical dependent variable and one or more independent variables, which are
usually continuous, by using probability scores as the predicted values of the dependent variable.
As such it treats the same set of problems as does probit regression using similar techniques.
Fields and examples of applications Logistic regression was put forth in the 1940s
as an alternative to Fisher's 1936 classification method, linear discriminant analysis. It is
used extensively in numerous disciplines, including the medical and social science fields.
For example, the Trauma and Injury Severity Score, which is widely used to predict mortality
in injured patients, was originally developed by Boyd et al. using logistic regression.
Logistic regression might be used to predict whether a patient has a given disease, based
on observed characteristics of the patient. Another example might be to predict whether
an American voter will vote Democratic or Republican, based on age, income, gender,
race, state of residence, votes in previous elections, etc. The technique can also be
used in engineering, especially for predicting the probability of failure of a given process,
system or product. It is also used in marketing applications such as prediction of a customer's
propensity to purchase a product or cease a subscription, etc. In economics it can be
used to predict the likelihood of a person's choosing to be in the labor force, and a business
application would be to predict the likehood of a homeowner defaulting on a mortgage. Conditional
random fields, an extension of logistic regression to sequential data, are used in natural language
processing. Basics
Logistic regression can be binomial or multinomial. Binomial or binary logistic regression deals
with situations in which the observed outcome for a dependent variable can have only two
possible types. Multinomial logistic regression deals with situations where the outcome can
have three or more possible types. In binary logistic regression, the outcome is usually
coded as "0" or "1", as this leads to the most straightforward interpretation. If a
particular observed outcome for the dependent variable is the noteworthy possible outcome
it is usually coded as "1" and the contrary outcome as "0". Logistic regression is used
to predict the odds of being a case based on the values of the independent variables.
The odds are defined as the probability that a particular outcome is a case divided by
the probability that it is a noncase. Like other forms of regression analysis, logistic
regression makes use of one or more predictor variables that may be either continuous or
categorical data. Unlike ordinary linear regression, however, logistic regression is used for predicting
binary outcomes of the dependent variable rather than continuous outcomes. Given this
difference, it is necessary that logistic regression take the natural logarithm of the
odds of the dependent variable being a case to create a continuous criterion as a transformed
version of the dependent variable. Thus the logit transformation is referred to as the
link function in logistic regression—although the dependent variable in logistic regression
is binomial, the logit is the continuous criterion upon which linear regression is conducted.
The logit of success is then fit to the predictors using linear regression analysis. The predicted
value of the logit is converted back into predicted odds via the inverse of the natural
logarithm, namely the exponential function. Therefore, although the observed dependent
variable in logistic regression is a zero-or-one variable, the logistic regression estimates
the odds, as a continuous variable, that the dependent variable is a success. In some applications
the odds are all that is needed. In others, a specific yes-or-no prediction is needed
for whether the dependent variable is or is not a case; this categorical prediction can
be based on the computed odds of a success, with predicted odds above some chosen cut-off
value being translated into a prediction of a success.
Logistic function, odds ratio, and logit
An explanation of logistic regression begins with an explanation of the logistic function,
which always takes on values between zero and one:
and viewing as a linear function of an explanatory variable , the logistic function can be written
as:
This will be interpreted as the probability of the dependent variable equalling a "success"
or "case" rather than a failure or non-case. We also define the inverse of the logistic
function, the logit:
and equivalently:
A graph of the logistic function is shown in Figure 1. The input is the value of and
the output is . The logistic function is useful because it can take an input with any value
from negative infinity to positive infinity, whereas the output is confined to values between
0 and 1 and hence is interpretable as a probability. In the above equations, refers to the logit
function of some given linear combination of the predictors, denotes the natural logarithm,
is the probability that the dependent variable equals a case, is the intercept from the linear
regression equation, is the regression coefficient multiplied by some value of the predictor,
and base denotes the exponential function. The formula for illustrates that the probability
of the dependent variable equaling a case is equal to the value of the logistic function
of the linear regression expression. This is important in that it shows that the value
of the linear regression expression can vary from negative to positive infinity and yet,
after transformation, the resulting expression for the probability ranges between 0 and 1.
The equation for illustrates that the logit is equivalent to the linear regression expression.
Likewise, the next equation illustrates that the odds of the dependent variable equaling
a case is equivalent to the exponential function of the linear regression expression. This
illustrates how the logit serves as a link function between the probability and the linear
regression expression. Given that the logit ranges between negative infinity and positive
infinity, it provides an adequate criterion upon which to conduct linear regression and
the logit is easily converted back into the odds.
Multiple explanatory variables If there are multiple explanatory variables,
then the above expression can be revised to Then when this is used in the equation relating
the logged odds of a success to the values of the predictors, the linear regression will
be a multiple regression with m explanators; the parameters for all j = 0, 1, 2, ..., m
are all estimated. Model fitting
Estimation Maximum likelihood estimation
The regression coefficients are usually estimated using maximum likelihood estimation. Unlike
linear regression with normally distributed residuals, it is not possible to find a closed-form
expression for the coefficient values that maximizes the likelihood function, so an iterative
process must be used instead, for example Newton's method. This process begins with
a tentative solution, revises it slightly to see if it can be improved, and repeats
this revision until improvement is minute, at which point the process is said to have
converged. In some instances the model may not reach
convergence. When a model does not converge this indicates that the coefficients are not
meaningful because the iterative process was unable to find appropriate solutions. A failure
to converge may occur for a number of reasons: having a large proportion of predictors to
cases, multicollinearity, sparseness, or complete separation.
Having a large proportion of variables to cases results in an overly conservative Wald
statistic and can lead to nonconvergence. Multicollinearity refers to unacceptably high
correlations between predictors. As multicollinearity increases, coefficients remain unbiased but
standard errors increase and the likelihood of model convergence decreases. To detect
multicollinearity amongst the predictors, one can conduct a linear regression analysis
with the predictors of interest for the sole purpose of examining the tolerance statistic
used to assess whether multicollinearity is unacceptably high.
Sparseness in the data refers to having a large proportion of empty cells. Zero cell
counts are particularly problematic with categorical predictors. With continuous predictors, the
model can infer values for the zero cell counts, but this is not the case with categorical
predictors. The reason the model will not converge with zero cell counts for categorical
predictors is because the natural logarithm of zero is an undefined value, so final solutions
to the model cannot be reached. To remedy this problem, researchers may collapse categories
in a theoretically meaningful way or may consider adding a constant to all cells.
Another numerical problem that may lead to a lack of convergence is complete separation,
which refers to the instance in which the predictors perfectly predict the criterion
– all cases are accurately classified. In such instances, one should reexamine the data,
as there is likely some kind of error. Although not a precise number, as a general
rule of thumb, logistic regression models require a minimum of 10 events per explaining
variable. Minimum chi-squared estimator for grouped
data While individual data will have a dependent
variable with a value of zero or one for every observation, with grouped data one observation
is on a group of people who all share the same characteristics; in this case the researcher
observes the proportion of people in the group for whom the response variable falls into
one category or the other. If this proportion is neither zero nor one for any group, the
minimum chi-squared estimator involves using weighted least squares to estimate a linear
model in which the dependent variable is the logit of the proportion: that is, the log
of the ratio of the fraction in one group to the fraction in the other group.
Evaluating goodness of fit Goodness of fit in linear regression models
is generally measured using the R2. Since this has no direct analog in logistic regression,
various methods including the following can be used instead.
Deviance and likelihood ratio tests In linear regression analysis, one is concerned
with partitioning variance via the sum of squares calculations – variance in the criterion
is essentially divided into variance accounted for by the predictors and residual variance.
In logistic regression analysis, deviance is used in lieu of sum of squares calculations.
Deviance is analogous to the sum of squares calculations in linear regression and is a
measure of the lack of fit to the data in a logistic regression model. Deviance is calculated
by comparing a given model with the saturated model – a model with a theoretically perfect
fit. This computation is called the likelihood-ratio test:
In the above equation D represents the deviance and ln represents the natural logarithm. The
log of the likelihood ratio will produce a negative value, so the product is multiplied
by negative two times its natural logarithm to produce a value with an approximate chi-squared
distribution. Smaller values indicate better fit as the fitted model deviates less from
the saturated model. When assessed upon a chi-square distribution, nonsignificant chi-square
values indicate very little unexplained variance and thus, good model fit. Conversely, a significant
chi-square value indicates that a significant amount of the variance is unexplained.
Two measures of deviance are particularly important in logistic regression: null deviance
and model deviance. The null deviance represents the difference between a model with only the
intercept and the saturated model. And, the model deviance represents the difference between
a model with at least one predictor and the saturated model. In this respect, the null
model provides a baseline upon which to compare predictor models. Given that deviance is a
measure of the difference between a given model and the saturated model, smaller values
indicate better fit. Therefore, to assess the contribution of a predictor or set of
predictors, one can subtract the model deviance from the null deviance and assess the difference
on a chi-square distribution with degree of freedom equal to the difference in the number
of parameters estimated. Let
Then
If the model deviance is significantly smaller than the null deviance then one can conclude
that the predictor or set of predictors significantly improved model fit. This is analogous to the
F-test used in linear regression analysis to assess the significance of prediction.