Placeholder Image

Subtitles section Play video

  • In statistics, logistic regression, or logit regression, is a type of probabilistic statistical

  • classification model. It is also used to predict a binary response from a binary predictor,

  • used for predicting the outcome of a categorical dependent variable based on one or more predictor

  • variables. That is, it is used in estimating the parameters of a qualitative response model.

  • The probabilities describing the possible outcomes of a single trial are modeled, as

  • a function of the explanatory variables, using a logistic function. Frequently "logistic

  • regression" is used to refer specifically to the problem in which the dependent variable

  • is binarythat is, the number of available categories is twowhile problems with more

  • than two categories are referred to as multinomial logistic regression or, if the multiple categories

  • are ordered, as ordered logistic regression. Logistic regression measures the relationship

  • between a categorical dependent variable and one or more independent variables, which are

  • usually continuous, by using probability scores as the predicted values of the dependent variable.

  • As such it treats the same set of problems as does probit regression using similar techniques.

  • Fields and examples of applications Logistic regression was put forth in the 1940s

  • as an alternative to Fisher's 1936 classification method, linear discriminant analysis. It is

  • used extensively in numerous disciplines, including the medical and social science fields.

  • For example, the Trauma and Injury Severity Score, which is widely used to predict mortality

  • in injured patients, was originally developed by Boyd et al. using logistic regression.

  • Logistic regression might be used to predict whether a patient has a given disease, based

  • on observed characteristics of the patient. Another example might be to predict whether

  • an American voter will vote Democratic or Republican, based on age, income, gender,

  • race, state of residence, votes in previous elections, etc. The technique can also be

  • used in engineering, especially for predicting the probability of failure of a given process,

  • system or product. It is also used in marketing applications such as prediction of a customer's

  • propensity to purchase a product or cease a subscription, etc. In economics it can be

  • used to predict the likelihood of a person's choosing to be in the labor force, and a business

  • application would be to predict the likehood of a homeowner defaulting on a mortgage. Conditional

  • random fields, an extension of logistic regression to sequential data, are used in natural language

  • processing. Basics

  • Logistic regression can be binomial or multinomial. Binomial or binary logistic regression deals

  • with situations in which the observed outcome for a dependent variable can have only two

  • possible types. Multinomial logistic regression deals with situations where the outcome can

  • have three or more possible types. In binary logistic regression, the outcome is usually

  • coded as "0" or "1", as this leads to the most straightforward interpretation. If a

  • particular observed outcome for the dependent variable is the noteworthy possible outcome

  • it is usually coded as "1" and the contrary outcome as "0". Logistic regression is used

  • to predict the odds of being a case based on the values of the independent variables.

  • The odds are defined as the probability that a particular outcome is a case divided by

  • the probability that it is a noncase. Like other forms of regression analysis, logistic

  • regression makes use of one or more predictor variables that may be either continuous or

  • categorical data. Unlike ordinary linear regression, however, logistic regression is used for predicting

  • binary outcomes of the dependent variable rather than continuous outcomes. Given this

  • difference, it is necessary that logistic regression take the natural logarithm of the

  • odds of the dependent variable being a case to create a continuous criterion as a transformed

  • version of the dependent variable. Thus the logit transformation is referred to as the

  • link function in logistic regressionalthough the dependent variable in logistic regression

  • is binomial, the logit is the continuous criterion upon which linear regression is conducted.

  • The logit of success is then fit to the predictors using linear regression analysis. The predicted

  • value of the logit is converted back into predicted odds via the inverse of the natural

  • logarithm, namely the exponential function. Therefore, although the observed dependent

  • variable in logistic regression is a zero-or-one variable, the logistic regression estimates

  • the odds, as a continuous variable, that the dependent variable is a success. In some applications

  • the odds are all that is needed. In others, a specific yes-or-no prediction is needed

  • for whether the dependent variable is or is not a case; this categorical prediction can

  • be based on the computed odds of a success, with predicted odds above some chosen cut-off

  • value being translated into a prediction of a success.

  • Logistic function, odds ratio, and logit

  • An explanation of logistic regression begins with an explanation of the logistic function,

  • which always takes on values between zero and one:

  • and viewing as a linear function of an explanatory variable , the logistic function can be written

  • as:

  • This will be interpreted as the probability of the dependent variable equalling a "success"

  • or "case" rather than a failure or non-case. We also define the inverse of the logistic

  • function, the logit:

  • and equivalently:

  • A graph of the logistic function is shown in Figure 1. The input is the value of and

  • the output is . The logistic function is useful because it can take an input with any value

  • from negative infinity to positive infinity, whereas the output is confined to values between

  • 0 and 1 and hence is interpretable as a probability. In the above equations, refers to the logit

  • function of some given linear combination of the predictors, denotes the natural logarithm,

  • is the probability that the dependent variable equals a case, is the intercept from the linear

  • regression equation, is the regression coefficient multiplied by some value of the predictor,

  • and base denotes the exponential function. The formula for illustrates that the probability

  • of the dependent variable equaling a case is equal to the value of the logistic function

  • of the linear regression expression. This is important in that it shows that the value

  • of the linear regression expression can vary from negative to positive infinity and yet,

  • after transformation, the resulting expression for the probability ranges between 0 and 1.

  • The equation for illustrates that the logit is equivalent to the linear regression expression.

  • Likewise, the next equation illustrates that the odds of the dependent variable equaling

  • a case is equivalent to the exponential function of the linear regression expression. This

  • illustrates how the logit serves as a link function between the probability and the linear

  • regression expression. Given that the logit ranges between negative infinity and positive

  • infinity, it provides an adequate criterion upon which to conduct linear regression and

  • the logit is easily converted back into the odds.

  • Multiple explanatory variables If there are multiple explanatory variables,

  • then the above expression can be revised to Then when this is used in the equation relating

  • the logged odds of a success to the values of the predictors, the linear regression will

  • be a multiple regression with m explanators; the parameters for all j = 0, 1, 2, ..., m

  • are all estimated. Model fitting

  • Estimation Maximum likelihood estimation

  • The regression coefficients are usually estimated using maximum likelihood estimation. Unlike

  • linear regression with normally distributed residuals, it is not possible to find a closed-form

  • expression for the coefficient values that maximizes the likelihood function, so an iterative

  • process must be used instead, for example Newton's method. This process begins with

  • a tentative solution, revises it slightly to see if it can be improved, and repeats

  • this revision until improvement is minute, at which point the process is said to have

  • converged. In some instances the model may not reach

  • convergence. When a model does not converge this indicates that the coefficients are not

  • meaningful because the iterative process was unable to find appropriate solutions. A failure

  • to converge may occur for a number of reasons: having a large proportion of predictors to

  • cases, multicollinearity, sparseness, or complete separation.

  • Having a large proportion of variables to cases results in an overly conservative Wald

  • statistic and can lead to nonconvergence. Multicollinearity refers to unacceptably high

  • correlations between predictors. As multicollinearity increases, coefficients remain unbiased but

  • standard errors increase and the likelihood of model convergence decreases. To detect

  • multicollinearity amongst the predictors, one can conduct a linear regression analysis

  • with the predictors of interest for the sole purpose of examining the tolerance statistic

  • used to assess whether multicollinearity is unacceptably high.

  • Sparseness in the data refers to having a large proportion of empty cells. Zero cell

  • counts are particularly problematic with categorical predictors. With continuous predictors, the

  • model can infer values for the zero cell counts, but this is not the case with categorical

  • predictors. The reason the model will not converge with zero cell counts for categorical

  • predictors is because the natural logarithm of zero is an undefined value, so final solutions

  • to the model cannot be reached. To remedy this problem, researchers may collapse categories

  • in a theoretically meaningful way or may consider adding a constant to all cells.

  • Another numerical problem that may lead to a lack of convergence is complete separation,

  • which refers to the instance in which the predictors perfectly predict the criterion

  • all cases are accurately classified. In such instances, one should reexamine the data,

  • as there is likely some kind of error. Although not a precise number, as a general

  • rule of thumb, logistic regression models require a minimum of 10 events per explaining

  • variable. Minimum chi-squared estimator for grouped

  • data While individual data will have a dependent

  • variable with a value of zero or one for every observation, with grouped data one observation

  • is on a group of people who all share the same characteristics; in this case the researcher

  • observes the proportion of people in the group for whom the response variable falls into

  • one category or the other. If this proportion is neither zero nor one for any group, the

  • minimum chi-squared estimator involves using weighted least squares to estimate a linear

  • model in which the dependent variable is the logit of the proportion: that is, the log

  • of the ratio of the fraction in one group to the fraction in the other group.

  • Evaluating goodness of fit Goodness of fit in linear regression models

  • is generally measured using the R2. Since this has no direct analog in logistic regression,

  • various methods including the following can be used instead.

  • Deviance and likelihood ratio tests In linear regression analysis, one is concerned

  • with partitioning variance via the sum of squares calculationsvariance in the criterion

  • is essentially divided into variance accounted for by the predictors and residual variance.

  • In logistic regression analysis, deviance is used in lieu of sum of squares calculations.

  • Deviance is analogous to the sum of squares calculations in linear regression and is a

  • measure of the lack of fit to the data in a logistic regression model. Deviance is calculated

  • by comparing a given model with the saturated model – a model with a theoretically perfect

  • fit. This computation is called the likelihood-ratio test:

  • In the above equation D represents the deviance and ln represents the natural logarithm. The

  • log of the likelihood ratio will produce a negative value, so the product is multiplied

  • by negative two times its natural logarithm to produce a value with an approximate chi-squared

  • distribution. Smaller values indicate better fit as the fitted model deviates less from

  • the saturated model. When assessed upon a chi-square distribution, nonsignificant chi-square

  • values indicate very little unexplained variance and thus, good model fit. Conversely, a significant

  • chi-square value indicates that a significant amount of the variance is unexplained.

  • Two measures of deviance are particularly important in logistic regression: null deviance

  • and model deviance. The null deviance represents the difference between a model with only the

  • intercept and the saturated model. And, the model deviance represents the difference between

  • a model with at least one predictor and the saturated model. In this respect, the null

  • model provides a baseline upon which to compare predictor models. Given that deviance is a

  • measure of the difference between a given model and the saturated model, smaller values

  • indicate better fit. Therefore, to assess the contribution of a predictor or set of

  • predictors, one can subtract the model deviance from the null deviance and assess the difference

  • on a chi-square distribution with degree of freedom equal to the difference in the number

  • of parameters estimated. Let

  • Then

  • If the model deviance is significantly smaller than the null deviance then one can conclude

  • that the predictor or set of predictors significantly improved model fit. This is analogous to the

  • F-test used in linear regression analysis to assess the significance of prediction.

  • Pseudo-R2s In linear regression the squared multiple

  • correlation, R2 is used to assess goodness of fit as it represents the proportion of

  • variance in the criterion that is explained by the predictors. In logistic regression

  • analysis, there is no agreed upon analogous measure, but there are several competing measures

  • each with limitations. Three of the most commonly used indices are examined on this page beginning

  • with the likelihood ratio R2, R2L:

  • This is the most analogous index to the squared multiple correlation in linear regression.

  • It represents the proportional reduction in the deviance wherein the deviance is treated

  • as a measure of variation analogous but not identical to the variance in linear regression

  • analysis. One limitation of the likelihood ratio R2 is that it is not monotonically related

  • to the odds ratio, meaning that it does not necessarily increase as the odds ratio increases

  • and does not necessarily decrease as the odds ratio decreases.

  • The Cox and Snell R2 is an alternative index of goodness of fit related to the R2 value

  • from linear regression. The Cox and Snell index is problematic as its maximum value

  • is .75, when the variance is at its maximum. The Nagelkerke R2 provides a correction to

  • the Cox and Snell R2 so that the maximum value is equal to one. Nevertheless, the Cox and

  • Snell and likelihood ratio R2s show greater agreement with each other than either does

  • with the Nagelkerke R2. Of course, this might not be the case for values exceeding .75 as

  • the Cox and Snell index is capped at this value. The likelihood ratio R2 is often preferred

  • to the alternatives as it is most analogous to R2 in linear regression, is independent

  • of the base rate and varies between 0 and 1.

  • A word of caution is in order when interpreting pseudo-R2 statistics. The reason these indices

  • of fit are referred to as pseudo R2 is because they do not represent the proportionate reduction

  • in error as the R2 in linear regression does. Linear regression assumes homoscedasticity,

  • that the error variance is the same for all values of the criterion. Logistic regression

  • will always be heteroscedasticthe error variances differ for each value of the predicted

  • score. For each value of the predicted score there would be a different value of the proportionate

  • reduction in error. Therefore, it is inappropriate to think of R2 as a proportionate reduction

  • in error in a universal sense in logistic regression.

  • HosmerLemeshow test The HosmerLemeshow test uses a test statistic

  • that asymptotically follows a distribution to assess whether or not the observed event

  • rates match expected event rates in subgroups of the model population.

  • Evaluating binary classification performance If the estimated probabilities are to be used

  • to classify each observation of independent variable values as predicting the category

  • that the dependent variable is found in, the various methods below for judging the model's

  • suitability in out-of-sample forecasting can also be used on the data that were used for

  • estimationaccuracy, precision, recall, specificity and negative predictive value.

  • In each of these evaluative methods, an aspect of the model's effectiveness in assigning

  • instances to the correct categories is measured. Coefficients

  • After fitting the model, it is likely that researchers will want to examine the contribution

  • of individual predictors. To do so, they will want to examine the regression coefficients.

  • In linear regression, the regression coefficients represent the change in the criterion for

  • each unit change in the predictor. In logistic regression, however, the regression coefficients

  • represent the change in the logit for each unit change in the predictor. Given that the

  • logit is not intuitive, researchers are likely to focus on a predictor's effect on the exponential

  • function of the regression coefficientthe odds ratio. In linear regression, the significance

  • of a regression coefficient is assessed by computing a t-test. In logistic regression,

  • there are several different tests designed to assess the significance of an individual

  • predictor, most notably the likelihood ratio test and the Wald statistic.

  • Likelihood ratio test The likelihood-ratio test discussed above

  • to assess model fit is also the recommended procedure to assess the contribution of individual

  • "predictors" to a given model. In the case of a single predictor model, one simply compares

  • the deviance of the predictor model with that of the null model on a chi-square distribution

  • with a single degree of freedom. If the predictor model has a significantly smaller deviance,

  • then one can conclude that there is a significant association between the "predictor" and the

  • outcome. Although some common statistical packages do provide likelihood ratio test

  • statistics, without this computationally intensive test it would be more difficult to assess

  • the contribution of individual predictors in the multiple logistic regression case.

  • To assess the contribution of individual predictors one can enter the predictors hierarchically,

  • comparing each new model with the previous to determine the contribution of each predictor.

  • (There is considerable debate among statisticians regarding the appropriateness of so-called

  • "stepwise" procedures. They do not preserve the nominal statistical properties and can

  • be very misleading.[1] Wald statistic

  • Alternatively, when assessing the contribution of individual predictors in a given model,

  • one may examine the significance of the Wald statistic. The Wald statistic, analogous to

  • the t-test in linear regression, is used to assess the significance of coefficients. The

  • Wald statistic is the ratio of the square of the regression coefficient to the square