Subtitles section Play video Print subtitles In statistics, logistic regression, or logit regression, is a type of probabilistic statistical classification model. It is also used to predict a binary response from a binary predictor, used for predicting the outcome of a categorical dependent variable based on one or more predictor variables. That is, it is used in estimating the parameters of a qualitative response model. The probabilities describing the possible outcomes of a single trial are modeled, as a function of the explanatory variables, using a logistic function. Frequently "logistic regression" is used to refer specifically to the problem in which the dependent variable is binaryâ€”that is, the number of available categories is twoâ€”while problems with more than two categories are referred to as multinomial logistic regression or, if the multiple categories are ordered, as ordered logistic regression. Logistic regression measures the relationship between a categorical dependent variable and one or more independent variables, which are usually continuous, by using probability scores as the predicted values of the dependent variable. As such it treats the same set of problems as does probit regression using similar techniques. Fields and examples of applications Logistic regression was put forth in the 1940s as an alternative to Fisher's 1936 classification method, linear discriminant analysis. It is used extensively in numerous disciplines, including the medical and social science fields. For example, the Trauma and Injury Severity Score, which is widely used to predict mortality in injured patients, was originally developed by Boyd et al. using logistic regression. Logistic regression might be used to predict whether a patient has a given disease, based on observed characteristics of the patient. Another example might be to predict whether an American voter will vote Democratic or Republican, based on age, income, gender, race, state of residence, votes in previous elections, etc. The technique can also be used in engineering, especially for predicting the probability of failure of a given process, system or product. It is also used in marketing applications such as prediction of a customer's propensity to purchase a product or cease a subscription, etc. In economics it can be used to predict the likelihood of a person's choosing to be in the labor force, and a business application would be to predict the likehood of a homeowner defaulting on a mortgage. Conditional random fields, an extension of logistic regression to sequential data, are used in natural language processing. Basics Logistic regression can be binomial or multinomial. Binomial or binary logistic regression deals with situations in which the observed outcome for a dependent variable can have only two possible types. Multinomial logistic regression deals with situations where the outcome can have three or more possible types. In binary logistic regression, the outcome is usually coded as "0" or "1", as this leads to the most straightforward interpretation. If a particular observed outcome for the dependent variable is the noteworthy possible outcome it is usually coded as "1" and the contrary outcome as "0". Logistic regression is used to predict the odds of being a case based on the values of the independent variables. The odds are defined as the probability that a particular outcome is a case divided by the probability that it is a noncase. Like other forms of regression analysis, logistic regression makes use of one or more predictor variables that may be either continuous or categorical data. Unlike ordinary linear regression, however, logistic regression is used for predicting binary outcomes of the dependent variable rather than continuous outcomes. Given this difference, it is necessary that logistic regression take the natural logarithm of the odds of the dependent variable being a case to create a continuous criterion as a transformed version of the dependent variable. Thus the logit transformation is referred to as the link function in logistic regressionâ€”although the dependent variable in logistic regression is binomial, the logit is the continuous criterion upon which linear regression is conducted. The logit of success is then fit to the predictors using linear regression analysis. The predicted value of the logit is converted back into predicted odds via the inverse of the natural logarithm, namely the exponential function. Therefore, although the observed dependent variable in logistic regression is a zero-or-one variable, the logistic regression estimates the odds, as a continuous variable, that the dependent variable is a success. In some applications the odds are all that is needed. In others, a specific yes-or-no prediction is needed for whether the dependent variable is or is not a case; this categorical prediction can be based on the computed odds of a success, with predicted odds above some chosen cut-off value being translated into a prediction of a success. Logistic function, odds ratio, and logit An explanation of logistic regression begins with an explanation of the logistic function, which always takes on values between zero and one: and viewing as a linear function of an explanatory variable , the logistic function can be written as: This will be interpreted as the probability of the dependent variable equalling a "success" or "case" rather than a failure or non-case. We also define the inverse of the logistic function, the logit: and equivalently: A graph of the logistic function is shown in Figure 1. The input is the value of and the output is . The logistic function is useful because it can take an input with any value from negative infinity to positive infinity, whereas the output is confined to values between 0 and 1 and hence is interpretable as a probability. In the above equations, refers to the logit function of some given linear combination of the predictors, denotes the natural logarithm, is the probability that the dependent variable equals a case, is the intercept from the linear regression equation, is the regression coefficient multiplied by some value of the predictor, and base denotes the exponential function. The formula for illustrates that the probability of the dependent variable equaling a case is equal to the value of the logistic function of the linear regression expression. This is important in that it shows that the value of the linear regression expression can vary from negative to positive infinity and yet, after transformation, the resulting expression for the probability ranges between 0 and 1. The equation for illustrates that the logit is equivalent to the linear regression expression. Likewise, the next equation illustrates that the odds of the dependent variable equaling a case is equivalent to the exponential function of the linear regression expression. This illustrates how the logit serves as a link function between the probability and the linear regression expression. Given that the logit ranges between negative infinity and positive infinity, it provides an adequate criterion upon which to conduct linear regression and the logit is easily converted back into the odds. Multiple explanatory variables If there are multiple explanatory variables, then the above expression can be revised to Then when this is used in the equation relating the logged odds of a success to the values of the predictors, the linear regression will be a multiple regression with m explanators; the parameters for all j = 0, 1, 2, ..., m are all estimated. Model fitting Estimation Maximum likelihood estimation The regression coefficients are usually estimated using maximum likelihood estimation. Unlike linear regression with normally distributed residuals, it is not possible to find a closed-form expression for the coefficient values that maximizes the likelihood function, so an iterative process must be used instead, for example Newton's method. This process begins with a tentative solution, revises it slightly to see if it can be improved, and repeats this revision until improvement is minute, at which point the process is said to have converged. In some instances the model may not reach convergence. When a model does not converge this indicates that the coefficients are not meaningful because the iterative process was unable to find appropriate solutions. A failure to converge may occur for a number of reasons: having a large proportion of predictors to cases, multicollinearity, sparseness, or complete separation. Having a large proportion of variables to cases results in an overly conservative Wald statistic and can lead to nonconvergence. Multicollinearity refers to unacceptably high correlations between predictors. As multicollinearity increases, coefficients remain unbiased but standard errors increase and the likelihood of model convergence decreases. To detect multicollinearity amongst the predictors, one can conduct a linear regression analysis with the predictors of interest for the sole purpose of examining the tolerance statistic used to assess whether multicollinearity is unacceptably high. Sparseness in the data refers to having a large proportion of empty cells. Zero cell counts are particularly problematic with categorical predictors. With continuous predictors, the model can infer values for the zero cell counts, but this is not the case with categorical predictors. The reason the model will not converge with zero cell counts for categorical predictors is because the natural logarithm of zero is an undefined value, so final solutions to the model cannot be reached. To remedy this problem, researchers may collapse categories in a theoretically meaningful way or may consider adding a constant to all cells. Another numerical problem that may lead to a lack of convergence is complete separation, which refers to the instance in which the predictors perfectly predict the criterion â€“ all cases are accurately classified. In such instances, one should reexamine the data, as there is likely some kind of error. Although not a precise number, as a general rule of thumb, logistic regression models require a minimum of 10 events per explaining variable. Minimum chi-squared estimator for grouped data While individual data will have a dependent variable with a value of zero or one for every observation, with grouped data one observation is on a group of people who all share the same characteristics; in this case the researcher observes the proportion of people in the group for whom the response variable falls into one category or the other. If this proportion is neither zero nor one for any group, the minimum chi-squared estimator involves using weighted least squares to estimate a linear model in which the dependent variable is the logit of the proportion: that is, the log of the ratio of the fraction in one group to the fraction in the other group. Evaluating goodness of fit Goodness of fit in linear regression models is generally measured using the R2. Since this has no direct analog in logistic regression, various methods including the following can be used instead. Deviance and likelihood ratio tests In linear regression analysis, one is concerned with partitioning variance via the sum of squares calculations â€“ variance in the criterion is essentially divided into variance accounted for by the predictors and residual variance. In logistic regression analysis, deviance is used in lieu of sum of squares calculations. Deviance is analogous to the sum of squares calculations in linear regression and is a measure of the lack of fit to the data in a logistic regression model. Deviance is calculated by comparing a given model with the saturated model â€“ a model with a theoretically perfect fit. This computation is called the likelihood-ratio test: In the above equation D represents the deviance and ln represents the natural logarithm. The log of the likelihood ratio will produce a negative value, so the product is multiplied by negative two times its natural logarithm to produce a value with an approximate chi-squared distribution. Smaller values indicate better fit as the fitted model deviates less from the saturated model. When assessed upon a chi-square distribution, nonsignificant chi-square values indicate very little unexplained variance and thus, good model fit. Conversely, a significant chi-square value indicates that a significant amount of the variance is unexplained. Two measures of deviance are particularly important in logistic regression: null deviance and model deviance. The null deviance represents the difference between a model with only the intercept and the saturated model. And, the model deviance represents the difference between a model with at least one predictor and the saturated model. In this respect, the null model provides a baseline upon which to compare predictor models. Given that deviance is a measure of the difference between a given model and the saturated model, smaller values indicate better fit. Therefore, to assess the contribution of a predictor or set of predictors, one can subtract the model deviance from the null deviance and assess the difference on a chi-square distribution with degree of freedom equal to the difference in the number of parameters estimated. Let Then If the model deviance is significantly smaller than the null deviance then one can conclude that the predictor or set of predictors significantly improved model fit. This is analogous to the F-test used in linear regression analysis to assess the significance of prediction. Pseudo-R2s In linear regression the squared multiple correlation, R2 is used to assess goodness of fit as it represents the proportion of variance in the criterion that is explained by the predictors. In logistic regression analysis, there is no agreed upon analogous measure, but there are several competing measures each with limitations. Three of the most commonly used indices are examined on this page beginning with the likelihood ratio R2, R2L: This is the most analogous index to the squared multiple correlation in linear regression. It represents the proportional reduction in the deviance wherein the deviance is treated as a measure of variation analogous but not identical to the variance in linear regression analysis. One limitation of the likelihood ratio R2 is that it is not monotonically related to the odds ratio, meaning that it does not necessarily increase as the odds ratio increases and does not necessarily decrease as the odds ratio decreases. The Cox and Snell R2 is an alternative index of goodness of fit related to the R2 value from linear regression. The Cox and Snell index is problematic as its maximum value is .75, when the variance is at its maximum. The Nagelkerke R2 provides a correction to the Cox and Snell R2 so that the maximum value is equal to one. Nevertheless, the Cox and Snell and likelihood ratio R2s show greater agreement with each other than either does with the Nagelkerke R2. Of course, this might not be the case for values exceeding .75 as the Cox and Snell index is capped at this value. The likelihood ratio R2 is often preferred to the alternatives as it is most analogous to R2 in linear regression, is independent of the base rate and varies between 0 and 1. A word of caution is in order when interpreting pseudo-R2 statistics. The reason these indices of fit are referred to as pseudo R2 is because they do not represent the proportionate reduction in error as the R2 in linear regression does. Linear regression assumes homoscedasticity, that the error variance is the same for all values of the criterion. Logistic regression will always be heteroscedastic â€“ the error variances differ for each value of the predicted score. For each value of the predicted score there would be a different value of the proportionate reduction in error. Therefore, it is inappropriate to think of R2 as a proportionate reduction in error in a universal sense in logistic regression. Hosmerâ€“Lemeshow test The Hosmerâ€“Lemeshow test uses a test statistic that asymptotically follows a distribution to assess whether or not the observed event rates match expected event rates in subgroups of the model population. Evaluating binary classification performance If the estimated probabilities are to be used to classify each observation of independent variable values as predicting the category that the dependent variable is found in, the various methods below for judging the model's suitability in out-of-sample forecasting can also be used on the data that were used for estimationâ€”accuracy, precision, recall, specificity and negative predictive value. In each of these evaluative methods, an aspect of the model's effectiveness in assigning instances to the correct categories is measured. Coefficients After fitting the model, it is likely that researchers will want to examine the contribution of individual predictors. To do so, they will want to examine the regression coefficients. In linear regression, the regression coefficients represent the change in the criterion for each unit change in the predictor. In logistic regression, however, the regression coefficients represent the change in the logit for each unit change in the predictor. Given that the logit is not intuitive, researchers are likely to focus on a predictor's effect on the exponential function of the regression coefficient â€“ the odds ratio. In linear regression, the significance of a regression coefficient is assessed by computing a t-test. In logistic regression, there are several different tests designed to assess the significance of an individual predictor, most notably the likelihood ratio test and the Wald statistic. Likelihood ratio test The likelihood-ratio test discussed above to assess model fit is also the recommended procedure to assess the contribution of individual "predictors" to a given model. In the case of a single predictor model, one simply compares the deviance of the predictor model with that of the null model on a chi-square distribution with a single degree of freedom. If the predictor model has a significantly smaller deviance, then one can conclude that there is a significant association between the "predictor" and the outcome. Although some common statistical packages do provide likelihood ratio test statistics, without this computationally intensive test it would be more difficult to assess the contribution of individual predictors in the multiple logistic regression case. To assess the contribution of individual predictors one can enter the predictors hierarchically, comparing each new model with the previous to determine the contribution of each predictor. (There is considerable debate among statisticians regarding the appropriateness of so-called "stepwise" procedures. They do not preserve the nominal statistical properties and can be very misleading.[1] Wald statistic Alternatively, when assessing the contribution of individual predictors in a given model, one may examine the significance of the Wald statistic. The Wald statistic, analogous to the t-test in linear regression, is used to assess the significance of coefficients. The Wald statistic is the ratio of the square of the regression coefficient to the square