We're currently redeveloping our website and you may notice some inconsistencies with our designs.

Goodness of Fit Measures

main content

Goodness of Fit Measures in Logistic Regression

One question that obviously arises when we are looking at regression models is that of the overall fit of the model. As in ordinary linear regression, we can find measures of model fit of logistic regression models.

There are, in fact, a number of different measures of goodness of fit for logistic regression models. The first, most straightforward measure, is to simply look at the extent to which the model accurately predicts the dependent variable, or, in other words, how accurate the model is at predicting whether or not a pupil, in this sample, is likely to report having literature in their home. This is calculated by comparing the predicted score of the individual students (as either possessing or not possessing literature) on the basis of the two independent variables we have in our model (gender and mother’s education), with their actual group membership as given by the data (in other words, what the data tells us about whether students have actually said they possess or don’t possess literature in the home). SPSS give us this output, in the box labelled ‘classification table’. In this example, this table is as follows:

Observed

Predicted

Possessions literature Q17h

Percentage Correct

Tick

No Tick

Step 1

Possessions literature Q17h

Tick

1707

3212

34.7

No Tick

1211

6663

84.6

Overall Percentage

65.4

The overall percentage is given as 65.4%. This means that 65.4% of students have been accurately classified as either possessing or not possessing literature on the basis of our two variable model. There is no absolute cut-off point which tells us whether or not this represents good fit, but obviously 100% would represent perfect fit, in that all students would be correctly classified on the basis of our model. This extreme situation is highly unlikely to occur in practice. Rather, the key question is the extent to which or model is better able to predict group membership (do they fall into the possessing literature or the not possessing literature group) than a model without any of our independent variables. An indication of this is given us by the initial classification table given at the start of the SPSS output:

Observed

Predicted

Possessions literature Q17h

Percentage Correct

Tick

No Tick

Step 0

Possessions literature Q17h

Tick

0

4919

.0

No Tick

0

7874

100.0

Overall Percentage

61.5

In this null model, all values have been assigned to the modal value, no tick, which means not possessing literature in the home. As can be seen, this gives us an overall accuracy of 61.5%, meaning that 61.5% of students would be correctly classified if we merely assumed that they belonged to the largest group, not possessing literature in the home. To say that our model is worthwhile, we need to do better than this. If we look at the previous table, we can see that our classification accuracy was 65.4%. Adding mother’s education and gender has thus increased the likelihood of a correct prediction of possession of literature in the home, but not by much. This would suggest that this is not a particularly accurate model. We can also see in that table that prediction of not possessing literature is more accurate than prediction of possessing literature. It will usually be the case that prediction will be more accurate for the larger than for the smaller group. 

While intuitively easy to understand, this measure of fit is also somewhat limited. It does not give us any measure of significance, and is not easily comparable to measures of fit in linear regression. Because of these limitations a number of other measures of model fit have been developed for logistic regression. 

Hosmer-Lemeshow Test

 One of these measures is the Hosmer-Lemeshow test of goodness of fit. This is similar to a Chi Square test, and indicates the extent to which the model provides better fit than a null model with no predictors, or, in a different interpretation, how well the model fits the data, as in log-linear modelling. If chi-square goodness of fit is not significant, then the model has adequate fit. By the same token, if the test is significant, the model does not adequately fit the data. The Hosmer and Lemeshow's (H-L) goodness of fit test divides subjects into deciles based on predicted probabilities, then computes a chi-square from observed and expected frequencies. Then a probability (p) value is computed from the chi-square distribution to test the fit of the logistic model. If the H-L goodness-of-fit test statistic is greater than .05, as we want for well-fitting models, we fail to reject the null hypothesis that there is no difference between observed and model-predicted values, implying that the model's estimates fit the data at an acceptable level. That is, well-fitting models show nonsignificance on the goodness-of-fit test, indicating model prediction that is not significantly different from observed values.

A disadvantage of this goodness of fit measure is that it is a significance test, with all the limitations this entails. Like other significant tests it only tells us whether the model fits or not, and does not tell us anything about the extent of the fit. Similarly, like other significance tests, it is strongly influenced by the sample size (sample size and effect size both determine significance), and in large samples, such as the PISA dataset we are using here, a very small difference will lead to significance. As the sample size gets large, the H-L statistic can find smaller and smaller differences between observed and model-predicted values to be significant. Small sample sizes are also problematic, however, as, being a Chi Square test we can’t have too many groups (more than 10%) with predicted frequencies of less than five.

The Hosmer-Lemeshow test of goodness of fit is not automatically a part of the SPSS logistic regression output. To get this output, we need to go into ‘options’ and tick the box marked Hosmer-Lemeshow test of goodness of fit. In our example, this gives us the following output:

 Step  Chi-square df   Sig.
 1  142.032  6  .000

Therefore, our model is significant, suggesting it does not fit the data. However, as we have a sample size of over 13,000, even very small divergencies of the model from the data would be flagged up and cause significance. Therefore, with samples of this size it is hard to find models that are parsimonious (i.e. that use the minimum amount of independent variables to explain the dependent variable) and fit the data. Therefore, other fit indices might be more appropriate.

 In ordinary linear regression, our primary measure of model fit was R2, which was an indicator of the percentage of variance in the dependent variable explained by the model. It would be useful for us to have a similar measure for logistic regression. However, the R2 measure is only appropriate to linear regression, with its continuous dependent variables. To get around this problem, a number of statisticians have developed so-called ‘Pseudo R2 ’ measures that aim to mimic R2 for logistic regression models. In contrast to the actual R2 , as these are approximations there are a number of different Pseudo R Squares, which take a different conceptual approach to what R2 means .

The most commonly used interpret R2 as representing the improvement of the model we are using (in our case the two variable model) over the null model with no independent variables (also called predictors). Other approaches are based on viewing R2 as explained variance.

In the first category we can find two Pseudo R2   measures used in SPSS, Cox and Snell’s and Nagelkerke’s Pseudo R square measures. Cox and Snell's R2 is based on calculating the proportion of unexplained variance that is reduced by adding variables to the model.

The formula for Cox and Snell's Pseudo R2 is:
$$R^2 =  1 - ({\frac{-2LL_{null}}{-2LL_k}})^{\frac{2}{n}}$$

where -2LLnull is the loglikelihood for the empty model, and -2LLk is the loglikelihood for the model with the independent variables.

There is a major problem with Cox and Snell’s Pseudo R Square, however, which is that its maximum can be (and usually is) less than 1.0, making it difficult to interpret. That is why Nagelkerke developed a modified version of Cox and Snell’s measure that varies from 0 to 1. To achieve this, Nagelkerke's R 2 divides Cox and Snell's R 2 by its maximum, giving us the formula:
$$R^2 =  \frac{1-(\frac{-2LL_{null}}{-2LL_k})^{\frac{2}{n}}}{1-(-2LL_{null})^{2/n}}$$

Therefore Nagelkerke's Pseudo R2 will normally be higher than the Cox and Snell measure. Both of these Pseudo R2 measures will tend to be lower than traditional ordinary least squares R2 measures.


Effron's Pseudo R-squared

An example of an approach that views R2 as explained variance is Effron’s Pseudo R2. This measure takes the model residuals, which are squared, summed, and divided by the total variability in the dependent variable:

$$R^2 = 1 - {\frac{\sum_{i-1}^n(y_i-{\hat{\pi_i}})^2}{\sum_{i-1}^n(y_i-{\bar{y}})^2}$$

This R-squared is equal to the squared correlation between the predicted values and actual values. 

Other Pseudo R2s, such as Mc Fadden’s and McKelvey and Zavoina’s measures also exist, but we will not discuss them all in detail here. However, the existence of multiple measures, as opposed to the one R2 we had in ordinary linear regression points to the facts that these are approximations of R2which are inexact and disputable to some extent, and it is important to remember that they will give us somewhat different numbers.



Pseudo R-Squared in SPSS

In SPSS we only get two Pseudo R2 measures in the output, Cox and Snell and Nagelkerke. These are given in the box labelled ‘model summary’ in the output:

Step

-2 Log likelihood

Cox & Snell R Square

Nagelkerke R Square

1

16327.952(a)

.055

.074

As we can see, Nagelkerke’s measure gives us a higher value than does Cox and Snell’s as we would expect. We also said earlier that Nagelkerke’s measure was a correction of Cox and Snell’s, allowing the measure to use the full 0-1 range, so we will prefer to use this one.

Whichever of the measures we use, however, we can see that fit of the model is poor. As .07 is close to 0 our model is not a great improvement over the null model with no predictors.

Summarising what these three measures tell us about the fit of our model, the Hosmer and Lemesnow Goodness of Fit Chi Square test indicates that our model does not fit the data. However, this measure is sensitive to sample size, and in our large sample few parsimonious models would fit. Nevertheless, poor fit is also indicated by the other measures. Accuracy of prediction has improved over the null model, but only by 4%. Nagelkerke’s Pseudo Ris only .07, again (think of an analogous R2in ordinary linear regression) indicating poor fit. So, even though both our predictors are significant, they are weak predictors of possessing literature in the home.

Task 1

Let’s now try and improve our model. We will add two more independent variables that may be related to whether or not pupils say they possess literature in the home. Firstly, we would like to know whether there is any difference between the three countries in our sample, Finland, Scotland and Flanders. Secondly, it might be the case that achievement in English would lead to a greater awareness of literature in the home, so we shall include English reading test scores in the model as well. The two variables we already have (gender and mother’s education) will remain in our model. [See Module 4 (on Multiple regression with nominal independent variables) for how to create the dummy variables for Finnish and Flemish, if you don’t have these in the version of the dataset that you are using.]

Task 2

We said earlier that as the coefficients are unstandardised, we cannot directly compare the strength of these variables to each other. What can we do to see whether country or readingscore has the strongest relationship with the dependent variable?