When things are not normal: Generalized Linear Models

Levi John Wolf

An impatient patron

Returning to our pub data from R11:

waiting pub long_wait drinktype
5.90 The Berkeley TRUE Beer
8.82 The Berkeley TRUE Beer
4.52 The Berkeley TRUE Beer
6.66 The Berkeley TRUE Cocktail
1.01 The Berkeley TRUE Cocktail
1.30 The Berkeley FALSE Beer
0.63 The Gallimaufry TRUE Cocktail
3.31 The Gallimaufry TRUE Cocktail
0.83 The Gallimaufry FALSE Beer
2.23 The Gallimaufry TRUE Beer
0.71 The Gallimaufry FALSE Beer
1.68 The Gallimaufry TRUE Beer
0.41 The Gallimaufry TRUE Beer
15.69 The Milk Thistle TRUE Beer
2.70 The Milk Thistle FALSE Beer
1.86 The Milk Thistle FALSE Cocktail
4.81 The Milk Thistle TRUE Beer

Let’s say that we were trying to predict what your friend thought was a “long wait.” Eyeballing the table above, it looks like the average wait time is different in different pubs, as we saw before. More to the point, though, is that long waits in The Gallimaufry might be short waits at Milk Thistle1 like, your friend thought a half-minute wait at the Galley was too long but a 2.5 minute wait at the Milk Thistle was OK. What a demanding patron!: there’s more than just waiting time or pub location, we need both to make this prediction correctly. The absolute length of the wait is only part of the picture. Your friend clearly values the ambiance of a place when deciding whether the wait is short or not, since a long wait at the Gallimaufry is much shorter than a short wait at the Milk Thistle!

You can see this in the table below. There, we split the data by pub & whether the wait was “long,” and then look at the number of observations in each group. For example, your friend only had one short wait in the Berkeley, while they had 2 short waits in the Gallimaufry. Their two long waits in the Milk Thistle were for an average of 10 minutes, but their average long wait in the Berkeley was only about 5 minutes. Thus, you can see that your friend cares about the pub they’re in.

knitr::kable(times %>% group_by(pub, long_wait) %>% 
               summarize(count = n(), mean_wait_time = mean(waiting)))
pub long_wait count mean_wait_time
The Berkeley FALSE 1 1.2967804
The Berkeley TRUE 5 5.3806650
The Gallimaufry FALSE 2 0.7700659
The Gallimaufry TRUE 5 1.6525299
The Milk Thistle FALSE 2 2.2799322
The Milk Thistle TRUE 2 10.2509857

A simple approach for binary predictions

You could simply try predicting the probability a wait was long using a linear model2 Technically, this model is called a linear probability model, since you are using a linear regression to predict probabilities for a binary outcome, but there’s nothing inherently different about the model itself, only \(y\).. This puts your binary variable as the \(y\), and can include any \(X\) values as predictors:

wait_model2 = lm(long_wait ~ 0 + pub + waiting, data=times)
knitr::kable(broom::tidy(wait_model2))
term estimate std.error statistic p.value
pubThe Berkeley 0.4984616 0.2300168 2.1670659 0.0493913
pubThe Gallimaufry 0.6145088 0.1672170 3.6749188 0.0028009
pubThe Milk Thistle 0.0535921 0.2926292 0.1831399 0.8575137
waiting 0.0712490 0.0320067 2.2260648 0.0443211

This model “sees” \(y\) as a bunch of either zeros or ones, and tries to fit an line that passes through them. Graphically, this model describes the following lines3 If you’re surprised that there are lines, and not just one line, that’s OK. It’s the “next step” from our model of pub waiting times from above. This model is: \[y_i = \begin{cases} \text{if at berkeley:} & \alpha_b \\ \text{if at gallimaufry:} & \alpha_g \\ \text{if at milk thistle:} & \alpha_{m} \end{cases} + \text{wait time}*\beta + e_i\] Each \(\alpha\) is a different slope that’s used to represent the baseline wait in that group, just like we used the average wait in the pub before. But, here, we also are using the actual wait time to improve our predictions. of best fit: OK, this is definitely useful, since we can prove your friend definitely allows for longer waits at the Milk Thistle: the intercept for the Milk Thistle is way below the other two pubs, which suggests that the “baseline” probability of a long wait at the Milk Thistle is lower, even controlling for waiting time! For comparison, look at the plot:a wait of about 5 minutes has around a 40% chance of being a “too long” wait at the Milk Thistle, but has a nearly 90% chance if you’re at the Berkeley, and is basically 100% for the Gallimaufry!

This strategy, predicting a binary outcome just using a linear regression, has a few issues, however. First, look more closely at those lines: you know that the residuals are going to have some odd structure: \(y\) can only be zero (if the wait is not too long) or one (if the wait is too long). This means that the residual, \(y - \hat{y}\), will be \(0-\hat{y}\) or \(1 - \hat{y}\), since \(y_i\) is only ever zero or one. While it’s possible for these residuals to be distributed normally in abstract, it’s unlikely in practice. This is not what we want to see for linear regression, so we have to be very careful about interpreting this model.

You can also see that the predictions can go a little bit funky… If your friend waits for seven minutes at the Galli, there is a probability greater than 1 that the wait will be too long:

scenario = data.frame(pub='The Gallimaufry', waiting=7)
predict(wait_model2, scenario)
##        1 
## 1.113252

Having a probability over 1 means that there’s more than a 100 percent chance of something happening. That’s against the law, and I don’t want you to get busted for having contraband probabilities lying around! So, let’s try something a little better than this.

A generalized linear model

So, to deal with this, we use a generalized linear model, or GLM. With a GLM, we still focus on our conditional expectation \(\mathbf{E}[y|x]\), but it may not be the case that our distribution for \(y\) is normal. What is required is a link between what we’re predicting and some other, latent variable that actually is normal. So, returning to our expectations, remember that all regression is focused on \(\mathbf{E}[y|x]\). But, what is the expected value of a binary variable like \(y\)? Well…

In our linear model predicting long waits, we used a gaussian or normal model for \(\mathbf{E}[y|x]\), but this model was hopeless: the Gaussian model would only give us the probability a wait was long (that is, \(\pi_i\)). So, we actually were predicting \(\pi_i\) and, since we only observe the \(1\) or the \(0\), the line of best fit represents our best guess for what those probabilities actually would be.

Another model for \(y_i\) that actually generates a binary outcome might use the bernoulli distribution, which is just the “coin flip” distribution. Instead of the Normal model: \[ y_i \sim \mathcal{N}(\mu, \sigma^2)\] we use the bernoulli model: \[ y_i \sim \mathcal{Bernoulli}(\pi_i)\] The Bernoulli distribution still has our nice property from before, where \(\mathbf{E}[y_i]=\pi_i\), great. But, we’re still stuck: in the Normal model, we knew that we could use \(\alpha + X\beta\) to model the mean, \(\mu\), since everything was continuous. But, \(\pi_i\) is still a probability.

So, we transform the probability using the logit transformation: \[logit(y_i) = \log\left(\frac{\pi_i}{1 - \pi_i}\right)\] Breaking this down, \(\frac{\pi_i}{1 - \pi_i}\) is something called the odds of \(y_i\) happening. You may have heard this before: if a horse has two-to-one (spelled: \(2:1\)) odds of winning a race, it means that the betters believe that it is 2 times more likely that the horse loses than that they win. A horse that has 15:1 odds is a very unlikely winner4 hence the saying “long odds” meaning that you have some incredibly-long-number to one chance of succeeding..

Mathematically, odds can vary between zero and infinity: when \(\pi_i\) is very small, the odds the odds are very small (well below 1). In our example before, the horse with two-to-one odds has \(\pi=.3\bar{3}\), since that value of \(\pi\) is the only solution to \(\frac{\pi_i}{1-\pi_i} = \frac{1}{2}\). Our horse with 15:1 odds has \(\pi = 1/16\), since \(\frac{\frac{1}{16}}{1 - \frac{1}{16}} = \frac{1}{15}\).

This may seem indirect, but bear with me. On the right, you can see a plot of the odds versus the probability for those odds. You can see that the odds are below 1 for probabilities less than \(.5\), and become increasingly large (indeed, going to infinity) as the probabilities approach 1.

Something very cool happens when we take the logarithm of an odds. This is shown on the right. When \(\pi_i < .5\), the odds are between \(0\) and \(1\). When \(\pi_i>.5\), the odds are between \(1\) and infinity. The logarithm of the odds becomes negative when \(\pi_i < .5\) and becomes positive when \(\pi_i > .5\). This means that you get a kind of “S”-like curve that gets really steep when \(p_i\) gets close to its boundary. This “stretches” the log odds space out as \(\pi_i\) gets really small or really large; small changes in \(p_i\) in this area create large changes in log odds. This is kind of intuitive: a change from a 100000:1 odds to 1000:1 odds still means the event is still extremely unlikely and the change the probability is not that large in absolute terms. But that’s actually a huge change in terms of the log odds: they go from \(\log(1/100001)=\)-11.513 to \(\log(1/1001)=\)-6.909… almost a halving of the log odds while the absolute change in probability change is very small.5 That’s more than the change in log odds when moving from \(p_i = .25\) to \(p_i = .75\), which is about 2.198! In sum, this is a pretty good way to model how probabilities work: we want to magnify the changes that represent large relative changes, but we don’t want to squash the relative changes that happen in the middle, for things that have around 1:1 odds.

the “logistic” regression

If we do this, then our model can be stated just like a linear model: \[ logit(y_i) = \log\left(\frac{\pi_i}{1 - \pi_i}\right) = \alpha + x_i\beta + e_i\] This is basically three transformations: from binary to probability, from probability to odds, and then from odds to log odds. We run a standard linear regression on the log odds scale, and then transform it back to the probability scale. The transformation from log odds back to the probability scale is called the logistic function, hence why this approach is called logistic regression. You can see each step of this process below, where you can see the three different “steps” in the transormation plotted against \(X\). Moving from our linear model for outcomes on the log odds scale (left) to an exponential function in terms of the odds (middle), and then to a logistic function in terms of the probabilities (right), you can see how the line gets ‘bent’ by the transformations:

This is nice! as your \(\alpha + x_i\beta\) get really small, you get \(\pi_i\) that stay around zero! As \(\alpha + x_i\beta\) get large, you get probabilities that stay really close to 1! This means that we never get predicted probabilities larger than 1 or smaller than zero like we did with our earlier model. Further, we have some control over the way the curve moves: in this case, the \(\beta\) parameter governs how “steep” that central part of the curve is, and the \(\alpha\) parameter moves the curve left and right.

Fortunately, it’s pretty simple to fit a model like this in R. We can use the glm() command and use the exact same formula we did before. However, we need to tell R that the family of this GLM is binomial(), not gaussian(), which would simply return the same thing as our earlier lm() fitted directly to the binary data.

wait_model3= glm(long_wait ~ 1 + pub + waiting, data=times, family=binomial())
knitr::kable(broom::tidy(wait_model3))
term estimate std.error statistic p.value
(Intercept) -2.250639 2.123210 -1.0600173 0.2891367
pubThe Gallimaufry 1.134360 1.785956 0.6351552 0.5253272
pubThe Milk Thistle -5.207691 4.236907 -1.2291256 0.2190247
waiting 1.957421 1.353776 1.4458977 0.1482059

Two things might surprise you here:

  1. We use binomial() in the R command instead of bernoulli() because the binomial distribution includes the bernoulli distribution as a special case. R only provides binomial(), since it’s less work to provide both binomial() and bernoulli().
  2. We’re now fitting an intercept, 1 + pub, rather than the intercept-less model from above 0 + pub. The exact reason for this will become clear in the next course block, but for now, notice that (a) the “Berkeley” effect is the (Intercept), and our pubThe Gallimaufry effect represents the difference between The Berkeley and The Gallimaufry.6 Check the ARM§4.5, pg. 66, Using discrete rather than continuous predictors for more info.

Now, we can plot the results in the exact same manner as before, but instead of straight lines, we get these s-shaped logistic curves:

predicted_probabilities = predict(wait_model3, type='response')
ggplot(times, 
       aes(x=waiting, 
           y=as.numeric(long_wait), 
           group=pub, 
           color=pub)) +
  geom_point() +
  geom_line(aes(y=predicted_probabilities)) + 
  labs(x='waiting time', y='long wait')

Or, smoothing out our plot a little bit more with faked data, we can see the curvature really well7 Step through each line on in R on your own and take a look at the final scenario dataframe to see the result.

# make fake waiting times from 0 to 15 minutes, every half minute
fake_minutes = seq(0,15,.5) 
# get the unique pubs in our data
pubs = unique(times$pub)
# make a scenario dataframe where we:
scenario = data.frame(waiting = rep(fake_minutes, length(pubs)), # 1. repeat the fake_minutes for each pub
                      pub = rep(pubs, each=length(fake_minutes)) # 2. tile the pubs over each set of fake minutes
                        )
scenario['prediction'] = predict(wait_model3, scenario, type='response')
ggplot(times, aes(x=waiting, y=as.numeric(long_wait), color=pub, group=pub)) + geom_point() + geom_line(data=scenario, aes(y=prediction))

In this simple model, \(\alpha\) controls the location of the curve and \(\beta\) controls the speed at which you go from \(0\) to \(1\). Here, since our model fits one \(\beta\) for all the pubs (i.e. the effect of an extra minute is the same everywhere) but different \(\alpha\) for each place (i.e. your friend gives the Milk Thistle a break, but is very harsh on the Gallimaufry), we get three curves that rise at the same rates, but are located on different places.

Assessing accuracy

Like in the linear probability model, the prediction for each observation is a probability (or, a log odds) that the observation is a \(1\). However, you still get to decide the threshold at which you consider a probability high or low. For example, if you say that probabilities higher than \(.5\) are considered a “long wait,” then you have to classify the observations yourself:

predictions = predicted_probabilities > .5

You can see a cross-tabulation of the accuracy of your classification, called a confusion matrix, once you build this prediction:

table(predictions, times$long_wait)
##            
## predictions FALSE TRUE
##       FALSE     2    2
##       TRUE      3   10

This confusion matrix contains the “accurate” predictions on the diagonal and the inaccurate predictions on the off-diagonal. The rows reflect the predicted class, and the columns represent the true class. The Wikipedia Article provides a very cogent overview of the ways you can summarize this matrix. But, the main kinds of comments we might want to make are things that involve the raw prediction accuracies, and then various kinds of scores that describe how effective our classifier is.

For accuracy, we can make a few statements. Five observations were not long waits, and twelve were long waits8 sums over the columns, while four observations were classified as short waits and thirteen observations were classified as long waits.9 sums over rows We can also see that twelve predictions were made correctly, and 5 are incorrect.10 sum along the diagonals This means that our classifier’s accuracy, or percent of total correct predictions, is 70%.11 twelve correct predictions out of seventeen Our false positive rate is the percent of all FALSE that our model thinks is TRUE, which comes to 60%.12 Number of TRUE predictions in the FALSE column, divided by the total in the FALSE column.. All of these are different views of how the model is performing.

We can change this matrix by deciding on a different threshold, too. Let’s look at the confusion matrix when the threshold is \(.7\):

table(predicted_probabilities > .7, times$long_wait)
##        
##         FALSE TRUE
##   FALSE     5    3
##   TRUE      0    9

Our false positive rate is now zero (\(\frac{0}{5} = 0\)), and our accuracy is now 83% (\(\frac{5 + 9}{17} = .83\))! But, this comes at a cost: our earlier model had a false negative rate of 16%, meaning that 2 observations were predicted false that were actually true.

If this were, say… a cancer screening where TRUE meant “has cancer,” we definitely want to make sure that everyone who has cancer is correctly predicted as having cancer. So, while it may seem acceptable to improve the accuracy and false positive rate by increasing our threshold, it’s also not always the case that raw accuracy is what matters. This is why there are tons of different scores that summarize the confusion matrix; they’re each capturing something different about the performance of your classifier relative to some objective you care about. Commonly used scores, such as the F1 score or the “Area under the Reciever Operating Characteristic curve (AUC-ROC)” balance the four kinds of prediction types in the confusion matrix. Generally, the overall accuracy is fine if classes are roughly the same size and you care about both directions of mis-prediction equally.