How to: Binomial regression models in R

We’ve been doing a lot of work recently with multinomial logit choice models; that is, trying to predict how choices are made between multiple options. It’s a fairly new topic for me so I’m trying to get the hang of both the theory and the practical bits. But learning multinomial modelling before binomial modelling (the choice between two options) is like trying to run before you can walk. (Normal linear regression would be crawling.)

Binomial models are easy to do in R. Just feed your independent and response variables into the glm function and specify the “binomial” regression family.

# Create some data
n <- 500
x1 <- runif(n,0,100)
x2 <- runif(n,0,100)
y <- (x2 - x1 + rnorm(n,sd=20)) < 0
 
# Fit a binomial regression model
model <- glm(y ~ x1 + x2, family="binomial")

Without getting into the theory, this model estimates the logit z as a linear function of the independent variables.
  z = \beta_0 + \beta_1 x_1 + \ldots + \beta_i x_i

This value can then be used to calculate the probability of a given outcome via the logistic function:
  f(z) = \frac{1}{1 + e^{-z}}

Normally with a regression model in R, you can simply predict new values using the predict function. The problem with a binomial model is that the model estimates the probability of success or failure. So, for a given set of data points, if the probability of success was 0.5, you would expect the predict function to give TRUE half the time and FALSE the other half.

This is kind of what happens in R but it doesn’t predict these TRUE/FALSE outcomes directly, so you need to know how to convert the prediction outcome into the actual binary response variable. If you use the default predict arguments, the model gives you the z values. One (wrong) approach is to interpret z less than 0 as success or failure depending on how you’ve coded your model. But what you really want to do is estimate the probability, then draw a random number and see if that’s above or below the threshold value. R’s predict function can do this but it will take a couple lines of code:

# Create some new data
n2 <- 100
new.df <- data.frame(x1 = runif(n2, 0, 100), x2 = runif(n2,0,100))
 
# Predict the probabilities
# predict(model, new.df) will just give you the logit (z) values
# We want the probabilities so use the "response" type
probs <- predict(model, new.df, "response")
 
# Draw some random values between 0 and 1
draws <- runif(n2)
 
# Check if draw value is less than threshold probability
results <- (draws < probs)

Created by Pretty R at inside-R.org

So if you want to predict the results again, simply rerun the last two lines. New random draws will mean new outcomes.

To illustrate all this, I’ve put together a quick ggplot picture to show the original data (small points) and the modelled results (larger points). The colours represent the success/failure outcomes. (ggplot doesn’t want to add the legends for some reason.)

Binomial logistic regression

Example of binomial logistic regression.

Next time I’ll look at the more complicated multinomial case.

7 thoughts on “How to: Binomial regression models in R

  1. John

    Hi James,

    Thanks for the post. Do you have any references for this approach to draw a random number in order to determine the binary response?

  2. Natalia

    Hello, I have a doubt about binomial models: I did a binomial glm based on an experiment, and my response variable es female bird agression; females in my experimental group were agressive and females in the control group were not. when I run the model, theres some mistake, because all the 0′s are on the control group and all the 1′s are in the experimental group….I think it is a quasicomplete separation problem…..what can I do to run a binomial glm with data like this??

    thanx a lot

  3. Dalton Hance

    I’ve been playing around with a binomial regression model (though I’m still not clear on how this is different from logistic regression, if at all). I just discovered that the output of ‘predict(model, new.df, “response”) ‘ yields the odds not the probability. Check it out yourself. The output of ‘predict(model, new.df, “response”) ‘ is the same as if you were to simply take the exponent of the logit (log-odds) values that is: ‘predict(model, new.df) = exp( predict(model, new.df, “response”) )’. This is not the probability. Probability is equal to (odds/1+odds). So you’ve got to take one extra step to get the probability. My issue is dealing with getting a CI in terms of probability not odds. Any help out there?

  4. James Keirstead Post author

    I’m not sure I follow your comment Dalton. Here’s the relevant section of ?predict.glm:

    type: the type of prediction required. The default is on the scale
    of the linear predictors; the alternative “response” is on
    the scale of the response variable. Thus for a default
    binomial model the default predictions are of log-odds
    (probabilities on logit scale) and type = “response” gives
    the predicted probabilities. The “terms” option returns a
    matrix giving the fitted values of each term in the model
    formula on the linear predictor scale.

  5. Pingback: Propensity Modelling | GH Powell, D.I.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>