Positive coefficient regression in R

I’m currently working on a paper about simulating urban demands for electricity and gas at 5 minute resolution. To do this, I have a simple regression model that tries to explain observed consumption based on local population figures and simulated levels of activity demands (e.g. minutes spent at work, leisure, etc). The data set looks like this, where each of the letters is an activity code:

> head(data)
  zone  pop     elec       gas  A    B   I    J  L M     O   R    S     W
1    0 7412 46221768 124613714  0    0   0    0  0 0     0   0    0     0
2    4 7428 37345875 100944002 60 3060 120 1020  0 0  4900 390  510 28635
3    7 7464 20914281  64109628  0 1155 510  255  0 0 11000 225 2475 29580
4   10 7412 46221768 124613714  0  680   0  390  0 0  5145   0    0  9300
5   14 7128 69233086  36611811  0 1335 105  210 60 0  5970   0 2520 14910
6   17 7608 40783190  59343776  0  150   0  150  0 0  1500   0    0   555

Created by Pretty R at inside-R.org

I then performed a basic regression using lm, removing the intercept (the “- 1”) as I want the population coefficient to serve a similar purpose for this analysis:

lm.elec <- lm(elec/365 ~ pop + A + B + I + J + L + M + O + R + S + W - 1, data)

Created by Pretty R at inside-R.org

Using Andrew Gelman’s helpful arm package, I can get a quick overview of the result as shown below. It doesn’t look too bad at first, but then I noticed all sorts of negative coefficients for the activity levels. This makes no sense: when individuals perform activities such as going to work, going to school, or shopping, we expect that their demand for electricity should go up, not down.

> display(lm.elec)
lm(formula = elec/365 ~ pop + A + B + I + J + L + M + O + R + 
    S + W - 1, data = data)
    coef.est coef.se 
pop    13.99     1.35
A     177.07    83.87
B     -63.90    27.80
I      -0.80    21.05
J      91.27    34.19
L   -1075.76   391.53
M      -5.57    73.50
O      55.47     9.90
R    -336.14   178.98
S      28.12     3.84
W      -8.19     3.16
n = 391, k = 11
residual sd = 159994.17, R-Squared = 0.65

Created by Pretty R at inside-R.org

Since a linear regression is essentially an optimization problem, my immediate thought was: can I just constrain the coefficient values so that they are all positive? This would mean that some activities might have no significant effect on consumption, but at least they couldn’t have a negative impact. And it turns out, yes, you can do this using the nnls package and function.

The nnls function is not quite as user-friendly as lm so the first thing you have to do is manually define your input variable matrix and output vector. For example:

A <- as.matrix(data[,c("pop","A","B","I","J","L","M","O","R","S","W")])
b.elec <- data$elec/365

The analysis can then be run as:

nnls.elec <- nnls(A,b.elec)

Similarly, we can’t use predict to generate our results and have to manually perform the matrix multiplication. This can be done as shown below.

coef.elec <- coef(nnls.elec)
pred.elec.nnls <- as.vector(coef.elec%*%t(A))

This method also doesn’t give you an r2 value per se, but you can estimate it with the following dummy regression:

> lm.elec.dummy <- lm(b.elec~pred.elec.nnls - 1)
> display(lm.elec.dummy)
lm(formula = b.elec ~ pred.elec.nnls - 1)
               coef.est coef.se
pred.elec.nnls 1.00     0.04   
n = 391, k = 1
residual sd = 164861.13, R-Squared = 0.62

As you can see, the r2 value is slightly lower in this case than the standard lm model but in terms of interpreting the coefficients the result makes much more sense. This can be clearly seen in the following graph, where demands for electricity and gas rise during the day as expected, rather than sinking in the basic case.

Simulated profiles for electricity and gas, comparing lm and nnls regressions

Simulated profiles for electricity and gas, comparing lm and nnls regressions

So there you go. If you ever need to run a regression and ensure that all the coefficients are greater than or equal to zero, nnls is your friend.

12 thoughts on “Positive coefficient regression in R

  1. Erolgo

    Er, … it’s OK isn’t it, to have negative coefficients on some activities? Time spent in that activity means less time spent in another, more costly, activity, hence represents a saving.
    Not looking at the specific activities in your list, you’d imagine that activities such as “turning off lights” or “picnicing by a river” could conceivably have a lower than average electricity demand, and thus have a negative coefficient.
    Or have I missed the point again?

  2. James Keirstead Post author

    Hi Eric

    Yes, sorry I should have clarified. All of the activities represent things like “work”, “education” or “shopping” which you would expect to have positive impact. Also the figure shows a commercial zone, i.e. where the activities are taking place, and not a domestic zone where demand might go down because people have left to do something else.


  3. Sam

    Great post; this is very helpful. I do this type of stuff frequently to blend models and find that using only models that have positive coefficients helps prevent overfitting.

  4. Mike

    Long shot, since this is an old post but question for you – how did you get nnls to work?! I’ve been trying to follow your example but I’m getting issues with using nnls – it basically gives the error :

    in .Fortran(“nnls”, A = as.numeric(A), MDA = as.integer(MDA), M = as.integer(M), :
    “nnls” not available for .Fortran() for package “nnls”

    which I can’t make heads or tails out of… Any help would be much appreciated!

  5. Olu

    Hi James,
    It seems nnls does not give you the intercept, do you have any idea of how to phrase the problem to get the intercept to be non negative as well?

  6. James Keirstead Post author

    You can add an intercept easily by adding a column of ones to your A matrix, e.g. cbind(A, 1). The last coefficient back from nnls will represent the model intercept.

  7. Marcin

    In a similar problem, how can we get the estimate of contribution of each of the factors in %? and perhaps also the error (or say % unexplained by the model)

    Many thanks, Marcin

  8. James Keirstead Post author

    If I understand your question correctly, you want to perform a sum of squares analysis. In this case, you probably want to do something like this:

    nnls.mod < - nnls(A, b)

    ESS < - sum((nnls.mod$fitted - mean(b))^2)

    RSS < - sum(nnls.mod$residuals^2)

    The percent of variance explained by the model is therefore ESS/(ESS+RSS); this assumes you have an intercept (constant column of 1s) in your independent matrix A. There's more detail on Wikipedia.

  9. Marcin

    Thanks a lot, James. That is super helpful!
    By % in my question I meant that in the end I would like to have a contribution of each of the input variables summing up to 1 (100%). Or better yet, contribution of each of them in % and the unexplained/error in % as well all of which summing to 1.
    When doing that you proposed I get ESS/(ESS+RSS) of 0.96 (i.e. 96%), so it is pretty good but a sum of all coeffficients is greater than 1. Is this expected? How should I interpret this so that I get an answer I wanted to begin with?
    > A1 nnls.mod ESS RSS ESS/(ESS+RSS)
    [1] 0.9601099
    [1] 1.011154

    P.S. In case it matters: Sum of b is 1 and sum of all factors is 1 as well. It is implied that the observed variable b is formed by addition of all factors in different quantity.
    > head(A1)
    Intercept fact.1 fact.2 fact.3 fact.4 fact.5 fact.6
    A 1 0.011098326 6.827082e-04 0.02217231 0.0365 0.014941548 0.0017
    B 1 0.009149341 6.191072e-04 0.01787168 0.0309 0.008960918 0.0028
    C 1 0.001490070 9.927896e-05 0.00213834 0.0183 0.002207846 0.0005
    D 1 0.006233885 3.238914e-04 0.01626515 0.0243 0.009206905 0.0019
    E 1 0.001801068 2.634810e-04 0.02400262 0.0097 0.011671022 0.0013
    F 1 0.002580909 2.698660e-04 0.01216030 0.0054 0.007292089 0.0012

    Many thanks, Marcin

  10. Marcin

    Somehow this was missing from my question above. Sorry:
    #A1 <- as.matrix(cbind(Intercept=1, A))
    #nnls.mod <- nnls(A1, b)
    #ESS <- sum((nnls.mod$fitted – mean(b))^2)
    #RSS <- sum(nnls.mod$residuals^2)
    # [1] 0.9601099

Comments are closed.