One of the historic difficulties of doing research on urban energy systems has been the limited availability of data at sufficiently detailed spatial resolutions. Without this data, you might end up relying on aggregate information about the built environment, building occupants, and local geography that doesn't apply to the specifics of a particular neighbourhood or street.
Fortunately things are gradually improving and the UK government has a major initiative in this area: NEED, the National Energy Efficiency Data Framework. The project website has recently been updated with a series of data sets, mainly consisting of cross-tabulated statistics comparing insulation, dwelling size, income, dwelling age, and other factors.
I haven't had a chance to do anything meaningful with this data, but I thought a good starting point would be to write some functions to extract the necessary information from the provided Excel spreadsheets. So the analysis is purely exploratory, examining electricity and gas assumption measured at the level of individual English local authorities and grouped by dwelling floor area. The full code for parsing the data and making the plots is available at the bottom of the post, but here’s the main result.
Energy consumption by floor area for English local authorities
As you would expect, both the mean and the variance of gas consumption increase with larger dwellings. In electricity consumption, the effect is less noticeable with less distinction between consumption levels in small dwellings.
Let's hope that DECC makes more NEED data available quickly.
So the London Olympics have been pretty exciting, right? All those rare sports a person only seems to watch once every four years, right alongside familiar favourites being performed at the absolute top level. To top it all off, watching this year’s Games have made me realize that I’ve been making big mistakes in my running technique for years.
A bit of backstory first. Growing up, my main sport was downhill skiing with a bit of running and golf on the side. The running was rarely competitive but after moving to the UK, the skiing and golf dropped of the radar pretty quickly and I started running in earnest. I’ve now run a few 10k races and two half-marathons with decent times in the 40 minute and 1:35 range for each event respectively. Unfortunately settling into my first year of lecturing took a big chunk out of my schedule and I’m only now getting back on the road after a year-long hiatus.
Which brings us to the Olympics. I was blown away by both Mo Farah‘s 10km victory and David Rudisha‘s world record in the 800m, and soon found myself looking up running technique videos on YouTube. This one in particular was a real eye-opener.
Having never been filmed running before, I had no idea whether I was heel-striking or not. My gut feeling was that I probably was but I headed out this afternoon for a short run to find out for sure.
After doing about a kilometer I was pretty confident that, although it wasn’t too extreme, I probably did have a heel strike. The picture below shows the difference between the two and once you’re aware of the distinction, you become attuned to the way your foot hits the ground quite quickly.
Forefoot strike (left) versus heel strike (right)
There were several good tips on the web for how to encourage forefoot striking, including hopping in place (like jumping rope) and using a slight body lean (from the ankles, not hips) while running to encourage the foot to land under your hips, rather than way out in front. So, after pausing for stretch, I tentatively leaned forward and absolutely took off.
I cannot believe the difference this made to my running – it feels like you’re flying along. That horrible braking sensation of jarring down on the leading leg is gone and you are propelled down the street with a spring in your step. Marvellous! Obviously it’s only one run, and I need to gradually build up the mileage with this new technique to ensure that I don’t hurt any previously untested muscles. What’s more, controlling pace is a bit tricky and you have to get used to the faster cadence.
But, for today’s short run, the results speak for themselves. The chart below compares the distribution of all the runs I’ve done between 3.3 and 4 km over the past three years, including a period where I was recovering from injury. These aren’t pace runs or anything, just normal daily runs, and if anything a bit slow, as I usually use this distance either early in a training cycle or as recovery runs. Still, I wouldn’t say that today’s effort wasn’t particularly different from these previous runs and the extra pace of the forefoot strike is obvious. I’ll just keep practicing and see how things go!
Comparison of heel strike and forefoot strike on short runs
I’m currently working my way through Rasmussen and Williams’s book on Gaussian processes. It’s another one of those topics that seems to crop up a lot these days, particularly around control strategies for energy systems, and thought I should be able to at least perform basic analyses with this method.
While the book is sensibly laid-out and pretty comprehensive in its choice of topics, it is also a very hard read. My linear algebra may be rusty but I’ve heard some mathematicians describe the conventions used in the book as “an affront to notation”. So just be aware that if you try to work through the book, you will need to be patient. It’s not a cookbook that clearly spells out how to do everything step-by-step.
That said, I have now worked through the basics of Gaussian process regression as described in Chapter 2 and I want to share my code with you here. As always, I’m doing this in R and if you search CRAN, you will find a specific package for Gaussian process regression: gptk. The implementation shown below is much slower than the gptk functions, but by doing things manually I hope you will find it easier to understand what’s actually going on. The full code is given below and is available Github. (PS anyone know how to embed only a few lines from a gist?)
Step 1: Generating functions
With a standard univariate statistical distribution, we draw single values. By contrast, a Gaussian process can be thought of as a distribution of functions. So the first thing we need to do is set up some code that enables us to generate these functions. The code at the bottom shows how to do this and hopefully it is pretty self-explanatory. The result is basically the same as Figure 2.2(a) in Rasmussen and Williams, although with a different random seed and plotting settings.
Example of functions from a Gaussian process
Step 2: Fitting the process to noise-free data
Now let’s assume that we have a number of fixed data points. In other words, our Gaussian process is again generating lots of different functions but we know that each draw must pass through some given points. For now, we will assume that these points are perfectly known. In the code, I’ve tried to use variable names that match the notation in the book. In the resulting plot, which corresponds to Figure 2.2(b) in Rasmussen and Williams, we can see the explicit samples from the process, along with the mean function in red, and the constraining data points.
Example of Gaussian process trained on noise-free data
Step 3: Fitting the process to noisy data
The next extension is to assume that the constraining data points are not perfectly known. Instead we assume that they have some amount of normally-distributed noise associated with them. This case is discussed on page 16 of the book, although an explicit plot isn’t shown. The code and resulting plot is shown below; again, we include the individual sampled functions, the mean function, and the data points (this time with error bars to signify 95% confidence intervals).
Example of Gaussian process trained on noisy data
Next steps
Hopefully that will give you a starting point for implementating Gaussian process regression in R. There are several further steps that could be taken now including:
Changing the squared exponential covariance function to include the signal and noise variance parameters, in addition to the length scale shown here.
Speed up the code by using the Cholesky decomposition, as described in Algorithm 2.1 on page 19.
Try to implement the same regression using the gptk package. Sadly the documentation is also quite sparse here, but if you look in the source files at the various demo* files, you should be able to figure out what’s going on.
I’ve been working through Gelman et al.’s otherwise excellent Bayesian Data Analysis and it’s going reasonably well. My statistics is a little bit rusty so it’s taken time to work through all of the exercises and really understand what’s going on. But I say “otherwise excellent” because yesterday I spent ages trying to figure out a problem, only to discover that the data published in the book don’t correspond to the text discussion.
The troublemaker is the SAT problem in section 5.5. The authors give values for two variables and for schools, rounded to integer values. Using the formulas given in the text, I then calculated estimates for the complete pooling statistics and came up with a posterior treatment effect, , of 7.7 and a variance of 16.6. However the text says these values should be 7.9 and 17.4 respectively.
I puzzled over this for quite a while, thinking that maybe I’d missed a prior/posterior distinction somewhere and my estimates were supposed to be subtlely shifted. But no, when I went and checked the original data source, I found that the values were reported to 4 sig figs. Repeating the calculation with the new values gives the expected results. Grr…
Here’s the data and R-code if anyone’s interested.
### Load SAT data from the Gelman et al book and the original Rubin paperdf<-data.frame(school=LETTERS[1:8],
book.y=c(28,8,-3,7,-1,1,18,12),
rubin.y=c(28.39,7.94,-2.75,6.82,-0.64,0.63,18.01,12.16),
book.sigma=c(15,10,16,11,9,11,10,18),
rubin.sigma=c(14.9,10.2,16.3,11,9.4,11.4,10.4,17.6))### Rearrange data into a handy formdf<- melt(df,id="school")
vals <- colsplit(df$variable,"\\.",c("source","metric"))df<-cbind(df,vals)df<-df[,-2]df<- cast(df,school+source~ metric)### Calculate the summary statistics
ddply(df,.(source),summarize,y=sum(y/sigma^2)/sum(1/sigma^2),sig=1/sum(1/sigma^2))
EDIT: I’ve updated this code to work with distributions requiring more than two parameters. See this Gist for the improved code.
When doing Monte Carlo simulation, it’s important to pick your parameter values efficiently especially if your model is computationally expensive to run. If the model takes two days to run, and a parameter ranges from 0 to 10, it doesn’t make much sense to run it once at and again at if hasn’t been explored at all.
To get around this problem, one can use quasi-random low-discrepancy sequences which are designed to fill a parameter space efficiently. The R package, randtoolbox, provides implementations of common sequences, like the Halton or Sobol’, but the process involves a couple of steps that beg to be automated. The general process is:
Generate a n x p matrix of uniformly distributed quasi-random values, where n is the number of simulations you wish to run and p is the number of parameters.
For each column of the matrix, convert the quasi-random value to the parameter’s actual distribution by inverting the cdf curve. So if you have a uniformly distributed value of 0.5, and you want to convert it to a normal distribution with mean you would get a parameter value of for use in your simulation.
Run your simulation with these parameter values, and analyse the results
I’ve written a little R function to make this process easier. You simply pass it the number of simulations you want to run, and a list describing each parameter, and it will return the Monte Carlo sample as a data frame. At the moment, it’s pretty rudimentry, and each parameter is described by a name, a distribution name (matching the R abbreviations, e.g. “unif” for the uniform distribution, “norm” for the normal), and two parameters to describe the distribution.
# Generate a Monte Carlo sample
generateMCSample <-function(n, vals){# Packages to generate quasi-random sequences# and rearrange the datarequire(randtoolbox)require(plyr)# Generate a Sobol' sequence
sob <- sobol(n,length(vals))# Fill a matrix with the values# inverted from uniform values to# distributions of choice
samp <-matrix(rep(0,n*(length(vals)+1)),nrow=n)
samp[,1]<-1:n
for(i in1:length(vals)){
l <- vals[[i]]dist<- l$dist
params <- l$params
samp[,i+1]<-eval(call(paste("q",dist,sep=""),sob[,i],params[1],params[2]))}# Convert matrix to data frame and label
samp <-as.data.frame(samp)names(samp)<-c("n",laply(vals,function(l) l$var))return(samp)}
Here’s a simple example to show how it can be used.
n <-1000# number of simulations to run# List described the distribution of each variable
vals <-list(list(var="Uniform",dist="unif",
params=c(0,1)),list(var="Normal",dist="norm",
params=c(0,1)),list(var="Weibull",dist="weibull",
params=c(2,1)))# Generate the sample
samp <- generateMCSample(n,vals)# Plot with ggplot2library(ggplot2)
samp.mt <- melt(samp,id="n")
gg <-ggplot(samp.mt,aes(x=value))+
geom_histogram(binwidth=0.1)+
theme_bw()+
facet_wrap(~variable,ncol=3,scale="free")
Histogram of three parameters in a Monte Carlo sample
Any suggestions on how to improve this function so that it has a more generic description of a distribution would be appreciated (e.g. for distributions with n!=2 parameters).
When I first started using R, one of the things that attracted me was its claim to be an object-oriented programming (OOP) language. Coming from a Java background, I was used to designing software with OOP concepts like encapsulation and inheritance but, when I turned my hand to R, I quickly realized that “object-oriented” meant something subtlely different.
For those who are interested, the technical detail is explained at length in this paper and this blog (with examples). But what I want to do here is quickly illustrate how you can implement the common get/set method structure in R for a slightly different purpose.
In “traditional” OOP, you might have an object like a Shape with some attribute, say its area. In Java, this attribute would be accessed and altered using get and set methods, as in:
Shape s = new Shape();
s.setArea(10); // set the attribute value
int area = s.getArea(); // get the attribute value
Using this pattern offers advantages like protecting access to the value of the area field and ensuring that only valid values are set. While there is somedisagreement about whether or not mutator methods represent good practice, every OOP coder must have had at least one occasion where a getter/setter was needed.
Which brings us back to R. I recently had a problem where I wanted to use something like a getter/setter in order to access a global variable consistently, from both inside and outside functions. Since variable scoping in R isn’t very intuitive, I wasn’t sure how to do this at first and so I thought the get/set paradim might be helpful, even though the value I was getting and setting wasn’t really associated with an object.
Here’s what I ended up using. The trick is to create an explicit environment that stores the variable’s value behind the scenes, so that the user doesn’t have to worry about scope.
# Declare an explicit environment to hold the variable
e1 <-new.env()# Sets the value of the variable
setArea <-function(value){assign("area", value, env=e1)}# Gets the value of the variable
getArea <-function(){return(get("area", e1))}
Again, this isn’t really object-oriented programming as I’m not manipulating an object (other than the environment e1) but the get/set pattern has an OOP pedigree. So if you’re coming from a Java or C++ background and are having trouble figuring out variable scoping, the above code might be useful for you. However if you want to do proper object-oriented getting and setting, I highly recommend John Myles White’s example of R object-oriented polymorphism in get/set methods.