EDIT: A new and improved version of this code, fixing the issues mentioned in the comments, is available on Github.
The internet seems abuzz this week with the “discovery” of a long-lost Edward Tufte plot type: the slopegraph. The full story is explained over at Charlie Park’s website and you can catch some of the debate about whether this is actually a novel plot type at Andrew Gelman’s blog, Metafilter or on Twitter.
But whatever you want to call, there’s no denying that these plots are an elegant and concise way to display trends in grouped data. In this post, I’ll show you an attempt I’ve made at creating a slopegraph in R using ggplot. Here’s the original figure from Tufte’s book, Beautiful Evidence, to give you an idea of what we’re aiming for.
First, we need to get some data. The basic structure of a data set for this kind of plot is a long table with columns for each group, measurement year, and measurement value. In Tufte’s plot, the groups are various types of cancer, and the measure is the estimated survival rate (in %) taken at 5 year intervals. You can find the raw data here but I’ve already arranged it in the necessary format for you in the Github project.
A slopegraph is intended to “compare changes over time for a list of nouns located on an ordinal or interval scale”. It does this by plotting the relative rank of each group at a given point in time, the measured value, and a line indicating the change in value between two measurement points. At first glance, this should be a fairly easy plotting task especially with ggplot2 and the naive code might look something like this:
However this produces all sorts of nasty collisons, as there may be many labels plotted at similar values. Tufte solves these sorts of problems by creating his figures largely by hand (well, with Illustrator). This gives full control over the typography, spacing and so on, but it’s not very reproduceable.
With R, we can automate the plotting process although our aesthetic tweaks can only be approximate. What I’ve done is to add an offset between each group at a given year to ensure that a certain amount of minimum spacing is maintained. The algorithm isn’t perfect, as a value of x in a given year may be positioned higher than x + 1 in an adjacent year depending on the characteristics of each year’s data. (For a particularly ugly example, look at the liver and intraheptic bile duct group in the plot below.) Tufte’s plot also has this characteristic, but he arranges the lines so that they never cross: clearer to read but potentially confusing when making horizontal comparisons.
Once the data set is prepared, we need to plot it. The first thing to do here is to create a new ggplot theme which strips away most of the plot decoration. I did this by typing
theme_bw at the command line, copying the resulting output, and modifying it to my needs. We can then make the plot using standard ggplot commands. The plot consists of (in plotting order):
- Lines linking each group
- White points to provide a background for each number
- Text labels for each numeric value
- Text labels for the groups
- A scale adjustment to ensure that the labels appear properly
The final result is below. It’s kind of a big plot, so CFB.
I’ve put the code and data up on Github so feel free to fork it and tweak things to your liking.