Slopegraphs in R

EDIT: A new and improved version of this code, fixing the issues mentioned in the comments, is available on Github.

The internet seems abuzz this week with the “discovery” of a long-lost Edward Tufte plot type: the slopegraph. The full story is explained over at Charlie Park’s website and you can catch some of the debate about whether this is actually a novel plot type at Andrew Gelman’s blog, Metafilter or on Twitter.

But whatever you want to call, there’s no denying that these plots are an elegant and concise way to display trends in grouped data. In this post, I’ll show you an attempt I’ve made at creating a slopegraph in R using ggplot. Here’s the original figure from Tufte’s book, Beautiful Evidence, to give you an idea of what we’re aiming for.

Edward Tufte's slopegraph of cancer survival rates

Edward Tufte's slopegraph of cancer survival rates

First, we need to get some data. The basic structure of a data set for this kind of plot is a long table with columns for each group, measurement year, and measurement value. In Tufte’s plot, the groups are various types of cancer, and the measure is the estimated survival rate (in %) taken at 5 year intervals. You can find the raw data here but I’ve already arranged it in the necessary format for you in the Github project.

A slopegraph is intended to “compare changes over time for a list of nouns located on an ordinal or interval scale”. It does this by plotting the relative rank of each group at a given point in time, the measured value, and a line indicating the change in value between two measurement points. At first glance, this should be a fairly easy plotting task especially with ggplot2 and the naive code might look something like this:

ggplot(data, aes(x=year,y=value,group=group)) + geom_line()

Created by Pretty R at inside-R.org

However this produces all sorts of nasty collisons, as there may be many labels plotted at similar values. Tufte solves these sorts of problems by creating his figures largely by hand (well, with Illustrator). This gives full control over the typography, spacing and so on, but it’s not very reproduceable.

With R, we can automate the plotting process although our aesthetic tweaks can only be approximate. What I’ve done is to add an offset between each group at a given year to ensure that a certain amount of minimum spacing is maintained. The algorithm isn’t perfect, as a value of x in a given year may be positioned higher than x + 1 in an adjacent year depending on the characteristics of each year’s data. (For a particularly ugly example, look at the liver and intraheptic bile duct group in the plot below.) Tufte’s plot also has this characteristic, but he arranges the lines so that they never cross: clearer to read but potentially confusing when making horizontal comparisons.

Once the data set is prepared, we need to plot it. The first thing to do here is to create a new ggplot theme which strips away most of the plot decoration. I did this by typing theme_bw at the command line, copying the resulting output, and modifying it to my needs. We can then make the plot using standard ggplot commands. The plot consists of (in plotting order):

  1. Lines linking each group
  2. White points to provide a background for each number
  3. Text labels for each numeric value
  4. Text labels for the groups
  5. A scale adjustment to ensure that the labels appear properly

The final result is below. It’s kind of a big plot, so CFB.

R version of slopegraph for cancer survival rates

R version of slopegraph for cancer survival rates

I’ve put the code and data up on Github so feel free to fork it and tweak things to your liking.

18 thoughts on “Slopegraphs in R

  1. Richie

    I really like this, but it’s a bit confusing the way the same number can appear at different heights. (This is particularly obvious in “Thyroid” near the top, where 95 appears higher than 96.)

    You could stick to using ranks, rather than actual values, similar to a parallel coordinates plot, or keep you original plot with the collisions, but have most of the lines greyed out, and just a few lines of interest highlighted.

  2. James Keirstead Post author

    Thanks Richie. The Tufte version has the different heights problem too (though less severe than in my version), so I think I like the idea of highlighting lines of interest best. It also has the advantage of increasing clarity when displaying large data sets. I can’t find the link right now but I remember someone doing a nice interactive version of this.

    Just noticed too that the ranks are only correct in the first column of the Tufte plot. Not so easy if you want to find out which cancer had the highest/lowest survival rates in any other year.

  3. Bob Muenchen

    Nice work! I think there’s still a bit of tweaking to do. In Tufte’s version, the final Thyroid value 95 appears even or slightly higher than the first 96 value due to (I think) an optical illusion caused by the lines dipping to the 3rd value, 94. If you lay a ruler across the string, “Thyroid 96 96″ the final 95 is slightly below it. But on your version, that approach shows the 95 is actually above the 96. Even the first two 96′s are not level. But you’re really close & I love the plot! It has been years since I read that book & I had forgotten all about it.

    Cheers,
    Bob

  4. Pingback: More on Presidential Rhetoric « YourMorals.Org Moral Psychology Blog

  5. Gaurav Kapoor

    Thanks for providing the plot and the code. I am trying to use it for plotting. I am getting error.
    Error in structure(list(axis.line = theme_blank(), axis.text.x = theme_text(family = base_family, :
    could not find function “unit”

    Would you know whats causing it. Thank you

  6. Stephen Kinsella

    Hi James,
    Thanks very much for creating this R code, I’ve used it several times and it has worked perfectly. I’m an economist, and would like to display the changes in Eurozone countries’ current accounts as percentages of GDP from 1997 to 2011. The slopegraph code isn’t working, I *think* because the code can’t handle negative numbers for values. Is this the case? If so I’ll work around.

  7. som

    hi james,
    quite a bit late in this post but i was wondering how to use different colours/line type for each entry. that might help to understand the present condition of the diseases compared to the beginning.

  8. James Keirstead Post author

    I don’t plan to add such a feature directly into my code, but you can easily add it yourself. Line 201 of slopegraph.r controls the plotting of the lines, you could just add a new aesthetic e.g. aes(group=group, colour=group) to do this.

  9. Tom

    James,

    Awesome work; thank you sharing your code! I have been looking for a good way to plot a small data set, and slopegraphs finally came to mind. Your code made this easy.

    After downloading the latest version on Github (December 2013), the first run threw a couple of errors. plyr and ggplot2 need to be loaded, and were not.

    I also would like to suggest plotting the labels on both the left and the right side of the plot. I accomplished this by modifying slopegraph.r:

    In theme_slopegraph(), don’t plot the y axis text:
    axis.text.y = element_blank(), # axis.text.y = element_text(size=rel(0.8)),

    and in plot_slopegraph(), replace scale_y_continuous() with
    + geom_text(data = subset(df, x == df$x[length(df$x)]),
    aes(x = factor(x), label = sprintf(” %s”, group)),
    size = rel(2.8), hjust = 0) +
    geom_text(data = subset(df, x == df$x[1]),
    aes(x = factor(x), label = sprintf(“%s “, group)),
    size = rel(2.8), hjust = 1)

    The above code was adapted from the slopegraph implementation of Bob Rudis at http://rud.is/b/2013/01/11/slopegraphs-in-r/

    Lastly, I have a data set with two x-axis factors where the Tufte method doesn’t correctly order the right-side data (the right side retains the rank order of the left side). The “spaced” method works correctly. I haven’t figured out, yet, why this isn’t plotting correctly, but I’d be happy to share the data if you’re interested.

    Thanks, again.

  10. Pingback: Slopegraphs | rCharts –> maybe finance versions

  11. David Montgomery

    Hey, this looks great. I am, however, having a few difficulties.

    1. I keep getting this when I try to run the code: “Error: Zero breaks in scale for y/ymin/ymax/yend/yintercept/ymin_final/ymax_final”
    2. A possible source of problem is that my spreadsheet has some missing data — I’m trying to slopegraph entries over four different years, but a few entries don’t exist in all the years. Will this cause problems? If so, is there a way around this?
    3. I am interested in creating a slopegraph in a slightly different setup than you have. Rather than display the values at each step, I’d rather track the categories. So if entry “Smith” is in rank 1 in the first year, rank 5 in the second year and rank 10 in the third year, I’d like the graph to show “Smith” as the label for each of Smith’s positions, rather than “1,” “5″ and “10.” Is it possible to tweak your code to do this?

    I am unfortunately a relative novice at R — I’ve used it to accomplish a few tasks, usually by reading and altering tutorials. If you’re able to assist me in this I’d be very grateful. I realize this is complicated a complicated series of questions…

    Thanks.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>