Tag Archives: tufte

Worst infographic of 2013?

I know it’s only January, but this infographic from the Carbon Trust is a doozy.

Are businesses sleep-walking into a resource crunch?  Good luck answering that with this figure.

Are businesses sleep-walking into a resource crunch? Good luck answering that with this figure.

While the overall research on whether businesses are prepared for future resource scarcity is interesting, this figure falls down in all sorts of ways.

  • First we’ve got an hourglass metaphor wrecked by the fact that “now” (i.e. the pinch point in the glass) is actually 3–5 years in the future and the past sand includes “up to three years” in the future. If you’re going to use this kind of visual metaphor, at least give it some respect.
  • Next there are the percentages which are appear to represent a vertical distance, not volume of sand or width of the hourglass. Maybe that’s a pedantic point, but the amount of ink on the yellow 15% is not even close to the amount on the green 19%. Tufte may be an obsessive, but he’s got a point when he talks about the lie factor of a plot.
  • And finally, there’s a strange color scheme in which green goes from dark to light to dark again. So is dark green a good thing or not?

Slopegraphs in R

The internet seems abuzz this week with the “discovery” of a long-lost Edward Tufte plot type: the slopegraph. The full story is explained over at Charlie Park’s website and you can catch some of the debate about whether this is actually a novel plot type at Andrew Gelman’s blog, Metafilter or on Twitter.

But whatever you want to call, there’s no denying that these plots are an elegant and concise way to display trends in grouped data. In this post, I’ll show you an attempt I’ve made at creating a slopegraph in R using ggplot. Here’s the original figure from Tufte’s book, Beautiful Evidence, to give you an idea of what we’re aiming for.

Edward Tufte's slopegraph of cancer survival rates

Edward Tufte's slopegraph of cancer survival rates

First, we need to get some data. The basic structure of a data set for this kind of plot is a long table with columns for each group, measurement year, and measurement value. In Tufte’s plot, the groups are various types of cancer, and the measure is the estimated survival rate (in %) taken at 5 year intervals. You can find the raw data here but I’ve already arranged it in the necessary format for you in the Github project.

A slopegraph is intended to “compare changes over time for a list of nouns located on an ordinal or interval scale”. It does this by plotting the relative rank of each group at a given point in time, the measured value, and a line indicating the change in value between two measurement points. At first glance, this should be a fairly easy plotting task especially with ggplot2 and the naive code might look something like this:

ggplot(data, aes(x=year,y=value,group=group)) + geom_line()

Created by Pretty R at inside-R.org

However this produces all sorts of nasty collisons, as there may be many labels plotted at similar values. Tufte solves these sorts of problems by creating his figures largely by hand (well, with Illustrator). This gives full control over the typography, spacing and so on, but it’s not very reproduceable.

With R, we can automate the plotting process although our aesthetic tweaks can only be approximate. What I’ve done is to add an offset between each group at a given year to ensure that a certain amount of minimum spacing is maintained. The algorithm isn’t perfect, as a value of x in a given year may be positioned higher than x + 1 in an adjacent year depending on the characteristics of each year’s data. (For a particularly ugly example, look at the liver and intraheptic bile duct group in the plot below.) Tufte’s plot also has this characteristic, but he arranges the lines so that they never cross: clearer to read but potentially confusing when making horizontal comparisons.

Once the data set is prepared, we need to plot it. The first thing to do here is to create a new ggplot theme which strips away most of the plot decoration. I did this by typing theme_bw at the command line, copying the resulting output, and modifying it to my needs. We can then make the plot using standard ggplot commands. The plot consists of (in plotting order):

  1. Lines linking each group
  2. White points to provide a background for each number
  3. Text labels for each numeric value
  4. Text labels for the groups
  5. A scale adjustment to ensure that the labels appear properly

The final result is below. It’s kind of a big plot, so CFB.

R version of slopegraph for cancer survival rates

R version of slopegraph for cancer survival rates

I’ve put the code and data up on Github so feel free to fork it and tweak things to your liking.

Lie factors and airline seating

I’ve recently been reading Edward Tufte’s Visual Display of Quantitative Information. It introduces several interesting metrics for the analysis of data graphics such as the data-ink ratio and my personal favourite, the lie factor.

The lie factor is defined as the size of the effect shown in a graphic divided by the effect shown in the data. A great example of this comes from FlyZoom’s website, which describes the difference between standard and premium seating. Seats in standard class are set at a 31″ pitch while those in premium are at 36″ pitch. However this is the graphic they use to illustrate it:

Zoom seat pitch

Measuring the figure, the 31″ pitched seats just happen to be 31 pixels apart (i.e. back of one seat to the back of the next). However the 36″ seats are 43 pixels apart. That means a lie factor of (43-31)/(36-31) = 2.4.

Actual airline seating

Doesn’t look quite so comfortable now, does it?