Friday, January 7, 2011

Rules of Thumb for Visualization

If you're going to make a visualization of any data (baseball or otherwise), here are a few recommendations to make it more effective.

#1 Know Your Story

Every graph should tell a story. If it's not, then why are you making it? What story are you trying to tell? What do you want the reader to get out of it? If you aren't clear about what the reader should take away from the graph, usually it's time to think about what you're trying to express before trying to make a graph of it. So for this example, I am going to compare the three career home run leaders of all time.

#2 Pick an Appropriate Graph Type

There are three main graph types:
- Bar Graphs
- Line Graphs
- Pie Charts

Bar Graphs are a good default. 98% of the time, if you're comparing things, a bar graph is a good place to start. When you use bar graphs, there is one major rule that you have to follow: always start the main axis at 0. We judge the categories by the height of the bar, so if you don't start from 0, it warps the data. Here's an example:

Wow! Bonds DESTROYED Ruth's record, didn't he? Only he didn't. Here's the same graph with the axis starting at 0:

Tells quite a different story, doesn't it?

One of the times you don't want to use bar graphs is when you're talking about something that happened over time. For instance, if you are going to look at how many home runs each of those players had by age, a line graph is a better bet:

If I do it with a bar graph, it gets messy, and harder to read:

Pie charts are the oddball of the group. Humans don't judge areas of circles too well, so pie charts are really only good if you're dealing with 2-3 categories, so that it's easy to eyeball. And the percentages always have to add up to 100%. For instance, if we want to look at how often a player strikes out, walks, or hits the ball, a pie graph is a mighty fine choice:

#3 Don't Overdo It

Sometimes you have a lot of data, and you get tempted to try to put it all on one graph. Let's say you want to show how many home runs each player got along with how many plate appearances he had each season. You could throw them on the same graph like this:

Now the problem is that the message gets muddied. What exactly are you trying to say with that added data? If you want to show the different in PA/HR by age, then maybe it's a better bet to make a second graph that focuses just on that:

More graphs isn't necessarily a bad thing if it helps you tell your message better.

#4 Focus the Message

When you create a chart in Excel, it really makes it ugly. Look at the first graph I created using Excel Defaults:

Here are some of the problems:
  • The background is too dark
  • The gridlines are too strong
  • The axis is messed up (it doesn't start from zero!)
  • The title and legend duplicate information
  • The player names are tiny
  • The bar graphs all have shadows for some reason

What is the story of this graph? It shows how many home runs the three home run leaders have. So we want to focus on those three blue bars, who did them, and what they mean. So I eliminated the background of the graph, fixed the axis, lightened the grid lines, gave a proper title, and made the names of each player more visible:

#5 Add Color (where appropriate)

If you'll notice, most of my graphs use the same color. When we look at a graph and see a lot of color, we assume that the color means something. So color is another tool to help us tell the story. If I made the three bars in the career home run graphs different colors, our brain would tell us "Hey, the bars must be a different color for a reason!" and it would spend a little time trying to figure out what the color is telling us.

When there's a reason to add color (for instance, in the pie graph) it's a good idea to be consistent and to make sure that the color adds to the story, rather than to distract it. I used red for strikeouts because they are "bad", blue for walks because they are "good", and grey for the rest because they are neutral (and not part of the story, but need to be there to make sure the pies add up to 100%).

#6 Resist the Urge to Pretty-up the Graph with Chart Junk

A lot of the graphs we see on a daily basis are prettied up. People add 3D effects, drop shadows, gradients, etc. But those things don't typically add to the story we're trying to tell. They just "look cool" so people want to use them. Adding chart junk to your graphs is like putting a spoiler on a Honda Civic -- maybe it looks like it should go faster, but it's really just weighing the car down.

For instance, let's say I get the urge to pretty-up my pie graphs. What was pretty obvious before suddenly becomes a lot less obvious:

It's a lot harder to compare the slices in 3D, because 3D graphs distort data. We aren't so good at adjusting perspective in our head, so we lose a lot of ability to analyze the data. And the gradients add absolutely nothing but headaches since we lose the center of the grey slice making it even harder to figure out the angle (and with the angle, the size of the slices).

These are just the basics, with a really simple example. The above suggestions are just that -- suggestions. The most important thing is to keep the message at the center of the graph. Find what works best with your audience, and displays the data best, and you'll do fine.

All of the above examples were created using Microsoft Excel. Graphs from Excel can be copy-pasted into a vector data editing program like Adobe Illustrator or Inkscape to have better control over how it looks.


  1. I had a post on this issue as well a while back. One suggestion I think is the most important is to label your axes and be sure to tell me what I'm looking at. I've seen some awesome visualizations where I don't know what it is showing me.

    The bar/line graphs are straight forward by your title(s) above, so those aren't a huge worry. But I always find it a useful exercise to assume the reader knows absolutely nothing about the picture I'm showing them, other than what I explicitly say.

    Thanks for the post. These are always instructive. Now, if I could only get my undergrads to understand this stuff.......

  2. You're right -- Axis labels are very important if they aren't immediately obvious. It's a delicate balance, but as you said, assuming the viewer knows absolutely nothing is a good place to start.

  3. Good post. My only issue is your description of pie charts as a "mighty fine" choice. A pie chart is okay for a qualitative comparison, but not a more detailed quantitative comparison. And you've created the only thing worse than a pie chart, according to Edward Tufte: several of them. In your pies, I can tell that Aaron had about the same pctg of K as BB, and Ruth and Bonds had about the same pctg of K. If you'd provided a bar chart, with one category per player, one series per quantity compared, and the value axis in percent, we could not only see the data qualitatively, we could actually compare the values with some precision.

    Logging into Beyond the Box Score with my facebook account failed, so I couldn't reply there. I also wanted to take issue with the comment that truncating a bar chart is okay (it isn't, our perception of the bar lengths overrules the axis labels) and with the comment claiming that pies are good for any number of categories (his example pies have so much decoration that they are nearly useless).

  4. Jon, first off, love your blog and have learned tons from it (but clearly not enough!). Yeah, I probably screwed up that graph, and there are better ways to do it. I'll be sure to rectify it the next time I put something up.