## Tuesday, January 11, 2011

### Rules of Thumb for Complex Visualization

Last time I discussed some simple graphing techniques that will make visualizations easier to grasp. This time I want to discuss some slightly more complicated techniques involving graphing more complicated data.

Basic graphs usually tackle only a few pieces of information:

Bar Graph:

#1: Players
#2: Home Runs

Pie Graphs:

#1: Players
#2: Plate Appearance Outcomes

Both of these graphs were simple. The players are constants, so they don't really need much special treatment (they are just labels), so we're really only graphing a single variable for each.

Line Graphs:

#1: Players
#2: Home Runs
#3: Over Time

This one is slightly more complicated, but we are all so used to seeing graphs over time like this that most of us wouldn't find it too complicated since while time varies, it only varies in a predictable way (we all know when tomorrow will be).

Lots of times we have to deal with more complicated data sets. Instead of just graphing a single unpredictable variable, what if we have to graph two of them?

To give a simple baseball example, let's say you want to see how many wins each team got in 2010, and how much money they paid in salaries.

There are two main ways you can do it. You can label the various points with colors and leave a legend, or you can label the points on the chart and use the same color.

Here are examples:

Looks like confetti, doesn't it?

When dealing with this much data, there are just too many different colors to handle a different color for each, so if the point is to show each team, then it's probably better to go with labels. Note that the labels are a lighter color and pretty small -- if I made them black, they would draw attention away from the data, and focus the eyes on the labels -- that would make it harder to see the trends.

So when is color good? Let's say you wanted to make this chart to show the differences between AL and NL teams. That would be great for color.

The benefit of adding color here is that it tells a story. The AL has a much wider gap between the top teams in salary and the bottom teams. Boston and New York are just way ahead of the group, and the top NL teams in payroll are between #2 and #3 in the AL. Clearly there's a gap between the leagues.

You can adjust the colors to tell whatever story you want. Let's say we just want to focus on the AL East and how absurdly unbalanced they are.

I just made the AL East jump out by making them really dark, and all the other teams grey. This tells the story of how incredibly well Toronto and Tampa did considering their financial disadvantage. It also shows the huge gap between the "Haves" and the "Have Nots" in the game's most competitive division.

But let's say we want to make this even more complicated. Let's say we want to not only show data for one year, but show data for 3 years. If we throw all that data on a single graph, we get a giant mess.

Too many labels, no real trending, just a giant blotch of stuff. It becomes pretty hard to tell what is what. So we need to really focus on what we want to say rather than just dumping data on a graph.

For instance, if we want to take a look at AL vs. NL from 2008-2010 in payroll and wins, we can do that. But time doesn't really come into the picture if we do it this way, it just looks like a giant mess of dots, and doesn't help us with trending.

One way we can make things a bit clearer is by changing our colors around. The more recent the year, the more solid we make the color -- that way we can see trending visually over time.

For another example, if we just want to see where the AL East is flying around to in the grand scheme of things, we can color them in, connect the dots, and give an idea of how the division is trending.

That shows us a bit more. Baltimore and Toronto are cutting salaries but have improved their win totals in 2010. Tampa Bay is spending more and more each year. The Yankees are essentially treading water. And the Red Sox made a huge jump in salary for 2010.

The other teams are all grey, and they aren't color coded by year. That is a judgment call. Is that information important to your story? In fact, if you only care about the AL East teams, you could even remove all of the excess dots and show just the AL East information. This shows how the division is working internally without cluttering it up with all those extra details.

Another alternative is to make them even less visible so that the AL East data stands out more, but the other information is still out there.

This all comes back to the first rule of graphing: what story are you trying to tell? You should leave in information that is important to your story, and take out anything that doesn't matter. Personally I think that showing Toronto as a more-or-less middle of the pack team, Baltimore as among the perennial losers, and Tampa Bay as the thrifty winners is useful, so I like the last graph best. That tells the story of the AL East best in my mind. But always decide for yourselves.

The basic lesson here is that you can use color to add extra information to a graph, but if you have too much information jumbled together, no amount of color will save you.

Think about your message, make sure that the message is the most obvious thing in the graph, and play around until you find something that works well for you.

If you want to code more data than this, you need to move into the world of interactive graphing like Gapminder.org

But that's a story for another day...

References:
As a side note, I will be on vacation in the US for the next couple weeks, so there won't be many blog posts here. I'll try to check in regularly with twitter and e-mail and the like, and have two scheduled columns going up at Fangraphs each Friday.