変化球　Henkakyuu: 2011

Friday, February 4, 2011

Top 500 Position Players by WAR

My series of top 500 Position Players by WAR is finished and up on Fangraphs. If you missed it, please take a minute to check it out if it interests you:

Thursday, January 27, 2011

A Correction for Jon Peltier

The way the web works is funny. I have been reading Jon Peltier's blog for several years now, and learned a lot of what I can do in Excel, and about principles of design from his blog (and ones like it). I was shocked when he commented on my blog when I wrote about basic principles of data visualization for baseball. I was possibly more shocked when I read his comment and realized that he was very right and I'd made a charting faux-pas.

Here is the chart he takes issue with:

Jon doesn't like pie charts much. In the data visualization community, there's been a lot of back and forth over whether pie charts are useful. Since they are used everywhere, people are familiar with them, so I look on them a bit more favorably than Jon does, but that wasn't the essence of his complaint.

You should never use multiple pie charts if you're going to compare them. It's just no easy to compare the size of slices across multiple pie charts.

So instead, Jon recommends using bar charts, like this:

And I think I have to admit it looks a lot better, and is a lot easier to compare between the various players this way. I still think pie charts are okay if you keep them simple, but I didn't follow my own advice and keep them simple enough. This is probably more the proper simplicity:

Tuesday, January 11, 2011

Rules of Thumb for Complex Visualization

Last time I discussed some simple graphing techniques that will make visualizations easier to grasp. This time I want to discuss some slightly more complicated techniques involving graphing more complicated data.

Basic graphs usually tackle only a few pieces of information:

Bar Graph:

#1: Players
#2: Home Runs

Pie Graphs:

#1: Players
#2: Plate Appearance Outcomes

Both of these graphs were simple. The players are constants, so they don't really need much special treatment (they are just labels), so we're really only graphing a single variable for each.

Line Graphs:

#1: Players
#2: Home Runs
#3: Over Time

This one is slightly more complicated, but we are all so used to seeing graphs over time like this that most of us wouldn't find it too complicated since while time varies, it only varies in a predictable way (we all know when tomorrow will be).

Lots of times we have to deal with more complicated data sets. Instead of just graphing a single unpredictable variable, what if we have to graph two of them?

To give a simple baseball example, let's say you want to see how many wins each team got in 2010, and how much money they paid in salaries.

There are two main ways you can do it. You can label the various points with colors and leave a legend, or you can label the points on the chart and use the same color.

Here are examples:

Looks like confetti, doesn't it?

When dealing with this much data, there are just too many different colors to handle a different color for each, so if the point is to show each team, then it's probably better to go with labels. Note that the labels are a lighter color and pretty small -- if I made them black, they would draw attention away from the data, and focus the eyes on the labels -- that would make it harder to see the trends.

So when is color good? Let's say you wanted to make this chart to show the differences between AL and NL teams. That would be great for color.

The benefit of adding color here is that it tells a story. The AL has a much wider gap between the top teams in salary and the bottom teams. Boston and New York are just way ahead of the group, and the top NL teams in payroll are between #2 and #3 in the AL. Clearly there's a gap between the leagues.

You can adjust the colors to tell whatever story you want. Let's say we just want to focus on the AL East and how absurdly unbalanced they are.

I just made the AL East jump out by making them really dark, and all the other teams grey. This tells the story of how incredibly well Toronto and Tampa did considering their financial disadvantage. It also shows the huge gap between the "Haves" and the "Have Nots" in the game's most competitive division.

But let's say we want to make this even more complicated. Let's say we want to not only show data for one year, but show data for 3 years. If we throw all that data on a single graph, we get a giant mess.

Too many labels, no real trending, just a giant blotch of stuff. It becomes pretty hard to tell what is what. So we need to really focus on what we want to say rather than just dumping data on a graph.

For instance, if we want to take a look at AL vs. NL from 2008-2010 in payroll and wins, we can do that. But time doesn't really come into the picture if we do it this way, it just looks like a giant mess of dots, and doesn't help us with trending.

One way we can make things a bit clearer is by changing our colors around. The more recent the year, the more solid we make the color -- that way we can see trending visually over time.

For another example, if we just want to see where the AL East is flying around to in the grand scheme of things, we can color them in, connect the dots, and give an idea of how the division is trending.

That shows us a bit more. Baltimore and Toronto are cutting salaries but have improved their win totals in 2010. Tampa Bay is spending more and more each year. The Yankees are essentially treading water. And the Red Sox made a huge jump in salary for 2010.

The other teams are all grey, and they aren't color coded by year. That is a judgment call. Is that information important to your story? In fact, if you only care about the AL East teams, you could even remove all of the excess dots and show just the AL East information. This shows how the division is working internally without cluttering it up with all those extra details.

Another alternative is to make them even less visible so that the AL East data stands out more, but the other information is still out there.

This all comes back to the first rule of graphing: what story are you trying to tell? You should leave in information that is important to your story, and take out anything that doesn't matter. Personally I think that showing Toronto as a more-or-less middle of the pack team, Baltimore as among the perennial losers, and Tampa Bay as the thrifty winners is useful, so I like the last graph best. That tells the story of the AL East best in my mind. But always decide for yourselves.

The basic lesson here is that you can use color to add extra information to a graph, but if you have too much information jumbled together, no amount of color will save you.

Think about your message, make sure that the message is the most obvious thing in the graph, and play around until you find something that works well for you.

If you want to code more data than this, you need to move into the world of interactive graphing like Gapminder.org

But that's a story for another day...

References:

Win Data from Baseball Reference
2010 Salary Data from CBS Sports
2008-2009 Salary Data from About.com

As a side note, I will be on vacation in the US for the next couple weeks, so there won't be many blog posts here. I'll try to check in regularly with twitter and e-mail and the like, and have two scheduled columns going up at Fangraphs each Friday.

Friday, January 7, 2011

Rules of Thumb for Visualization

If you're going to make a visualization of any data (baseball or otherwise), here are a few recommendations to make it more effective.

#1 Know Your Story

Every graph should tell a story. If it's not, then why are you making it? What story are you trying to tell? What do you want the reader to get out of it? If you aren't clear about what the reader should take away from the graph, usually it's time to think about what you're trying to express before trying to make a graph of it. So for this example, I am going to compare the three career home run leaders of all time.

#2 Pick an Appropriate Graph Type

There are three main graph types:
- Bar Graphs
- Line Graphs
- Pie Charts

Bar Graphs are a good default. 98% of the time, if you're comparing things, a bar graph is a good place to start. When you use bar graphs, there is one major rule that you have to follow: always start the main axis at 0. We judge the categories by the height of the bar, so if you don't start from 0, it warps the data. Here's an example:

Wow! Bonds DESTROYED Ruth's record, didn't he? Only he didn't. Here's the same graph with the axis starting at 0:

Tells quite a different story, doesn't it?

One of the times you don't want to use bar graphs is when you're talking about something that happened over time. For instance, if you are going to look at how many home runs each of those players had by age, a line graph is a better bet:

If I do it with a bar graph, it gets messy, and harder to read:

Pie charts are the oddball of the group. Humans don't judge areas of circles too well, so pie charts are really only good if you're dealing with 2-3 categories, so that it's easy to eyeball. And the percentages always have to add up to 100%. For instance, if we want to look at how often a player strikes out, walks, or hits the ball, a pie graph is a mighty fine choice:

#3 Don't Overdo It

Sometimes you have a lot of data, and you get tempted to try to put it all on one graph. Let's say you want to show how many home runs each player got along with how many plate appearances he had each season. You could throw them on the same graph like this:

Now the problem is that the message gets muddied. What exactly are you trying to say with that added data? If you want to show the different in PA/HR by age, then maybe it's a better bet to make a second graph that focuses just on that:

More graphs isn't necessarily a bad thing if it helps you tell your message better.

#4 Focus the Message

When you create a chart in Excel, it really makes it ugly. Look at the first graph I created using Excel Defaults:

Here are some of the problems:

The background is too dark
The gridlines are too strong
The axis is messed up (it doesn't start from zero!)
The title and legend duplicate information
The player names are tiny
The bar graphs all have shadows for some reason

What is the story of this graph? It shows how many home runs the three home run leaders have. So we want to focus on those three blue bars, who did them, and what they mean. So I eliminated the background of the graph, fixed the axis, lightened the grid lines, gave a proper title, and made the names of each player more visible:

#5 Add Color (where appropriate)

If you'll notice, most of my graphs use the same color. When we look at a graph and see a lot of color, we assume that the color means something. So color is another tool to help us tell the story. If I made the three bars in the career home run graphs different colors, our brain would tell us "Hey, the bars must be a different color for a reason!" and it would spend a little time trying to figure out what the color is telling us.

When there's a reason to add color (for instance, in the pie graph) it's a good idea to be consistent and to make sure that the color adds to the story, rather than to distract it. I used red for strikeouts because they are "bad", blue for walks because they are "good", and grey for the rest because they are neutral (and not part of the story, but need to be there to make sure the pies add up to 100%).

#6 Resist the Urge to Pretty-up the Graph with Chart Junk

A lot of the graphs we see on a daily basis are prettied up. People add 3D effects, drop shadows, gradients, etc. But those things don't typically add to the story we're trying to tell. They just "look cool" so people want to use them. Adding chart junk to your graphs is like putting a spoiler on a Honda Civic -- maybe it looks like it should go faster, but it's really just weighing the car down.

For instance, let's say I get the urge to pretty-up my pie graphs. What was pretty obvious before suddenly becomes a lot less obvious:

It's a lot harder to compare the slices in 3D, because 3D graphs distort data. We aren't so good at adjusting perspective in our head, so we lose a lot of ability to analyze the data. And the gradients add absolutely nothing but headaches since we lose the center of the grey slice making it even harder to figure out the angle (and with the angle, the size of the slices).

These are just the basics, with a really simple example. The above suggestions are just that -- suggestions. The most important thing is to keep the message at the center of the graph. Find what works best with your audience, and displays the data best, and you'll do fine.

All of the above examples were created using Microsoft Excel. Graphs from Excel can be copy-pasted into a vector data editing program like Adobe Illustrator or Inkscape to have better control over how it looks.

Tuesday, January 4, 2011

A Glimpse at Pitcher WAR

Happy New Year!

There has been a lot of hubbub by Adam Darowski over at Beyond the Box Score on Hall of Fame Wins Above Replacement, and how to better get an idea of peak value and career value for people who weight them differently.

You can see his work here and here (it is awesome and interactive).

Personally I believe a lot in the power of the eye and what it sees when it takes a look. Design a graphic using your brain, and let your eyes soak it in and let your gut come to a conclusion about the data. While I love Adam's graphs, I don't know if "Weighted WAR" is the way to go -- it is quite arbitrary, and doesn't really measure peak performance as much as it measures performance over a certain level. A player who has an incredible 10 WAR season at age 22 and a 10 WAR season at age 32 will look like they have the same peak (8 WAR above MVP) as a player who has 7 WAR from 10 straight seasons.

So I thought to myself, "How can we make that distinction?" I diddled with a lot of data, and I came up with this:

(Click for a larger version)

The graph includes all the players in the top 50 for career pitching WAR, as well as any player in either the Hall of Fame or the Hall of Merit.

Let your eyes come to your own conclusions, but here is what I noticed:

Early careers were a lot shorter than they were later on
The best pitchers jump right out -- Young, Nichols, Johnson
Pitchers in the modern era are starting their careers a lot later, and ending them earlier than they did in the Expansion Era

If you have any suggestions, improvements, etc., let me know. I will be doing a version for batters on FanGraphs in the near future.

References:

Data collected from Baseball Reference via Baseball Projection
Raw Data in a Google Spreadsheet

Created using Excel for initial formatting, and Adobe Illustrator for prettying up. Licensed under Creative Commons Attribution, Non-Commercial License.