Saturday, December 25, 2010

Merry Christmas

I spent far more time than I ever intended on a reply to Jinaz's Hall of Fame WAR Comparison post over at Beyond the Boxscore. But what came out was interesting, so I'm sticking it up here and explaining it.

Back in the day, I made my Cumulative WAR by Age Graph in response to the same graphs that Jinaz used. Basically, the goal was to show the same data in a slightly different way, to tell a slightly different story. While the "nth best" season works great for normal progressions, it doesn't work so well when a player has an odd career (like one marred by injuries, or one interrupted by war). In cases like that, the cumulative WAR by age tells a much different story. Look at Bonds vs. Aaron for example.

Well this year, Adam Darowski introduced a tremendous Interactive 2011 Hall of Fame Ballot. So lots of people want to look at other careers in a similar way -- by tremendous (6+ WAR), good (3+ WAR), and better than nothing (0+ WAR) seasons. Sky Kalkman came up with a version combining Jinaz's line graph with Adam's radar graph.

Something about it struck me as a bit off, and counter-intuitive. I couldn't figure out why exactly, so in trying to explain it, I worked out a solution of my own:

Seasons with negative WAR have been removed. There are some good points and bad points about this graph. The big negative is that you can only compare two players. On the plus side, it shows not only how many 6+ WAR a player had, but also how many seasons with over 6 WAR a player had. That can help you get another dimension in the data -- did the player have a long peak of consecutive seasons over 6 WAR? Did they have a tremendous fluke of a 10 WAR season and sink back to 5+ after that?

I don't know how much use this will actually have to most people, but it is a different way to compare players, so I figured I'd put it up.

Image is licensed under a Creative Commons Attribution, Non-Commercial License. Original graph was created in Excel and then prettied up in Adobe Illustrator.

Wednesday, December 22, 2010

Free Agent Signings by Year

In his "The Two Markets" post on Fangraphs, Dave Cameron talked about how the trade market and the free agent market relate to each other, and if the decrease in trade returns may be related to the increase in free agent spending.

I wondered if there was any indicator of what is going on with the free agent market, decided to grab some data and see if I could figure out anything. To be honest, I didn't know what (if anything) I would find, and just figured I'd dive into the data and see what there is to see. As a quick note, I took away international signings from the pool -- they really muck things up (players from Cuba come over pretty young and throw off averages since they are complete outliers). I also ignored minor league deals.

First thing I looked at was the average age of free agents. My thought was, "Maybe the recent trend of signing players before their arbitration runs out has reduced the pool of young free agents, and increased the free agent age -- that would reduce supply of younger talented players, and increase costs of the remaining free agents."

Well, the data didn't exactly bear that out. It looks like the average age of free agents has had a consistent downward trend. So I thought, "Well, maybe that's because guys like Julio Franco refused to retire and drove up the average age. Maybe we should be looking at the average age of free agents who were signed!"

Again, we see the same trend, there's almost no difference. So it looks like the talent pool, for some reason, is getting younger in the MLB from 2006 to today. It looked to be time to take it from a different angle.

I thought about contract length next. How many contracts of each type are they handing out? And what age are players in each bracket? Are they giving more 1-year contracts to younger players who had a bad year, injury concerns, or something to prove?

The darker the color, the more recent the year (2006-2010). As you can see, for the most part, the contracts are granted to younger players pretty-much across the board, regardless of contract length, with longer contracts being given to progressively younger players. That makes sense. So I looked at the average age that a contract ends, so that we get an idea of when teams want to cut ties, rather than the age they want to make them:
Lighter colors are shorter contracts (from 1-5+ years). A couple things I noticed here are that the younger the player (and the longer the contract), the older the player seems to be when he ends. I wonder if this is a premium on getting premier talent signed long-term -- you are almost required to keep the player longer than you would otherwise like to. We don't see the same pattern in 2-year or 3-year contracts, but we see it in 4-year contracts for 2008 (Derek Lowe, 41 and Ryan Dempster, 37), and we see it in 5+ year contracts from 2008 through 2010.

But unfortunately, I'm still nowhere near answering my original question. Why the heck are we seeing such a decline in free agent ages?

I looked at how many years of contracts teams were giving to free agents, and how many players they gave them to. The closer the two lines get, the shorter each contract will be. If they're on top of each other, it would mean that they are only handing out 1-year contracts.

So through all this, I never really figured it out. What I can say is that it looks like teams are getting a lot more conservative about big free agent signings, and that something around 2007-2008 seemed to cause it. It could be that the huge amount of contracts signed in 2006 flooded teams with players that are still under contract, and we will see a small rebound as those players go back on the market leaving more openings on each team. Or it could be a new understanding of aging, causing teams to be more conservative about signing aging players. Or it could be something I'm not seeing.

I'm sorry that this is so long and drawn out, but I think there's something hidden in the data trying to jump out. I fiddled around with it a dozen ways over many hours, and spent quite a bit of time thinking of different ways to tackle the data, but whatever is hidden didn't jump out at me.

I think that it may be hidden in data I haven't put in the spreadsheet. Perhaps the poor free agent signings of 2006 (Soriano, Zito, Carlos Lee, Juan Pierre, etc.) made people think twice about spending big bucks on the Free Agent market. Or maybe the Financial Crisis starting from 2007-2008 had an impact on team finances. Or maybe there just weren't good players on the market. Or...

If you can figure it out, please, let me know.

Note: For reasons unknown, Google doesn't do the formulas right. If you download as an Excel file, then you should be able to see a lot fewer #DIV/0 errors. Graphics were created in Excel and then imported into Adobe Illustrator. There's nothing special about them, but if you do want to use them, feel free under a Creative Commons Attribution, Non-commercial license.

Monday, December 20, 2010

Writing at Fangraphs

Sorry for the lack of updates over the past week. I have several things I am working on, but I am intentionally not posting them here because I have been added to the writing staff at Fangraphs. I still plan to post some things here, but I will be focusing a lot more on writing for the bigger audience there.

Some things I am working on:
  • Finishing the graphical game summary
  • Using the Japanese NPB stats I posted to do some analysis on the difference with the MLB
  • Creating a "Baseball Basics" presentation geared toward non-sabermetric fans explaining about why they should care about how baseball works

Tuesday, December 14, 2010

A Work in Progress

I am rethinking the way we look at game summaries. Most sites have a play by play, box score, and even a WPA graph for the game. Isn't there a better way to get an idea of how the game flowed without sifting through the text?

I figure there must be, so I'm trying to come up with something, but it's a process. And it's nowhere near done, but I may as well show what I have to get feedback, or at least to put it out there in case someone else has use for it in its current form.

Pardon the lack of polish, it was created solely in Excel (turned to PNG in Illustrator):
(Click for a larger version)

What I want to do is to give a nice graphical review of what happened on the field, coupled with the run expectancy for each part. Each base-out state at the start of the at-bat is shown by the graphic. Red is no outs, yellow one out, blue two outs. To the left of each base graphic is a white dot showing the expected runs (0 to 3) before the at-bat, and to the right is the expected runs (0 to 3) after the at-bat. If the at-bat ended the inning, the dot is black (at zero). Each player's contribution is the gap between the left and the right.

But there's still a lot of stuff missing:
  • Color coding the base-out states is ugly. I want to change the background color to correspond to the number of outs (white for 0, light grey for 1, grey for 2, and black for 3 outs)
  • I need to show when a player scored runs in his at-bats. For instance, the 3-run HR by Sardinha in the 4th inning makes it look like he dropped the run expectancy (1st and 2nd with 2 outs to bases empty 2 outs). In reality he scored 3 runs minus that drop in expectancy
  • I need to figure out a better way of dealing with stolen bases. There is one in the 9th, but there's no way of knowing that it happened, or who did it. That's going to be a challenge (stupid non-discrete events in an otherwise discrete game!)
My ultimate goal is to be able to give a nice simple way to tally up the expected runs and scored runs in the middle (simple addition across), and a nice simple way to tally up the run expectancy added by each player (simple addition downward). I'd be much happier if it also looked nice, but I'll focus on getting the hard work of making it useful first.

If anyone has any input, please let me know. And if there are any Excel geniuses out there who know how I can use a custom marker for XY charts with a transparent background (transparent turns black when you add it to the chart), I'd be eternally grateful.

Image is licensed under Creative Commons attribution, non-commercial license. Feel free to use it as you'd like, and if you'd like the file I've used for it, just ask.

Sunday, December 12, 2010

Nippon Pro Baseball Data (2005-2010)

I have spent a bit of time putting together the batting and pitching stats from Japanese pro baseball from 2005-2010. I've uploaded it onto Google Documents so that you can use it as well. The English names are all pretty-much wrong (I guess there are a few correct ones in there), because there are over 1000 different last names alone, and translating them would have taken a long time. However, the names there should be usable for doing stats, and searching for the name of any player on google by copy-pasting the characters should get you a site with the proper English name.

Also, Japan tracks "Hold Points" which are holds + relief wins. Useless stat, but it's in there.

All data was retrieved from the NPB homepage.

Please feel free to use it however you'd like, and I will start delving into it myself in the near future. You can access the Google spreadsheet here.

2011 Red Sox Salary Commitments and Projected Wins

Beyond the Box Score asked for its readers to make graphs showing a team's salary commitments and projected wins in a way that you can understand with a glance. Unfortunately, that's a lot of information with one graph, and I cheated by making an entire sheet covering all the information in several graphs:

(Click for a larger image, or download as a PDF)

The top graph shows the projected Wins Above Replacement projected for the team. The projections were taken from Fangraphs and fiddled with a bit. I am not passing judgment on the actual projections (in other words, take them with a grain or six of salt), just putting down what was projected. Replacement level was set at 50 wins (which is why the Y-axis starts at 50).

The second graph shows projected salaries for each player, taken from Cot's Contracts. For players who have under 3 years of service time, I had them paid 110% of the league minimum, which was $400,000 in 2010, and I projected to increase by $10,000 a year. For players with more than 3 years, but less than 6 (the 3 arbitration years) I had them paid 40% of their previous year's value the first year, 60% of their value the second, and 80% value the third (estimates taken from tangotiger and/or fangraphs).

The bottom portion shows the legend for the mess of colors, and a table of the data for the players projected to be on the roster for 2011. Since there are only 21 players, 4 more will be added, but since I don't know who they are or how much they will be paid, I just left the spots blank.

There are also thick black lines on the first two graphs. For budget, I took the 2011 salaries, and calculated an 8% increase every year to create a projected budget (thick line in the second graph). Using that projected budget, and the projected cost of a win ($4.75 million in 2011, 8% inflation every year after that), I calculated how many wins the remaining budget minus salary costs could buy on the free agent market, and put it as a thick line on the top half of the graph.

So what?

Well, looking at the graph a bit, a few things popped out at me. The first is the importance of cost-controlled players. Despite the Red Sox having a lot of talent projecting to an unreasonable amount of wins (114 in 2011 -- again, grain of salt), if they don't get more cost controlled players that can make a contribution at low cost, they will drop down to an average team in 5 years.

The second thing I noticed was the Red Sox long-term commitments to homegrown players like Youkilis, Pedroia, and Lester. Their contributions are quite cheap for the amount of wins they are projected for. All three deals were signed before free agency, so the players accepted a discount for job security. That helps a lot.

The third thing I noticed was the amount of starting pitching they have signed. Lester, Beckett, and Lackey all have 4 years left under contract. Buccholz has 5 years left of service time (I think, but I've never been too bright). Even Matsuzaka is still under contract for 2 more years, and Tim Wakefield the knuckle-baller is also there for another year when/if someone gets injured.

Anyway, this is just an exercise, and I would love to do it for all 30 teams, but I'm afraid it would be a full-time job. Feel free to use the format if you want, and I will be happy to support you if you need it.


The graph was initially made in Excel, then imported into Adobe Illustrator. Licensed under Creative Commons, Attribution, Non-Commercial License (save for the player photos which are copyrighted).

Tuesday, December 7, 2010

Ted Williams Rolling wOBA per 150 Games

I wanted to try a new take on seeing the progression of players. Rather than looking at wOBA by year and injuries/playing time separately, I wanted to add them all into one graph so you can see both how they've performed, and how much they've performed.

My first attempt uses Ted Williams. The red line is his career wOBA, the blue line is the league average wOBA over that period, and the black line shows how he performed over the past 150 games. As the line fades further from black, that means he played fewer of the past 150 games.

What does this tell us? Well, the darker the line, the more confident we should be that his performance is based on skill, and not just random fluctuations.

In the future I want to use this type of graph to compare multiple players and see what their performance and injury history looks like to give a quick glance evaluation between similar free agents, or trade candidates.

This graph was created in Excel and then edited in Adobe Illustrator. It is licensed under a Creative Commons Attribution, Non-Commercial license.

Monday, December 6, 2010

wOBA by Ball-Strike Count

I am a big fan of graphs and baseball. Fangraphs made me excited because putting complex data into reasonably easy to understand graphs helps open up sabermetrics to more fans. I'm a big fan of statistical analysis, but after a while, a table full of numbers just starts running together and stops making sense. That's what makes graphs such an effective tool.

I've dabbled in graphs myself. When people were creating the WAR graphs to compare hall of famers, I made a sample graph showing cumulative WAR by age on Tom Tango's Book Blog:

(click for a larger image)

Of course, soon after Fangraphs came out with a far better looking one, saving me the headache of figuring out how to automate it.

Here is my latest foray into the world of graphs, looking at wOBA by count:

(click for a larger image)

Let me explain the mess you see above. The horizontal X-axis shows the amount of pitches. The first pitch is all the way to the left, and a full count is all the way to the right. The vertical Y-axis shows the wOBA for all at-bats that go through that count.

Since all at-bats go through the first pitch, the average wOBA is .330 (league average). The higher on the graph, the more likely a player is going to do something good. As you can see, the best count for hitters is 3-0, and the worst count is 0-2. On 3-0 the average hitter is better than 2001-2002 Barry Bonds, and on 0-2 they're batting more like Adam Wainwright in 2010.

The size of the counts (by area) are the amount of times that count has happened. There were 936,848 PA in my sample, so the first pitch is the biggest. There were only 47,488 3-0 counts, so that is the smallest. Each of the counts is a graph in and of itself showing what happened at that count.

Blue is ball, red is strike, and gray means the play ended. As you can see, with 2 strikes the play ends with another, so there are only balls and ended at-bats.

So What?


I made this graph for my own use. It is a nice easy-reference tool to track what's happening each pitch. I can follow and see if a batter's chances went up or down, and how likely the at-bat is going to end on each pitch (really roughly).

Ideally I would make one for each team, so that you can get one for your own team and use it when you're watching games, or even for each player so that you can compare and contrast Vladimir Guerrero with Kevin Youkilis, or the Twins and the Yankees, etc. And there's a good chance that there are things that you can think of to use this graph for, so please let me know what they are in the comments.


The graph was initially made in Excel to get the bubble positions and sizes, then imported into Adobe Illustrator to add the pie graphs, connecting lines, etc. Images are licensed under Creative Commons, Attribution, Non-Commercial License.

Run Expectancy by Base-Out State (2010 MLB)

How many runs does a team score on average when there is nobody on and no outs?

The answer to that question is available by run expectancy charts, like this one provided by Baseball Prospectus. But to most people, that's just a mess of numbers, and it's difficult to follow.

Here is the same information in a chart form:

(click for a larger version)

The Red base diagrams show the runs expected with zero outs. Orange with one out. Blue with two outs. The first column is with the bases empty. Moving to the right, it shows the runs expected with 1 runner on, 2 runners on, or the bases loaded. As you can see, the more people on base, the more runs you are expected to score, and the fewer outs, the more runs you are expected to score.

Image was editing using Adobe Illustrator. Vector data is available on request. This image is available under a Creative Commons Attribution, Non-Commercial License.