Introducing the New Negro Leagues Database

It’s been over five years since we originally launched the Negro Leagues Database. Over that time, there have been significant additions to the database, in terms of new seasons and statistics. But the website and the presentation of these statistics have largely remained the same. In May of 2015, I overhauled the Major League part of The Baseball Gauge, and I’ve wanted to do the same with the Negro Leagues section. Today, we re-launch the award-winning Negro Leagues Database. Here are some of the new features:

Per 162 games

One of the biggest issues with Negro Leagues statistics is that they are incomplete. We don’t have box scores for every game and we currently do not have data for every season and league. Because of this, it’s tough to compare Buck Leonard’s 62 career home runs to Cristóbal Torriente’s 70, the same way we compare Harmon Killebrew (573) to Andre Dawson (438).

To help fix this issue, I’ve included “per 162 games” rates on player and season/career leaderboard pages. Here we’ll see that Buck Leonard averaged 26 home runs per 162 games, while Torriente averaged 11.

Similarity scores

Comparing raw stats from Negro Leagues to Major Leagues is far from perfect. It doesn’t account for league quality, park factors or era. Having said that, we have similarity scores on all player pages, to see which Major Leaguer had the most similar career. Because of the issue described above, “per 162 games” statistics are used instead of career totals. There is also the ability to only compare to Hall of Famers or active players.

The similarity score tool shows us that Oscar Charleston’s most similar Major Leaguer was Rogers Hornsby
Charleston vs Hornsby

Defensive Regression Analysis

These fielding statistics have been available on the Major League site for a few years now and they are finally included in The Negro Leagues Database. Defensive Regression Analysis, created by Michael Humphreys, takes basic fielding statistics and estimates how many runs a player has saved (or allowed) compared to average.

Defensive Regression Analysis shows us that Dick Seay, while a lightweight with the bat (career 51 OPS+), saved 67 runs at second base in the season we have fielding data.

New Wins Above Replacement

The calculation for Wins Above Replacement now matches the Major League site. It uses Base Runs for offense, Defensive Regression Analysis for fielding, and runs allowed (with an adjustment for fielding) for pitching. The replacement level has been set at .294 to be consistent with Baseball-Reference and Fangraphs.

There is also Wins Above Average and Wins Above Greatness if you prefer a different baseline. As with the previous version of the website, Win Shares and Win Shares Above Bench are included.

The career leaders per 162 games contains many familiar names:
WAR per 162

Roster pages

These are available on team, year, franchise, and all-time pages. They contain vitals, uniform #’s, and birth/death information.

Data Coverage

These pages give the user an idea of which statistics we have and which we are missing.

New Logo

We have a beautiful new logo, which was kindly provided by Gary Cieradkowski, creator of the Infinite Baseball Card Set and author of The League of Outsider Baseball.

Finally, we have all the features that were previously available on The Negro Leagues Database as well as the Major League version of The Baseball Gauge.

Posted in Announcements, General, Historical, Site Additions, Statistical Analysis | 1 Comment

How Close Have the Cubs come to Winning the World Series?

I don’t know if you’ve heard, but the Cubs are in the World Series. But what you probably don’t know is that they haven’t won it since 1908. OK, we’re all well aware and will be reminded of it a thousand times over the next week and a half. We’ll see highlights of the ball rolling through Leon Durham’s legs, a fan trying to catch a foul ball, Alex Gonzalez botch a potential double play ball, images of a black cats and goats, and a lot of photos of Tinker, Evers & Chance.

So if you’ll indulge me, let’s see how close the Cubs have come to winning the World Series in each season since 1903. For teams that reached the postseason, I’ll use their highest series win probability in their final series. For teams that failed to advance to the playoffs, I’ll use the highest pennant win probability they reached during the regular season.

Cubs Highest Championship Probability by Sesaon *Click on image to expand

If you’re familiar with Chicago Cubs history, you know they were very good during the late 1920s and 1930’s. In fact, they had the second highest winning percentage of any team from 1929-1939. They just happened to run into Connie Mack’s A’s, Ruth and Gehrig’s Yankees, Greenberg and Cochrane’s Tigers, and Gehrig and DiMaggio’s Yankees.

They also came out on top of the National League in both seasons that were most affected by World Wars, 1918 and 1945. These were seasons that saw rosters severely impacted with some of the games best players going into the service or defense industry. The following seasons, when players returned to their teams, the Cubs fell to 21 games behind in 1919 and 14 1/2 games in 1946.

Then, from 1946 until 1983, the Cubs came no closer to winning the World Series than a 23.5% chance. The high point was 1969, when on August 13th, they were up 9 games in the National League East and had a 94.2% probability of winning the division. As we all know, the Amazin’ Mets came from behind to win the division and the World Series. But even if the Cubs had won the division, they would have still had to go through Hank Aaron’s Braves and the 109 win Orioles.

The Cubs have made the postseason six times since 1984, advancing to the NLCS on three different occasions. And just like the Cubs would have had to get past the Tigers (1984), A’s (1989) and Yankees (2003), this years Cubs will have to get past the Indians. The only difference is that the 2016 team will actually get a chance to do it.

Finally, if you prefer the data in a table, here is the year-by-year data, sorted by highest World Series win probability.

Year Probability Postseason
1945 77.4% World Series
1935 67.4% World Series
1932 57.8% World Series
1929 54.8% World Series
1938 53.4% World Series
1918 52.2% World Series
1910 50.9% World Series
2016 50.0%
2003 48.9%
1984 45.6% NLCS
1937 43.0%
1927 41.0%
1930 40.1%
1936 33.4%
1989 29.3% NLCS
1911 26.4%
2015 26.4% NLCS
1969 23.5%
1915 21.7%
1967 20.2%
1917 20.2%
1977 19.3%
1973 19.3%
1934 18.5%
1947 16.5%
1924 16.3%
1909 15.9%
1933 15.6%
2008 15.1% NLDS
1928 13.4%
1970 13.3%
1913 13.2%
2007 13.0% NLDS
1985 12.7%
1998 12.5% NLDS
1978 12.4%
1920 12.4%
1926 12.3%
1914 12.0%
1922 11.6%
1958 10.9%
1952 10.8%
1912 10.7%
2001 10.6%
1975 10.6%
1923 10.5%
1931 10.3%
1939 10.0%
1955 9.8%
2004 9.8%
1963 9.7%
1916 9.7%
1987 9.4%
1925 9.4%
1951 9.2%
1946 9.1%
1919 9.1%
1921 8.9%
1950 8.8%
1953 8.4%
1941 7.9%
1959 7.8%
1940 7.3%
1980 7.3%
1995 7.2%
1949 7.2%
1965 7.2%
1954 7.1%
1948 7.1%
1942 7.0%
1961 7.0%
1960 7.0%
1956 6.8%
1944 6.7%
1943 6.6%
1988 6.5%
1991 6.5%
1972 6.3%
1968 6.2%
1979 6.1%
1957 5.9%
1990 5.9%
2009 5.8%
1964 5.8%
1976 5.7%
1999 5.2%
1981 5.2%
1974 5.1%
1962 5.0%
1966 4.9%
1982 4.7%
1992 4.7%
1971 4.6%
1996 4.5%
1986 4.4%
1983 4.2%
2005 4.1%
2006 4.0%
2013 3.8%
1993 3.6%
1994 3.5%
1997 3.4%
2014 3.4%
2000 3.3%
2011 3.2%
2010 3.2%
2002 3.0%
Posted in General, Historical, Statistical Analysis | Leave a comment

Top 10 Plays of 2016 (Link)

Yesterday, I wrote about the top 10 plays of the regular season, using championship win probability added at The Hardball Times. Check it out!

Posted in Announcements, General, Statistical Analysis | Leave a comment

Park Neutralized Stats

One thing that gives people like me headaches is having to deal with the fact that all ballparks are different. Don’t get me wrong, the uniqueness of the sport is part of what makes baseball so special. But it’s an imperfect science when attempting to compare players from different teams.

The other day, my friend and baseball statistic guru Ryan Spaeder was asked which park was the toughest on hitters, and this was his response:

I thought it was perfect. Obviously, Coors Field is easily the most “hitter friendly” ballpark, but it is almost a no-win situation for a hitter. The Rockies have zero Hall of Famers in their 24 years of existence and their two best candidates (Larry Walker and Todd Helton) are unlikely to get in any time soon. The argument is their stats are inflated due to Coors Field, although it is impossible to say by exactly how much. We have park neutralized metrics such as OPS+ and wRC+ which include a park adjustment, and while I use them regularly, I admit they aren’t perfect.

The common practice in adjusting for ballpark is to take the ballpark factor, which estimates how a park influences run scoring compared to league average, and apply to it to each hitter. However, all players receive the same adjustment, no matter if they bat left-handed or right-handed, or if they are fly ball, ground ball, pull or spray hitters, etc. The assumption that all types of hitters should be treated the same is what I’m attempting to correct.

So while I know I’m not completely fixing the problem here, I’m offering an alternative. What I have are park adjusted career totals based off home and road splits. My first attempt was to just multiply road stats by two, but that would completely eliminate the player from playing ANY games in their home park and turn ballpark advantages into disadvantages and vice versa.

Instead, I decided to include home stats, but only at the same rate that a player would visit other parks. For example, Babe Ruth played in an era when his league had eight teams. That means that if he played in all ballparks an equal amount of time, he’d play in his home ballpark 12.5% (1/8) of his games. The other 87.5% would come from his road stats. A player in 2016 would play in their home park 6.67% (1/15) of the time.

Example

The formula is simple. For Babe Ruth’s home runs, we take his home HR/PA (347 / 5150 = .0674) and divide that by the number of teams in his league (.0674 / 8 = .00842). Next we take his road HR/PA (367 / 5473 = .0671) and multiply that by (7 / 8 = .875), which is the number of opponents in the league divided by the number of teams in the league (.0671 * .875 = .0587). Next, we add those two numbers together to get his new HR rate (.00842 + .0587 = .06712). To get his final career HR total, we multiply his rate by his career plate appearances (.06712 * 10623 = 713 HR). Surprisingly, he actually loses a HR, even though he played most of his career with a short right field at Yankee Stadium. We’ll see, however, that many players have a bigger difference in their adjusted career totals.

500 HR
After neutralizing the stats, there are no new members of the 500 HR club, although we lose six players. The biggest drop is by Mel Ott, who loses over 20% of his career total. Ott played his entire career at The Polo Grounds, where the right field foul pole was just 258 feet from home plate. During his career there, lefthanded batters hit about 80% more HR there than they did at the other National League parks.
Capture
What is interesting about Mel Ott and The Polo Grounds is that while it allowed Ott to hit far more home runs, it came at the expense of other hits. So we take away 104 HR, but also credit him 83 singles, 87 doubles, and 21 triples. Overall, his production was increased at home, but not by as much as his home run total would indicate.

The second biggest drop is a bit surprising in Frank Thomas, who lost 90 home runs. Comiskey Park does favor HR hitters, but it’s far from the most drastic ballpark. Still, over his career, Thomas hit a HR in 6.2% of his plate appearances at home and 4.1% on the road.

David Ortiz is not only the biggest gainer on the list, but also among all players. He went from 525 HR to 589, and we’ll see that Fenway Park wreaks havoc on these neutralized stats.
HR Change

The common theme with both of these lists is that those who saw their totals increase all played in parks that favored pitchers with the long ball, while those who saw their totals decrease played in parks that favored home runs.

Now let’s look at what is probably the second most popular career batting list, the “3,000 Hit Club”. (Note: Since we only have home/away splits going back to 1913, any player who began their career before this season is not included. Thus, no Cobb, Wagner or Speaker).
3000 Hits
Just as with the “500 HR Club”, the “3000 Hit Club” only lost players. While some players see an increase their production and some see a decrease from these neutralized stats, as a whole, players will lose some production. This is due to home field advantage and because the majority of these neutralized stats are influenced by road stats. This may partly explain why there are no new members of either club.

Already, we see some of Fenway Park’s impact with David Ortiz’s increase in his home run total and both Yaz and Boggs seeing big decreases in their hit totals. Maybe the most telling is the list of players that saw the biggest decrease in their doubles.
Capture
Nine of the top 10 played most or all of their career at Fenway Park. You just don’t see this type of thing in other sports.

Players with Big Increases in Production

Capture
A fun thought experiment is to imagine how their careers would have turned out had Joe DiMaggio and Ted Williams been traded for each other, with DiMaggio taking advantage of the Green Monster and Williams facing a short RF porch. Instead for DiMaggio, he had to contend with 451 feet to left-center field at Yankee Stadium. We estimate that with a neutral park, he would hit 45 more home runs and increase his overall production, with 31 more points of OPS.

Capture
Rick Wilkins may be the player farthest from your mind when you started reading, yet here he is. What is most amazing about him is just how drastic his home/road splits were when he spent the majority of his career at a hitters park in Wrigley Field. For his career, Wilkins hit .216/.298/.350 at home and .272/.366/.471 on the road. Who knows? Maybe he just got incredible nights sleep in hotel beds.

Capture
It’s no secret that AT&T Park favors pitchers, and what makes Buster Posey so special is that his raw stats are impressive, even before a park adjustment. But if we estimate what they would look like at more favorable parks, it becomes even more obvious that he’s on an early path to the Hall of Fame.

Capture
In the Willie Davis comment in his New Historical Baseball Abstract, Bill James describes a method for converting a players stats from one run environment to another. This has come to be known as the “Willie Davis Method” and it is currently used on this site and is the basis for Baseball-Reference’s neutralized stats (with some additional adjustments). The problem with this method is it treats all batting events the same and are adjusted at the same proportion. As we have seen with Fenway Park, this is not the case. Anyway, Bill James introduced his method in Davis’s player comment because he spent much of his career at a horrible hitters park. As we can see from this neutralization method, Davis’s stats improve, and his +29 triples and +210 total bases are the most of any player in history.

Capture
As if Mike Piazza didn’t already have the most impressive statistics for any catcher in baseball history, they get even better after they are neutralized. In fact, every single home ballpark during Piazza’s career had a park factor below 1.

Players with Big Decreases in Production

Capture
Chuck Klein spent much of his career in the Baker Bowl, which was 280 ft to right field and 300 ft to right-center field. So it’s easy to see why he hit 63% of his home runs at home. When we neutralize his stats, his overall numbers are much less impressive, especially given the hitters era in which he played.

Capture
As we saw earlier, Wade Boggs takes a big hit, with his OPS dropping 63 points. This is similar to other Red Sox players, such as Bobby Doerr (-.084), Rico Petrocelli (-.069), Dom DiMaggio (-.059), Jim Rice (-.053) and Carl Yastrzemski (-.051).

Capture
Barry Larkin is an interesting case. He has a 21 point drop in OPS while his hit total increases by 29. The biggest change was losing 130 walks, decreasing his walk percentage from 10.4% to 8.9%. In fact, Riverfront Stadium regularly increased walks for right handed batters every season of Larkin’s career. This is just a reminder that park factors are not limited to balls in play.

Capture
As mentioned above, playing at Coors Field can be a no-win situation. Obviously, Larry Walkers stats have received a boost. But if we neutralize them, he still compares well to these two Hall of Famers.
Capture
Throw in 9 Gold Gloves and +94 fielding runs, and this should quell any fears you may have about his Coors-inflated stats.

Capture
CarGo loses 102 points in OPS, which is the most of anyone with at least 2000 career plate appearances. This may indicate that his style of play is more affected by Coors Field. It’s also possible that he has a tougher time adjusting to the different approaches opposing pitchers take on the road, as Eno Sarris suggests. This may point to a flaw in the neutralization method, especially for extreme ballparks like Coors Field.

Change in Type of Production

Capture
Hank Aaron played his career at two parks, Milwaukee County Stadium and Atlanta-Fulton County Stadium. Milwaukee favored pitchers in terms of the long ball, but Atlanta was known as “The Launching Pad” and had a big affect on home runs. Naturally, he sees a drop in his home run total, but also an increase in singles, doubles and triples. The overall level of production didn’t change much after neutralization, just how it changed.

Capture
Jay Buhner’s neutralized OPS is nearly identical to his actual OPS, but his peripherals see some changes. His singles and home runs increase, but his singles, doubles and walks decrease. This is just another example that a ballpark can change how a game is played while having very little impact on the run environment.

Flaws in the System

The unbalanced schedule and interleague play make it so not all teams visit the same ballpark an equal amount of times. This means that Rockies players will visit pitchers parks such as Petco, At&T and Dodgers Stadium more often than teams in other divisions. It’s possible this can be corrected by equalizing the amount a team will see a road park. However, this would complicate the process and I’m not completely comfortable with it.

As mentioned in the Carlos Gonzalez comment, it is possible that some players’ away stats are affected due to different approaches taken by pitchers based on the ballpark. I suspect this is only the case in the very extreme and unique parks. It is something to keep an eye on.

This method only uses career statistics, which contain large samples when dealing with home/road splits. A single season may not offer a big enough sample to completely trust, especially with part-time players.

Conclusion

This method is admittedly imperfect, but it does fix the problem with applying the same run adjustment to all players. If anything, it is an alternative to other methods of neutralizing ballparks. I’m open to any suggestions on improving this method and I may publish a pitching neutralization shortly.

For those interested, I have included a spreadsheet with neutralized stats that can be viewed here. It includes all players with at least 1000 PA and began their career after 1912.

Posted in General, Historical, Statistical Analysis | Leave a comment

Gauging the First Half

Instead of making a post about mid-season awards, which we are sure to see a few of during the All-Star break, I figured I’d try something different. Let’s take a look at how individual plays affect a team’s postseason probabilities.

Top Plays

Earlier in the season, I added a page that shows the top plays of the season in terms of win probability added. While placing a value on the importance of the individual game is interesting, we can take it one step further and look at how each play impacts a team’s playoff probability. A big hit in a game between two teams that are not in contention will have little to no effect. But a walk-off home run in a game between teams tied for the lead in a division will have a much greater impact. So let’s take a look at the biggest plays of the first half.

This list is sorted by championship win probability added (cWPA). Just as in-game win probability added shows the change in win probability in terms of percentage points, cWPA shows the change in World Series win probability. Your first thought upon seeing the cWPA values is probably how small they are. In fact, every play this season has a cWPA of less than 1 percentage point. This shows just how little of an impact, even the most important play of the first three months of the season, has on a team’s chances of winning the World Series. Another way to look at these numbers is to multiply them by 8, to see the change in probability of being one of the final 8 playoff teams.

1) Leonys Martin’s walk-off HR (0.71 cWPA)


With two outs and a runner on 2nd in the bottom of the ninth and his team down by a run, Martin fell behind in the count 1-2 on three Ryan Madson changeups. On the fourth pitch, Madson went changeup again and Martin deposited it into the right field bleachers. The walk-off increased the Mariners in-game win probability by 86 percentage points, but more importantly, it increased the Mariners probability of winning the World Series by 0.71 percentage points.
On a side note, this play is also 40th on our list, as it decreased the A’s World Series win probability by 0.38 percentage points. The change in percentage points is bigger for the Mariners since the game was of more importance, as they were ahead in the division by 1.5 games, while Oakland was 7 games back.

2) Salvador Perez’s go-ahead 2-R HR (0.63 WPA)


In the bottom of the 8th inning with two outs, Bryan Shaw was looking to send the game to the 9th with his team up by a run. He was facing Salvador Perez, who was 1 for 12 in his career vs Shaw. But on the 1st pitch, Perez gave the fans in left field a souvenir and his team the lead. This play decreased the Indians World Series win probability by 0.63 percentage points. You can actually see the moment when Bryan Shaw realizes that pitcher vs batter stats are too small of samples to trust.Bryan Shaw
This play is also 6th on our list, as it increased the Royals World Series win probability by 0.59 percentage points.

3) Ian Desmond’s go-ahead 2-R HR (0.62 cWPA)


The next two plays on this list are from the same crazy game in Oakland. The A’s were one out away from victory with Ryan Madson on the mound, when Ian Desmond gave Texas the lead with a 2-run HR off a changeup. This play increased the Rangers chances of winning the World Series by 0.62 percentage points. As Desmond rounded the bases, Rangers announcer Tom Grieve noted that Madson threw one too many changeups, which seems to be a recurring theme here.
This play is also 24th on this list, as it increased Oakland’s World Series win probability by 0.41 percentage points.

4) Khris Davis’s walk-off grand slam (0.61 cWPA)


The next half-inning, Texas closer Shawn Tolleson intentionally walked Josh Reddick to load the bases with one out. Next, Danny Valencia flew out to shallow right, which brought up Khris Davis, who had already hit two home runs in the game. Davis then ended the game on a walk-off grand slam, which left Adrian Beltre wondering “what the hell just happened”.
Capture
From the A’s perspective, this play is 27th on the list, as it decreased their World Series win probability by 0.40 percentage points.

5) Yasiel Puig’s walk-off single and error by Michael Taylor (0.60 cWPA)


Puig’s single would have put runners on 1st and 2nd with one out in the inning, but it was Taylor’s gaffe (and Puig’s hustle) that allowed both runners to score. This was the culmination of Michael Taylor’s horrendous game, where he also struck out in all five of his at bats. If you look close enough, you can see him calculating the cWPA in his head.
Capture

Here are the rest of our top 25 plays of the first half:

Rk Date Play Team VS cWPA Highlight
6 6/14 Salvador Perez HR KC CLE
+0.59
7 5/21 Matt Wieters HR BAL LAA
+0.58
8 7/08 Luis Valbuena HR HOU OAK
+0.56
9 6/22 Yasiel Puig little league HR WAS LAD
-0.56
10 6/12 Jayson Werth Single WAS PHI
+0.56
11 5/14 Albert Pujols HR SEA LAA
-0.52
12 6/05 Matt Wieters 1B BAL NYY
+0.47
13 7/07 Troy Tulowitzki 1B TOR DET
+0.47
14 4/12 Geovany Soto HR OAK LAA
+0.45
15 5/20 Melvin Upton HR LAD SD
-0.44
16 5/10 Ryan Rua HR TEX CHW
+0.43
17 4/12 Geovany Soto HR LAA OAK
-0.43
18 4/08 Starling Marte grand slam PIT CIN
+0.42
19 4/08 Starling Marte grand slam CIN PIT
-0.42
20 6/24 Adam Lind HR STL SEA
-0.42
21 5/21 Jayson Werth GIDP WAS MIA
-0.42
22 5/28 Drew Butera 2B CHW KC
-0.41
23 6/23 Adonis Garcia HR NYM ATL
-0.41
24 5/17 Ian Desmond HR OAK TEX
-0.41
25 6/11 Prince Fielder HR SEA TEX
-0.41

Most Critical Moments

We can measure the importance that a particular play has on a game by using leverage index (LI), but this is limited to the situation in the game and it treats all games the same. Just as with WPA and cWPA above, we can take this one step further and measure the importance of the game by including the game’s championship leverage index (CLI). This number shows the importance of the game for each team, where the average game equals 1. If a win or a loss has a significant effect on the team’s playoff probability, the CLI will be greater than 1. By multiplying the LI and CLI, we can measure the importance that a play has on a team’s playoff probability. We’ll call this number pCLI (for championship leverage index by play). This number can be read as “how many times more important this situation was compared to the average play on opening day”.

Below are the top 10 most critical situations of the first half. As with the list above, some plays will appear twice, since they were important to BOTH team’s playoff chances.

Rk Date Team Inning Outs Runners Score pCLI Outcome Highlight
1
5/17 TEX Bot 9 2 Loaded 5 – 4 15.4 Khris Davis grand slam
2
6/12 WAS Bot 9 2 Loaded 4 – 3 14.2 Jayson Werth 1B
3
6/05 WAS Bot 9 2 Loaded 10 – 9 14.1 Ivan de Jesus fly out
4
6/24 TOR Top 9 2 Loaded 2 – 3 12.7 Michael Saunders pop out
5
6/11 SEA Bot 11 2 1st & 2nd 2 – 1 12.5 Kyle Seager fly out
6
5/17 TEX Bot 9 1 Loaded 5 – 4 12.2 Danny Valencia fly out
7
7/07 TOR Bot 8 2 Loaded 4 – 3 12.1 Troy Tulowitzki 1B
8
6/11 TEX Bot 11 2 1st & 2nd 2 – 1 12.0 Kyle Seager fly out
9
6/11 SEA Bot 10 2 Loaded 1 – 1 11.8 Ketel Marte fly out
10
5/06 BOS Top 9 2 Loaded 2 – 3 11.7 Hanley Ramirez strikeout
11
6/10 SFN Bot 9 2 1st & 2nd 3 – 2 11.5 Brandon Crawford strikeout
12
7/06 HOU Top 9 2 Loaded 8 – 9 11.5 Dae-Ho Lee strikeout
13
6/10 LAN Bot 9 2 1st & 2nd 3 – 2 11.5 Brandon Crawford strikeout
14
6/11 TEX Bot 10 2 Loaded 1 – 1 11.3 Ketel Marte fly out
15
6/05 WAS Bot 9 1 Loaded 10 – 9 11.1 Zack Cozart strikeout
16
6/05 BAL Bot 8 2 Loaded 1 – 0 10.9 Matt Wieters 1B
17
6/05 BOS Bot 9 2 1st & 2nd 5 – 4 10.9 Marco Hernandez strikeout
18
6/30 NYN Top 9 2 Loaded 3 – 4 10.8 Javier Baez pop out
19
6/18 TOR Top 9 1 Loaded 2 – 4 10.6 Josh Donaldson GIDP
20
6/21 BAL Bot 8 2 Loaded 7 – 6 10.6 Adam Jones ground out

If we revisit these lists at the end of the season, there is a good chance it will be dominated by second half plays. The reason for this is, just as the most important plays happen in the later innings of the game, the most important games occur near the end of the season. However, 2016 may be different since 5 of the 6 division leaders currently have at least a 5 game lead, which may lead to less enjoyable divisional races. For the sake of exciting plays and games, let’s hope some of these leads shorten.

Posted in General, Statistical Analysis | 5 Comments

Reunions

There has been a lot of attention being paid to 30th anniversary of the 1986 New York Mets, especially with their reunion this weekend. In addition to being one of the best teams in baseball history, they are one of the most interesting. While I look forward to seeing the team on the field together once again, it will be on a somber note since Gary Carter, who succumbed to brain cancer four years ago, won’t be there to join them. Carter is the team’s lone Hall of Famer and is the only member of the 1986 Mets that is no longer with us.

This got me to thinking about how rare it is to have every member of a team survive multiple decades after they last took the field. So if you can get past the morbidity of this article, let’s take a look at which teams could have a reunion with all of their players still around to take the field.

Year Team # of Players WS Win
1978 San Diego Padres
38
84-78
1978 Seattle Mariners
34
56-104
1979 San Diego Padres
35
68-93
1980 Milwaukee Brewers
33
86-76
1980 Texas Rangers
41
76-85
1982 Milwaukee Brewers
33
95-67
1982 Texas Rangers
40
64-98
1983 Boston Red Sox
31
78-84
1983 Milwaukee Brewers
36
87-75
1983 Toronto Blue Jays
33
89-73

1978 San Diego Padres and Seattle Mariners
The earliest teams with all of its players currently living are the 1978 Padres and Mariners. The Mariners were pretty dreadful, losing 104 games in their 2nd season of existence. The Padres (84-78, 11 GB), on the other hand, were just starting to acquire big names, thanks to the advent of free agency and Ray Kroc’s deep pockets. A reunion of the 1973 Padres would include Hall of Famers Dave Winfield, Rollie Fingers and Gaylord Perry. Not to mention Gene Tenace, Randy Jones and Oscar Gamble, to name a few. But while the 1978 Padres had a number of big names, teams usually hold reunions for pennant or World Series winners.

Earliest Pennant winners with all currently living players

Year Team # of Players WS Win
1982 Milwaukee Brewers
33
N
1992 Atlanta Braves
41
N
1992 Toronto Blue Jays
40
Y
1993 Toronto Blue Jays
38
Y
1995 Cleveland Indians
41
N
1996 Atlanta Braves
42
N
1997 Cleveland Indians
46
N
1997 Florida Marlins
43
Y
1999 Atlanta Braves
44
N

Brewers manager Harvey Kuenn passed away in 1988, but every one of his “Wallbangers” has survived from the 1982 team. 2017 will be the 35th anniversary of their American League Pennant and it will be great to have every member be able to attend a possible reunion.

Posted in General, Historical | 1 Comment

MLB.tv Game Changer (formerly Dashboard)

I can’t tell you how many times I’ve been watching a game on MLB.tv and all of the sudden, my twitter feed blows up when something big happens, like Giancarlo Stanton hitting another 450+ ft homerun. But I missed it because I didn’t know that he was at bat. So I decided to make something that will allow me to customize my baseball viewing experience. Something that will allow me to see as much of the baseball that I want to see. I got the idea from using Dan Brooks’ (of brooksbaseball.net) MLB.tv RedZone, which switches between the games with the highest leverage index. This allowed me to see the potential of using MLB’s gameday data.

Enter the MLB.tv Dashboard, which allows you to customize what you want to watch the most, and automatically switches between games based on your priorities. Here are some of the things you can customize:

  • Batters: Want to see Bryce Harper, Mike Trout, Giancarlo Stanton, or even Bartolo Colon at the plate? Add it to your list, and the application will switch to their game when they come to the plate.
  • Pitchers: Don’t want to miss Jake Arrieta or Clayton Kershaw pitch? This will switch to their game while they are on the mound and go to another game on your list while their team is batting.
  • Baserunners: It doesn’t get much more exciting than when Billy Hamilton is on the bases. If the runner of your choosing is on 1st or 2nd base with the next base open, the application will make sure you see it.
  • Fantasy Teams: The three settings above allows you add all of the players on your fantasy team to track their progress.
  • Teams: Let’s say you’re a huge Braves fan and that’s mostly what you want to see. But you also want to see each of Manny Machado’s at bats. Put Machado at #1 and Atlanta at #2 and you’ll get to watch your Braves games, but the application switches to Orioles games when Machado comes to the plate. It will switch back when his at bat is over.
  • Leverage Index: If you don’t want to miss a tense moment in any game, set the LI to something like “>= 3.0″ and it will switch to any game that meets that criteria.
  • No-hitters: Don’t want to miss a potential no-hitter, but don’t think it’s important until after the 7th inning? You can set that as a priority.
  • Vin Scully: There’s only a few more months left that we get to appreciate the greatest announcer in baseball history. This setting will switch to Dodger games when they are at home, or on the road in San Diego, Anaheim and San Francisco.
  • Position Players Pitching: Who doesn’t love to see a left fielder pitch in a blowout or in the 18th inning? This setting will switch to a game if a non-pitcher is on the mound.
  • Extra Innings: Self-explanatory. This setting will switch to games that are in extra innings.
  • Replay Challenge/Review: Let’s say you are a masochist and you want to see all replay challenges. This will switch to those situations.

Now suppose none of your priority items are met, or your teams are on commercial break. In that event, the application will switch to the game with the current highest leverage index. It will keep switching between these games until any of your priority criteria are met.

In addition to the above priority items, you can also avoid watching any teams of your choosing. If you’re blacked out from seeing a certain team, you can add them to your ignore list and the application will avoid changing to their games since you won’t be able to watch them.

We all know MLB.tv has a delay from the actual live game to when you see it on your screen. You can adjust the delay timer setting if the games are switching too early or late. However, I recommend not changing this setting too much since it can severely alter the experience.

Finally, you can set whether or not you want the application to wait for the current at bat to finish before switching to a higher priority game, or if you want to it change immediately. The default setting is to wait. Changing the setting to switch immediately will allow you to not miss higher priority situations, but beware that you could see a lot of changing at inopportune times.

If you choose to use this, I hope you enjoy it. Let me know what you think.

Posted in Announcements, General, Site Additions | 7 Comments

Game Star Ratings

I’ve added a star rating to each game. It measures the “enjoyability” of a game based on a few different factors. There are many elements to a game that can make it enjoyable to the unbiased fan. I’ve tried to include the most important of these.

The rating system ranges from 0 stars to 5 stars and goes in increments of .25 stars. The average game will be around 2.5 stars.

Leverage Index (aLI)
The first and most important element is leverage index, which measures the importance of each situation in the game. The more crucial a moment in a game is to the outcome, the higher the leverage index will be. A leverage index of “1” is average. A leverage index of “3” is equal to three times as important as the average play. FanGraphs has a primer on LI for those interested.

In the game rating formula, I use average leverage index over the course of the entire game. I could have chosen to just go with the top X plays in a game, or the number of plays over a certain threshold, but I felt the average over the course of the game is best suited to gauge the intensity of the entire game.

Win Expectancy Change (WE+/-)
Next is change in win expectancy per play. Suppose an RBI single increases a team’s win expectancy from 55% to 65%. That would obviously be an increase of 10 percentage points. I calculate the average absolute value of WE change over the course of the entire game and use that number for my rating formula.

The average play in the average game will have a win expectancy change of about 3.3 percentage points. Bigger and more exciting plays will increase this number, while plays in blowout games will do the opposite.

Leverage index and win expectancy change very likely have a high correlation, which would cause these elements to be “double counted”. I have taken that into consideration and am fine with it since they are the most important factors in gauging a game’s intensity.

Championship Leverage Index (CLI)
Championship leverage index is similar to in-game leverage index (above), in that it gauges the importance of a single game as opposed to a single play. The game importance is measured in how much a team’s probability of winning the World Series changes in a win versus a loss.

The average game will have a CLI of 1 and is equal to the average game on opening day. In the 2nd Wild Card era (2012-present), the average game on opening day can change a team’s chances of winning the World Series by 0.59 percentage points.

The CLI used in this ratings formula is the average of the two team’s CLI for this game.

Examples: A team that is already eliminated has a 0% chance of winning the World Series. A win will not increase their chances, so their CLI will be 0. The same goes for a team that has already clinched their division. A division title ensures that a team is 1 of the 8 teams in the postseason tournament, meaning they have a 12.5% chance of winning the World Series. A win or a loss after clinching the division will not change this number. But a one-game playoff for the division (game 163) is a “win or go home” scenario and will have a CLI of around 21, since it is 21 times more important than the average game on opening day.

Comeback (CB)
The final element is comeback, which is defined as the highest win expectancy the losing team reached during the game. A comeback can range anywhere from 50 percentage points to 100 percentage points. A comeback of 100 percentage points means that the losing team had a 100% chance of winning, but still managed to lose the game. A comeback of 50 percentage points means the losing team was never able to increase their win expectancy above the 50% level at the beginning of the game and likely means the game was never much in doubt.

Formula and Weights
Each of the four elements (LI, WE+/-, CLI, CB) are individually compared to a large sample of games ranked in a percentile. These percentiles are then weighted and combined to create the star rating. The weights are:
aLI = 1.5
WE+/- = 1.5
CLI = 1
CB = 1

Example: A game has an average leverage index of 1.25, an average win expectancy change of 4.5 percentage points, a championship leverage index of 1.55, and a 85% comeback. Their percentiles and weights are:
aLI = 70 * 1.5
WE+/- = 82 * 1.5
CLI = 92 * 1
CB = 90 * 1

Their sum is 410. This number is divided by 25 and rounded to the nearest whole number. It is finally divided by 4 to give you the star rating. This game would be a 4 star game (410 / 25 = 16.4 = 16 / 4 = 4).

Elements of a game not currently included in star rating system
Individual game performances and milestones. A player hitting 4 HR in a game is exciting and uncommon and makes each of the at bats more important. A pitcher taking a no-hitter or perfect game late into the game has the same effect. These types of elements are currently not included, but are “on the table” for future versions.

Star Players
One could argue that the more superstar players in a game could make it more enjoyable. This rating system does not take the players superstar status or skill level into account.

Special Games
While Derek Jeter’s final home game was exciting in its own right, I would argue that it was even more enjoyable since it was his final game at Yankee Stadium. This rating system doesn’t take these rare situations into account.

The Home Crowd’s Enjoyment
As mentioned above, this star rating measures the enjoyment for the unbiased fan. The home crowd may have a different definition of an enjoyable game based on whether their team wins, but this system makes no such distinction.

Posted in Announcements, General, Statistical Analysis | 5 Comments

2016 Retrosheet Database

One of the best days of every offseason for a baseball nerd is Retrosheet annual end of season release day. It’s the day one of the best sites on the internet releases the play by play data from the previous season. If you’re like me, you download it immediately and go to town. But one thing I’ve always wanted was the ability to access the data during the current season.

So this past offseason, I designed a way to take mlb’s gameday data and convert it into a Chadwick-style retrosheet database. The database (.csv files) will be available and updated daily* in the downloads section. I’m making it available mainly because I know there are others out there, like me, that are interested in having an in-season pbp database. But also because I’d like to have more than one set of eyes on it, to help iron out the kinks and catch any errors.

Error Checking
I run a few processes to check for errors and to validate the data. But there is still the possibility that errors will come up from time to time. I’d like to make this a forum for error reporting, for those who are interested in helping.

Daily Download
Just as with this website, I intend to have updates available daily. Usually, the site is updated in the morning. But with a full-time job and two toddlers at home, there can sometimes be a delay.

Missing columns
There are a few columns in the events table that I have left blank:
“EVENT_TX”: It turns out that it is a huge process to replicate this. While I believe the “EVENT_TX” column is helpful in quickly identifying the play, I don’t use it in my queries and felt it wasn’t worth the hassle. The same goes for the “BAT_PLAY_TX” and the “RUNX_PLAY_TX” columns.

“BATTEDBALL_LOC_TX”: Gameday does include hit locations for all balls in play, but I have yet to dive into this data. If there is someone who has experience with this data and is willing to assist in converting gameday’s x and y coordinates to Project Scoresheet locations codes, please let me know.

“UMP_ID”: These columns for the six umpires are currently left blank.

“GWRBI_BAT_ID”: This is left blank because game winning RBI’s are no longer officially recorded.

ID’s for players making their Major League debut
Since these players have yet to be assigned official ID’s by retrosheet, I just give them the next available ID for their name. For example, if a John Smith were to make a debut, he would be assigned ID “smitj005″, since 005 would be next in line.

Building a Retrosheet Database
For those who are interested in using the data, but lack experience, David Temple at TechGraphs recently created a helpful two part tutorial.

Donations
If you find this data useful and have some disposable income, please consider donating. I do not get paid for my work on this website and while it is my passion to work with baseball data, it does take a lot of time and money (server costs) to keep it up. I’d like to also suggest donating to David Smith and the Retrosheet team.





Posted in Announcements, General, Site Additions | Leave a comment

2011 Royals Farm System

The Royals are on the brink of winning their first World Series in 30 years, leading the Mets three games to one. While it’s certainly not over (5 of 43 teams have come back from being down 3-1 in the World Series), I figured it was as good of a time as any to write about the Royals 2011 Farm System.

Compilation of Royals Top Prospect Rankings
Prospect Pos FG BA BP Sickels Avg
Mike Moustakas 3B 1 3 1 1 1.5
Eric Hosmer 1B 2 1 3 2 2.0
Wil Myers OF 3 2 4 3 3.0
John Lamb LHP 5 4 2 6 4.3
Mike Montgomery LHP 4 5 5 5 4.8
Danny Duffy LHP 7 7 7 4 6.3
Chris Dwyer LHP 9 8 6 9 8.0
Christian Colon SS 8 6 8 11 8.3
Jeremy Jeffress RHP 11 8 9.5
Brett Eibner OF 12 10 14 10 11.5
Tim Collins LHP 13 13 10 16 13.0
Aaron Crow RHP 14 9 16 14 13.3
Tim Melville RHP 15 14 15 14.7
Johnny Giavotella 2B 21 18 9 12 15.0
Yordano Ventura RHP 10 12 13 27 15.5
Cheslor Cuthbert 3B 17 15 12 19 15.8
Louis Coleman RHP 20 19 17 13 17.3
Jason Adam RHP 16 11 15 28 17.5
Robinson Yambati RHP 19 16 11 26 18.0
Salvador Perez C 18 17 20 18 18.3
Patrick Keating RHP 22 22 17 20.3
Will Smith LHP 19 25 22.0
Derrick Robinson OF 23 26 18 22.3
Jarrod Dyson OF 26 20 23.0
Kevin Chapman LHP 23 23.0
David Lough OF 24 25 22 23.7
Jeff Bianchi SS 30 21 21 24.0
Orlando Calixte SS 24 24.0
Clint Robinson 1B 28 20 24.0
Buddy Baumann LHP 24 24.0
Noah Arguelles LHP 25 25.0
Humberto Arteaga SS 27 23 25.0
Henry Barrera RHP 27 27.0
Crawford Simmons LHP 28 28.0
Lucas May C 29 29.0
Elisaul Pimentel RHP 29 29.0
Kelvin Herrera RHP 30 30.0
Greg Holland RHP

Below is a list of the top ten farm systems in 2011, using five different rankings from the industry. The first three were a consensus among most of the publications, with Kansas City topping the list in each one.

Compilation of Farm System Rankings
Rank Team BA BP Law Sickels THT Avg
1 Royals 1 1 1 1 1 1.0
2 Rays 2 2 2 2 2 2.0
3 Braves 3 3 3 4 3 3.2
4 Blue Jays 4 5 4 5 4 4.4
5 Yankees 5 4 9 6 14 7.6
6 Reds 6 9 8 7 11 8.2
7 Indians 7 7 17 3 9 8.6
8 Angels 15 6 6 8 8 8.6
9 Phillies 10 8 5 11 12 9.2
10 Twins 12 15 7 9 6 9.8

Rankings alone won’t do the praise for this system justice, so here are some of the comments coming from those who ranked them:

Kevin Goldstein at Baseball Prospectus:

This is not just the best minor-league system in baseball, it’s the best by a wide margin. The more I wrote about these prospects, the more trouble I had figuring out any way for things to go wrong. Another winning record could occur as early as 2012, but more importantly, the team should return to annual playoff contention shortly thereafter.

Keith Law at ESPN:

The phrase “Mission Accomplished” has acquired an ironic connotation of late, but if anyone could use the phrase earnestly to describe his own efforts, it would be [Dayton] Moore, as the Royals have arms coming out of their ears.

That’s particularly impressive when you consider that Kansas City’s top two prospects are bats, and there are some solid position player prospects further down in the system.

Jim Callis at Baseball America:

The Royals set a record by placing nine players on our Top 100 Prospects list, starting with three of the very best hitting prospects in the minors in 1B Eric Hosmer, 3B Mike Moustakas and OF Wil Myers. They also have an enviable collection of lefthanders, led by John Lamb, Mike Montgomery, Danny Duffy and Chris Dwyer.

John Sickels at minorleagueball.com:

What can you say? This is one hell of a farm system. While the young pitching gets a large amount of attention, and deservedly so, the Royals also have three of the most elite young bats in baseball in the Moustakas/Hosmer/Myers troika.

Matt Hagen at The Hardball Times:

Kansas City has a plethora of top-end impact talent and loads of depth throughout. The best system in baseball and a reason to follow America’s pastime for long-suffering Royal fans.

It’s difficult to find one negative comment about this farm system. It contained high impact talent AND depth at almost all positions. Years of futility earned the Royals multiple early round draft picks. From 2005-2010, they had a top five pick on five different occasions. Additionally, Kansas City was active on the international front, signing players out of the Dominican (Kelvin Herrera & Yordano Ventura), Venezuela (Salvador Perez), and even Nicaragua (Cheslor Cuthbert).

In 2011, Doug Gray at minorleagueball.com assigned a monetary value to every prospect in baseball based on John Sickel’s grading system. He estimated the Kansas City farm system to be worth $243 million while the next best team (Tampa Bay) was worth $184 million. Here is a graph Doug provided, showing all farm systems in 2011:
2011 Farm System dollar values

But having a top farm system has never guaranteed success. Scott McKinney at Royals Review studied prospect success and failure rates and determined that 70% of Baseball America Top 100 prospects are failures. In August of this year, Alex Speier wrote in the Boston Globe:

Remarkably, none of the last 14 organizations to be designated with the top farm system by Baseball America has won a World Series since receiving that accolade. The last team to hoist a championship trophy following a top farm system ranking was the 2005 White Sox, four years after they’d been named the top farm system in 2001.

Of course, this will change if the Royals can win just one of the next three World Series games.

So what does a team with the best farm system need to do to reach the next step?
First, they need to continue to develop these players, as none of them are finished products.
The Royals did just that, as a good number of their prospects reached the big league level. Of course, not all of them have reached their potential, but that is expected.

Next, they need to surround this core with complimentary players.
Whether it is from outside the organization through free agency or via trade, Dayton Moore may have done his best work in this aspect. He parlayed Wil Myers, Jake Odorizzi & Mike Montgomery into James Shields and more importantly, Wade Davis. He also signed Chris Young, Edinson Volquez, Kendrys Morales & Ryan Madson to fill out the roster. Finally, at the trade deadline this year, Moore traded prospects to acquire the final pieces to the puzzle in Johnny Cueto and Ben Zobrist.

Finally, they need luck.
Because sometimes no matter how hard you try, things just don’t go as planned. On the other hand, there’s always a chance to find that diamond in the rough that you weren’t expecting. As Branch Rickey said, “Luck is the residue of design.” It’s hard to tell how much of the Royals success is luck, but by putting the organization in the best situation possible, they have been in position to capitalize on many breaks.

We can now look back retroactively at the 2011 farm system to see where it ranks among the 30 teams in terms of wins above replacement. Granted, it is far too early to make a final judgement on these systems as most of the players are still beginning their careers. Below is a table showing how many wins above replacement each team’s 2011 farm system has produced, along with their winning percentages in subsequent seasons. The top prospect is the player with the most WAR in that farm system.

# Teams WAR 2011 2012 2013 2014 2015 Top Prospect
1 DBacks 100.4 .580 .500 .500 .395 .488 Paul Goldschmidt
2 Royals 91.4 .438 .444 .531 .549 .586 Salvador Perez
3 Angels 88.2 .531 .549 .481 .605 .525 Mike Trout
4 Braves 85.3 .549 .580 .593 .488 .414 Andrelton Simmons
5 Cardinals 74.9 .556 .543 .599 .556 .617 Matt Carpenter
6 Rays 70.6 .562 .556 .564 .475 .494 Desmond Jennings
7 Reds 70.3 .488 .599 .556 .469 .395 Todd Frazier
8 Indians 63.9 .494 .420 .568 .525 .503 Jason Kipnis
9 Pirates 61.6 .444 .488 .580 .543 .605 Starling Marte
10 Nationals 60.6 .497 .605 .531 .593 .512 Bryce Harper
11 Blue Jays 59.2 .500 .451 .457 .512 .574 Brett Lawrie
12 Mets 58.1 .475 .457 .457 .488 .556 Matt Harvey
13 White Sox 56.8 .488 .525 .389 .451 .469 Chris Sale
14 Astros 56.3 .346 .340 .315 .432 .531 Jose Altuve
15 Mariners 50.7 .414 .463 .438 .537 .469 Kyle Seager
16 Athletics 49.8 .457 .580 .593 .543 .420 Josh Donaldson
17 Twins 46.8 .389 .407 .407 .432 .512 Brian Dozier
18 Dodgers 42.7 .509 .531 .568 .580 .568 Kenley Jansen
19 Yankees 42.2 .599 .586 .525 .519 .537 Jose Quintana
20 Padres 42.2 .438 .469 .469 .475 .457 Anthony Rizzo
21 Rockies 41.5 .451 .395 .457 .407 .420 Nolan Arenado
22 Orioles 40.5 .426 .574 .525 .593 .500 Manny Machado
23 Red Sox 40.0 .556 .426 .599 .438 .481 Josh Reddick
24 Marlins 38.1 .444 .426 .383 .475 .438 Christian Yelich
25 Giants 36.5 .531 .580 .469 .543 .519 Brandon Crawford
26 Cubs 33.4 .438 .377 .407 .451 .599 Welington Castillo
27 Brewers 26.7 .593 .512 .457 .506 .420 Mike Fiers
28 Tigers 24.4 .586 .543 .574 .556 .460 Drew Smyly
29 Phillies 23.6 .630 .500 .451 .451 .389 Jarred Cosart
30 Rangers 23.4 .593 .574 .558 .414 .543 Pedro Strop

*Note: I have only included players with positive career WAR in these totals.

Surprisingly, the Diamondbacks have accumulated the most WAR of any team from the 2011 prospect class. However, Kansas City is not far behind in second place. Whether or not this farm system turns out to be among the all-time greats remains to be seen. But if they end up winning the World Series in the next few nights, it will be impossible not to deem it a success.

Thanks to Hawkins DuBois for helping out with prospect lists.

Posted in General, Historical, Statistical Analysis | Leave a comment