Measuring a Pitcher’s Ace Factor
Each pitcher’s season, represented by a blue diamond, was measured by a calculation that didn’t consider any of the usual suspects typically used in measuring a pitcher’s effectiveness, such asÂ ERA, strikeouts, WHIP, shutouts, or evenÂ ground ball percentages, contact rates, or FIP. Yet, when you see the names associated with the outliers on the right, you’ll agree that this plot captures the most dominant pitching performances over the past 61 years. I’m sure you’re familiar with the work of Greg Maddux in 1995 (5th) and Bob Gibson in 1968 (4th). (Can you guess the top three?)
The algorithm that generated this plot is a stepping stone for an alternative way of grading a starting pitcher’s body of work. It’s a tool versatile enough to bolster Cy Young award arguments, compare pitching staffs from different generations, and serve as a benchmark for assessing a pitcher’s dominance over his entire career.
First, some background.
There’s a rift in baseball’s grading system.
The W-L record, a long-time staple of pitching performance numbers, is now being ignored like some unwanted guest. With a saber’d up BBWAA awarding three of the last four Cy Young trophies to a 16-, a 15-, and a 13-game winner, a growing number of fans would hardly mind if the W-L record just upped and left the stat sheet.
Long serving as a “dummyâ€™s guide” for gauging a pitcher’s success, the premise behind the W-L record was simpleâ€”assign credit or blame to one and only one pitcher who was most responsible for the teamâ€™s winning or losing that day. It was baseballâ€™s way of holding a pitching staff accountable for every game.
Eventually, we figured out that the “pitcher of record” scheme is a lousy way to stamp a grade on the guy on the mound. For one, the gameâ€™s binary outcomeâ€”a win or a lossâ€”is caused by a network of circumstances that are often beyond a pitcher’s control. Sure, a starting pitcher owns a significant slice of the responsibility pieâ€”he is the guy holding the baseball at the start of each and every play. But ultimately, itâ€™s the play of his teammates that skews a pitcherâ€™s W-L record into one of the most misleading indicators in all sports. Poor defense can turn a pitching gem into a heartbreaking loss, just as infectious team hitting can save a pitcherâ€™s butt and give him a W on days when his stuff was more worthy of the L word.
When Felix Hernandez took home the 2010 AL Cy Young award with a journeyman-like 13-12 record, you would have thought that the W-L record had just been served its eviction notice. The announcement triggered eruptions of sabermetric celebrations and mock eulogies, with stat fanatics declaring, “Our work is done here.”
Were new-age statistics such as FIP, WAR, and tERA finally hitting mainstream?
Not quite. Tom Tango, a think-tank hub for many of the game’s recent statistical advancements, cautioned that although the Hernandez-led Cy Young rankings seem to show that voters did indeed “devalue” the W-L record, it was the traditional ‘baseball card’ numbers that stood tall and proud. Hernandez’s ERA and strikeouts, just to name two, were simply too dominant to ignore.
At least for now, sabermetrics still sits just outside the perimeter of consideration for baseballâ€™s major awards. So we’re not quite sure how this pageantry for pitchers will play out in future Novembers.
The problem is this: Without putting trust in a bottom-line assessment like the W-L record, evaluating the work of certain pitchers seems to require a statistical thesis just to back an argument. Just ask Bert Blyleven, who arguably needed a cult of sabermetric support to make his 2011 induction even remotely possible.
Not that there’s a shortage of alternatives for measuring pitching effectiveness. The raw counting numbers (IP, ER, H, K, BB) and their ratios (ERA, K/9, K/BB, WHIP) are a great first look (for many, it’s the last look). Sabermetrics kicks in whenÂ the raw data is massagedÂ into a more useful context. Adjusted ERA (also known as ERA+), a standard at B-R.com, takes a pitcherâ€™s normal ERA and compensates for ballparks and league average ERA. The FIP methodology (Fielding Independent Pitching), a Tango creation found on a pitcher’s dashboard at fangraphs.com, defines a more authentic ERA in the spirit of Voros McCracken’s revolutionary DIPS theory which states that what happens once a ball is hit in play is largely out of the hands of the pitcher. Pitching dissection hits full throttle with the availability of batted-ball data, which gives us GB/LD/FB percentages, and pitch-by-pitch data, the roots of swing and contact rates relative to the strike zone.
These tools are excellent indicators of future performance. Their ability to put a number on a pitcher’s tendencies help generate virtual scouting reports.
But nowhere in this toolbox is there a stat that measures the bottom line performance of each outing with the finality of a report card grade…something like what the W-L record was used for.
When the chatter that followed Hernandez’s Cy Young win became more about the demise over the W-L record than for the incredible season Hernandez turned in, I have to admit that I empathized with Roy Halladay’s stance on the issue. “Sometimes the run support isn’t there, but you sometimes just find ways to win games,” said Halladay, tossing a subtle scoff at Hernandez’s 13-12 record. “I think the guys that are winning and helping their teams deserve a strong look, regardless of how good Felix’s numbers are.”
Not that I believed that a pitcher’s wins and losses are his most important numbersâ€”far from it; we know how misleading they can be. I just thought it’s the best statistic we’ve got that holds the pitcher accountable in terms of his ability to do what’s most important every time he steps to the mound: win the game. The Cy Young award shouldn’t just go to the pitcher with the most curvaceous set of peripherals. That’s too much like selecting the best golfer based on driving accuracy and greens in regulation without emphasis on who won the darn tournaments.
Value metrics such as Win Shares and WAR come close. But as best I can see, their calculations bundle a pitcher’s game-by-game numbers into a algorithmic stew to get a final grade. Without maintaining any type of hierarchy between outings, there’s no assessment of a starter’s job from one game to the next.
That became my motivation, to keep a pitcher’s performance quantized into outings, and to grade him on three basic performance measures:
(1) his ability to work deep into games to preserve the bullpen,
(2) his ability to work effectively enough to keep the opposition off the scoreboard so as to give his team the best chance to win that day,
(3) his ability to show up for work every five days.
That’s basically it. To me, this is what you want your ace to do. As much as I believe that K/BB and GB%/FB% are revealing and useful ratios, I don’t care about those numbers when judging strictly if an ace did his job. A starter doesn’t have to strike out a single batter to lead his team to victory, and he can still toss a shutout while walking four or more batters. I didn’t want the grade to be diluted with any tangential measures that arenâ€™t completely necessary for success.
In short, I wanted a measure that shows how well a starting pitcher consistently put his team in position to win every fifth day.
You may have already figured out where this is heading. There’s a well-established way of calculating a team’s chances of winning a game at any time during that game.Â It’s called Win Expectancy (WE), and I’ve made it the workhorse behind a new process for grading starting pitchers that I call Ace Factor.
WE is part entertainment and part insight. On one hand, it’s kind of fun to track the dynamic swings of a team’s win probability as a game progressed. But it’s also enlightening to find out the plays that had most impact on a game’s outcome, and the players most responsible. For what I was trying to do with Ace Factor, it fit like the last piece of the puzzle.
In applying WE to a pitcher’s body of work, I’m only interested in his team’s chances of winning the moment he left the gameâ€”but with a twist. To isolate the pitcher’s performance from other statistical distractionsâ€”which is the whole point of this processâ€”I didn’t consider what his team did at the plate, nor did I consider what happened in the game any time after the pitcher was lifted.
Scoring environment is also factored in because of its impact on the likelihood of winning throughout a gameâ€”runs are harder to come by in lower scoring environments, whereas leads are tougher to hold in higher scoring environments.
At the core of the algorithm is an extremely flexible and useful WE spreadsheet put together by Dave Studeman and other baseball researchers which calculates WE for all types of scenarios while taking scoring environment into account.
Now that you have an idea of the motivations, let’s get to the process. My first step was to calculate and assign a WE grade for every game between 1950 and 20101.
A quick example: Felix Hernandez was lifted after 7.1 innings on August 25, 2010, leavingÂ runners on first and second with two earned runs charged against him. The outing was graded with a WE score of 0.797 (out ofÂ a perfect score of 1.000â€”or 100%â€”whichÂ occurs only at the instant a team wins the game). Stating this another way, Hernandez pitched to a statistical performance level that says he left his team with a 79.7% chance of winning the game upon his exit (remember, we’re not factoring in his team’s actual performance). For a detailed explanation of this example, go here.
With a WE grade assigned to every game over the 61-year period, there are three ways we can go here, depending on how we want to compare the seasons of starting pitchers across history.
One way is to take into account any significant time a pitcher may have missed in a particular season, i.e. factor in his number of starts. This would fulfill performance measure #3 while completing our Ace Factor calculation. But we’re going to do address this later.
For now, we’re going to deal with the raw WE scores without adjusting for starts, except for setting some minimum number (similar to how a minimum of 162 innings is required to win an ERA championship).
Although the WE numbers from the spreadsheet are already compensated for scoring environment, there’s one more adjustment that needs to be done that would allow us to legitimately compare seasons across history, and that’s to take into account league performance. There’s two valid and equally informative ways to do this.
One way is to gauge how “special” each pitcher’s performance was relative to his competitionâ€”a maneuver whichÂ also compensates for other unknown biases in a pitcher’s environment from one season to the next. To do this, each pitcher’s set of WE scores for a given season was converted to a single “Z score,” a statistical tool that helps indicate how rare or common a measure is compared to a ‘population’ of measures. Since this is a “seasonal” type of adjustment, we’re going to call this our “seasonal WE-based Z score,” or sWEZ for short. The sWEZ number represents how many standard deviations above or below the mean a pitcher performed at relative to his league for a given season.
(To see the details of how sWEZ was calculated, go here.)
Another way of comparing seasonal performances across history is to gauge how special each pitcher’s performance was relative not to his competition but to performances across generations. This type of adjustment is called “historical WE-based Z score,” or hWEZ for short. What’s the difference between sWEZ and hWEZ? Pedro Martinez in 2000, Greg Maddux in 1995, and Sandy Koufax in 1966 are all examples of pitchers performing well above the competition. sWEZ will tell us which pitcher out-performed his peers by the widest margin. hWEZ, on the other hand, will tell us how these performances compare to each other. Two different but equally interesting and relevant concepts.
For the rest of this post, I’m going to focus just on sWEZ, leaving hWEZ for another day.
Now on to the fun part.
With an sWEZ score calculated andÂ tagged to each pitcherâ€™s season, a ranking of the most dominant seasonal performances for starting pitchers was created. This ranking is shown graphically below (click pic to enlarge).
Each pitcher’s seasonal sWEZ score is plotted as a blue diamond. With the horizontal axis representing sWEZ, the most dominating performances appear on the right and the least dominating performances appear on the left. Outlier scores are tagged with the pitcherâ€™s name, season, sWEZ score, and number of starts.
The top five performances are highlighted in red on the right. Greg Maddux in 1994 (1.005) and Pedro Martinez in 2000 (1.003) are the only pitchers to perform at least “one full standard deviation” above the league average.
Make a mental note of Warren Spahn, the only Hall of Famer on the left side of the chart (highlighted in red), who probably hung on two seasons too long in ’64 and ’65 at ages 43 and 44. When we compare pitching careers, Spahn’s prime numbers will come up huge, and the irony of his presence among the weakest performances will ring as loud as a cowbell.
You can see that there are no seasons listed below the 25-start threshold. Without this threshold, the rankings would be clobbered with meaningless data from the shorter seasons. Of course, weâ€™re also putting 25-start seasons on the same level of comparison as seasons of many more starts without penalty or compensation. This issue is the essence and motivation behind adjusting these scores according to the number of starts, which we’ll do later on.
In this process of taking a pitcher’s outing and carving away real-life portions of it such as run support and the actual outcome of the game, applying the win-loss concept doesn’t seem to apply. But I wanted to use a similar decision framework as the W-L record here to get a better idea how a pitcher’s quality of work was distributed outing by outing, something that tells us at a glance how often he was dominant, or how many clunkers he pitched. So I created thresholds based on each game’s WE score to assign decisions I calledÂ effective wins (eWIN), effective losses (eLOSS), and effective no decisions (eND).
The chart below shows the thresholds used for these decisions. These selections were based on historical WE grades with the idea that given that every game has a win and a loss,Â it madeÂ sense to balance the eWIN/eLOSS thresholds across the distribution of WE scores. Using the historical WE mean score as aÂ baseline, the eWIN and eLOSS thresholds were assigned +/-0.200 relative to the mean (or about 0.65 standard deviation), with an eND assigned to any WE grade in between. The effective dominant wins (eDW) threshold was set to one full standard deviation above the mean, and the effective quality starts (eQS) threshold was set right at the mean.
This decision-based grading system provides a better sense of how each pitcher’s quality of outings were distributed, giving our listing of dominant performances a much more interesting shape.
An eLOSS could be thought of as “throwing a clunker,” which makes it incredible that Bob Gibson in 1968 is the only starter to go an entire season without tossing a single one.
The cool part about going down this labor-intensive path of formulating a theory, applying an algorithm, then coming up with a list like this is that there’s not a lot one can argue about with these results. Just about every one of these seasons has been praised for its dominance and historical relevance.
The last column of the sWEZ ranking shows that 21 of the 30 pitchers on the list won the Cy Young Award for that particular season. Curiously, Gibson didn’t receive a single CYA vote for his 10th-ranked ’69 season. His 20-13 record that year apparently couldnâ€™t compete with Tom Seaver’s 25-7 record (23 of 24 Cy Young votes) or Phil Niekro’s 23-13 record (1 vote) at a time in the game when nothing else much mattered aside from wins and losses when it came to the Cy Young award vote.
Note that King Felix is listed twice for his 2009 (25th) and 2010 (16th) seasons. Hernandez finished second to Zack Greinke for the Cy Young in 2009; Greinke’s season earned 14th place on the sWEZ list.
Interestingly, the top seasons ranked by ERA+ over the same time period shows a strong correlation with the sWEZ rankings; there are eight player/seasons in the top ten of both lists. The top ten ERA+ seasons since 1950 are shown below (sWEZ rank shown in parenthesis):
- Pedro Martinez, 2000 (2)
- Greg Maddux, 1994 (1)
- Greg Maddux, 1995 (5)
- Bob Gibson, 1968 (4)
- Pedro Martinez, 1999 (3)
- Dwight Gooden, 1985 (6)
- Roger Clemens, 2005 (20)
- Roger Clemens, 1997 (7)
- Pedro Martinez, 1997 (13)
- Kevin Brown, 1996 (9)
The only reason I could think of why Clemensâ€™ 2005 season ranked so much lower in sWEZ (#20) than in ERA+ (#7) is that with Houstonâ€™s strong bullpen that year (Lidge, Qualis, Wheeler), Clemens rarely made it beyond the seventh inning almost by rule (only four times in 32 starts), leaving opportunities for higher WE grades out on the field. ERA+ has no such bias, except for a minimum amount of innings requried to make the list. Alternatively, Clemens of 1997 pitched into the eighth and ninth innings frequentlyâ€”19 of 34 startsâ€”and wound up with similar sWEZ (#7) and ERA+ (#8) rankings.
Here is the same database sorted by eWINs:
Aided by his domination in an era when pitchers threw every fourth day, Sandy Koufax appears in three of the top five spots for most eWINs for his ’66, ’63, and ’65 seasons. He also accrued the most eDWs (effective dominating wins) in a season with 29 in both his ’65 and ’66 seasons.
The decision numbers gives you a sense for how inconsistent Mickey Lolich was in 1971 with his 30 eWINs (ranked 5th) to go with 12 eLOSSes and just 3 eNDs. The 12 eLOSSes obviously hurt his overall sWEZ (0.361), the lowest on the chart.
You may have noticed that Pedro Martinez, who appeared a whopping four times in the first list of most dominant sWEZ grades, has vanished here. Blame it on his number of starts those seasonsâ€”29, 29, 31, and 29â€”relatively low for an ace. Let’ s do him a favor and sort the database by eWIN% (eWIN/STARTS):
Pedro ranks 3rd and 14th here, but you might be starting to realize that once we start factoring number of starts, his standing in this analysis is going to take a bit of a hit.
Outside of Maddux’s strike-season eWIN% of 84.0, Gibson’s “closet” 1969 season heads the list at 82.9%, three slots higher than his more famous ’68 season.
Win Expectancy, generally used for quantifying the impact that individual plays have on the outcome of games, has been applied here as a tool for grading starting pitchers. Assigning WE scores to each pitcher’s outing allows us to make an assessment of a pitcher’s season in how well he worked deep into games while keeping the opposition off the scoreboard. With a pitcher’s WE scores for a season bundled into an sWEZ or hWEZ grade, a pitcher’s seasonalÂ performance can be measured relative to his peers orÂ across multiple generations.
But we’re still two-thirds of the way toward capturing a pitcher’s Ace Factor and fulfilling all of the stated goals of how to measure a starting pitcher’s “ace-ness.”
In my next post, we’ll take a pitcher’s reliability into account, that is, his ability to take his turn in the rotation over an entire season. We’ll see that Ace Factor is versatile enough to help us compare the Phillies “Quad Aces” of 2011 to the Atlanta Brave staffs of the 1990s and the Cleveland Indian staffs of the 1950s. It will show why Warren Spahn, Kevin Brown, and Dave Stieb were embarrassingly undersold when it came to the Cy Young vote. We’ll even take a stab at Hall of Fame worthiness, and give yet another statistical argument showing why Bert Blyleven’s recent HOF induction was long overdue.
And our alternative grading system for starting pitchers will be complete.
â€” John Cappello
To see more of Johnâ€™s baseball research and postings, go to www.baseballengineer.com.
11950 was the earliest season provided by Retrosheet.org that had the appropriate play-by-play data needed for this analysis.
Special thanks toâ€¦
…Dave Studeman, who along with Tom Tango has been a force in bringing WE, WP and WPA to the baseball forefront in recent years. Dave’s thoughtful and insightful reactions to my methodology helped shape my presentation, such as his suggestion to add a “historical Z score” to the WEZ grades, and to balance the decision thresholds about the historical WE grades.
â€¦Retrosheet.org, especially the BEVENT utility, for making play-by-play data easier for software to understand.
â€¦Baseball-reference.com, for making the history of baseball statistics easier for humans to understand.
â€¦Pete Palmer, who did a lot of unsung work on Win Probability a generation ago, for the vote of confidence in my data.