Deviating from the Past
Every baseball rankingâ€”even those based on statisticsâ€”is biased in some way. Itâ€™s what makes practically every list debatable.
For instance, take a list as fundamental as “hit leaders.” As a counting stat, a playerâ€™s hit total is impacted by his place in the batting order as well as his lineupâ€™s offensive strength, two factors that have almost nothing to do with a batterâ€™s talent for hitting a baseball. This type of ranking is also biased toward free swingers, who tend to put more balls in play simply because theyâ€™d rather hack than take a walkâ€”which doesnâ€™t necessarily make them better hitters than those who are more selective.
So we’re careful in interpreting such lists until we know exactly whatâ€™s being measured. Okay, maybe we didnâ€™t do this so much a generation ago, but if sabermetrics has taught us anything, itâ€™s that numbers without context is like the lyrics without the sheet music.
In my last post, I presented an alternative way of grading a starting pitcherâ€™s performance using Win Expectancy theory. After stamping each pitcherâ€™s outing over a 61-year period (1950â€“2010) with a Win Expectancy score, I compared each pitcherâ€™s seasonal performance to his competition using a measure called sWEZ (“seasonal WE-based Z scoreâ€). Here’s a re-print of the top 30 sWEZ scores from 1950 through 2010.
This ranking shows what appears to be a bias toward baseballâ€™s more recent generation, with 11 of the top 15 (73%) coming from the last two decades. One might be tempted to suggest that the sWEZ scores for guys like Greg Maddux and Pedro Martinez are inflated, that for whatever reason it was ‘easier’ for them to statistically dominate their competition compared to the likes of Sandy Koufax, Tom Seaver, or Whitey Ford. (Not to mention the Phillies fans who are wondering how Steve Carlton’s 1972 season could be missing from any ranking of dominant pitching seasons.)
Let’s take a look at exactly what we’re comparing these performances with using the line chart below, which shows how the league average WE scores have fluctuated over time.
As you can see, the WE scores gradually declined about 5% over the 61-year time period. But this doesn’t directly clarify any alleged bias in the sWEZ ranking. Remember, sWEZ is theoretically immune to these year-to-year fluctuations because a pitcher’s seasonal grade is judged only against the performances that particular season.
Before going on with addressing these sWEZ issues, let’s do something I mentioned I would do in that last post, and that’s to compare each pitcher’s performance to a historical baseline instead of his seasonal competition.
Using the same methodology that generated sWEZ, each starter’s seasonal performance was compared to historical WE averages to come up with an hWEZ score (“historical WE-based Z scoreâ€), with the intent being to produce a more direct comparison of pitcher performances across generations, like a “Gibson vs. Pedro” type of matchup.
Here, the historical WE average is superimposed on the seasonal WE scores line chart as a dotted line representing a value of 0.584.
This is our historical baseline for the hWEZ scores. The chart is telling us visually that the earlier generations out-performed the historical average, whereas the later generations underperformed. With this information, you might be able to guess how the hWEZ scores will turn out compared to the sWEZ ranking.
Here is that hWEZ ranking, a list of the top 40 most dominant performances over the same 61-year time period according to hWEZ. (I extended the list to include the “curiously missing” 1972 season of Steve Carlton as well as the 1978 season of Ron Guidry.)
What we see here is that pitchers from the earlier generations made huge jumps relative to the sWEZ scores. The first three decades (â€˜50s, â€˜60s, â€˜70s) are now well-represented, making up 68% (27/40) of this list, up from 30% (12/40) of the sWEZ top 40. The second column labeled “MOVE” indicates how each player’s hWEZ rank compares with his sWEZ rank. Steve Carlton’s ’72 season (+70) and Koufax’s ’63 season (+65) made huge jumps. Meanwhile, every recent-gen player dropped, including King Felix’s long fall from 16th to 39th place for his 2010 season.
About that Carlton season. Some of you (as I did) expected this season to be near the top of any ranking of “dominant pitching performances in history.â€ Yes, Carlton won 27 games for a team that won just 59. He was unhittable much of the season, accumulating 310 strikeouts. But from an outing-to-outing standpoint, he wasnâ€™t as consistently dominant as those above him; this is what this ranking tells us. Those inept Phils did manage to score 6.5 runs per game (actually, per 27 outs) in 18 Carlton games, allowing Lefty to cruise to a 14-1 record in those contestsâ€”so there was a little bit of a run support factor behind that 27-10 record. Also, remember that sWEZ measures performance relative to the league that particular season. In 1972, there were plenty of aces keeping Carlton company at the top, such as Gaylord Perry (24-16, 1.92 ERA, 234 K, 170 ERA+), Gary Nolan (15-5, 1.99 ERA, 162 ERA+), Catfish Hunter (21-7, 2.04 ERA, 140 ERA+), Jim Palmer (21-10, 2.07 ERA, 150 ERA+), and Don Sutton (19-9, 2.08 ERA, 162 ERA+), plus 34 other pitchers with qualified ERAs under 3.00. By comparison, Pedro Martinez (2000) and Greg Maddux (1994) found themselves on an island compared to peer performances in their greatest seasons.
As with any ranking, context is important here. In this case, that context is the understanding of how the falling WE scores over time have impacted the hWEZ rankings, making the early-gen pitchers looking more dominant than the late-gen pitchers.
To see why the WE scores have fallen, we need only go right to the source: the WE algorithm. Recall that a WE grade rewards the starting pitcher who pitches deep into games. That was the motivation behind the algorithm as I explained in my last post, and thatâ€™s the way Win Probability theory works. The closer a pitcher is to recording the 27th out for his team, the greater the chances of winning.
Guess what else changed over time? The chart below shows how the average number of innings pitched per game started (IP/GS) for each season gradually dropped over the same time period.
Does the slope look familiar? It should. Statistically, it correlates extremely well with the falling WE scores, with a correlation factor of 0.93 for NL scores and 0.89 for AL scores. Correlation may not imply causation, but logic tells us that the falling WE scores were most likely due to the trend of starters throwing less innings.
Take the 2005 season of Roger Clemens, which ranked 20th in sWEZ and 51st in hWEZ. Clemens averaged 19.8 outs per game that year, about 6.2 innings per start. He worked beyond the seventh inning only four times out of 32 outings that season, as Houston relied heavily on a superior bullpen despite Clemens’ dominance. Then youâ€™ve got Bob Gibson, who averaged 26.9 outs per game in 1968 with the aid of four extra-inning efforts. While Clemens, like so many of his contemporaries, gave way to his bullpen while leaving higher WE scores out on the field, Gibson shouldered the workload all by himself.
The evolution of the relief specialist seems like our smoking gun here. But it’s not the only culprit. With multiyear contracts growing into the range of tens of millions of dollars, a pitcherâ€™s elbow and shoulder joints are now regarded as delicate hubs of intertwined muscle and tissue that should be handled with extreme care, which has driven baseball management to employ a pitch-count curfew for practically every arm in the game.
So we have an explanation for why the WE scores gradually fell over time. But where does that leave our hWEZ ranking, which is obviously biased toward this phenomenon?
Some of you might be ready to toss hWEZ into the trash thinking that thereâ€™s no point in comparing Bob Gibson and Roger Clemens if the tool doesn’t act as an â€˜equal opportunityâ€™ calculation. Or maybe your sabermetric reaction is to compensate, as in to adjust the WE algorithm to account for the declining IP/GS numbers in an attempt to â€œlevel the playing field.â€ That would be a bit presumptuous. Considering the Win Expectancy makeup, any sort of compensations like this would be like playing with the rules of nature. Are we really supposed to wave a statistical wand over the performance of a new-gen pitcher to grade his performance on a different scale of “winning” just because pitchers are lifted earlier these days?
I donâ€™t think so, because I believe the falling WE scores are trying to tell us that the value of the starting pitcher, in terms of winning baseball games, has also fallen over time.
Stated another way, starting pitchers arenâ€™t the aces they used to be, and there’s nothing wrong with a chart telling us that statistically.
Back in the â€˜50s and â€˜60s, starters were expected to finish what they started, at a time when the complete game was a stat worth tracking. In effect, they were doing the job of the starter, long relief, setup man, and closer. From 1950 through 1969â€”the first 20 years of the time period we’ve been focusing onâ€”individual pitchers threw 20 or more complete games in a season 72 times. Thatâ€™s all ancient history now. Do you know how many times that happened over the past twenty seasons?
The complete game has become an anomaly. Even Roy Halladay has never reached double-digit complete games in his career.
The point is, it’s still historically relevant to compare pitching performances across generations in this manner, and hWEZ is a sufficient tool to do just that. You just have to understand what youâ€™re looking at and accept how a starting pitcher’s role has changed.
Now back to sWEZâ€”comparing a pitcher’s performance to his peersâ€”which is supposed to be immune to any seasonal variations of WE scores.
Out of the 14,557 pitcher-seasons analyzed and ranked by sWEZ, 30 of the top 100 seasons are represented by just six pitchers, all from the last two decades: seven from Roger Clemens, six from Randy Johnson and Greg Maddux, five from Pedro Martinez, and three each from Johan Santana and Kevin Brown.
To paraphrase our earlier question: Was there a bias that gave these new-gen pitchers some type of statistical edge with sWEZ?
One obvious difference over time has been rotation size. Somewhere along the line, the four-man rotation expanded to five. But any argument supporting how the five-man rotation might have impacted sWEZ scores evaporates when you consider that the entire league was faced with the same evolution at the same time. I don’t see an obvious avenue for an outlier to separate himself more from the pack just because of the four-man rotation.
But there is a clue in the way those “packs of performances” are distributed. The final arithmetic operation performed to calculate a pitcherâ€™s sWEZ score is to divide by the standard deviation (SD) of the league WE scores for a particular season, which allows us to tell how many standard deviations a player performed above or below the league average. A smaller (â€œtighterâ€) distribution indicates that the league scores were generally bunched closer to the league average, making outlier performances rarer and that much more â€œspecial.â€ A larger (â€œlooserâ€) distribution indicates that the league scores were spread out further from league average, making outlier performances more common. So in dividing by distribution we are effectively grading how special a pitcherâ€™s performance was relative to his league that season.
Here is how the seasonal standard deviations changed over time:
But that doesn’t really help tell us why. Two possibilities that could explain extreme outlier performance of recent generations are the scoring environment and league expansion.
Greater league offense could have played a role simply because of the math. It’s easier for stud performances to shine during high run-scoring environments due to the higher “ceiling” of bad performance levels (think Coors Field football scores) relative to the very best a pitcher can do, whichâ€”whether in 1968 or 1998â€”has always been to keep the opposition off the board entirely. Despite this widening of the performance gap, you still have to credit the pitcher who excels under the same conditions that caused other pitchersâ€™ stats to balloon.
As far as league expansion goes, if you believe that major league quality gets diluted with triple-A talent after expansion (a hotly debated issue, especially during the Steroid Era), in theory this is a classic mechanism for superior performances to stand out even more.
I only pose these factors as possible influences. I don’t see enough evidence to allow these theories to over-shadow the greatness of the commanding seasons themselves. One thing we know for sure from looking at the top five of both sWEZ and hWEZ rankings: Gibson’s 1968 season, Maddux’s 1994 season, and Pedro’s 2000 season are three of the most dominant performances ever, no matter how we slice and dice the data.
With that in mind, I present to you a final ranking based simply on combining the sWEZ and hWEZ scores equally into a new WEZ grade. (WEZ = 0.5*sWEZ + 0.5*hWEZ)
As a whole, Iâ€™m okay with calling this list a collection of the most dominating pitching seasons in baseball from 1950 through 2010. Whenever I sit down to look at it, I’ll do the same thing I do when looking at any other baseball ranking, and that’s to consider the circumstances by keeping in mind the underlying, biasing forces at play.
For this case, the primary context is to recognize that a starting pitcher’s contribution to winning has taken a fall over the past 61 years.
â€” John Cappello
To see more of Johnâ€™s baseball research and postings, go to www.baseballengineer.com.