Baseball Digest Daily
HomeAbout UsArticlesBlogPlayer TrackerMLB StatsBaseball ProspectusSeamheads

Pick me! Pick me!

by Matt Mitchell

A lesson in selective sampling at a time when it seems to be most prevelent

As the regular season ends, award season begins. You may have read my previous posts about the MVP and the Cy Young awards, and this post is something of an offspring from the comments in the latter. Today is a reminder about the dangers of selective sampling, which may sound fancy until you realize that, as a baseball fan, you are inundated with it almost daily. Please keep in mind I am not the first to address this. Tangotiger harps on it a lot over at The Book blog.

What is a selective sample? Well, take any one of these statements:

  • 18-for-25 in his last 6 games against the Rangers
  • Batting .346 since the All-Star break
  • 2.54 ERA in his last 5 starts

These are all selective samples. We usually create these as we try to find a specific portion of our data (in this case, baseball statistics) using a specific set of criteria. Just about anything can be used, but baseball stats typically see samples based on location (home/road, ballpark), time (Month, before/after All-Star break, day/night), teams and even specific players. These are the little nuggets of information that play-by-play and color commentators rely on a lot in order to make a broadcast more interesting.

But not all information is equal. The problem in using selective samples to draw conclusions is that they aren’t necessarily a valid indicator of what may happen. Often times, the samples are too small (i.e. 15 ABs, 5 games, etc.) for any legitimate conclusion to be drawn other than to say someone was “lucky” or “unlucky” or report the factual record of previous events.

Notice I chose the really small samples for the last comment. What about larger samples, like what we see in the MVP, Cy Young, and Hall of Fame debates? Again, be wary of what is said. While a larger sample is more suited to be a legitimate indicator of something, it is not necessarily an accurate indicator. A more noted example of this is the “Jim Rice between 1975 and 1986″ argument. I will not discuss the merits of this particular one, but if one is looking at whether someone should be in the Hall of Fame, his analysis should go beyond merely looking at the years when Rice was probably at his best. This also goes for the argument for Carlos Delgado as 2008 NL MVP, where many people cite his “clutchiness” in the second half, a very select sample of plate appearances. (BTW, if he wins, the Morneau List will become the Delgado List).

Take note when the talking heads are spouting off about who should win and flinging statistics based on selective samples around like proselytizers fling chapter-verse pairings. It could be detrimental to your ability to see truth.

Reply