Baseball Digest Daily
HomeAbout UsArticlesBlogPlayer TrackerMLB StatsBig League FuturesSeamheadsHeater

Statistics 101: Distributions

by Matt Mitchell

Continuing our look at statistical concepts in sabermetrics

Two weeks ago, we looked at simple descriptive statistics. Last week we looked at probability. This week it’s time for the probability distribution.

Now, I could expose you to the mathematical nuts and bolts, like I did the last couple weeks. However, these formulas are confusing to the untrained eye, so the mathematically savvy can seek out information on their own. What I’ll do here is mention what a probability distribution is, and mention some of the common ones and explain them using this language called English.

To define what a probability distribution is, let’s play one of my favorite games where we define each word, and combine the definitions in a logical manner. Thus,

  • Probability - the chance of an event occurring
  • Distribution - the arrangement of objects over a dimension or dimensions such as space, time, or data sample

Thus, a probability distribution is the arrangement of probability over the set of data we are using.

So, why does this matter? Well, the way a statistic is distributed tells us about the data in a different way than the mean and standard deviation do. This meaning can reflect the nature of what is being measured, such as whether it is a rare event happening many times in a short time span, and it can also show us what the reasonable range of expectation is for a given statistic.

Now that you hopefully understand why these are important, here are some common distributions, what they denote, and where you’ll see them in sabermetrics:

  • Uniform distribution - the simplest distribution, this treats everything as having equal odds across its range. Obviously, the real world doesn’t work like this, but a lot of random number generators will use this distribution. The random number generators power things such as Baseball Prospectus’ Playoff Odds Report. So, while you won’t see it, it does work behind the scenes.
  • Binomial distribution - measures the likelihood of a number of successes in known number of yes/no situations. Those estimating Chipper Jones’ odds of hitting .400 right now are using this distribution, where the number of situations is his expected number of remaining at bats and the number of successes being how many hits he would need to achieve that milestone.
  • Normal distribution - the famed “bell curve” distribution, and most commonly used. Any statistic produced by regression is approximately normal, we know that people do lots of regressing within sabermetrics. Just go scan the Inside the Book blog for 5 minutes and tell me if you don’t run into it in a baseball stats post that talks about regressing to the mean (regression is another concept for another week).
  • Poisson distribution - the one models the probability of a large number of events occuring within a certain time interval. This differs from the binomial in that it usually handles infinite calculations, and thus can be used in cases where we don’t know the number of cases or want the theoretical case of the binomial calculation.

There are many others, but I’ll leave those to the reader to investigate on one’s own. If you feel compelled, you can mention or ask about other distributions by making a comment.
The Sabermetric Soapbox will be on vacation in Chicago next week. A pilgrimage to 35th and Shields is most definitely part of the plans.

Reply