Statistics 101: Means, Variance, and Standard Deviation
A new series looking at statistical concepts and techniques used in sabermetrics
After writing last week’s post, where I mused about the application of nonparametric statistics to sabermetrics, I realized I probably talked above the heads of 95% of anyone who would read it. I also have this sense that too many people in American don’t understand statistics, and thus are extremely prone to misinterpreting data and reported results from that data. This series of articles will be my attempts to make sense of statistics for the mathematically disinclined by showcasing how it has or can be used in sabermetric analysis. The mathematically inclined will hopefully also be edified by some future articles exploring more complex topics.
This week we’ll keep it simple and look at three simple calculations that are descriptive statistics, or values that describe the set of numbers in questions. Let’s look at them one at a time, and I promise you that any math I use will be something you saw no later than high school algebra.
You all are probably familiar with the concept of a mean. If the term throws you off, maybe this term won’t: average. The mean of a set of numbers is simply its average. However, there are differences between population mean and sample mean. Sample mean is what you get when you take a selection of events and average out their values. Population mean is what the average would be over every single event in that universe.
To show the difference of these, we’ll use the classic example of rolling a pair of dice. Say your playing a game, like Strat-O-Matic, and your first 10 rolls have the following values: 2,7,9,5,8,6,11,3,8,6. The sample mean here is sum of these values divided by 10, which is 6.5. However, we know that the population mean of an infinite number of rolls of 2 dice should be exactly 7. How? Because we know how many ways the dice can land. This population mean is also what we refer to as the expected value.

It is easy to calculate the sample mean in baseball statistics. Take Hank Aaron’s career, and you’ll see he averaged about 32.8 HR per season in his 23 years playing baseball. Every BA, OBP, SLG, or other rate stat is a sample mean. What about population mean? Well…. we really don’t know what Hank Aaron was supposed to hit, we only have a record of what he, and everyone else who has recorded baseball statistics, did.
Why does the absence of a true population mean matter? It affects how we calculate variance, and by extension, standard deviation. Variance is simply a way of measuring how much a set of numbers deviates from the expected value. Since we are dealing with a data sample, this means we have to use an approximation. Since we don’t have a population mean for a true expected value, we can substitute the sample mean and compare how much the numbers which make up that sample mean deviate from it.
Since sabermetrics uses a finite number of records, we can calculate a sample variance using the following equation:

The lower case Greek letter sigma squared is stat-speak for “sample variance”, the “N” is the number of records, the chunk in parentheses is the difference between an individual value and the sample mean (which we square here), and the capital sigma just means we do the math in the parentheses and add them together. Make sense? I hope so, because that’s the best I can do given the limitations of this space.
Back to our Hank Aaron HR example: He averaged 32.8 home runs per year, so that’s our sample mean. You can go back to his Baseball Reference page and look at his yearly HR totals, which will make up the other part of the operations in parenthesis. The number of records here is 23, equal to the number of seasons he played. So, with this we can calculate our sample variance, which is 125.05. Our sample standard deviation is calculated simply from the sample variance, as it is the square root. In Mr. Aaron’s case, this value is 11.18.
While these estimators for variance and the standard deviation are good, they are biased. Bias in this case means that we have treated the data like it is a perfect representation of the unknown population. In order to estimate what the population mean would be, we multiply by the coefficient (or numeric value) of (N-1)/N, with N still acting as the sample size. Thus, the unbiased estimator looks like this:
![]()
NOTE: I have spared many of the gory theoretical details, but those interested can explore Wolfram Math World for more.
Of course, calculation is good, but knowing what these calculations tell us is what makes them meaningful. Most people understand that the mean is what you can reasonably expect to happen. The variance and the standard deviation are ways of describing how spread out the numbers are. The bigger the variance, and thus the bigger the standard deviation, the more difference there is between the average and a player’s year-to-year numbers. We typically use the standard deviation because it shows how consistent or inconsistent a statistic is. Obviously, a standard deviation closer to zero means there is more consistency. This is key when calculating new metrics that try to demonstrate skill in baseball (like FIP) over those that do not (like ERA).
The three statistics are prevalent in just about every aspect of sabermetrics, but usually buried behind the scenes of written analysis. They help give a picture of the data being analyzed. But we’ll discuss in more depth next week about that picture.
Next week: Distributions




