Statistics 101 - Regression
The last entry in this little series on statistical topics
OK, time for the elephant in the stat head’s room: linear regression. What is it really? Why do people like it so darn much? Why does it even matter to sabermetrics?
Regression is a very broad term in statistics, as we discovered last time out. So why is this week’s lesson “the big one”? Because linear regression is possibly the most used analytical technique in the entire subject field. What’s my proof? Go find a syllabus for an undergraduate business statistics course at any fine university in America (or, since I feel like being nice, click here). They cover linear regression. Mathematics and statistics students often take a course focused soley on linear regression. So yeah, it’s kind of a big deal.
Let’s answer my questions at the top one at a time.
To remind us of the definition of regression (and use the words of someone much smarter than myself), regression is used to study the relationship between measurable variables (e.g. HR and RBI). Linear regression, then, is the special group of these relationships that can be characterized by a line. (Thanks to Sanford Weisberg, who wrote Applied Linear Regression).
Maybe words aren’t you’re thing and you need a picture. Here’s your image (You’ll have to scroll to page 5 for the first image, but a big thanks to Jim Albert for the paper and to him and Jay Bennett for writing one of the best baseball stat books ever!). What you see are a bunch of data points and then a line that generalizes the trend of the data. This is why non-statistical people like linear regression: it produces something easily understandable. You can see the relationship between the 2 variables, one independent (age) and one dependent (Average Linear Weights) , and draw many valid ideas from it.
After looking at that paper, hopefully you understand a bit better why this is so important not just to sabermetrics, but statistics in general. It is undoubtedly one of the most potent tools in the shed. As you notice, however, it only compares 2 variables to each other, age and ALW. But what happens when we try to look at the relationship of many variables? We must have that one dependent variable, whose information is a combined relationship of many independent variables. An example of such an equation can be seen in Nate Silver’s QERA metric on Baseball Prospectus.
There’s another flavor of regression that works similar to linear regression, except that this uses a binary dependent variable, such as “Is the player in or out of the Hall of Fame”. We call this logistic regression, which predicts the odds of such an occurrence. This is far less common.
If you’re interested in the nuts and bolts of how this works, ask below and I’ll explain. Thanks for reading, and I’ll be back atop the Soapbox next week to muse about baseball again.









