Truer Park Factors
Baseball is the only sport where the exact size of the playing field is not standardized. Even if every field had the same dimensions, outcomes would still be affected by things such as elevation. Acknowledging these differences, there have been many attempts to measure how much each ballpark can change the outcome of each play.
Normally park effects have been measured on a yearly basis, with each year’s values required to average out to 1.00. If one league were full of smaller parks (Great American Ballpark, Citizens
In 1976, Fenway Park ranked 3rd in the
If a park hasn’t changed, then it’s TRUE (absolute) factors shouldn’t change. Park’s factors can be calculated covering the entire length of time that the park has kept a certain configuration. However, a team’s factors may change from year to year even if their home park hasn’t changed, if there is any change to any of the road parks, as the team’s yearly factors are a weighted mean of its home park and all the road parks they played in.
Some other studies have done only the past three years of data, with some of those giving more weight to the most recent seasons. Rather than attempting to measure all the ballparks together and weighing according to their playing schedules, this method uses a larger sample size to hopefully mean out the effects of the road parks.
Sorting my results, it is obvious that most of the extreme high and low factors for each component are in park versions of one or two seasons, as the sample size was simply not large enough to get an accurate estimate of the true factor. To get truer factors, during each pass, I added a league average number of plate appearances for each component to both the home and road sums just prior to calculating the ratios. The amount of plate appearances was based on the formula: (mean(x)^2)*(1-mean(x)/sd(x) taken from “The Book” by Tango, Lichtman & Dolphin.
To test how the variances of the measurements improves with larger sample sizes, I studied the National League from 1985 to 1991, a period in which there were no changes in either ballpark or schedule. The factors for each component for each ballpark were computed for the entire seven year period. Then for each single season, the difference was calculated between that sample and the seven year “true value.” This was repeated for periods of two and three consecutive seasons. The results in the table are expressed as the root mean square error of the sample vs. the true value.
| YR | SDT | XBH | SI | DO | TR | HR | BB | SO |
| 1 | .039 | .083 | .044 | .091 | .292 | .149 | .069 | .044 |
| 2 | .023 | .057 | .025 | .060 | .207 | .085 | .054 | .030 |
| 3 | .018 | .046 | .020 | .045 | .161 | .060 | .041 | .023 |
Except for triples and home runs, even one season gives a value within one decimal accuracy, although when the totals are split by bat hand, the sample size will drop, especially for left- hand batters, so that three seasons of left-handed batters would be needed to equal the accuracy of one season of all batters.
In my study right-handed and left-handed batters were run separately, as well as for all batters. For symmetric parks, the RHB and LHB factors should in theory be equal. In those cases, using the combined values increases the sample size and the accuracy of the factors.
In order to calculate these multi-year factors, I combined the existing KJOK database with RetroSheet play-by-play data. I linked Retro Events table to the Retro Games table with the GameID, and then the Games table is linked to KJOK’s Park Configurations table using ParkID and Year. I added version field to the Configurations table, numbered starting at “1″, and changed every time a change was made in each park’s listed configurations.
The first query was grouped by season, home team, road team, site and bat hand, and I calculated the batting statistics for each combination of these. In this way, for example, I could return the combined batting stats for the Pirates and Phillies each season, in Pittsburgh (Forbes Field, Three Rivers Stadium, PNC Park) and in Philadelphia (Shibe Park, Veterans Stadium, Citizens Bank ballpark), and then for each combination of ballparks. In this example, PIT07v02 1975-2000 overlaps PHI12v02 1971-2003 for the years 1975-2000. Summing those 26 seasons gives the head-to-head comparison of those two parks.
On my first pass through, as in the traditional manner, the sum of all the home stats are divided by the sum of all the road stats to get the factor for each component. This assumes that all road parks are equal, but having now calculated a set of factors this is known not to be true. On my second and subsequent passes, the stats in each road park are normalized by the factors calculated in the previous pass, then again the home sums are divided by the road sums, so that the calculations will converge to a properly weighted solution. I decided to stop after three passes, as the extra code for more would not add substantially to the solutions.
In this study, a new version of the ballpark was created whenever anything changed that created a noticeable difference in the factors. However, some changes may not affect all the components. If artificial turf is installed, there will be a change in SI, DO & TR, but it wouldn’t affect HR. Future improvements might include independent versions of the same ballpark for: 1. SI, DO & TR; 2. HR; 3. BB & SO.
I believe these calculations give us a “TRUER” park factor than has previously been available. Please refer to the spreadsheet ML Park Factors 1954-2007 for the complete results of these calculations.





10 June 2008 11:44
Brian,
Good work and a great idea!
One thing to keep in mind about the annual park factor calculations is the fact that they, in a way, capture the weather, which is variable from year to year. Think about Wrigley as a prime example for this. A really warm year in Chicago will probably lead to more HR there because the wind out of the south will blow more balls out, but a colder year usually means that the wind is coming off the Lake or from Wisconsin, keeping balls in the yard and depressing HR. This variability shouldn’t be overlooked.
10 June 2008 11:52
My Retrosheet based database includes temperature, wind speed and wind direction where the information is avalaible.
I have run preliminary queries, and Wrigley does appear to be the most sensitive to wind direction, with the homerun rate with the wind blowing out to cf almost twice what it is with the wind blowing in.
Greg Rybarczyk’s Home Run Tracker uses widn speed and direction, along with other factors, to normalize every homerun for distance and to see how many different parks the ball would have left.
The queries I built for this study allow calculating an individual player’s normalized stats by summing each players stats in Wrigley, normalized with Wrigley’s factors, then the stats in Three Rivers normalized with Three Rivers factors, etc through all the parks, and then summed for the season line. It would be possible to break down a park factors into weather groups, and then see how each pplayer did there when that weather was in effect.
Of course, this would make the sample sizes smaller, and create larger variances away from the “true” value, but with Wrigley everything else has remained the same for so many years that specialzed weather facotrs would probably work there.
10 June 2008 15:27
I was just using homers at Wrigley as an example for broader inquiry, which you started to do. The breakdown of park factors by weather groups would be interesting to look at, provided there is a significant sample and the groupings make sense.
10 June 2008 22:55
In case you’re interested, cricket also has non-standardised grounds (both shape and size). Distances from the centre of the playing field to the boundary range from 65 to 90 yards.