One of the best days of every offseason for a baseball nerd is Retrosheet annual end of season release day. It’s the day one of the best sites on the internet releases the play by play data from the previous season. If you’re like me, you download it immediately and go to town. But one thing I’ve always wanted was the ability to access the data during the current season.
So this past offseason, I designed a way to take mlb’s gameday data and convert it into a Chadwick-style retrosheet database. The database (.csv files) will be available and updated daily* in the downloads section. I’m making it available mainly because I know there are others out there, like me, that are interested in having an in-season pbp database. But also because I’d like to have more than one set of eyes on it, to help iron out the kinks and catch any errors.
I run a few processes to check for errors and to validate the data. But there is still the possibility that errors will come up from time to time. I’d like to make this a forum for error reporting, for those who are interested in helping.
Just as with this website, I intend to have updates available daily. Usually, the site is updated in the morning. But with a full-time job and two toddlers at home, there can sometimes be a delay.
There are a few columns in the events table that I have left blank:
“EVENT_TX”: It turns out that it is a huge process to replicate this. While I believe the “EVENT_TX” column is helpful in quickly identifying the play, I don’t use it in my queries and felt it wasn’t worth the hassle. The same goes for the “BAT_PLAY_TX” and the “RUNX_PLAY_TX” columns.
“BATTEDBALL_LOC_TX”: Gameday does include hit locations for all balls in play, but I have yet to dive into this data. If there is someone who has experience with this data and is willing to assist in converting gameday’s x and y coordinates to Project Scoresheet locations codes, please let me know.
“UMP_ID”: These columns for the six umpires are currently left blank.
“GWRBI_BAT_ID”: This is left blank because game winning RBI’s are no longer officially recorded.
ID’s for players making their Major League debut
Since these players have yet to be assigned official ID’s by retrosheet, I just give them the next available ID for their name. For example, if a John Smith were to make a debut, he would be assigned ID “smitj005″, since 005 would be next in line.
Building a Retrosheet Database
For those who are interested in using the data, but lack experience, David Temple at TechGraphs recently created a helpful two part tutorial.
If you find this data useful and have some disposable income, please consider donating. I do not get paid for my work on this website and while it is my passion to work with baseball data, it does take a lot of time and money (server costs) to keep it up. I’d like to also suggest donating to David Smith and the Retrosheet team.