This is a repository for PostgreSQL schemas and queries for baseball analytics.The
lahman/ddl folder contains the schema for the Lahman database, and the
retrosheet/ddl folder contains the schema for a Retrosheet database.
Please feel free to submit a pull request to add to the repository and reportany issues or improvements that can be made to the code.
How Do I Get the Data?
Package: Lahman Type: Package Version: 7.0-1 Date: 2019-05-01 License: GPL version 2 or newer LazyLoad: yes LazyData: yes The main form of this database is a relational database in Microsoft Access format. The design follows these general principles: Each player is assigned a unique code (playerID). Welcome to Lahman Baseball Database project! This database contains pitching, hitting, and fielding statistics for Major League Baseball from 1871 through 2012. As an R package, it offers a variety of interesting challenges and opportunities for data processing and visualization in R.
The Lahman database is maintained by Sean Lahman and can be found on his website. Download the comma-delimited version.To populate your PostgreSQL Lahman database use the loader sql files in the lahman folder, andsimply replace
path to with the full path to csv files.
The Retrosheet database is similar to the one Jeff Zimmerman provides at Baseball Heat Maps, but instead of MySQL, this is a PostgreSQL database. The ddl folder contains the database schema
schema.sql, the sql scripts to load the parsed Retrosheet data into the database:
copy_misc.sql, two csv files which contains all the players and ball parks, and lastly a script
indices.sql which creates indexes on the different tables.
How Do I Get the Data?
This is now really easy as Chadwick is available via the Mac OSX package manager, Homebrew.
Download the Data from Retrosheet
First creating a directory to hold all the Retrosheet data is probably best. I set mine up to contain both
unparsed directories.Then we will move into the unparsed directory.
All of the following shell commands were inspired by the shell retrosheet_getscript created by wa-rufio.
Parse the Data
Finally use a for loop to parse through the
subs for each year. Make sure you are still in the
Apn tester free for mac. Download and install the application which is however automatic for play store as long as the download button was clicked. Wait for the installation process to finish.
This repo stores the PostgreSQL schema for the 2016 Version of Lahman's Baseball Database, originally published for MySQL.
I have uploaded the CSV files as a courtesy here because the GIDP column in the PitchingPost table is set up to be an integer, but the raw CSV data hosted on Sean Lahman's website tabulated some GIDP values as floats (e.g. 0.0, 1.0, 2.0). Also fixed were some data irregularities noted here and here, but not all of them--the majority are in the Master.csv file (please see below Importing section for more details), related to last names like 'O'*', and remain unfixed (for now).
- Download the files in this repo
- Create a database in PostgreSQL
- Import the schema file into the database via the command line ($ psql [database_name] < lahman_2016_schema.sql)
- Update the lahman_2016_import.csv file to include the filepath of the .csv files in the csv folder.
- Import the CSV files into the database via the command line ($ psql [database_name] < lahman_2016_import.sql)
You can find the source of the CSV files hosted on Sean Lahman's website. There are also ongoing updates to the files made by Ted Turocy on Github here.
As mentioned in the background, there is some erroneous data owing to an issue with the script used to generate the data. I will be keeping an eye on the repo and updating the data as fixes come in.
add_id.py Python script that will add a column 'id' with numbers incrementing 1 through n total rows in the included CSV files. This script is helpful if you need to modify the schema in this repo to read like so:
Rather than in its current form:
Instructions for the script are included in comments in the file.
Kris Eberwein and Brett Nycum for their very resourceful repos I discovered when searching for PostgreSQL implementations of Lahman's Baseball Database.
Copyright Notice & Limited Use License
Baseball Databank is a compilation of historical baseball data in a convenient, tidy format, distributed under Open Data terms.
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. For details see: http://creativecommons.org/licenses/by-sa/3.0/
Person identification and demographics data are provided by Chadwick Baseball Bureau (http://www.chadwick-bureau.com), from its Register of baseball personnel.
Player performance data for 1871 through 2016 is based on the Lahman Baseball Database, version 2017-02-26, which is Copyright (C) 1996-2017 by Sean Lahman.
The tables Parks.csv and HomeGames.csv are based on the game logs and park code table published by Retrosheet.
This information is available free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at http://www.retrosheet.org.