There is a common belief that databases are best only in certain situations, namely when the data is large, when the data is complex, or when there’s enough time to properly design a schema in order to contain the data. Since being introduced to PostgreSQL two years ago, I’ve come to rely on it in various projects and found that contrary to the conventional wisdom, a database is not only useful when working with small datasets, sometimes it’s the only way to ensure the analysis is both accurate and repeatable, for data of any size.
It’s important to point out, I’m not building production systems powering applications. My data work is almost strictly for doing analysis, which allows me to be a little fast and loose with usual standards of database design and usage. Almost every project that comes my way involves data in various states of cleanliness and almost always in Excel workbooks. While Excel is a great tool for working with data, I always load the data into a PostgreSQL instance running on AWS. The reasons are simple:
Small datasets are easy to load into a database. A few hundred rows can be added with a Python script, the psycopg package, and a FOR loop that generates the necessary INSERT statements. For larger datasets, I’ll format the data and then use the PSQL \COPY command to load the data from the command line. While it may not be best practice, for fields with data I’m likely not going to join on, I’ll use the generic “text” datatype. This saves the time of ensuring all values in a column match the declared type (like characters in fields that should be numbers). Otherwise I try to use numbers and VARCHARs where appropriate for fields I want to perform calculations on or use as identifiers, respectively.
Joining data is reliable and consistent. I rarely work with a single self-contained dataset. There’s generally another dataset (or numerous datasets) that I have to combine together to get the result I’m looking for. I’ve found that even a project that starts out as something simple ends up requiring the kind of reliable and consistent join capability that is the hallmark of relational databases, especially if I’m working with any kind of spatial data. Once the data is loaded, I can easily link tables together to get the results I’m looking for. I can do the same with the Excel VLOOKUP function, but a table join is far more powerful, especially when I need to aggregate data or perform some other function across rows.
Actions are documentable and repeatable. I’ve worked with Excel for a long time and no one can say it isn’t flexible, but that flexibility can be it’s downfall, whether it’s being used by Wall Street financier or a state regulator. A VLOOKUP function or formula can easily be mistyped, not to mention other problems that result in data errors. With a database, I can save my SQL and repeat the steps when necessary to recover the data. This is especially useful when I’m linking multiple datasets using multiple key values. If I need to make a change either to the process or a set of values, I can make the changes and rerun the query.
Data is harder to change (especially accidentally). One of the beauties of ACID (Atomicity, Consistency, Isolation, Durability) is the durability. SELECT statements only query data, they don’t alter it, allowing me to manipulate data in any way I’d like without changing it. Altering the underlying data is a far more intentional process than with a spreadsheet where a simple keystroke can delete or otherwise change the data in a cell. When using an UPDATE statement, all matching rows are changed, keeping the data consistent.
These points were driven home to me recently when I worked on a visualization project. The data was provided to me in several Excel worksheets. The analysis had already been done and my task was to simply visualize the data. To combine the data, which was no more than a few hundred rows, I loaded the data into a database and began working with the data. It soon became clear that not only were the necessary data points missing (which I later got), the data provided had a number of calculation errors. These were simple mistakes, but taken together, they revealed some issues with the data analysis that needed to be resolved before the visualization was created and the final report released. Even without the errors, creating a map of the data, which involved ZIP Code Tabulation Areas (ZCTAs), required joining the data to their respective geometries, all of which was simple to do in PostgreSQL using PostGIS, but couldn’t be done easily (or reliably) in Excel.
Ultimately, the project was a success and taught me the value of using a database. I’ll be sharing these lessons at PGConf US 2015 during my talk, “Sure You Don’t Want to Use a Database for That?: Using PostgreSQL for Projects Big and Small.“