I’ve had to take a break from my MTA Subway Fares series to focus on other things like my second semester of graduate school, summer internship opportunities, and applying to a new graduate program. As part of the last item, I’ve been tasked with answering a data challenge using the NYC Open Data Portal. I think every data scientist who’s used the portal has a love-hate relationship with it. As one of the first in the country, it’s an important milestone in the battle for local governments to open their data stores to the public, but it’s definitely not the ideal platform for data access. The interface is slow to execute queries and creating views from the datasets is a time-intensive process, especially for large data sets like the 311 Service Request data set I’m using for my answer to the data challenge. This data set is a trove of interesting data points on the concerns of average New Yorkers. My intention is to look at trends in service requests based on neighborhood demographics compiled by the US Census. This requires filtering the data for those events with easily mapped locations (I only have a week to do the project so no fancy address to lat/long conversions) and then doing a spatial join with the census tracts. From my MTA Subway Fares project, I already had census tract data for New York City. What I didn’t have was a good working knowledge of MySQL, which is essential for working with a data set of this size. Here’s what I’ve learned:
- Always start small. Validate the methods and workflow with small samples of the data. Working with large datasets create a number of challenges intrinsic to being large that obscure the underlying issues with the data. Solve the data issues first then deal with the scaling issues.
- When it comes to working with large data files, Unix command line tools are simple and amazing. I’ve fallen in love with “sort” and “split.” They are like the hammer and screwdriver you keep handy in the kitchen drawer because you’re always using them. If you don’t know how to use them, learn. Right now.
- Create your database schema early, but don’t get too attached to it. It’ll change, but you won’t know what you need until you commit it to the database and try to upload data. I’m deleting tables constantly because of datatype mismatches that corrupt the input. It’s better to load, test, delete and start over than load all the data and then find all the problems (which also goes back to starting small). Of course, my hope is to reduce the number of load, test, delete, restart cycles as I become more proficient with MySQL.
- MySQL Workbench blows. I use “blow” in the complete technical sense of the term. Navicat works much better, especially for database newbies like me.
- Careful the options you choose when uploading files. “Append/Update” sounds like a good option when you’re reading data from multiple files into the same table (which Navicat makes very easy, unlike Workbench). I’m sure I didn’t replicate any values, but I’m not sure. I thought I’d be safe since I’m reading from multiple files. I was wrong and am watching my upload times suffer as a result (like hours instead of minutes).
- As a corollary to the above, use the UNIX tools to do the heavy data lifting. They are light and simple with minimal overhead, especially important if you have limited memory for processing data.
- API’s can really suck. I find myself continually banging my head against them hoping data will fall like manna from heaven. It never happens no matter how many ways I tweak the REST call. It’s bad enough wadding through API documentation for a specific service. It’s even worse when the only documentation available for the API is the generic service API and not how it’s specifically implemented in the particular data portal you’re trying to access (I’m looking at you Socrata).
- Don’t beat your head against anything. I’m a big fan of persistence, but there comes a time when you have to stop trying to knock down a wall and look for a door. When I find myself engrossed in a fruitless task (like making an API respond the way I want it to), I know it’s time to walk away, take a break, enjoy the view out my window, get a coffee, check my email, get on Facebook, kick a cat*, yell at a child*, or otherwise release the frustration and come back with a refreshingly fresh perspective on the problem.
These are some of the lessons I learned on this project, which brings up the most important lesson of all: you don’t know it until you’ve done it. If you want to learn MySQL or any other database system, grab a hunk of data and stuff it in the nearest database you can find. With the free usage tier on Amazon Web Services, there’s no excuse for not trying to homestead your own little piece of the data landscape and make it your own. *Note: No cats or children were harmed in the writing of this post.