This past Fall, I attempted to use US Census demographic data to predict fare-type usage on the New York City Subway system using a number of different machine learning algorithms. The intention was to see if it was possible to predict the fare-type used at a particular subway station based on demographic information for the area surrounding the station.
As it was my first data mining project, it was successful in what I learned from the experience, but there were no actionable conclusions from the data gathered or the models generated. I’m going to share the lessons learned from the project in a series of blog posts that look at the data I used (MTA fare data and US Census data), the methods I used to clean the data into a usable form, and the results from the project. My hope is to get feedback from others on my data and methods. I hope also that this information might prove useful to others embarking on similar projects.
I don’t know how many posts this series will span, but I look forward to sharing the work I’ve done on this project.
Update: I feel it’s necessary to include some information some specifics about the period of analysis and on what I tried to accomplish with this project. I used US Census data from the 2010 Census and New York City Transit Subway fare data for the period 17 October 2011 through 19 October 2012. As every New Yorker will remember for some time to come, on 29 October 2012, Hurricane Sandy hit New York City causing a great deal of damage to the New York City Subway System, not to mention to many areas of New York City and surrounding areas. For that reason, I ended my analysis prior to that event and the resulting massive disruptions to the subway system.
I believe a successful model using demographic information to predict fare usage will be useful in marketing MTA fare products to demographics that are potentially unaware of MTA fare product choices or help increase the uptake of certain fare products likely to appeal to a certain demographic, particularly as areas in New York City experience demographic changes in the near and long term. As a public benefit corporation, this shouldn’t be so much about increasing revenue as ensuring all New Yorkers are served by their public transit system. This project hopefully supports the goal of data analysis in the public interest, even if it’s primary purpose is to help me learn basic data mining techniques (and get a good grade in my class).