I recently responded to a comment on Stack Overflow that denounced the use of anything but black box algorithms because other algorithms had “failed” and opaque black box algorithms were the only way to go. I completely disagreed and in the disagreement, I came to understand one of the defining differences between the academic point of view on data science and the practical point of view, namely in the measure of effectiveness, and consequently failure.
Academic research in the area of data science seems to often be about finding the most optimal algorithm for a machine learning task. In this environment, even minor improvements in performance are cause for celebration, or at least publication credit. The resulting complexity of the solution isn’t a concern as the target audience will be a set of trained and experienced data scientists undeterred by complex formulas or technical terms. If anything, greater complexity can be seen as a benefit as it increases the barriers to entry for any would-be competitors in the field. I’m not saying academics are obscure for obscurity’s sake, but there are benefits to complexity and few incentives for simplicity in academia.
In searching for ever more optimal results, academics are often able to choose the data they attempt to model, gaining superior results by being able to subtly manipulate the data environment in which they work. While the academic research often informs and can greatly improve the work of data scientists outside of academia, often practical machine learning and data mining applications don’t have the luxury of cherry-picking the data they model.
In private industry, a model that provides any gain over current conditions can be relevant. If an average stock trader makes correct choices 75% of the time but a model can predict the correct choice 85% of the time, the company gains, even with a level of accuracy well below what would be considered optimal in an academic setting. There is also more tolerance for an opaque modeling process. A data scientist could cast rat bones in a pool of pig blood while praying to a golden calf and so long as they provided highly accurate results and increased profit (despite their dubious connection with data science), most managers would be content to leave them alone. Black boxes don’t scare private companies so long as what comes out of the box makes them money.
But data science in the public interest, to make cities more efficient, livable, and overall better functioning for example, is different from both of these applications. The immediate payoff of good data science is more effective application of scarce resources to meet a public agency’s statutory or moral obligation to its citizenry. While the tools and techniques may be similar, the process of modeling data must be more transparent than either the academic or private industry implementations. Public data is compelled from citizens in a way that requires greater stewardship and transparency. While I would argue that much of the data collected by private companies is for all intents and purposes compelled from consumers, government bodies must take more seriously their responsibility to citizens, particularly in an open society such as ours.
Important also is the political culture of the US. From our founding, we’ve been concerned with the arbitrary exercise of state power. A black box approach to make regulatory and policy decisions plays into our deep-seated fears of overreaching government powers and a monolithic bureaucracy. As such, the data science methods employed by public agencies and government bodies must not only be more transparent but readily explainable to a public prone to view such approaches with tacit suspicion if not outright hostility. As data scientists, we can’t ask an elected official to go in front of constituents and tell them their building is being condemned because “an algorithm you can’t understand told me to.”
This doesn’t mean we’re left with flint and iron while our cousins in private industry and academia get gunpowder and steel, but it means we have more factors to consider when applying an algorithm to a particular data problem. If I’m building a model to help better deploy police during crisis events, I’d better make sure that model can be explained in a townhall meeting by a police commissioner or city council member with little to no training in computer science in front of concerned citizens worried their community has been unfairly targeted. If I’m building a model to improve an internal business process in the financial services department full of statisticians and accountants, I can dig deeper into the toolbox for something more opaque but more powerful in answering their particular need. If I’ve been able to build a relationship of trust with managers, their employees, and their constituents, then I can try something more advanced, but only after I’ve addressed each stakeholder’s concerns.
I have to accept that my model based on a decision tree will be much less accurate than one using an artificial neural network, but the 30-40% lift over current practices without a model saves the agency I’m working for hundreds of thousands of dollars and hundreds of man-hours, while being explainable to inexperienced managers and a skeptical public.
This is the kind of data science we need for cities of the 21st century, despite not being perfect, or even optimal, it is necessary. Which is why I’m glad to be joining the inaugural class in Applied Urban Science and Informatics at the new Center for Urban Science and Progress at NYU. I look forward to discussing the topics raised in this blog post as I go through the program and out into my future career as an urban data scientist.