I’m not a statistician by training, but as a data scientist statistical methods are extremely important, and I’ve tried to learn as much as I can. Often this is helpful in understanding the reasonable limits and possibilities of what can be inferred from a given dataset. For instance, understanding how the data is distributed in a given set can help me understand meaningful ways to describe it. For data that is normally distributed, the mean has meaning as the typical value within the dataset. It’s thus truly a representative value for the dataset as a whole (along with other important statistical measures ). For data that’s bimodal (see example below) or doesn’t have any kind of distribution, the mean is meaningless because it isn’t representative of the dataset and some other way of describing the data must be developed.
Another important area of statistics is sampling. Obviously, having a full enumeration of all possible data points is preferable to only having some subset of the data. When trying to determine the demographic characteristics of all incoming university freshmen, no one would ask the registrar to provide only a fraction of the profiles for analysis.
However, it’s not always possible or feasible to enumerate the entire population. The US Bureau of the Census is constitutionally mandated to enumerate the entire population every 10 years. The cost of the 2010 Census was over $13 billion to fund the swell of workers going door to door in order to record the response of over 300 million Americans. Despite the richness of this data for government (over $400 billion in funds is distributed each year based on the Census, not to mention Congressional redistricting) and businesses (who use it in marketing campaigns and deciding where to open new stores), this is a large and expensive effort that couldn’t be reasonably conducted on a more regular basis. The American Community Survey (ACS) helps fill the 10-year gap by sampling 3 million households a year, weighting the results, and aggregating over multiple years to produce a reasonable approximation of the population.
The theory and practice of sampling has been relatively well developed over the past 100 years, particularly in how statisticians are able to calculate the sampling error, or how well the sample describes the population. This can be difficult to work with sometimes, particularly with relatively small subsets of the population that are hard to capture in a random sample. With modern data collection through cellphones, tracking cookies, and other forms of data exhaust, many believe we no longer need surveys or other measurement techniques. Merely collect the data and analyze the results.
The problem with this approach is that while surveys target the individual, the data exhaust targets the devices generating the data, which serve as proxies for individuals. For example, if I want to measure traffic flowing on I-95 through New Jersey, I could set up a series of monitoring stations that count the number of cars passing by, their type, and direction. In this case, if I have counters at each of these sites continuously throughout the study period, I’m enumerating the entire population of drivers along the stretch of roadway. Instead of having these stations monitored 24 hours a day, I could collect data at random times and locations in order to estimate the traffic flows over time, saving time and money with a measurable sampling error that I could calculate and correct for.
I could also do the same thing using smart phone data to track the movement of drivers on the roadways. Leaving aside the technical and legal challenges of collecting cell phone data on a large scale, you can already see that there is a problem: not everyone has a smart phone. I’m introducing a systematic bias towards those with smart phones in my sample. If I’m only getting data from some of the carriers, I’m further biasing towards users on those networks. If not all of them consent to having their data collected or don’t have their GPS turned on, I’m further biasing my sample and severely constraining the analysis that comes from this work.
Were I to use smart phones in this study, I would’ve effectively used a technological proxy, namely GPS-enabled smart phones on networks I can collect data from, in place of the thing I’m actually trying to measure, namely drivers on the road. I think too often we get caught up in the possibilities with these technological proxies (as well as the seeming ease with which the data can be collected and processed) that we miss the reality that we’re not always measuring what we think we’re measuring.
In many cases, this is a difference without distinction. If we want to understand the activity of visitors to a webpage, we can easily review the log data. The actions of their web browsers are perfect proxies for their activities. But trying to infer trends in the larger population based on webpage activity introduces a bias towards those who have Internet access (in 2011, only 71.7% of households had Internet access at home). Obviously this doesn’t mean these proxies can’t be well used. The Google Flu Trends analysis is a great example of using data collected from online users to predict activity in the offline world, but it’s important to recognize that replacing samples with proxies shifts the problem from one of sampling error (a well understood phenomenon that’s relatively easy to account for) to one of systematic bias (a completely unpredictable phenomenon that leads to all kinds of issues).
Data doesn’t lie, but the inability or unwillingness to comprehend systematic bias in data collection can easily lead to dangerously incorrect results. Only by understanding data, from the point it’s collected until it lands on your computer hard drive or cloud-based storage and all points in between, can we as data scientists properly employ the powerful statistical and computational tools at our disposal to further understanding through data without looking like hucksters selling big data as the snake-oil of the 21st century.