Who can we trust in the big-data future?
"Big data" is the jargon du jour, the tech world's one-size-fits-all (so long as it's triple XL) answer to solving the world's most intractable problems. The term is commonly used to describe the art and science of analysing massive amounts of information to detect patterns, glean insights, and predict answers to complex questions.
The past week's revelations about data surveillance by US government agencies have concentrated on its extent, oversight and privacy implications. But for the evangelists of big data, there is no problem - from stopping terrorists to ending poverty to saving the planet - too big for it to solve.
"The benefits to society will be myriad, as big data becomes part of the solution to pressing global problems like addressing climate change, eradicating disease, and fostering good governance and economic development," crow Viktor Mayer-Schonberger and Kenneth Cukier in the modestly titled Big Data: A Revolution that Will Transform How We Live, Work, and Think.
So long as there are enough numbers to crunch - whether it's data from your iPhone, supermarket purchases, online dating profile or, say, the anonymous health records of an entire country - the insights that can be gleaned from our computing ability to decode this raw data are innumerable. In May, long before privacy fears were stoked, Barack Obama's administration jumped with both feet on the bandwagon, releasing a "ground-breaking" trove of "previously inaccessible or unmanageable data" to entrepreneurs, researchers and the public.
"One of the things we're doing to fuel more private-sector innovation and discovery is to make vast amounts of America's data open and easy to access for the first time in history. And talented entrepreneurs are doing some pretty amazing things with it," Obama said at the time.
But is big data really all it's cracked up to be? Can we trust that so many ones and zeros will illuminate the hidden world of human behaviour? Foreign Policy goes behind the numbers to examine some of the assumptions, biases and blind spots.
"With enough data, the numbers speak for themselves"
Not a chance. The promoters of big data would like us to believe that behind the lines of code and vast databases lie objective and universal insights into patterns of human behaviour, be it consumer spending, criminal or terrorist acts, healthy habits or employee productivity. But many big-data evangelists avoid taking a hard look at the weaknesses.
Numbers can't speak for themselves, and data sets - no matter their scale - are still objects of human design. The tools of big-data science, such as the Apache Hadoop software framework, do not immunise us from skews, gaps and faulty assumptions.
Those factors are particularly significant when big data tries to reflect the social world we live in, yet we can often be fooled into thinking that the results are somehow more objective than human opinions.
Biases and blind spots exist in big data as much as they do in individual perceptions and experiences. Yet there is a problematic belief that bigger data is always better data and that correlation is as good as causation.
For example, social media is a popular source for big-data analysis, and there's certainly a lot of information to be mined there. Twitter data, we are told, informs us that people are happier when they are further from home and saddest on Thursday nights. But there are many reasons to ask questions about what this data really reflects.
For starters, we know from the Pew Research Centre that only 16 per cent of online adults in the United States use Twitter, and they are by no means a representative sample - they skew younger and more urban than the general population.
Further, we know many Twitter accounts are automated response programs called "bots", fake accounts or "cyborgs" - human-controlled accounts assisted by bots. Recent estimates suggest there could be as many as 20 million fake accounts. So even before we get into the methodological minefield of how you assess sentiment on Twitter, let's ask whether those emotions are expressed by people or just automated algorithms.
But even if you're convinced that the vast majority of tweeters are real flesh-and-blood people, there's the problem of confirmation bias. For example, to determine which players in the 2013 Australian Open were the "most positively referenced" on social media, IBM conducted a large-scale analysis of tweets about the players via its Social Sentiment Index. The results determined that Victoria Azarenka was top of the list. But many of those mentions of Azarenka on Twitter were critical of her controversial use of medical timeouts. So did Twitter love her or hate her? It's difficult to trust that IBM's algorithms got it right.
Once we get past the dirty-data problem, we can consider the ways in which algorithms themselves are biased. News aggregator sites that use your personal preferences and click history to funnel in the latest stories on topics of interest also come with their own baked-in assumptions - for example, assuming that frequency equals importance or that the most popular news stories shared on your social network must also be interesting to you. As an algorithm filters through masses of data, it is applying rules about how the world will appear - rules that average users will never get to see, but that powerfully shape their perceptions.
Some computer scientists are moving to address these concerns. Ed Felten, a Princeton University professor and former chief technologist at the US Federal Trade Commission, recently announced an initiative to test algorithms for bias, especially those that the US government relies on to assess the status of individuals, such as the infamous "no-fly" list that the FBI and Transportation Security Administration compile from the numerous big-data resources at the government's disposal and use as part of their airport security regimes.
"Big data will make our cities smarter and more efficient"
Up to a point. Big data can provide valuable insights to help improve our cities, but it can only take us so far. Because not all data is created or even collected equally, there are "signal problems" in big-data sets - dark zones or shadows where some citizens and communities are overlooked or under-represented. So big-data approaches to city planning depend heavily on city officials understanding both the data and its limits.
For example, Boston's Street Bump app, which collects smartphone data from drivers going over potholes, is a clever way to gather information at low cost, and more apps like it are emerging. But if cities begin to rely on data that only comes from citizens with smartphones, it's a self-selecting sample - it will necessarily have less data from those neighbourhoods with fewer smartphone owners, which typically include older and less affluent populations.
While Boston public servants made concerted efforts to address these potential data gaps, less conscientious officials may miss them and end up misallocating resources in ways that further entrench social inequities. One need only look to the 2012 Google Flu Trends miscalculations, which significantly overestimated annual flu rates, to realise the impact that relying on faulty big data could have on public services and public policy.
The same is true for "open government" initiatives that post data about public sectors online, such as Data.gov and the White House's Open Government Initiative. More data won't necessarily improve any functions of government, including transparency or accountability, unless there are mechanisms to allow engagement between the public and their institutions, not to mention aid the government's ability to interpret the data and respond with adequate resources. None of that is easy. In fact, there just aren't many skilled data scientists around yet. Universities are currently scrambling to define the field, write curriculums and meet demand.
Human rights groups are also looking to use big data to help understand conflicts and crises. But here too there are questions about the quality of both the data and the analysis. The MacArthur Foundation recently awarded an 18-month, $US175,000 grant to Carnegie Mellon University's Centre for Human Rights Science to investigate how big-data analytics are changing human rights fact-finding, such as through development of "credibility tests" to sort alleged human rights violations posted to sites like Crisis Mappers, YouTube, Ushahidi and Facebook.
The director of the centre, Jay D. Aronson, notes that there are "serious questions emerging about the use of data and the responsibilities of academics and human rights organisations to its sources. In many cases, it is unclear whether the safety and security of the people reporting the incidents is enhanced or threatened by these new technologies."
"Big data doesn't discriminate between social groups"
Hardly. Another promise of big data's alleged objectivity is that there will be less discrimination against minority groups because raw data is somehow immune to social bias, allowing analysis to be conducted at a mass level and thus avoiding group-based discrimination. Yet big data is often deployed for exactly this purpose - to segregate individuals into groups - because of its ability to make claims about how groups behave differently. For example, a recent paper points to how scientists are allowing their assumptions about race to shape their big-data genomics research.
As Alistair Croll writes, the potential for big data to be used for price discrimination (charging different customers different prices for the same goods or services) raises serious civil rights concerns. Under the rubric of "personalisation", big data can be used to isolate specific social groups and treat them differently, something that laws often prohibit businesses or humans from doing explicitly. Companies can choose to show online ads for a credit card offer to people who are most attractive in terms of household income or credit history to banks, leaving others completely unaware that a particular offer is available. Google even has a patent to dynamically price content: if your past buying history indicates you are more likely to pay top dollar for shoes, your starting price the next time you shop for footwear online might be considerably higher.
Now employers are trying to apply big data to human resources, assessing how to make employees more productive by analysing their every click and tap. Employees may have no idea how much data is being gathered about them or how it is being used.
Discrimination can also take on other demographic dimensions. For example, The New York Times reported that US retailer Target started compiling analytic profiles of its customers years ago; it now has so much data on purchasing trends that it can predict under certain circumstances if a woman is pregnant with an 87 per cent confidence rate, simply based on her shopping history. While the Target statistician in the article emphasises how this will help the company improve its marketing to expectant parents, one can also imagine such determinations being used to discriminate in other ways that might have serious ramifications for social equality and, of course, privacy.
And recently, a big-data study from Cambridge University of 58,000 Facebook "likes" was used to predict very sensitive personal information about users, such as sexual orientation, ethnicity, religious and political views, personality traits, intelligence, happiness, use of addictive substances, parents' marital status, age and gender.
As journalist Tom Foremski observes of the study: "Easy access to such highly sensitive information could be used by employers, landlords, government agencies, educational institutes and private organisations, in ways that discriminate [against] and punish individuals. And there's no way [to] fight it."
Finally, consider the implications in the context of law enforcement. US police are turning to "predictive policing" models of big data in the hope they will shine investigative light on unsolved cases and even help prevent future crimes. However, focusing police activity on particular big data-detected "hot spots" runs the danger of reinforcing stigmatised social groups as likely criminals and institutionalising differential policing as a standard practice.
As one police chief has written, although predictive policing algorithms explicitly avoid categories such as race or gender, the practical result of using such systems without sensitivity to differential impact can be "a recipe for deteriorating community relations between police and the community, a perceived lack of procedural justice, accusations of racial profiling, and a threat to police legitimacy".
"Big data is anonymous, so it doesn't invade our privacy"
Flat-out wrong. While many big-data providers do their best to de-identify individuals from human-subject data sets, the risk of re-identification is very real. Mobile phone data, en masse, may seem fairly anonymous, but a recent study on a data set of 1.5 million mobile users in Europe showed that just four points of reference were enough to individually identify 95 per cent of people. There is a uniqueness to the way that people make their way through cities, the researchers observed, and given how much can be inferred by the large number of public data sets, this makes privacy a "growing concern".
But big data's privacy problem goes far beyond standard re-identification risks. Currently, medical data sold to analytics firms has a risk of being used to track your identity. There is a lot of chatter about personalised medicine, where the hope is that drugs and other therapies will be so individually targeted that they work to heal an individual's body as if they were made from that person's very own DNA.
It's a wonderful prospect in terms of improving the power of medical science, but it's fundamentally reliant on personal identification at cellular and genetic levels, with high risks if it is used inappropriately or leaked. But despite the rapid growth in personal health data collecting apps such as RunKeeper and Nike [plus], practical use of big data to improve healthcare delivery is still more aspiration than reality.
Other kinds of intimate information are being collected by big-data energy initiatives, such as the federal government's Smart Grid, Smart City trials in New South Wales. Smart grid initiatives look to improve the efficiency of energy distribution to our homes and businesses by analysing enormous data sets of consumer energy usage.
The projects have great promise but also come with great privacy risks. They can predict not only how much energy we need and when we need it, but also minute-by-minute information on where we are in our homes and what we are doing. This can include knowing when we are in the shower, when our dinner guests leave for the night and when we turn off the lights to go to sleep.
Of course, such highly personal big-data sets are a prime target for hackers or leakers.
"Big data is the future of science"
Partly true, but it has some growing up to do. Big data offers new roads for science, without a doubt. We only need look to the discovery of the Higgs boson particle, a result of the largest grid-computing project in history, with CERN using the Hadoop Distributed File System to manage all the data. But unless we recognise and address some of big data's inherent weaknesses, we may make major public policy and business decisions based on incorrect assumptions.
To address this, data scientists are starting to collaborate with social scientists, who have a long history of critically engaging with data: assessing sources, the methods of data collection and the ethics of use. Over time, this means finding new ways to combine big-data approaches with small-data studies. This goes well beyond advertising and marketing approaches like focus groups or A/B testing (in which two versions of a design or outcome are shown to users in order to see which variation proves more effective). Rather, new hybrid methods can ask questions about why people do things, beyond just tallying up how often something occurs. That means drawing on sociological analysis and deep ethnographic insight as well as information retrieval and machine learning.
Technology companies recognised early that social scientists could give them greater insight on how and why people engage with their products, such as when Xerox's PARC hired pioneering anthropologist Lucy Suchman. The next stage will be a richer collaboration between computer scientists, statisticians and social scientists of many stripes - not just to test the findings of each other's work, but to ask fundamentally different kinds of questions with greater rigour.
Given the immense amount of information collected about us every day, we must decide sooner rather than later whom we can trust with that information, and for what purpose. We can't escape the fact that data is never neutral and that it's difficult to make anonymous. But we can draw on expertise across different fields in order to better recognise biases, gaps and assumptions, and to rise to the new challenges to privacy and fairness. Foreign Policy
Kate Crawford is principal researcher at Microsoft Research, a visiting professor at the MIT Centre for Civic Media and an associate professor at the University of New South Wales.