I have decided to take a break this week from the ‘leap-blogging’ approach I’ve taken up until now (I have just coined that term – it means to blog disjointedly about whichever topic one feels inclined towards). Sticking with the theme of my last post, I’ve been spending a lot of time thinking about some more of the issues around big data (or ‘Big Data’ if you like capitals and inverted commas); the way I see it at the moment, there are a few very useful intersections to be considered.
You can never have too much of a good thing, or so the saying goes. I’m not so sure - by the time you’ve read this hopefully you’ll have a sense of the problem I’m going to call ‘information overload’. I’m going to draw on two separate examples but as you will see, there are parallels between them.
Last time I discussed the sheer enormity of what is referred to as big data. I also indicated that the primary driving force behind the accumulation and analysis of such data is industry – big data is big business. Taking this as a point of departure, we can acknowledge a key truth about big data: making sense of it is difficult, costly and time-consuming. But of course big data is not an issue only restricted to the world of consumer analytics and marketing. Social scientists are beginning to tread gingerly on the foothills of the mountain that is big data. And, much like a real mountain, we need the right tools to tackle this beast.
At a recent event hosted by the British Sociological Association (BSA), some of the teething problems in the relationship between social science and big data were laid out. I won’t go into detail but I’d recommend reviewing the conference hashtag ‘bigdatabl’ on Twitter and checking out the synopsis of the day from Paola Tubaro as well as some neat visual summaries from Sam Martin; below, her 'influencermoth' diagram shows which Twitterers used the conference hashtag the most and who had the greatest social reach. Appropriately enough, this provides an example of some effective uses to which big (social media) data can be put.
As I see it, the issue in this regard for the social sciences comes back to that first truth: the complexity of analysing big data in meaningful ways, which is inseparable from the question of scale. Analysis of these datasets requires the significant technical expertise of computer scientists and programmers. I’ve already suggested that tackling the mountain of big data requires the right tools – when you have to craft these for yourself it is an entirely different prospect. For the time being at least, there is simply too much for social science alone to engage with fully. Speaking as a social scientist, both the analytical tools we possess and the training are currently lacking.
However, I believe the questions being asked and the new directions that sociological inquiry is taking are making headway against the current. Social media seems to be a good place to start, as projects such as Cardiff University’s COSMOS illustrates. Social media allow for new angles on big social questions – crime, education, health, inequality – because people are engaging with these issues themselves in new forums. The next step is to gradually pull more big datasets into the purview of sociological analysis – and it does have to be gradual. There is too much to attempt to take on at once. We have to learn how to deal with big data in one form (let’s say social media) and then apply those lessons learned to new forms of big data. This will not only prove to be instructive in how the social sciences can engage with these sorts of data in themselves but I would anticipate it yielding better results when it comes to tackling social questions across a number of big datasets. Like I have already suggested, there are effective methods that can be carried out relatively easily but they may not provide much deep insight. I have experienced this first hand in a hackathon-type event surrounded by computer scientists who were able to grab numerous open-access datasets and create various mash-ups of these, such as crime reports and food hygiene ratings of local eateries overlaid on a map of Cardiff. However, part of the problem for me was that they were capable of doing this quicker than I could think what the question to be explored was.
All this was what lead me to think about a problem of ‘information overload’. There is only so much data that can be meaningfully explored at once. Linked to this is the methodological issue that a large quantity of big data, in whatever form it takes, is junk. Just because social media, for instance, provide us with a wealth of real-time social interactional data does not mean it is all useful. Separating the wheat from the chaff is a time-consuming but necessary task.
I’m going to switch tracks at this point. As soon as I had started thinking about the problem of ‘all these data’ it occurred to me that a similar problem arises in my own research area of anti-surveillance (for want of a better term).
A short while ago I read an article by Professor Laura Huey that outlined the problems in engendering mass resistance against surveillance practices. First, there is no consensus on what the organising issue is – what should we try to coalesce around, what do we brand this a campaign for, or against? Second, assuming we decide the issue is ‘pro-privacy’ or ‘anti-surveillance’, what is privacy? What is surveillance? The former, particularly, is amorphous, there is too much there to try to convey meaningfully, particularly given the context-dependent nature of ‘privacy’. If you want to get motivated about something, to make a difference, to force a change in the way things happen, you need to have a good handle on what exactly it is that needs addressing. People also need to feel that their actions will have consequences, or else why bother?
The Snowden revelations – or ‘The NSA Files’ as the news media has begun to brand them – contemporise this problem. Clearly, there are questions to be answered about the activities of our intelligence agencies concerning the apparent lack of oversight that exists over substantial data retention and data sharing practices. Snowden’s leaked information is of course crucial to fostering this debate and I don’t doubt that the steady drip-feed approach The Guardian journalists have taken is the best course of action. It is a different tactic to that employed, for example, when WikiLeaks released the war diaries or the diplomatic cables en masse. But even so, I still see the releases as potentially obfuscating the debate that needs to occur. Definitely a Catch-22.
Take a moment to consider what you know about The NSA Files (oh, see I’m doing it too now, it’s catchy). If you believe there are problems that need addressing, where do you begin? First, there was PRISM, which broke the story. Then the alleged tapping of submarine cables by GCHQ. The sharing of raw communications data between the NSA and the Israeli SigInt agency. Targeted attacks on the Tor network and other attempts to defeat encryption. Not forgetting the bugs on Angela Merkel’s phone or the collection of meta-data of 60 million Spanish telephone calls. This is a great example of information overload – I’m not sure where I would begin if I wanted to formulate a coherent challenge to this surveillance regime. But like I say, the information is valuable nevertheless so it is likely a question of balance. Much like social science engaging with big data, it needs to be a process of prioritising, filtering and, most importantly, understanding.
There is one last parallel I want to draw out. The big data revolution coincides with the surveillance capabilities of the NSA and GCHQ – we have seen this from the outset. Some of the major big data players such as Google are either complicit in the activities of the agencies or have been targeted by the same efforts. Whichever is the case (both, in all probability) big data and the surveillance state intersect in a number of ways – but we know this already.
What struck me is that big data provides as much of a mountain to climb for the intelligence agencies as it does for the social sciences. But, while the former have the means at their disposal to collect virtually all of it they face the same problems of complexity and scale. How do you find that piece of hay in such a massive stack of needles? I’ll reiterate what I said above – most of the data collected is useless, or at least of no interest to the NSA and GCHQ. Take a look at page 15 of Dr Ian Brown’s witness statement to the ECHR in the current case being brought against the UK – all of you BitTorrent users can breathe a sigh of relief.
The intelligence questions do still need to be addressed and considering them in this way is helpful. Such data collection/retention practices are a logical response to the big data revolution. But unless you can address the ‘how’ of big data, the ‘why’ is fairly irrelevant. Climbing the mountain ‘because it is there’ doesn’t address the difficulty in doing so; doing it because you can doesn’t solve the problems of big data.