PRISM, Snowden and the NSA: when Big Data becomes Too Much Data

Edward SnowdenA lot has already been written about Edward Snowden’s claims regarding the NSA and its PRISM program. I don’t intend to get into the politics of this issue or contemplate what Snowden’s motivations might be – this has been covered at length elsewhere. My interest, at this point, lies in the technical implications of what Snowden has apparently revealed. Would the kind of intercept database he has posited actually be possible to create? And assuming it could, what realistic analysis would it be capable of delivering?

Snowden has, regrettably, been very vague with the technical details thus far. While he may have bamboozled the Guardian’s Glenn Greenwald (a lawyer) into believing he is some kind of “master on computers”, this much has not been apparent to me from the level (lack) of detail displayed in his interviews and recent Q&A session. Techies, for example, don’t tend to respond to questions on encryption as if there was only one kind. A friend of mine said it would have been more fun if the questioner had asked him whether NSA intercepts were immune to ROT13 encryption just to gauge Snowden’s reaction. I have also never known anyone who knows one end of a database from the other to use a clunky expression like “raw query access”. Direct query access, sure. Raw data, yep. But “raw query access”? It does all begin to sound a little bit like a management consultant who’s been in a few meetings with techies but hasn’t quite grasped what they were talking about.

From the time Snowden’s claims were first published by the Guardian over a week ago to the Q&A session which took place yesterday, the precise details of what PRISM really is have shifted. It is no longer clear whether he is claiming that the NSA can grab data at will, as and where they see a need in individual cases, or whether PRISM is some kind of gargantuan surveillance system tracking the online communications of everyone around the globe. Given that Snowden has repeatedly referred to PRISM as “suspicionless surveillance”, this would at the very least suggest some kind of program that is indiscriminate in its data gathering. To be indiscriminate, it would need to be harvesting data on everyone rather than just those people whom the NSA have marked as potential threats. For, surely if someone is a potential threat, the surveilling of them is no longer “suspicionless”. Snowden has also been keen to stress how “you’re being watched”. Again, this points to a system of intrusive data collection that is active rather than merely reactive.

So, for the sake of argument, let’s take Snowden at his word. PRISM is a suspicionless surveillance program that monitors the online activities of everyone across the globe indiscriminately. To be even remotely useful to the NSA, it would need to track e-mails, tweets and social media interactions. First, some statistics:

  1. Facebook is growing by 500 terabytes per day. [2012 figures]
  2. Twitter is growing by 8 terabytes per day. [2010 figures]
  3. Google processes 24 petabytes per day. [2012 figures]

That last statistic is worth pausing for thought. 1 petabyte is one thousand terabytes, which is one million gigabytes. To date, the largest single array ever built was by IBM in 2012, and it was 120PB in size. So, that’s enough for 5 days’ worth of Google data. What these figures don’t include is the amount of redundancy that must necessarily be built into any decent storage solution. Smaller companies would tend to opt for something like RAID5 which stripes data across multiple disks, thereby allowing for multiple disks to go kaput without suffering data loss. Companies on the scale of Google or Facebook are likely using storage redundancies many magnitudes higher – and thus their storage needs will be even greater than the statistics above would suggest.

This much storage has to be physically located somewhere. Google’s server farms are not small. And bear in mind that in the figures quoted, we’re just sticking with Google, Facebook and Twitter. What about Yahoo!, Microsoft and the full gamut of less popular but equally necessary (from a surveillance point of view) e-mail service providers? Bear in mind also that Snowden has, as of yesterday, made it clear that he is talking about content, not simply records of interaction. He claims the NSA can go right down to the attachments that have been included on e-mails. That level of data has got to go somewhere. And believe me, the costs would far exceed the paltry $20m PRISM budget cited in one of Snowden’s PowerPoint decks.

Let’s build on our assumptions. Let’s say the NSA has managed, somehow, to construct a server farm filled with IBM-level arrays managed by engineers of the same calibre and quality as Google. And they’ve done so without anyone noticing. It’s a stretch, but it’s just about possible – assuming vastly greater sums of money than have been quoted. Now let’s look at how it could be harnessed. The NSA has somehow managed to amass all of this data – raw, unstructured and growing rapidly by the minute – so what are they going to do with it?

The instinctive reaction would be that since they have all of the data at their disposal, they can do pretty much anything. But that is to misunderstand how data analysis works. To be sure, map reduce, hadoop and other technologies do allow querying across vast datasets – but they do so on the basis of trends and pattern spotting. Google Analytics is able to track the ebb and flow of popular search terms, and Twitter can show you what is trending right now in real time – but this is done by enormous feats of aggregation. What it can’t tell you very easily is who is tweeting what right now, who is searching for what right now and who is talking to whom right now at the level of individuals. All of that data is identified by an IP address, but an IP address is not a person. To link an actual person to an IP address can be phenomenally difficult. They are tied to geographic locations (unless they’re being bounced, which is easy work for any hacker worth their black hat) not people. Now admittedly, with the help of Internet Service Providers (ISPs), you can work out who owns an internet account and therefore who is likely to be behind a given IP address – but the data itself will not tell you that. In other words, you need to know who you’re stalking first before any of the data will mean anything. That doesn’t sit well with the idea of suspicionless surveillance in which the entire premise is that all data is tracked indiscriminately. On top of all of that, you’d have to have a way of knowing when one person is operating behind mutliple IP addresses (their home PC, their smartphone and any open wi-fi networks they connect to) and then map it all together into a complete picture of an individual’s online activities. It should be apparent by now that this is next to impossible. Even allowing for things like logging in to social media accounts which can then be identified with a particular person (cookie tracking, similar to the way in which real-time ad platforms work) you are facing an almost insurmountable task of picking out all of this data from an ever increasing, unstructured web of information. To adapt the cliché, it would be like looking for a needle in a haystack, except that the needle has been broken into many tiny fragments and every few minutes the farmer comes along and piles on more hay.

In all the talk of Big Data and the increasing reach of the state into our private affairs, it is all too easy to believe the claims that Snowden has made. It sounds reasonable at first: of course the security services are spying on us, of course they’re tracking every move we make on the internet. Yet, a closer inspection of what this would really mean in practice, taking into account the limits on money, resources, technology – even sheer physical space – starts to make it all look rather less credible.

Snowden may yet surprise us all and, in a grand reveal worthy of Agatha Christie herself, show how it was all done with mirrors. Maybe the NSA – with all the technological pedigree of any government-run body – really is streets ahead of private companies like Google and Apple.

But I remain to be convinced.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: