Big data is fun to talk about. It's espionage, geekery, and your daily trip to 7-11 all rolled into one. Everyone is interested, everyone can relate, and everyone has questions. Here's one for you to consider: Will the state ever be able to both collect and analyze all of the relevant intelligence data produced by its population?
The easy answer is no. Words like “all” and “relevant” immediately obviate the question as posed. However, it is a mistake to dismiss the issue.
Consider a related question. Can a state count all of the relevant people in its population?
However, it must try to do so. The census is such an obviously valuable practice that it is practically coeval with the historical record. Anyone who argues that having more and better data doesn't foster better governance has a long and arduous path to tread.
Even advocates for limited government must admit that, in whatever capacity a government does act (however small or large), it ought to act on observed data (despite this line of argument ...).
So, the question must be altered a bit. Just how much data can the state collect? How much of that data is relevant to the proper exercise of its powers? Of the data collected, how much can be put to use in making policy?
Q1 – How much data can the state collect?
The answer is a lot, and increasingly, depending on advances in technology rather than increases in staff. Currently both data storage densities and transistor densities are growing exponentially. If we assume that data collection is more active than passive, then both these rates of growth need to be taken into effect, as either can act as a limiting factor in the collection and storage of data. At current levels of growth, we must consider exabyte levels of data storage in coming decades. To put this number in perspective, that’s at least a file the size of an average college library for every person in the United States. Collecting every word you say would produce a book only every few days, so it’s not inconceivable that, if we all were to wear mics, by 2050 the state should have no trouble at all collecting our every word.
Looking only at words, however, makes the problem of data collection and storage a bit too simplistic. Suppose we wanted to track your movements, taking a sample and a time stamp every second. We can call that about 2 billion samples of a few kilobytes each collected over the average lifespan, enough to fill a single large external hard drive for a personal computer -- a drop in the bucket. Still, these drops add up. What if we want a database indexing your relative positions to every other person’s position samples? That’s the famous handshake problem, only in 2 billion separate rooms each containing the whole population of the whole country. How many bytes for that? Maybe a yottabyte. We’d have to build millions of exabyte data centers just to store all the relative positions. Obviously storing that type of data instead of calculating it as it is demanded would be stupid, but the point is that the data-creating power of a large population is unthinkably large, especially when we want to compare the data sets generated by individuals in a population. Our very ability to conceive new relationships between any previously collected data suggests that in any case, whatever data we have on hand, we could have infinitely more, some of which is likely useful. Which brings us to:
Q2 -- How much of that data is relevant?
Clearly, this is a matter of opinion, not fact, but it is necessary to constrain the problem somehow in order for it to be meaningful. Let’s go back to our million book library for every person. What books ought we to collect in that library? Every number ever dialed? Every word ever spoken or written, every purchase made, every flight reserved, every website visited…it would be a fantastic collection of reference books. Government aside, it would make online dating far less risky (and probably less appealing).
Barring an asteroid strike, nearby supernova, or other similarly devastating calamity, modern technology will continue to allow for more "relevant" data to be collected about you than you are likely comfortable with, which means we ought to move on to
Q3 -- How much can the state really act on?
First we need to ask how much of your personal reference library can government intelligence staff actually look at in your lifetime (and theirs)? A person dedicated the task (and reasonably skilled) can read a long book in a single day. So a good government snoop ought to be able to read almost ten thousand of your million books before retirement. He won’t have time to act on this information, but he’ll remember a lot of it, because the human brain can store that much information without breaking a sweat (cerebrally speaking). That means the government will have to hire 100 well trained Russian (or of some other equally myth-encrusted provenance) spies with good memories to fully digest the information collected on each American. That’s 35 billion spies, or approximately 4 times the current population of Earth. Unfortunately, 400% literacy rates are hard to come by, even in Russia, even in mythical KGB Russia.
So, the government will have to content itself with looking up just a few facts about you when it really needs them. If the government really only wants about ten books worth of information on you, then each spy can handle 1000 citizens, which means employing only about 350,000 spies. This is a lot of spies, but it’s a feasible number. Some estimates suggest that the NSA already employs at least one third that number (although they’re not all spies).
So what’s the point of this discussion?
The point is, right now, the data imply that the limiting factor in data mining is not the technology, but the HR department. There is no threshold of data collection at which the government can reasonably say, “We have achieved 100% predictive power; we needn’t collect any more.” Nor can the government ever hire enough staff to act on all the data it collects. Probably, the human race doesn’t even have access to enough resources to build a race of police robots to act on all the data. So the issue here is not really an issue of technology; it’s an old fashioned issue of budget and staffing.
Fantastic search algorithms can drastically reduce the amount of staff required, but search algorithms are based on the fundamental assumption that you only need some of the information on file some of the time. Yes, they help you find Waldo right away; however, they also allow you to avoid looking at all the other beachgoers. In fact, the search algorithm argument may simply aid the NSA in ignoring you.
So, do we need to freak out about exabyte data centers in Utah? No. Should we keep an eye on how many spies our government employs? Yes. Will that be easy? No. They are spies, after all.