The information fire hydrant

“Come, let us build ourselves a city, and a tower whose top is in the heavens.”
Genesis 11:4, The Tower of Babel
“There's certainly a certain degree of uncertainty about, of that we can be quite sure.”
Rowan1 Atkinson, Sir Marcus Browning MP

As well as being a mathematician, Lewis Fry Richardson was a Quaker and a pacifist. He chose to be a conscientious objector during the First World War, and while this meant that he could not work directly in academia, he nonetheless continued his studies at its fringes. As well as creating models which could actually predict weather patterns, he focused much of his attention on the mathematical principles behind conflict, on and off the battlefield. His findings he summarised in a single volume, entitled Statistics of Deadly Quarrels, and published just as Europe and the world plunged itself into war once again. Perhaps it was this unfolding tragedy that pushed the pacifistic Richardson back to his studies: one area in particular intrigued him, namely the nature of border disputes, of which in Europe there were plenty. As he attempted to create models however, he found it challenging to determine the length of a border — indeed, the closer he looked at individual borders, the longer they became. Think about it: zoomed out, the edges of a geographical feature are relatively simple, but as you zoom in, you find they are more complicated and, therefore, the measurement becomes longer. The closer you go, the longer they become, until matters become quite absurd “Sea coasts provide an apt illustration,” he wrote2 as he watched his models collapse in a heap. “An embarrassing doubt arose as to whether actual frontiers were so intricate as to invalidate [^an] otherwise promising theory,”

The discomfiting nature of the phenomenon, which became known as the coastline paradox, was picked up by fractal pioneer Benoit Mandelbrot in 1967. In his paper3 ‘How Long Is the Coast of Britain?’ he wrote, “Geographical curves can be considered as superpositions of features of widely scattered characteristic size; as ever finer features are taken account of, the measured total length increases, and there is usually no clearcut gap between the realm of geography and details with which geography need not be concerned.” In other words, it wasn’t only the measurable distance that mattered, but the phenomenon cast into doubt what the geological features actually meant. Was a rocky outcrop part of the coastline or not? How about a large boulder? Or a grain of sand?

This same phenomenon is fundamental to our understanding of what we have come to call data, in all of its complexity. Data can be created by anything that can generate computer bits, which these days means even the most lowly of computer chips. Anything can be converted to a digital representation by capturing some key information, then digitising and converting it into data points, transporting it from one place to another using a generally accepted binary format. Whenever we write a message or make use of a sensor, we are adding to the mother of all analogue to digital converters. Digital cameras, voice recorders, computer keyboards, home sensors and sports watches and, well, you name it, all can and do generate data in increasing quantities. Not only is are the devices proliferating in volume and type, but we are then using computers to process, transform and analyse the data — which only generates more.

As a consequence, we are creating data far faster than we know what to do with it. Consider: at the turn of the millennium 75% of all the information in the world was still in analogue format, stored as books, videotapes and images. According to a study conducted in 2007 however4, 94% of all information in the world was digital — the total amount of stored was measured as 295 Exabytes (billions of Gigabytes). This enormous growth in information shows no sign of abating. By 2010 the figure had crossed the Zettabyte (thousand Exabyte) barrier, and by 2020, it is estimated, this figure will have increased fifty-fold.

As so often, the simplest concepts have the broadest impact: no restriction has been placed on what data can be about, within the bounds of philosophical reason. The information pile is increasing as we can (and we do) broadcast our every movement, every purchase and every interaction with our mobile devices and on social networks, in the process adding to the information mountain. Every search, every ‘like’, every journey, photo and video is logged, stored and rendered immediately accessible using computational techniques that would have been infeasible just a few years ago. Today, YouTube users upload an hour of video every second, and watch over 3 billion hours of video a month; over 140 million tweets are sent every day, on average – or a billion per week.

It’s not just us — retailers and other industries are generating staggering amounts of data as well. Supermarket giant Wal-Mart handles over a million customer transactions every hour. Banks are little more than transaction processors, with each chip card payment we make leaving a trail of zeroes and ones, all of which are processed. Internet service providers and, indeed, governments are capturing every packet we send and receive, copying it for posterity and, rightly or wrongly, future analysis. Companies of all shapes and sizes are accumulating unprecedented quantities of information about their customers, products and markets. And science is one of the worst culprits: Alice experiment at CERN’s Large Hadron Collider generates data at a rate of 1.2 Gigabytes per second. Per second!

Our ability to create data is increasing in direct relation to our ability to create increasingly sensitive digitisation mechanisms. The first commercially available digital cameras, for example, could capture images of up to a million pixels, whereas today it is not uncommon to have 20 or even 40 ‘megapixels’ as standard. In a parallel to Richardson’s coastline paradox, it seems that the better we get at collecting data, the more data we get. Marketers have the notion of a ‘customer profile’ for example: at a high level, this cold consist of your name and address, your age, perhaps whether you are married, and so on. But more detail can be added, in principle helping the understanding of who you are. Trouble is, nobody knows where to stop — is your blood type relevant, or whether you had siblings? Such questions are a challenge not only to companies who would love to know more about you, but also (as we shall see) because of the privacy concerns they raise.

Industry pundits have, in characteristic style, labelled the challenges caused by creating so much data as ‘Big Data’ (as in, “We have a data problem. And it’s big.”). It’s not just data volumes that are the problem, they say, but also the rate at which new data is created (the ‘velocity’) and the speed at which data changes (or ‘variance’). Data is also sensitive to quality issues (‘validity’) — indeed, it’s a running joke that customer data used by utilities companies is so poor, the organisations are self-regulating — and it has a sell by date, that is, a point when it is no longer useful apart from historically. When we create information from data, we are often experience a best-before time limit, beyond which it no longer makes sense to be informed. This is as true for the screen taps that make a WhatsApp message, as for a complex medical diagnosis.

All of these criteria make it incredibly difficult to keep up with the data we are generating. Indeed, our ability to process data will, mathematically, always lag behind our ability to create it. And it’s not just the raw data we need to worry about. Computer processors don’t help themselves as they have a habit of creating duplicates, or whole new versions of data sets. Efforts have been made to reduce this duplication but it often exists for architectural reasons — you need to create a snapshot of live data so you can analyse it. It’s a good job we have enough space to store it all, or do we? To dip back into history, data storage devices have, until recently, remained one of the most archaic parts of the computer architecture, reliant as they have been upon spinning disks of magnetic material. IBM shipped the first disk drives in 1956 — these RAMAC drives could store a then-impressive four million bytes of information across its fifty disk platters, but had to be used in clean environments so that dust didn’t mess up their function. It wasn’t until 1973 that IBM released5 a drive, codenamed Winchester, that incorporated read/write heads in a sealed, removable enclosure.

Despite their smaller size, modern hard disks have not changed a great deal since this original, sealed design was first proposed. Hard drive capacity increased by 50 million times between 1956 and 2013 but even this is significantly behind the curve when compared to processor speeds, leading pundits such as analyst firm IDC going to the surprising length of suggesting that the world would “run out of storage” (funnily enough, it hasn’t). In principle, the gap could close with the advent of solid state storage — the same stuff that is a familiar element of the SD cards we use in digital cameras and USB sticks. Solid State Drives (SSDs) are currently more expensive, byte for byte, than spinning disks but (thanks to Moore’s Law) the gap is closing. What has taken solid state storage so long? It’s all to do with the transistor counts. Processing a bit of information requires a single transistor, whereas storing the same bit of information for any length of time requires six transistors. But as SSDs become more available, their prices also fall meaning that some kind of parity starts to appear with processors. SSDs may eventually replace spinning disks, but even if they do, the challenge of coping wth the data we create will pervade. This issue is writ large in the Internet of Things — as we have seen, a.k.a. the propensity of Moore’s Law to spawn smaller, cheaper, lower-power devices that can generate even more data. Should we add sensors to our garage doors and vacuum cleaners, hospital beds and vehicles, we will inevitably increase the amount of information we create. Networking company Cisco estimates6 that the ’Internet of Everything’ will cause a fourfold increase in the five years from 2013, to reach over 400 ZettaBytes - that’s 10^21 bytes.

To technology’s defence, data management has long moved away from simply storing it on disk, loading it into memory and accessing it via programs. It was back in the late 1950’s that computer companies started to realise a knock-on effect of all their innovation — the notion of obsolescence. The IBM 4077 series of accounting machines, introduced 10 years before, could do little more than read punched cards and tabulate reports on the data they contained; while the 407’s successor, the 1401, was a much more powerful computer (and based entirely on new-fangled transistors), technicians needed some way of getting the data from familiar stacks of cards and to the core storage of the 1401 for processing. The answer was FARGO8 — the Fourteen-o-one Automatic Report Generation Operation program, which essentially turned the 407 into a data entry device for the 1401.

The notion of creating data stores and using them to generate reports became a mainstay of commercial computing. As the processing capabilities of computers became more powerful, the reports could in turn become more complicated. IBM’s own language for writing reports was the Report Program Generator itself, RPG. While it was originally launched in 1961, RPG is still in use today, making it one of the most resilient programming languages of the information age. IBM wasn’t the only game in town: while it took the lion’s share of the hardware market, it wasn’t long before a variety of technology companies, commercial businesses (notably American Airlines with its SABRE booking system) and smaller computer services companies started to write programs of their own. Notable were the efforts of Charles Bachman, who developed what he termed the Integrated Data Store wen working at General Electric in 1963. IDS was the primary input to the Conference/Committee on Data Systems Languages’ efforts to standardise how data stores should be accessed; by 1968 the term database had been adopted.

And then, in 1969, dogged by the US government, IBM chose to break the link between hardware and software sales, opening the door to competition9 from a still-nascent software industry. All the same it was another IBM luminary, this time Englishman Edgar Codd, who proposed another model for databases, based on tables and the relationships between data items. By the 1980’s this relational database model, and the Structured Query Language (SQL) used to access it, became the mechanism du choix for several decades afterwards, for all but mainframe software where (despite a number of competitors appearing over the years) IBM models still dominated.

Of course it’s more complicated than that — database types, shapes and sizes proliferated across computers of every shape and size. But even as data management technologies evolved, technology’s propensity to generate even more data refused to abate. As volumes of data started to get out of hand once again in the 1990’s, attention turned to the idea of data warehouses — data stores that could take a snapshot of data and store it somewhere else, so that it could be interrogated, the data analysed and the results used to generate ever more complex reports. For a while it looked like the analytical challenge had been addressed. But then, with the arrival of the Web, quickly followed by e-commerce, social networks, online video and the rest, new mechanisms were required yet again as even SQL-based databases proved inadequate to keep up with the explosion of data that resulted. Not least, the issue of how to search the ever-increasing volumes of web pages was becoming ever more pressing. In response, in 200310 Yahoo! colleagues Doug Cutting and Mike Cafarella developed a tool called Nutch, based around a indexing mechanism from Google, called MapReduce, itself “a framework for processing embarrassingly parallel problems across huge datasets.” The pair quickly realised that the mechanism could be used to analyse the kinds of data more traditionally associated with relational databases, and created a specific tool for the job. Doug named it Hadoop11, after his son’s toy elephant.

Hadoop spelt a complete breakthrough in how large volumes of data could be stored and queried. In 2009 the software managed12 to sort and index a petabyte of data in 16 hours, and 2015 was to be the year of ‘Hadooponomics13’ (allegedly14). The project inspired many others to create non-relational data management platforms. MongoDB, Redis, Apache Spark and Amazon Redshift are all clever and innovative variations on a general trend, which is to create vast data stores that can be interrogated and analysed at incredible speed.

Even with such breakthroughs, our ability to store and manage data remains behind the curve of our capability to create it. Indeed, the original strategists behind the ill-fated tower of Babel might not have felt completely out of place in present day, large-scale attempts to deal with information. And so it will continue — it makes logical sense that we will carry on generating as much information as we can, and then we will insist on storing it. Medicine, business, advertising, farming, manufacturing… all of these domains and more are accumulating increasingly large quantities of data. But even if we can’t deal with it all, we can do increasingly clever things with the data we have. Each day, the Law of Diminishing Thresholds ensures a new set both very old and very new problems that are moving from insoluble to solvable.

To do so this requires not just data processing, storage, management and reporting, but programs that push the processing capabilities of computers to the absolute limits. Enter: the algorithm.