This evening I attended a seminar hosted by the University of Toronto entitled “Material Evidence: is a picture really worth a thousand words?” The panel comprised the Director of the Global City Indicators Facility (GCIF) Prof. Patricia McCarney; media artist Ben Rubin; and the visual op-ed columnist for the New York Times Charles M. Blow. The broad topic under discussion was data visualisation: how best to make sense of vast, complex data and convey it in a way that makes sense and can be readily consumed. In essence, the transformation of mere data into real information.
Prof. McCarney presented some of the challenges faced by the GCIF when they first embarked on consolidating metrics across multiple cities. Despite the comparative wealth of national statistics, data at the city level has historically been very weak. Given that 70% of the world’s GDP comes from cities, and that the increase in the numbers of people living in cities is set to rise dramatically over the next decade, understanding what the key indicators are within cities – and how they compare to other cities – is ever more important. In their first pilot, the GCIF discovered that among just 9 cities, more than a thousand indicators were being captured – but that only 2 of those indicators were reliably comparable. This is a problem that recurs again and again in the world of statistical analysis. Even within a single country, the measure used for things like assessing whether the crime rate is rising or falling can change over time – making comparisons with past datasets relatively meaningless. Since that early pilot, the GCIF have made commendable progress, standardising key indicators across over 250 cities worldwide. This now allows them to compare cities on a whole range of metrics and identify those who are surging ahead – and those who are lagging behind. But as Prof. McCarney admitted, during the Q&A, questions like “which is the safest city in the world?” and “which is the best city to live in?” are actually quite mundane compared with what sorts of information can be garnered from the data being collected by the GCIF. By comparing different datasets with apparently little connection, they have found out that things like a change in temperature of just a couple of degrees can affect the crime rate of a single block within a given city. The data can also be used to create clever demographic representations that demonstrate how less well known collections of city municipalities, like the ‘golden horseshoe’ of Toronto are directly comparable to more famous ones like the Bay area of San Francisco in terms of educated population, infrastructure and distribution of housing. This is precisely the kind of information that can make all the difference to a trade mission overseas. Of course, despite the progress the GCIF has made, there is no escaping the “it depends” answer to certain questions. Is Tokyo the largest city in the world? Well, it depends on what definition you use for ‘Tokyo’. Intriguingly, if you take Japan’s own definition of what constitutes the city limits, Tokyo is by no means the largest. As with so much in the world of data, asking the right question and being clear on your definitions can make all the difference to the answer.
Ben Rubin’s take on the subject of ‘data visualisation’ was a real exercise in lateral thinking. The traditional epitome of data visualisation is the infographic: take some pertinent data points and turn them into a picture that conveys the data as information in an easily digestible format. Rubin, as the artist, decided to take the idea and turn it on its head: what data can we discover by examining the visual nature of things? His first example was a strangely beautiful work in which he took the satellite photos of roughly a hundred airports from around the world, abstracted them into simple line diagrams (where each line depicted either a runway or a major taxi route to the runway) and then compared the resulting pictograms with each other, sorting airports based on how similar they looked in this form. Is there much to be learnt from this in terms of practical value? Almost certainly not, and Rubin was the first to admit this: but the art of finding patterns in otherwise unrelated objects is nevertheless quite beguiling. Another installation piece Rubin put together involved taking the Wikileaks diplomacy wire tap documents and pulling out only the six-lettered words. He then flashed these on a screen using fluorescent tubes in the order they appeared within the wire tap documents. Instead of a mere collection of random words, as might be imagined, the results were quite revealing: “Brutal … Demand … Solved … ” Rubin had attempted to create data out of something purely visual – and in an unexpected way, it seemed to have worked.
Charles Blow gave a tour de force performance at the lectern, detailing his time as graphics editor for both the New York Times and National Geographic, before returning to the NYT in his current role as visual op-ed columnist. The mantra of “become an expert by the afternoon” underpinned much of his early work at the Times, having to find out all there was to know on a given subject in a short space of time in order to put together a diagram, graphic, chart or map to convey a huge amount of information in a simple, consumable way. He revealed all kinds of interesting considerations that go into a graphic for a publication like the NYT, including the fact that human beings are great at appreciating distance between two objects but quite hopeless at conceiving volume. This is why line and bar charts are so popular – and why we increasingly experience ‘form fatigue’. Sometimes, however, the data cannot be readily conveyed in terms of the distance between two points – for example, the number of SARS cases across the world, ranging from one to thousands. In this case, using volume is a better means of conveying the information – but to make up for the human brain’s shortfall in dealing with volume visually, the actual number, along with some form of colour coding is often used. On the point about colour, the New York Times only ever uses colours that have been gathered from real world objects in the city of New York. This is why they have many shades of yellow, brass and limestone – but no red (only a dark maroon, the colour of New York bricks). Another lesson learnt: if you want to make a chart conveying something relatively dull more likely to be studied, ensure the axes don’t quite meet: this causes a subconscious frustration for the consumer, because the brain is wired in a way that wants two intersecting lines to meet! While his time at National Geographic was much more about getting the details precisely correct (at a much slower pace), on returning to the NYT as visual op-ed columnist, Blow has come to value the importance of a single data point in a sea of information that can tell a story all by itself. An entire column can be written on the back of a chart depicting infant mortality rates over time across OECD countries (the USA fares very poorly). For Blow, even if you can disagree with the subjective interpretation of the data, what you cannot dismiss is the data itself, provided it comes from a reliable source.
The discussion which followed from each speaker’s presentation touched on two broad points: the ethics of big data and what the best way of finding an answer to a question really is. To illustrate the ethical dilemma, Prof. McCarney cited the example of IBM’s work in Istanbul, where they obtained the mobile phone records for all Vodafone customers in that city in order to map out the journeys made by individuals across a 24 hour period. On the up side, this information could be used to make drastic improvements to infrastructure, traffic light phasing, roadworks planning and more. But is it right to take that data – in spite of the benefits it could offer – without consent? Would people feel comfortable knowing that companies like IBM can now track their every move? Rubin made the point that right now, we willingly give up our personal information in exchange for things like cool apps and fun websites largely because we know what we’re getting in return but don’t really know – or perhaps even care – about what we’re giving up. Blow believes we’ve already crossed the threshold of being able to do anything about it: the genie is out of the bottle as far as data privacy is concerned, and we can’t go back. He expressed concern that in the future, only young people scared of Instagram will be able to run for President – because only they will have avoided having their personal information (and embarrassing photos) shared with the world.
An interesting divide among the panelists appeared over the second issue. Blow was a strong advocate of the Siri approach: where can I get ribs tonight? Answer: this specific restaurant at this address. Single question, single answer. Congress has just passed a new tax bill; what does that mean for me in particular? Big data can answer that question for everyone, but each person generally only wants to know the answer for them specifically. Rubin, meanwhile, championed the Google view: for a single question, I can get a thousand possible answers, and then I’ll choose the one I think is best for me. Having access to a wealth of data empowers the consumer of that information to make an informed decision. Rubin identified the risk of allowing apps like Siri to make up our minds for us: how do we know that the single answer we’re given isn’t just the one with the most money riding behind it? In a way, this issue has its parallel in the world of software: do we take the easy route and buy the heavily marketed, pre-packaged option that will do most of what we want, but limits our ability to customise it; or do we go with the open source version that may take more work to get to what we want to but which ensures we remain in control?
My own view is that in the rush to take advantage of the benefits big data clearly affords, we should keep two simple principles in mind. First, more data does not mean more information. Without the right tools or resources, increasing your data volumes will just increase the difficulty you have in finding the answers you need. Second, more diverse data is often better than just more data. Data analysis is at its best when it is able to compare datasets that intuitively have little in common, but which in practice – like Prof. McCarney’s temperature vs. crime rate data – deliver surprising results. So, as per Blow, we don’t want to have to swim through an ocean of data to find a single answer, particularly if that answer could be found much closer to the shore. But equally, Rubin is also right: if we want the best possible answer, we first need to have access to all the possibilities.