What do we mean by “data”?
Some technical terms are so ubiquitous and (apparently) unambigious, that they almost become a transparent fluid: always used but never much reflected upon. Interestingly enough, the word “data” 1 is such a term. It is an abstract, weightless and unidentified mass of numbers (mostly digitally encoded), with a potent influence on our lives. It is also considered a rich source of insight that is worth being tapped. But what are the origins of the word “data” – and what are its implications?
Now, most people have heard that data is the plural form of datum, the past participle of the Latin word dare “to give”. Data, then, is what is given. So far so good. Of course, Latin was a subtle language, and the field of semantic meaning opened by the word “dare” is translated today by a variety of words. But most of it comes down to the broader meaning of “giving, handing out, devoting something”.
Medieval manuscript copy of Euclid’s “Data.” The book was originally written in Greek in the 3rd century BC. Its title “Dedomenai” literally translates to the Latin “Data.” This page shows the first proposition. Source: Bibliothèque nationale de France.
One of the earliest scientific uses of the word apparently occurred in a lesser known work by the greek mathematician Euclid. In his book “Data”, he collected a number of geometrical axioms, in which he shows that if one geometrical object is given, another one is also given, i.e. can be determined. In his commentary with the great subtitle “The Importance of Being Given”, the Danish mathematician Christian Marinus Taisbak wrote:
In the Data Euclid proves deductively that if some items are given, some other items are also given… When I started to translate the Data, I found it very longwinded that a certain phrase kept popping up time and again, several times in every proposition: if this item is given, that item is also given. I decided to cancel all those alsos… But then I discovered that I was leaving out an essential feature of the “Data”: the Givens hang together in chains, the purpose of any proposition being to produce more links to them.2
In the wider context of early Greek mathematics and philosophy, Euclid’s collection of “Data” deals with the question of what can be known and is a tool for mathematical problem-solving. Given is what is proposed, and what is therefore beyond argument.
If we follow the term a little further towards the present, we will observe two entirely separate strands evolve. First of all, we can see the origin of the English word “date” (as in calendar) or the German “Datum” forming from a sort of time stamp formula which is regularly included in official medieval documents. This formula always began with the word “Datum” (in the sense of: “this document was given, created, delivered”), followed by the date (and sometimes place) where it was set up. From this habit, the word “date” has evolved as the denominator of a specific moment in time.
Latin time “stamps” on medieval documents began with the word “Datum”, abbreviated “Dat.”, followed by date and place. Source: Lichtbildarchiv Älterer Urkunden Marburg
This calendar-related use of “datum”, however, is now separated from the semantic world of digital or scientific “data”. This latter evolved from a different strand – one which refers back to Euclid: the word re-appeared in the English language in scientific texts in the mid-17th century, first in mathematics and theology. Here again, data was something that is given (either as part of the mathematical proposition or by the word of God) and therefore didn’t need to be discussed. Historian Daniel Rosenberg observed how the term shifted meaning as it was used in scientific contexts over the 18th century:
At the beginning of the century, “data” was especially used to refer either to principles accepted as the basis of argument or to facts gleaned from scripture that were unavailable to questioning. By the end of the century, the term was most commonly used to refer to facts in evidence determined by experiment, experience or collection.3
It is in this sense, as the results of empirical research and observation, that “data” found their way into the emerging sciences of statistics and national economy through the 19th century. By 1900, the term “data” as the result of statistical observation and the common ground from which to draw conclusions was well established. This went along with the notion that statisticians and scientists should observe clear rules when producing this data, to make sure it provides a reliable source of information for further study. For instance, the Statistical Society of London wrote as early as 1838 – when discussing whether statistics should confine itself to the collection of data, or whether it should progress to drawing scientific conclusions from the collected material:
It is not, however, true that the statist rejects all deductions, or that statistics consist merely of columns of figures; it is simply required that all conclusions shall be drawn from wel-attested data, and shall admit of mathematical demonstration.4
In 1939, Willard Cope Brinton published an extremely detailed and rich overview of graphical methods for presenting information under the title “Graphic Presentation”. Throughout the book, the word “data” is used much the same way as we know it today (save for its digital encoding) – as a set of structured measurements derived from scientific observation or statistical methods. Source: Archive.org
With this firmly established notion of “data” as the result of scientific experiment and observation, we enter the 20th century – and soon the computer revolution. We should note that much of the evolution of digital technology happened in the English-speaking world, which is where the term “data” had been established as outlined above. “Data” began to be used for digitally encoded information, which can be stored and processed by computers. And as such, it begins to exist the moment it is recorded by the machine.
And this is a crucial twist. All through the evolution of statistics through the 19th century, data was generated by humans, and the scientific methodology of measuring and recording data had been a constant topic of debate. This is not trivial, as the question of how data is generated also answers the question of whether and how it is capable of delivering a “true” (or at least “approximated”) representation of reality. The notion that data begins to exist when it is recorded by the machine completely obscures the role that human decisions play in its creation. Who decided which data to record, who programmed the cookie, who built the sensor? And more broadly – what is the specific relationship of any digital data set to reality?
The election of Donald Trump has brought about a deep debate around election polls’ methodolodies. This in turn has fuelled the ongoing research within the data visualisation community about how best to communicate differing statistical results, cavities in data sets, or uncertain predictions. Here, the New York Times discussed strongly deviant polling results during the 2016 race.
Daniel Rosenberg has coined a wonderful phrase to point out this semantic crack in the word “data”: “When a fact is proven false, it ceases to be a fact. False data is data nonetheless.”5 Luckily, we are witnessing how the discourse on this ominous thing called “data” is constantly evolving with new research. The role of humans as the generator of data is increasingly recognized and scrutinized. And that is a good thing. After all, data is man-made. And, I should hope, it always will be.
Sandra Rendgen is the author of two bestselling books on data visualisation and infographics, both published by Taschen. Currently she is preparing a book on the work of Charles-Joseph Minard, one of the most important forefathers of modern information visualisation. It will be published by Princeton Architectural Press on November 6, 2018.
1The English-speaking world is still divided as to whether “data” should be treated as a singular or plural form – which continues to irritate German native speakers. Without wanting to comment on the issue I opt here for the singular form.
2Christian Marinus Taisbak, Dedomena: Euclid’s Data, Or, The Importance of Being Given, Copenhagen 2003, p. 13f.
3Daniel Rosenberg, “Data before the Fact”, in: Gitelman, Lisa (ed.): “Raw Data” is an Oxymoron. Cambridge/Mass.: MIT Press, 2013, p. 33.
4Journal of the Statistical Society, London, May 1838 (Introduction). Quoted from Block, Maurice: Traité théorique et pratique de Statistique. Paris: 1886, p. 97.
5Daniel Rosenberg, “Data before the Fact”, in: Gitelman, Lisa (ed.): “Raw Data” is an Oxymoron. Cambridge/Mass.: MIT Press, 2013, p. 18.