Thursday, November 20, 2014

big data and their epistemological challenge


springer |  It is estimated that humanity accumulated 180 EB of data between the invention of writing and 2006. Between 2006 and 2011, the total grew ten times and reached 1,600 EB. This figure is now expected to grow fourfold approximately every 3 years. Every day, enough new data are being generated to fill all US libraries eight times over. As a result, there is much talk about “big data”. This special issue on “Evolution, Genetic Engineering and Human Enhancement”, for example, would have been inconceivable in an age of “small data”, simply because genetics is one of the data-greediest sciences around. This is why, in the USA, the National Institutes of Health (NIH) and the National Science Foundation (NSF) have identified big data as a programme focus. One of the main NSF–NIH interagency initiatives addresses the need for core techniques and technologies for advancing big data science and engineering (see NSF-12-499).
Despite the importance of the phenomenon, it is unclear what exactly the term “big data” means and hence refers to. The aforementioned document specifies that: “The phrase ‘big data’ in this solicitation refers to large, diverse, complex, longitudinal, and/or distributed data sets generated from instruments, sensors, Internet transactions, email, video, click streams, and/or all other digital sources available today and in the future.” You do not need to be an analytic philosopher to find this both obscure and vague. Wikipedia, for once, is also unhelpful. Not because the relevant entry is unreliable, but because it reports the common definition, which is unsatisfactory: “data sets so large and complex that they become awkward to work with using on-hand database management tools”. Apart from the circular problem of defining “big” with “large”, the definition suggests that data are too big or large only in relation to our current computational power. This is misleading. Of course, “big”, as many other terms, is a relational predicate: a pair of shoes is too big for you, but fine for me. It is also trivial to acknowledge that we tend to evaluate things non-relationally, in this case as absolutely big, whenever the frame of reference is obvious enough to be left implicit. A horse is a big animal, no matter what whales may think. Yet, these two simple points may give the impression that there is no real trouble with “big data” being a loosely defined term referring to the fact that our current computers cannot handle so many gazillions of data efficiently. And this is where two confusions seem to creep in. First, that the epistemological problem with big data is that there is too much of them (the ethical problem concerns how we use them; see below). And second, that the technological solution to the epistemological problem is more and better techniques and technologies, which will “shrink” big data back to a manageable size. The epistemological problem is different, and it requires an equally epistemological solution.