What on earth is Big Data you say? To put it simply: Big Data is data that is too big to fit into normal data processors, such as web login information, the webpages of a major website, or say millions of facebook statuses. Analyzing all this data would take far too long and the sheer volume and velocity of data coming into a normal system would make it near impossible to analyze this valuable information the old way. If we put Big Data into a data system that can handle said data –whatever that large volume is – we can see the patterns and changes over time. So like distant reading, Big Data systems can help us analyze large volumes of data only on a larger scale.
The main difficulty with Big Data is how to collect and present usable information to your audience. If the dataset is so large and growing every minute, then how does someone compile this Big Data into easily manageable visualizations so that the audience can see what you are in fact doing.
Consider the big data tool, the Syllabus Finder, featured in Daniel Cohen’s article From Babel to Knowledge: Data Mining Large Digital Collections which has so far cataloged over 600,000 syllabi. The Syllabus Finder is set up to find certain keywords found in syllabi such as ‘week, readings, winter, fall, and of course syllabus’. If we just used the search engine Google, we won’t find as many syllabi as Syllabus Finder. Even still, Syllabus Finder is not ‘perfect’ application. The far harder goal of answering History’s main questions: when and where and why, Cohen argues, are far harder to compute with Big Data. Not only are we simply ‘classify documents’ but also attempt to process the natural language in which we frame our questions are harder for computers to find. For instance, Cohen uses the question “Why did Charles Linberg fly to Paris’ he uses H-Bot a Big Data system tool which uses “open source natural language processing package and associated dictionaries to pull out the key terms “Charles Lindbergh” and “Paris.”[Cohen]
Using H-Bot to find ‘pure’ articles instead of simple encyclopedia entries, the H-Bot is an interesting tool for historian’s to use in order to find articles they could not find on Google otherwise. The main lesson Cohen would like us to take away from these two tools, Syllabus Finder and H-Bot are that APIs(server to server software ‘connections’ without human involvement needed) should be more readibly available from digital collections so that historian’s ‘can stitch together’ these mega-search-tools to find answers to our questions faster and easier. [Cohen]
However, these examples Cohen uses only give us one part of how to use Big Data. We’ve collected our data using these tools but now the question is –how to present it? Do we use maps, tables, GIS, mindmap/trees or some combination of these? The possibilities are overwhelming.
Dr. Edward Tufte, a statistician and graphic designer at Yale described as ‘The Leonardo De Vinci of Data”, has set some sensible principles of visual design to help those normally only work with data text to use graphic design for data. His main principles, as outlined in a presentation by Liz Marai, are simple and straightforward. Don’t dumb down your data, clarify data sets by adding details not simplifying them using annotations and footnotes, and above all show the data: “graphic design is intelligence made visible.”
Also, Tufte cautions against “chartjunk’ or decorative elements that provide no data and cause confusion” and exaggeration of the size data on the graph or visualizations so there is no problem with graphic design integrity. Many things can be used to deceive the viewer whether it is exaggeration of size or emphasis of importance on certain sets of data and Tufte’s principles illustrate clearly that this artful deceit can be avoided. Also, in order to increase data comprehension Tufte’s suggests that legends on maps, pie charts, and certain kinds of maps should be avoided for their confusion and lack of data density. (See this article for an illustration of how easy it is to deceive the reader using visualizations that misrepresent the data) Also, Tufte recommends that if a simple table can show the data simply and accurately, a graphic should not be used in showing this data and uses a quote by Ad Reinhardt: “if a picture does not say a 1000 words, to hell with it.” Using a bit of common sense and a little bit of design theory we can show high density visualizations of even the most complex data.
So how does all of this relate to us as historians? Does this mean that we as historians can be replaced by these data systems that now compile data and texts more objectively than we as human beings can?In the open source book The Historian’s Macroscope: Big Digital History by Shawn Graham, Ian Milligan, and Scott Weingart, they argue that though these digital tools are becoming increasingly more and more advanced, digital history and its practices does not show us ‘truths’ but ‘offer us a way to interpret and understand these ‘traces of the past’. [Graham] Our pool of data may have grown from only specific examples to huge empirically based sets from computers but our job of transforming facts and data and text into narratives has not changed. We can use these new tools in data mining to better interpret history. How fast we move towards a more objective interpretation with accurate visualizations is not up to the technology we use, but is up to us as historians.
Cohen, Daniel. “Essays on History and New Media.” Roy Rosenzweig Center for History and New Media RSS. http://chnm.gmu.edu/essays-on-history-new-media/essays/?essayid=40 (accessed November 21, 2013).
Graham, Shawn, Ian Milligan, Scott Weingart. “Big Data and the Historian” The Historian’s Macroscope – working title. Under contract with Imperial College Press. Open Draft Version, Autumn 2013, http://themacroscope.org
Sacco, nick. “Exploring the Past.” Exploring the Past. http://pastexplore.wordpress.com/tag/big-data/ (accessed November 21, 2013).