Data, Data, Data

Did someone say Data?

What on earth is Big Data you say? To put it simply: Big Data is data that is too big to fit into normal data processors, such as web login information, the webpages of a major website, or say millions of facebook statuses. Analyzing all this data would take far too long and the sheer volume and velocity of data coming into a normal system would make it near impossible to analyze this valuable information the old way. If we put Big Data into a data system that can handle said data –whatever that large volume is – we can see the patterns and changes over time. So like distant reading, Big Data systems can help us analyze large volumes of data only on a larger scale.

The main difficulty with Big Data is how to collect and present usable information to your audience. If the dataset is so large and growing every minute, then how does someone compile this Big Data into easily manageable visualizations so that the audience can see what you are in fact doing.

Consider the big data tool, the Syllabus Finder, featured in Daniel Cohen’s  article From Babel to Knowledge: Data Mining Large Digital Collections which has so far cataloged over 600,000 syllabi. The Syllabus Finder is set up to find certain keywords found in syllabi such as ‘week, readings, winter, fall, and of course syllabus’. If we just used the search engine Google, we won’t find as many syllabi as Syllabus Finder. Even still, Syllabus Finder is not ‘perfect’ application. The far harder goal of answering History’s main questions: when and where and why, Cohen argues, are far harder to compute with Big Data. Not only are we simply ‘classify documents’ but also attempt to process the natural language in which we frame our questions are harder for computers to find. For instance, Cohen uses the question “Why did Charles Linberg fly to Paris’ he uses H-Bot a Big Data system tool which uses “open source natural language processing package and associated dictionaries to pull out the key terms “Charles Lindbergh” and “Paris.”[Cohen]

Using H-Bot to find ‘pure’ articles instead of simple encyclopedia entries, the H-Bot is an interesting tool for historian’s to use in order to find articles they could not find on Google otherwise. The main lesson Cohen would like us to take away from these two tools, Syllabus Finder and H-Bot are that APIs(server to server software ‘connections’ without human involvement needed) should be more readibly available from digital collections so that historian’s ‘can stitch together’ these mega-search-tools to find answers to our questions faster and easier. [Cohen] 

However, these examples Cohen uses only give us one part of how to use Big Data. We’ve collected our data using these tools but now the question is –how to present it? Do we use maps, tables, GIS, mindmap/trees or some combination of these? The possibilities are overwhelming.

Dr. Edward Tufte, a statistician and graphic designer at Yale described as ‘The Leonardo De Vinci of Data”, has set some sensible principles of visual design to help those normally only work with data text to use graphic design for data. His main principles, as outlined in a presentation by Liz Marai, are simple and straightforward. Don’t dumb down your data, clarify data sets by adding details not simplifying them using annotations and footnotes, and above all show the data: “graphic design is intelligence made visible.”

Also, Tufte cautions against “chartjunk’ or decorative elements that provide no data and cause confusion” and exaggeration of the size data on the graph or visualizations so there is no problem with graphic design integrity. Many things can be used to deceive the viewer whether it is exaggeration of size or emphasis of importance on certain sets of data and Tufte’s principles illustrate clearly that this artful deceit can be avoided. Also, in order to increase data comprehension Tufte’s suggests that legends on maps, pie charts, and certain kinds of maps should be avoided for their confusion and lack of data density. (See this article for an illustration of how easy it is to deceive the reader using visualizations that misrepresent the data) Also, Tufte recommends that if a simple table can show the data simply and accurately, a graphic should not be used in showing this data and uses a quote by Ad Reinhardt: “if a picture does not say a 1000 words, to hell with it.” Using a bit of common sense and a little bit of design theory we can show high density visualizations of even the most complex data.

So how does all of this relate to us as historians? Does this mean that we as historians can be replaced by these data systems that now compile data and texts more objectively than we as human beings can?In the open source book The Historian’s Macroscope: Big Digital History by Shawn Graham, Ian Milligan, and Scott Weingart, they argue that though these digital tools are becoming increasingly more and more advanced, digital history and its practices does not show us ‘truths’ but ‘offer us a way to interpret and understand these ‘traces of the past’. [Graham] Our pool of data may have grown from only specific examples to huge empirically based sets from computers but our job of transforming facts and data and text into narratives has not changed. We can use these new tools in data mining to better interpret history. How fast we move towards a more objective interpretation with accurate visualizations is not up to the technology we use, but is up to us as historians.



Cohen, Daniel. “Essays on History and New Media.” Roy Rosenzweig Center for History and New Media RSS. (accessed November 21, 2013).

Graham, Shawn, Ian Milligan, Scott Weingart. “Big Data and the Historian” The Historian’s Macroscope – working title. Under contract with Imperial College Press. Open Draft Version, Autumn 2013,

“Tuft’s Design Principles.” .. (accessed November 22, 2013).

Sacco, nick. “Exploring the Past.” Exploring the Past. (accessed November 21, 2013).


Museums, Collections and Community Involvement: Canada History Forum 2013

On Tuesday, I was able to watch the Canadian History Forum via livestream from the luxury of my kitchen table. Over lunch I saw the video presentations of the 2013 Young Citizens Video Award, and although two were in French, (this poor American only understands some Spanish) from what I saw they all were very well researched and fantastic examples of how the next generation is using technology in education.

The History Forum itself was focused on the theme “is technology changing our history?”. I watched the keynote presentation by Kate Hennesey’ on the virtual collection on the aboriginal people that resided near Fort Anderson in the 1860’s and who’s artifacts had been collected by Hudson Bay trader Roderick Macfarlane and then acquired later by the Smithsonian Institution in Washington, DC. The purpose of creating to establish community connections with these artifacts and to establish a more modern curatorial alternative to how these artifacts were labeled and tagged in the current Smithsonian catalog. Also, many of the items had little description attached to them in the Smithsonian’s catalog. However, with the help of the aboriginal community from which these items originated the narrative behind many of these artifacts were retold and cataloged and labeled appropriately. A digitized collection such as this gives the community ‘a place to tell the complex stories of these objects’ says Hennessey. Hennessey’s argument was if words and labels are power then we must change the language in which we describe these items in order to deconstruct the imperialistic language encoded in the language that Macfarlane used in the period he cataloged this collection in the 19th century.

In the course of this project spanning from 2009-2012, the team conducted public outreach in the community. Besides the surveys, interviews, and collaboration with Parks Canada, Hennessey and her team put together a documentary called A Case of Access, which documented the process of cataloging and rephotographing many of the objects, often with a person of the tribe so that the item would have contextual meaning and scale with a human being. In this documentary, which is included in the site,, the team included annotations to the artifacts presented alongside the documentary so people watching can see the items shown in detail.

A fantastic innovation and very clever way of engaging your audience further in the collection itself. An annotated documentary such as the one mentioned at this keynote could be a valuable asset to museums and archives in showcasing parts of their collection in short videos with the valuable description, up-close view and tags not otherwise seen without sifting through the virtual collection itself or seeing it in person. I hope that many museums see these new software and video editing programs as opportunities for broadening education programs and public outreach.
Also, the advantages of this online collection is that the new alternative more modern descriptions and labels do not erase the current (though out of date) Smithsonian catalog but coincides with that version.

In the end of her presentation, Hennessey said: “we can draw connections using these new technologies, but it is really people not technology that change history. We can make decision about how to use the technology and how that technology shapes us.”

Overall, I found the keynote presentation a fascinating example of how historians are collaborating with the public and using technology to shape the current interpretation of history for the better.