John Bradley (King's College London), Silk Purses and Sow's Ears: In what ways can structured data deal with historical sources?
Joaquim de Carvalho (Universidade de Coimbra), Combining source oriented and person oriented data models in prosopographical database design
Paul Ell (Centre for Data Digitisation and Analysis, Queen's University Belfast), Humanities Geographical Information Systems: texts, images, maps
This paper discusses the developing use of Geographical Information System technology in the humanities. It examines early projects the Centre for Data Digitisation and Analysis (CDDA) were involved in, creating the first incarnation of humanities GIS projects termed ‘Historical GIS’ with a focus on census and other statistical data together with administrative unit boundaries. These projects resulted in the ability to produce choropleth maps of a range of socio-economic data and to compare these data over time.
Arguably historical GIS had a limited impact. Such systems are complex and costly to create and require specialised skills to use. Moreover the vast majority of humanities scholars are not interested in statistical data but in text. Work has been done to encompass a wider range of content into historical GIS including multimedia materials such as photographs, historical maps, gazetteers, travellers tales and more. Attempts have also been made to make these resources more accessible through the use of online interfaces rather than bespoke GIS software such as ArcGIS. An exemplar involving CDDA will be reviewed.
The challenge remains however, to put the humanity into humanities GIS. The involves far closer interaction between GIS and texts, and a greater focus not on producing maps but on using the spatial functionality of GIS to draw sources together by place. The paper discusses new work at CDDA, at Lancaster University and at UC Berkeley to make this happen and transform a niche methodology into a core element of humanities research practice.
Luís Espinha da Silveira (IHC, FCSH-Universidade Nova de Lisboa), GIS and Historical Research: promises, achievements and pitfalls
It is usually agreed that GIS begun to be introduced in historical research in the mid-1990s (Gregory & Ell, 2007, 15-16; Knowles, 2002, XI). The description of the atmosphere of the 1998 and 1999 Social Science History Association (SSHA) sessions on historical GIS refers people’s “passionate engagement with methodology” and participants’ excitement caused the sense they were “making something new by using new tools” (Knowles, 2000, 5-6). As it had happened with quantitative history, historical GIS would be able to open up historical scholarship, inspire new creativity, challenge old assumptions, and promote the exploitation and understanding of new kinds of historical evidence. (Knowles, 2000, 17-18). Some years later Knowles (2008, 267), although acknowledging the great progresses that had been made, also recognized that GIS’s “promise, however, is far from fully realized”. Even so, it seems that the enthusiasm of the 1990s hasn’t vanished and a very interesting book, recently published, proposes to “advance an even more radical conception of GIS that will reorient, and perhaps revolutionize, humanities scholarship” (Bodenhamer, Corrigan & Harris, 2010, IX).
In this paper I will not question the need to develop the conceptual framework of the field of historical GIS, the importance of addressing the specific problems that historical information poses to GIS, the developments regarding textual information, the interest to explore new forms of visualization and the possibilities opened up by the Internet to create an infrastructure to support research. At the same time, I will emphasize the importance for historians who are aware of the role of space in historical explanation to continue to address relevant historical questions and to integrate GIS tools in historical research methodologies, combining quantitative and qualitative approaches, paying attention to space, time and scale. I will illustrate this point with some of examples taken from our research on Portuguese History.
Finally, I will call the attention to some pitfalls created by the use of GIS technologies: the costs in terms of time or money; the fascination by the technology that easily diverts you from research to data collection and dissemination and the lack of academic recognition of new forms of publication. I will exemplify the first issue with a supranational project on population in the Iberian Peninsula.
Malte Rehbein (Universität Würzburg), Text Encoding: a historian's perspective
“Before they can be studied with the aid of machines, texts must be encoded in a machine-readable form” (Michael Sperberg-McQueen) and encoding is understood as “the process by which information from a source is converted into symbols to be communicated” (wikipedia).
The focus of text encoding initiatives and /the/ Text Encoding Initiative (TEI), providing a standard for electronic texts in the humanities, has mostly been on two core disciplines, namely literary and linguistic studies and projects that used computers to study texts have quite often been based on digitization of printed material. But text encoding for historical research is about more than just text; it is about encoding (textual) culture and hence concerns many aspects of our work as historians (at least when dealing with primary sources).
This presentation illustrates how machine readable information can be created and encoded and what role this may play in order to support our understanding of the past. Encoding is not restricted to text. The various examples of mostly on-going historical research discussed here, spanning a range from early medieval biblical commentaries up to letters from the WWII front, will demonstrate that text encoding from a historian's perspective must be understood holistically, including not only text, but also text carriers, context, implicit and explicit content, inter- and paratexts, text production processes, and text usage.
Rita Marquilhas (Centro de Linguística - Universidade de Lisboa), The automatic research of digital editions
Databases with the transcription of historical sources do benefit from the adoption of an XML-TEI mark-up. On the one hand, such methodology allows for the electronic online edition of the text since XML is a machine-readable format and all browsers can process it. On the other hand, the adoption of the TEI conventions for the textual mark-up give the corresponding databases the design of a well tested standard within scholarly editing.
The just mentioned advantages – readability by computer programs and standardization according to scholarly criteria – have guaranteed that one such marked-up database has been recently subjected to further processing, allowing for several empirical advances in a multidisciplinary domain.
The database in question is a corpus of Portuguese private letters written between the 16th and the 20th century (projects CARDS and FLY of the Linguistics Centre of Lisbon University, CLUL). In September 2011, the database contained c. 2,400 letters. A quick presentation of the mark-up we use in those projects will take place.
The main experiments we have made on the corpus thus assembled were the following:
1. Using information extraction procedures by means of Perl scripts expressly made by computer engineers, it was possible to isolate the textual parts of the letters that contained polite formal language and to account for the relevance of their semantics to politeness theory.
2. Using lexical statistics software (both the branded WordSmith Tools and the freely available AntConc) it was possible to compare the keywords of the letters’ text with the ones in two larger reference Portuguese corpora. One contained oral utterances recorded in Dialectology campaigns (Cordial); the other contained texts of several genres written in Contemporary Portuguese (CRPC). The keyness of the letters corpus lexicon (i.e., the most relevant lexicon vis-à-vis the one in Cordial and in CRPC) revealed that the letter genre is indeed much closer to spoken styles than to written ones.
3. Adapting for the Portuguese language a set of statistical tools of spelling normalization formerly designed for English (VARD2), we are managing to get some promising results in the attempt to make the automatic normalization of the palaeographic (highly variant) transcriptions we have.
Melissa Terras (University College London), Exploring the potential of Digital Humanities with the Transcribe Bentham project
There is much confusion around what Digital Humanities actually is, with many practitioners in the field debating the use of the term, and the activities which constitute Digital Humanities. The aim of this paper is to present a bottom-up approach to defining the scope and potential of Digital Humanities by presenting a single project, Transcribe Bentham. By focussing on one individual project and the various ways in which technologies have allowed research that would otherwise have proved impossible, Transcribe Bentham allows us to understand the benefits and potentials in using digital techniques within the humanities.
Transcribe Bentham is a one year, Arts and Humanities Research Council funded project running from April 2010 until April 2011, housed under the auspices of the Bentham Project at UCL (http://www.ucl.ac.uk/Bentham-Project/). The Bentham Project aims to produce new editions of the scholarship of Jeremy Bentham (1748-1832), the English jurist, philosopher, and legal and social reformer. He is well known for his advocacy of utilitarianism and animal rights, but is perhaps most famous for his work on the “panopticon”: a type of prison in which wardens can observe (-opticon) all (pan-) prisoners without the incarcerated being able to tell whether or not they are being watched. Twenty volumes of Bentham’s correspondence have so far been published by the Bentham Project, plus various collections of his work on jurisprudence and legal matters. However, UCL Library Services holds 60,000 folios of Bentham’s manuscripts, and there is much more work to be done to make his writings more accessible, and to provide transcripts of the materials therein.
Transcribe Bentham has tested the feasibility of outsourcing the work of manuscript transcription to members of the public, aiming to digitise 12,500 Bentham folios, and, through a wiki-based interface, allowing transcribers access to images of unpublished manuscript images, in order to create a TEI-encoded transcript for checking by UCL experts. Approved transcripts have been stored and preserved, with the manuscript images, in UCL's public Digital Collections repository, making innovative use of traditional Library material.
The Transcribe Bentham project is now in its reporting phase, after six months of active promotion of the wiki based transcription tool. This paper will present an overview of the project, demonstrating how digitisation, online presence, text encoding, transcription, crowdsourcing, and online outreach can benefit those working with humanities data.
Daniel Gomes (Portuguese Web Archive – Fundação para a Computação Científica Nacional), Web Archiving
The web is the primary mean of communication in developed societies. All kinds of information that describe our recent times are published on the web. As everyone can publish online, it becomes possible to analyze events through various first-person points of view that provide different perspectives, and not just the official descriptions issued by dominant forces. Thus, the web is a valuable resource for contemporary historical research. However, its information is extremely ephemeral. Several research studies have shown that only a small amount of information remains available on the web for longer than 1 year.
Web archiving aims to acquire, preserve and provide access to historical information published on the web. In November 2011, there were 52 web archiving initiatives worldwide, which include services that enable any person to create their own historical collections. Web archiving has also an important sociological impact because common citizens are publishing information with a personal meaning on the web, without having any preservation concerns, such as saving their pictures on a disk. In the future, web archives will be the only source of personal memories to many people.
There are tools, such as browser add-ons, that facilitate historical research over web archives. However, most of them require that the users know the exact address (URL) where the needed information was published in the past. The Portuguese Web Archive provides a full-text search service over 1 billion contents archived from 1996 to 2011 (available at www.archive.pt), as well as other tools for historical research over archived web collections.
Peter Doorn (Data Archiving and Networked Services, Nederland), Computational history among e-science, digital humanities and research infrastructures: accomplishments and challenges
This presentation will focus on the following subjects: first I will briefly introduce DANS; after that I will place the developments in computational history in the context of the developments in e-Science and the digital humanities. Over the years we see a gradual increase in the scale of projects, partly brought about by computation itself and the specialization it requires. Therefore we can see an increased attention for digital data and research infrastructures, both at the national and at the European level.
DANS is an institute of the Royal Netherlands Academy of Arts and Sciences (KNAW) and the Dutch Research Funding Organisation (NWO) and was founded in 2005 (www.dans.knaw.nl). It builds on the work of predecessors, the first of which dates back to 1964 (Steinmetz Foundation and Archive for the social sciences). The Netherlands Historical Data Archive (NHDA) was created in 1989, inspired by the needs of historians and the creation of numerous historical databases, which needed to be archived and kept accessible for later use. The central task of DANS is to provide permanent access to digital data in the humanities and social sciences, although we recently started to gradually expand our services to other domains as well.
DANS maintains a digital archive with substantial data collections in history, social sciences, and archaeology. We also carry out data projects in collaboration with research communities and partner organizations. Moreover, we give advice and support, for example we developed a Data Seal of Approval (see: http://www.datasealofapproval.org/), aiming at quality control of data and repositories, and maintain a Persistent Identifier Infrastructure based on the URN (see: http://www.persid.org/index.html).
In short, DANS promotes permanent access to digital research data; it encourages scientific researchers to archive and reuse data by means of our online archiving system EASY; we provide access, through www.narcis.nl, to thousands of scientific datasets, e-publications and other research information in the Netherlands; moreover, DANS provides training and advice, and we perform research into archiving of and access to digital information.
History and computing as e-Science:
It makes sense to place the developments of computational history in the past decade in the context of e-science, which has been defined back in 2001 as “Science increasingly done through distributed global collaborations enabled by the Internet, using very large data collections, tera-scale computing resources and high performance visualisation.“ (UK Department of Trade and Industry; Research Council e-Science Core Programme). Jim Grey, Tony Hey and others spoke of a “fourth paradigm” in science, characterized by a high data intensiveness. Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets. The speed at which any given scientific discipline advances will depend on how well its researchers collaborate with one another, and with technologists, in areas of e-Science such as databases, workflow management, visualization, and cloud-computing technologies.
Although the scale of humanities research, including the work of historians, is much smaller than that in astronomy or particle physics, most specialists agree that the tendencies and needs of e-science and e-humanities are basically similar. Humanities computing was defined by Willard McCarty in 1999 as “an academic field concerned with the application of computing tools to arts and humanities data or to their use in the creation of these data.” Terms such as computational humanities, digital humanities and e-humanities are now also in use, and essentially denote similar things (with nuance I do not intend to get into).
Since the 1990s many people have come up with definitions for or descriptions of computing in historical research (this list can be easily expanded):
• Charles Harvey: historical computing must be concerned with the creation of models of the past or representations of past realities.
• Matthew Woollard: History and computing is not only about historical research, but also about historical resource creation.
• George Welling: Historical Informatics (computational history) is a new field of interdisciplinary specialization dealing with pragmatic and conceptual issues related to the use of information and communication technologies in the teaching, research and public communication of history.
• Lawrence McCrank (2002): Historical information science integrates equally the subject matter of a historical field of investigation, quantified social science and linguistic research methodologies, computer science and technology, and information science, which is focused on historical information sources, structures, and communications.”
• Boonstra, Breure, Doorn (2004): Historical information science is the discipline that deals with specific information problems in historical research and in the sources that are used for historical research, and tries to solve these information problems in a generic way with the help of computing tools
In a study on the “Past, Present and Future of Historical Information Science” I published together with Onno Boonstra and Leen Breure, we distinguished four categories of information problems in historical research, which we ordered on what we called the “life cycle of historical information”: information problems of historical sources (representation); of relationships between sources (harmonization, linkage); of historical analysis (qualitative and quantitative); of the presentation of sources or analysis (visualization, edition). The PDF of the book can be found here: http://www.dans.knaw.nl/content/categorieen/publicaties/past-present-and-future-historical-information-science.
Back in 2004, we were a bit wary on the developments of history and computing in the past few years. It seemed as if the exciting and formative years of historical computing (roughly the period 1985-2000) year were over. Many main-stream historians were just happy to be able to use the computer for text processing, web browsing and emailing.
Probably a degree of specialisation did occur: you simply could not expect every historian to be a programmer, as Le Roy Ladurie once said. The scale of historical research had to go up to get beyond the basic level of computing techniques. Collaboration with professional IT specialists was necessary, and I think we are gradually working towards that direction.
The increase of the scale of digital history projects:
In my presentation I will mention a few examples of big projects we were involved in, and in which computing scientists and historians did work together: the digitization of the Dutch censuses and the project “Life Courses in Context” (the first project in the humanities in the Netherlands to receive an investment grant of a few million Euros; see www.volkstellingen.nl); the project “Climate of the World Oceans”, in which historians, computing scientists and climatologists worked together to retrieve weather observations from historical ships’ logs (www.knmi.nl/cliwoc/); the collaboratory on institutions for collective action (http://www.collective-action.info/); the collaboratory ‘Clio Infrastructure’, building and connecting global data hubs on world inequality, the increasing divergence between rich and poor countries (www.clio-infra.eu). The projects “Telling witnesses” and “Veteran tapes”, in which many hundreds of qualitative interviews have been collected and analysed as “oral histories” of the Second World War and other conflicts (http://getuigenverhalen.nl/) en (http://www.watveteranenvertellen.nl). The project Medieval Memoria Online (MeMO), which aims to help scholars in carrying out research into memoria during the period up to the Reformation (c. 1580) in the area that is the present-day country of the Netherlands (http://memo.hum.uu.nl/). In all these projects, historical researchers and computing experts (and often specialists from other disciplines as well) from several institutes worked or are working together.
The need for research infrastructures:
It is vital that these projects rest on a solid foundation, not only during the course of the project, but also afterwards. If no infrastructure exists that can guarantee the sustainability after the project is finished, the results are in danger of disappearing soon after the projects’ end, and the investment and effort will get lost. This is exactly why digital infrastructures are necessary: to support and maintain the collaborative efforts. The services developed in the projects need to be sustainable, and they can only be maintained efficiently if they are generic and re-usable. This is why a few years ago, not just in the natural and life sciences, but also in the humanities and social sciences, initiatives have been taken to set up infrastructures to support and sustain the investments done in large (and small) projects. The European Strategy Forum for Research Infrastructures (ESFRI) formulated a first “Roadmap” for the creation of such infrastructures (http://ec.europa.eu/research/infrastructures/index_en.cfm?pg=esfri). DARIAH, the emerging Digital Research Infrastructure for the Arts and Humanities, is one of the two infrastructures proposed on the ESFRI Roadmap for the humanities, including history (www.dariah.eu). DARIAH aims to “link and provide access to distributed digital source materials of many kinds”. In the field of linguistics, CLARIN has been set up: Common Language Resources and Technology Infrastructure (www.clarin.eu), and there are also examples in the social sciences. In several countries, among which the Netherlands, it is proposed that CLARIN and DARIAH will closely work together or even merge.
The digitization of cultural heritage material, among which archival sources, is of great importance for historians and other humanities researchers. And also in this field we see the creation of large-scale infrastructures. Europeana enables people to explore the digital resources of Europe's museums, libraries, archives and audio-visual collections (www.europeana.eu). It promotes discovery and networking opportunities in a multilingual space where users can engage, share in and be inspired by the rich diversity of Europe's cultural and scientific heritage. The width of the endeavor is at the same time it’s limitation for researchers: although millions of heritage objects can be “explored”, the content and descriptions are oriented to the consumption by a general audience, not towards the analytical use of specialists. The European Holocaust Research Infrastructure, which is supported by DARIAH for solving the technological challenges of bringing together virtual resources from dispersed archives, is a good example for an infrastructure on the interface of heritage and historical research.
The intention of the organisers of the Lisbon workshop on Digital Methods and Tools for Historical Research is to discuss the implications of using digital technologies in the production and dissemination of knowledge in History.
Two of the implications I have highlighted is the increase of scale of digital history projects and the need for research infrastructures to sustain the results of digital projects. Multidisciplinary and international collaboration is inevitable for professional results. Computational history is in this sense comparable to (of simply part of) data driven e-Science.
This conclusion is independent from the type of methodology we look at: whether it is relational databases, geographic information systems, (text) encoding or digitization and preservation of digital memory. Such methodologies rarely stand alone in a digital project, and are rather phases in the cycle that many digital projects go through: after digitization comes the encoding (in textual sources) or the structuring in databases. Analysis is the next phase, for which GISes are very useful in the case the data has a geospatial component, which can be visualised. At the end of the cycle, proper measures need to be taken in order to keep the results accessible for the future.