Reading 35,000 Books: UCD Contagion Project and the British Library Digital Corpus

How do you set about finding specific references and thematic associations in the massive digital resource represented by the British Library Digital Corpus?

Gerardine Meaney and Derek Greene

Gerardine Meaney, Professor of Cultural Theory in the UCD School of English, Drama and Film and Derek Greene, Assistant Professor at the School of Computer Science, together with the British Library Labs’ team, and supported by the Irish Research Council, organised a free workshop and roundtable at the British Library on February 20 2019.

In the run up to the event Gerardine and Derek wrote a guest post for the British Library’s blog about the reasons for organising the event and the associated research work they are currently engaged in at UCD.


How do you set about finding specific references and thematic associations in the massive digital resource represented by the British Library Nineteenth Century Book Corpus, originally digitised through a collaboration with Microsoft?

The Contagion, Biopolitics and Cultural Memory project at UCD Dublin set out to illuminate culturally and historically specific understandings of disease and contagion that appear within the fiction in the corpus. In order to do so, the project team extracted over 35,000 unique volumes out of a total of 65,000 in English and built a searchable interface of 12.3 million individual pages of text, which can be filtered and sorted using the corpus metadata (e.g. author, title, year, etc). The interface incorporates an index of the topical catalogue of volumes used by the British Library from 1823-1985 (within Alston index). Using a combination of OCR text recognition and manual annotation, we have extracted data the two top levels of the index, covering over 98% of the English language texts in the corpus. So for the first time it is possible to reliably identify and extract fiction, drama, history, topography, etc, from the corpus.

To allow researchers to further filter the corpus to identify texts from niche topic areas, the interface supports the semi-automatic creation of word lexicons, built upon modern “word embedding” natural language processing methods. By combining the resulting lexicons with existing corpus metadata and the data extracted from digitised version of the Alston Index, researchers can efficiently create and export small topical sub-corpora for subsequent close reading.

The Contagion project team is currently using information retrieval and word embeddings to identify texts for close reading. This combination allows us to track key trends pertaining to illness and contagion in the corpus, and interpret these findings with particular reference to current and historical debates surrounding biopolitics, medical culture and migration. Clusters of associations between contagion, poverty and morality are identifiable within the corpus. However, to date our research indicates that Victorians were more worried about religious contamination from migrants and minorities than they were about contagious diseases.

A key feature of the project is the intersection of methodologies and concepts from English literature, automated text mining, and medical humanities. This involves using data analytics as a mode of interpretation not a substitute for it, a way of engaging with the extent and complexity of cultural production in the nineteenth century. Cultural data resists giving definitive yes or no answers to the questions put to it by researchers, but the more cultural data we analyse the better we can map the processes of cultural change and continuity, in all their complexity. The process of tracking themes, topics, and associations enabled by the new interface offers an opportunity to work with and far beyond the existing canon of nineteenth century fiction, itself radically expanded by the last 20 years of scholarship. The identification within the corpus of a very large collection of 3 volume novels indicates that the popular novel is very well represented, for example, while the ability to identify and extract ‘Collected Works’ indicates which writers their contemporaries expected to remain central to the tradition of fiction.

On February 20th 2019, the FREE ‘Reading 35,000 Books’ workshop and roundtable will present the project’s work to date, and will also include discussion by scholars of nineteenth century literature and the British Library Labs of the future development and use of the new searchable interface, including exporting topical sub-corpora for further research.

The event was supported by the Irish Research Council.