Boutique Big Data: reintegrating close and distant reading of 19th-Century newspapers
2016-12-21T14:02:24Z (GMT) by
From their earliest incarnations in the seventeenth-century, through their Georgian expansion into provincial and colonial markets and culminating in their late-Victorian transformation into New Journalism, British newspapers have relied upon scissors-and-paste journalism to meet consumer demands for the latest political intelligence and diverting content. Although this practice, wherein one newspaper extracted or wholly duplicated content from another, is well known to scholars of the periodical press, in-depth analysis of the process is hindered by the lack of formal records relating to the reprinting process. Although anecdotes abound, attributions were rarely and inconsistently given and, with no legal requirement to recompense the original author, formal records of where material was obtained were unnecessary. Even if they had existed, the number of titles that relied upon reprinted material makes systematic analysis impossible; for many periodicals, only a few issues, let alone business records, survive. However, mass digitisation of these periodicals, in both photographic and machine-readable form, offers historians a new opportunity to rediscover the mechanics of nineteenth-century reprinting. By undertaking multi-modal and multi-scalar analyses of digitised periodicals, we can begin to reconstruct the precise journeys these texts took from their first appearance to their multiple ends. Before the advent of the telegraph, individual texts were disseminated manually, through postal and private correspondence routes, over sea and land. This allowed for the relatively slow spread of texts across communication networks, as well their adaptation, truncation and expansion various stages. In a manner similar to modern internet memes, blogs and online news content, texts underwent evolutionary changes with each reprinting. These could be minute, such as the correction of spelling errors or the application of house style, or significant, through selective reordering and truncation to alter the overall meaning of the text. While identifying meme families, or collections of related texts, can help us understand what made particularly texts popular, or viral, it is only by tracing the specific trajectories and pathways of these texts that the causes and consequences of evolutionary changes can be understood. Doing so requires us to approach these texts on multiple scales. First, by mining extremely large corpora, derived from several independent collections, we are able to identify a statistically sufficient portion of the historical network. Then, by carefully analysing the chronology and discrepancies between these reprints, hypotheses regarding institutional and industry standards can be posited and tested against the wider corpus. These efforts can be further buttressed by utilising manual transcriptions found in the personal archives of researchers using historical newspapers, such as the Scissors and Paste Database (www.scissorsandpaste.net). These transcriptions, far more accurate than the majority of datasets derived from optical character recognition, greatly improve the mining of the corpora, yielding a more complete initial network to analyse, as well as offset the skewing effect of the ‘offline penumbra’. This poster will explore the possibilities of large-scale reprint identification within and across digitised collections using a combination of Lou Bloomfield’s Copyfind and project-specific code to identify matches between individual articles or full pages of texts in both manual (perfect) and OCR (messy) transcriptions. Exemplar collections include the British Library's 19th-Century Newspapers digital collection and planned expansions into the digital collections of the National Library of Wales (Welsh Newspaper Online) and of Australia (Trove). The poster will also demonstrate the means by which reprint branching can be mapped using chronology and character clustering and the relative precision of manual and computer-aided techniques. Finally, it will explore the nature of multi-scalar analysis and how we might best reintegrate ‘boutique’ periodical research, such as the author’s Scissors and Paste Database, into large-scale text-mining projects.