Data Week has drawn to a close all too fast, and we now have our prototype online!
As readers will see, we have uploaded a small corpus of works into our text-matching tool. If you click on the title of a work, it will bring you to a page that displays the entire text, and highlights areas where it matches that of other uploaded works. In addition to highlighting text matches, the tool displays them alongside one another, and provides links so that readers can view the full text of the work that shares text in common with the one under study.
If this experience has taught us one thing, it’s that we needed more time to create a tool that can be used with so many different digitized texts, for so many different purposes. It was fun to see what could be done in a week, but the compressed timeline gave us little wiggle room for reflecting on how we wanted the tool to work — both in terms of how the code was written and in terms of the interface design—and even less wiggle room for experimentation.
If we were given, say, another week or two, we would love to improve the tool’s matching capabilities and make the interface more elegant and intuitive. We would also love to upload a much larger corpus to the tool and play around with that, and/or figure out whether (and if so, how) we could integrate it into an existing platform for viewing the Medical Heritage Library, so it would be easier for everyone to use.
We’re midway through Data Week, and our group is still in the thick of trying to figure out how to detect plagiarism—and other forms of similarity and difference between texts—through the UK Medical Heritage Library. We’re hoping that this will be worked out soon, so we’ll have some time to develop a way of visualising what we find in the UK MHL.
One of the challenges we’ve encountered is figuring out how to detect identical phrases and chunks of text, rather than similarity between all of the words in two books. The latter process will suggest, misleadingly, that two works on, say, syphilis, are virtually identical. Not too helpful for research purposes.
Another challenge has been figuring out the best source for extracting our data from the UK MHL. We can download full text from the Wellcome Library website or from the Internet Archive. The text from the Wellcome Library comes with metadata that we think could be useful, but the Internet Archive offers us a wider range of texts to choose from and compare. We’re giving text from the Internet Archive a go right now, with the dataset we have prepared, and have another one in the works for if we end up using text from the Wellcome site.
Thankfully, one of the books we’ve been using as a key text for testing out matching capabilities, Manhood: The Causes of its Premature Decline with Directions for its Perfect Restoration is fun to work with!
Listen to Sarah and Dan talking to Tom Crane, about their aims for the project, and issues they are facing:
Inspired by projects and tools like Viral Texts, the Wilde Trials Text App, and Kaleida, our group is trying to track the ways in which producers of works on sexual health copied each other during the nineteenth century, using the newly launched UK Medical Heritage Library. We’re hoping that, in doing so, we can find out more about how knowledge about sexual health circulated in the nineteenth century—and how people thought about that knowledge and who had ownership of it.
We started Monday off by talking about how to approach this problem — the UK MHL is huge, comprising over 66,000 books, which makes the task of comparing its texts pretty daunting! We agreed that working with two smaller corpuses of texts taken from the UK MHL, one of which is made up of works that we already know copied from one another, might be easier in the short run.
We also talked a lot about how we would present information about text matches, and looked at a number of different kinds of visualizations — network charts, trees, side-by-side comparison windows, and other possibilities. After lunch, Sarah started to look more closely at visualization possibilities, while Dan focused on figuring out the best way to search for text matches between different works.
In between, we’ve had great chats about different nineteenth-century terms for sexual problems (“nervous debility,” anyone?), funny beginnings to medical books, and surprising similarities between the books we’re looking at and tiny cameras. Looking forward to Day 2!
Written by Sarah Bull, Cambridge historian interested in plagarism within sexual health manuals, looking at books within the UKMHL collection. Working alongside Dan Williams