Ticehurst – What Next?

It’s been a week since the Data Week finished, and whilst it’s still fresh in our minds I thought I’d jot down a few thoughts on possible next steps.

Firstly, given that we managed, remarkably, to annotate all the case books with page numbers and types, thanks to one of the tools we built, it would be great to incorporate this data back into the main Wellcome Library interface.

If you view Volume 38 of the Case Books in the Wellcome Library viewer interface right now, you’ll see that there’s a “jump to page” feature at the top, but that it doesn’t work for this book because the data isn’t available to the player. There’s also no table of contents, a feature usually available for scanned books, but not for archival material.

To demonstrate how this could be improved, we generated a IIIF manifest for each of the case books (see an example for Volume 38), which is a data file (in JSON format) in the IIIF metadata format that the player can use. If you take this URL and plug it into the Universal Viewer demo you can see how it improves the interface: there’s now a table of contents, and the page-switcher works. (One slight bug is that the pages are labelled as “page 12 – 13 of Spine”, as “Spine” is the label of the last image.)

Screen Shot 2016-11-11 at 12.51.14.png

There’s room to improve the viewer further: currently each image shows two pages (left side and right side) together, but for reading purposes it would probably be better to only show one side at a time (so that it can be larger, making the handwriting easier to read). This isn’t something the viewer currently supports (I don’t think?) but would be a useful improvement.

At the moment, there isn’t an easy way to save the new metadata about the case books back into the Wellcome Library systems – but this is something that can be investigated further.

The other metadata we’ve generated is annotating which pages are about which patients. This could perhaps be imported back into the current main library systems using “person as subject” fields, but only at the book level, not the page-range level. (Although it’d still be better than nothing).

Ultimately, given our goal was to enable researchers to follow the stories of individual patients, that might require a specialised interface which is more detailed than the generic library book finder & reader interface. Whether this should be bespoke to the requirements of the Ticehurst archive, or whether it might apply more generically across archival material, is an open question.

Finally, a thought on how all this metadata might be generated. One of the reasons given for none of this metadata already existing (not even page-number annotations) is that it is so time-consuming to create, especially in the context of mass-digitisation projects. I think we’ve partly answered this by showing that specialised tools can make this process a lot faster, and also the usefulness of the results, but there will always be a trade-off between quality and quantity of metadata. Personally, I’d adjust this balance slightly by ensuring that some basic metadata (like page numbers) is always captured at the time of scanning. However doing full indexing of the content might be something that’s best done later.

One approach, which we discussed a few times over the week, is to open up metadata annotation so that it’s not done just by library staff, but can also be contributed to by researchers using the material. After all, it’s quite possible that researchers are already selectively transcribing the handwritten material that they’re interested in, for use in their own publications or research, and so it makes sense to ask them if they’d mind contributing it back to the library so that future researchers don’t have to do the same job all over again.


Day 1 – Ticehurst

Hello. Our project for the week is to explore the archives of Ticehurst House Hospital, a private lunatic asylum which opened in 1792.

In particular, we’d like to try and make it easier to follow the individual stories of the patients who resided there, sometimes for many years. To do this currently involves a lot of work, as the patient records are mostly ordered chronologically, not by patient.

As part of the preparatory work, we looked at some of the existing research that has been done on these archives. One very useful resource is an index of patient names, which was painstakingly compiled by researcher Charlotte MacKenzie in the 1980s. Wellcome Library had a copy of this list (actually two copies, one organised by admission date, the other by patient surname), but only in printed form.

So our first task was to digitise the list. With limited time, we did this initially ourselves by scanning one of the multiple copies of the list using a photocopier – not the way that digitsiation usually works here at Wellcome. (It has also since been put through the Internet Archive digitisation workflow)

The result of this scanning was a set of 36 images. Digital, but still not that useful. To turn them into text we put the images through an OCR process, where a computer algorithm try to detect the actual typewritten characters.

This worked pretty well. Like, probably 95% accurate. The 5% of errors were mostly caused by extra marks on the pages such the holes from the spiral-binding being detected as nonsense characters. There were also some handwritten notes on the pages, either adding corrections, or references to which file in the archives the data came from (useful!). The OCR software seems pretty terrible with handwriting, even though this is fairly neat.

So there was quite a bit of manual correction to do, plus additionally re-formatting the text into tabular form where the OCR assumed columns. (I suspect there’s a way to configure OCR software to better scan tabular data, but I have no idea how to do that).

At around 10 mins per page to do the OCR tidy-up, the total process took about 6 hours! Good job we started this before the week began.

With the index now formatted as plain text, most of Day 1 was spent importing this into a mini database (after first transforming the text files into CSV via some quick regular expressions).

One thing to note is that the index is a list of patient stays (from admission to discharge), and many patients returned multiple times. So where the patient name is exactly the same, we’re grouping the information about the stays under a single patient record. It’s quite possible though that there were two patients with the exact same name which we’ve incorrectly grouped together, or that sometimes the same patient is recorded with a slightly different name in two places. Hopefully we’ll spot these errors later.

For now, you can browse the list of patients on a simple website we’ve built.

Screen Shot 2016-11-01 at 10.33.45.png