Hello. Our project for the week is to explore the archives of Ticehurst House Hospital, a private lunatic asylum which opened in 1792.
In particular, we’d like to try and make it easier to follow the individual stories of the patients who resided there, sometimes for many years. To do this currently involves a lot of work, as the patient records are mostly ordered chronologically, not by patient.
As part of the preparatory work, we looked at some of the existing research that has been done on these archives. One very useful resource is an index of patient names, which was painstakingly compiled by researcher Charlotte MacKenzie in the 1980s. Wellcome Library had a copy of this list (actually two copies, one organised by admission date, the other by patient surname), but only in printed form.
So our first task was to digitise the list. With limited time, we did this initially ourselves by scanning one of the multiple copies of the list using a photocopier – not the way that digitsiation usually works here at Wellcome. (It has also since been put through the Internet Archive digitisation workflow)
The result of this scanning was a set of 36 images. Digital, but still not that useful. To turn them into text we put the images through an OCR process, where a computer algorithm try to detect the actual typewritten characters.
This worked pretty well. Like, probably 95% accurate. The 5% of errors were mostly caused by extra marks on the pages such the holes from the spiral-binding being detected as nonsense characters. There were also some handwritten notes on the pages, either adding corrections, or references to which file in the archives the data came from (useful!). The OCR software seems pretty terrible with handwriting, even though this is fairly neat.
So there was quite a bit of manual correction to do, plus additionally re-formatting the text into tabular form where the OCR assumed columns. (I suspect there’s a way to configure OCR software to better scan tabular data, but I have no idea how to do that).
At around 10 mins per page to do the OCR tidy-up, the total process took about 6 hours! Good job we started this before the week began.
With the index now formatted as plain text, most of Day 1 was spent importing this into a mini database (after first transforming the text files into CSV via some quick regular expressions).
One thing to note is that the index is a list of patient stays (from admission to discharge), and many patients returned multiple times. So where the patient name is exactly the same, we’re grouping the information about the stays under a single patient record. It’s quite possible though that there were two patients with the exact same name which we’ve incorrectly grouped together, or that sometimes the same patient is recorded with a slightly different name in two places. Hopefully we’ll spot these errors later.
For now, you can browse the list of patients on a simple website we’ve built.