Day 3 – Drink and health

November 2, 2016hewalkerblog Leave a comment

Or, ‘how we wrote Elasticsearch queries.’ It’s good to have these different skills and forms of expertise in the team, not least because we have to explain to each other what we’re doing – and why it’s worth doing. Nat has been wrestling with the data all day today, while Rioghnach has been doing two essential jobs – 1. providing the project with some kind of documentation and reflection, and 2. checking some of our searches on the London’s Pulse and JISC Medical Heritage Library sites. I’ve been trying to work out whether the search terms we came up with are too focused (meaning we are only going to find the sorts of ideas about alcohol we started off with) or too broad (meaning we get a broad selection but also a whole load of red herrings.)

I suggested there might be three kinds of results:

Discussions of alcohol we were expecting. Medical Officers returned the number of deaths from cirrhosis in many reports, for example, because that was their job.
Discussions of alcohol we weren’t expecting. There’s much more on checking samples of beer and other drinks for adulteration than I was expecting, and that’s rather different to worrying about cirrhosis, because Officer were trying to make sure that drinkers got all that lovely alcohol they’d paid for and not (for example) arsenic poisoning.
Hits that aren’t actually talking about alcohol at all, despite one or more terms that look like they are.

w1912

We cut out a few of the terms, partly because of this last point, but also because Nat’s wizardry began to show us some actual patterns (for the test sample, at least). There were a few OCR issues, too, where terms were absent from one search though we knew they were these somewhere. We ended up asking Elasticsearch to run a two step search, looking for key terms that must refer to drink first, and then searching for other terms. We then thought we might wait until all the data sets were ready before tweaking the searches any further (including one code-named ‘all the drinks.’)

So we have decided to look at a set of about 3,000 MOH reports, 50 cookery books and household manuals, and 14 handbooks for medical examiners, covering the period between 1842 and 1930. We hope to have this ready to go tomorrow, and then we need to settle some lingering questions about how we present the results. Are we interested in the distant reading of these three data sets, so can we quickly identify relationships between terms and different kinds of documents? Or in using these broadly-conceived maps to dive into the more complicated contexts in which these words surfaced, roaming free range through the documents?

Team Drink and Health

Day 3 – UK MHL

November 2, 2016November 4, 2016Sarah Bull Leave a comment

We’re midway through Data Week, and our group is still in the thick of trying to figure out how to detect plagiarism—and other forms of similarity and difference between texts—through the UK Medical Heritage Library. We’re hoping that this will be worked out soon, so we’ll have some time to develop a way of visualising what we find in the UK MHL.

One of the challenges we’ve encountered is figuring out how to detect identical phrases and chunks of text, rather than similarity between all of the words in two books. The latter process will suggest, misleadingly, that two works on, say, syphilis, are virtually identical. Not too helpful for research purposes.

Another challenge has been figuring out the best source for extracting our data from the UK MHL. We can download full text from the Wellcome Library website or from the Internet Archive. The text from the Wellcome Library comes with metadata that we think could be useful, but the Internet Archive offers us a wider range of texts to choose from and compare. We’re giving text from the Internet Archive a go right now, with the dataset we have prepared, and have another one in the works for if we end up using text from the Wellcome site.

Thankfully, one of the books we’ve been using as a key text for testing out matching capabilities, Manhood: The Causes of its Premature Decline with Directions for its Perfect Restoration is fun to work with!

manhood

Day 3 – Chemist and Druggist

November 2, 2016November 2, 2016hewalkerblog Leave a comment

Up to this point, our project – to develop ways of searching the enormous digital resource that is The Chemist and Druggist journal to present meaningful and visually stimulating results to researchers – has been a three-pronged attack. How to carry out a comprehensive search across the whole data set (that doesn’t take 3 hours to run), spearheaded by David; how to present the resulting adverts and images in a coherent and stimulating way, spearheaded by Olivia; how to interpret the results to answer research questions and set them in historical context, spearheaded by Briony. As the technical side of the work focusses on improved ways of running searches, David jokingly described what we’ve been up to as “artisan production” and he has a point – the manual, labour- and time- intensive nature of what we’ve been able to achieve up to this point is clearly very restrictive. But we feel optimistic that we’re edging closer to a better way to bring the mass of material into focus.

Using our best existing search results, Olivia has created a timeline which displays each page that contains the search term “asthma”. For the first time, we can visually see our results which is an exciting development. However, it also made it clear that the original search had not been as comprehensive as we had thought – back to the drawing board to refine the process. The visualisation has also thrown up a major challenge in the sheer quantity of material. The pages cascading into the structure as they load is certainly impressive, but raises lots of questions about how best the presentation can ultimately be de-cluttered to allow effective use of the material. Our approach has been to organise the results by individual year and then issue date, which has resulted in some obvious patterning of similar adverts run over consecutive issues and prompted interesting ideas for research questions about subtle design changes and updated content. However, visually the mass of pages will need much more work, and inevitably means a return to the problems surrounding searching such an enormous data set, in order to filter the results to be useful.

Our initial aim was that a timeline approach would allow clear presentation of trends over time, but we also want to be able to carve up the results further to provide answers to research questions that are more refined than the overarching “what medicines were advertised to treat asthma in the Chemist and Druggist between 1859 and 2010?” Exploring themes and filters to allow exploration of questions such as “what additional products were advertised by the same companies that made asthma remedies?” or “are there common active ingredients in the medicines across time?” would obviously be enormously helpful.

And we also want to allow users of the resource to take advantage of glimpses of other research avenues that they might grasp, so the context is all important. Hence, for example, our decision that showing a cropped asthma advert is not as valuable as keeping it situated within its full journal page with its neighbouring adverts. If the end result for a researcher was that they were distracted by an adjacent advert for, say, ballroom floor polish (!) and pursued this through the journal, this would be an equally satisfying result for our project. While grappling with technical and visualisation issues, our motivation is still to make the richness of the journal’s content more accessible.

Day 2 – Ticehurst

November 2, 2016November 2, 2016hewalkerblog 2 Comments

Continuing our project of making the archives of Ticehurst House Hospital more accessible, Frankie updated our website on day two.

The archives hold a lot of information scattered through different series, e.g. admissions certificates, case notes, discharge records and bills. We wanted to bring all the separate material together for each patient. This is not the easiest thing to do in the paper or online archives, so we felt it would be useful to consolidate all this information by patient. The somewhat tedious but ultimately satisfying task of plucking out all the digitized images of the relevant information and was my task for the day.

Richard had already identified 5 interesting-looking patients, and useful pointers such as record numbers and dates for each, that put us on the right track. Then it was a case of trawling the archives online to identify the relevant images. As these are handwritten papers and not printed books there’s no easy text search available. It’s often a question of just opening a set of papers online and scanning through until you find the person you want. Scanning through handwritten papers, even in nice Victorian copperplate, takes some time whilst you’re trying to get your eye in.

By the end of the day we’d managed to get a lot of image references for our patients and were using that to pull in the digitized images to our web pages. Our first patient to get the image treatment is Lady Maria Beauclerk. You can now see images of her admissions certificate, case notes (including some interesting doodles), her death note and her bills. Lady Maria was admitted to Ticehurst in 1851 and died over 20 years later from epilepsy whilst still an inmate at Ticehurst.

beauclerk-casenotes

Richard has been researching more about these particular patients so we can learn more about them than what’s only in the Ticehurst papers. We’ve even got retrospective diagnoses for three patients from Trevor Turner’s thesis ‘A diagnostic analysis of the casebooks of Ticehurst House Asylum, 1845-1890’

Next up: trying to find a way to semi-automate finding and recording the digital images for these records to display on many more patients pages.

Written by Natalie Pollecutt, library systems officer