Welcome to Data Week

What happens if you put a bunch of data wrangles and a group of historians together in a room with a load of digital data for a week? This is one of the questions we hope to answer during Wellcome Data Week.

Over the past few years we’ve been busy turning our unique historical collections into a digital library – digital items and counting. Now we’re thinking about how the rest of the world can discover and use all this fantastic content.

During Data Week five small groups, consisting of data developers, researchers and Library staff, will be set loose on a selection of our digital collections for five uninterrupted days of exploration, analysis, play and experimentation.

Each group will start with a research topic proposed by the historian in their group. Together with the rest of the team they well look for ways to investigate their topic in a particular data set.That’s it. There are no other instructions. Each group is free to decide how they work, what they want to do and what their final outcomes will be.

It’s a rare treat for researchers and developers alike to have a space to just ‘play’ with data for a week without interruption. And what’s in it for Wellcome Collection? Well a few things – we hope!

  • We get to test the robustness of our data, and find out how easy it is to download, manipulate and display.
  • We get do answer the question I started with: what does happen when developers and researchers come together for a week?
  • We hope to share some inspiring ideas and prototypes showing just what can be done with our digital collections
Advertisements

Ticehurst – What Next?

It’s been a week since the Data Week finished, and whilst it’s still fresh in our minds I thought I’d jot down a few thoughts on possible next steps.

Firstly, given that we managed, remarkably, to annotate all the case books with page numbers and types, thanks to one of the tools we built, it would be great to incorporate this data back into the main Wellcome Library interface.

If you view Volume 38 of the Case Books in the Wellcome Library viewer interface right now, you’ll see that there’s a “jump to page” feature at the top, but that it doesn’t work for this book because the data isn’t available to the player. There’s also no table of contents, a feature usually available for scanned books, but not for archival material.

To demonstrate how this could be improved, we generated a IIIF manifest for each of the case books (see an example for Volume 38), which is a data file (in JSON format) in the IIIF metadata format that the player can use. If you take this URL and plug it into the Universal Viewer demo you can see how it improves the interface: there’s now a table of contents, and the page-switcher works. (One slight bug is that the pages are labelled as “page 12 – 13 of Spine”, as “Spine” is the label of the last image.)

Screen Shot 2016-11-11 at 12.51.14.png

There’s room to improve the viewer further: currently each image shows two pages (left side and right side) together, but for reading purposes it would probably be better to only show one side at a time (so that it can be larger, making the handwriting easier to read). This isn’t something the viewer currently supports (I don’t think?) but would be a useful improvement.

At the moment, there isn’t an easy way to save the new metadata about the case books back into the Wellcome Library systems – but this is something that can be investigated further.

The other metadata we’ve generated is annotating which pages are about which patients. This could perhaps be imported back into the current main library systems using “person as subject” fields, but only at the book level, not the page-range level. (Although it’d still be better than nothing).

Ultimately, given our goal was to enable researchers to follow the stories of individual patients, that might require a specialised interface which is more detailed than the generic library book finder & reader interface. Whether this should be bespoke to the requirements of the Ticehurst archive, or whether it might apply more generically across archival material, is an open question.

Finally, a thought on how all this metadata might be generated. One of the reasons given for none of this metadata already existing (not even page-number annotations) is that it is so time-consuming to create, especially in the context of mass-digitisation projects. I think we’ve partly answered this by showing that specialised tools can make this process a lot faster, and also the usefulness of the results, but there will always be a trade-off between quality and quantity of metadata. Personally, I’d adjust this balance slightly by ensuring that some basic metadata (like page numbers) is always captured at the time of scanning. However doing full indexing of the content might be something that’s best done later.

One approach, which we discussed a few times over the week, is to open up metadata annotation so that it’s not done just by library staff, but can also be contributed to by researchers using the material. After all, it’s quite possible that researchers are already selectively transcribing the handwritten material that they’re interested in, for use in their own publications or research, and so it makes sense to ask them if they’d mind contributing it back to the library so that future researchers don’t have to do the same job all over again.

Day 5 – Chemist and Druggist

At the end of 5 days, perhaps inevitably, we are just getting our teeth into the meat of the issues! As we planned yesterday, Alex has run a number of searches within our existing subset of pages that contain the word “inhaler.”  Although not without its own problems, the results have enabled Olivia and David to investigate what more refined searches might look like.   What would a search looking for pages containing “asthma” + “inhaler” look like?  What about finding illustrations only that contain those terms?  Would you be able to look for a brand name such as “Potter’s Asthma Cure” or  “Ventolin” within these results?  The resulting timeline visualisations show how advanced searches and filters might work and show masses of potential to prompt further and interesting research questions.

David has also created a timeline that shows, through density of coloured dots plotting the results, the frequency of results occurring in each issue of the journal.

transparent

Spending time today experimenting with more of this user interface confirms that our ambitions are justifiable, but not currently attainable.   Hopefully, if the time and resources are found in the future to spend on the back end search functionality of The Chemist and Druggist, it will be a much more accessible and useful resource for a wide range of historians.  So, we haven’t reached our holy grail this week, but it has been really enjoyable trying.

Day 5 – Ticehurst

Day 5 of data week really concentrated our minds as we knew we didn’t have long left. It turned out that one good way to get more done was to have more people working on our project. We put out a call amongst Wellcome Library staff to help us for half an hour or so using our Image annotator tool (described on Day 4). The task of transcribing page numbers is pretty straightforward, and quick, so it was easy enough to explain to our helpers to get them doing a bit of page numbering. It was really gratifying to receive help from across the Wellcome Library (Frankie coined this ‘staffsourcing’), and our thanks go to Danny, Tania, Lalita, Hannah, Chloe, Juulia, Philippa and Jonathan.

Frankie did yet more tool optimisation, this time for our second tool for transcribing the case book indexes. We can now delete items that have been added, rather than have Frankie have to amend the database, which makes for more streamlined working. The case books page also now have tables of contents showing patients and which pages their case notes are on:

ticehust-1

We’d talked about doing some data visualisations on our first day of data week and did want to get some done even though we were running out of time. Frankie pulled out the case books’ spine images to make a clickable display which looks like books on a shelf:

ticehust-2

Although this isn’t how the case books look in the archive I like to think that they may have looked like this whilst in use in the 19th century.

Frankie also did a useful visualisation with the stay dates of the patients. Stays are plotted for each patient between 1792 for the first stay to 1989 for the last. These are colour-coded showing their status upon leaving, e.g. ‘cured’ or ‘not improved’. Frankie also gave the patients list clickable column headings so you can sort by name alphabetically, length of stay, and stay dates:

ticehurst-3

And that brought us to the end of data week. I think we’ve all found it very interesting and gratifying to try out so many new things in a just a week. If you get the right mix of subject and technical knowledge around a table a lot of good work can be achieved very quickly. Another Wellcome data week would definitely get my vote.

Written by Natalie Pollecutt

 

Day 5 – UK MHL

Data Week has drawn to a close all too fast, and we now have our prototype online!

As readers will see, we have uploaded a small corpus of works into our text-matching tool. If you click on the title of a work, it will bring you to a page that displays the entire text, and highlights areas where it matches that of other uploaded works. In addition to highlighting text matches, the tool displays them alongside one another, and provides links so that readers can view the full text of the work that shares text in common with the one under study.

If this experience has taught us one thing, it’s that we needed more time to create a tool that can be used with so many different digitized texts, for so many different purposes. It was fun to see what could be done in a week, but the compressed timeline gave us little wiggle room for reflecting on how we wanted the tool to work — both in terms of how the code was written and in terms of the interface design—and even less wiggle room for experimentation.

If we were given, say, another week or two, we would love to improve the tool’s matching capabilities and make the interface more elegant and intuitive. We would also love to upload a much larger corpus to the tool and play around with that, and/or figure out whether (and if so, how) we could integrate it into an existing platform for viewing the Medical Heritage Library, so it would be easier for everyone to use.

Day 4 – Ticehurst

After manually pulling together the records for our five selected patients (which took a couple of days), we’ve spent the remainder of our time investigating ways to speed up doing this work for the remaining 1,641 patients.

To try and achieve this, we’ve built two tools.  The first is an Image annotator. This lets us perform the simple yet crucial task of identifying what types of page each image contain (from Front Cover through the Index to the Back Cover).

The interface we built for this is fairly simple. It shows a full image, which helps us identify the type of page it is, e.g. cover, title page, index page, content, etc. It also shows the top left and top right hand corners of the image. This zooms into where the page numbers usually appear, making our reading and transcribing quicker:

b4

Frankie added some nifty optimisations to this tool to help us speed through the work. When index pages are indicated with a letter, e.g. ‘a’, ‘b’, the ‘Index’ page type gets automagically selected. When a number is indicated, as above, the page type is filled with ‘Content’. Sequences of letters or numbers can be zipped through just by tapping the space bar. This fills in the page numbers, the page type is dealt with as previously described, and you just have to keep an eye on it to make sure the images and numbers don’t get out of alignment. I was amazed at how much quicker we can progress with these small but crucial improvements.

The second tool of the day built on the output of the first. As we’ve now identified which images are the index pages, we want to transcribe the names from the index pages and assign images to them. Doing that will start to fill up the patient pages with images of their records. The tool shows an image of the index page to be transcribed along with fields to transcribe into. A very helpful dropdown list of all the names with their associated dates makes it easy to find the person – if you can read the handwriting on the page! With the person picked, you add the page number from the index page:

 

b5b6 

Saving this name information along with the page number makes the person’s case notes appear on their patient page:

 b7

Another nice feature completed on day 4 was adding brief biographies of our five selected patients to their patient pages. You can see these by selecting the top five highlighted names on the patients page.

Frankie & Natalie

 

Day 4 – Chemist and Druggist

Confession: we spent 3 minutes today watching a stop motion version of Paddington Bear dancing to ‘Singin’ in the Rain.’ https://www.youtube.com/watch?v=kHg6QjhvsCM Light relief?  Yes, but it came about from a discussion of issues facing our project – honest.

Bear with me (excuse the pun) while I explain: ‘Singin’ in the Rain’ shows the impact of movies going from silent to sound in the late 1920s, a major technological development which we now take entirely for granted. Our parallel thoughts were that we today entirely take our ability to perform complex searches across massive amounts of data for granted.  The digitised material exists, and we expect to be able to find what we want from it.  However, the root of our problems on the penultimate day of Data Week is that just because The Chemist and Druggist is fully digitised does not automatically mean that it is fully searchable.  Far from it.

So having grappled with searches for four days, we have decided to concentrate on our final day fully on the front end. Our approach is to pretend that the complete search functionality is running perfectly, and play with what we would we be able to present to a researcher.  So we’re faking this with a series of searches carried out on the full run of C&D firstly to create a subset containing the word “inhaler”, then subsequent searches to split these results into adverts and/or articles, another search looking for the brandname “Ventolin” and a final look at presenting results by the decade they appear.  It isn’t feasible to carry these out in ‘real time’ at the moment, so we’re probably going to produce an animation that pretends we can.  A stop motion animation if you like – thank you Paddington!

Written by Briony Hudson