Harvesting illustrations with nearby text

Many Wellcome works are printed books with illustrations, figures, tables and so on that are identified by OCR during digitisation.

Many of those will have text nearby, that is likely to be a caption that describes the image.

The IIIF representation offers a quick way of getting pixels for the region of the page occupied by the image, and could be a way of finding nearby text.

Example:

Given an identifier b28047345
I can get a IIIF representation https://wellcomelibrary.org/iiif/b28047345/manifest
This will either be a manifest, as in this case, or a collection of manifests, when the identifier is a multi-volume work.
The manifest has a sequence of canvases. Each canvas links to the raw METS-ALTO data, and also (via the otherContent property) to an annotation list derived from that data.
This demo uses the annotations: https://tomcrane.github.io/wellcome-today/annodump.html?manifest=https://wellcomelibrary.org/iiif/b28047345/manifest&page=8
Here's a typipcal annotation list: https://wellcomelibrary.org/iiif/b28047345/contentAsText/83
As well as line-by-line annotations (you can toggle these in the UI via the checkbox) it also includes annotations that highlight regions identified during OCR.
So we can do this: https://dlcs.io/iiif-img/wellcome/1/994b1b9f-0757-4b62-9063-dfc17c9b9513/299,3521,2183,70/full/0/default.jpg
and this: https://dlcs.io/iiif-img/wellcome/1/994b1b9f-0757-4b62-9063-dfc17c9b9513/663,1532,1476,836/full/0/default.jpg
...just by looping through the linked annotations.
The non-text annotations are grouped at the end of the annotation list for a page, so adjacency in the annotation list is not enough. The distance of a text line from the image could be measured, though, and the closest one chosen.
A more sophisticated transformation of the METS-ALTO could identify the caption, especially if the OCR software can do this. Or an image-and-text-processing step could do this after the basic OCR, with a trained method of identifying an image caption. If you want the published representation of the object to include this, the image-identifying annotation can include this textual body as well.

Helpers:

https://github.com/tomcrane/wellcome-today/blob/gh-pages/script/annodump.js#L161

https://github.com/tomcrane/wellcome-today/blob/gh-pages/script/annodump.js#L172

(apologies for my 90s-style JavaScript).

A more sophisticated approach could learn what text is likely to be a caption, but you can still use these techniques to get the pixels of the image (at whatever scale you might need for a machine learning task via the IIIF params) and the text lines that are candidate captions.

tomcrane/identifying_images.md