Monday, June 18, 2012

BHL and text-mining: some ideas

Some quick notes on possibilities for text-mining BHL (in rough order of priority). Any text-mining would have to be robust to OCR errors. I've created a group of OCR-related papers on Mendeley:

OCR - Optical Character Recognition is a group in Computer and Information Science on Mendeley.

Improve finding taxonomic names in text in face of OCR errors

There is some published research on OCR errors that could be used to develop a tool to improve our ability to index OCR text. The outcome would be improved search in BHL (and other archives). I've touched on some of these issues earlier). One approach that looks interesting is using anagram hashing (see Reynaert, 2008), which may be a cheap way to support approximate string matching in OCR text.

Reynaert, M. (2008). Non-interactive OCR Post-correction for Giga-Scale Digitization Projects. Lecture Notes in Computer Science, 4919:617-630. doi:10.1007/978-3-540-78135-6_53 (PDF here).


Recognition and extraction of literature cited

Given an article extract all the references it cites. There's a fair amount of literature on automated citation extraction, but again we need to do this in the face of OCR errors, and enormous variability in citation styles. The outputs could help build citation indexes, and also serve as data for the "bibliography of life". The citations could also be used to help locate further articles in BHL (e.g., using BioStor's OpenURL resolver).


Improved extraction of named entities (e.g., museum specimen codes) and localities (e.g., latitude and longitudes, place names)

This would enable better geographic searches, and help start to link literature to museum specimen databases.

Automated recognition of articles within scanned volumes

My own approach to finding articles has focussed on finding articles based on citation metadata, e.g. based on article title, journal, volume, and pagination, find corresponding article in BHL:

Page, R. D. (2011). Extracting scientific articles from a large digital archive: BioStor and the Biodiversity Heritage Library. BMC Bioinformatics, 12(1), 187. doi:10.1186/1471-2105-12-187

An alternative is to infer articles from just the scanned pages. There has been some limited work on this in the context of BHL:

Lu, X., Kahle, B., Wang, J. Z., & Giles, C. L. (2008). A metadata generation system for scanned scientific volumes. Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’08 (p. 167). Association for Computing Machinery (ACM).
doi:10.1145/1378889.1378918 (PDF here)

The NLM has some cool stuff on automatically labelling the parts of a document, see Automated Labeling in Document Images and Ground truth data for document image analysis. See also Distance Measures for Layout-Based Document Image Retrieval.

Other links
Should also note that there's a relevant question on StackOverflow about OCR correction, which has links to tools like OCRspell:

Taghva, K., & Stofsky, E. (2001). OCRSpell: an interactive spelling correction system for OCR errors in text. International Journal on Document Analysis and Recognition, 3(3), 125–137. doi:10.1007/PL00013558

Code is on github.