Monday, August 10, 2015

Possible project: mapping authors to Wikipedia entries using lists of published works

220px Boulenger George 1858 1937One of the less glamorous but necessary tasks of data cleaning is mapping "strings to things", that is, taking strings such as "George A. Boulenger" and mapping them to identifiers, such as ISNI: 0000 0001 0888 841X. In case of authors such as George Boulenger, one way to do this would be through Wikipedia, which has entries for many scientists, often linked to identifiers for those people (see the bottom of the Wikipedia page for George A. Boulenger and look at the "Authority Control" section).

How could we make these mappings? Simple string matching is one approach, but it seems to me that a more robust approach could use bibliographic data. For example, if I search for George A. Boulenger in BioStor, I get lots of publications. If at least some of these were listed on the Wikipedia page for this person, together with links back to BioStor (or some other external identifier, such as DOIs), then we could do the following:

  1. Search Wikipedia for names that matched the author name of interest
  2. If one or more matches are found, grab the text of the Wikipedia pages, extract any literature cited (e.g., in the {cite} tag), get the bibliographic identifiers, and see if they match any in our search results.
  3. If we get one or more hits, then it's likely that the Wikipedia page is about the author of the papers we are looking at, and so we link to it.
  4. Once we have a link to Wikipedia, extract any external identifier for that person, such as ISNI or ORCID.
For this to work, it requires that the Wikipedia page cites works by the author in a way that we can harvest, and uses identifiers that we can match to those in the relevant database (e.g., BioStor, Crossef, etc.). We might also have to look at Wikipedia pages in multiple languages, given that English-language Wikipedia may be lacking information on scholars from non-English speaking countries (this will be a significant issue for many early taxonomists).

Based on my limited browsing of Wikipedia, there seems to be little standardisation of entries for people, certainly little in how their published works are listed (the section heading, format, how many, etc.). The project I'm proposing would benefit from a consistent set of guidelines for how to include a scholar's output.

What makes this project potentially useful is that it could help flesh out Wikipedia pages by encouraging people to add lists of published works, it could aid bibliographic repositories like my own BioStor by increasing the number of links they get from Wikipedia, and if the Wikipedia page includes external identifiers then it helps us go from strings to things by giving us a way to locate globally unique identifiers for people.