Friday, March 18, 2016

The Plant List, GBIF, and the primary literature

TL;DR; The Plant List is now in GBIF http://doi.org/10.15468/btkum2.

Readers of this blog may recall that I've had a somewhat jaundiced view of The Plant List. The first version was release with a Creative Commons Attribution-NonCommercial-NoDerivs (CC BY-NC-ND) license which allowed copying so long as didn't create a derived work (The Plant List: nice data, shame it's not open). This is frankly about the silliest possible license for a data set as, from my perspective, the whole reason for releasing data is so that it can be combined and enhanced with other data.

The second release (version 1.1) dropped an explicit CC license in favour of almost the reverse position (!). You can't copy the list "as is" without permission, but you can make derivative works "without prior written permission from us" (see Terms of Use for The Plant List). Progress, of a sort.

So, for the last week I've been working on getting a version of The Plant List into GBIF, and I've finally managed to achieve this. There's isn't a single place you can grab the whole plant list, so you have to scrape the web site for CSV files, then glue them together. I would could argue that converting the data into the Darwin Core Archive is a derived work, but in case this seems not derivative enough (of course, nobody seems ready to define just what "derived" actually means) I started to augment the list of names by adding bibliographic identifiers. I've long argued (see e.g. Surfacing the deep data of taxonomy) that a fundamental limitation of existing taxonomic database is that they don't explicitly link to the primary literature. This is why I built BioNames, and why I've been working to link the "micro citations" in IPNI to identifiers such as DOIs, JSTOR likes, BioStor URLs and BHL page links (see project on github). So, I've added about 120,000 DOIs and JSTOR links to names in the plant list. This is a subset of the links I've found for IPNI, but for this first release I've tried to keep things simple. I've also made the link between Plant List name and DOI/JSTOR via the IPNI identifier for a name, and the Plant List has ommitted quite a few IPNI ids for reasons which aren't clear.

The Plant List version I've created is available in GBIF (http://doi.org/10.15468/btkum2 and http://www.gbif.org/dataset/d9a4eedb-e985-4456-ad46-3df8472e00e8). Having another list of plant names will be a useful addition to the checklists that GBIF already has, even if the Plant List is already somewhat out of date.

DOIs

One feature of enhanced Plant List in GBIF is that for a subset of names (currently about 10%) there are direct links to the original publication of that name. For example, the record for Haniffia albiflora in the Plant List has a fairly cryptic bibliographic citation Nordic J. Bot. 20: 287 2000 and no link to that publication. In the version I've uploaded to GBIF the name Haniffia albiflora looks like this: Haniffia Note the full citation. But more importantly, the Publisher record link is the DOI http://doi.org/10.1111/j.1756-1051.2000.tb00745.x so clicking on it takes you to the original description of this species: Doi4 There is a lot of plant taxonomic literature available in JSTOR, sadly most of it (along with specimen images) behind a paywall (see Why are botanists locking away their data in JSTOR Plant Science?). Some of the links from GBIF take you to JSTOR: Doi3 The DOI landscape is evolving, and there are now multiple DOI registration agencies minting DOIs for scientific papers. CrossRef provides easily the best services for discovery and metadata harvesting, other agencies often have no equivalent, which makes it hard to discover DOIs for those papers hard. I've spent some time getting this information for Chinese and Taiwanese articles, e.g. http://dx.doi.org/10.6165/tai.1985.30.5: Doi1 and http://dx.doi.org/10.3969/j.issn.2095-0845.2005.04.002: Doi2 to give two example of articles that are now linked to from the corresponding species page in GBIF.

It's all about the links To reiterate, I believe that one of the key challenges facing biodiversity informatics is cross linking between disparate types of data and source of information. At the moment most of our data resides in disconnected silos. The links I'm adding to plant names are a small step, but they can lead to all sorts of possibilities. For example, users of GBIF can click on a link and see the original paper. If, for example, GBIF doesn't have a map for the species discussed in that paper, it's likely that the paper may have some information (e.g., the type locality). If users click on the links, then that is going to drive more traffic to the original literature, thus increasing its visibility. Furthermore, now that we have a taxon identifier (from GBIF) linked to a bibliographic identifier, we can go in the opposite direction. Earlier I proposed a Javascript bookmarklet as a way to augment the information on a web page (see Rethinking annotating biodiversity data). We could have a popup on a article web page that can tell the user about the taxa mentioned in that paper. If GBIF has a ma for those taxa, we can immediately place that paper in a geospatial context (e.g., Africa). This is barely scratching the surface of what is possible once we start breaking out of silos and share deeply linked data.

Monday, March 14, 2016

Towards a biodiversity knowledge graph

TL;DR; In order to build a usable biodiversity knowledge graph we should adopt JSON-LD for biodiversity data, develop reconciliation services to match entities to identifiers, and a use a mixture of document and graph databases to store and query the data. To bootstrap this project we can create wrappers around each biodiversity data provider, and a central cache that is both a document store and a simple graph database. This power of this approach should be showcased by applications that use the central cache to tackle specific problems, such as augmenting existing data.
Fig 1 full

I’ve thrown together some notes on building a biodiversity knowledge graph, and in the interests of making it interactive it's in the form of a web page: http://bionames.org/~rpage/towards-knowledge-graph/. There are buttons to click that display live data, and I hope to dd more examples as I flesh out the ideas. I'm hoping to have a fully-functioning live demo that can be used to explore the notion of a knowledge graph and demonstrate what we can do with it. It will be pretty obvious that this is all a bit crude, but my goal is to try and sketch out a fully-functioning system that can create and query the graph, and support interesting applications.

Thursday, March 03, 2016

Cisco Pit Stop: Digitising the Natural History Museum’s collections

Last week (25-26 February) I was in London for CISCO Pit Stop event. Thursday evening was at the Natural History Museum where I gave a talk extolling the virtues of linking stuff together:

My slides are here:

Friday we assembled at the Digital Catapult Centre, which as Sandy Knapp notes, has some amazing views from it's 9th floor.

A group of experts (loosely defined, at least, if they include me in that category) and small businesses (with backgrounds in digitisation, text-mining, publishing, etc.) got together to try and come up with workable ideas where we could marry issues in digitisation and subsequent use of that data with tools and markets. A fascinating experience, although I'm not yet sure what the outcome will be. But it's always useful talking (and listening) to people with very different backgrounds and notions of what matters (and what is possible).

iSpecies meets TreeBASE

I'm continuing to play with the new version of iSpecies, seeing just how far one can get by simply grabbing JSON from various sources and mashing them up. Since the Open Tree of Life is pretty unresolved ("OMG it's full of stars") I've started to grab trees from TreeBASE and add those. Sadly TreeBASE is showing it's age and doesn't have a JSON API, so I had to break my rule of only using HTML and Javascript in iSpecies and I had to write some PHP wrappers to talk to TreeBASE. Now, when you search for a genus or species you may see a list of studies from TreeBASE, and a popup menu where you can select a tree to view.

Below is a example (searching for the plant genus Fitzalania). Ispecies treebase

This example shows one reason phylogenies are useful. Although GBIF (which supplies the data for the map) recognises Fitzalania, a recent study in TreeBASE shows that this renders Meiogyne paraphyletic, and so moves the Fitzalania to Meiogyne. Hence GBIF's taxonomy is somewhat behind the current state of knowledge about these plants.

The paper merging these two genra (doi:10.1600/036364414x680825) also shows up in the CrossRef results. Unfortunately TreeBASE doesn't have the DOI for the paper, so linking these two results (the TreeBASE study and the corresponding paper) will require some work. This is another reason why I'm playing with iSpecies: I want to see how many identifiers we can uncover to connect results from different sources, and how many cross links we need to add before it all comes together in a nice linked graph of data.