Friday, May 31, 2013

BioNames now live - Report on project

Bionames3
BioNames (http://bionames.org) is live. Getting to this point was supported by funding from EOL as part of their Computable Data Challenge. The award from EOL is paying for Ryan Schenk to work on the interface and overall design of the web site, and over the last few weeks we've been working increasingly frantically to get things ready. "Ready" is a relative concept. The project is far from finished from my perspective, there is a mound of data (millions of names, hundreds of thousands of publications) that is being cleaned, cross-linked, and ultimately visualised. But the EOL funding came with a deadline and adult supervision (aka Cyndy Parr), so it was a great incentive to get something function out the door.

What is BioNames?

Elsewhere I've argued that biodiversity informatics is fundamentally about linking stuff together, and BioNames tackles the link between a name and its publication. Ultimately I want each taxon name to be linked to its original description, and that description has a digital identifier (such as a DOI). It's a small step, but building those links, coupled with (where possible) bringing those publications together in one place provides a platform to potentially do some cool stuff (more on this later). Since about 2009 I've been working on building a database of these links, and have been documenting progress (or it's lack) along the way (e.g., search iPhylo for "itaxon").

Here are some screen shots (and links so you can see for your self). It's a very early stage release, but you'll get the idea.

GBIF classiifcation of Rousettus with Ryan's awesome taxon name timeline.
Bionames1

Viewing a paper A Tarzan yell for conservation: a new chameleon, Calumma tarzan sp. n., proposed as a flagship species for the creation of new nature reserves in MadagascarBionames4

Coverage of articles in a journal (Proceedings of the Entomological Society of Washington).
Bionames2What got built?

There is a bunch of code and documentation online:


There is also a Darwin Core Archive format dump, which *cough* fun to create.

There have been progress reports on this blog (search for BioNames). You can also see what we got up to in the github logs.

What didn't happen

The original proposal (http://dx.doi.org/10.6084/m9.figshare.92091) was, of course, a tad ambitious, and a number of things haven't made it into this release. Phylogenies are the biggest casualty, but they are close (see Viewing phylogenies on the web: Javascript conversion of Newick tree to SVG for experiments on visualisation). It just wasn't possible to get them ready in time for the May 31 deadline. But this is on the to do list.

What's next

Now that there is a functioning web site there are several directions to explore. There is a lot of data cleaning to do, many missing references to add, taxon names to map to GBIF and NCBI, and more. I've completely glossed over the issue of reconciling author names, it's clear that the same author can appear multiple times because of variations in how their name has been recorded in various databases. There are various ways to tackle this, the most interesting is to use tools like Mendeley or ORCID to enable people to "claim" their identity.

Now that there is a mapping between the NCBI taxonomy and taxon names linked to literature, it would be great to add phylogenetic data to BioNames (which was part of the original plan). One way is by importing PhyLoTA, another is by adding support for BLAST searches that generate trees. For example, for a given taxon we could create a list of suitable sequences (e.g., DNA barcodes) and enable users to generate BLAST trees to get a sense of what the taxon is related to (and, in many cases, how much genetic differentiation there is within that taxon).

Given that BioNames has a lot of full text (from BioStor as well as numerous sources of PDFs and scans) there is huge scope for data mining. Obvious things to do are extract taxon names, geographic localities, and specimen codes (using tools I already have for BioStor). Then there is the challenge of extracting lists of literature cited and building citation networks. A small proportion of the taxonomic literature exists in XML (e.g., articles in PLoS, Zookeys, and various SciElo journals), which makes this task a lot easier. Given that many of the cited papers will already be in BioNames, we could build a taxonomic literature reader that enabled you to treat the literature in BioNames as one big interlinked, browesable archive. I'm posting a list of ideas on Trello.

Monday, May 27, 2013

Multiple DOIs for the same article issued by different publishers

DoiI've stumbled on a case where two different publishers have issued different DOIs for the same articles. In this case, Springer and J-State both publish the Japanese Journal of Ichthyology (ISSN 0021-5090). The following article:
Randall, J. E., & McCarthy, L. J. (1989). Solea stanalandi, a new sole from the Persian Gulf. Japanese Journal of Ichthyology, 36(2), 196–199. doi:10.1007/BF02914322

is published by Springer with the DOI http://dx.doi.org/10.1007/BF02914322, and this DOI is registered with CrossRef. J-Stage publish the same article, with the DOI (http://dx.doi.org/10.11369/jji1950.36.196). This DOI is not registered with CrossRef. I haven't been able to find an easy way to discover the DOI registration agency for a DOI (surely there should be a simple service that tells me this?).

This illustrates a problem with the success of DOIs and the existence of multiple registration agencies. When there was essentially a single agency for publications (CrossRef) it was relatively easy to ensure that DOIs for publications were unique. Now that there are multiple DOI registration agencies it is possible for conflicts to arise. We might expect this to be rare, after all, surely there's only one publisher for an article? However, the publishing landscape is more complicated that that, with articles being served by multiple publishers, and archiving projects like JSTOR and BHL having content that overlaps with that of existing publishers. Messy (sigh).

Sunday, May 26, 2013

BioNames update - deadlines

This week promises to be *cough* interesting. The deadline for the BioNames project is the end of the week (May 31st), and all I have right now is a blank web page (gulp). Behind the scenes data is being cleaned, mushed, and reconciled, and CSS, Javascript, and HTML are being wrangled. It's going to be tight, but hey, what could possibly go wrong...?

Thursday, May 23, 2013

DOIs for specimens are here, but we're not quite there yet


I've been banging on about having citable, persistent identifiers for specimens, so was suitably impressed when Derek Sikes posted a comment on iPhylo that Arctos already does this. For example, here is a DOI for a specimen: http://dx.doi.org/10.7299/X7VQ32SJ.

Uam

So, we're all done, right? Not quite. DOIs by themselves don't get us where we (OK, where I think we) want to be. The DOI identifies a specimen, which is great (see discussion on iDigBio: You are putting identifiers on the wrong thing for why this matters). We can also get machine-readable metadata using the DOI (by using the URL http://data.datacite.org/10.7299/X7VQ32SJ ). The metadata is limited (ideally we'd want something like Darwin Core), but it is a start. It's not clear how we get from the DOI to Darwin Core.

There are at least two issues that remain to be tackled. The first is that we now have a bunch of identifiers for the same thing, e.g.:

Most of these identifiers don't know about each other (for example, GBIF doesn't know about the DOI, nor does Arctos link to GBIF). So we have disconnected pieces of information about the same thing.

The second issue is how do we discover a specimen DOI? CrossRef supports services where you can take a bibliographic citation, e.g. Phylogeny and biogeography of ice crawlers (Insecta: Grylloblattodea) based on six molecular loci: designating conservation status for Grylloblattodea species and get back a DOI (in this case, http://dx.doi.org/10.1016/j.ympev.2006.04.013). This makes it possible for publishers to take lists of literature cited in authors' manuscripts and quickly add DOIs to those citations. We don't have an equivalent service for specimens, which is going to make our task of linking specimens to sequences and the literature something of a challenge.

We are making progress, but there is some way to go. Identifiers are only part of the solution, we also need services.

Thursday, May 16, 2013

The impact of museum collections: one collection ≈ one Nobel Prize

359f89198dca80b0f99b3208e1cfedde
Ideas on measuring the "impact" of a natural history collection have been bubbling along, as reflected in recent comments on iPhylo, and some offline discussions I've been having with David Blackburn and Alan Resetar.

My focus has been at the specimen-level, with a view to motivation the adoption of persistent specimen-level identifiers so that we can citations of specimens over time (e.g., in publications and databases such as GenBank). Not only does this provide a measure of the "impact" of a collection, it helps with provenance. If we sequence a specimen that is subsequently assigend to a different taxon and we have a way of tracking that specimen via its identifier, then we can transmit that new identification to other consumers of data based on that specimen. For example, we could automatically notify GenBank that what we thought was an x is actually a y.

So I made a simple "league table" of museum collections based on specimens cited in BioStor. There are all sorts of issues with this approach. Once you rank collections, people may use that to argue some can be axed and more resources funnelled into others. A more positive approach would be to indetify collections that are underused, and try and figure out why. And in the same way that taxonomic papers may have a citation long life, specimens may sit in a museum for a long time before being cited (for example, when eventually recognised as a new species doi:10.1016/j.cub.2012.10.029). So, metrics can be a double-edged sword.

Citing specimens is a useful metric, but not all citations are equal, and not all citations are immediate. A specimen that yields DNA sequences that are published in, say, Nature, arguably has more weight than a specimen listed in a rarely cited paper. Likewise, subsequent citations of a paper that cites a specimen should confer more weight on the value of that specimen. Elsewhere (doi:10.1093/bib/bbn022, preprint here: hdl:10101/npre.2008.1760.1) I've argued for a Google PageRank-style way to measure the impact of a specimen that takes into account papers and other objects derived from a specimen (e.g., images, sequences).

Meanwhile, Morgan Jackson alerted me to a quicker way to get a measure of the impact of the collection.

The "short note" Morgan refers to is by Kevin Winker and Jack J. Withrow:
Winker, K., & Withrow, J. J. (2013). Natural history: Small collections make a big impact. Nature, 493(7433), 480–480. doi:10.1038/493480b

They constructed a Google Scholar profile and collected papers that cite the University of Alaska Museum's bird collection (see here for full details). The h-score of this collection of papers is 42, which Winkler and Withrow note is "equivalent to an average Nobel laureate in physics". Here's the graph of citations over time:

Chart  1
It's a neat trick, if a little time consuming. But one advantage it has is that it puts collections on a similar footing to individual researchers. You could imagine asking the question "how much money would you spend supporting a researcher at this level?" How does this compare to the resources actually being spent?

One thing I hope will emerge from discussions like this is a desire to make specimens first-class citizens of the web, with stable identifiers that enable them to be cited in the same way we cite papers and, increasingly, data sets.

Thursday, May 02, 2013

GBIF data quality: visualising Mesibov's millipedes


Bob Mesibov (who has been a guest author on this blog) recently published a paper on data quality in in ZooKeys:

Mesibov, R. (2013). A specialist’s audit of aggregated occurrence records. ZooKeys, 293(0), 1–18. doi:10.3897/zookeys.293.5111

In this paper Bob documents some significant discrepancies between data in his
Millipedes of Australia (MoA) database and the equivalent data in the Atlas of Living Australia and GBIF (disclosure, I was a reviewer of the paper, and also sit on GBIF's science committee). This paper spawned a thread on TAXACOM, and also came up at the GBIF meeting I was at earlier this week.

One thing lacking from the discussion is a clear sense of just how big are the discrepancies between GBIF and MoA data, so I grabbed the data provided by Bob (http://dx.doi.org/10.3897/zookeys.293.5111.app and extracted the records where GBIF and MoA disagreed. I converted these to GeoJSON and threw them on Google Maps:

Mesibov2

You can see a live version here http://bl.ocks.org/rdmpage/raw/5501293/ (it can take a little while for the map to appear). I've connected the MoA and GBIF localities for the same occurrence by a straight line, and the the MoA records are encircled by an estimate of their uncertainty (for many records the circle is invisible at this scale).

There are some fairly spectacular discrepancies, and a lot of relatively small scale displacements of records. Does this matter? The answer to this question will depend on what people want to do with the data. You may regard the discrepancies as serious (certainly it's interesting that there are so many differences between the two data sets), or minor given the geographic scale. But visualising them at least makes it possible to form a judgement.