Friday, January 24, 2014

NCBI taxonomy database now shows type material

26243 367877492480 367794192480 4194340 6124010 nScott Federhen told me about a nice new feature in GenBank that he's described in a piece for NCBI News. The NCBI taxonomy database now shows a its of type material (where known), and the GenBank sequence database "knows: about types. Here's the summary:

The naming, classification and identification of organisms traditionally relies on the concept of type material, which defines the representative examples ("name-bearing") of a species. For larger organisms, the type material is often a preserved specimen in a museum drawer, but the type concept also extends to type bacterial strains as cultures deposited in a culture collection. Of course, modern taxonomy also relies on molecular sequence information to define species. In many cases, sequence information is available for type specimens and strains. Accordingly, the NCBI has started to curate type material from the Taxonomy database, and are using this data to label sequences from type specimens or strains in the sequence databases. The figure below shows type material as it appears in the NCBI taxonomy entry and a sequence record for the recently described African monkey species, Cercopithecus lomamiensis.
Cercopithecus

You can query for sequences from type using the query "sequence from type"[filter]. This could lead to some nice automated tools. If you had a bunch of distinct clusters of sequences that were all labelled with the same species name, and one cluster includes a sequence form the type specimen, then the other clusters are candidates for being described as new names.

VertNet starts issue tracking using GitHub

VN Short LogoVertNet has announced that they have implemented issue tracking using GitHub. This is a really interesting development, as figuring out how to capture and make use of annotations in biodiversity databases is a problem that's attracting a lot of attention. VertNet have decided to use GitHub to handle annotations, but in a way that hides most of GitHub from users (developers tend to love things like GitHub, regular folks, not so much, see The Two Cultures of Computing).

The VertNet blog has a detailed walk through of how it works. I've made some comments on that blog, but I'll repeat them here.

At the moment the VertBet interface doesn't show any evidence of issue tracking (there's a link to add an issue, but you can't see if there are any issues). For example, visiting an example CUMV Amphibian 1766 I don't see any evidence on that page that there is an issue for this record (there is an issue, see https://github.com/cumv-vertnet/cumv-amph/issues/1). It think it's important that people see evidence of interaction (that way you might encourage others to participate). This would also enable people to gauge how active collection managers are in resolving issues ("gee, they fixed this problem in a couple of days, cool").

Likewise, it would be nice to have a collection-level summary in the portal. For example, looking at CUMV Amphibian 1766 I'm not able to click through to a page for CUMV Amphibians (why can't I do this at the moment - there needs to be a way for me to get to the collection from a record) to see how many issues there are for the whole collection, and how fast they are being closed.

I think the approach VertNet are using has a lot of potential, although it sidesteps some of the most compelling features of GitHub, namely forking and merging code and other documents. I can't, for example, take a record, edit it, and have those edits merged into the data. It's still a fairly passive "hey, there's a problem here", which means that the burden is still on curators to fix the issue. This raises the whole question of what to do with user-supplied edits. There's a nice paper regarding validating user input into Freebase that is relevant here, see "Trust, but Verify: Predicting Contribution Quality for Knowledge Base Construction and Curation" (http://dx.doi.org/10.1145/2556195.2556227 [not live yet], PDF here).

Wednesday, January 15, 2014

What I'll be working on in 2014: knowledge graphs and Google forests

More for my own benefit than anything else I've decided to list some of the things I plan to work on this year. If nothing else, it may make sobering reading this time next year.

A knowledge graph for biodiversity


Google's introduction of the "knowledge graph" gives us a happy phrase to use when talking about linking stuff together. It doesn't come with all the baggage of the "semantic web", or the ambiguity of "knowledge base". The diagram below is my mental model of the biodiversity knowledge graph (this comes from http://dx.doi.org/10.7717/peerj.190, but I sketched most of this for my Elsevier Challenge entry in 2008, see http://dx.doi.org/10.1038/npre.2008.2579.1).

Fig 1 1x

Parts of this knowledge graph are familiar: articles are published in journals, and have authors. Articles cite other articles (represented by a loop in the diagram below). The topology of this graph gives us citation counts (number of times an article has been cited), impact factor (citations for articles in a given journal), and author-based measures such as the H-index (a function of the distribution of citations for each article you have authored). Beyond simple metrics this graph also gives us the means to track the provenance of an idea (by following the citation trail).

Publication

The next step is to grow this graph to include the other things we care about (e.g., taxa, taxon names, specimens, sequences, phylogenies, localities, etc.).

BioNames


I spent a good deal of last year building BioNames (for background see my blog posts or read the paper in PeerJ http://dx.doi.org/10.7717/peerj.190). BioNames represents a small corner of the biodiversity knowledge graph, namely taxonomic names and their associated publications (with added chocolately goodness of links to taxon concepts and phylogenies). In 2014 I'll continue to clean this data (I seem to be forever cleaning data). So far BioNames is restricted to animal names, but now that the plant folks have relaxed their previously restrictive licensing of plant data (see post on TAXACOM) I'm looking at adding the million or so plant names (once I've linked as many as possible to digital identifiers for the corresponding publications).

Spatial indexing


Now that I've become more involved in GBIF I'm spending more time thinking about spatial indexing, and our ability to find biodiversity data on a map. There's a great Google ad that appeared on UK TV late last year. In it, Julian Bayliss recounts the use of Google Earth to discover of virgin rainforest (the "Google forest") on Mount Mabu in Mozambique.



It's a great story, but I keep looking at this and wondering "how did we know that we didn't know anything about Mount Mabu?" In other words, can we go to any part of the world and see what we know about that area? GBIF goes a little way there with its specimen distribution maps, which gives some idea of what is now known from Mount Mabu (although the map layers used by GBIF are terrible compared to what Google offers).

Mabu

But I want to be able to see all the specimens now known from this region (including the new species that have been discovered, e.g. see http://dx.doi.org/10.1007/s12225-011-9277-9 and http://dx.doi.org/10.1080/21564574.2010.516275). Why can't I have a list of publications relevant to this area (e.g., species descriptions, range extensions, ecological studies, conservation reports)? What about DNA sequences from material in this region (e.g., from organismal samples, DNA barcodes, metagenomics, etc.)? If GBIF is to truly be a "Global Biodiversity Information Facility" then I want it to be able to provide me with a lot more information than it currently does. The challenge is how to enable that to happen.

Thursday, January 09, 2014

Annotating GBIF: some thoughts

Given that it's the start of a new year, and I have a short window before teaching kicks off in earnest (and I have to revise my phyloinformatics course) I'm playing with a few GBIF-related ideas. One topic which comes up a lot is annotating and correcting errors. There has been some work in this area [1][2] bit it strikes me as somewhat complicated. I'm wondering whether we couldn't try and keep things simple.

From my perspective there are a bunch of problems to tackle. The first is that occurrence data that ends up in GBIF may be incorrect, and it would be nice if GBIF users could (at the very least) flag those errors, and even better fix them if they have the relevant information. For example, it may be clear that a frog apparently in the middle of the ocean is there because latitude and longitudes were swapped, and this could be easily fixed.

Another issue is that data on an occurrence may not be restricted to a single source. It's tempting to think, for example, that the museum housing a specimen has the authoritative data on that specimen, but this need not be the case. Sometimes museums either lack (or decide not to make available) data such as geographic coordinates, but this information is available from other sources (such as the primary literature, or GenBank, see e.g. Linking GBIF and GenBank). Speaking of Genbank, there is a lot of basic biodiversity data in GenBank (such as georeferenced voucher specimens) and it would be great to add that data to GBIF. One issue, however, is that some of the voucher specimens in GenBank will already be in GBIF, potentially creating duplicate records. Ideally each specimen would be represented just once in GBIF, but for a bunch of reasons this is tricky to do (for a start, few specimens have globally unique identifiers, see DOIs for specimens are here, but we're not quite there yet), hence GBIF has duplicate specimen records. So, we are going to have to live with multiple records for the 'same" thing.

Lastly there is the ongoing bugbear that URLs for GBIF occurrences are not stable. This is frustrating in the extreme because it defeats any attempt to link these occurrences to other data (e.g., DNA sequences, the Biodiversity Heritage Library, etc.). If the URLs regularly break then there is little incentive to go to the trouble of creating links between different data bases, and biodiversity data will remain in separate silos.

So, we have three issues: user edits and corrections of data hosted by GBIF, multiple sources of data on the same occurrence, and lack of persistence links to occurrences.

If we accept that the reality is we will always have duplicates, then the challenge becomes how to deal with them. Let's imagine that we have multiple instances of data on the same occurrence, and that we have some way of clustering those records together (e.g., using the specimen code, the Darwin Core Triple, additional taxonomic information, etc.). Given that we have multiple records we may have multiple values for the same item, such as locality, taxon name, geo-coordinates, etc. One way to reconcile these is to use an approach developed for handling bibliographic metadata derived from citations, as described in [3](PDF here). If you are building a bibliographic database from lists of literature cited, you need to cluster the citations that are sufficiently similar to be likely to be the same reference. You might also want to combine those records to yield a best estimate of the metadata for the actual reference (in other words, one author might have cited the article with an abbreviated journal name, another author might have cited only the first page, etc., but all might agree on the volume the article occurs in). Councill et al. use Bayesian belief networks to derive an estimate of the correct metadata.

What is nice about this approach is that you retain all the original data, and you can weight each source by some measure of its reliability (i.e., the "prior"). Hence, we could weight a user's edits based on some measure, such as the acceptance of other edits they've made or, say, their authority (a user who is the author of a taxonomic revision of a group might know quite a bit about the specimens belonging to those taxa). If a user edits a GBIF record (say, but adding latitude and longitude values) we could add that as a "new" record, linked to the original, and containing just the edited values (we could also enable the user to confirm that other values are correct).

So, what do we show regular users of GBIF if we have multiple records for the same occurrence? In effect we compute a "consensus" based on the multiple records, tackling into account the prior probabilities that each source is reliable. What about the museums (or other "providers")? Well, they can grab all the other records (e.g., the user edits, the GenBank information, etc.) and use it to update their records, if they so choose. If they do so, next time GBIF harvest their data, the GBIF version of that data is updated, and we can recompute the new "consensus". It would be nice to have some way of recording whether the other edits/records we accepted, so we can gauge the reliability of those sources (a user whose edits are consistently accepted gets "up voted"). The provider could explicitly tell GBIF which edits it accepted, or we could infer them by comparing the new and old versions.

To retain a version history we'd want to keep the new and old provider records. This could be done using timestamps - any record has a creation date, and an expiry date. By default the expiry date is far in the future, but if a record is replaced it's expiry date is set to that time, and it is ignored when indexing the data.

How does this relate to duplicates? Well, GBIF has a habit of deleting whole sets of data if it indexes data from a provider and that provider has done something foolish, such as change the fields GBIF uses to identify the record (another reason why globally unique identifiers for specimens can't come soon enough). Instead of deleting the old records (and breaking any links to those records) GBIF could simply set their expiry date but keep them hanging around. They would not be used to create consensus records for an occurrence, but if someone used a link that had a now deleted occurrence id they could be redirected to the current cluster that corresponds to that old id, and hence the links would be maintained (albeit pointing to possibly edited data).

This is still a bit half-baked, but I think the challenge GBIF faces is how to make the best of messy data which may lack a single definitive source. The ability for users to correct GBIF-hosted data would be a big step forward, as would the addition of data from Genbank and the primary literature (the later has the advantage that in many cases it will presumably have been scrutinised by experts). The trick is to make this simple enough that there is a realistic chance of it being implemented.

References



[1] Wang, Z., Dong, H., Kelly, M., Macklin, J. A., Morris, P. J., & Morris, R. A. (2009). Filtered-Push: A Map-Reduce Platform for Collaborative Taxonomic Data Management. 2009 WRI World Congress on Computer Science and Information Engineering (pp. 731–735). Institute of Electrical and Electronics Engineers. doi:10.1109/CSIE.2009.948

[2] Morris, R. A., Dou, L., Hanken, J., Kelly, M., Lowery, D. B., Ludäscher, B., Macklin, J. A., et al. (2013). Semantic Annotation of Mutable Data. (I. N. Sarkar, Ed.)PLoS ONE, 8(11), e76093. doi:10.1371/journal.pone.0076093

[3] Councill, I. G., Li, H., Zhuang, Z., Debnath, S., Bolelli, L., Lee, W. C., Sivasubramaniam, A., et al. (2006). Learning metadata from the evidence in an on-line citation matching scheme. Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’06 (p. 276). Association for Computing Machinery. doi:10.1145/1141753.1141817