Friday, June 30, 2017

Response to To Increase Trust, Change the Social Design Behind Aggregated Biodiversity Data

Nico Franz and Beckett W. Sterner recently published a preprint entitled "To Increase Trust, Change the Social Design Behind Aggregated Biodiversity Data" on bioRxiv http://dx.doi.org/10.1101/157214

Below is the abstract:

Growing concerns about the quality of aggregated biodiversity data are lowering trust in large-scale data networks. Aggregators frequently respond to quality concerns by recommending that biologists work with original data providers to correct errors "at the source". We show that this strategy falls systematically short of a full diagnosis of the underlying causes of distrust. In particular, trust in an aggregator is not just a feature of the data signal quality provided by the aggregator, but also a consequence of the social design of the aggregation process and the resulting power balance between data contributors and aggregators. The latter have created an accountability gap by downplaying the authorship and significance of the taxonomic hierarchies ≠ frequently called "backbones" ≠ they generate, and which are in effect novel classification theories that operate at the core of data-structuring process. The Darwin Core standard for sharing occurrence records plays an underappreciated role in maintaining the accountability gap, because this standard lacks the syntactic structure needed to preserve the taxonomic coherence of data packages submitted for aggregation, leading to inferences that no individual source would support. Since high-quality data packages can mirror competing and conflicting classifications, i.e., unsettled systematic research, this plurality must be accommodated in the design of biodiversity data integration. Looking forward, a key directive is to develop new technical pathways and social incentives for experts to contribute directly to the validation of taxonomically coherent data packages as part of a greater, trustworthy aggregation process.

Below I respond to some specific points that annoyed me about this article, at the end I try and sketch out a more constructive response. Let me stress that although I am the current Chair of the GBIF Science Committee, the views expressed here are entirely my own.

Trust and social relations

Trust is a complex and context-sensitive concept...First, trust is a dependence relation between a person or organization and another person or organization. The first agent depends on the second one to do something important for it. An individual molecular phylogeneticist, for example, may rely on GenBank (Clark et al. 2016) to maintain an up-to-date collection of DNA sequences, because developing such a resource on her own would be cost prohibitive and redundant. Second, a relation of dependence is elevated to being one of trust when the first agent cannot control or validate the second agent's actions. This might be because the first agent lacks the knowledge or skills to perform the relevant task, or because it would be too costly to check.

Trust is indeed complex. I found this part of the article to be fascinating, but incomplete. The social network GBIF operates in is much larger than simply taxonomic experts and GBIF, there are relationships with data providers, other initiatives, a broad user community, government agencies that approve it's continued funding, and so on. Some of the decisions GBIF makes need to be seen in this broader context.

For example, the article challenges GBIF for responding to errors in the data by saying that these should be "corrected at source". This a political statement, given that data providers are anxious not to ceed complete control of their data to aggregators. Hence the model that GBIF users see errors, those errors get passed back to source (the mechanisms for tis is mostly non-existent), the source fixes it, then the aggregator re-harvests. This model makes assumptions about whether sources are either willing or able to fix these errors that I think are not really true. But the point is this is less about not taking responsibility, but instead avoiding treading on toes by taking too much responsibility. Personally I think should take responsibility for fixing a lot of these errors, because it is GBIF whose reputation suffers (as demonstrated by Franz and Sterner's article).

Scalability

A third step is to refrain from defending backbones as the only pragmatic option for aggregators (Franz 2016). The default argument points to the vast scale of global aggregation while suggesting that only backbones can operate at that scale now. The argument appears valid on the surface, i.e., the scale is immense and resources are limited. Yet using scale as an obstacle it is only effective if experts were immediately (and unreasonably) demanding a fully functional, all data-encompassing alternative. If on the other hand experts are looking for token actions towards changing the social model, then an aggregator's pursuit of smaller-scale solutions is more important than succeeding with the 'moonshot'.

Scalability is everything. GBIF is heading towards a billion occurrence records and several million taxa (particularly as more and more taxa from DNA-barcoding taxa are added). I'm not saying that tractability trounces trust, but it is a major consideration. Anybody advocating a change has got to think about how these changes will work at scale.

I'm conscious that this argument could easily be used to swat away any suggestion ("nice idea, but won't scale") and hence be a reason to avoid change. I myself often wish GBIF would do things differently, and run into this problem. One way around it is to make use of the fact that GBIF has some really good APIs, so if you want GBIF to do something different you can build a proof of concept to show what could be done. If that is sufficiently compelling, then the case for trying to scale it up is going to be much easier to make.

Multiple classifications

As a social model, the notion of backbones (Bisby 2000) was misguided from the beginning. They disenfranchise systematists who are by necessity consensus-breakers, and distort the coherence of biodiversity data packages that reflect regionally endorsed taxonomic views. Henceforth, backbone-based designs should be regarded as an impediment to trustworthy aggregation, to be replaced as quickly and comprehensively as possible. We realize that just saying this will not make backbones disappear. However, accepting this conclusion counts as a step towards regaining accountability.

This strikes me as hyperbole. "They disenfranchise systematists who are by necessity consensus-breakers". Really? Having backbones in no way prevents people doing systematic research, challenging existing classifications, or developing new ones (which, if they are any good, will become the new consensus).

We suggest that aggregators must either author these classification theories in the same ways that experts author systematic monographs, or stop generating and imposing them onto incoming data sources. The former strategy is likely more viable in the short term, but the latter is the best long-term model for accrediting individual expert contributions. Instead of creating hierarchies they would rather not 'own' anyway, aggregators would merely provide services and incentives for ingesting, citing, and aligning expert-sourced taxonomies (Franz et al. 2016a).

Backbones are authored in the sense that they are the product of people and code. GBIF's is pretty transparent (code and some data on github, complete with a list of problems). Playing Devil's advocate, maybe the problem here is the notion of authorship. If you read a paper with 100's of authors, why does that give you any greater sense of accountabily? Is each author going to accept responsibility for (or being to talk cogently about) every aspect of that paper? If aggregators such as GBIF and Genbank didn't provide a single, simple way to taxonomically browse the data I'd expect it would be the first thing users would complain about. There are multiple communities GBIF must support, including users who care not at all about the details of classification and phylogeny.

Having said that, obviously these backbone classifications are often problematic and typically lag behind current phylogenetic research. And I accept that they can impose a certain view on how you can query data. GenBank for a long time did not recognise the Ecdysozoa (nematodes plus arthropods) despite the evidence for that group being almost entirely molecular. Some of my research has been inspired by the problem of customising a backbone classification to better more modern views (doi:10.1186/1471-2105-6-208).

If handling multiple classifications is an obstacle to people using or contributing data to GBIF, then that is clearly something that deserves attention. I'm a little sceptical, in that I think this is similar to the issue of being able to look at multiple versions of a document or GenBank sequence. Everyone says it's important to have, I suspect very few people ever use that functionality. But a way forward might be to construct a meaningful example (in other words an live demo, not a diagram with a few plant varieties).

Ways forward

We view this diagnosis as a call to action for both the systematics and the aggregator communities to reengage with each other. For instance, the leadership constellation and informatics research agenda of entities such as GBIF or Biodiversity Information Standards (TDWG 2017) should strongly coincide with the mission to promote early-stage systematist careers. That this is not the case now is unfortunate for aggregators, who are thereby losing credibility. It is also a failure of the systematics community to advocate effectively for its role in the biodiversity informatics domain. Shifting the power balance back to experts is therefore a shared interest.

Having vented, let me step back a little and try and extract what I think the key issue is here. Issues such as error correction, backbones, multiple classifications are important, but I guess the real issue here is the relationship between experts such as taxonomists and systematists, and large-scale aggregators (note that GBIF serves a community that is bigger than just these researchers). Franz and Sterner write:

...aggregators also systematically compromise established conventions of sharing and recognizing taxonomic work. Taxonomic experts play a critical role in licensing the formation of high-quality biodiversity data packages. Systems of accountability that undermine or downplay this role are bound to lower both expert participation and trust in the aggregation process.

I think this is perhaps the key point. Currently aggregation tends to aggregate data and not provenance. Pretty much every taxonomic name has at one point or other been published by somebody. For various reasons (including the crappy way most nomenclature databases cite the scientific literature) by the time these names are assembled into a classification by GBIF the names have virtually no connection to the primary literature, which also means that who contributed the research that led to that name being minted (and the research itself) is lost. Arguably GBIF is missing an opportunity to make taxonomic and phylogenetic research more visible and discoverable (I'd argue this is a better approach than Quixotic efforts to get all biologists to always cite the primary taxonomic literature).

Franz and Sterner's article is a well-argued and sophisticated assessment of a relationship that isn't working the way it could. But to talk in terms of "power balance" strikes me as miscasting the debate. Would it not be better to try and think about aligning goals (assuming that is possible). What do experts want to achieve? What do they need to achieve those goals? Is it things such as access to specimens, data, literature, sequences? Visibility for their research? Demonstrable impact? Credit? What are the impediments? What, if anything, can GBIF and other aggregators do to help? In what way can facilitating the work of experts help GBIF?

In my own "early-stage systematist career" I had a conversation with Mark Hafner about the Louisiana State University Museum providing tissue samples for molecular sequencing, essentially a "project in a box". Although Mark was complaining about the lack credit for this (a familiar theme) the thing which struck me was how wonderful it would be to have such a service - here's everything you need to do your work, go do some science. What if GBIF could do the same? Are you interested in this taxonomic group, well here's the complete sum of what we know so far. Specimens, literature, DNA sequences, taxonomic names, the works. Wouldn't that be useful?

Franz and Sterner call for "both the systematics and the aggregator communities to reengage with each other". I would echo this. I think that the sometimes dysfunctional relationship between experts and aggregators is partly due to the failure to build a community of researchers around GBIF and its activities. The focus of GBIF's relationship with the scientific community has been to have a committee of advisers, which is a rather traditional and limited approach ("you're a scientist, tell us what scientists want"). It might be better served if it provided a forum for researchers to interact with GBIF, data providers, and each other.

I stated this blog (iPhylo) years ago to vent my frustrations about TreeBASE. At the time I was fond of a quote from a philosopher of science that I was reading, to the effect that we only criticise those things that we care about. I take Franz and Sterner's article to indicate that they care about GBIF quite a bit ;). I'm looking forward to more critical discussion about how we can reconcile the needs of experts and aggregators as we seek to make global biodiversity data both open and useful.

Friday, June 16, 2017

GBIF Challenge 2017: Liberating species records from open data repositories for scientific discovery and reuse

Ebbe v5 300

GBIF is running its Ebbe Nielsen Challenge for the third successive year. This year the title is Liberating species records from open data repositories for scientific discovery and reuse. To quote from the Challenge background on Devpost:

This year's Challenge will seek to leverage the growth of open data policies among scientific journals and research funders, which require researchers to make the data underlying their findings publicly available. Adoption of these policies represents an important first step toward increasing openness, transparency and reproducibility across all scientific domains, including biodiversity-related research.

To abide by these requirements, researchers often deposit datasets in public open-access repositories. Potential users are then able to find and access the data through repositories as well as data aggregators like OpenAIRE and DataONE. Many of these datasets are already structured in tables that contain the basic elements of biodiversity information needed to build species occurrence records: scientific names, dates, and geographic locations, among others.

However, the practices adopted by most repositories, funders and journals do not yet encourage the use of standardized formats. This approach significantly limits the interoperability and reuse of these datasets. As a result, the wider reuse of data implied if not stated by many open data policies falls short, even in cases where open licensing designations (like those provided through Creative Commons) seem to encourage it.

In essence, the 2017 Challenge is to develop tools to discover these biodiversity-relevant datasets, and make them available to GBIF. In other words, we want tools to enable us to do this:

Goal

As an example of the impact that external data can have on GBIF, last year I wrote a blog post (The Zika virus, GBIF, and the missing mosquitoes) describing how I took published data (doi:10.1038/sdata.2015.35) from the Dryad repository and added it to GBIF. The effect was dramatic:

Before

1651430

After

1651430 updated

This is just one example. I suspect that there is a lot of biodiversity data gathering digital dust sitting in repositories that could be more widely reused if we just had the tools to discover it, and convert it into a form that GBIF can use. Prove me right, and win cash prizes! Details at https://gbif2017.devpost.com.