Monday, December 11, 2017

Towards a digital natural history museum

These notes are the result of a few events I've been involved in the last couple of months, including TDWG 2017 in Ottawa, a thesis defence in Paris, and a meeting of the Science Advisory Board of the Natural History Museum in London. For my own benefit if no one else's, I want to sketch out some (less than coherent) ideas for how a natural history museum becomes truly digital.

Background

The digital world poses several challenges for a museum. In terms of volume of biodiversity data, museums are already well behind two major trends, observations from citizen science and genomics. The majority of records in GBIF are observations, and genomics databases are growing exponentially, through older initiatives such as barcoding, and newer methods such as environmental genomics. While natural history collections contain an estimated 109 specimens or "lots" [1], less than a few percent of that has been digitised, and it is not obvious that massive progress in increasing this percentage will be made any time soon.

Furthermore, for citizen science and genomics it is not only the amount of data but the network effects that are possible with that data that make it so powerful. Network effects arise when the value of something increases as more people use it (the classic example is the telephone network). In the case of citizen science, apart from the obvious social network that can form around a particular taxon (e.g., birds), there are network effects from having a large number of identified observations. iNaturalist is using machine learning to suggest identifications of photos taken by members. The more members join and add photos and identifications, the more reliable the machine identifications become, which in turn makes it more desirable to join the network. Genomics data also shows network effects. In effect, a DNA sequence is useless without other sequences to compare it with (it is no accident that the paper describing BLAST is one of the most highly cited in biology). The more sequences a genomics database has the more useful it is.

For museums the explosion of citizen science and genomics begs the question "is there any museum data that can show similar network effects"? We should also ask whether there will be an order of magnitude increase in digitisation of specimens in the near future. If not, then one could argue that museums are going to struggle to remain digitally relevant if they remain minority biodiversity data providers. Being part of organisations such as GBIF certainly helps, but GBIF doesn't (yet) offer much in the way of network effects.

Users

We could divide the users of museums into three distinct (but overlapping) communities. These are:

  1. Scientists
  2. Visitors
  3. Staff

Scientists make use of research and data generated by the museum. If the museum doesn't support science (both inside and outside the museum) then the rationale for the collections (and associated staff) evaporates. Hence, digitisation must support scientific research.

Visitors in this sense means both physical and online visitors. Online visitors will have a purely digital experience, but in person visitors can have both physical and digital experiences.

In many ways the most neglected category is the museum staff. Perhaps best way to make progress towards a digital museum is having the staff committed to that vision, and this means digitisation should wherever possible make their work easier. In many organisations going digital means a difficult transition period of digitising material, dealing with crappy software that makes their lives worse, and a lack of obvious tangible benefits (digitisation for digitisation's sake). Hence outcomes that deliver benefits to people doing actual work should be prioritised. This is another way of saying that museums need to operate as "platforms", the best way to ensure that external scientists will use the museums digital services is if the research of the museum's own staff depends on those services.

Some things to do

For each idea I sketch a "vision", some ways to get there, what I think the current reality is (and, let's be honest, what I expect it to still be like in 10 years time).


Vision: Anyone with an image of an organism can get a answer to the question "what is this?"

Task: Image the collection in 2D and 3D. Computers can now "see", and can accomplish tasks such as identify species and traits (such as the presence of disease [2]) from images. This ability is based on machine learning from large numbers of images. The museum could contribute to this by imaging as many specimens as possible. For example, a library of butterfly photos could significantly increase the accuracy of identifications by tools such as iNaturalist. Creating 3D models of specimens could generate vast numbers of training images [3] to further improve the accuracy of identifications. The museum could aim to provide identifications for the majority of species likely to be encountered/photographed by its users and other citizen scientists.

Reality: Imaging is unlikely to be driven by identification and machine learning, beiggest use is to provide eye-catching images for museum publicity.

Who can help: iNaturalist has experience with machine learning. More and more of research is appearing on image recognition, deep learning, and species identification.


Vision: Anyone with a DNA sequence can get a answer to the question "what is this?"

Task: DNA sequence the collection, focussing first on specimens that (a) have been identified and (b) represent taxonomic groups that are dominated by "dark taxa" in GenBank. Many sequences being added to GenBank are unidentified and hence unnamed. These will only become named (and hence potentially connected to more information) if we have sequences from identified material of those species (or close relatives). Often discussions of sequences focus on doing the type specimens. While this satisfies the desire to pin a name to a sequence in the most rigorous way, it doesn't focus on what users need - an answer to "what is this?" The number of identified specimens will far exceed the number of type specimens, and many types will not be easily sequenced. Sequencing identified specimens puts the greatest amount of museum-based information into sequence space. This will become even more relevant as citizen science starts to expand to include DNA sequences (e.g., using tools like MinION).

Reality: Lack of clarity over what taxa to prioritise, emphasis on type specimens, concerns over whether DNA barcoding is out of date compared to other techniques (ignoring importance of global standardisation as a way to make data maximally useful) will all contribute to a piecemeal approach.

Who can help: Explore initiatives such as the Planetary Biodiversity Mission.


Vision: A physical visitor to the museum has a digital experience deeply informed by the museum's knowledge

Task: The physical walls of the museum are not barriers separating displays from science but rather interfaces to that knowledge. Any specimen on display is linked to what we know about it. If there is a fossil on a wall, we can instantly see the drawings made of that specimen in various publications, 3D scans to interact with, information about the species, the people who did the work (whether historical figures or current staff), and external media (e.g., BBC programs).

Reality: Piecemeal, short-lived gimmicky experiments (such as virtual reality), no clear attempt to link to knowledge that visitors can learn from or create themselves. Augmented reality is arguably more interesting, but without connections to knowledge it is a gimmick.

Who could help: Many of the links between specimens, species, and people full into the domain of Wikipedia and Wikidata, hence lots of opportunities for working with GLAM Wiki community.


Vision: A museum researcher can access all published information about a species, specimen, or locality via a single web site.

Task: All books and journals in the museum library that are not available online should be digitised. This should focus on materials post 1923 as pre-1923 is being done by BHL. The initial goal is to provide its researchers with the best possible access to knowledge, the secondary goal is to open that up to the rest of the world. All digitised content should be available to researchers within the museum using a model similar to the Haithi Trust which manages content scanned by Google Books. The museum aggressively pursues permission to open as much of the digitised content up as it can, starting with its own books and journals. But it scans first, sorts out permissions later. For many uses, full access isn't necessarily needed, at least for discovery. For example, by indexing text for scientific names, specimen codes, and localities, researchers could quickly discover if a text is relevant, even if ultimately direct physically access is the only possibility for reading it.

Reality: Piecemeal digitisation hampered by the chilling effects of copyright, combined with limited resources means the bulk of our scientific knowledge is hard to access. A lack of ambition means incremental digitisation, with most taxonomic research remaining inaccessible, and new research constrained by needing access to legacy works in physical form.

Who could help: Consider models such as Hathi, work with BHL and publishers to open up more content, and text mining researchers to help maximise use even for content that can't be opened up straight away.


Vision: The museum as a "connection machine" to augment knowledge

Task: While a museum can't compete in terms of digital volume, it can compete for richness and depth of linking. Given a user with a specimen, an image, a name, a place, how can the museum use its extensive knowledge base to augment that user's experience? By placing the thing in a broader context (based on links derived from image -> identity tools, sequence -> identity tools, names to entities e.g., species, people and places, and links between those entites) the museum can enhance our experience of that thing.

Reality: The goal of having everything linked together into a knowledge graph is often talked about, but generally fails to happen, partly because things rapidly descend into discussions about technology (most of which sucks), and squabbling over identifiers and vocabularies. There is also a lack of clear drivers, other than "wouldn't it be cool?". Hence expect regular calls to link things together (e.g., Let’s rise up to unite taxonomy and technology), demos and proof of concept tools, but little concrete progress.

Who can help: The Wikidata community, initiatives such as (some of these are no longer alive but useful to investigate) Big Data Europe, BBC Things. The BBC's defunct Wildlife Finder is an example of what can be achieved with fairly simple technology.

Summary

The fundamental challenge the museum faces is that it is analogue in an increasingly digital world. It cannot be, nor should it be, completely digital. For one thing it can't compete, for another its physical collection, physical space, and human expertise are all aspects that make a museum unique. But it needs to engage with visitors that are digitally literate, it needs to integrate with the burgeoning digital knowledge being generated by both citizens and scientists, and it needs to provide its own researchers with the best possible access to the museum's knowledge. Above all, it needs to have a clear vision of what "being digital means".

References

1. Ariño, A. H. (2010). Approaches to estimating the universe of natural history collections data. Biodiversity Informatics, 7(2). https://doi.org/10.17161/bi.v7i2.3991

2. Ramcharan, A., Baranowski, K., McCloskey, P., Ahmed, B., Legg, J., & Hughes, D. P. (2017). Deep Learning for Image-Based Cassava Disease Detection. Frontiers in Plant Science, 8. https://doi.org/10.3389/fpls.2017.01852

3. Xingchao Peng, Baochen Sun, Karim Ali, Kate Saenko (2014) Learning Deep Object Detectors from 3D Models. https://arxiv.org/abs/1412.7122

Tuesday, December 05, 2017

Blue Planet II, the BBC, and the Semantic Web: a tale of lessons forgotten and opportunities lost

David Attenborough’s latest homage to biodiversity, Blue Planet II is, as always, visually magnificent. Much of its impact derives from the new views of life afforded by technological advances in cameras, drones, diving gear, and submersibles. One might hope that the supporting information online reflected the equivalent technological advances made in describing and sharing information. Sadly, this is not the case. Instead the BBC offers a web site with a video clips and a poster... a $%@£ poster.

Oceans poster feat

This is a huge missed opportunity. Where do people go to learn more about the organisms featured in an episode? How do we discover related content on the BBC and elsewhere? How do we discover the science underpinning each episode that has been so exquisitely filmed and edited?

Perhaps the lack of an online resource reflects a lack of resources, or expertise? Yet one look at the series (and the "Into the blue" epilogues) tells us that resources are hardly limiting. Furthermore, the BBC has previously constructed rich, informative web sites to support natural history programming. The now deprecated BBC Nature Wildlife site had an extensive series of web pages for the organisms featured in BBC programmes, with links to individual clips. For each organism the corresponding web page listed key traits such as behaviours, habitats, and geographic distribution, and each of these traits had its own web page list all organisms with those traits (see, for example the page for Steller's Sea Eagle).

Screenshot 2017 12 05 13 12 02

Underlying all this information was a simple vocabulary (the Wildlife Ontology), and the entire corpus is also available in RDF: in other words, the BBC used Semantic Web technologies to structure this information. To get this data you simply append ".rdf" to the URL for a web page. For example, below is the RDF for Steller's Sea Eagle. It is not pretty, but it is a great example of machine-readable data which enables all sorts of interesting things to be built.

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:dc="http://purl.org/dc/terms/"
xmlns:dctypes="http://purl.org/dc/dcmitype/"
xmlns:skos="http://www.w3.org/2004/02/skos/core#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
xmlns:po="http://purl.org/ontology/po/"
xmlns:wo="http://purl.org/ontology/wo/">
<rdf:Description rdf:about="/nature/species/Steller's_Sea_Eagle">
<foaf:primaryTopic rdf:resource="/nature/species/Steller's_Sea_Eagle#species"/>
<rdfs:seeAlso rdf:resource="/nature/species"/>
</rdf:Description>
<wo:Species rdf:about="/nature/life/Steller's_Sea_Eagle#species">
<rdfs:label>Steller's sea eagle</rdfs:label>
<wo:name rdf:resource="http://www.bbc.co.uk/nature/species/Steller's_Sea_Eagle#name"/>
<foaf:depiction rdf:resource="http://ichef.bbci.co.uk/naturelibrary/images/ic/640x360/s/st/stellers_sea_eagle/stellers_sea_eagle_1.jpg"/>
<dc:description>Steller’s sea eagles are native to eastern Russia, inhabiting coastal cliffs and estuaries where they can easily access good fishing territories. They feed primarily on salmon, which they catch by swooping from perches located by the water's edge. Pairs are monogamous and hatch an average of two chicks each season, although crows and martens commonly take both eggs and young birds from the nest. During winter a small number of birds remain in Russia to tough it out, but the majority fly south to Japan.</dc:description>
<owl:sameAs rdf:resource="http://dbpedia.org/resource/Steller's_Sea_Eagle"/>
<wo:adaptation rdf:resource="/nature/adaptations/Altricial#adaptation"/>
<wo:adaptation rdf:resource="/nature/adaptations/Animal_migration#adaptation"/>
<wo:adaptation rdf:resource="/nature/adaptations/Carnivore#adaptation"/>
<wo:adaptation rdf:resource="/nature/adaptations/Flight#adaptation"/>
<wo:adaptation rdf:resource="/nature/adaptations/Hearing_(sense)#adaptation"/>
<wo:adaptation rdf:resource="/nature/adaptations/Monogamous_pairing_in_animals#adaptation"/>
<wo:adaptation rdf:resource="/nature/adaptations/Oviparity#adaptation"/>
<wo:adaptation rdf:resource="/nature/adaptations/Parental_investment#adaptation"/>
<wo:livesIn rdf:resource="/nature/habitats/Coast#habitat"/>
<wo:livesIn rdf:resource="/nature/habitats/Estuary#habitat"/>
<wo:livesIn rdf:resource="/nature/habitats/Marsh#habitat"/>
<wo:livesIn rdf:resource="/nature/habitats/River#habitat"/>
<wo:livesIn rdf:resource="/nature/habitats/Swamp#habitat"/>
<wo:genus rdf:resource="/nature/life/Sea_eagle#genus"/>
<wo:family rdf:resource="/nature/life/Accipitridae#family"/>
<wo:order rdf:resource="/nature/life/Falconiformes#order"/>
<wo:class rdf:resource="/nature/life/Bird#class"/>
<wo:phylum rdf:resource="/nature/life/Chordate#phylum"/>
<wo:kingdom rdf:resource="/nature/life/Animal#kingdom"/>
</wo:Species>
<wo:TaxonName rdf:about="/nature/species/Steller's_Sea_Eagle#name">
<rdfs:label>Haliaeetus pelagicus</rdfs:label>
<wo:commonName>Steller's sea eagle</wo:commonName>
<wo:scientificName>pelagicuspelagicus</wo:scientificName>
<wo:kingdomName>animalia</wo:kingdomName>
<wo:phylumName>Chordata</wo:phylumName>
<wo:className>Aves</wo:className>
<wo:orderName>Falconiformes</wo:orderName>
<wo:familyName>Accipitridae</wo:familyName>
<wo:genusName>Haliaeetus</wo:genusName>
<wo:speciesName>pelagicus</wo:speciesName>
</wo:TaxonName>
<foaf:Image rdf:about="http://ichef.bbci.co.uk/naturelibrary/images/ic/640x360/s/st/stellers_sea_eagle/stellers_sea_eagle_1.jpg">
<foaf:depicts rdf:resource="/nature/species/Steller's_Sea_Eagle#species"/>
<foaf:thumbnail rdf:resource="http://ichef.bbci.co.uk/naturelibrary/images/ic/83x104/s/st/stellers_sea_eagle/stellers_sea_eagle_1.jpg"/>
</foaf:Image>
<po:Clip rdf:about="http://www.bbc.co.uk/programmes/p00dhn1t#programme">
<dc:title>Lunch on the wing</dc:title>
<po:subject rdf:resource="/nature/species/Steller's_Sea_Eagle#species"/>
</po:Clip>
<po:Clip rdf:about="http://www.bbc.co.uk/programmes/p00382f5#programme">
<dc:title>Steller's sea eagle</dc:title>
<po:subject rdf:resource="/nature/species/Steller's_Sea_Eagle#species"/>
</po:Clip>
<dctypes:Sound rdf:about="http://downloads.bbc.co.uk/earth/naturelibrary/assets/s/st/stellers_sea_eagle/5015017.mp3">
<dc:title>Calls from Steller's and white-tailed sea eagles</dc:title>
<dc:subject rdf:resource="/nature/species/Steller's_Sea_Eagle#species"/>
</dctypes:Sound>
<foaf:Document rdf:about="http://en.wikipedia.org/wiki/Steller's_Sea_Eagle">
<foaf:primaryTopic rdf:resource="/nature/species/Steller's_Sea_Eagle#species"/>
</foaf:Document>
<foaf:Document rdf:about="http://animaldiversity.ummz.umich.edu/site/accounts/information/Haliaeetus_pelagicus.html">
<foaf:primaryTopic rdf:resource="/nature/species/Steller's_Sea_Eagle#species"/>
</foaf:Document>
<foaf:Document rdf:about="http://www.arkive.org/stellers-sea-eagle/haliaeetus-pelagicus/">
<foaf:primaryTopic rdf:resource="/nature/species/Steller's_Sea_Eagle#species"/>
</foaf:Document>
<foaf:Document rdf:about="http://www.birdlife.org/datazone/species/index.html?action=SpcHTMDetails.asp&sid=3366&m=0">
<foaf:primaryTopic rdf:resource="/nature/species/Steller's_Sea_Eagle#species"/>
</foaf:Document>
<foaf:Document rdf:about="http://www.flickr.com/search/show/?q=steller+sea+eagle&s=int">
<foaf:primaryTopic rdf:resource="/nature/species/Steller's_Sea_Eagle#species"/>
</foaf:Document>
<foaf:Document rdf:about="http://www.iucnredlist.org/details/144342/0">
<foaf:primaryTopic rdf:resource="/nature/species/Steller's_Sea_Eagle#species"/>
</foaf:Document>
<foaf:Document rdf:about="http://www.natural-research.org/index.php?cID=169">
<foaf:primaryTopic rdf:resource="/nature/species/Steller's_Sea_Eagle#species"/>
</foaf:Document>
<wo:ReproductionStrategy rdf:about="/nature/adaptations/Altricial#adaptation">
<rdfs:label>Helpless young</rdfs:label>
</wo:ReproductionStrategy>
<wo:SurvivalStrategy rdf:about="/nature/adaptations/Animal_migration#adaptation">
<rdfs:label>Migration</rdfs:label>
</wo:SurvivalStrategy>
<wo:FeedingHabit rdf:about="/nature/adaptations/Carnivore#adaptation">
<rdfs:label>Carnivorous</rdfs:label>
</wo:FeedingHabit>
<wo:LocomotionAdaptation rdf:about="/nature/adaptations/Flight#adaptation">
<rdfs:label>Adapted to flying</rdfs:label>
</wo:LocomotionAdaptation>
<wo:CommunicationAdaptation rdf:about="/nature/adaptations/Hearing_(sense)#adaptation">
<rdfs:label>Acoustic communication</rdfs:label>
</wo:CommunicationAdaptation>
<wo:ReproductionStrategy rdf:about="/nature/adaptations/Monogamous_pairing_in_animals#adaptation">
<rdfs:label>Monogamous</rdfs:label>
</wo:ReproductionStrategy>
<wo:ReproductionStrategy rdf:about="/nature/adaptations/Oviparity#adaptation">
<rdfs:label>Egg layer</rdfs:label>
</wo:ReproductionStrategy>
<wo:LifeCycle rdf:about="/nature/adaptations/Parental_investment#adaptation">
<rdfs:label>Parental investment</rdfs:label>
</wo:LifeCycle>
<wo:TerrestrialHabitat rdf:about="/nature/habitats/Coast#habitat">
<rdfs:label>Coastal</rdfs:label>
</wo:TerrestrialHabitat>
<wo:MarineHabitat rdf:about="/nature/habitats/Estuary#habitat">
<rdfs:label>Estuaries</rdfs:label>
</wo:MarineHabitat>
<wo:FreshwaterHabitat rdf:about="/nature/habitats/Marsh#habitat">
<rdfs:label>Marsh</rdfs:label>
</wo:FreshwaterHabitat>
<wo:FreshwaterHabitat rdf:about="/nature/habitats/River#habitat">
<rdfs:label>Rivers and streams</rdfs:label>
</wo:FreshwaterHabitat>
<wo:FreshwaterHabitat rdf:about="/nature/habitats/Swamp#habitat">
<rdfs:label>Swamp</rdfs:label>
</wo:FreshwaterHabitat>
<wo:Genus rdf:about="/nature/genus/Sea_eagle#genus">
<rdfs:label>Haliaeetus</rdfs:label>
<wo:species rdf:resource="/nature/life/Steller's_Sea_Eagle#species"/>
<wo:species rdf:resource="/nature/life/African_Fish_Eagle#species"/>
<wo:species rdf:resource="/nature/life/White-tailed_Eagle#species"/>
</wo:Genus>
<wo:Family rdf:about="/nature/family/Accipitridae#family">
<rdfs:label>Accipitridae</rdfs:label>
</wo:Family>
<wo:Order rdf:about="/nature/order/Falconiformes#order">
<rdfs:label>Falconiformes</rdfs:label>
</wo:Order>
<wo:Class rdf:about="/nature/class/Bird#class">
<rdfs:label>Aves</rdfs:label>
</wo:Class>
<wo:Phylum rdf:about="/nature/phylum/Chordate#phylum">
<rdfs:label>Chordata</rdfs:label>
</wo:Phylum>
<wo:Kingdom rdf:about="/nature/kingdom/Animal#kingdom">
<rdfs:label>animalia</rdfs:label>
</wo:Kingdom>
</rdf:RDF>

For some reason, this web site is now deprecated. As an exercise I grabbed the RDF from the web site, did a little cleaning, and merged it together resulting in a set of around 94,500 triples (statements of the form “subject”, “predicate”, “object”). For example, this triple says that Steller's Sea Eagle is monogamous.

[/nature/life/Steller's_Sea_Eagle#species,
wo:adaptation,
/nature/adaptations/Monogamous_pairing_in_animals#adaptation]

One reason the Semantic Web has struggled to gain widespread adoption is the long list of things you need to get to the point where it is usable. You need data consistently structured using the same vocabulary. You need identifiers that everyone agrees on (or at least can map their own identifiers too). And you need a triple store, which is essentially a graph database, a technology that is still unfamiliar to many. But in this case the BBC has done a lot of the hard work by cleverly minting identifiers based on Wikipedia URLs (”slugs”), and developing a vocabulary to express relationships between organisms, traits, and habitats. All that’s needed is a way to query this data. Rather than use a triple store (most of which are not much fun to install or maintain) I’ve used the delightfully simple approach of employing a Hexastore. Hexastores provide fast querying of graphs by indexing all six permutations of the subject, predicates, object triple (hence “hexa”). The approach is sufficiently simple that for moderately sized databases we can implement it in Javascript and run it in a web browser.

As a demonstration, I created a very crude hexastore-based version of the BBC pages (https://rdmpage.github.io/bbc-wildlife/www/.

Screenshot 2017 12 05 13 13 51

Once you load the page there are no further server requests, other than fetching images. Every query is “live” but takes place in the browser. You can click on the image for a species and get some textural information, as well as images representing traits of that organism. Click on a trait and you discover what organisms share those traits. This example is trivial, but surprisingly rich. I’ve found it fascinating to simply bounce around the images discovering unexpected facts about different species. There’s lots of potential for serendipitous discovery, as well as an enhanced appreciation for just how rich the BBC’s content is. If the Encyclopedia of Life were this engaging I’d be it’s biggest fan.

The question then, is why a similar approach was not taken for Blue Planet II? It can’t be a lack of resources, this series has amazing production values. And yet a wonderful opportunity has been missed. Why not build on the existing work and create an interactive resource that encourages people to explore more deeply and learn more? Much of the existing data could be used, as well as adding all the new species and behaviours we see on our TV screens. Blue Planet also highlights the impacts humans are having on the marine environment, these could be added as categories as well to show wat organisms are susceptible to different impacted (e.g., plastics).

That the BBC thinks a poster is an adequate for of engagement in the digital age speaks of a corporation that, in spite of many triumphs in the digital sphere (e.g., iPlayer) has not fully grasped the role the web can play in making its content more widely useful and relevant, beyond enthralling viewers on a Sunday evening. It also seems oblivious to the fact that it already knows how to deliver rich, informative online content (as evidenced by the now deprecated Wildlife application). So please, BBC, can we have a resource that enables us to learn more about the organisms and habitats that are the subjects of the grandeur and beauty we see on our TV screens?

Follow up

Below is some of the discussion this post generated on Twitter.

Friday, November 10, 2017

Exploring images in the Biodiversity Literature Repository

A post by on the Plaza blog Expanded access to images in the Biodiversity Literature Repository has prompted me to write up a little toy I created earlier this week.

The Biodiversity Literature Repository (BLR) is a repository of taxonomic papers hosted by Zenodo. Where possible Plazi have extracted individual images and added those to the BLR, even if the article itself is not open access. The justification for being able to do this is presented here: DOI:10.1101/087015. I'm not entirely convinced by their argument (see Copyright and the Use of Images as Biodiversity Data) but rather than rehash that argument I decide dit would be much more fun to get a sense of what is in the BLR. I built a tool to scrape data from Zenodo and store it in CouchDB, put a simple search engine on top (using the search functionality in Cloudant) to search within the figure captions, and wrote some code to use a cloud-based image server to generate thumbnails for the images in Zenodo (some of which are quite big). The tool is hosted at Heroku, you can try it out here: https://zenodo-blr-interface.herokuapp.com/.

Screenshot 2017 11 10 11 03 30

This is not going to win any design awards, I'm simply trying to get a feel for what imagery BLR has. My initial reactions was "wow!". There's a rich range of images, including phylogenies, type specimens, habitats, and more. Searching by museum codes, e.g. NHMUK is a quick way to discover images of specimens from various collections.

Screenshot 2017 11 10 11 22 05

Based on this experiment there are at least two things I think would be fun to do.

Adding more images

BLR already has a lot of images, but the biodiversity literature is huge, and there's a wealth of imagery elsewhere, including journals not in BLR, and of course the Biodiversity Heritage Library (BHL). Extracting images from articles in BHL would potentially add a vast number of additional images.

Machine learning

Machine learning is hot right now, and anyone using iNaturalist is probably aware of their use of computer vision to suggest identifications for images you upload. It would be fascinating to apply machine learning to images in the BLR. Even basic things such as determining whether an image is a photo or a drawing, how many specimens are included, what the specimen orientation is, what part of the organism is being displayed, is the image a map (and of what country) would be useful. There's huge scope here for doing something interesting with these images.

The toy I created is very basic, and merely scratches the surface of what could be done (Plazi have also created their own tool, see http://github.com/punkish/zenodeo). But spending a few minutes browsing the images is well worthwhile, and if nothing else is a reminder of both how diverse life is, and how active taxonomists are in trying to discover and describe that diversity.

Friday, October 06, 2017

Notes on finding georeferenced sequences in GenBank

Notes on how many georeferenced DNA sequences there are in GenBank, and how many could potentially be georeferenced.

BCT	Bacterial sequences
PRI	Primate sequences
ROD	Rodent sequences
MAM	Other mammalian sequences
VRT	Other vertebrate sequences
INV	Invertebrate sequences
PLN	Plant and Fungal sequences
VRL	Viral sequences
PHG	Phage sequences
RNA	Structural RNA sequences
SYN	Synthetic and chimeric sequ
UNA	Unannotated sequences

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
?db=nucleotide nucleotides
&term=ddbj embl genbank with limits[filt]
NOT transcriptome[All Fields] ignore transcriptome data
NOT mRNA[filt] ignore mRNA data
NOT TSA[All Fields] ignore TSA
NOT scaffold[All Fields] ignore scaffold
AND src lat lon[prop] include records that have source feature "lat_lon"
AND 2010/01/01:2010/12/31[pdat] from this date range
AND gbdiv_pri[PROP] restrict search to PRI division (primates)
AND srcdb_genbank[PROP] Need this if we query by division, see NBK49540

Numbers of nucleotide sequences that have latitude and longitudes in GenBank for each year.

DatePRIRODMAMVRTINVPLN
2010/01/01412725529551926927174
2011/01/013711204816017657784947968
2012/01/01658034214216968406027314
2013/01/01297349761107647041123435
2014/01/011529044761145986807614018
2015/01/0117452719831784336353835501
2016/01/0158261512631489875789322813
2017/01/0193817581017107127506628180

Numbers of nucleotide sequences that don't have latitude and longitudes in GenBank for each year but do have the country field and hence could be georeferenced.

DatePRIRODMAMVRTINVPLN
2010/01/01666026545534326666257756692
2011/01/01399832666210337177401598664
2012/01/015377559072835533286945103379
2013/01/011092848058013663736971995817
2014/01/019727349267515991377816135372
2015/01/0189226774139646057885867167337
2016/01/0164303384108606223895711145111
2017/01/0111474352049124115991219109747

Wednesday, October 04, 2017

TDWG 2017: thoughts on day 3

Day three of TDWG 2017 highlighted some of the key obstacles facing biodiversity informatics.

After a fun series of "wild ideas" (nobody will easily forget David Bloom's "Kill your Darwin Core darlings") we had a wonderful keynote by Javier de la Torre (@jatorre) entitled "Everything happens somewhere, multiple times". Javier is CEO and founder of Carto, which provides tools for amazing geographic visualisations. Javier provided some pithy observations on standards, particularly the fate of official versus unofficial "community" standards (the community standards tend to be simpler, easier to use, and hence win out), and the potentially stifling effects standards can have on innovation, especially if conforming to standards becomes the goal rather than merely a feature.

The session Using Big Data Techniques to Cross Dataset Boundaries - Integration and Analysis of Multiple Datasets demonstrated the great range of things people want to do with data, but made little progress on integration. It still strikes me as bizarre that we haven't made much progress on minting and reusing identifiers for the same entities that we keep referring too. Channeling Steve Balmer:

Identifiers, identifiers, identifiers, identifiers

It's also striking to compare Javier de la Torre's work with Carto where there is a clear customer-driven focus (we need these tools to deliver this to users so that they can do what they want to do) versus the much less focussed approach of our community. Many of the things we aspire to won't happen until we identify some clear benefits for actual users. There's a tendency to build stuff for our own purposes (e.g., pretty much everything I do) or build stuff that we think people might/should want, but very little building stuff that people actually need.

TDWG also has something of an institutional memory problem. Franck Michel gave an elegant talk entitled A Reference Thesaurus for Biodiversity on the Web of Linked Data which discussed how the Muséum national d'Histoire naturelle's taxonomic database could be modelled in RDF (see for example http://taxref.mnhn.fr/lod/taxon/60878/10.0). There's a more detailed description of this work here:

This browser does not support PDFs. Please download the PDF to view it: Download PDF.

What struck me was how similar this was to the now deprecated TDWG LSID vocabulary, still used my most of the major taxonomic name databases (the nomenclatures). This is an instance where TDWG had a nice, workable solution, it lapsed into oblivion, only to be subsequently reinvented. This isn't to take anything away from Frank's work, which has a thorough discussion of the issues, and has a nice way to handle the the difference between asserting that two taxa are the same (owl:equivalentClass) and that a taxon/name hybrid (which is what many databases serve up because they don't distinguish between names and taxa) and a taxon might be the same (linking via the name they both share).

The fate of the RDF served by the nomenclators for the last decade illustrates a point I keep returning too (see also EOL Traitbank JSON-LD is broken). We tend to generate data and standards because it's the right thing to do, rather than because there's actually a demonstrable need for that data and those standards.

Bitcoin, biodiversity, and micropayments for open data

I gave a "wild ideas" talk at TDWG17 suggesting that the biodiversity community use Bitcoin to make micropayments to use data.

The argument runs like this:

  1. We like open data because it's free and it makes it easy to innovate, but we struggle to (a) get it funded and (b) it's hard to demonstrate value (hence pleas for credit/attribution, and begging for funding).
  2. The alternative of closed data, such as paying a subscription to access a database limits access and hence use and innovation, but generates an income to support the database, and the value of the database is easy to measure (it's how much money it generates).
  3. What if we have a "third model" where we pay small amounts of money to access data (micropayments)?

Micropayments as a way to pay creators is an old idea (it was part of Ted Nelson's Xanadu vision). Now that we have cryptocurrencies such as Bitcoin, micropayments are feasible. So we could imagine something like this:

  1. Access to raw datasets is free (you get what you pay for)
  2. Access to cleaned data comes at a cost (you are paying someone else to do the hard, tedious work of making the data usable)
  3. Micropayments are made using Bitcoin
  4. To help generate funds any spare computational capacity in the biodiversity community is used to mine Bitcoins

After the talk Dmitry Mozzherin sent me a link to Steem, and then this article about Steemit appeared in my Twitter stream:

Clearly this is an idea that has been bubbling around for a while. I think there is scope for thinking about ways to combine a degree of openness (we don't want to cripple access and innovation) with a way to fund that openness (nobody seems interested in giving us money to be open).

Tuesday, October 03, 2017

TDWG 2017: thoughts on day 1

Some random notes on the first day of TDWG 2017. First off, great organisation with the first usable conference calendar app that I've seen (https://tdwg2017.sched.com).

I gave the day's keynote address in the morning (slides below).

It was something of a stream of consciousness brain dump, and tried to cover a lot of (maybe too much) stuff. Among the topics I covered were Holly Bik's appeal for better links between genomic and taxonomic data, my iSpecies tool, some snarky comments on the Semantic Web (and an assertion that the reason that GenBank succeeded was due more to network effects than journals requiring authors to submit sequences there), a brief discussion of Wikidata (including using d3sparql to display classifications, see here), and the use of Hexastore to query data from BBC Wildlife. I also talked about Ted Nelson, Xanadu, using hypothes.is to annotate scientific papers (see Aggregating annotations on the scientific literature: a followup on the ReCon16 hackday), social factors in building knowledge graphs (touching on ORCID and some of the work by Nico Franz discussed here), and ended with some cautionary comments on the potential misuse of metrics based on knowledge graphs (using "league tables" of cited specimens, see GBIF specimens in BioStor: who are the top ten museums with citable specimens?).

TDWG is a great opportunity to find out what is going on in biodiversity informatics, and also to get a sense of where the problems are. For example, sitting through the Financial Models for Sustaining Biodiversity Informatics Products session you couldn't help being struck by (a) the number of different projects all essentially managing specimen data, and (b) the struggle they all face to obtain funding. If this was a commercial market there would be some pretty drastic consolidation happening. It also highlights the difficulty of providing services to a community that doesn't have much money.

I was also struck by Andrew Bentley's talk Interoperability, Attribution, and Value in the Web of Natural History Museum Data. In a series of slides Andrew outlined what he felt collections needed from aggregators, researchers, and publishers, e.g.:

Chatting to Andrew at the evening event at the Canadian Museum of Nature, I think there's a lot of potential for developing tools to provide collections with data on the use and impact of their collections. Text mining the biodiversity literature on a massive scale to extract (a) mentions of collections (e.g., their institutional acronyms) and (b) citations of specimens could generate metrics that would be helpful to collections. There's a great opportunity here for BHL to generate immediate value for natural history collections (many of which are also contributors to BHL).

Also had a chance to talk to Jorrit Poelen who works on Global Biotic Interactions (GloBI). He made some interesting comparisons between Hexastores (which I'd touched on in my keynote) and Linked Data Fragments.

The final session I attended was Towards robust interoperability in multi-omic approaches to biodiversity monitoring. The overwhelming impression was that there is a huge amount of genomic data, much of which does not easily fit into the classic, Linnean view of the world that characterises, say, GBIF. For most of the sequences we don't know what they are, and that might not be the most interesting question anyway (more interesting might be "what do they do?"). The extent to which these data can be shoehorned into GBIF is not clear to me, although doing so may result in some healthy rethinking of the scope of GBIF itself.

Monday, September 18, 2017

Guest post: Our taxonomy is not your taxonomy

Bob mesibov The following is a guest post by Bob Mesibov.

Do you know the party game "Telephone", also known as "Chinese Whispers"? The first player whispers a message in the ear of the next player, who passes the message in the same way to a third player, and so on. When the last player has heard the whispered message, the starting and finishing versions of the message are spoken out loud. The two versions are rarely the same. Information is usually lost, added or modified as the message is passed from player to player, and the changes are often pretty funny.

I recently compared ca 100 000 beetle records as they appear in the Museums Victoria (NMV) database and in DarwinCore downloads from the Atlas of Living Australia (ALA) and the Global Biodiversity Information Facility (GBIF). NMV has its records aggregated by ALA, and ALA passes its records to GBIF. The "Telephone" effect in the NMV to ALA to GBIF comparison was large and not particularly funny.

Many of the data changes occur in beetle names. ALA checks the NMV-supplied names against a look-up table called the National Species List, which in this case derives from the Australian Faunal Directory (AFD). If no match is found, ALA generalises the record to the next higher supplied taxon, which it also checks against the AFD. ALA also replaces supplied names if they are synonyms of an accepted name in the AFD.

GBIF does the same in turn with the names it gets from ALA. I'm not 100% sure what GBIF uses as beetle look-up table or tables, but in many other cases their GBIF Backbone Taxonomy mirrors the Catalogue of Life.

To give you some idea of the magnitude of the changes, of ca 85000 NMV records supplied with a genus+species combination, about one in five finished up in GBIF with a different combination. The "taxonRank" changes are summarised in the overview below, and note that replacement ALA and GBIF taxon names at the same rank are often different:

Generalised

Of the species that escaped generalisation to a higher taxon, there are 42 names with genus triples: three different genus names for the same taxon in NMV, ALA and GBIF.

Just one example: a paratype of the staphylinid Schaufussia mona Wilson, 1926 is held in NMV. The record is listed under Rytus howittii (King, 1866) in the ALA Darwin Core download, because AFD lists Schaufussia mona as a junior subjective synonym of Tyrus howitti King, 1866, and Tyrus howittii in AFD is in turn listed as a synonym of Rytus howittii (King, 1866). The record appears in GBIF under Tyraphus howitti (King, 1865), with Rytus howittii (King, 1866) listed as a synonym. In AFD, Rytus howittii is in the tribe Tyrini, while Tyraphus howitti is a different species in the tribe Pselaphini.

ALA gives "typeStatus" as "paratype" for this record, but the specimen is not a paratype of Rytus howittii. In the GBIF download, the "typeStatus" field is blank for all records. I understand this may change in future. If it does, I hope the specimen doesn't become a paratype of Tyraphus howitti through copying from ALA.

There are lots of "Telephone" changes in non-taxonomic fields as well, including some geographical howlers. ALA says that a Kakadu National Park record is from Zambia and another Northern Territory record is from Mozambique, because ALA trusts the incorrect longitude provided by NMV more than it does the NMV-supplied locality text. GBIF blanks this locality text field, leaving the GBIF user with two African records for Australian specimens and no internal contradictions.

ALA trusts latitude/longitude to the extent of changing the "stateProvince" field for localities near Australian State borders, if a low-precision latitude/longitude places the occurrence a short distance away in an adjoining State.

Manglings are particularly numerous in the "recordedBy" field, where name strings are reformatted, not always successfully. Complex NMV strings suffer worst, e.g. "C Oke; Charles John Gabriel" in NMV becomes "Oke, C.|null" in ALA, and "Ms Deb Malseed - Winda-Mara Aboriginal Corporation WMAC; Ms Simone Sailor - Winda-Mara Aboriginal Corporation WMAC" is reformatted as in ALA "null|null|null|null"

Most of the "Telephone" effect in the NMV-ALA-GBIF comparison appears in the NMV-ALA stage. I contacted ALA by email and posted some of the issues on the ALA GitHub site; I haven't had a response and the issues are still open. I also contacted Tim Robertson at GBIF, who tells me that GBIF is working on the ALA-GBIF stage.

Can you get data as originally supplied by NMV to ALA, through ALA? Well, that's easy enough record-by-record on the ALA website, but not so easy (or not possible) for a multi-record download. Same with GBIF, but in this case the "original" data are the ALA versions.

Monday, August 28, 2017

Let’s rise up to unite taxonomy and technology

Holly Bik (@hollybik) has an opinion piece in PLoS Biology entitled "Let’s rise up to unite taxonomy and technology" https://doi.org/10.1371/journal.pbio.2002231 (thanks to @sjurdur for bringing this to my attention).

Journal pbio 2002231 g001

It's a passionate plea for integrating taxonomic knowledge and "omics" data. In her article Bik includes a mockup of the kind of tool she'd like to see (based in part on Phinch), and writes:

Step 2: Clicking on a specific data point (e.g., an OTU) will pull up any online information associated with that species ID or taxonomic group, such as Wikipedia entries, photos, DNA sequences, peer-reviewed articles, and geolocated species observations displayed on a map.

This sort of plea has been made any times, and reminds me very much of PLoS's own efforts when they wanted to build a "Biodiversity Hub" and biodiversity informatics basically failed them. The hub itself later closed down.. There's clearly a need for a simply way to summarise what we know about a species, but we've yet to really tackle this (on the face of it) fairly simple task.

Quickly summarising the available information about a species was the motivation behind my little tool iSpecies, which I recently reworked to use DBpedia, GBIF, CrossRef, EOL, TreeBASE and OpenTreeofLife as sources. For the nematode featured in Bik's figure (Desmoscolex) there's not a great deal of easily available information (see http://ispecies.org/?q=Desmoscolex). We can get a little more form other sources not queried by iSpecies, such as BioNames, which aggregates the primary taxonomic literature, see http://bionames.org/search/Desmoscolex.

Part of the problem is that taxonomy is fundamentally a "long tail" field, both in terms of the subject matter (a few very well know species, then millions of poorly known species) and our knowledge of those species (a large, scattered taxonomic literature, much of it not yet digitised, although progress is being made). Furthermore, the names of species (and our conception of them) can change, adding an additional challenge.

But I think we can do a lot better. Simple web-based tools like iSpecies can assemble reasonable information from multiple sources (and in multiple languages) on the fly. It would be nice to expand those sources (the more primary sources the better). The current iSpecies tool searches on species name. This works well if the sources being queried mention that name (e.g., in the title of a paper that has a DOI and is indexed by CrossRef). Given that many of the "omics" datasets Bik works with are likely to have dark taxa, what we'll also need is the ability to search, say, using NCBI taxon ids, and retrieve literature linked to sequences for those taxa

It would also be useful to package those up in a simple API that other tools could consume. For example, if I wanted to improve the utility of iSpecies, one approach would be to package up the results in a JSON object. Perhaps even use JSON-LD (with global identifiers for taxa, documents, etc.) to make it possible for consumers to easily integrate that data with their own.

Taxonomy could be on the brink of another golden age—if we play our cards right. As it is reinvented and reborn in the 21st century, taxonomy needs to retain its traditional organismal-focused approaches while simultaneously building bridges with phylogenetics, ecology, genomics, and the computational sciences.

Taxonomy is, of course, doing just this, albeit not nearly fast enough. There are some pretty serious obstacles, some of them cultural, but some of them due to the nature of the problem. Taxonomic knowledge is massively decentralised, mostly non-digital, and many of the key sources and aggregations are behind paywalls. There is also a fairly large "technical debt" to deal with. Ian Mulvany was recently interviewed by PLoS and he emphasised that because academic publishers had been online from early on they were pioneers, but at the same time this left them with a legacy of older technologies and approaches that can sometimes get in the way of new idea. I think taxonomy suffers from some of the same problems. Because taxonomy has long been involved with computers, sometime we needed up betting on the "wrong" solutions. For example, at one time XML was the new hotness, and people invested a lot of effort in developing XML schema, and then ontologies and RDF vocabularies. Meantime much of the web has moved to simple data formats such as JSON, many specialist vocabularies are gathering dust as schema.org takes off, and projects like Wikidata force us to rethink the need to topic-specific databases.

But these are technical details. For me the key point of "Let’s rise up to unite taxonomy and technology" is that it's a symptom of the continued failure of biodiversity informatics to actually address the needs of its users. People keep asking for fairly simple things, and we keep ignoring them (or explaining why it's MUCH harder than people think, which is another way of ignoring them).

Sunday, August 20, 2017

Notes on displaying big trees using Google Maps/Leaflet

Notes to self on web map-style tree viewers. The basic idea is to use Google Maps or Leaflet to display a tree. Hence we need to compute tiles. One approach is to use a database that supports spatial queries to store the x,y coordinates of the tree. When we draw a tile we compute the coordinates of that tile, based on position and zoom level, do a spatial query to extract all lines that intersect with the rectangle for that tile, and draw those.

A nice example of this is Lifemap (see also De Vienne, D. M. (2016). Lifemap: Exploring the Entire Tree of Life. PLOS Biology, 14(12), e2001624. doi:10.1371/journal.pbio.2001624).

It occurs to me that for trees that aren't too big we could do this without an external database. For example, what if we used a Javascript implementation of an R-tree, such as imbcmdth/RTree or its fork leaflet-extras/RTree. So, we could compute the coordinates of the nodes in the tree in "geographic" space, store the bounding box for each line/arc in an R-tree, then query that R-tree for lines that intersect with the bounding box of the relevant tile. We could use a clipping algorithm to only draw the bits of the lines that cross the tile itself.

Web maps, at least in my experience, make trips to a tile server to fetch a tile, we would want instead to call a routine within our web page, because all the data would be loaded into that page. So we'd need to modify the tile creating code.

The ultimate goal would be to have a single page web app that accepts a Newick-style tree and converts into a browsable, zoomable visualisation.

Tuesday, July 04, 2017

Apple's Knowledge Navigator concept video (1987): we are still a long way from this vision

I've been viewing Apple's Knowledge Navigator concept video from 1987 and it's striking how much of this we have today, and yet how far away we are from the complete vision. For some background on this promotional video see The Making of Knowledge Navigator. The computer scientist Alan Kay provided some advice to the makers (who put the video together for a presentation by then Apple CEO John Sculley). Kay is a true visionary, he's currently working on children, computers, and education, motivated by the realisation that, like the printing press before it, computing will change the way people think, and how children learn using computers could have a profound impact on our future.

The Knowledge Navigator video looked futuristic when it came out, but now we have ubiquitous touch interfaces, video chat, and can talk to computers (albeit not with the level of sophistication shown in the video). But there are a couple of things in the concept video that are in many ways even more impressive.

Early on, our professor is trying to track down a paper, and he can't quite remember the name of the author. His visual assistant (a more sophisticated version of Siri) finds it, which of itself isn't too exciting (Google supports searching for things when you don't quite know what it is you're looking for). What is more impressive is that the professor can access and play with the data in the paper, and compare the predictions made in that paper with more recent data.

This requires that we have access to the data and models from a published paper, and a way to easily add new data and redo the analyses. This is related to "reproducible science" doi:10.1038/s41562-016-0021 and the notion of "executable papers" doi:10.1016/j.procs.2011.04.074, but goes beyond that because we don't just reproduce the results in the original paper, we can add to them. And it's all seamless and effortless. Anyone who has tried to get adta from a paper and do something with it will recognise that we are a long way from this.

The second interesting example is when our professor is chatting online with a colleague about deforestation in South America, and she sends him her graphical model of the spread of the Sahara. They then view these side-by-side. Note that this is not two separate videos, the simulations merge together and their timeline syncs so that they play together simultaneously. The parameters of the simulations can also be changed on the fly.

This ability to collaborate in real time in the same space with both data and analysis is something that we don't really have, at least I'm not aware of it. Yes, we can work together on editing a Google Document, but throwing together two data sets or visualisations and have them align themselves automatically is pretty cool.

While some aspects of the Knowledge Navigator video look quaint, it's striking that the actual core of the video - a researcher redoing an analysis published by another researcher, or collaborating with a colleague with different but related data is still something we haven't been able to achieve yet (for some related work on collaboratively viewing evolutionary trees see "Interactive Tree Comparison for Co-located Collaborative Information Visualization" doi:10.1109/TVCG.2007.70568). In this respect the Knowledge Navigator is still a vision of the future.

Friday, June 30, 2017

Response to To Increase Trust, Change the Social Design Behind Aggregated Biodiversity Data

Nico Franz and Beckett W. Sterner recently published a preprint entitled "To Increase Trust, Change the Social Design Behind Aggregated Biodiversity Data" on bioRxiv http://dx.doi.org/10.1101/157214 Below is the abstract:

Growing concerns about the quality of aggregated biodiversity data are lowering trust in large-scale data networks. Aggregators frequently respond to quality concerns by recommending that biologists work with original data providers to correct errors "at the source". We show that this strategy falls systematically short of a full diagnosis of the underlying causes of distrust. In particular, trust in an aggregator is not just a feature of the data signal quality provided by the aggregator, but also a consequence of the social design of the aggregation process and the resulting power balance between data contributors and aggregators. The latter have created an accountability gap by downplaying the authorship and significance of the taxonomic hierarchies ≠ frequently called "backbones" ≠ they generate, and which are in effect novel classification theories that operate at the core of data-structuring process. The Darwin Core standard for sharing occurrence records plays an underappreciated role in maintaining the accountability gap, because this standard lacks the syntactic structure needed to preserve the taxonomic coherence of data packages submitted for aggregation, leading to inferences that no individual source would support. Since high-quality data packages can mirror competing and conflicting classifications, i.e., unsettled systematic research, this plurality must be accommodated in the design of biodiversity data integration. Looking forward, a key directive is to develop new technical pathways and social incentives for experts to contribute directly to the validation of taxonomically coherent data packages as part of a greater, trustworthy aggregation process.

Below I respond to some specific points that annoyed me about this article, at the end I try and sketch out a more constructive response. Let me stress that although I am the current Chair of the GBIF Science Committee, the views expressed here are entirely my own.

Trust and social relations

Trust is a complex and context-sensitive concept...First, trust is a dependence relation between a person or organization and another person or organization. The first agent depends on the second one to do something important for it. An individual molecular phylogeneticist, for example, may rely on GenBank (Clark et al. 2016) to maintain an up-to-date collection of DNA sequences, because developing such a resource on her own would be cost prohibitive and redundant. Second, a relation of dependence is elevated to being one of trust when the first agent cannot control or validate the second agent's actions. This might be because the first agent lacks the knowledge or skills to perform the relevant task, or because it would be too costly to check.

Trust is indeed complex. I found this part of the article to be fascinating, but incomplete. The social network GBIF operates in is much larger than simply taxonomic experts and GBIF, there are relationships with data providers, other initiatives, a broad user community, government agencies that approve it's continued funding, and so on. Some of the decisions GBIF makes need to be seen in this broader context.

For example, the article challenges GBIF for responding to errors in the data by saying that these should be "corrected at source". This a political statement, given that data providers are anxious not to ceed complete control of their data to aggregators. Hence the model that GBIF users see errors, those errors get passed back to source (the mechanisms for tis is mostly non-existent), the source fixes it, then the aggregator re-harvests. This model makes assumptions about whether sources are either willing or able to fix these errors that I think are not really true. But the point is this is less about not taking responsibility, but instead avoiding treading on toes by taking too much responsibility. Personally I think should take responsibility for fixing a lot of these errors, because it is GBIF whose reputation suffers (as demonstrated by Franz and Sterner's article).

Scalability

A third step is to refrain from defending backbones as the only pragmatic option for aggregators (Franz 2016). The default argument points to the vast scale of global aggregation while suggesting that only backbones can operate at that scale now. The argument appears valid on the surface, i.e., the scale is immense and resources are limited. Yet using scale as an obstacle it is only effective if experts were immediately (and unreasonably) demanding a fully functional, all data-encompassing alternative. If on the other hand experts are looking for token actions towards changing the social model, then an aggregator's pursuit of smaller-scale solutions is more important than succeeding with the 'moonshot'.

Scalability is everything. GBIF is heading towards a billion occurrence records and several million taxa (particularly as more and more taxa from DNA-barcoding taxa are added). I'm not saying that tractability trounces trust, but it is a major consideration. Anybody advocating a change has got to think about how these changes will work at scale.

I'm conscious that this argument could easily be used to swat away any suggestion ("nice idea, but won't scale") and hence be a reason to avoid change. I myself often wish GBIF would do things differently, and run into this problem. One way around it is to make use of the fact that GBIF has some really good APIs, so if you want GBIF to do something different you can build a proof of concept to show what could be done. If that is sufficiently compelling, then the case for trying to scale it up is going to be much easier to make.

Multiple classifications

As a social model, the notion of backbones (Bisby 2000) was misguided from the beginning. They disenfranchise systematists who are by necessity consensus-breakers, and distort the coherence of biodiversity data packages that reflect regionally endorsed taxonomic views. Henceforth, backbone-based designs should be regarded as an impediment to trustworthy aggregation, to be replaced as quickly and comprehensively as possible. We realize that just saying this will not make backbones disappear. However, accepting this conclusion counts as a step towards regaining accountability.

This strikes me as hyperbole. "They disenfranchise systematists who are by necessity consensus-breakers". Really? Having backbones in no way prevents people doing systematic research, challenging existing classifications, or developing new ones (which, if they are any good, will become the new consensus).

We suggest that aggregators must either author these classification theories in the same ways that experts author systematic monographs, or stop generating and imposing them onto incoming data sources. The former strategy is likely more viable in the short term, but the latter is the best long-term model for accrediting individual expert contributions. Instead of creating hierarchies they would rather not 'own' anyway, aggregators would merely provide services and incentives for ingesting, citing, and aligning expert-sourced taxonomies (Franz et al. 2016a).

Backbones are authored in the sense that they are the product of people and code. GBIF's is pretty transparent (code and some data on github, complete with a list of problems). Playing Devil's advocate, maybe the problem here is the notion of authorship. If you read a paper with 100's of authors, why does that give you any greater sense of accountabily? Is each author going to accept responsibility for (or being to talk cogently about) every aspect of that paper? If aggregators such as GBIF and Genbank didn't provide a single, simple way to taxonomically browse the data I'd expect it would be the first thing users would complain about. There are multiple communities GBIF must support, including users who care not at all about the details of classification and phylogeny.

Having said that, obviously these backbone classifications are often problematic and typically lag behind current phylogenetic research. And I accept that they can impose a certain view on how you can query data. GenBank for a long time did not recognise the Ecdysozoa (nematodes plus arthropods) despite the evidence for that group being almost entirely molecular. Some of my research has been inspired by the problem of customising a backbone classification to better more modern views (doi:10.1186/1471-2105-6-208).

If handling multiple classifications is an obstacle to people using or contributing data to GBIF, then that is clearly something that deserves attention. I'm a little sceptical, in that I think this is similar to the issue of being able to look at multiple versions of a document or GenBank sequence. Everyone says it's important to have, I suspect very few people ever use that functionality. But a way forward might be to construct a meaningful example (in other words an live demo, not a diagram with a few plant varieties).

Ways forward

We view this diagnosis as a call to action for both the systematics and the aggregator communities to reengage with each other. For instance, the leadership constellation and informatics research agenda of entities such as GBIF or Biodiversity Information Standards (TDWG 2017) should strongly coincide with the mission to promote early-stage systematist careers. That this is not the case now is unfortunate for aggregators, who are thereby losing credibility. It is also a failure of the systematics community to advocate effectively for its role in the biodiversity informatics domain. Shifting the power balance back to experts is therefore a shared interest.

Having vented, let me step back a little and try and extract what I think the key issue is here. Issues such as error correction, backbones, multiple classifications are important, but I guess the real issue here is the relationship between experts such as taxonomists and systematists, and large-scale aggregators (note that GBIF serves a community that is bigger than just these researchers). Franz and Sterner write:

...aggregators also systematically compromise established conventions of sharing and recognizing taxonomic work. Taxonomic experts play a critical role in licensing the formation of high-quality biodiversity data packages. Systems of accountability that undermine or downplay this role are bound to lower both expert participation and trust in the aggregation process.

I think this is perhaps the key point. Currently aggregation tends to aggregate data and not provenance. Pretty much every taxonomic name has at one point or other been published by somebody. For various reasons (including the crappy way most nomenclature databases cite the scientific literature) by the time these names are assembled into a classification by GBIF the names have virtually no connection to the primary literature, which also means that who contributed the research that led to that name being minted (and the research itself) is lost. Arguably GBIF is missing an opportunity to make taxonomic and phylogenetic research more visible and discoverable (I'd argue this is a better approach than Quixotic efforts to get all biologists to always cite the primary taxonomic literature).

Franz and Sterner's article is a well-argued and sophisticated assessment of a relationship that isn't working the way it could. But to talk in terms of "power balance" strikes me as miscasting the debate. Would it not be better to try and think about aligning goals (assuming that is possible). What do experts want to achieve? What do they need to achieve those goals? Is it things such as access to specimens, data, literature, sequences? Visibility for their research? Demonstrable impact? Credit? What are the impediments? What, if anything, can GBIF and other aggregators do to help? In what way can facilitating the work of experts help GBIF?

In my own "early-stage systematist career" I had a conversation with Mark Hafner about the Louisiana State University Museum providing tissue samples for molecular sequencing, essentially a "project in a box". Although Mark was complaining about the lack credit for this (a familiar theme) the thing which struck me was how wonderful it would be to have such a service - here's everything you need to do your work, go do some science. What if GBIF could do the same? Are you interested in this taxonomic group, well here's the complete sum of what we know so far. Specimens, literature, DNA sequences, taxonomic names, the works. Wouldn't that be useful?

Franz and Sterner call for "both the systematics and the aggregator communities to reengage with each other". I would echo this. I think that the sometimes dysfunctional relationship between experts and aggregators is partly due to the failure to build a community of researchers around GBIF and its activities. The focus of GBIF's relationship with the scientific community has been to have a committee of advisers, which is a rather traditional and limited approach ("you're a scientist, tell us what scientists want"). It might be better served if it provided a forum for researchers to interact with GBIF, data providers, and each other.

I stated this blog (iPhylo) years ago to vent my frustrations about TreeBASE. At the time I was fond of a quote from a philosopher of science that I was reading, to the effect that we only criticise those things that we care about. I take Franz and Sterner's article to indicate that they care about GBIF quite a bit ;). I'm looking forward to more critical discussion about how we can reconcile the needs of experts and aggregators as we seek to make global biodiversity data both open and useful.

Friday, June 16, 2017

GBIF Challenge 2017: Liberating species records from open data repositories for scientific discovery and reuse

Ebbe v5 300

GBIF is running its Ebbe Nielsen Challenge for the third successive year. This year the title is Liberating species records from open data repositories for scientific discovery and reuse. To quote from the Challenge background on Devpost:

This year's Challenge will seek to leverage the growth of open data policies among scientific journals and research funders, which require researchers to make the data underlying their findings publicly available. Adoption of these policies represents an important first step toward increasing openness, transparency and reproducibility across all scientific domains, including biodiversity-related research.

To abide by these requirements, researchers often deposit datasets in public open-access repositories. Potential users are then able to find and access the data through repositories as well as data aggregators like OpenAIRE and DataONE. Many of these datasets are already structured in tables that contain the basic elements of biodiversity information needed to build species occurrence records: scientific names, dates, and geographic locations, among others.

However, the practices adopted by most repositories, funders and journals do not yet encourage the use of standardized formats. This approach significantly limits the interoperability and reuse of these datasets. As a result, the wider reuse of data implied if not stated by many open data policies falls short, even in cases where open licensing designations (like those provided through Creative Commons) seem to encourage it.

In essence, the 2017 Challenge is to develop tools to discover these biodiversity-relevant datasets, and make them available to GBIF. In other words, we want tools to enable us to do this:

Goal

As an example of the impact that external data can have on GBIF, last year I wrote a blog post (The Zika virus, GBIF, and the missing mosquitoes) describing how I took published data (doi:10.1038/sdata.2015.35) from the Dryad repository and added it to GBIF. The effect was dramatic:

Before

1651430

After

1651430 updated

This is just one example. I suspect that there is a lot of biodiversity data gathering digital dust sitting in repositories that could be more widely reused if we just had the tools to discover it, and convert it into a form that GBIF can use. Prove me right, and win cash prizes! Details at https://gbif2017.devpost.com.

Wednesday, May 31, 2017

Programming with Glitch: microservices and serverless computing

LgbNpkq 400x400Yes, this post is indeed an attempt to fit as many buzzwords that I don't really understand into the title. I've been playing around with Glitch, which is a delightful project from Fog Creek (makers of Trello and co-creators of Stack Overflow).

On first glance Glitch looks weirdly retro, and it took a little while for me to get the hang of things. Bit it's fun and very powerful. Basically it's a place where you can start creating web apps in your browser, and each app is automatically hosted online. If you see an app that you like you can see the source code (just like you can see HTML using "view source" in your browser). if you want to hack on the code you can simply create a copy and it's yours to play with (this is called "remixing", like forking on GitHub). Your copy gets a cute name (possibly annoyingly cute) and away you go.

If you're a developer, then at this point you're probably wondering what is actually happening under the hood. Each Glitch app is a node.js app, which means you're programming in Javascript (you can just use HTML and client side Javascript if you want to avoid node.js). I'm very new to node.js, so Glitch has been a fun way to experiment.

There are two things which make Glitch very powerful. The first is the "remix" feature. Don't know where to start? Find an app that looks like it might do something you want to do, remix it, and hack away. The code is edited online, and the editor works very well. It also checks your code for Javascript errors as you type, which is helpful (usually).

The second great feature is that you get built in hosting for free. As soon as you remix an app you have a functioning web site. Remixing is very like forking in GitHub, and if you're running node.js on your local machine then the benefits of Glitch might not seem obvious. But hosting is often a pain, either you need to set up your own servers, or use a hosting service. Glitch takes care of this for you, so your app is instantly available for others to use.

So, what can you do with Glitch? There's some great examples on the Glitch site, but I want to show an almost trivial example. I've created an app called "enchanting-bongo" https://enchanting-bongo.glitch.me (yes, the name is a bit irritating) that does one simple thing. You give it a DOI for an article and enchanting-bongo tells you whether any of the authors of that work have an ORCID. For example, try the DOI 10.3897/zookeys.555.6173. Why did I write this? I'm interested in ways to link people to the work that they've done, especially work that ends up being aggregated in large-scale biodiversity databases like GBIF (see Possible project: #itaxonomist, combining taxonomic names, DOIs, and ORCID to measure taxonomic impact).

Screenshot 2017 05 31 17 09 32

The app does one thing. It takes the DOI and calls the ORCID API to see if anyone has claimed authorship of the paper with that DOI. You can use the app with a web browser, or you can use an HTTP client and call the API (e.g., https://enchanting-bongo.glitch.me/search?q=10.3897%2Fzookeys.555.6173).

Glitch is an example of servers computing, where you don't have to worry about physical servers or the software infrastructure that runs on them (e.g., the web server itself), you just write code. Like any buzzword, there is some pushback, see for example What Is “Serverless”? An Alternative Take, but for a fascinating essay I recommend Why the fuss about serverless?. But the notion that I can simply hack away on some code and have an instantly available web app is very attractive.

The other buzzword is "microservices". I'm forever needing to do tasks such as find a DOI for a paper, match a "microcitation" to the enclosing article, locate a specimen in GBIF based on catalogue number in a paper, parse some text into structured data, such as a reference, geographic coordinates, etc. These are tools that I need in lots of contexts, and I've written software to do this on my machine, often as part of larger projects. "Microservices" is the idea that instead of large, monolithic apps we write a series of minimal tools that typically do one thing, and do it well. We then chain the together to do various tasks. Having small tools means that we can treat each problem independently, and if the tools communicate over the web (HTTP) then it doesn't matter what programming language we use. I've started thinking more and more about adopting this model and developing a bunch of small services to perform many of the tasks I need. Hosting these services then becomes in issue, I have web servers in my office but they are a pain to maintain (my university is forever insisting that I upgrade their software), so cloud-based hosting seems the obvious way forward. Free-hosting looks ideal, so Glitch is looking very attractive.

So, I'm hoping to experiment more with this approach. One thing I might do is create a series of services very like enchanting-bongo, have a simple web interface and an API that the web interface calls. That way users can play with it in their web browser, then call the service via the API if it does something useful. As a more sophisticated example of a service, I'm working on tools to parse Wikispecies reference strings, and link specimen codes to records in GBIF.

One reason I'm enthusiastic about Glitch is that it is fun!. Some of the best shifts in technology that I've made have been because a tool made something easy and fun to do. For example, CouchDB made working with structured data fun, and that was a revelation (databases, fun, surely not). Fun is a much neglected characteristic of the tools we use.