Friday, September 30, 2016

Guest post: It's 2016 and your data aren't UTF-8 encoded?

Bob mesibov The following is a guest post by Bob Mesibov.

According to w3techs, seven out of every eight websites in the Alexa top 10 million are UTF-8 encoded. This is good news for us screenscrapers, because it means that when we scrape data into a UTF-8 encoded document, the chances are good that all the characters will be correctly encoded and displayed.

It's not quite good news for two reasons.

In the first place, one out of eight websites is encoded with some feeble default like ISO-8859-1, which supports even fewer characters than the closely related windows-1252. Those sites will lose some widely-used punctuation when read as UTF-8, unless the webpage has been carefully composed with the HTML equivalents of those characters. You're usually safe (but see below) with big online sources like Atlas of Living Australia (ALA), APNI, CoL, EoL, GBIF, IPNI, IRMNG, NCBI Taxonomy, The Plant List and WoRMS, because these declare a UTF-8 charset in a meta tag in webpage heads. (IPNI's home page is actually in ISO-8859-1, but its search results are served as UTF-8 encoded XML.)

But a second problem is that just because a webpage declares itself to be UTF-8, that doesn't mean every character on the page sings from the Unicode songbook. Very odd characters may have been pulled from a database and written onto the page as-is. In ALA I recently found an ancient rune — the High Octet Preset control character (HOP, hex 81):

http://biocache.ala.org.au/occurrences/6191ca90-873b-44f8-848d-befc29ad7513 http://biocache.ala.org.au/occurrences/5077df1f-b70a-465b-b22b-c8587a9fb626

HOP replaces ü on these pages and is invisible in your browser, but a screenscrape will capture the HOP and put SchHOPrhoff in your UTF-8 document.

Another example of ALA's fidelity to its sources is its coding of the degree symbol, which is a single-byte character (hex b0) in windows-1252, e.g. in Excel spreadsheets, but a two-byte character (hex c2 b0) in Unicode. In this record, for example:

http://biocache.ala.org.au/occurrences/5e3a2e05-1e80-4e1c-9394-ed6b37441b20

the lat/lon was supplied (says ALA) as 37°56'9.10"S 145° 0'43.74"E. Or was it? The lat/lon could have started out as 37°56'9.10"S 145°0'43.74"E in UTF-8. Somewhere along the line the lat/lon was converted to windows-1252 and the ° characters were generated, resulting in geospatial gibberish.

When a program fails to understand a character's encoding, it usually replaces the mystery character with a ?. A question mark is a perfectly valid character in commonly used encodings, which means the interpretation failure gets propagated through all future re-uses of the text, both on the Web and in data dumps. For example,

http://biocache.ala.org.au/occurrences/dfbbc42d-a422-47a2-9c1d-3d8e137687e4

gives N?crophores for Nécrophores. The history of that particular character failure has been lost downstream, as is the case for myriads of other question marks in online biodiversity data.

In my experience, the situation is much worse in data dumps from online sources. It's a challenge to find a dump without question marks acting as replacement characters. Many of these question marks appear in author and place names. Taxonomists with eastern European names seem to fare particularly badly, sometimes with more than one character variant appearing in the same record, as in the Australian Faunal Directory (AFD) offering of Wêgrzynowicz, W?grzynowicz and Węgrzynowicz for the same coleopterist. Question marks also frequently replace punctuation, such as n-dashes, smart quotes and apostrophes (e.g. O?Brien (CoL) and L?Échange and d?Urville (AFD)).

Character encoding issues create major headaches for data users. It would be a great service to biodiversity informatics if data managers compiled their data in UTF-8 encoding or took the time to convert to UTF-8 and fix any resulting errors before publishing to the Web or uploading to aggregators.

This may be a big ask, given that at least one data manager I've talked to had no idea how characters were encoded in the institution's database. But as ALA's Miles Nicholls wrote back in 2009, "Note that data should always be shared using UTF-8 as the character encoding". Biodiversity informatics is a global discipline and UTF-8 is the global standard for encoding.

Readers needing some background on character encoding will find this and especially this helpful, and a very useful tool to check for encoding problems in small blocks of text is here.

GBIF 2016 Ebbe Nielsen Challenge entries

The GBIF 2016 Ebbe Nielsen Challenge has received 15 submissions. You can view them here: Screenshot 2016 09 30 14 01 05 Unlike last year where the topic was completely open, for the second challenge we've narrowed the focus to "Analysing and addressing gaps and biases in primary biodiversity data". As with last year, judging is limited to the jury (of which I'm a member), however anyone interested in biodiversity informatics can browse the submissions. Although you can't leave comments directly on the submissions within the GBIF Challenge pages, each submission also appears on the portfolio page of the person/organisation that created the entry, so you can leave comments there (follow the link at the bottom of the page for each submission to see it on the portfolio page).

Wednesday, September 07, 2016

Guest post: Absorbing task or deranged quest: an attempt to track all genus names ever published

YtNkVT2U This guest post by Tony Rees describes his quest to track all genus names ever published (plus a subset of the species…).

A “holy grail” for biodiversity informatics is a suitably quality controlled, human- and machine-queryable list of all the world’s species, preferably arranged in a suitable taxonomic hierarchy such as kingdom-phylum-class-order-family-genus or other. To make it truly comprehensive we need fossils as well as extant taxa (dinosaurs as well as dinoflagellates) and to cover all groups from viruses to vertebrates (possibly prions as well, which are, well, virus-like). Linnaeus had some pretty good attempts in his day, and in the internet age the challenge has been taken up by a succession of projects such as the “NODC Taxonomic Code” (a precursor to ITIS, the Integrated Taxonomic Information System - currently 722,000 scientific names), the Species 2000 consortium, and the combined ITIS+SP2000 product “Catalogue of Life”, now in its 16th annual edition, with current holdings of 1,635,250 living and 5,719 extinct valid (“accepted”) species, plus an additional 1,460,644 synonyms (information from http://www.catalogueoflife.org/annual-checklist/2016/info/ac). This looks pretty good until one realises that as well as the estimated “target” of 1.9 million valid extant species there are probably a further 200,000-300,000 described fossils, all with maybe as many synonyms again, making a grand total of at least 5 million published species names to acquire into a central “quality assured” system, a task which will take some time yet.

Ten years ago, in 2006, the author participated in a regular meeting of the steering committee for OBIS, the Ocean Biogeographic Information System which, like GBIF, aggregates species distribution data (for marine species in this context) from multiple providers into a single central search point. OBIS was using the Catalogue of Life (CoL) as its “taxonomic backbone” (method for organising its data holdings) and, again like GBIF, had come up against the problem of what to do with names not recognised in the then-latest edition of CoL, which was at the time less than 50% complete (information on 884,552 species). A solution occurred to me that since genus names are maybe only 10% as numerous as species names, and every species name includes its containing genus as the first portion of its binomial name (thanks, Linnaeus!), an all-genera index might be a tractable task (within a reasonable time frame) where an all-species index was not, and still be useful for allocating incoming “not previously seen” species names to an appropriate position in the taxonomic hierarchy. OBIS, in particular, also wished to know if species (or more exactly, their parent genera) were marine (to be displayed) or nonmarine (hide), similar with extant versus fossil taxa. Sensing a challenge, I offered to produce such a list, in my mind estimating that it might require 3 months full-time, or the equivalent 6 months in part time effort to complete and supply back to OBIS for their use.

To cut a long story short… the project, which I christened the Interim Register of Marine and Nonmarine Genera or IRMNG (originally at CSIRO in Australia, now hosted on its own domain “www.irmng.org” and located at VLIZ in Belgium) has successfully acquired over 480,000 published genus names, including valid names, synonyms and a subset of published misspellings, all allocated to families (most) or higher ranks (remainder) in an internally coherent taxonomic structure, most with marine/nonmarine and extant/fossil flags, all with the source from which I acquired them, sources for the flags, and more; also for perhaps 50% of genera, lists of associated species from wherever it has been convenient to acquire them (Catalogue of Life 2006 being a major source, but many others also used). My estimated 6 months has turned into 10 years and counting, but I do figure that the bulk of the basic “names acquisition” has been done for all groups (my estimate: over 95% complete) and it is rare (although not completely unknown) for me to come across genus names not yet held, at least for the period 1753-2014 which is the present coverage of IRMNG; present effort is therefore concentrated on correcting internal errors and inconsistencies, and upgrading the taxonomic placement (to family) for the around 100,000 names where this is not yet held (also establishing the valid name/synonym status of a similar number of presently “unresolved” generic names).

With the move of the system from Australia to VLIZ, completed within the last couple of months, there is the facility to utilise all of the software and features presently developed at VLIZ that currently runs WoRMS, the World Register of Marine Species and its many associated subsidiary databases, as well as (potentially) look at forming a distributed editing network for IRMNG in the future, as already is the case for WoRMS, presuming that others are see a value in maintaining IRMNG as a useful resource e.g. for taxonomic name resolution, detection of potential homonyms both within and across kingdoms, and generally acting as a hierarchical view of “all life” to at least genus level. A recently implemented addition to IRMNG is to hold ION identifiers (also used in BioNames), for the subset of names where ION holds the original publication details, enabling “deep links” to both ION and BioNames wherein the original publication can often be displayed, as previously described elsewhere in this Blog. Similar identifiers for plants are not yet held in the system but could be, (for example Index Fungorum identifiers for fungi), for cases where the potential linked system adds value in giving, for example, original publication details and onward links to the primary literature.

All in all I feel that the exercise has been of value not only to OBIS (the original “client”) but also to other informatics systems such as GBIF, Encyclopedia of Life, Atlas of Living Australia, Open Tree of Life and others who have all taken advantage of IRMNG data to add to their systems, either for the marine/nonmarine and extant/fossil flags or as an addition to their primary taxonomic backbones, or both. In addition it has allowed myself, the founding IRMNG compiler, to “scratch the taxonomic itch” and finally flesh out what is meant by statements that a certain group contains x families or y genera, and what these actually might be. Finally, many users of the system via its web interface have commented over time on how useful it is to be able to input “any” name, known or unknown, with a good chance that IRMNG can tell them something about the genus (or genus possible options, in the case of homonyms) as well as the species, in many cases, as well as discriminate extant from fossil taxon names, something not yet offered to any significant extent by the current Catalogue of Life.

Readers of iPhylo are encouraged to try IRMNG as a “taxonomic name resolution service” by visiting www.irmng.org and of course, welcome to contact me with suggestions of missed names (concentrating at genus level at the present time) or any other ideas for improvement (which I can then submit for consideration to the team at VLIZ who now provide the technical support for the system).