Friday, August 29, 2008

Turning Japanese: EUC-JP, UTF-8, and percent-encoding

In case I forget how to do this, and as an example of how easy it is to get sucked into a black hole of programming micro-details, I spent a hour or more trying to figure out how to handle Japanese characters.

I'm building a database of publications linked to taxonomic names, and I'm interested in linking to electronic versions of those publications. CrossRef and JSTOR provide a lot of references, as does BHL (once they get an OpenURL resolver in place), but there are numerous other sources to be harvested. One is CiNii, the Japanese National Institute of Informatics Scholarly and Academic Information Navigator, which have an OpenURL resolver. For example, I can query CiNii for an article using this URL
http://ci.nii.ac.jp/openurl/query?ctx_ver=Z39.88-2004&url_ver=Z39.88-2004&ctx_enc=info%3aofi%2fenc%3aUTF-8&rft.date=2003&rft.volume=58&rft.spage=1&rft.epage=6&rft.jtitle=Entomological%20Review%20of%20Japan.

If I want to harvest bibliographic metadata, I can parse the resulting HTML. I could follow the links to formats such as BibTex, but there's enough information in the link itself. For example, there's a link to the BibTex format that looks like this:

http://ci.nii.ac.jp/openurl/servlet/createData?type=bib
&ca=@article
&au=%B7%A6%CC%DA+%B4%B4%C9%D7
&title=%A5%AB%A5%DF%A5%AD%A5%EA%A5%E0%A5%B7%B2%CAPidonia%C2%B0%A4%CE%BF%B7%B0%A1%C2%B0%A4%CB%A4%C4%A4%A4%A4%C6
&jtitle=%BA%AB%EA%B5%D5%DC%C9%BE%CF%C0+%3D+The+entomological+review+of+Japan
&year=20030430
&vol=00058
&num=00001
&spage=1-6
&id=10011061577
&lang=jp
&issn=02869810
&publish=%C6%FC%CB%DC%B9%C3%C3%EE%B3%D8%B2%F1
&perm_link=http%3A%2F%2Fci.nii.ac.jp%2Fnaid%2F10011061577%2F
Note the percent-encoded fields, such as %B7%A6%CC%DA+%B4%B4%C9%D7. This string represents the author's name, 窪木 幹夫. It took me a little while to figure out how to convert %B7%A6%CC%DA+%B4%B4%C9%D7 to 窪木 幹夫. Eventually I discovered this table, which shows that there are a number of ways to represent Japanese characters, including JIS, SJIS, and EUC-JP. Given that C9D7 = 夫, the string is EUC-JP encoded. What I want is UTF-8. After some fussing, it turns out that all I need to do (in PHP) is:

$decoded_str = rawurldecode($str);
if (mb_detect_encoding($decoded_str) != 'ASCII')
{
$decoded_str = mb_convert_encoding($decoded_str, 'UTF-8', 'EUC-JP');
}
rawurldecode decodes the percent-encoding to EUC-JP, then mb_convert_encoding gives me UTF-8.
As an example, here is the above reference displayed by the bioGUID OpenURL resolver. A small victory, but it is nice to display the Japanese title. The English title of this article is "A New Subgenus of the Genus Pidonia MULSANT (Coleoptera: Cerambycidae)". It's perhaps the major triumph of Linnean taxonomy that even though I can't read a word of Japanese, I know the paper is about Pidonia.

Tuesday, August 26, 2008

Perceptive Pixel Taxonomy Demo



Found this while Googling. Demo by Perceptive Pixel of browsing the ITIS classification using their multi-touch technology. I want one...

Vince Smith wins 2008 Ebbe Nielsen Prize

As spotted by dechronization, GBIF has made public that Vince Smith has won the 2008 Ebbe Nielsen Prize. The award "recognises a researcher who is combining biosystematics and biodiversity informatics research in an exciting and novel way."

For Vince the award brings kudos, recognition, and €30,000 (just a little less than the fortune implied by dechronization ;) ).For me, it's a opportunity for unseemly basking in reflected glory (Vince is a former PhD student of mine, and also spent a Wellcome Trust Fellowship in my lab in the heady days when I cared about lice). If you haven't seen it, check out Vince's blog, and the Scratchpads.

Saturday, August 23, 2008

Reasons text mining will fail. I. UTM Grid References and GenBank accession numbers

OMG. Playing with extracting identifiers from text, I have a regular expression for GenBank accession numbers that looks something like this:
(A[A-Z])[0-9]{6} | (U[0-9]){5} | (D[A-Z])[0-9]{6} | (E[A-Z])[0-9]{6} | (NC_)[0-9]{6}).
OK, it won't get everything, but what is more worrying are the things it will pickup that aren't GenBank accession numbers.

For example, I ran Robert Mesibov's 2005 paper "The millipede genus Lissodesmus Chamberlin, 1920 (Diplopoda: Polydesmida:
Dalodesmidae) from Tasmania and Victoria, with descriptions of a new genus and 24 new species" [PDF here] through a script, and out came loads of GenBank accession numbers ... which is a worry as there aren't any sequences in this paper.

Turns out, Mesibov uses UTM grid references to describe localities, and these look like just GenBank accessions. There is a nice web site here which describes how UTM grid references are determined in Tasmania (from which the image below is taken).

Not all the "accession numbers" in Mesibov(2005) exist in GenBank, but some do, for example grid reference DQ402119 (41°26'31''S 146°17'02''E) is also a sequence DQ402119 and, you guessed it, it's not from a millipede. So, I need to be a little bit careful in extracting identifiers from text.

Thursday, August 21, 2008

Elsevier Grand Challenge


Elsevier recently announced the 10 semi-finalists for their Grand Challenge. To my consternation, I'm one of them. I wrote a proposal entitled "Towards realising Darwin’s dream: setting the trees free" (I have uploaded a copy to Nature Precedings, it should be available shortly see doi:10.1038/npre.2008.2217.1). The "setting the trees" free is a reference to my oft expressed view that much of our knowledge of evolutionary history is locked up in the pages of Molecular Phylogenetics and Evolution.

Of course, writing a proposal is one thing, making something useful is quite another. I envision something along the lines of this, but *cough* better. Meantime, the other semi-finalists look scarily good.

Wednesday, August 20, 2008

NCBI visualisations I - Genbank Timemap

Time for some fun. In between some tedious text mining I've been meaning to explore some visualisations of NCBI. Here's the first, inspired by Jörn Clausen's wonderful Live Earthquake Mashup (thanks to Donat Agosti for telling me about this). What I've done is take all the frog sequences in Genbank that are georeferenced, add the date those Genbank records were created, generate a KML file, and use Nick Rabinowitz's timemap to plot the KML. The result is here:



By dragging the time line you can see collections of sequences and where the frog samples came from. Clicking on a marker on the Google Map takes displays a link to the Genbank record. It's all pretty crude, but fun to play with. What I'm toying with is trying to do something like this for new taxa, i.e., a timemap showing where an when new species are described. Sort of a live biodiversity map like the earthquake mashup, albeit not quite so rapidly moving.

ZooKeys, DOIs, Open Access, and RSS, but why?


ZooKeys (ISSN 1313-2970) is a new journal for the rapid publication of taxonomic names, rather like Zootaxa. On first glance it has some nice features, such as being Open Access (using the Creative Commons Attribution license), DOIs, and RSS feeds -- although these don't validate, partly due to an error at the bottom of the feeds:
<b>Warning</b>:  Cannot modify header information - headers already sent by (output started 
at /home/pensofto/public_html/zookeys/cache/t_compile/%%C2^C2D^C2D18A7A%%rss.tpl.php:5)
in <b>34</b><br />
So, something to fix there.

The RSS feeds are reasonably informative, although they don't include the DOI, which somewhat defeats the point of having them. DOIs need to be first class citizens in taxonomic literature.

But these are technical matters, the real question is why? Why create a new journal when Zootaxa is pumping out new taxaonomic papers at an astonishing rate. Why not combine forces (DOIs and RSS for Zootaxa, yay!)? There is an editorial doi:10.3897/zookeys.1.11 that is rather coy about this. Yes, Open Access is a Good Thing™, but Zootaxa has some Open Access articles. Why dilute the effort to transform zoological taxonomy by creating a new journal?

Monday, August 18, 2008

DOIs, the good news and the bad news

The good news is that the merger of Blackwell's digital content with that of Wiley's has not affected the DOIs, which is exactly as you'd expect, and is a nice demonstration of the power of identifiers that use indirection (although there was a time when Wiley was offline).

For example, the article identified by doi:10.1111/j.1095-8312.2003.00274.x had the URL
http://www.blackwell-synergy.com/doi/abs/10.1111/j.1095-8312.2003.00274.x and now has the URL http://www3.interscience.wiley.com/journal/118833573/abstract. The DOI, of course, hasn't changed, so anybody linking to the paper via the DOI (for example, in a blog) won't be affected.

Naturally, not everything is rosy. The Canadian Journal of Zoology has managed to break just about all their DOIs. Surely the must come a time when CrossRef starts automatically checking DOIs and alerting publishers when they are broken?

Friday, August 15, 2008

Freebase Parallax

Freebase Parallax is a very cool interface to Freebase.


Freebase Parallax: A new way to browse and explore data from David Huynh on Vimeo.

DBpedia, and integrating taxonomy with the rest of the linked data world


While biodiversity informatics putters along, generating loads of globally unique identifiers that nobody else uses, perhaps it's time to take a look at the bigger picture. DBPedia is an effort to extract data from Wikipedia and make it available as linked data. At the heart of this effort is the use of HTTP URIs to identify resources, and reusing those URIs. Hence, for many concepts DBpedia URIs are the default option.

Interestingly, in addition to taxa, Wikipedia has pages on prominent (and not so prominent) taxonomists, such as Thomas Say and Henri Milne-Edwards. When it comes to assigning GUIDs to people, DBpedia URIs would be an obvious choice. For example, http://dbpedia.org/resource/Henri_Milne-Edwards is the URI for Henri Milne-Edwards.

This approach has several adavantages. For one, it embeds taxonomic authorities in the broader ocean of linked data. It also makes use of Wikipedia to provide biographical details on taxonomic authorities (many of whom are sufficiently notworthy to appear in Wikipedia). Until we start linking to other data sources, taxonomic data will remain in it's own little ghetto.

Thursday, August 14, 2008

BNCOD 2008 Workshop


The proceedings of the BNCOD 2008 Workshop on "Biodiversity Informatics: challenges in modelling and managing biodiversity knowledge" are online. This workshop was held in conjunction with the 25th British National Conference on Databases (BNCOD 2008) at Cardiff, Wales. The papers make interesting reading.

Exploring International Plant Names Index (IPNI) Data using Visualisation by Nicola Nicolson [PDF]
This paper describes visualisation as a means to explore data from the International Plant Names Index (IPNI). Several visualisations are used to display large volumes of data and to help data standardisation efforts. These have potential uses in data mining and in the exploration of taxon concepts.
Nicky explores some visualisations of the IPNI plant name database. Unfortunately only one of these (arguably the east exciting one) is shown in the PDF. The visualisations of citation history using Timeline, and social networks using prefuse are mentioned, but not shown.

Scratchpads: getting biodiversity online, redefining publication by Vince Smith et al. [PDF]
Taxonomists have been slow to adopt the web as a medium for building research communities. Yet, web-based communities hold great potential for accelerating the pace of taxonomic research. Here we describe a social networking application (Scratchpads) that enables communities of biodiversity researchers to manage and publish their data online. In the first year of operation 466 registered users comprising 53 separate communities have collectively generated 110,000 pages within their Scratchpads. Our approach challenges the traditional model of scholarly communication and may serve as a model to other research disciplines beyond biodiversity science.
This is a short note describing Scratchpads, which are built using the Drupal content management system (CMS). Scratchpads provide a simple way for taxonomists to get their content online. Based in large measure on the success of scratchpads, EOL will use Drupal as the basis of their "Lifedesks". There are numerous scratchpads online, although the amount and quality of content is, um, variable.

Managing Biodiversity Knowledge in the Encyclopedia of Life by Jen Schopf et al. [PDF]
The Encyclopedia of Life is currently working with hundreds of Content Providers to create 1.8 million aggregated species pages, consisting of tens of millions of data objects, in the next ten years. This article gives an overview of our current data management and Content Provider interactions.
This is a short note on EOL itself. I've given my views on EOL's progress (or, rather, lack thereof) elsewhere (here, here and here). The first author on this paper has left the project, and at least one of the other authors is leaving. It seems EOL has yet to find its feet (it certainly has no idea of how to use blogs).


Distributed Systems and Automated Biodiversity Informatics: Genomic Analysis and Geographic Visualization of Disease Evolution by Andrew Hill and Robert Guralnick [doi:10.1007/978-3-540-70504-8_28]
A core mission in biodiversity informatics is to build a computing infrastructure for rapid, real-time analysis of biodiversity information. We have created the information technology to mine, analyze, interpret and visualize how diseases are evolving across the globe. The system rapidly collects the newest and most complete data on dangerous strains of viruses that are able to infect human and animal populations. Following completion, the system will also test whether positions in the genome are under positive selection or purifying selection, a useful feature to monitor functional genomic charac-teristics such as, drug resistance, host specificity, and transmissibility. Our system’s persistent monitoring and reporting of the distribution of dangerous and novel viral strains will allow for better threat forecasting. This information system allows for greatly increased efficiency in tracking the evolution of disease threats.
This paper is was one of two contributions chosen to be proceedings BNCOD 2008 ("Sharing Data, Information and Knowledge", doi:10.1007/978-3-540-70504-8, ISBN 978-3-540-70503-1). Rob Guralnick has put a free version online (see his comment below). It describes the very cool system being developed to provide near real time visualisation of disease spread and evolution, and builds on some earlier work published in Systematic Biology (doi:10.1080/10635150701266848).

LSID Deployment in the Catalogue of Life by Ewen Orme et al. [PDF]
In this paper we describe a GBIF/TDWG-funded project in which LSIDs have been deployed in the Catalogue of Life’s Annual and Dynamic Checklist products as a means of identifying species and higher taxa in these large species catalogues. We look at the technical infras- tructure requirements and topology for the LSID resolution process and characteristics of the RDF (Resource Description Framework) metadata returned by the resolver. Such characteristics include the use of concepts and relationships taken from the TDWG (Taxonomic Database Working Group) ontology and how a given taxon LSID relates to others includ- ing those issued by database providers and those above and below it in the taxonomic tree. Finally we evaluate the pro ject and LSID usage in general. We also look to the future when the CoL LSID infrastructure will have to deal changing taxonomic information, annually in the case of the Annual Checklist and possibly much more frequently in the case of the Dynamic Checklist.

Although I was an early adopter of LSIDs (in my now defunct Taxonomic Search Engine doi:10.1186/1471-2105-6-48 and the very-much alive LSID Tester, doi:10.1186/1751-0473-3-2), I have some reservations about them. The Catalogue of Life uses UUIDs to generate the LSID identifier, which makes for rather ugly looking LSIDs, as David Shorthouse has complained. For example, the LSID for Pinnotheres pisum urn:lsid:catalogueoflife.org:taxon:ef0ae064-29c1-102b-9a4a-00304854f820:ac2008 (gack). Why these ugly UUIDs? Well, one advantage is that they can be generated in a distributed fashion and remain unique. This would make sense for a project like the Catalogue of Life, which aggregates names from a range of contributors, but in actual fact all the LSIDs at present are of the form "xxxxxxxx-29c1-102b-9a4a-00304854f820", indicating that they are being generated centrally (by MySQL's UUID function, in this case).

Ironically, when I was talking to Frank Bisby earlier this year, he implied that LSIDs would change with each release if the information about a name changed, thus failing to solve the existing, fundamental design flaw in the Catalogue of Life, namely the lack of stable identifiers! So, at first glance we are stuck with hideous-looking identifiers that may be unstable. Hmmm...

Workflow Systems for Biodiversity Researchers: Existing Problems and Potential Solutions by Russel McIver et al. [PDF]
In this paper we discuss the potential that scientific work- flow systems have to support biodiversity researchers in achieving their goals. This potential comes through their ability to harness distributed resources and set up complex, multi-stage experiments. However, there remain concerns over the usability of existing workflow systems and re- search still needs to be done to help match the functionality of the soft- ware to the needs of its users. We discuss some of the existing concerns regarding workflow systems and propose three potential interfaces in- tended to improve workflow usability. We also outline the software ar- chitecture that we have adopted, which is designed to make our proposed workflow interface software interoperable across key workflow systems.
Not sure what to make of this paper. Workflows seem to generate an awful lot of publications, and few tools that people actually use.


Visualisation to Aid Biodiversity Studies through Accurate Taxonomic Reconciliation by Martin Graham et al. [doi:10.1007/978-3-540-70504-8_29]
All aspects of organismal biology rely on the accurate identification of specimens described and observed. This is particularly important for ecological surveys of biodiversity, where organisms must be identified and labelled, both for the purposes of the original research, but also to allow reinterpretation or reuse of collected data by subsequent research projects. Yet it is now clear that biological names in isolation are unsuitable as unique identifiers for organisms. Much modern research in ecology is based on the integration (and re-use) of multiple datasets which are inherently complex, reflecting any of the many spatial and temporal environmental factors and organismal interactions that contribute to a given ecosystem. We describe visualization tools that aid in the process of building concept relations between related classifications and then in understanding the effects of using these relations to match across sets of classifications.
The second contribution published in the conference proceedings, but there is also free version available here from the project's blog. The paper describes TaxVis, a project developing visualisation techniques for comparing multiple taxonomic hierarchies.

The paper discusses taxonomic concepts and the difficulty of establishing what a taxonomist meant when they used a particular name. As much as I understand the argument, I can't shake the feeling that obsessing about taxonomic concepts is ultimately a dead end. It won't scale, and in an age of DNA barcoding, it becomes less and less relevant.

Releasing the content of taxonomic papers: solutions to access and data mining by Chris Lyal and Anna Weitzman [PDF]
Taxonomic information is key to all studies of biodiversity. Taxonomic literature contains vast quantities of that information, but it is under-utilised because it is difficult to access, especially by those in biodiverse countries and non-taxonomists. A number of initiatives are making this literature available on the Web as images or even as unstructured text, but while that improves accessibility, there is more that needs to be done to assist users in locating the publication; locating the relevant part of the publication (article, chapter etc) and locating the text or data required within the relevant part of the publication. Taxonomic information is highly structured and automated scripts can be used to mark-up or parse data from it into atomised pieces that may be searched and repurposed as needed. We have developed a schema, taXMLit that allows for mark-up of taxonomic literature in this way. We have also developed a prototype system, INOTAXA that uses literature marked up in taXMLit for sophisticated data discovery.
This is a nice overview of the challenge of extracting information from legacy literature. There are numerous challenges facing this work, including taks that are trivial for people, such as determining when an article starts and ends, but which are challenging for computers (see Lu et al. doi:10.1145/1378889.1378918, free copy here -- there is a job related to this question available now). Related efforts are the TaxonX markup being used by Plazi. My own view is that for legacy literature heavy markup is probably overkill, decent text mining will be enough. The real challenge is to stop the rot at source, and enable new taxonomic publications to be marked up as part of the authoring and publishing process.

An architecture to approach distributed biodiversity pollinators relational information into centralized portals based on biodiversity protocols by Pablo Salvanha et a. [PDF]
The present biodiversity distributed solution using DiGIR / TAPIR protocols and the Darwincore2 schema has been very valuable in the centralized portals, which that can provide distributed information in a very quickly way. Using the same concept this paper presents an architecture based on the case study of pollinators to bring the centralization of the relational information to those portals. This architecture is based on a technological structure to facilitate the implementation and extraction from the providers of that relational information, and proposes a model to make this information reliable to be used with the present specimens information on the portal database.
This is a short note on extending DarwinCore to include information about pollination relationships. The wisdom of doing this has been question (see Roger Hyam's comment on the proposal).

A Pan-European Species-directories Infrastructure (PESI) by Charles Hussey and Yde de Jong [PDF]
This communication introduces the rationale and aims of a new Europe-wide biodiversity informatics project. PESI defines and coordinates strategies to enhance the quality and reliability of European biodiversity information by integrating the infrastructural components of four major community networks on taxonomic indexing, namely those of marine life, terrestrial plants, fungi and animals, into a joint work programme. This will include functional knowledge networks of both taxonomic experts and regional focal points, which will collaborate on the establishment of standardised and authoritative taxonomic (meta-) data. In addition PESI will coordinate the integration and synchronisation of the European taxonomic information systems into a joint e-infrastructure and the creation of a common user-interface disseminating the pan- European checklists and associated user-services results.
This paper describes PESI, yet another mega-science project in biodiversity, complete with acronyms, work packages, and vacuous, buzzword-compliant statements. Just what the discipline needs...

Tuesday, August 12, 2008

Dinosaurs and the Cretaceous Terrestrial Revolution

Shameless plug. One of my former PhD students, Katie Davis, is second author on "Dinosaurs and the Cretaceous Terrestrial Revolution" (doi:10.1098/rspb.2008.0715), which came out recently in Proceedings of the Royal Society. The abstract:
The observed diversity of dinosaurs reached its highest peak during the mid- and Late Cretaceous, the 50 Myr that preceded their extinction, and yet this explosion of dinosaur diversity may be explained largely by sampling bias. It has long been debated whether dinosaurs were part of the Cretaceous Terrestrial Revolution (KTR), from 125–80 Myr ago, when flowering plants, herbivorous and social insects, squamates, birds and mammals all underwent a rapid expansion. Although an apparent explosion of dinosaur diversity occurred in the mid-Cretaceous, coinciding with the emergence of new groups (e.g. neoceratopsians, ankylosaurid ankylosaurs, hadrosaurids and pachycephalosaurs), results from the first quantitative study of diversification applied to a new super tree of dinosaurs show that this apparent burst in dinosaurian diversity in the last 18 Myr of the Cretaceous is a sampling artefact. Indeed, major diversification shifts occurred largely in the first one-third of the group’s history. Despite the appearance of new clades of medium to large herbivores and carnivores later in dinosaur history, these new originations do not correspond to significant diversification shifts. Instead, the overall geometry of the Cretaceous part of the dinosaur tree does not depart from the null hypothesis of an equal rates model of lineage branching. Furthermore, we conclude that dinosaurs did not experience a progressive decline at the end of the Cretaceous, nor was their evolution driven directly by the KTR.
Now, if we could just get the bird supertree paper out the door...

Thursday, August 07, 2008

Spida of Love

Systematics makes The Colbert Report.



The paper describing Aptostichus stephencolbertio by Jason Bond and Amy Stockman has been published in Systematic Biology doi:10.1080/10635150802302443. Jason described also described Aptostichus angelinajolieae in the same paper, but I guess she is otherwise engaged.