Tuesday, June 10, 2008

Catalogue of Life as a treemap

I have an "on again/off again" relationship with treemaps. Lately, I've been taking another look, partly inspired by Björn Engdahl's MSc thesis Ordered and Unordered Treemap Algorithms and Their Applications on Handheld Devices. He describes a simple treemap algorithm which he calls Split Layout. It has the nice properties of having a good aspect ratio (most cells in the treemap are approximately square) and it keeps the cells in roughly the original order. This later property is important as one thing I find distracting with tree diagrams is if the order of the objects in the tree keep changing.

I also have an "on again/off again" relationship with the Catalogue of Life, which is potentially very useful, but seems determined to undermine this with some poor design decisions. But, I finally bit the bullet and extracted a complete classification from the 2008 edition of the Catalogue of Life. I downloaded an ISO image, burnt a CD, installed it on a Windows box (gack), grabbed the MySQL database files, and put those on my MacBook Air. Using some tools I developed for working with the NCBI taxonomy, I wanted to extract the tree from the taxa table, only to discover that this table isn't a tree. Not all the taxa in the table are flagged is_accepted_name, and if you remove those, then the remaining taxa don't form a tree. It's clear that some taxa have been orphaned when the table was created. For example, Enteromorpha flexuosa is not an accepted name, and is flagged as such in the taxa table, yet it is has four child taxa that are accepted (Enteromorpha flexuosa subsp. linziformis, Enteromorpha flexuosa subsp. biflagellata, Enteromorpha flexuosa subsp. pilifera, and Enteromorpha flexuosa forma submarina). These taxa are orphaned in the tree. Eventually I gave up trying to extract the tree using SQL, and had to traverse the entire structure starting at the root node. This extracts a tree, at the cost of the orphans. It appears that Catalogue of Life haven't checked whether there classification is, in fact, a tree (OK, technically it is a forest as it is a set of disjoint trees comprising the eight kingdoms CoL recognises, but I make it a tree by rooting it on a node called "life").

After much anguish, I have a tree. I then coded up Engdahl's algorithm, based on the pseudocode he provides on p. 31 of his thesis (I think there's a bug in his code as he doesn't deal with the case when the cell being partitioned is narrower than it is wide, but this was easy to fix). One thing I was keen to do is just use HTML, no SVG or Flash. Here's an example of the treemap, showing the eight kingdoms. Each taxon is drawn proportional to log10(n + 1), where n is the number of terminal taxa (i.e., species or below) in that taxon (the number of terminals is shown in each cell). The log scale was chosen to avoid mega-diverse groups crowding out the smaller taxa.


Animalia 892,966

Archaea 281

Bacteria 9,588

Chromista 6,855

Fungi 33,017

Plantae 206,843

Protozoa 6,435

Viruses 1,906



The live version is here. It's a bit crude (to go back up the tree just use your browser back button), but it's simple, and it's HTML. The underlying code is PHP, but it would be quite easy to convert this to Javascript to make a simple drop in widget. In addition to Björn Engdahl's algorithm, and the Catalogue of Life data, I should acknowledge Samson's code for generating colour gradients.

There are all sorts of things that could be done to improve this. One approach would be to include exemplar pictures of the taxa in each cell, to help navigate in unfamiliar taxa. Denise Green and Rebecca Shapley's Teaching with a visual tree of life report has some examples of this idea (see their p. 86), and Marcos Weskamp (author of the very cool newsmap) has done a mockup for EOL using Flash.

As to the treemap idea itself, there are some fun things which could be done with it. I'm not convinced that it is great for navigation. However, it is probably very useful for showing changes over time. For example, imagine making the State of Observed Species report dynamic. Take the uBio RSS feed for new names, classify the new names, then colour the treemap cells by the number of new names (in a sense, this is a taxonomic version of newsmap).

3 comments:

Javier de la Torre said...

Hi,

I have been recently also playing with taxonomic trees and visualization. Look at the next post for an example of using the Iphone UI ideas together with images from google.

http://biodivertido.blogspot.com/2008/06/taxonomic-browser-in-flex.html

There is a nice component to create tree maps with Flex. I also tried it an agree with you that, apart from cool, i dont see it that much useful.

Great blog by the way!

Anonymous said...

I had become interested in treemaps a couple of years ago, and hoped someone would find a realistically usable way of implementing a phylogeny using this metaphor.

Rod's example is clearly a step in the right direction...

Андрей said...

Hello to all!
I was also playing with the catalogue of life. i'd like to build a hierarchical tree down to genera level. the autor writes: ".. After much anguish, I have a tree.". maybe someone could help me to get the data from the CoL or to build the tree. i crashed trying to get a table form the SQL of CoL. thanks