Friday, August 29, 2008

Turning Japanese: EUC-JP, UTF-8, and percent-encoding

In case I forget how to do this, and as an example of how easy it is to get sucked into a black hole of programming micro-details, I spent a hour or more trying to figure out how to handle Japanese characters.

I'm building a database of publications linked to taxonomic names, and I'm interested in linking to electronic versions of those publications. CrossRef and JSTOR provide a lot of references, as does BHL (once they get an OpenURL resolver in place), but there are numerous other sources to be harvested. One is CiNii, the Japanese National Institute of Informatics Scholarly and Academic Information Navigator, which have an OpenURL resolver. For example, I can query CiNii for an article using this URL
http://ci.nii.ac.jp/openurl/query?ctx_ver=Z39.88-2004&url_ver=Z39.88-2004&ctx_enc=info%3aofi%2fenc%3aUTF-8&rft.date=2003&rft.volume=58&rft.spage=1&rft.epage=6&rft.jtitle=Entomological%20Review%20of%20Japan.

If I want to harvest bibliographic metadata, I can parse the resulting HTML. I could follow the links to formats such as BibTex, but there's enough information in the link itself. For example, there's a link to the BibTex format that looks like this:

http://ci.nii.ac.jp/openurl/servlet/createData?type=bib
&ca=@article
&au=%B7%A6%CC%DA+%B4%B4%C9%D7
&title=%A5%AB%A5%DF%A5%AD%A5%EA%A5%E0%A5%B7%B2%CAPidonia%C2%B0%A4%CE%BF%B7%B0%A1%C2%B0%A4%CB%A4%C4%A4%A4%A4%C6
&jtitle=%BA%AB%EA%B5%D5%DC%C9%BE%CF%C0+%3D+The+entomological+review+of+Japan
&year=20030430
&vol=00058
&num=00001
&spage=1-6
&id=10011061577
&lang=jp
&issn=02869810
&publish=%C6%FC%CB%DC%B9%C3%C3%EE%B3%D8%B2%F1
&perm_link=http%3A%2F%2Fci.nii.ac.jp%2Fnaid%2F10011061577%2F
Note the percent-encoded fields, such as %B7%A6%CC%DA+%B4%B4%C9%D7. This string represents the author's name, 窪木 幹夫. It took me a little while to figure out how to convert %B7%A6%CC%DA+%B4%B4%C9%D7 to 窪木 幹夫. Eventually I discovered this table, which shows that there are a number of ways to represent Japanese characters, including JIS, SJIS, and EUC-JP. Given that C9D7 = 夫, the string is EUC-JP encoded. What I want is UTF-8. After some fussing, it turns out that all I need to do (in PHP) is:

$decoded_str = rawurldecode($str);
if (mb_detect_encoding($decoded_str) != 'ASCII')
{
$decoded_str = mb_convert_encoding($decoded_str, 'UTF-8', 'EUC-JP');
}
rawurldecode decodes the percent-encoding to EUC-JP, then mb_convert_encoding gives me UTF-8.
As an example, here is the above reference displayed by the bioGUID OpenURL resolver. A small victory, but it is nice to display the Japanese title. The English title of this article is "A New Subgenus of the Genus Pidonia MULSANT (Coleoptera: Cerambycidae)". It's perhaps the major triumph of Linnean taxonomy that even though I can't read a word of Japanese, I know the paper is about Pidonia.

No comments: