<span class="mw-page-title-namespace">Blog</span><span class="mw-page-title-separator">:</span><span class="mw-page-title-main">Hacks/Substitution of accented characters in Perl when Unicode gets in the way</span>
Fabrice P. Laussy's Web

Substitution of accented characters in Perl when Unicode gets in the way

From laussy.org's Blog about Hacks.
Published: Not published.

In a script like doi2bib, you'd want to have a simple

$bibkey =~ tr/àáâãäåèéêëìíîïòóôõöùúûüçñ/aaaaaaeeeeiiiiooooouuuucn/;

to get rid of all the pesky accents; but with unicode, that doesn't work so well.

This is because the original string uses UTF-8 bytes where ñ in UTF-8 is two bytes c3 b1 and tr/// operates on characters; say on pdf = {sci/lópezcarreño25a}, you get:

  pdf = {sci/lanpezcarreano25a},

The substitution

$bibkey =~ s/ó/o/g; # this would work but is not scalable

works but is not scalable (you need one line per character). Nightmare!

A way out, which I implemented in v°0.8.0 of doi2bib, is to use NFKD + decode_utf8 which handles all characters, like ä, ü, ç, å, etc., and is fairly straightforward. In the preamble, add:

use Encode qw(decode_utf8);
use Unicode::Normalize qw(NFKD);

and at the time of sanitizing your string:

$bibkey = decode_utf8($bibkey);  # interpret bytes as UTF-8
$bibkey = NFKD($bibkey);
$bibkey =~ s/\p{NonspacingMark}//g;  # remove diacritics
$bibkey =~ s/[^\x00-\x7F]//g;        # safety: strip any remaining non-ASCII

What is done here is

  1. NFKD decomposes "ó" → "o" + combining acute accent, "ñ" → "n" + combining tilde, etc.
  2. The regex then strips all combining marks, leaving plain ASCII "o" and "n".

and this works.