Removing recursively special and accented characters in filenames

⇠ Back to Blog:Hacks

Don't use non-ASCII characters in filenames. This rule has exceptions. I often use unicode characters in data files which are for me only (when I say me I essentially mean my computer) on a short term basis, e.g., some computation result that I don't really care to possibly loose, which will happen sooner or later when straying away from ASCII.

For data that is precious and which you are archiving, like photos, for instance, which at some point you will want to copy from one device to the other, one filesystem to the other, or to email or zip, you should never ever have it contain special characters or accents. Not even a space " ", this is particularly nasty. A typical bad exemple is:

Tolède - 1.jpg

The - is acceptable, as is the number. Sooner or later, the rest will pop up as something like:

Tol�de

or not pop up at all and will be lost, or destroyed, etc., when you copy, transfer or merely access your file.

You will find stuff on the internet that address this problem once it has spread to thousands of files (fixing one by hand is rarely an issue, but sometime is). However most scripts simply get rid of the offending characters. Ideally you'd like to keep as much ASCII information as possible, that is turn "Tolède - 1.jpg" into "Tolede_-_1.jpg" rather than "Tolde-1.jpg".

I would have assumed it's part of my rights as an internet citizen to readily find scores of good scripts to do just that. After much research, I was left unsatisfied. It's not really a big deal to do it anyway, so here it is.

I propose the sanitize script to do just that. It may visit recursively subdirectories if you so wish and will fix nasty stuff that it finds all down the tree. It will only deal with filenames, though, so if your directories are faulty as well, it won't fix them. However it should be able to process them to fix the files they contain.

Its output will be something like:

Pierre carr�e.jpg --> Pierre_carree.jpg
Puy-de-D�me_03 - 1.jpg --> Puy-de-Dome_03_-_1.jpg
Puy-de-D�me_03 - 2.jpg --> Puy-de-Dome_03_-_2.jpg
Tol�de - 1.jpg --> Tolede_-_1.jpg
Tol�de - 2.jpg --> Tolede_-_2.jpg

telling you which files it has fixed, being quiet otherwise. It's easy to alter it to behave as you think it should.

Please follow the link (it's here again) for all the useful documentation. Beware that although it's been working for me, it has not been extensively tested, and it's moving files around, so use with caution. Also its transliteration table can be extended. If you need to see the octal code for the fucked-up character, use:

ls -b