m (Directories)
Line 1: Line 1:
 +
= Sanitize =
 +
 
'''sanitize''' is a bash [[script]] to replace special and accented characters in a filename to their best-match in the ASCII code.
 
'''sanitize''' is a bash [[script]] to replace special and accented characters in a filename to their best-match in the ASCII code.
  
 
See [[Blog:Notes/Removing_recursively_special_and_accented_characters_in_filenames|a blog post]] for a discussion of the necessity and merit of this task.
 
See [[Blog:Notes/Removing_recursively_special_and_accented_characters_in_filenames|a blog post]] for a discussion of the necessity and merit of this task.
  
= Usage =
+
== Usage ==
  
== Files in a given directory ==
+
=== Files in a given directory ===
  
 
Make a file sanitize with the source below executable and in the directory where files are to be fixed, run:
 
Make a file sanitize with the source below executable and in the directory where files are to be fixed, run:
Line 27: Line 29:
 
Octal code are possible. Use "ls -b" to figure out which they are.
 
Octal code are possible. Use "ls -b" to figure out which they are.
  
== All files in all subdirectories ==
+
=== All files in all subdirectories ===
  
 
Also make the ''run-sanitize'' file (source below) and run instead:
 
Also make the ''run-sanitize'' file (source below) and run instead:
Line 35: Line 37:
 
</pre>
 
</pre>
  
== Directories ==
+
=== Directories ===
  
 
If you want to use the script to change not only the filename but also the name of directories, you can use the following trick (to put/replace in the Sanitize script):
 
If you want to use the script to change not only the filename but also the name of directories, you can use the following trick (to put/replace in the Sanitize script):
Line 46: Line 48:
 
What it does is to recreate the directory tree, sanitized according to your transliteration table, and copy the (also sanitized) files within. If you are happy with the result, you can then delete the original structure (not done by the script itself for security).
 
What it does is to recreate the directory tree, sanitized according to your transliteration table, and copy the (also sanitized) files within. If you are happy with the result, you can then delete the original structure (not done by the script itself for security).
  
= Source =
+
== Source ==
  
== sanitize ==
+
=== sanitize ===
  
 
<pre>
 
<pre>
Line 126: Line 128:
 
</pre>
 
</pre>
  
== run-sanitize ==
+
=== run-sanitize ===
  
 
To be used for propagating through subdirectories.
 
To be used for propagating through subdirectories.
Line 138: Line 140:
 
</pre>
 
</pre>
  
= History =
+
== History ==
  
 
* [[5 June|5]], [[June (2011)|June]]&nbsp;[[2011|(2011)]] First version.
 
* [[5 June|5]], [[June (2011)|June]]&nbsp;[[2011|(2011)]] First version.

Revision as of 20:51, 23 October 2011

Contents

Sanitize

sanitize is a bash script to replace special and accented characters in a filename to their best-match in the ASCII code.

See a blog post for a discussion of the necessity and merit of this task.

Usage

Files in a given directory

Make a file sanitize with the source below executable and in the directory where files are to be fixed, run:

for f in *; do ./sanitize "$f"; done

If you are happy with the output, uncomment the mv line. If required, extend the transliteration table:

s@XXX@YYY@g 

where XXX will be replaced by YYY, e.g.,

s@æ@ae@g 

Octal code are possible. Use "ls -b" to figure out which they are.

All files in all subdirectories

Also make the run-sanitize file (source below) and run instead:

find . -type d -exec sh -c "cd \"{}\" && ./run-sanitize \"*\""  \;

Directories

If you want to use the script to change not only the filename but also the name of directories, you can use the following trick (to put/replace in the Sanitize script):

mkdir -p "`dirname $sanitized`"
cp $1 $sanitized

What it does is to recreate the directory tree, sanitized according to your transliteration table, and copy the (also sanitized) files within. If you are happy with the result, you can then delete the original structure (not done by the script itself for security).

Source

sanitize

#!/bin/bash
#  ____              _ _   _         
# / ___|  __ _ _ __ (_) |_(_)_______ 
# \___ \ / _` | '_ \| | __| |_  / _ \
#  ___) | (_| | | | | | |_| |/ /  __/
# |____/ \__,_|_| |_|_|\__|_/___\___|
#                                    
# sanitize v0.1
# FP Laussy -- fabrice.laussy@gmail.com
# http://laussy.org
# Sun Jun  5 17:20:50 CEST 2011
# (building on TeX+ :)
#
# This script remove special characters in filenames
# according to a transliteration table given below.
# 
# Usage:
# Caution: this is potentially harmful!
# Use only if you know what you are doing.
#
# To use in the files within the same directory:
#
#    for f in *; do ./sanitize "$f"; done
#
# To go recursively through subdirectories:
#
#    find . -type d -exec sh -c "cd \"{}\" && ./run-sanitize \"*\""  \;
#
# where run-sanitize is provided separately.  (it's essentially the
# command above put in a script).

sanitized=`echo $1 | sed ' 
/^%/d 
#begin transliteration table: 
s@ @_@g
s@Á@A@g 
s@Æ@AE@g 
s@Ê@E@g 
s@É@E@g 
s@Ë@E@g 
s@Ì@I@g 
s@Ý@Y@g 
s@Ù@U@g 
s@Ú@U@g 
s@Ñ@N@g
s@\o323@O@g
s@à@a@g 
s@æ@ae@g 
s@á@a@g 
s@ê@e@g 
s@é@e@g 
s@è@e@g 
s@ë@e@g 
s@ì@i@g 
s@ñ@n@g
s@ó@o@g
s@ú@u@g 
s@\o350@e@g 
s@\o351@e@g
s@\o353@e@g
s@\o364@o@g
s@\o363@o@g
s@\o361@n@g
s@\[@(@g
s@\]@)@g
#end transliteration table 
'`

if [[ $1 != $sanitized ]]
then
echo $1 "-->" $sanitized
#mv "`pwd`/$1" "`pwd`/$sanitized"
fi

run-sanitize

To be used for propagating through subdirectories.

#!/bin/bash
#echo `pwd`
for f in $*; do
    ./sanitize "$f"
done

History