contrib/subversion/subversion/libsvn_subr/utf8proc/README

   1
   2 Please read the LICENSE file, which is shipping with this software.
   3
   4
   5 *** QUICK START ***
   6
   7 For compilation of the C library call "make c-library", for compilation of
   8 the ruby library call "make ruby-library" and for compilation of the
   9 PostgreSQL extension call "make pgsql-library".
  10
  11 For ruby you can also create a gem-file by calling "make ruby-gem".
  12
  13 "make all" can be used to build everything, but both ruby and PostgreSQL
  14 installations are required in this case.
  15
  16
  17 *** GENERAL INFORMATION ***
  18
  19 The C library is found in this directory after successful compilation and
  20 is named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of
  21 the files "utf8proc.rb" and "utf8proc_native.so", which are found in the
  22 subdirectory "ruby/". If you chose to create a gem-file it is placed in the
  23 "ruby/gem" directory. The PostgreSQL extension is named "utf8proc_pgsql.so"
  24 and resides in the "pgsql/" directory.
  25
  26 Both the ruby library and the PostgreSQL extension are built as stand-alone
  27 libraries and are therefore not dependent the dynamic version of the
  28 C library files, but this behaviour might change in future releases.
  29
  30 The Unicode version being supported is 5.0.0.
  31 Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as
  32       version 5.0.0 had not been available at the time of implementation.
  33
  34 For Unicode normalizations, the following options have to be used:
  35 Normalization Form C:  STABLE, COMPOSE
  36 Normalization Form D:  STABLE, DECOMPOSE
  37 Normalization Form KC: STABLE, COMPOSE, COMPAT
  38 Normalization Form KD: STABLE, DECOMPOSE, COMPAT
  39
  40
  41 *** C LIBRARY ***
  42
  43 The documentation for the C library is found in the utf8proc.h header file.
  44 "utf8proc_map" is most likely function you will be using for mapping UTF-8
  45 strings, unless you want to allocate memory yourself.
  46
  47
  48 *** RUBY API ***
  49
  50 The ruby library adds the methods "utf8map" and "utf8map!" to the String
  51 class, and the method "utf8" to the Integer class.
  52
  53 The String#utf8map method does the same as the "utf8proc_map" C function.
  54 Options for the mapping procedure are passed as symbols, i.e:
  55 "Hello".utf8map(:casefold) => "hello"
  56
  57 The descriptions of all options are found in the C header file
  58 "utf8proc.h". Please notice that the according symbols in ruby are all
  59 lowercase.
  60
  61 String#utf8map! is the destructive function in the meaning that the string
  62 is replaced by the result.
  63
  64 There are shortcuts for the 4 normalization forms specified by Unicode:
  65 String#utf8nfd,  String#utf8nfd!,
  66 String#utf8nfc,  String#utf8nfc!,
  67 String#utf8nfkd, String#utf8nfkd!,
  68 String#utf8nfkc, String#utf8nfkc!
  69
  70 The method Integer#utf8 returns a UTF-8 string, which is containing the
  71 unicode char given by the code point.
  72 0x000A.utf8 => "\n"
  73 0x2028.utf8 => "\342\200\250"
  74
  75
  76 *** POSTGRESQL API ***
  77
  78 For PostgreSQL there are two SQL functions supplied named "unifold" and
  79 "unistrip". These functions function can be used to prepare index fields in
  80 order to be folded in a way where string-comparisons make more sense, e.g.
  81 where "bathtub" == "bath<soft hyphen>tub"
  82 or "Hello World" == "hello world".
  83
  84 CREATE TABLE people (
  85   id    serial8 primary key,
  86   name  text,
  87   CHECK (unifold(name) NOTNULL)
  88 );
  89 CREATE INDEX name_idx ON people (unifold(name));
  90 SELECT * FROM people WHERE unifold(name) = unifold('John Doe');
  91
  92 The function "unistrip" removes character marks like accents or diaeresis,
  93 while "unifold" keeps then.
  94
  95 NOTICE: The outputs of the function can change between releases, as
  96         utf8proc does not follow a versioning stability policy. You have to
  97         rebuild your database indicies, if you upgrade to a newer version
  98         of utf8proc.
  99
 100
 101 *** TODO ***
 102
 103 - detect stable code points and process segments independently in order to
 104   save memory
 105 - do a quick check before normalizing strings to optimize speed
 106 - support stream processing
 107
 108
 109 *** CONTACT ***
 110
 111 If you find any bugs or experience difficulties in compiling this software,
 112 please contact us:
 113
 114 Project page: http://www.public-software-group.org/utf8proc
 115
 116