2 Please read the LICENSE file, which is shipping with this software.
7 For compilation of the C library call "make c-library", for compilation of
8 the ruby library call "make ruby-library" and for compilation of the
9 PostgreSQL extension call "make pgsql-library".
11 For ruby you can also create a gem-file by calling "make ruby-gem".
13 "make all" can be used to build everything, but both ruby and PostgreSQL
14 installations are required in this case.
17 *** GENERAL INFORMATION ***
19 The C library is found in this directory after successful compilation and
20 is named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of
21 the files "utf8proc.rb" and "utf8proc_native.so", which are found in the
22 subdirectory "ruby/". If you chose to create a gem-file it is placed in the
23 "ruby/gem" directory. The PostgreSQL extension is named "utf8proc_pgsql.so"
24 and resides in the "pgsql/" directory.
26 Both the ruby library and the PostgreSQL extension are built as stand-alone
27 libraries and are therefore not dependent the dynamic version of the
28 C library files, but this behaviour might change in future releases.
30 The Unicode version being supported is 5.0.0.
31 Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as
32 version 5.0.0 had not been available at the time of implementation.
34 For Unicode normalizations, the following options have to be used:
35 Normalization Form C: STABLE, COMPOSE
36 Normalization Form D: STABLE, DECOMPOSE
37 Normalization Form KC: STABLE, COMPOSE, COMPAT
38 Normalization Form KD: STABLE, DECOMPOSE, COMPAT
43 The documentation for the C library is found in the utf8proc.h header file.
44 "utf8proc_map" is most likely function you will be using for mapping UTF-8
45 strings, unless you want to allocate memory yourself.
50 The ruby library adds the methods "utf8map" and "utf8map!" to the String
51 class, and the method "utf8" to the Integer class.
53 The String#utf8map method does the same as the "utf8proc_map" C function.
54 Options for the mapping procedure are passed as symbols, i.e:
55 "Hello".utf8map(:casefold) => "hello"
57 The descriptions of all options are found in the C header file
58 "utf8proc.h". Please notice that the according symbols in ruby are all
61 String#utf8map! is the destructive function in the meaning that the string
62 is replaced by the result.
64 There are shortcuts for the 4 normalization forms specified by Unicode:
65 String#utf8nfd, String#utf8nfd!,
66 String#utf8nfc, String#utf8nfc!,
67 String#utf8nfkd, String#utf8nfkd!,
68 String#utf8nfkc, String#utf8nfkc!
70 The method Integer#utf8 returns a UTF-8 string, which is containing the
71 unicode char given by the code point.
73 0x2028.utf8 => "\342\200\250"
76 *** POSTGRESQL API ***
78 For PostgreSQL there are two SQL functions supplied named "unifold" and
79 "unistrip". These functions function can be used to prepare index fields in
80 order to be folded in a way where string-comparisons make more sense, e.g.
81 where "bathtub" == "bath<soft hyphen>tub"
82 or "Hello World" == "hello world".
85 id serial8 primary key,
87 CHECK (unifold(name) NOTNULL)
89 CREATE INDEX name_idx ON people (unifold(name));
90 SELECT * FROM people WHERE unifold(name) = unifold('John Doe');
92 The function "unistrip" removes character marks like accents or diaeresis,
93 while "unifold" keeps then.
95 NOTICE: The outputs of the function can change between releases, as
96 utf8proc does not follow a versioning stability policy. You have to
97 rebuild your database indicies, if you upgrade to a newer version
103 - detect stable code points and process segments independently in order to
105 - do a quick check before normalizing strings to optimize speed
106 - support stream processing
111 If you find any bugs or experience difficulties in compiling this software,
114 Project page: http://www.public-software-group.org/utf8proc