1 \input texinfo @c -*- texinfo -*-
3 @setfilename gperf.info
4 @settitle Perfect Hash Function Generator
5 @c @setchapternewpage odd
8 @c some day we should @include version.texi instead of defining
9 @c these values at hand.
10 @set UPDATED 31 March 2007
13 @c ---------------------
15 @c remove the black boxes generated in the GPL appendix.
18 @c Merge functions into the concept index
22 @dircategory Programming Tools
24 * Gperf: (gperf). Perfect Hash Function Generator.
28 This file documents the features of the GNU Perfect Hash Function
29 Generator @value{VERSION}.
31 Copyright @copyright{} 1989-2006 Free Software Foundation, Inc.
33 Permission is granted to make and distribute verbatim copies of this
34 manual provided the copyright notice and this permission notice are
35 preserved on all copies.
38 Permission is granted to process this file through TeX and print the
39 results, provided the printed document carries a copying permission
40 notice identical to this one except for the removal of this paragraph
41 (this paragraph not being relevant to the printed manual).
45 Permission is granted to copy and distribute modified versions of this
46 manual under the conditions for verbatim copying, provided also that the
47 section entitled ``GNU General Public License'' is included exactly as
48 in the original, and provided that the entire resulting derived work is
49 distributed under the terms of a permission notice identical to this
52 Permission is granted to copy and distribute translations of this manual
53 into another language, under the above conditions for modified versions,
54 except that the section entitled ``GNU General Public License'' and this
55 permission notice may be included in translations approved by the Free
56 Software Foundation instead of in the original English.
61 @title User's Guide to @code{gperf} @value{VERSION}
62 @subtitle The GNU Perfect Hash Function Generator
63 @subtitle Edition @value{EDITION}, @value{UPDATED}
64 @author Douglas C. Schmidt
68 @vskip 0pt plus 1filll
69 Copyright @copyright{} 1989-2007 Free Software Foundation, Inc.
72 Permission is granted to make and distribute verbatim copies of
73 this manual provided the copyright notice and this permission notice
74 are preserved on all copies.
76 Permission is granted to copy and distribute modified versions of this
77 manual under the conditions for verbatim copying, provided also that the
78 section entitled ``GNU General Public License'' is included
79 exactly as in the original, and provided that the entire resulting
80 derived work is distributed under the terms of a permission notice
81 identical to this one.
83 Permission is granted to copy and distribute translations of this manual
84 into another language, under the above conditions for modified versions,
85 except that the section entitled ``GNU General Public License'' may be
86 included in a translation approved by the author instead of in the
91 @node Top, Copying, (dir), (dir)
94 This manual documents the GNU @code{gperf} perfect hash function generator
95 utility, focusing on its features and how to use them, and how to report
99 * Copying:: GNU @code{gperf} General Public License says
100 how you can copy and share @code{gperf}.
101 * Contributors:: People who have contributed to @code{gperf}.
102 * Motivation:: The purpose of @code{gperf}.
103 * Search Structures:: Static search structures and GNU @code{gperf}
104 * Description:: High-level discussion of how GPERF functions.
105 * Options:: A description of options to the program.
106 * Bugs:: Known bugs and limitations with GPERF.
107 * Projects:: Things still left to do.
108 * Bibliography:: Material Referenced in this Report.
112 @detailmenu --- The Detailed Node Listing ---
114 High-Level Description of GNU @code{gperf}
116 * Input Format:: Input Format to @code{gperf}
117 * Output Format:: Output Format for Generated C Code with @code{gperf}
118 * Binary Strings:: Use of NUL bytes
120 Input Format to @code{gperf}
122 * Declarations:: Declarations.
123 * Keywords:: Format for Keyword Entries.
124 * Functions:: Including Additional C Functions.
125 * Controls for GNU indent:: Where to place directives for GNU @code{indent}.
129 * User-supplied Struct:: Specifying keywords with attributes.
130 * Gperf Declarations:: Embedding command line options in the input.
131 * C Code Inclusion:: Including C declarations and definitions.
133 Invoking @code{gperf}
135 * Input Details:: Options that affect Interpretation of the Input File
136 * Output Language:: Specifying the Language for the Output Code
137 * Output Details:: Fine tuning Details in the Output Code
138 * Algorithmic Details:: Changing the Algorithms employed by @code{gperf}
139 * Verbosity:: Informative Output
146 @node Copying, Contributors, Top, Top
147 @unnumbered GNU GENERAL PUBLIC LICENSE
150 @node Contributors, Motivation, Copying, Top
151 @unnumbered Contributors to GNU @code{gperf} Utility
156 The GNU @code{gperf} perfect hash function generator utility was
157 written in GNU C++ by Douglas C. Schmidt. The general
158 idea for the perfect hash function generator was inspired by Keith
159 Bostic's algorithm written in C, and distributed to net.sources around
160 1984. The current program is a heavily modified, enhanced, and extended
161 implementation of Keith's basic idea, created at the University of
162 California, Irvine. Bugs, patches, and suggestions should be reported
163 to @code{<bug-gnu-gperf@@gnu.org>}.
166 Special thanks is extended to Michael Tiemann and Doug Lea, for
167 providing a useful compiler, and for giving me a forum to exhibit my
170 In addition, Adam de Boor and Nels Olson provided many tips and insights
171 that greatly helped improve the quality and functionality of @code{gperf}.
174 Bruno Haible enhanced and optimized the search algorithm. He also rewrote
175 the input routines and the output routines for better reliability, and
179 @node Motivation, Search Structures, Contributors, Top
180 @chapter Introduction
182 @code{gperf} is a perfect hash function generator written in C++. It
183 transforms an @var{n} element user-specified keyword set @var{W} into a
184 perfect hash function @var{F}. @var{F} uniquely maps keywords in
185 @var{W} onto the range 0..@var{k}, where @var{k} >= @var{n-1}. If @var{k}
186 = @var{n-1} then @var{F} is a @emph{minimal} perfect hash function.
187 @code{gperf} generates a 0..@var{k} element static lookup table and a
188 pair of C functions. These functions determine whether a given
189 character string @var{s} occurs in @var{W}, using at most one probe into
192 @code{gperf} currently generates the reserved keyword recognizer for
193 lexical analyzers in several production and research compilers and
194 language processing tools, including GNU C, GNU C++, GNU Java, GNU Pascal,
195 GNU Modula 3, and GNU indent. Complete C++ source code for @code{gperf} is
196 available from @code{http://ftp.gnu.org/pub/gnu/gperf/}.
197 A paper describing @code{gperf}'s design and implementation in greater
198 detail is available in the Second USENIX C++ Conference proceedings
199 or from @code{http://www.cs.wustl.edu/~schmidt/resume.html}.
201 @node Search Structures, Description, Motivation, Top
202 @chapter Static search structures and GNU @code{gperf}
203 @cindex Static search structure
205 A @dfn{static search structure} is an Abstract Data Type with certain
206 fundamental operations, e.g., @emph{initialize}, @emph{insert},
207 and @emph{retrieve}. Conceptually, all insertions occur before any
208 retrievals. In practice, @code{gperf} generates a @emph{static} array
209 containing search set keywords and any associated attributes specified
210 by the user. Thus, there is essentially no execution-time cost for the
211 insertions. It is a useful data structure for representing @emph{static
212 search sets}. Static search sets occur frequently in software system
213 applications. Typical static search sets include compiler reserved
214 words, assembler instruction opcodes, and built-in shell interpreter
215 commands. Search set members, called @dfn{keywords}, are inserted into
216 the structure only once, usually during program initialization, and are
217 not generally modified at run-time.
219 Numerous static search structure implementations exist, e.g.,
220 arrays, linked lists, binary search trees, digital search tries, and
221 hash tables. Different approaches offer trade-offs between space
222 utilization and search time efficiency. For example, an @var{n} element
223 sorted array is space efficient, though the average-case time
224 complexity for retrieval operations using binary search is
225 proportional to log @var{n}. Conversely, hash table implementations
226 often locate a table entry in constant time, but typically impose
227 additional memory overhead and exhibit poor worst case performance.
229 @cindex Minimal perfect hash functions
230 @emph{Minimal perfect hash functions} provide an optimal solution for a
231 particular class of static search sets. A minimal perfect hash
232 function is defined by two properties:
236 It allows keyword recognition in a static search set using at most
237 @emph{one} probe into the hash table. This represents the ``perfect''
240 The actual memory allocated to store the keywords is precisely large
241 enough for the keyword set, and @emph{no larger}. This is the
242 ``minimal'' property.
245 For most applications it is far easier to generate @emph{perfect} hash
246 functions than @emph{minimal perfect} hash functions. Moreover,
247 non-minimal perfect hash functions frequently execute faster than
248 minimal ones in practice. This phenomena occurs since searching a
249 sparse keyword table increases the probability of locating a ``null''
250 entry, thereby reducing string comparisons. @code{gperf}'s default
251 behavior generates @emph{near-minimal} perfect hash functions for
252 keyword sets. However, @code{gperf} provides many options that permit
253 user control over the degree of minimality and perfection.
255 Static search sets often exhibit relative stability over time. For
256 example, Ada's 63 reserved words have remained constant for nearly a
257 decade. It is therefore frequently worthwhile to expend concerted
258 effort building an optimal search structure @emph{once}, if it
259 subsequently receives heavy use multiple times. @code{gperf} removes
260 the drudgery associated with constructing time- and space-efficient
261 search structures by hand. It has proven a useful and practical tool
262 for serious programming projects. Output from @code{gperf} is currently
263 used in several production and research compilers, including GNU C, GNU
264 C++, GNU Java, GNU Pascal, and GNU Modula 3. The latter two compilers are
265 not yet part of the official GNU distribution. Each compiler utilizes
266 @code{gperf} to automatically generate static search structures that
267 efficiently identify their respective reserved keywords.
269 @node Description, Options, Search Structures, Top
270 @chapter High-Level Description of GNU @code{gperf}
273 * Input Format:: Input Format to @code{gperf}
274 * Output Format:: Output Format for Generated C Code with @code{gperf}
275 * Binary Strings:: Use of NUL bytes
278 The perfect hash function generator @code{gperf} reads a set of
279 ``keywords'' from an input file (or from the standard input by
280 default). It attempts to derive a perfect hashing function that
281 recognizes a member of the @dfn{static keyword set} with at most a
282 single probe into the lookup table. If @code{gperf} succeeds in
283 generating such a function it produces a pair of C source code routines
284 that perform hashing and table lookup recognition. All generated C code
285 is directed to the standard output. Command-line options described
286 below allow you to modify the input and output format to @code{gperf}.
288 By default, @code{gperf} attempts to produce time-efficient code, with
289 less emphasis on efficient space utilization. However, several options
290 exist that permit trading-off execution time for storage space and vice
291 versa. In particular, expanding the generated table size produces a
292 sparse search structure, generally yielding faster searches.
293 Conversely, you can direct @code{gperf} to utilize a C @code{switch}
294 statement scheme that minimizes data space storage size. Furthermore,
295 using a C @code{switch} may actually speed up the keyword retrieval time
296 somewhat. Actual results depend on your C compiler, of course.
298 In general, @code{gperf} assigns values to the bytes it is using
299 for hashing until some set of values gives each keyword a unique value.
300 A helpful heuristic is that the larger the hash value range, the easier
301 it is for @code{gperf} to find and generate a perfect hash function.
302 Experimentation is the key to getting the most from @code{gperf}.
304 @node Input Format, Output Format, Description, Description
305 @section Input Format to @code{gperf}
307 @cindex Declaration section
308 @cindex Keywords section
309 @cindex Functions section
310 You can control the input file format by varying certain command-line
311 arguments, in particular the @samp{-t} option. The input's appearance
312 is similar to GNU utilities @code{flex} and @code{bison} (or UNIX
313 utilities @code{lex} and @code{yacc}). Here's an outline of the general
326 @emph{Unlike} @code{flex} or @code{bison}, the declarations section and
327 the functions section are optional. The following sections describe the
328 input format for each section.
331 * Declarations:: Declarations.
332 * Keywords:: Format for Keyword Entries.
333 * Functions:: Including Additional C Functions.
334 * Controls for GNU indent:: Where to place directives for GNU @code{indent}.
337 It is possible to omit the declaration section entirely, if the @samp{-t}
338 option is not given. In this case the input file begins directly with the
339 first keyword line, e.g.:
351 @node Declarations, Keywords, Input Format, Input Format
352 @subsection Declarations
354 The keyword input file optionally contains a section for including
355 arbitrary C declarations and definitions, @code{gperf} declarations that
356 act like command-line options, as well as for providing a user-supplied
360 * User-supplied Struct:: Specifying keywords with attributes.
361 * Gperf Declarations:: Embedding command line options in the input.
362 * C Code Inclusion:: Including C declarations and definitions.
365 @node User-supplied Struct, Gperf Declarations, Declarations, Declarations
366 @subsubsection User-supplied @code{struct}
368 If the @samp{-t} option (or, equivalently, the @samp{%struct-type} declaration)
369 @emph{is} enabled, you @emph{must} provide a C @code{struct} as the last
370 component in the declaration section from the input file. The first
371 field in this struct must be of type @code{char *} or @code{const char *}
372 if the @samp{-P} option is not given, or of type @code{int} if the option
373 @samp{-P} (or, equivalently, the @samp{%pic} declaration) is enabled.
374 This first field must be called @samp{name}, although it is possible to modify
375 its name with the @samp{-K} option (or, equivalently, the
376 @samp{%define slot-name} declaration) described below.
378 Here is a simple example, using months of the year and their attributes as
383 struct month @{ char *name; int number; int days; int leap_days; @};
401 Separating the @code{struct} declaration from the list of keywords and
402 other fields are a pair of consecutive percent signs, @samp{%%},
403 appearing left justified in the first column, as in the UNIX utility
406 If the @code{struct} has already been declared in an include file, it can
407 be mentioned in an abbreviated form, like this:
418 @node Gperf Declarations, C Code Inclusion, User-supplied Struct, Declarations
419 @subsubsection Gperf Declarations
421 The declaration section can contain @code{gperf} declarations. They
422 influence the way @code{gperf} works, like command line options do.
423 In fact, every such declaration is equivalent to a command line option.
424 There are three forms of declarations:
428 Declarations without argument, like @samp{%compare-lengths}.
431 Declarations with an argument, like @samp{%switch=@var{count}}.
434 Declarations of names of entities in the output file, like
435 @samp{%define lookup-function-name @var{name}}.
438 When a declaration is given both in the input file and as a command line
439 option, the command-line option's value prevails.
441 The following @code{gperf} declarations are available.
444 @item %delimiters=@var{delimiter-list}
445 @cindex @samp{%delimiters}
446 Allows you to provide a string containing delimiters used to
447 separate keywords from their attributes. The default is ",". This
448 option is essential if you want to use keywords that have embedded
452 @cindex @samp{%struct-type}
453 Allows you to include a @code{struct} type declaration for generated
454 code; see above for an example.
457 @cindex @samp{%ignore-case}
458 Consider upper and lower case ASCII characters as equivalent. The string
459 comparison will use a case insignificant character comparison. Note that
460 locale dependent case mappings are ignored.
462 @item %language=@var{language-name}
463 @cindex @samp{%language}
464 Instructs @code{gperf} to generate code in the language specified by the
465 option's argument. Languages handled are currently:
469 Old-style K&R C. This language is understood by old-style C compilers and
470 ANSI C compilers, but ANSI C compilers may flag warnings (or even errors)
471 because of lacking @samp{const}.
474 Common C. This language is understood by ANSI C compilers, and also by
475 old-style C compilers, provided that you @code{#define const} to empty
476 for compilers which don't know about this keyword.
479 ANSI C. This language is understood by ANSI C compilers and C++ compilers.
482 C++. This language is understood by C++ compilers.
487 @item %define slot-name @var{name}
488 @cindex @samp{%define slot-name}
489 This declaration is only useful when option @samp{-t} (or, equivalently, the
490 @samp{%struct-type} declaration) has been given.
491 By default, the program assumes the structure component identifier for
492 the keyword is @samp{name}. This option allows an arbitrary choice of
493 identifier for this component, although it still must occur as the first
494 field in your supplied @code{struct}.
496 @item %define initializer-suffix @var{initializers}
497 @cindex @samp{%define initializer-suffix}
498 This declaration is only useful when option @samp{-t} (or, equivalently, the
499 @samp{%struct-type} declaration) has been given.
500 It permits to specify initializers for the structure members following
501 @var{slot-name} in empty hash table entries. The list of initializers
502 should start with a comma. By default, the emitted code will
503 zero-initialize structure members following @var{slot-name}.
505 @item %define hash-function-name @var{name}
506 @cindex @samp{%define hash-function-name}
507 Allows you to specify the name for the generated hash function. Default
508 name is @samp{hash}. This option permits the use of two hash tables in
511 @item %define lookup-function-name @var{name}
512 @cindex @samp{%define lookup-function-name}
513 Allows you to specify the name for the generated lookup function.
514 Default name is @samp{in_word_set}. This option permits multiple
515 generated hash functions to be used in the same application.
517 @item %define class-name @var{name}
518 @cindex @samp{%define class-name}
519 This option is only useful when option @samp{-L C++} (or, equivalently,
520 the @samp{%language=C++} declaration) has been given. It
521 allows you to specify the name of generated C++ class. Default name is
526 This option specifies that all strings that will be passed as arguments
527 to the generated hash function and the generated lookup function will
528 solely consist of 7-bit ASCII characters (bytes in the range 0..127).
529 (Note that the ANSI C functions @code{isalnum} and @code{isgraph} do
530 @emph{not} guarantee that a byte is in this range. Only an explicit
531 test like @samp{c >= 'A' && c <= 'Z'} guarantees this.)
533 @item %compare-lengths
534 @cindex @samp{%compare-lengths}
535 Compare keyword lengths before trying a string comparison. This option
536 is mandatory for binary comparisons (@pxref{Binary Strings}). It also might
537 cut down on the number of string comparisons made during the lookup, since
538 keywords with different lengths are never compared via @code{strcmp}.
539 However, using @samp{%compare-lengths} might greatly increase the size of the
540 generated C code if the lookup table range is large (which implies that
541 the switch option @samp{-S} or @samp{%switch} is not enabled), since the length
542 table contains as many elements as there are entries in the lookup table.
544 @item %compare-strncmp
545 @cindex @samp{%compare-strncmp}
546 Generates C code that uses the @code{strncmp} function to perform
547 string comparisons. The default action is to use @code{strcmp}.
549 @item %readonly-tables
550 @cindex @samp{%readonly-tables}
551 Makes the contents of all generated lookup tables constant, i.e.,
552 ``readonly''. Many compilers can generate more efficient code for this
553 by putting the tables in readonly memory.
557 Define constant values using an enum local to the lookup function rather
558 than with #defines. This also means that different lookup functions can
559 reside in the same file. Thanks to James Clark @code{<jjc@@ai.mit.edu>}.
562 @cindex @samp{%includes}
563 Include the necessary system include file, @code{<string.h>}, at the
564 beginning of the code. By default, this is not done; the user must
565 include this header file himself to allow compilation of the code.
568 @cindex @samp{%global-table}
569 Generate the static table of keywords as a static global variable,
570 rather than hiding it inside of the lookup function (which is the
575 Optimize the generated table for inclusion in shared libraries. This
576 reduces the startup time of programs using a shared library containing
577 the generated code. If the @samp{%struct-type} declaration (or,
578 equivalently, the option @samp{-t}) is also given, the first field of the
579 user-defined struct must be of type @samp{int}, not @samp{char *}, because
580 it will contain offsets into the string pool instead of actual strings.
581 To convert such an offset to a string, you can use the expression
582 @samp{stringpool + @var{o}}, where @var{o} is the offset. The string pool
583 name can be changed through the @samp{%define string-pool-name} declaration.
585 @item %define string-pool-name @var{name}
586 @cindex @samp{%define string-pool-name}
587 Allows you to specify the name of the generated string pool created by
588 the declaration @samp{%pic} (or, equivalently, the option @samp{-P}).
589 The default name is @samp{stringpool}. This declaration permits the use of
590 two hash tables in the same file, with @samp{%pic} and even when the
591 @samp{%global-table} declaration (or, equivalently, the option @samp{-G})
595 @cindex @samp{%null-strings}
596 Use NULL strings instead of empty strings for empty keyword table entries.
597 This reduces the startup time of programs using a shared library containing
598 the generated code (but not as much as the declaration @samp{%pic}), at the
599 expense of one more test-and-branch instruction at run time.
601 @item %define word-array-name @var{name}
602 @cindex @samp{%define word-array-name}
603 Allows you to specify the name for the generated array containing the
604 hash table. Default name is @samp{wordlist}. This option permits the
605 use of two hash tables in the same file, even when the option @samp{-G}
606 (or, equivalently, the @samp{%global-table} declaration) is given.
608 @item %define length-table-name @var{name}
609 @cindex @samp{%define length-table-name}
610 Allows you to specify the name for the generated array containing the
611 length table. Default name is @samp{lengthtable}. This option permits the
612 use of two length tables in the same file, even when the option @samp{-G}
613 (or, equivalently, the @samp{%global-table} declaration) is given.
615 @item %switch=@var{count}
616 @cindex @samp{%switch}
617 Causes the generated C code to use a @code{switch} statement scheme,
618 rather than an array lookup table. This can lead to a reduction in both
619 time and space requirements for some input files. The argument to this
620 option determines how many @code{switch} statements are generated. A
621 value of 1 generates 1 @code{switch} containing all the elements, a
622 value of 2 generates 2 tables with 1/2 the elements in each
623 @code{switch}, etc. This is useful since many C compilers cannot
624 correctly generate code for large @code{switch} statements. This option
625 was inspired in part by Keith Bostic's original C program.
627 @item %omit-struct-type
628 @cindex @samp{%omit-struct-type}
629 Prevents the transfer of the type declaration to the output file. Use
630 this option if the type is already defined elsewhere.
633 @node C Code Inclusion, , Gperf Declarations, Declarations
634 @subsubsection C Code Inclusion
638 Using a syntax similar to GNU utilities @code{flex} and @code{bison}, it
639 is possible to directly include C source text and comments verbatim into
640 the generated output file. This is accomplished by enclosing the region
641 inside left-justified surrounding @samp{%@{}, @samp{%@}} pairs. Here is
642 an input fragment based on the previous example that illustrates this
649 /* This section of code is inserted directly into the output. */
650 int return_month_days (struct month *months, int is_leap_year);
652 struct month @{ char *name; int number; int days; int leap_days; @};
661 @node Keywords, Functions, Declarations, Input Format
662 @subsection Format for Keyword Entries
664 The second input file format section contains lines of keywords and any
665 associated attributes you might supply. A line beginning with @samp{#}
666 in the first column is considered a comment. Everything following the
667 @samp{#} is ignored, up to and including the following newline. A line
668 beginning with @samp{%} in the first column is an option declaration and
669 must not occur within the keywords section.
671 The first field of each non-comment line is always the keyword itself. It
672 can be given in two ways: as a simple name, i.e., without surrounding
673 string quotation marks, or as a string enclosed in double-quotes, in
674 C syntax, possibly with backslash escapes like @code{\"} or @code{\234}
675 or @code{\xa8}. In either case, it must start right at the beginning
676 of the line, without leading whitespace.
677 In this context, a ``field'' is considered to extend up to, but
678 not include, the first blank, comma, or newline. Here is a simple
679 example taken from a partial list of C reserved words:
683 # These are a few C reserved words, see the c.gperf file
684 # for a complete list of ANSI C reserved words.
697 Note that unlike @code{flex} or @code{bison} the first @samp{%%} marker
698 may be elided if the declaration section is empty.
700 Additional fields may optionally follow the leading keyword. Fields
701 should be separated by commas, and terminate at the end of line. What
702 these fields mean is entirely up to you; they are used to initialize the
703 elements of the user-defined @code{struct} provided by you in the
704 declaration section. If the @samp{-t} option (or, equivalently, the
705 @samp{%struct-type} declaration) is @emph{not} enabled
706 these fields are simply ignored. All previous examples except the last
707 one contain keyword attributes.
709 @node Functions, Controls for GNU indent, Keywords, Input Format
710 @subsection Including Additional C Functions
712 The optional third section also corresponds closely with conventions
713 found in @code{flex} and @code{bison}. All text in this section,
714 starting at the final @samp{%%} and extending to the end of the input
715 file, is included verbatim into the generated output file. Naturally,
716 it is your responsibility to ensure that the code contained in this
719 @node Controls for GNU indent, , Functions, Input Format
720 @subsection Where to place directives for GNU @code{indent}.
722 If you want to invoke GNU @code{indent} on a @code{gperf} input file,
723 you will see that GNU @code{indent} doesn't understand the @samp{%%},
724 @samp{%@{} and @samp{%@}} directives that control @code{gperf}'s
725 interpretation of the input file. Therefore you have to insert some
726 directives for GNU @code{indent}. More precisely, assuming the most
727 general input file structure
744 you would insert @samp{*INDENT-OFF*} and @samp{*INDENT-ON*} comments
765 @node Output Format, Binary Strings, Input Format, Description
766 @section Output Format for Generated C Code with @code{gperf}
769 Several options control how the generated C code appears on the standard
770 output. Two C functions are generated. They are called @code{hash} and
771 @code{in_word_set}, although you may modify their names with a command-line
772 option. Both functions require two arguments, a string, @code{char *}
773 @var{str}, and a length parameter, @code{int} @var{len}. Their default
774 function prototypes are as follows:
776 @deftypefun {unsigned int} hash (const char * @var{str}, unsigned int @var{len})
777 By default, the generated @code{hash} function returns an integer value
778 created by adding @var{len} to several user-specified @var{str} byte
779 positions indexed into an @dfn{associated values} table stored in a
780 local static array. The associated values table is constructed
781 internally by @code{gperf} and later output as a static local C array
782 called @samp{hash_table}. The relevant selected positions (i.e. indices
783 into @var{str}) are specified via the @samp{-k} option when running
784 @code{gperf}, as detailed in the @emph{Options} section below (@pxref{Options}).
787 @deftypefun {} in_word_set (const char * @var{str}, unsigned int @var{len})
788 If @var{str} is in the keyword set, returns a pointer to that
789 keyword. More exactly, if the option @samp{-t} (or, equivalently, the
790 @samp{%struct-type} declaration) was given, it returns
791 a pointer to the matching keyword's structure. Otherwise it returns
795 If the option @samp{-c} (or, equivalently, the @samp{%compare-strncmp}
796 declaration) is not used, @var{str} must be a NUL terminated
797 string of exactly length @var{len}. If @samp{-c} (or, equivalently, the
798 @samp{%compare-strncmp} declaration) is used, @var{str} must
799 simply be an array of @var{len} bytes and does not need to be NUL
802 The code generated for these two functions is affected by the following
808 Make use of the user-defined @code{struct}.
810 @item -S @var{total-switch-statements}
811 @itemx --switch=@var{total-switch-statements}
812 @cindex @code{switch}
813 Generate 1 or more C @code{switch} statement rather than use a large,
814 (and potentially sparse) static array. Although the exact time and
815 space savings of this approach vary according to your C compiler's
816 degree of optimization, this method often results in smaller and faster
820 If the @samp{-t} and @samp{-S} options (or, equivalently, the
821 @samp{%struct-type} and @samp{%switch} declarations) are omitted, the default
823 is to generate a @code{char *} array containing the keywords, together with
824 additional empty strings used for padding the array. By experimenting
825 with the various input and output options, and timing the resulting C
826 code, you can determine the best option choices for different keyword
829 @node Binary Strings, , Output Format, Description
830 @section Use of NUL bytes
833 By default, the code generated by @code{gperf} operates on zero
834 terminated strings, the usual representation of strings in C. This means
835 that the keywords in the input file must not contain NUL bytes,
836 and the @var{str} argument passed to @code{hash} or @code{in_word_set}
837 must be NUL terminated and have exactly length @var{len}.
839 If option @samp{-c} (or, equivalently, the @samp{%compare-strncmp}
840 declaration) is used, then the @var{str} argument does not need
841 to be NUL terminated. The code generated by @code{gperf} will only
842 access the first @var{len}, not @var{len+1}, bytes starting at @var{str}.
843 However, the keywords in the input file still must not contain NUL
846 If option @samp{-l} (or, equivalently, the @samp{%compare-lengths}
847 declaration) is used, then the hash table performs binary
848 comparison. The keywords in the input file may contain NUL bytes,
849 written in string syntax as @code{\000} or @code{\x00}, and the code
850 generated by @code{gperf} will treat NUL like any other byte.
851 Also, in this case the @samp{-c} option (or, equivalently, the
852 @samp{%compare-strncmp} declaration) is ignored.
854 @node Options, Bugs, Description, Top
855 @chapter Invoking @code{gperf}
857 There are @emph{many} options to @code{gperf}. They were added to make
858 the program more convenient for use with real applications. ``On-line''
859 help is readily available via the @samp{--help} option. Here is the
860 complete list of options.
863 * Output File:: Specifying the Location of the Output File
864 * Input Details:: Options that affect Interpretation of the Input File
865 * Output Language:: Specifying the Language for the Output Code
866 * Output Details:: Fine tuning Details in the Output Code
867 * Algorithmic Details:: Changing the Algorithms employed by @code{gperf}
868 * Verbosity:: Informative Output
871 @node Output File, Input Details, Options, Options
872 @section Specifying the Location of the Output File
875 @item --output-file=@var{file}
876 Allows you to specify the name of the file to which the output is written to.
879 The results are written to standard output if no output file is specified
880 or if it is @samp{-}.
882 @node Input Details, Output Language, Output File, Options
883 @section Options that affect Interpretation of the Input File
885 These options are also available as declarations in the input file
886 (@pxref{Gperf Declarations}).
889 @item -e @var{keyword-delimiter-list}
890 @itemx --delimiters=@var{keyword-delimiter-list}
892 Allows you to provide a string containing delimiters used to
893 separate keywords from their attributes. The default is ",". This
894 option is essential if you want to use keywords that have embedded
895 commas or newlines. One useful trick is to use -e'TAB', where TAB is
896 the literal tab character.
900 Allows you to include a @code{struct} type declaration for generated
901 code. Any text before a pair of consecutive @samp{%%} is considered
902 part of the type declaration. Keywords and additional fields may follow
903 this, one group of fields per line. A set of examples for generating
904 perfect hash tables and functions for Ada, C, C++, Pascal, Modula 2,
905 Modula 3 and JavaScript reserved words are distributed with this release.
908 Consider upper and lower case ASCII characters as equivalent. The string
909 comparison will use a case insignificant character comparison. Note that
910 locale dependent case mappings are ignored. This option is therefore not
911 suitable if a properly internationalized or locale aware case mapping
912 should be used. (For example, in a Turkish locale, the upper case equivalent
913 of the lowercase ASCII letter @samp{i} is the non-ASCII character
914 @samp{capital i with dot above}.) For this case, it is better to apply
915 an uppercase or lowercase conversion on the string before passing it to
916 the @code{gperf} generated function.
919 @node Output Language, Output Details, Input Details, Options
920 @section Options to specify the Language for the Output Code
922 These options are also available as declarations in the input file
923 (@pxref{Gperf Declarations}).
926 @item -L @var{generated-language-name}
927 @itemx --language=@var{generated-language-name}
928 Instructs @code{gperf} to generate code in the language specified by the
929 option's argument. Languages handled are currently:
933 Old-style K&R C. This language is understood by old-style C compilers and
934 ANSI C compilers, but ANSI C compilers may flag warnings (or even errors)
935 because of lacking @samp{const}.
938 Common C. This language is understood by ANSI C compilers, and also by
939 old-style C compilers, provided that you @code{#define const} to empty
940 for compilers which don't know about this keyword.
943 ANSI C. This language is understood by ANSI C compilers and C++ compilers.
946 C++. This language is understood by C++ compilers.
952 This option is supported for compatibility with previous releases of
953 @code{gperf}. It does not do anything.
956 This option is supported for compatibility with previous releases of
957 @code{gperf}. It does not do anything.
960 @node Output Details, Algorithmic Details, Output Language, Options
961 @section Options for fine tuning Details in the Output Code
963 Most of these options are also available as declarations in the input file
964 (@pxref{Gperf Declarations}).
967 @item -K @var{slot-name}
968 @itemx --slot-name=@var{slot-name}
970 This option is only useful when option @samp{-t} (or, equivalently, the
971 @samp{%struct-type} declaration) has been given.
972 By default, the program assumes the structure component identifier for
973 the keyword is @samp{name}. This option allows an arbitrary choice of
974 identifier for this component, although it still must occur as the first
975 field in your supplied @code{struct}.
977 @item -F @var{initializers}
978 @itemx --initializer-suffix=@var{initializers}
980 This option is only useful when option @samp{-t} (or, equivalently, the
981 @samp{%struct-type} declaration) has been given.
982 It permits to specify initializers for the structure members following
983 @var{slot-name} in empty hash table entries. The list of initializers
984 should start with a comma. By default, the emitted code will
985 zero-initialize structure members following @var{slot-name}.
987 @item -H @var{hash-function-name}
988 @itemx --hash-function-name=@var{hash-function-name}
989 Allows you to specify the name for the generated hash function. Default
990 name is @samp{hash}. This option permits the use of two hash tables in
993 @item -N @var{lookup-function-name}
994 @itemx --lookup-function-name=@var{lookup-function-name}
995 Allows you to specify the name for the generated lookup function.
996 Default name is @samp{in_word_set}. This option permits multiple
997 generated hash functions to be used in the same application.
999 @item -Z @var{class-name}
1000 @itemx --class-name=@var{class-name}
1002 This option is only useful when option @samp{-L C++} (or, equivalently,
1003 the @samp{%language=C++} declaration) has been given. It
1004 allows you to specify the name of generated C++ class. Default name is
1005 @code{Perfect_Hash}.
1009 This option specifies that all strings that will be passed as arguments
1010 to the generated hash function and the generated lookup function will
1011 solely consist of 7-bit ASCII characters (bytes in the range 0..127).
1012 (Note that the ANSI C functions @code{isalnum} and @code{isgraph} do
1013 @emph{not} guarantee that a byte is in this range. Only an explicit
1014 test like @samp{c >= 'A' && c <= 'Z'} guarantees this.) This was the
1015 default in versions of @code{gperf} earlier than 2.7; now the default is
1016 to support 8-bit and multibyte characters.
1019 @itemx --compare-lengths
1020 Compare keyword lengths before trying a string comparison. This option
1021 is mandatory for binary comparisons (@pxref{Binary Strings}). It also might
1022 cut down on the number of string comparisons made during the lookup, since
1023 keywords with different lengths are never compared via @code{strcmp}.
1024 However, using @samp{-l} might greatly increase the size of the
1025 generated C code if the lookup table range is large (which implies that
1026 the switch option @samp{-S} or @samp{%switch} is not enabled), since the length
1027 table contains as many elements as there are entries in the lookup table.
1030 @itemx --compare-strncmp
1031 Generates C code that uses the @code{strncmp} function to perform
1032 string comparisons. The default action is to use @code{strcmp}.
1035 @itemx --readonly-tables
1036 Makes the contents of all generated lookup tables constant, i.e.,
1037 ``readonly''. Many compilers can generate more efficient code for this
1038 by putting the tables in readonly memory.
1042 Define constant values using an enum local to the lookup function rather
1043 than with #defines. This also means that different lookup functions can
1044 reside in the same file. Thanks to James Clark @code{<jjc@@ai.mit.edu>}.
1048 Include the necessary system include file, @code{<string.h>}, at the
1049 beginning of the code. By default, this is not done; the user must
1050 include this header file himself to allow compilation of the code.
1053 @itemx --global-table
1054 Generate the static table of keywords as a static global variable,
1055 rather than hiding it inside of the lookup function (which is the
1060 Optimize the generated table for inclusion in shared libraries. This
1061 reduces the startup time of programs using a shared library containing
1062 the generated code. If the option @samp{-t} (or, equivalently, the
1063 @samp{%struct-type} declaration) is also given, the first field of the
1064 user-defined struct must be of type @samp{int}, not @samp{char *}, because
1065 it will contain offsets into the string pool instead of actual strings.
1066 To convert such an offset to a string, you can use the expression
1067 @samp{stringpool + @var{o}}, where @var{o} is the offset. The string pool
1068 name can be changed through the option @samp{--string-pool-name}.
1070 @item -Q @var{string-pool-name}
1071 @itemx --string-pool-name=@var{string-pool-name}
1072 Allows you to specify the name of the generated string pool created by
1073 option @samp{-P}. The default name is @samp{stringpool}. This option
1074 permits the use of two hash tables in the same file, with @samp{-P} and
1075 even when the option @samp{-G} (or, equivalently, the @samp{%global-table}
1076 declaration) is given.
1078 @item --null-strings
1079 Use NULL strings instead of empty strings for empty keyword table entries.
1080 This reduces the startup time of programs using a shared library containing
1081 the generated code (but not as much as option @samp{-P}), at the expense
1082 of one more test-and-branch instruction at run time.
1084 @item -W @var{hash-table-array-name}
1085 @itemx --word-array-name=@var{hash-table-array-name}
1087 Allows you to specify the name for the generated array containing the
1088 hash table. Default name is @samp{wordlist}. This option permits the
1089 use of two hash tables in the same file, even when the option @samp{-G}
1090 (or, equivalently, the @samp{%global-table} declaration) is given.
1092 @itemx --length-table-name=@var{length-table-array-name}
1094 Allows you to specify the name for the generated array containing the
1095 length table. Default name is @samp{lengthtable}. This option permits the
1096 use of two length tables in the same file, even when the option @samp{-G}
1097 (or, equivalently, the @samp{%global-table} declaration) is given.
1099 @item -S @var{total-switch-statements}
1100 @itemx --switch=@var{total-switch-statements}
1101 @cindex @code{switch}
1102 Causes the generated C code to use a @code{switch} statement scheme,
1103 rather than an array lookup table. This can lead to a reduction in both
1104 time and space requirements for some input files. The argument to this
1105 option determines how many @code{switch} statements are generated. A
1106 value of 1 generates 1 @code{switch} containing all the elements, a
1107 value of 2 generates 2 tables with 1/2 the elements in each
1108 @code{switch}, etc. This is useful since many C compilers cannot
1109 correctly generate code for large @code{switch} statements. This option
1110 was inspired in part by Keith Bostic's original C program.
1113 @itemx --omit-struct-type
1114 Prevents the transfer of the type declaration to the output file. Use
1115 this option if the type is already defined elsewhere.
1118 This option is supported for compatibility with previous releases of
1119 @code{gperf}. It does not do anything.
1122 @node Algorithmic Details, Verbosity, Output Details, Options
1123 @section Options for changing the Algorithms employed by @code{gperf}
1126 @item -k @var{selected-byte-positions}
1127 @itemx --key-positions=@var{selected-byte-positions}
1128 Allows selection of the byte positions used in the keywords'
1129 hash function. The allowable choices range between 1-255, inclusive.
1130 The positions are separated by commas, e.g., @samp{-k 9,4,13,14};
1131 ranges may be used, e.g., @samp{-k 2-7}; and positions may occur
1132 in any order. Furthermore, the wildcard '*' causes the generated
1133 hash function to consider @strong{all} byte positions in each keyword,
1134 whereas '$' instructs the hash function to use the ``final byte''
1135 of a keyword (this is the only way to use a byte position greater than
1138 For instance, the option @samp{-k 1,2,4,6-10,'$'} generates a hash
1139 function that considers positions 1,2,4,6,7,8,9,10, plus the last
1140 byte in each keyword (which may be at a different position for each
1141 keyword, obviously). Keywords
1142 with length less than the indicated byte positions work properly, since
1143 selected byte positions exceeding the keyword length are simply not
1144 referenced in the hash function.
1146 This option is not normally needed since version 2.8 of @code{gperf};
1147 the default byte positions are computed depending on the keyword set,
1148 through a search that minimizes the number of byte positions.
1153 Handle keywords whose selected byte sets hash to duplicate values.
1154 Duplicate hash values can occur if a set of keywords has the same names, but
1155 possesses different attributes, or if the selected byte positions are not well
1156 chosen. With the -D option @code{gperf} treats all these keywords as
1157 part of an equivalence class and generates a perfect hash function with
1158 multiple comparisons for duplicate keywords. It is up to you to completely
1159 disambiguate the keywords by modifying the generated C code. However,
1160 @code{gperf} helps you out by organizing the output.
1162 Using this option usually means that the generated hash function is no
1163 longer perfect. On the other hand, it permits @code{gperf} to work on
1164 keyword sets that it otherwise could not handle.
1166 @item -m @var{iterations}
1167 @itemx --multiple-iterations=@var{iterations}
1168 Perform multiple choices of the @samp{-i} and @samp{-j} values, and
1169 choose the best results. This increases the running time by a factor of
1170 @var{iterations} but does a good job minimizing the generated table size.
1172 @item -i @var{initial-value}
1173 @itemx --initial-asso=@var{initial-value}
1174 Provides an initial @var{value} for the associate values array. Default
1175 is 0. Increasing the initial value helps inflate the final table size,
1176 possibly leading to more time efficient keyword lookups. Note that this
1177 option is not particularly useful when @samp{-S} (or, equivalently,
1178 @samp{%switch}) is used. Also,
1179 @samp{-i} is overridden when the @samp{-r} option is used.
1181 @item -j @var{jump-value}
1182 @itemx --jump=@var{jump-value}
1184 Affects the ``jump value'', i.e., how far to advance the associated
1185 byte value upon collisions. @var{Jump-value} is rounded up to an
1186 odd number, the default is 5. If the @var{jump-value} is 0 @code{gperf}
1187 jumps by random amounts.
1191 Instructs the generator not to include the length of a keyword when
1192 computing its hash value. This may save a few assembly instructions in
1193 the generated lookup table.
1197 Utilizes randomness to initialize the associated values table. This
1198 frequently generates solutions faster than using deterministic
1199 initialization (which starts all associated values at 0). Furthermore,
1200 using the randomization option generally increases the size of the
1203 @item -s @var{size-multiple}
1204 @itemx --size-multiple=@var{size-multiple}
1205 Affects the size of the generated hash table. The numeric argument for
1206 this option indicates ``how many times larger or smaller'' the maximum
1207 associated value range should be, in relationship to the number of keywords.
1208 It can be written as an integer, a floating-point number or a fraction.
1209 For example, a value of 3 means ``allow the maximum associated value to be
1210 about 3 times larger than the number of input keywords''.
1211 Conversely, a value of 1/3 means ``allow the maximum associated value to
1212 be about 3 times smaller than the number of input keywords''. Values
1213 smaller than 1 are useful for limiting the overall size of the generated hash
1214 table, though the option @samp{-m} is better at this purpose.
1216 If `generate switch' option @samp{-S} (or, equivalently, @samp{%switch}) is
1217 @emph{not} enabled, the maximum
1218 associated value influences the static array table size, and a larger
1219 table should decrease the time required for an unsuccessful search, at
1220 the expense of extra table space.
1222 The default value is 1, thus the default maximum associated value about
1223 the same size as the number of keywords (for efficiency, the maximum
1224 associated value is always rounded up to a power of 2). The actual
1225 table size may vary somewhat, since this technique is essentially a
1229 @node Verbosity, , Algorithmic Details, Options
1230 @section Informative Output
1235 Prints a short summary on the meaning of each program option. Aborts
1236 further program execution.
1240 Prints out the current version number.
1244 Enables the debugging option. This produces verbose diagnostics to
1245 ``standard error'' when @code{gperf} is executing. It is useful both for
1246 maintaining the program and for determining whether a given set of
1247 options is actually speeding up the search for a solution. Some useful
1248 information is dumped at the end of the program when the @samp{-d}
1252 @node Bugs, Projects, Options, Top
1253 @chapter Known Bugs and Limitations with @code{gperf}
1255 The following are some limitations with the current release of
1260 The @code{gperf} utility is tuned to execute quickly, and works quickly
1261 for small to medium size data sets (around 1000 keywords). It is
1262 extremely useful for maintaining perfect hash functions for compiler
1263 keyword sets. Several recent enhancements now enable @code{gperf} to
1264 work efficiently on much larger keyword sets (over 15,000 keywords).
1265 When processing large keyword sets it helps greatly to have over 8 megs
1269 The size of the generate static keyword array can get @emph{extremely}
1270 large if the input keyword file is large or if the keywords are quite
1271 similar. This tends to slow down the compilation of the generated C
1272 code, and @emph{greatly} inflates the object code size. If this
1273 situation occurs, consider using the @samp{-S} option to reduce data
1274 size, potentially increasing keyword recognition time a negligible
1275 amount. Since many C compilers cannot correctly generate code for
1276 large switch statements it is important to qualify the @var{-S} option
1277 with an appropriate numerical argument that controls the number of
1278 switch statements generated.
1281 The maximum number of selected byte positions has an
1282 arbitrary limit of 255. This restriction should be removed, and if
1283 anyone considers this a problem write me and let me know so I can remove
1287 @node Projects, Bibliography, Bugs, Top
1288 @chapter Things Still Left to Do
1290 It should be ``relatively'' easy to replace the current perfect hash
1291 function algorithm with a more exhaustive approach; the perfect hash
1292 module is essential independent from other program modules. Additional
1293 worthwhile improvements include:
1297 Another useful extension involves modifying the program to generate
1298 ``minimal'' perfect hash functions (under certain circumstances, the
1299 current version can be rather extravagant in the generated table size).
1300 This is mostly of theoretical interest, since a sparse table
1301 often produces faster lookups, and use of the @samp{-S} @code{switch}
1302 option can minimize the data size, at the expense of slightly longer
1303 lookups (note that the gcc compiler generally produces good code for
1304 @code{switch} statements, reducing the need for more complex schemes).
1307 In addition to improving the algorithm, it would also be useful to
1308 generate an Ada package as the code output, in addition to the current
1314 @node Bibliography, Concept Index, Projects, Top
1315 @chapter Bibliography
1317 [1] Chang, C.C.: @i{A Scheme for Constructing Ordered Minimal Perfect
1318 Hashing Functions} Information Sciences 39(1986), 187-195.
1320 [2] Cichelli, Richard J. @i{Author's Response to ``On Cichelli's Minimal Perfect Hash
1321 Functions Method''} Communications of the ACM, 23, 12(December 1980), 729.
1323 [3] Cichelli, Richard J. @i{Minimal Perfect Hash Functions Made Simple}
1324 Communications of the ACM, 23, 1(January 1980), 17-19.
1326 [4] Cook, C. R. and Oldehoeft, R.R. @i{A Letter Oriented Minimal
1327 Perfect Hashing Function} SIGPLAN Notices, 17, 9(September 1982), 18-27.
1329 [5] Cormack, G. V. and Horspool, R. N. S. and Kaiserwerth, M.
1330 @i{Practical Perfect Hashing} Computer Journal, 28, 1(January 1985), 54-58.
1332 [6] Jaeschke, G. @i{Reciprocal Hashing: A Method for Generating Minimal
1333 Perfect Hashing Functions} Communications of the ACM, 24, 12(December
1336 [7] Jaeschke, G. and Osterburg, G. @i{On Cichelli's Minimal Perfect
1337 Hash Functions Method} Communications of the ACM, 23, 12(December 1980),
1340 [8] Sager, Thomas J. @i{A Polynomial Time Generator for Minimal Perfect
1341 Hash Functions} Communications of the ACM, 28, 5(December 1985), 523-532
1343 [9] Schmidt, Douglas C. @i{GPERF: A Perfect Hash Function Generator}
1344 Second USENIX C++ Conference Proceedings, April 1990.
1346 [10] Schmidt, Douglas C. @i{GPERF: A Perfect Hash Function Generator}
1347 C++ Report, SIGS 10 10 (November/December 1998).
1349 [11] Sebesta, R.W. and Taylor, M.A. @i{Minimal Perfect Hash Functions
1350 for Reserved Word Lists} SIGPLAN Notices, 20, 12(September 1985), 47-53.
1352 [12] Sprugnoli, R. @i{Perfect Hashing Functions: A Single Probe
1353 Retrieving Method for Static Sets} Communications of the ACM, 20
1354 11(November 1977), 841-850.
1356 [13] Stallman, Richard M. @i{Using and Porting GNU CC} Free Software Foundation,
1359 [14] Stroustrup, Bjarne @i{The C++ Programming Language.} Addison-Wesley, 1986.
1361 [15] Tiemann, Michael D. @i{User's Guide to GNU C++} Free Software
1364 @node Concept Index, , Bibliography, Top
1365 @unnumbered Concept Index