contrib/bzip2/manual.texi

   1 \input texinfo  @c                                  -*- Texinfo -*-
   2 @setfilename bzip2.info
   3
   4 @ignore
   5 This file documents bzip2 version 1.0.2, and associated library
   6 libbzip2, written by Julian Seward (jseward@acm.org).
   7
   8 Copyright (C) 1996-2002 Julian R Seward
   9
  10 Permission is granted to make and distribute verbatim copies of
  11 this manual provided the copyright notice and this permission notice
  12 are preserved on all copies.
  13
  14 Permission is granted to copy and distribute translations of this manual
  15 into another language, under the above conditions for verbatim copies.
  16 @end ignore
  17
  18 @ifinfo
  19 @format
  20 START-INFO-DIR-ENTRY
  21 * Bzip2: (bzip2).               A program and library for data compression.
  22 END-INFO-DIR-ENTRY
  23 @end format
  24
  25 @end ifinfo
  26
  27 @iftex
  28 @c @finalout
  29 @settitle bzip2 and libbzip2
  30 @titlepage
  31 @title bzip2 and libbzip2
  32 @subtitle a program and library for data compression
  33 @subtitle copyright (C) 1996-2002 Julian Seward
  34 @subtitle version 1.0.2 of 30 December 2001
  35 @author Julian Seward
  36
  37 @end titlepage
  38
  39 @parindent 0mm
  40 @parskip 2mm
  41
  42 @end iftex
  43 @node Top,,, (dir)
  44
  45 The following text is the License for this software.  You should
  46 find it identical to that contained in the file LICENSE in the
  47 source distribution.
  48
  49 @bf{------------------ START OF THE LICENSE ------------------}
  50
  51 This program, @code{bzip2},
  52 and associated library @code{libbzip2}, are
  53 Copyright (C) 1996-2002 Julian R Seward.  All rights reserved.
  54
  55 Redistribution and use in source and binary forms, with or without
  56 modification, are permitted provided that the following conditions
  57 are met:
  58 @itemize @bullet
  59 @item
  60    Redistributions of source code must retain the above copyright
  61    notice, this list of conditions and the following disclaimer.
  62 @item
  63    The origin of this software must not be misrepresented; you must
  64    not claim that you wrote the original software.  If you use this
  65    software in a product, an acknowledgment in the product
  66    documentation would be appreciated but is not required.
  67 @item
  68    Altered source versions must be plainly marked as such, and must
  69    not be misrepresented as being the original software.
  70 @item
  71    The name of the author may not be used to endorse or promote
  72    products derived from this software without specific prior written
  73    permission.
  74 @end itemize
  75 THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS
  76 OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
  77 WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  78 ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY
  79 DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  80 DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
  81 GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
  82 INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
  83 WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
  84 NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
  85 SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  86
  87 Julian Seward, Cambridge, UK.
  88
  89 @code{jseward@@acm.org}
  90
  91 @code{bzip2}/@code{libbzip2} version 1.0.2 of 30 December 2001.
  92
  93 @bf{------------------ END OF THE LICENSE ------------------}
  94
  95 Web sites:
  96
  97 @code{http://sources.redhat.com/bzip2}
  98
  99 @code{http://www.cacheprof.org}
 100
 101 PATENTS: To the best of my knowledge, @code{bzip2} does not use any patented
 102 algorithms.  However, I do not have the resources available to carry out
 103 a full patent search.  Therefore I cannot give any guarantee of the
 104 above statement.
 105
 106
 107
 108
 109
 110
 111
 112 @chapter Introduction
 113
 114 @code{bzip2}  compresses  files  using the Burrows-Wheeler
 115 block-sorting text compression algorithm,  and  Huffman  coding.
 116 Compression  is  generally  considerably  better than that
 117 achieved by more conventional LZ77/LZ78-based compressors,
 118 and  approaches  the performance of the PPM family of statistical compressors.
 119
 120 @code{bzip2} is built on top of @code{libbzip2}, a flexible library
 121 for handling compressed data in the @code{bzip2} format.  This manual
 122 describes both how to use the program and
 123 how to work with the library interface.  Most of the
 124 manual is devoted to this library, not the program,
 125 which is good news if your interest is only in the program.
 126
 127 Chapter 2 describes how to use @code{bzip2}; this is the only part
 128 you need to read if you just want to know how to operate the program.
 129 Chapter 3 describes the programming interfaces in detail, and
 130 Chapter 4 records some miscellaneous notes which I thought
 131 ought to be recorded somewhere.
 132
 133
 134 @chapter How to use @code{bzip2}
 135
 136 This chapter contains a copy of the @code{bzip2} man page,
 137 and nothing else.
 138
 139 @quotation
 140
 141 @unnumberedsubsubsec NAME
 142 @itemize
 143 @item @code{bzip2}, @code{bunzip2}
 144 - a block-sorting file compressor, v1.0.2
 145 @item @code{bzcat}
 146 - decompresses files to stdout
 147 @item @code{bzip2recover}
 148 - recovers data from damaged bzip2 files
 149 @end itemize
 150
 151 @unnumberedsubsubsec SYNOPSIS
 152 @itemize
 153 @item @code{bzip2} [ -cdfkqstvzVL123456789 ] [ filenames ...  ]
 154 @item @code{bunzip2} [ -fkvsVL ] [ filenames ...  ]
 155 @item @code{bzcat} [ -s ] [ filenames ...  ]
 156 @item @code{bzip2recover} filename
 157 @end itemize
 158
 159 @unnumberedsubsubsec DESCRIPTION
 160
 161 @code{bzip2} compresses files using the Burrows-Wheeler block sorting
 162 text compression algorithm, and Huffman coding.  Compression is
 163 generally considerably better than that achieved by more conventional
 164 LZ77/LZ78-based compressors, and approaches the performance of the PPM
 165 family of statistical compressors.
 166
 167 The command-line options are deliberately very similar to those of GNU
 168 @code{gzip}, but they are not identical.
 169
 170 @code{bzip2} expects a list of file names to accompany the command-line
 171 flags.  Each file is replaced by a compressed version of itself, with
 172 the name @code{original_name.bz2}.  Each compressed file has the same
 173 modification date, permissions, and, when possible, ownership as the
 174 corresponding original, so that these properties can be correctly
 175 restored at decompression time.  File name handling is naive in the
 176 sense that there is no mechanism for preserving original file names,
 177 permissions, ownerships or dates in filesystems which lack these
 178 concepts, or have serious file name length restrictions, such as MS-DOS.
 179
 180 @code{bzip2} and @code{bunzip2} will by default not overwrite existing
 181 files.  If you want this to happen, specify the @code{-f} flag.
 182
 183 If no file names are specified, @code{bzip2} compresses from standard
 184 input to standard output.  In this case, @code{bzip2} will decline to
 185 write compressed output to a terminal, as this would be entirely
 186 incomprehensible and therefore pointless.
 187
 188 @code{bunzip2} (or @code{bzip2 -d}) decompresses all
 189 specified files.  Files which were not created by @code{bzip2}
 190 will be detected and ignored, and a warning issued.
 191 @code{bzip2} attempts to guess the filename for the decompressed file
 192 from that of the compressed file as follows:
 193 @itemize
 194 @item @code{filename.bz2 } becomes @code{filename}
 195 @item @code{filename.bz  } becomes @code{filename}
 196 @item @code{filename.tbz2} becomes @code{filename.tar}
 197 @item @code{filename.tbz } becomes @code{filename.tar}
 198 @item @code{anyothername } becomes @code{anyothername.out}
 199 @end itemize
 200 If the file does not end in one of the recognised endings,
 201 @code{.bz2}, @code{.bz},
 202 @code{.tbz2} or @code{.tbz}, @code{bzip2} complains that it cannot
 203 guess the name of the original file, and uses the original name
 204 with @code{.out} appended.
 205
 206 As with compression, supplying no
 207 filenames causes decompression from standard input to standard output.
 208
 209 @code{bunzip2} will correctly decompress a file which is the
 210 concatenation of two or more compressed files.  The result is the
 211 concatenation of the corresponding uncompressed files.  Integrity
 212 testing (@code{-t}) of concatenated compressed files is also supported.
 213
 214 You can also compress or decompress files to the standard output by
 215 giving the @code{-c} flag.  Multiple files may be compressed and
 216 decompressed like this.  The resulting outputs are fed sequentially to
 217 stdout.  Compression of multiple files in this manner generates a stream
 218 containing multiple compressed file representations.  Such a stream
 219 can be decompressed correctly only by @code{bzip2} version 0.9.0 or
 220 later.  Earlier versions of @code{bzip2} will stop after decompressing
 221 the first file in the stream.
 222
 223 @code{bzcat} (or @code{bzip2 -dc}) decompresses all specified files to
 224 the standard output.
 225
 226 @code{bzip2} will read arguments from the environment variables
 227 @code{BZIP2} and @code{BZIP}, in that order, and will process them
 228 before any arguments read from the command line.  This gives a
 229 convenient way to supply default arguments.
 230
 231 Compression is always performed, even if the compressed file is slightly
 232 larger than the original.  Files of less than about one hundred bytes
 233 tend to get larger, since the compression mechanism has a constant
 234 overhead in the region of 50 bytes.  Random data (including the output
 235 of most file compressors) is coded at about 8.05 bits per byte, giving
 236 an expansion of around 0.5%.
 237
 238 As a self-check for your protection, @code{bzip2} uses 32-bit CRCs to
 239 make sure that the decompressed version of a file is identical to the
 240 original.  This guards against corruption of the compressed data, and
 241 against undetected bugs in @code{bzip2} (hopefully very unlikely).  The
 242 chances of data corruption going undetected is microscopic, about one
 243 chance in four billion for each file processed.  Be aware, though, that
 244 the check occurs upon decompression, so it can only tell you that
 245 something is wrong.  It can't help you recover the original uncompressed
 246 data.  You can use @code{bzip2recover} to try to recover data from
 247 damaged files.
 248
 249 Return values: 0 for a normal exit, 1 for environmental problems (file
 250 not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt
 251 compressed file, 3 for an internal consistency error (eg, bug) which
 252 caused @code{bzip2} to panic.
 253
 254
 255 @unnumberedsubsubsec OPTIONS
 256 @table @code
 257 @item -c  --stdout
 258 Compress or decompress to standard output.
 259 @item -d  --decompress
 260 Force decompression.  @code{bzip2}, @code{bunzip2} and @code{bzcat} are
 261 really the same program, and the decision about what actions to take is
 262 done on the basis of which name is used.  This flag overrides that
 263 mechanism, and forces bzip2 to decompress.
 264 @item -z --compress
 265 The complement to @code{-d}: forces compression, regardless of the
 266 invokation name.
 267 @item -t --test
 268 Check integrity of the specified file(s), but don't decompress them.
 269 This really performs a trial decompression and throws away the result.
 270 @item -f --force
 271 Force overwrite of output files.  Normally, @code{bzip2} will not overwrite
 272 existing output files.  Also forces @code{bzip2} to break hard links
 273 to files, which it otherwise wouldn't do.
 274
 275 @code{bzip2} normally declines to decompress files which don't have the
 276 correct magic header bytes.  If forced (@code{-f}), however, it will
 277 pass such files through unmodified.  This is how GNU @code{gzip}
 278 behaves.
 279 @item -k --keep
 280 Keep (don't delete) input files during compression
 281 or decompression.
 282 @item -s --small
 283 Reduce memory usage, for compression, decompression and testing.  Files
 284 are decompressed and tested using a modified algorithm which only
 285 requires 2.5 bytes per block byte.  This means any file can be
 286 decompressed in 2300k of memory, albeit at about half the normal speed.
 287
 288 During compression, @code{-s} selects a block size of 200k, which limits
 289 memory use to around the same figure, at the expense of your compression
 290 ratio.  In short, if your machine is low on memory (8 megabytes or
 291 less), use -s for everything.  See MEMORY MANAGEMENT below.
 292 @item -q --quiet
 293 Suppress non-essential warning messages.  Messages pertaining to
 294 I/O errors and other critical events will not be suppressed.
 295 @item -v --verbose
 296 Verbose mode -- show the compression ratio for each file processed.
 297 Further @code{-v}'s increase the verbosity level, spewing out lots of
 298 information which is primarily of interest for diagnostic purposes.
 299 @item -L --license -V --version
 300 Display the software version, license terms and conditions.
 301 @item -1 (or --fast) to -9 (or --best)
 302 Set the block size to 100 k, 200 k ..  900 k when compressing.  Has no
 303 effect when decompressing.  See MEMORY MANAGEMENT below.
 304 The @code{--fast} and @code{--best} aliases are primarily for GNU
 305 @code{gzip} compatibility.  In particular, @code{--fast} doesn't make
 306 things significantly faster.  And @code{--best} merely selects the
 307 default behaviour.
 308 @item --
 309 Treats all subsequent arguments as file names, even if they start
 310 with a dash.  This is so you can handle files with names beginning
 311 with a dash, for example: @code{bzip2 -- -myfilename}.
 312 @item --repetitive-fast
 313 @item --repetitive-best
 314 These flags are redundant in versions 0.9.5 and above.  They provided
 315 some coarse control over the behaviour of the sorting algorithm in
 316 earlier versions, which was sometimes useful.  0.9.5 and above have an
 317 improved algorithm which renders these flags irrelevant.
 318 @end table
 319
 320
 321 @unnumberedsubsubsec MEMORY MANAGEMENT
 322
 323 @code{bzip2} compresses large files in blocks.  The block size affects
 324 both the compression ratio achieved, and the amount of memory needed for
 325 compression and decompression.  The flags @code{-1} through @code{-9}
 326 specify the block size to be 100,000 bytes through 900,000 bytes (the
 327 default) respectively.  At decompression time, the block size used for
 328 compression is read from the header of the compressed file, and
 329 @code{bunzip2} then allocates itself just enough memory to decompress
 330 the file.  Since block sizes are stored in compressed files, it follows
 331 that the flags @code{-1} to @code{-9} are irrelevant to and so ignored
 332 during decompression.
 333
 334 Compression and decompression requirements, in bytes, can be estimated
 335 as:
 336 @example
 337      Compression:   400k + ( 8 x block size )
 338
 339      Decompression: 100k + ( 4 x block size ), or
 340                     100k + ( 2.5 x block size )
 341 @end example
 342 Larger block sizes give rapidly diminishing marginal returns.  Most of
 343 the compression comes from the first two or three hundred k of block
 344 size, a fact worth bearing in mind when using @code{bzip2} on small machines.
 345 It is also important to appreciate that the decompression memory
 346 requirement is set at compression time by the choice of block size.
 347
 348 For files compressed with the default 900k block size, @code{bunzip2}
 349 will require about 3700 kbytes to decompress.  To support decompression
 350 of any file on a 4 megabyte machine, @code{bunzip2} has an option to
 351 decompress using approximately half this amount of memory, about 2300
 352 kbytes.  Decompression speed is also halved, so you should use this
 353 option only where necessary.  The relevant flag is @code{-s}.
 354
 355 In general, try and use the largest block size memory constraints allow,
 356 since that maximises the compression achieved.  Compression and
 357 decompression speed are virtually unaffected by block size.
 358
 359 Another significant point applies to files which fit in a single block
 360 -- that means most files you'd encounter using a large block size.  The
 361 amount of real memory touched is proportional to the size of the file,
 362 since the file is smaller than a block.  For example, compressing a file
 363 20,000 bytes long with the flag @code{-9} will cause the compressor to
 364 allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560
 365 kbytes of it.  Similarly, the decompressor will allocate 3700k but only
 366 touch 100k + 20000 * 4 = 180 kbytes.
 367
 368 Here is a table which summarises the maximum memory usage for different
 369 block sizes.  Also recorded is the total compressed size for 14 files of
 370 the Calgary Text Compression Corpus totalling 3,141,622 bytes.  This
 371 column gives some feel for how compression varies with block size.
 372 These figures tend to understate the advantage of larger block sizes for
 373 larger files, since the Corpus is dominated by smaller files.
 374 @example
 375           Compress   Decompress   Decompress   Corpus
 376    Flag     usage      usage       -s usage     Size
 377
 378     -1      1200k       500k         350k      914704
 379     -2      2000k       900k         600k      877703
 380     -3      2800k      1300k         850k      860338
 381     -4      3600k      1700k        1100k      846899
 382     -5      4400k      2100k        1350k      845160
 383     -6      5200k      2500k        1600k      838626
 384     -7      6100k      2900k        1850k      834096
 385     -8      6800k      3300k        2100k      828642
 386     -9      7600k      3700k        2350k      828642
 387 @end example
 388
 389 @unnumberedsubsubsec RECOVERING DATA FROM DAMAGED FILES
 390
 391 @code{bzip2} compresses files in blocks, usually 900kbytes long.  Each
 392 block is handled independently.  If a media or transmission error causes
 393 a multi-block @code{.bz2} file to become damaged, it may be possible to
 394 recover data from the undamaged blocks in the file.
 395
 396 The compressed representation of each block is delimited by a 48-bit
 397 pattern, which makes it possible to find the block boundaries with
 398 reasonable certainty.  Each block also carries its own 32-bit CRC, so
 399 damaged blocks can be distinguished from undamaged ones.
 400
 401 @code{bzip2recover} is a simple program whose purpose is to search for
 402 blocks in @code{.bz2} files, and write each block out into its own
 403 @code{.bz2} file.  You can then use @code{bzip2 -t} to test the
 404 integrity of the resulting files, and decompress those which are
 405 undamaged.
 406
 407 @code{bzip2recover}
 408 takes a single argument, the name of the damaged file, and writes a
 409 number of files @code{rec00001file.bz2}, @code{rec00002file.bz2}, etc,
 410 containing the extracted blocks.  The output filenames are designed so
 411 that the use of wildcards in subsequent processing -- for example,
 412 @code{bzip2 -dc rec*file.bz2 > recovered_data} -- processes the files in
 413 the correct order.
 414
 415 @code{bzip2recover} should be of most use dealing with large @code{.bz2}
 416 files, as these will contain many blocks.  It is clearly futile to use
 417 it on damaged single-block files, since a damaged block cannot be
 418 recovered.  If you wish to minimise any potential data loss through
 419 media or transmission errors, you might consider compressing with a
 420 smaller block size.
 421
 422
 423 @unnumberedsubsubsec PERFORMANCE NOTES
 424
 425 The sorting phase of compression gathers together similar strings in the
 426 file.  Because of this, files containing very long runs of repeated
 427 symbols, like "aabaabaabaab ..."  (repeated several hundred times) may
 428 compress more slowly than normal.  Versions 0.9.5 and above fare much
 429 better than previous versions in this respect.  The ratio between
 430 worst-case and average-case compression time is in the region of 10:1.
 431 For previous versions, this figure was more like 100:1.  You can use the
 432 @code{-vvvv} option to monitor progress in great detail, if you want.
 433
 434 Decompression speed is unaffected by these phenomena.
 435
 436 @code{bzip2} usually allocates several megabytes of memory to operate
 437 in, and then charges all over it in a fairly random fashion.  This means
 438 that performance, both for compressing and decompressing, is largely
 439 determined by the speed at which your machine can service cache misses.
 440 Because of this, small changes to the code to reduce the miss rate have
 441 been observed to give disproportionately large performance improvements.
 442 I imagine @code{bzip2} will perform best on machines with very large
 443 caches.
 444
 445
 446 @unnumberedsubsubsec CAVEATS
 447
 448 I/O error messages are not as helpful as they could be.  @code{bzip2}
 449 tries hard to detect I/O errors and exit cleanly, but the details of
 450 what the problem is sometimes seem rather misleading.
 451
 452 This manual page pertains to version 1.0.2 of @code{bzip2}.  Compressed
 453 data created by this version is entirely forwards and backwards
 454 compatible with the previous public releases, versions 0.1pl2, 0.9.0,
 455 0.9.5, 1.0.0 and 1.0.1, but with the following exception: 0.9.0 and
 456 above can correctly decompress multiple concatenated compressed files.
 457 0.1pl2 cannot do this; it will stop after decompressing just the first
 458 file in the stream.
 459
 460 @code{bzip2recover} versions prior to this one, 1.0.2, used 32-bit
 461 integers to represent bit positions in compressed files, so it could not
 462 handle compressed files more than 512 megabytes long.  Version 1.0.2 and
 463 above uses 64-bit ints on some platforms which support them (GNU
 464 supported targets, and Windows).  To establish whether or not
 465 @code{bzip2recover} was built with such a limitation, run it without
 466 arguments.  In any event you can build yourself an unlimited version if
 467 you can recompile it with @code{MaybeUInt64} set to be an unsigned
 468 64-bit integer.
 469
 470
 471
 472 @unnumberedsubsubsec AUTHOR
 473 Julian Seward, @code{jseward@@acm.org}.
 474
 475 @code{http://sources.redhat.com/bzip2}
 476
 477 The ideas embodied in @code{bzip2} are due to (at least) the following
 478 people: Michael Burrows and David Wheeler (for the block sorting
 479 transformation), David Wheeler (again, for the Huffman coder), Peter
 480 Fenwick (for the structured coding model in the original @code{bzip},
 481 and many refinements), and Alistair Moffat, Radford Neal and Ian Witten
 482 (for the arithmetic coder in the original @code{bzip}).  I am much
 483 indebted for their help, support and advice.  See the manual in the
 484 source distribution for pointers to sources of documentation.  Christian
 485 von Roques encouraged me to look for faster sorting algorithms, so as to
 486 speed up compression.  Bela Lubkin encouraged me to improve the
 487 worst-case compression performance.  The @code{bz*} scripts are derived
 488 from those of GNU @code{gzip}.  Many people sent patches, helped with
 489 portability problems, lent machines, gave advice and were generally
 490 helpful.
 491
 492 @end quotation
 493
 494
 495
 496
 497 @chapter Programming with @code{libbzip2}
 498
 499 This chapter describes the programming interface to @code{libbzip2}.
 500
 501 For general background information, particularly about memory
 502 use and performance aspects, you'd be well advised to read Chapter 2
 503 as well.
 504
 505 @section Top-level structure
 506
 507 @code{libbzip2} is a flexible library for compressing and decompressing
 508 data in the @code{bzip2} data format.  Although packaged as a single
 509 entity, it helps to regard the library as three separate parts: the low
 510 level interface, and the high level interface, and some utility
 511 functions.
 512
 513 The structure of @code{libbzip2}'s interfaces is similar to
 514 that of Jean-loup Gailly's and Mark Adler's excellent @code{zlib}
 515 library.
 516
 517 All externally visible symbols have names beginning @code{BZ2_}.
 518 This is new in version 1.0.  The intention is to minimise pollution
 519 of the namespaces of library clients.
 520
 521 @subsection Low-level summary
 522
 523 This interface provides services for compressing and decompressing
 524 data in memory.  There's no provision for dealing with files, streams
 525 or any other I/O mechanisms, just straight memory-to-memory work.
 526 In fact, this part of the library can be compiled without inclusion
 527 of @code{stdio.h}, which may be helpful for embedded applications.
 528
 529 The low-level part of the library has no global variables and
 530 is therefore thread-safe.
 531
 532 Six routines make up the low level interface:
 533 @code{BZ2_bzCompressInit}, @code{BZ2_bzCompress}, and @* @code{BZ2_bzCompressEnd}
 534 for compression,
 535 and a corresponding trio @code{BZ2_bzDecompressInit}, @* @code{BZ2_bzDecompress}
 536 and @code{BZ2_bzDecompressEnd} for decompression.
 537 The @code{*Init} functions allocate
 538 memory for compression/decompression and do other
 539 initialisations, whilst the @code{*End} functions close down operations
 540 and release memory.
 541
 542 The real work is done by @code{BZ2_bzCompress} and @code{BZ2_bzDecompress}.
 543 These compress and decompress data from a user-supplied input buffer
 544 to a user-supplied output buffer.  These buffers can be any size;
 545 arbitrary quantities of data are handled by making repeated calls
 546 to these functions.  This is a flexible mechanism allowing a
 547 consumer-pull style of activity, or producer-push, or a mixture of
 548 both.
 549
 550
 551
 552 @subsection High-level summary
 553
 554 This interface provides some handy wrappers around the low-level
 555 interface to facilitate reading and writing @code{bzip2} format
 556 files (@code{.bz2} files).  The routines provide hooks to facilitate
 557 reading files in which the @code{bzip2} data stream is embedded
 558 within some larger-scale file structure, or where there are
 559 multiple @code{bzip2} data streams concatenated end-to-end.
 560
 561 For reading files, @code{BZ2_bzReadOpen}, @code{BZ2_bzRead},
 562 @code{BZ2_bzReadClose} and @* @code{BZ2_bzReadGetUnused} are supplied.  For
 563 writing files, @code{BZ2_bzWriteOpen}, @code{BZ2_bzWrite} and
 564 @code{BZ2_bzWriteFinish} are available.
 565
 566 As with the low-level library, no global variables are used
 567 so the library is per se thread-safe.  However, if I/O errors
 568 occur whilst reading or writing the underlying compressed files,
 569 you may have to consult @code{errno} to determine the cause of
 570 the error.  In that case, you'd need a C library which correctly
 571 supports @code{errno} in a multithreaded environment.
 572
 573 To make the library a little simpler and more portable,
 574 @code{BZ2_bzReadOpen} and @code{BZ2_bzWriteOpen} require you to pass them file
 575 handles (@code{FILE*}s) which have previously been opened for reading or
 576 writing respectively.  That avoids portability problems associated with
 577 file operations and file attributes, whilst not being much of an
 578 imposition on the programmer.
 579
 580
 581
 582 @subsection Utility functions summary
 583 For very simple needs, @code{BZ2_bzBuffToBuffCompress} and
 584 @code{BZ2_bzBuffToBuffDecompress} are provided.  These compress
 585 data in memory from one buffer to another buffer in a single
 586 function call.  You should assess whether these functions
 587 fulfill your memory-to-memory compression/decompression
 588 requirements before investing effort in understanding the more
 589 general but more complex low-level interface.
 590
 591 Yoshioka Tsuneo (@code{QWF00133@@niftyserve.or.jp} /
 592 @code{tsuneo-y@@is.aist-nara.ac.jp}) has contributed some functions to
 593 give better @code{zlib} compatibility.  These functions are
 594 @code{BZ2_bzopen}, @code{BZ2_bzread}, @code{BZ2_bzwrite}, @code{BZ2_bzflush},
 595 @code{BZ2_bzclose},
 596 @code{BZ2_bzerror} and @code{BZ2_bzlibVersion}.  You may find these functions
 597 more convenient for simple file reading and writing, than those in the
 598 high-level interface.  These functions are not (yet) officially part of
 599 the library, and are minimally documented here.  If they break, you
 600 get to keep all the pieces.  I hope to document them properly when time
 601 permits.
 602
 603 Yoshioka also contributed modifications to allow the library to be
 604 built as a Windows DLL.
 605
 606
 607 @section Error handling
 608
 609 The library is designed to recover cleanly in all situations, including
 610 the worst-case situation of decompressing random data.  I'm not
 611 100% sure that it can always do this, so you might want to add
 612 a signal handler to catch segmentation violations during decompression
 613 if you are feeling especially paranoid.  I would be interested in
 614 hearing more about the robustness of the library to corrupted
 615 compressed data.
 616
 617 Version 1.0 is much more robust in this respect than
 618 0.9.0 or 0.9.5.  Investigations with Checker (a tool for
 619 detecting problems with memory management, similar to Purify)
 620 indicate that, at least for the few files I tested, all single-bit
 621 errors in the decompressed data are caught properly, with no
 622 segmentation faults, no reads of uninitialised data and no
 623 out of range reads or writes.  So it's certainly much improved,
 624 although I wouldn't claim it to be totally bombproof.
 625
 626 The file @code{bzlib.h} contains all definitions needed to use
 627 the library.  In particular, you should definitely not include
 628 @code{bzlib_private.h}.
 629
 630 In @code{bzlib.h}, the various return values are defined.  The following
 631 list is not intended as an exhaustive description of the circumstances
 632 in which a given value may be returned -- those descriptions are given
 633 later.  Rather, it is intended to convey the rough meaning of each
 634 return value.  The first five actions are normal and not intended to
 635 denote an error situation.
 636 @table @code
 637 @item BZ_OK
 638 The requested action was completed successfully.
 639 @item BZ_RUN_OK
 640 @itemx BZ_FLUSH_OK
 641 @itemx BZ_FINISH_OK
 642 In @code{BZ2_bzCompress}, the requested flush/finish/nothing-special action
 643 was completed successfully.
 644 @item BZ_STREAM_END
 645 Compression of data was completed, or the logical stream end was
 646 detected during decompression.
 647 @end table
 648
 649 The following return values indicate an error of some kind.
 650 @table @code
 651 @item BZ_CONFIG_ERROR
 652 Indicates that the library has been improperly compiled on your
 653 platform -- a major configuration error.  Specifically, it means
 654 that @code{sizeof(char)}, @code{sizeof(short)} and @code{sizeof(int)}
 655 are not 1, 2 and 4 respectively, as they should be.  Note that the
 656 library should still work properly on 64-bit platforms which follow
 657 the LP64 programming model -- that is, where @code{sizeof(long)}
 658 and @code{sizeof(void*)} are 8.  Under LP64, @code{sizeof(int)} is
 659 still 4, so @code{libbzip2}, which doesn't use the @code{long} type,
 660 is OK.
 661 @item BZ_SEQUENCE_ERROR
 662 When using the library, it is important to call the functions in the
 663 correct sequence and with data structures (buffers etc) in the correct
 664 states.  @code{libbzip2} checks as much as it can to ensure this is
 665 happening, and returns @code{BZ_SEQUENCE_ERROR} if not.  Code which
 666 complies precisely with the function semantics, as detailed below,
 667 should never receive this value; such an event denotes buggy code
 668 which you should investigate.
 669 @item BZ_PARAM_ERROR
 670 Returned when a parameter to a function call is out of range
 671 or otherwise manifestly incorrect.  As with @code{BZ_SEQUENCE_ERROR},
 672 this denotes a bug in the client code.  The distinction between
 673 @code{BZ_PARAM_ERROR} and @code{BZ_SEQUENCE_ERROR} is a bit hazy, but still worth
 674 making.
 675 @item BZ_MEM_ERROR
 676 Returned when a request to allocate memory failed.  Note that the
 677 quantity of memory needed to decompress a stream cannot be determined
 678 until the stream's header has been read.  So @code{BZ2_bzDecompress} and
 679 @code{BZ2_bzRead} may return @code{BZ_MEM_ERROR} even though some of
 680 the compressed data has been read.  The same is not true for
 681 compression; once @code{BZ2_bzCompressInit} or @code{BZ2_bzWriteOpen} have
 682 successfully completed, @code{BZ_MEM_ERROR} cannot occur.
 683 @item BZ_DATA_ERROR
 684 Returned when a data integrity error is detected during decompression.
 685 Most importantly, this means when stored and computed CRCs for the
 686 data do not match.  This value is also returned upon detection of any
 687 other anomaly in the compressed data.
 688 @item BZ_DATA_ERROR_MAGIC
 689 As a special case of @code{BZ_DATA_ERROR}, it is sometimes useful to
 690 know when the compressed stream does not start with the correct
 691 magic bytes (@code{'B' 'Z' 'h'}).
 692 @item BZ_IO_ERROR
 693 Returned by @code{BZ2_bzRead} and @code{BZ2_bzWrite} when there is an error
 694 reading or writing in the compressed file, and by @code{BZ2_bzReadOpen}
 695 and @code{BZ2_bzWriteOpen} for attempts to use a file for which the
 696 error indicator (viz, @code{ferror(f)}) is set.
 697 On receipt of @code{BZ_IO_ERROR}, the caller should consult
 698 @code{errno} and/or @code{perror} to acquire operating-system
 699 specific information about the problem.
 700 @item BZ_UNEXPECTED_EOF
 701 Returned by @code{BZ2_bzRead} when the compressed file finishes
 702 before the logical end of stream is detected.
 703 @item BZ_OUTBUFF_FULL
 704 Returned by @code{BZ2_bzBuffToBuffCompress} and
 705 @code{BZ2_bzBuffToBuffDecompress} to indicate that the output data
 706 will not fit into the output buffer provided.
 707 @end table
 708
 709
 710
 711 @section Low-level interface
 712
 713 @subsection @code{BZ2_bzCompressInit}
 714 @example
 715 typedef
 716    struct @{
 717       char *next_in;
 718       unsigned int avail_in;
 719       unsigned int total_in_lo32;
 720       unsigned int total_in_hi32;
 721
 722       char *next_out;
 723       unsigned int avail_out;
 724       unsigned int total_out_lo32;
 725       unsigned int total_out_hi32;
 726
 727       void *state;
 728
 729       void *(*bzalloc)(void *,int,int);
 730       void (*bzfree)(void *,void *);
 731       void *opaque;
 732    @}
 733    bz_stream;
 734
 735 int BZ2_bzCompressInit ( bz_stream *strm,
 736                          int blockSize100k,
 737                          int verbosity,
 738                          int workFactor );
 739
 740 @end example
 741
 742 Prepares for compression.  The @code{bz_stream} structure
 743 holds all data pertaining to the compression activity.
 744 A @code{bz_stream} structure should be allocated and initialised
 745 prior to the call.
 746 The fields of @code{bz_stream}
 747 comprise the entirety of the user-visible data.  @code{state}
 748 is a pointer to the private data structures required for compression.
 749
 750 Custom memory allocators are supported, via fields @code{bzalloc},
 751 @code{bzfree},
 752 and @code{opaque}.  The value
 753 @code{opaque} is passed to as the first argument to
 754 all calls to @code{bzalloc} and @code{bzfree}, but is
 755 otherwise ignored by the library.
 756 The call @code{bzalloc ( opaque, n, m )} is expected to return a
 757 pointer @code{p} to
 758 @code{n * m} bytes of memory, and @code{bzfree ( opaque, p )}
 759 should free
 760 that memory.
 761
 762 If you don't want to use a custom memory allocator, set @code{bzalloc},
 763 @code{bzfree} and
 764 @code{opaque} to @code{NULL},
 765 and the library will then use the standard @code{malloc}/@code{free}
 766 routines.
 767
 768 Before calling @code{BZ2_bzCompressInit}, fields @code{bzalloc},
 769 @code{bzfree} and @code{opaque} should
 770 be filled appropriately, as just described.  Upon return, the internal
 771 state will have been allocated and initialised, and @code{total_in_lo32},
 772 @code{total_in_hi32}, @code{total_out_lo32} and
 773 @code{total_out_hi32} will have been set to zero.
 774 These four fields are used by the library
 775 to inform the caller of the total amount of data passed into and out of
 776 the library, respectively.  You should not try to change them.
 777 As of version 1.0, 64-bit counts are maintained, even on 32-bit
 778 platforms, using the @code{_hi32} fields to store the upper 32 bits
 779 of the count.  So, for example, the total amount of data in
 780 is @code{(total_in_hi32 << 32) + total_in_lo32}.
 781
 782 Parameter @code{blockSize100k} specifies the block size to be used for
 783 compression.  It should be a value between 1 and 9 inclusive, and the
 784 actual block size used is 100000 x this figure.  9 gives the best
 785 compression but takes most memory.
 786
 787 Parameter @code{verbosity} should be set to a number between 0 and 4
 788 inclusive.  0 is silent, and greater numbers give increasingly verbose
 789 monitoring/debugging output.  If the library has been compiled with
 790 @code{-DBZ_NO_STDIO}, no such output will appear for any verbosity
 791 setting.
 792
 793 Parameter @code{workFactor} controls how the compression phase behaves
 794 when presented with worst case, highly repetitive, input data.  If
 795 compression runs into difficulties caused by repetitive data, the
 796 library switches from the standard sorting algorithm to a fallback
 797 algorithm.  The fallback is slower than the standard algorithm by
 798 perhaps a factor of three, but always behaves reasonably, no matter how
 799 bad the input.
 800
 801 Lower values of @code{workFactor} reduce the amount of effort the
 802 standard algorithm will expend before resorting to the fallback.  You
 803 should set this parameter carefully; too low, and many inputs will be
 804 handled by the fallback algorithm and so compress rather slowly, too
 805 high, and your average-to-worst case compression times can become very
 806 large.  The default value of 30 gives reasonable behaviour over a wide
 807 range of circumstances.
 808
 809 Allowable values range from 0 to 250 inclusive.  0 is a special case,
 810 equivalent to using the default value of 30.
 811
 812 Note that the compressed output generated is the same regardless of
 813 whether or not the fallback algorithm is used.
 814
 815 Be aware also that this parameter may disappear entirely in future
 816 versions of the library.  In principle it should be possible to devise a
 817 good way to automatically choose which algorithm to use.  Such a
 818 mechanism would render the parameter obsolete.
 819
 820 Possible return values:
 821 @display
 822       @code{BZ_CONFIG_ERROR}
 823          if the library has been mis-compiled
 824       @code{BZ_PARAM_ERROR}
 825          if @code{strm} is @code{NULL}
 826          or @code{blockSize} < 1 or @code{blockSize} > 9
 827          or @code{verbosity} < 0 or @code{verbosity} > 4
 828          or @code{workFactor} < 0 or @code{workFactor} > 250
 829       @code{BZ_MEM_ERROR}
 830          if not enough memory is available
 831       @code{BZ_OK}
 832          otherwise
 833 @end display
 834 Allowable next actions:
 835 @display
 836       @code{BZ2_bzCompress}
 837          if @code{BZ_OK} is returned
 838       no specific action needed in case of error
 839 @end display
 840
 841 @subsection @code{BZ2_bzCompress}
 842 @example
 843    int BZ2_bzCompress ( bz_stream *strm, int action );
 844 @end example
 845 Provides more input and/or output buffer space for the library.  The
 846 caller maintains input and output buffers, and calls @code{BZ2_bzCompress} to
 847 transfer data between them.
 848
 849 Before each call to @code{BZ2_bzCompress}, @code{next_in} should point at
 850 the data to be compressed, and @code{avail_in} should indicate how many
 851 bytes the library may read.  @code{BZ2_bzCompress} updates @code{next_in},
 852 @code{avail_in} and @code{total_in} to reflect the number of bytes it
 853 has read.
 854
 855 Similarly, @code{next_out} should point to a buffer in which the
 856 compressed data is to be placed, with @code{avail_out} indicating how
 857 much output space is available.  @code{BZ2_bzCompress} updates
 858 @code{next_out}, @code{avail_out} and @code{total_out} to reflect the
 859 number of bytes output.
 860
 861 You may provide and remove as little or as much data as you like on each
 862 call of @code{BZ2_bzCompress}.  In the limit, it is acceptable to supply and
 863 remove data one byte at a time, although this would be terribly
 864 inefficient.  You should always ensure that at least one byte of output
 865 space is available at each call.
 866
 867 A second purpose of @code{BZ2_bzCompress} is to request a change of mode of the
 868 compressed stream.
 869
 870 Conceptually, a compressed stream can be in one of four states: IDLE,
 871 RUNNING, FLUSHING and FINISHING.  Before initialisation
 872 (@code{BZ2_bzCompressInit}) and after termination (@code{BZ2_bzCompressEnd}), a
 873 stream is regarded as IDLE.
 874
 875 Upon initialisation (@code{BZ2_bzCompressInit}), the stream is placed in the
 876 RUNNING state.  Subsequent calls to @code{BZ2_bzCompress} should pass
 877 @code{BZ_RUN} as the requested action; other actions are illegal and
 878 will result in @code{BZ_SEQUENCE_ERROR}.
 879
 880 At some point, the calling program will have provided all the input data
 881 it wants to.  It will then want to finish up -- in effect, asking the
 882 library to process any data it might have buffered internally.  In this
 883 state, @code{BZ2_bzCompress} will no longer attempt to read data from
 884 @code{next_in}, but it will want to write data to @code{next_out}.
 885 Because the output buffer supplied by the user can be arbitrarily small,
 886 the finishing-up operation cannot necessarily be done with a single call
 887 of @code{BZ2_bzCompress}.
 888
 889 Instead, the calling program passes @code{BZ_FINISH} as an action to
 890 @code{BZ2_bzCompress}.  This changes the stream's state to FINISHING.  Any
 891 remaining input (ie, @code{next_in[0 .. avail_in-1]}) is compressed and
 892 transferred to the output buffer.  To do this, @code{BZ2_bzCompress} must be
 893 called repeatedly until all the output has been consumed.  At that
 894 point, @code{BZ2_bzCompress} returns @code{BZ_STREAM_END}, and the stream's
 895 state is set back to IDLE.  @code{BZ2_bzCompressEnd} should then be
 896 called.
 897
 898 Just to make sure the calling program does not cheat, the library makes
 899 a note of @code{avail_in} at the time of the first call to
 900 @code{BZ2_bzCompress} which has @code{BZ_FINISH} as an action (ie, at the
 901 time the program has announced its intention to not supply any more
 902 input).  By comparing this value with that of @code{avail_in} over
 903 subsequent calls to @code{BZ2_bzCompress}, the library can detect any
 904 attempts to slip in more data to compress.  Any calls for which this is
 905 detected will return @code{BZ_SEQUENCE_ERROR}.  This indicates a
 906 programming mistake which should be corrected.
 907
 908 Instead of asking to finish, the calling program may ask
 909 @code{BZ2_bzCompress} to take all the remaining input, compress it and
 910 terminate the current (Burrows-Wheeler) compression block.  This could
 911 be useful for error control purposes.  The mechanism is analogous to
 912 that for finishing: call @code{BZ2_bzCompress} with an action of
 913 @code{BZ_FLUSH}, remove output data, and persist with the
 914 @code{BZ_FLUSH} action until the value @code{BZ_RUN} is returned.  As
 915 with finishing, @code{BZ2_bzCompress} detects any attempt to provide more
 916 input data once the flush has begun.
 917
 918 Once the flush is complete, the stream returns to the normal RUNNING
 919 state.
 920
 921 This all sounds pretty complex, but isn't really.  Here's a table
 922 which shows which actions are allowable in each state, what action
 923 will be taken, what the next state is, and what the non-error return
 924 values are.  Note that you can't explicitly ask what state the
 925 stream is in, but nor do you need to -- it can be inferred from the
 926 values returned by @code{BZ2_bzCompress}.
 927 @display
 928 IDLE/@code{any}
 929       Illegal.  IDLE state only exists after @code{BZ2_bzCompressEnd} or
 930       before @code{BZ2_bzCompressInit}.
 931       Return value = @code{BZ_SEQUENCE_ERROR}
 932
 933 RUNNING/@code{BZ_RUN}
 934       Compress from @code{next_in} to @code{next_out} as much as possible.
 935       Next state = RUNNING
 936       Return value = @code{BZ_RUN_OK}
 937
 938 RUNNING/@code{BZ_FLUSH}
 939       Remember current value of @code{next_in}.  Compress from @code{next_in}
 940       to @code{next_out} as much as possible, but do not accept any more input.
 941       Next state = FLUSHING
 942       Return value = @code{BZ_FLUSH_OK}
 943
 944 RUNNING/@code{BZ_FINISH}
 945       Remember current value of @code{next_in}.  Compress from @code{next_in}
 946       to @code{next_out} as much as possible, but do not accept any more input.
 947       Next state = FINISHING
 948       Return value = @code{BZ_FINISH_OK}
 949
 950 FLUSHING/@code{BZ_FLUSH}
 951       Compress from @code{next_in} to @code{next_out} as much as possible,
 952       but do not accept any more input.
 953       If all the existing input has been used up and all compressed
 954       output has been removed
 955          Next state = RUNNING; Return value = @code{BZ_RUN_OK}
 956       else
 957          Next state = FLUSHING; Return value = @code{BZ_FLUSH_OK}
 958
 959 FLUSHING/other
 960       Illegal.
 961       Return value = @code{BZ_SEQUENCE_ERROR}
 962
 963 FINISHING/@code{BZ_FINISH}
 964       Compress from @code{next_in} to @code{next_out} as much as possible,
 965       but to not accept any more input.
 966       If all the existing input has been used up and all compressed
 967       output has been removed
 968          Next state = IDLE; Return value = @code{BZ_STREAM_END}
 969       else
 970          Next state = FINISHING; Return value = @code{BZ_FINISHING}
 971
 972 FINISHING/other
 973       Illegal.
 974       Return value = @code{BZ_SEQUENCE_ERROR}
 975 @end display
 976
 977 That still looks complicated?  Well, fair enough.  The usual sequence
 978 of calls for compressing a load of data is:
 979 @itemize @bullet
 980 @item Get started with @code{BZ2_bzCompressInit}.
 981 @item Shovel data in and shlurp out its compressed form using zero or more
 982 calls of @code{BZ2_bzCompress} with action = @code{BZ_RUN}.
 983 @item Finish up.
 984 Repeatedly call @code{BZ2_bzCompress} with action = @code{BZ_FINISH},
 985 copying out the compressed output, until @code{BZ_STREAM_END} is returned.
 986 @item Close up and go home.  Call @code{BZ2_bzCompressEnd}.
 987 @end itemize
 988 If the data you want to compress fits into your input buffer all
 989 at once, you can skip the calls of @code{BZ2_bzCompress ( ..., BZ_RUN )} and
 990 just do the @code{BZ2_bzCompress ( ..., BZ_FINISH )} calls.
 991
 992 All required memory is allocated by @code{BZ2_bzCompressInit}.  The
 993 compression library can accept any data at all (obviously).  So you
 994 shouldn't get any error return values from the @code{BZ2_bzCompress} calls.
 995 If you do, they will be @code{BZ_SEQUENCE_ERROR}, and indicate a bug in
 996 your programming.
 997
 998 Trivial other possible return values:
 999 @display
1000       @code{BZ_PARAM_ERROR}
1001          if @code{strm} is @code{NULL}, or @code{strm->s} is @code{NULL}
1002 @end display
1003
1004 @subsection @code{BZ2_bzCompressEnd}
1005 @example
1006 int BZ2_bzCompressEnd ( bz_stream *strm );
1007 @end example
1008 Releases all memory associated with a compression stream.
1009
1010 Possible return values:
1011 @display
1012    @code{BZ_PARAM_ERROR}    if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL}
1013    @code{BZ_OK}    otherwise
1014 @end display
1015
1016
1017 @subsection @code{BZ2_bzDecompressInit}
1018 @example
1019 int BZ2_bzDecompressInit ( bz_stream *strm, int verbosity, int small );
1020 @end example
1021 Prepares for decompression.  As with @code{BZ2_bzCompressInit}, a
1022 @code{bz_stream} record should be allocated and initialised before the
1023 call.  Fields @code{bzalloc}, @code{bzfree} and @code{opaque} should be
1024 set if a custom memory allocator is required, or made @code{NULL} for
1025 the normal @code{malloc}/@code{free} routines.  Upon return, the internal
1026 state will have been initialised, and @code{total_in} and
1027 @code{total_out} will be zero.
1028
1029 For the meaning of parameter @code{verbosity}, see @code{BZ2_bzCompressInit}.
1030
1031 If @code{small} is nonzero, the library will use an alternative
1032 decompression algorithm which uses less memory but at the cost of
1033 decompressing more slowly (roughly speaking, half the speed, but the
1034 maximum memory requirement drops to around 2300k).  See Chapter 2 for
1035 more information on memory management.
1036
1037 Note that the amount of memory needed to decompress
1038 a stream cannot be determined until the stream's header has been read,
1039 so even if @code{BZ2_bzDecompressInit} succeeds, a subsequent
1040 @code{BZ2_bzDecompress} could fail with @code{BZ_MEM_ERROR}.
1041
1042 Possible return values:
1043 @display
1044       @code{BZ_CONFIG_ERROR}
1045          if the library has been mis-compiled
1046       @code{BZ_PARAM_ERROR}
1047          if @code{(small != 0 && small != 1)}
1048          or @code{(verbosity < 0 || verbosity > 4)}
1049       @code{BZ_MEM_ERROR}
1050          if insufficient memory is available
1051 @end display
1052
1053 Allowable next actions:
1054 @display
1055       @code{BZ2_bzDecompress}
1056          if @code{BZ_OK} was returned
1057       no specific action required in case of error
1058 @end display
1059
1060
1061
1062 @subsection @code{BZ2_bzDecompress}
1063 @example
1064 int BZ2_bzDecompress ( bz_stream *strm );
1065 @end example
1066 Provides more input and/out output buffer space for the library.  The
1067 caller maintains input and output buffers, and uses @code{BZ2_bzDecompress}
1068 to transfer data between them.
1069
1070 Before each call to @code{BZ2_bzDecompress}, @code{next_in}
1071 should point at the compressed data,
1072 and @code{avail_in} should indicate how many bytes the library
1073 may read.  @code{BZ2_bzDecompress} updates @code{next_in}, @code{avail_in}
1074 and @code{total_in}
1075 to reflect the number of bytes it has read.
1076
1077 Similarly, @code{next_out} should point to a buffer in which the uncompressed
1078 output is to be placed, with @code{avail_out} indicating how much output space
1079 is available.  @code{BZ2_bzCompress} updates @code{next_out},
1080 @code{avail_out} and @code{total_out} to reflect
1081 the number of bytes output.
1082
1083 You may provide and remove as little or as much data as you like on
1084 each call of @code{BZ2_bzDecompress}.
1085 In the limit, it is acceptable to
1086 supply and remove data one byte at a time, although this would be
1087 terribly inefficient.  You should always ensure that at least one
1088 byte of output space is available at each call.
1089
1090 Use of @code{BZ2_bzDecompress} is simpler than @code{BZ2_bzCompress}.
1091
1092 You should provide input and remove output as described above, and
1093 repeatedly call @code{BZ2_bzDecompress} until @code{BZ_STREAM_END} is
1094 returned.  Appearance of @code{BZ_STREAM_END} denotes that
1095 @code{BZ2_bzDecompress} has detected the logical end of the compressed
1096 stream.  @code{BZ2_bzDecompress} will not produce @code{BZ_STREAM_END} until
1097 all output data has been placed into the output buffer, so once
1098 @code{BZ_STREAM_END} appears, you are guaranteed to have available all
1099 the decompressed output, and @code{BZ2_bzDecompressEnd} can safely be
1100 called.
1101
1102 If case of an error return value, you should call @code{BZ2_bzDecompressEnd}
1103 to clean up and release memory.
1104
1105 Possible return values:
1106 @display
1107       @code{BZ_PARAM_ERROR}
1108          if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL}
1109          or @code{strm->avail_out < 1}
1110       @code{BZ_DATA_ERROR}
1111          if a data integrity error is detected in the compressed stream
1112       @code{BZ_DATA_ERROR_MAGIC}
1113          if the compressed stream doesn't begin with the right magic bytes
1114       @code{BZ_MEM_ERROR}
1115          if there wasn't enough memory available
1116       @code{BZ_STREAM_END}
1117          if the logical end of the data stream was detected and all
1118          output in has been consumed, eg @code{s->avail_out > 0}
1119       @code{BZ_OK}
1120          otherwise
1121 @end display
1122 Allowable next actions:
1123 @display
1124       @code{BZ2_bzDecompress}
1125          if @code{BZ_OK} was returned
1126       @code{BZ2_bzDecompressEnd}
1127          otherwise
1128 @end display
1129
1130
1131 @subsection @code{BZ2_bzDecompressEnd}
1132 @example
1133 int BZ2_bzDecompressEnd ( bz_stream *strm );
1134 @end example
1135 Releases all memory associated with a decompression stream.
1136
1137 Possible return values:
1138 @display
1139       @code{BZ_PARAM_ERROR}
1140          if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL}
1141       @code{BZ_OK}
1142          otherwise
1143 @end display
1144
1145 Allowable next actions:
1146 @display
1147       None.
1148 @end display
1149
1150
1151 @section High-level interface
1152
1153 This interface provides functions for reading and writing
1154 @code{bzip2} format files.  First, some general points.
1155
1156 @itemize @bullet
1157 @item All of the functions take an @code{int*} first argument,
1158   @code{bzerror}.
1159   After each call, @code{bzerror} should be consulted first to determine
1160   the outcome of the call.  If @code{bzerror} is @code{BZ_OK},
1161   the call completed
1162   successfully, and only then should the return value of the function
1163   (if any) be consulted.  If @code{bzerror} is @code{BZ_IO_ERROR},
1164   there was an error
1165   reading/writing the underlying compressed file, and you should
1166   then consult @code{errno}/@code{perror} to determine the
1167   cause of the difficulty.
1168   @code{bzerror} may also be set to various other values; precise details are
1169   given on a per-function basis below.
1170 @item If @code{bzerror} indicates an error
1171   (ie, anything except @code{BZ_OK} and @code{BZ_STREAM_END}),
1172   you should immediately call @code{BZ2_bzReadClose} (or @code{BZ2_bzWriteClose},
1173   depending on whether you are attempting to read or to write)
1174   to free up all resources associated
1175   with the stream.  Once an error has been indicated, behaviour of all calls
1176   except @code{BZ2_bzReadClose} (@code{BZ2_bzWriteClose}) is undefined.
1177   The implication is that (1) @code{bzerror} should
1178   be checked after each call, and (2) if @code{bzerror} indicates an error,
1179   @code{BZ2_bzReadClose} (@code{BZ2_bzWriteClose}) should then be called to clean up.
1180 @item The @code{FILE*} arguments passed to
1181    @code{BZ2_bzReadOpen}/@code{BZ2_bzWriteOpen}
1182   should be set to binary mode.
1183   Most Unix systems will do this by default, but other platforms,
1184   including Windows and Mac, will not.  If you omit this, you may
1185   encounter problems when moving code to new platforms.
1186 @item Memory allocation requests are handled by
1187   @code{malloc}/@code{free}.
1188   At present
1189   there is no facility for user-defined memory allocators in the file I/O
1190   functions (could easily be added, though).
1191 @end itemize
1192
1193
1194
1195 @subsection @code{BZ2_bzReadOpen}
1196 @example
1197    typedef void BZFILE;
1198
1199    BZFILE *BZ2_bzReadOpen ( int *bzerror, FILE *f,
1200                             int small, int verbosity,
1201                             void *unused, int nUnused );
1202 @end example
1203 Prepare to read compressed data from file handle @code{f}.  @code{f}
1204 should refer to a file which has been opened for reading, and for which
1205 the error indicator (@code{ferror(f)})is not set.  If @code{small} is 1,
1206 the library will try to decompress using less memory, at the expense of
1207 speed.
1208
1209 For reasons explained below, @code{BZ2_bzRead} will decompress the
1210 @code{nUnused} bytes starting at @code{unused}, before starting to read
1211 from the file @code{f}.  At most @code{BZ_MAX_UNUSED} bytes may be
1212 supplied like this.  If this facility is not required, you should pass
1213 @code{NULL} and @code{0} for @code{unused} and n@code{Unused}
1214 respectively.
1215
1216 For the meaning of parameters @code{small} and @code{verbosity},
1217 see @code{BZ2_bzDecompressInit}.
1218
1219 The amount of memory needed to decompress a file cannot be determined
1220 until the file's header has been read.  So it is possible that
1221 @code{BZ2_bzReadOpen} returns @code{BZ_OK} but a subsequent call of
1222 @code{BZ2_bzRead} will return @code{BZ_MEM_ERROR}.
1223
1224 Possible assignments to @code{bzerror}:
1225 @display
1226       @code{BZ_CONFIG_ERROR}
1227          if the library has been mis-compiled
1228       @code{BZ_PARAM_ERROR}
1229          if @code{f} is @code{NULL}
1230          or @code{small} is neither @code{0} nor @code{1}
1231          or @code{(unused == NULL && nUnused != 0)}
1232          or @code{(unused != NULL && !(0 <= nUnused <= BZ_MAX_UNUSED))}
1233       @code{BZ_IO_ERROR}
1234          if @code{ferror(f)} is nonzero
1235       @code{BZ_MEM_ERROR}
1236          if insufficient memory is available
1237       @code{BZ_OK}
1238          otherwise.
1239 @end display
1240
1241 Possible return values:
1242 @display
1243       Pointer to an abstract @code{BZFILE}
1244          if @code{bzerror} is @code{BZ_OK}
1245       @code{NULL}
1246          otherwise
1247 @end display
1248
1249 Allowable next actions:
1250 @display
1251       @code{BZ2_bzRead}
1252          if @code{bzerror} is @code{BZ_OK}
1253       @code{BZ2_bzClose}
1254          otherwise
1255 @end display
1256
1257
1258 @subsection @code{BZ2_bzRead}
1259 @example
1260    int BZ2_bzRead ( int *bzerror, BZFILE *b, void *buf, int len );
1261 @end example
1262 Reads up to @code{len} (uncompressed) bytes from the compressed file
1263 @code{b} into
1264 the buffer @code{buf}.  If the read was successful,
1265 @code{bzerror} is set to @code{BZ_OK}
1266 and the number of bytes read is returned.  If the logical end-of-stream
1267 was detected, @code{bzerror} will be set to @code{BZ_STREAM_END},
1268 and the number
1269 of bytes read is returned.  All other @code{bzerror} values denote an error.
1270
1271 @code{BZ2_bzRead} will supply @code{len} bytes,
1272 unless the logical stream end is detected
1273 or an error occurs.  Because of this, it is possible to detect the
1274 stream end by observing when the number of bytes returned is
1275 less than the number
1276 requested.  Nevertheless, this is regarded as inadvisable; you should
1277 instead check @code{bzerror} after every call and watch out for
1278 @code{BZ_STREAM_END}.
1279
1280 Internally, @code{BZ2_bzRead} copies data from the compressed file in chunks
1281 of size @code{BZ_MAX_UNUSED} bytes
1282 before decompressing it.  If the file contains more bytes than strictly
1283 needed to reach the logical end-of-stream, @code{BZ2_bzRead} will almost certainly
1284 read some of the trailing data before signalling @code{BZ_SEQUENCE_END}.
1285 To collect the read but unused data once @code{BZ_SEQUENCE_END} has
1286 appeared, call @code{BZ2_bzReadGetUnused} immediately before @code{BZ2_bzReadClose}.
1287
1288 Possible assignments to @code{bzerror}:
1289 @display
1290       @code{BZ_PARAM_ERROR}
1291          if @code{b} is @code{NULL} or @code{buf} is @code{NULL} or @code{len < 0}
1292       @code{BZ_SEQUENCE_ERROR}
1293          if @code{b} was opened with @code{BZ2_bzWriteOpen}
1294       @code{BZ_IO_ERROR}
1295          if there is an error reading from the compressed file
1296       @code{BZ_UNEXPECTED_EOF}
1297          if the compressed file ended before the logical end-of-stream was detected
1298       @code{BZ_DATA_ERROR}
1299          if a data integrity error was detected in the compressed stream
1300       @code{BZ_DATA_ERROR_MAGIC}
1301          if the stream does not begin with the requisite header bytes (ie, is not
1302          a @code{bzip2} data file).  This is really a special case of @code{BZ_DATA_ERROR}.
1303       @code{BZ_MEM_ERROR}
1304          if insufficient memory was available
1305       @code{BZ_STREAM_END}
1306          if the logical end of stream was detected.
1307       @code{BZ_OK}
1308          otherwise.
1309 @end display
1310
1311 Possible return values:
1312 @display
1313       number of bytes read
1314          if @code{bzerror} is @code{BZ_OK} or @code{BZ_STREAM_END}
1315       undefined
1316          otherwise
1317 @end display
1318
1319 Allowable next actions:
1320 @display
1321       collect data from @code{buf}, then @code{BZ2_bzRead} or @code{BZ2_bzReadClose}
1322          if @code{bzerror} is @code{BZ_OK}
1323       collect data from @code{buf}, then @code{BZ2_bzReadClose} or @code{BZ2_bzReadGetUnused}
1324          if @code{bzerror} is @code{BZ_SEQUENCE_END}
1325       @code{BZ2_bzReadClose}
1326          otherwise
1327 @end display
1328
1329
1330
1331 @subsection @code{BZ2_bzReadGetUnused}
1332 @example
1333    void BZ2_bzReadGetUnused ( int* bzerror, BZFILE *b,
1334                               void** unused, int* nUnused );
1335 @end example
1336 Returns data which was read from the compressed file but was not needed
1337 to get to the logical end-of-stream.  @code{*unused} is set to the address
1338 of the data, and @code{*nUnused} to the number of bytes.  @code{*nUnused} will
1339 be set to a value between @code{0} and @code{BZ_MAX_UNUSED} inclusive.
1340
1341 This function may only be called once @code{BZ2_bzRead} has signalled
1342 @code{BZ_STREAM_END} but before @code{BZ2_bzReadClose}.
1343
1344 Possible assignments to @code{bzerror}:
1345 @display
1346       @code{BZ_PARAM_ERROR}
1347          if @code{b} is @code{NULL}
1348          or @code{unused} is @code{NULL} or @code{nUnused} is @code{NULL}
1349       @code{BZ_SEQUENCE_ERROR}
1350          if @code{BZ_STREAM_END} has not been signalled
1351          or if @code{b} was opened with @code{BZ2_bzWriteOpen}
1352      @code{BZ_OK}
1353          otherwise
1354 @end display
1355
1356 Allowable next actions:
1357 @display
1358       @code{BZ2_bzReadClose}
1359 @end display
1360
1361
1362 @subsection @code{BZ2_bzReadClose}
1363 @example
1364    void BZ2_bzReadClose ( int *bzerror, BZFILE *b );
1365 @end example
1366 Releases all memory pertaining to the compressed file @code{b}.
1367 @code{BZ2_bzReadClose} does not call @code{fclose} on the underlying file
1368 handle, so you should do that yourself if appropriate.
1369 @code{BZ2_bzReadClose} should be called to clean up after all error
1370 situations.
1371
1372 Possible assignments to @code{bzerror}:
1373 @display
1374       @code{BZ_SEQUENCE_ERROR}
1375          if @code{b} was opened with @code{BZ2_bzOpenWrite}
1376       @code{BZ_OK}
1377          otherwise
1378 @end display
1379
1380 Allowable next actions:
1381 @display
1382       none
1383 @end display
1384
1385
1386
1387 @subsection @code{BZ2_bzWriteOpen}
1388 @example
1389    BZFILE *BZ2_bzWriteOpen ( int *bzerror, FILE *f,
1390                              int blockSize100k, int verbosity,
1391                              int workFactor );
1392 @end example
1393 Prepare to write compressed data to file handle @code{f}.
1394 @code{f} should refer to
1395 a file which has been opened for writing, and for which the error
1396 indicator (@code{ferror(f)})is not set.
1397
1398 For the meaning of parameters @code{blockSize100k},
1399 @code{verbosity} and @code{workFactor}, see
1400 @* @code{BZ2_bzCompressInit}.
1401
1402 All required memory is allocated at this stage, so if the call
1403 completes successfully, @code{BZ_MEM_ERROR} cannot be signalled by a
1404 subsequent call to @code{BZ2_bzWrite}.
1405
1406 Possible assignments to @code{bzerror}:
1407 @display
1408       @code{BZ_CONFIG_ERROR}
1409          if the library has been mis-compiled
1410       @code{BZ_PARAM_ERROR}
1411          if @code{f} is @code{NULL}
1412          or @code{blockSize100k < 1} or @code{blockSize100k > 9}
1413       @code{BZ_IO_ERROR}
1414          if @code{ferror(f)} is nonzero
1415       @code{BZ_MEM_ERROR}
1416          if insufficient memory is available
1417       @code{BZ_OK}
1418          otherwise
1419 @end display
1420
1421 Possible return values:
1422 @display
1423       Pointer to an abstract @code{BZFILE}
1424          if @code{bzerror} is @code{BZ_OK}
1425       @code{NULL}
1426          otherwise
1427 @end display
1428
1429 Allowable next actions:
1430 @display
1431       @code{BZ2_bzWrite}
1432          if @code{bzerror} is @code{BZ_OK}
1433          (you could go directly to @code{BZ2_bzWriteClose}, but this would be pretty pointless)
1434       @code{BZ2_bzWriteClose}
1435          otherwise
1436 @end display
1437
1438
1439
1440 @subsection @code{BZ2_bzWrite}
1441 @example
1442    void BZ2_bzWrite ( int *bzerror, BZFILE *b, void *buf, int len );
1443 @end example
1444 Absorbs @code{len} bytes from the buffer @code{buf}, eventually to be
1445 compressed and written to the file.
1446
1447 Possible assignments to @code{bzerror}:
1448 @display
1449       @code{BZ_PARAM_ERROR}
1450          if @code{b} is @code{NULL} or @code{buf} is @code{NULL} or @code{len < 0}
1451       @code{BZ_SEQUENCE_ERROR}
1452          if b was opened with @code{BZ2_bzReadOpen}
1453       @code{BZ_IO_ERROR}
1454          if there is an error writing the compressed file.
1455       @code{BZ_OK}
1456          otherwise
1457 @end display
1458
1459
1460
1461
1462 @subsection @code{BZ2_bzWriteClose}
1463 @example
1464    void BZ2_bzWriteClose ( int *bzerror, BZFILE* f,
1465                            int abandon,
1466                            unsigned int* nbytes_in,
1467                            unsigned int* nbytes_out );
1468
1469    void BZ2_bzWriteClose64 ( int *bzerror, BZFILE* f,
1470                              int abandon,
1471                              unsigned int* nbytes_in_lo32,
1472                              unsigned int* nbytes_in_hi32,
1473                              unsigned int* nbytes_out_lo32,
1474                              unsigned int* nbytes_out_hi32 );
1475 @end example
1476
1477 Compresses and flushes to the compressed file all data so far supplied
1478 by @code{BZ2_bzWrite}.  The logical end-of-stream markers are also written, so
1479 subsequent calls to @code{BZ2_bzWrite} are illegal.  All memory associated
1480 with the compressed file @code{b} is released.
1481 @code{fflush} is called on the
1482 compressed file, but it is not @code{fclose}'d.
1483
1484 If @code{BZ2_bzWriteClose} is called to clean up after an error, the only
1485 action is to release the memory.  The library records the error codes
1486 issued by previous calls, so this situation will be detected
1487 automatically.  There is no attempt to complete the compression
1488 operation, nor to @code{fflush} the compressed file.  You can force this
1489 behaviour to happen even in the case of no error, by passing a nonzero
1490 value to @code{abandon}.
1491
1492 If @code{nbytes_in} is non-null, @code{*nbytes_in} will be set to be the
1493 total volume of uncompressed data handled.  Similarly, @code{nbytes_out}
1494 will be set to the total volume of compressed data written.  For
1495 compatibility with older versions of the library, @code{BZ2_bzWriteClose}
1496 only yields the lower 32 bits of these counts.  Use
1497 @code{BZ2_bzWriteClose64} if you want the full 64 bit counts.  These
1498 two functions are otherwise absolutely identical.
1499
1500
1501 Possible assignments to @code{bzerror}:
1502 @display
1503       @code{BZ_SEQUENCE_ERROR}
1504          if @code{b} was opened with @code{BZ2_bzReadOpen}
1505       @code{BZ_IO_ERROR}
1506          if there is an error writing the compressed file
1507       @code{BZ_OK}
1508          otherwise
1509 @end display
1510
1511 @subsection Handling embedded compressed data streams
1512
1513 The high-level library facilitates use of
1514 @code{bzip2} data streams which form some part of a surrounding, larger
1515 data stream.
1516 @itemize @bullet
1517 @item For writing, the library takes an open file handle, writes
1518 compressed data to it, @code{fflush}es it but does not @code{fclose} it.
1519 The calling application can write its own data before and after the
1520 compressed data stream, using that same file handle.
1521 @item Reading is more complex, and the facilities are not as general
1522 as they could be since generality is hard to reconcile with efficiency.
1523 @code{BZ2_bzRead} reads from the compressed file in blocks of size
1524 @code{BZ_MAX_UNUSED} bytes, and in doing so probably will overshoot
1525 the logical end of compressed stream.
1526 To recover this data once decompression has
1527 ended, call @code{BZ2_bzReadGetUnused} after the last call of @code{BZ2_bzRead}
1528 (the one returning @code{BZ_STREAM_END}) but before calling
1529 @code{BZ2_bzReadClose}.
1530 @end itemize
1531
1532 This mechanism makes it easy to decompress multiple @code{bzip2}
1533 streams placed end-to-end.  As the end of one stream, when @code{BZ2_bzRead}
1534 returns @code{BZ_STREAM_END}, call @code{BZ2_bzReadGetUnused} to collect the
1535 unused data (copy it into your own buffer somewhere).
1536 That data forms the start of the next compressed stream.
1537 To start uncompressing that next stream, call @code{BZ2_bzReadOpen} again,
1538 feeding in the unused data via the @code{unused}/@code{nUnused}
1539 parameters.
1540 Keep doing this until @code{BZ_STREAM_END} return coincides with the
1541 physical end of file (@code{feof(f)}).  In this situation
1542 @code{BZ2_bzReadGetUnused}
1543 will of course return no data.
1544
1545 This should give some feel for how the high-level interface can be used.
1546 If you require extra flexibility, you'll have to bite the bullet and get
1547 to grips with the low-level interface.
1548
1549 @subsection Standard file-reading/writing code
1550 Here's how you'd write data to a compressed file:
1551 @example @code
1552 FILE*   f;
1553 BZFILE* b;
1554 int     nBuf;
1555 char    buf[ /* whatever size you like */ ];
1556 int     bzerror;
1557 int     nWritten;
1558
1559 f = fopen ( "myfile.bz2", "w" );
1560 if (!f) @{
1561    /* handle error */
1562 @}
1563 b = BZ2_bzWriteOpen ( &bzerror, f, 9 );
1564 if (bzerror != BZ_OK) @{
1565    BZ2_bzWriteClose ( b );
1566    /* handle error */
1567 @}
1568
1569 while ( /* condition */ ) @{
1570    /* get data to write into buf, and set nBuf appropriately */
1571    nWritten = BZ2_bzWrite ( &bzerror, b, buf, nBuf );
1572    if (bzerror == BZ_IO_ERROR) @{
1573       BZ2_bzWriteClose ( &bzerror, b );
1574       /* handle error */
1575    @}
1576 @}
1577
1578 BZ2_bzWriteClose ( &bzerror, b );
1579 if (bzerror == BZ_IO_ERROR) @{
1580    /* handle error */
1581 @}
1582 @end example
1583 And to read from a compressed file:
1584 @example
1585 FILE*   f;
1586 BZFILE* b;
1587 int     nBuf;
1588 char    buf[ /* whatever size you like */ ];
1589 int     bzerror;
1590 int     nWritten;
1591
1592 f = fopen ( "myfile.bz2", "r" );
1593 if (!f) @{
1594    /* handle error */
1595 @}
1596 b = BZ2_bzReadOpen ( &bzerror, f, 0, NULL, 0 );
1597 if (bzerror != BZ_OK) @{
1598    BZ2_bzReadClose ( &bzerror, b );
1599    /* handle error */
1600 @}
1601
1602 bzerror = BZ_OK;
1603 while (bzerror == BZ_OK && /* arbitrary other conditions */) @{
1604    nBuf = BZ2_bzRead ( &bzerror, b, buf, /* size of buf */ );
1605    if (bzerror == BZ_OK) @{
1606       /* do something with buf[0 .. nBuf-1] */
1607    @}
1608 @}
1609 if (bzerror != BZ_STREAM_END) @{
1610    BZ2_bzReadClose ( &bzerror, b );
1611    /* handle error */
1612 @} else @{
1613    BZ2_bzReadClose ( &bzerror );
1614 @}
1615 @end example
1616
1617
1618
1619 @section Utility functions
1620 @subsection @code{BZ2_bzBuffToBuffCompress}
1621 @example
1622    int BZ2_bzBuffToBuffCompress( char*         dest,
1623                                  unsigned int* destLen,
1624                                  char*         source,
1625                                  unsigned int  sourceLen,
1626                                  int           blockSize100k,
1627                                  int           verbosity,
1628                                  int           workFactor );
1629 @end example
1630 Attempts to compress the data in @code{source[0 .. sourceLen-1]}
1631 into the destination buffer, @code{dest[0 .. *destLen-1]}.
1632 If the destination buffer is big enough, @code{*destLen} is
1633 set to the size of the compressed data, and @code{BZ_OK} is
1634 returned.  If the compressed data won't fit, @code{*destLen}
1635 is unchanged, and @code{BZ_OUTBUFF_FULL} is returned.
1636
1637 Compression in this manner is a one-shot event, done with a single call
1638 to this function.  The resulting compressed data is a complete
1639 @code{bzip2} format data stream.  There is no mechanism for making
1640 additional calls to provide extra input data.  If you want that kind of
1641 mechanism, use the low-level interface.
1642
1643 For the meaning of parameters @code{blockSize100k}, @code{verbosity}
1644 and @code{workFactor}, @* see @code{BZ2_bzCompressInit}.
1645
1646 To guarantee that the compressed data will fit in its buffer, allocate
1647 an output buffer of size 1% larger than the uncompressed data, plus
1648 six hundred extra bytes.
1649
1650 @code{BZ2_bzBuffToBuffDecompress} will not write data at or
1651 beyond @code{dest[*destLen]}, even in case of buffer overflow.
1652
1653 Possible return values:
1654 @display
1655       @code{BZ_CONFIG_ERROR}
1656          if the library has been mis-compiled
1657       @code{BZ_PARAM_ERROR}
1658          if @code{dest} is @code{NULL} or @code{destLen} is @code{NULL}
1659          or @code{blockSize100k < 1} or @code{blockSize100k > 9}
1660          or @code{verbosity < 0} or @code{verbosity > 4}
1661          or @code{workFactor < 0} or @code{workFactor > 250}
1662       @code{BZ_MEM_ERROR}
1663          if insufficient memory is available
1664       @code{BZ_OUTBUFF_FULL}
1665          if the size of the compressed data exceeds @code{*destLen}
1666       @code{BZ_OK}
1667          otherwise
1668 @end display
1669
1670
1671
1672 @subsection @code{BZ2_bzBuffToBuffDecompress}
1673 @example
1674    int BZ2_bzBuffToBuffDecompress ( char*         dest,
1675                                     unsigned int* destLen,
1676                                     char*         source,
1677                                     unsigned int  sourceLen,
1678                                     int           small,
1679                                     int           verbosity );
1680 @end example
1681 Attempts to decompress the data in @code{source[0 .. sourceLen-1]}
1682 into the destination buffer, @code{dest[0 .. *destLen-1]}.
1683 If the destination buffer is big enough, @code{*destLen} is
1684 set to the size of the uncompressed data, and @code{BZ_OK} is
1685 returned.  If the compressed data won't fit, @code{*destLen}
1686 is unchanged, and @code{BZ_OUTBUFF_FULL} is returned.
1687
1688 @code{source} is assumed to hold a complete @code{bzip2} format
1689 data stream.  @* @code{BZ2_bzBuffToBuffDecompress} tries to decompress
1690 the entirety of the stream into the output buffer.
1691
1692 For the meaning of parameters @code{small} and @code{verbosity},
1693 see @code{BZ2_bzDecompressInit}.
1694
1695 Because the compression ratio of the compressed data cannot be known in
1696 advance, there is no easy way to guarantee that the output buffer will
1697 be big enough.  You may of course make arrangements in your code to
1698 record the size of the uncompressed data, but such a mechanism is beyond
1699 the scope of this library.
1700
1701 @code{BZ2_bzBuffToBuffDecompress} will not write data at or
1702 beyond @code{dest[*destLen]}, even in case of buffer overflow.
1703
1704 Possible return values:
1705 @display
1706       @code{BZ_CONFIG_ERROR}
1707          if the library has been mis-compiled
1708       @code{BZ_PARAM_ERROR}
1709          if @code{dest} is @code{NULL} or @code{destLen} is @code{NULL}
1710          or @code{small != 0 && small != 1}
1711          or @code{verbosity < 0} or @code{verbosity > 4}
1712       @code{BZ_MEM_ERROR}
1713          if insufficient memory is available
1714       @code{BZ_OUTBUFF_FULL}
1715          if the size of the compressed data exceeds @code{*destLen}
1716       @code{BZ_DATA_ERROR}
1717          if a data integrity error was detected in the compressed data
1718       @code{BZ_DATA_ERROR_MAGIC}
1719          if the compressed data doesn't begin with the right magic bytes
1720       @code{BZ_UNEXPECTED_EOF}
1721          if the compressed data ends unexpectedly
1722       @code{BZ_OK}
1723          otherwise
1724 @end display
1725
1726
1727
1728 @section @code{zlib} compatibility functions
1729 Yoshioka Tsuneo has contributed some functions to
1730 give better @code{zlib} compatibility.  These functions are
1731 @code{BZ2_bzopen}, @code{BZ2_bzread}, @code{BZ2_bzwrite}, @code{BZ2_bzflush},
1732 @code{BZ2_bzclose},
1733 @code{BZ2_bzerror} and @code{BZ2_bzlibVersion}.
1734 These functions are not (yet) officially part of
1735 the library.  If they break, you get to keep all the pieces.
1736 Nevertheless, I think they work ok.
1737 @example
1738 typedef void BZFILE;
1739
1740 const char * BZ2_bzlibVersion ( void );
1741 @end example
1742 Returns a string indicating the library version.
1743 @example
1744 BZFILE * BZ2_bzopen  ( const char *path, const char *mode );
1745 BZFILE * BZ2_bzdopen ( int        fd,    const char *mode );
1746 @end example
1747 Opens a @code{.bz2} file for reading or writing, using either its name
1748 or a pre-existing file descriptor.
1749 Analogous to @code{fopen} and @code{fdopen}.
1750 @example
1751 int BZ2_bzread  ( BZFILE* b, void* buf, int len );
1752 int BZ2_bzwrite ( BZFILE* b, void* buf, int len );
1753 @end example
1754 Reads/writes data from/to a previously opened @code{BZFILE}.
1755 Analogous to @code{fread} and @code{fwrite}.
1756 @example
1757 int  BZ2_bzflush ( BZFILE* b );
1758 void BZ2_bzclose ( BZFILE* b );
1759 @end example
1760 Flushes/closes a @code{BZFILE}.  @code{BZ2_bzflush} doesn't actually do
1761 anything.  Analogous to @code{fflush} and @code{fclose}.
1762
1763 @example
1764 const char * BZ2_bzerror ( BZFILE *b, int *errnum )
1765 @end example
1766 Returns a string describing the more recent error status of
1767 @code{b}, and also sets @code{*errnum} to its numerical value.
1768
1769
1770 @section Using the library in a @code{stdio}-free environment
1771
1772 @subsection Getting rid of @code{stdio}
1773
1774 In a deeply embedded application, you might want to use just
1775 the memory-to-memory functions.  You can do this conveniently
1776 by compiling the library with preprocessor symbol @code{BZ_NO_STDIO}
1777 defined.  Doing this gives you a library containing only the following
1778 eight functions:
1779
1780 @code{BZ2_bzCompressInit}, @code{BZ2_bzCompress}, @code{BZ2_bzCompressEnd} @*
1781 @code{BZ2_bzDecompressInit}, @code{BZ2_bzDecompress}, @code{BZ2_bzDecompressEnd} @*
1782 @code{BZ2_bzBuffToBuffCompress}, @code{BZ2_bzBuffToBuffDecompress}
1783
1784 When compiled like this, all functions will ignore @code{verbosity}
1785 settings.
1786
1787 @subsection Critical error handling
1788 @code{libbzip2} contains a number of internal assertion checks which
1789 should, needless to say, never be activated.  Nevertheless, if an
1790 assertion should fail, behaviour depends on whether or not the library
1791 was compiled with @code{BZ_NO_STDIO} set.
1792
1793 For a normal compile, an assertion failure yields the message
1794 @example
1795    bzip2/libbzip2: internal error number N.
1796    This is a bug in bzip2/libbzip2, 1.0.2, 30-Dec-2001.
1797    Please report it to me at: jseward@@acm.org.  If this happened
1798    when you were using some program which uses libbzip2 as a
1799    component, you should also report this bug to the author(s)
1800    of that program.  Please make an effort to report this bug;
1801    timely and accurate bug reports eventually lead to higher
1802    quality software.  Thanks.  Julian Seward, 30 December 2001.
1803 @end example
1804 where @code{N} is some error code number.  If @code{N == 1007}, it also
1805 prints some extra text advising the reader that unreliable memory is
1806 often associated with internal error 1007.  (This is a
1807 frequently-observed-phenomenon with versions 1.0.0/1.0.1).
1808
1809 @code{exit(3)} is then called.
1810
1811 For a @code{stdio}-free library, assertion failures result
1812 in a call to a function declared as:
1813 @example
1814    extern void bz_internal_error ( int errcode );
1815 @end example
1816 The relevant code is passed as a parameter.  You should supply
1817 such a function.
1818
1819 In either case, once an assertion failure has occurred, any
1820 @code{bz_stream} records involved can be regarded as invalid.
1821 You should not attempt to resume normal operation with them.
1822
1823 You may, of course, change critical error handling to suit
1824 your needs.  As I said above, critical errors indicate bugs
1825 in the library and should not occur.  All "normal" error
1826 situations are indicated via error return codes from functions,
1827 and can be recovered from.
1828
1829
1830 @section Making a Windows DLL
1831 Everything related to Windows has been contributed by Yoshioka Tsuneo
1832 @* (@code{QWF00133@@niftyserve.or.jp} /
1833 @code{tsuneo-y@@is.aist-nara.ac.jp}), so you should send your queries to
1834 him (but perhaps Cc: me, @code{jseward@@acm.org}).
1835
1836 My vague understanding of what to do is: using Visual C++ 5.0,
1837 open the project file @code{libbz2.dsp}, and build.  That's all.
1838
1839 If you can't
1840 open the project file for some reason, make a new one, naming these files:
1841 @code{blocksort.c}, @code{bzlib.c}, @code{compress.c},
1842 @code{crctable.c}, @code{decompress.c}, @code{huffman.c}, @*
1843 @code{randtable.c} and @code{libbz2.def}.  You will also need
1844 to name the header files @code{bzlib.h} and @code{bzlib_private.h}.
1845
1846 If you don't use VC++, you may need to define the proprocessor symbol
1847 @code{_WIN32}.
1848
1849 Finally, @code{dlltest.c} is a sample program using the DLL.  It has a
1850 project file, @code{dlltest.dsp}.
1851
1852 If you just want a makefile for Visual C, have a look at
1853 @code{makefile.msc}.
1854
1855 Be aware that if you compile @code{bzip2} itself on Win32, you must set
1856 @code{BZ_UNIX} to 0 and @code{BZ_LCCWIN32} to 1, in the file
1857 @code{bzip2.c}, before compiling.  Otherwise the resulting binary won't
1858 work correctly.
1859
1860 I haven't tried any of this stuff myself, but it all looks plausible.
1861
1862
1863
1864 @chapter Miscellanea
1865
1866 These are just some random thoughts of mine.  Your mileage may
1867 vary.
1868
1869 @section Limitations of the compressed file format
1870 @code{bzip2-1.0}, @code{0.9.5} and @code{0.9.0}
1871 use exactly the same file format as the previous
1872 version, @code{bzip2-0.1}.  This decision was made in the interests of
1873 stability.  Creating yet another incompatible compressed file format
1874 would create further confusion and disruption for users.
1875
1876 Nevertheless, this is not a painless decision.  Development
1877 work since the release of @code{bzip2-0.1} in August 1997
1878 has shown complexities in the file format which slow down
1879 decompression and, in retrospect, are unnecessary.  These are:
1880 @itemize @bullet
1881 @item The run-length encoder, which is the first of the
1882       compression transformations, is entirely irrelevant.
1883       The original purpose was to protect the sorting algorithm
1884       from the very worst case input: a string of repeated
1885       symbols.  But algorithm steps Q6a and Q6b in the original
1886       Burrows-Wheeler technical report (SRC-124) show how
1887       repeats can be handled without difficulty in block
1888       sorting.
1889 @item The randomisation mechanism doesn't really need to be
1890       there.  Udi Manber and Gene Myers published a suffix
1891       array construction algorithm a few years back, which
1892       can be employed to sort any block, no matter how
1893       repetitive, in O(N log N) time.  Subsequent work by
1894       Kunihiko Sadakane has produced a derivative O(N (log N)^2)
1895       algorithm which usually outperforms the Manber-Myers
1896       algorithm.
1897
1898       I could have changed to Sadakane's algorithm, but I find
1899       it to be slower than @code{bzip2}'s existing algorithm for
1900       most inputs, and the randomisation mechanism protects
1901       adequately against bad cases.  I didn't think it was
1902       a good tradeoff to make.  Partly this is due to the fact
1903       that I was not flooded with email complaints about
1904       @code{bzip2-0.1}'s performance on repetitive data, so
1905       perhaps it isn't a problem for real inputs.
1906
1907       Probably the best long-term solution,
1908       and the one I have incorporated into 0.9.5 and above,
1909       is to use the existing sorting
1910       algorithm initially, and fall back to a O(N (log N)^2)
1911       algorithm if the standard algorithm gets into difficulties.
1912 @item The compressed file format was never designed to be
1913       handled by a library, and I have had to jump though
1914       some hoops to produce an efficient implementation of
1915       decompression.  It's a bit hairy.  Try passing
1916       @code{decompress.c} through the C preprocessor
1917       and you'll see what I mean.  Much of this complexity
1918       could have been avoided if the compressed size of
1919       each block of data was recorded in the data stream.
1920 @item An Adler-32 checksum, rather than a CRC32 checksum,
1921       would be faster to compute.
1922 @end itemize
1923 It would be fair to say that the @code{bzip2} format was frozen
1924 before I properly and fully understood the performance
1925 consequences of doing so.
1926
1927 Improvements which I was able to incorporate into
1928 0.9.0, despite using the same file format, are:
1929 @itemize @bullet
1930 @item Single array implementation of the inverse BWT.  This
1931       significantly speeds up decompression, presumably
1932       because it reduces the number of cache misses.
1933 @item Faster inverse MTF transform for large MTF values.  The
1934       new implementation is based on the notion of sliding blocks
1935       of values.
1936 @item @code{bzip2-0.9.0} now reads and writes files with @code{fread}
1937       and @code{fwrite}; version 0.1 used @code{putc} and @code{getc}.
1938       Duh!  Well, you live and learn.
1939
1940 @end itemize
1941 Further ahead, it would be nice
1942 to be able to do random access into files.  This will
1943 require some careful design of compressed file formats.
1944
1945
1946
1947 @section Portability issues
1948 After some consideration, I have decided not to use
1949 GNU @code{autoconf} to configure 0.9.5 or 1.0.
1950
1951 @code{autoconf}, admirable and wonderful though it is,
1952 mainly assists with portability problems between Unix-like
1953 platforms.  But @code{bzip2} doesn't have much in the way
1954 of portability problems on Unix; most of the difficulties appear
1955 when porting to the Mac, or to Microsoft's operating systems.
1956 @code{autoconf} doesn't help in those cases, and brings in a
1957 whole load of new complexity.
1958
1959 Most people should be able to compile the library and program
1960 under Unix straight out-of-the-box, so to speak, especially
1961 if you have a version of GNU C available.
1962
1963 There are a couple of @code{__inline__} directives in the code.  GNU C
1964 (@code{gcc}) should be able to handle them.  If you're not using
1965 GNU C, your C compiler shouldn't see them at all.
1966 If your compiler does, for some reason, see them and doesn't
1967 like them, just @code{#define} @code{__inline__} to be @code{/* */}.  One
1968 easy way to do this is to compile with the flag @code{-D__inline__=},
1969 which should be understood by most Unix compilers.
1970
1971 If you still have difficulties, try compiling with the macro
1972 @code{BZ_STRICT_ANSI} defined.  This should enable you to build the
1973 library in a strictly ANSI compliant environment.  Building the program
1974 itself like this is dangerous and not supported, since you remove
1975 @code{bzip2}'s checks against compressing directories, symbolic links,
1976 devices, and other not-really-a-file entities.  This could cause
1977 filesystem corruption!
1978
1979 One other thing: if you create a @code{bzip2} binary for public
1980 distribution, please try and link it statically (@code{gcc -s}).  This
1981 avoids all sorts of library-version issues that others may encounter
1982 later on.
1983
1984 If you build @code{bzip2} on Win32, you must set @code{BZ_UNIX} to 0 and
1985 @code{BZ_LCCWIN32} to 1, in the file @code{bzip2.c}, before compiling.
1986 Otherwise the resulting binary won't work correctly.
1987
1988
1989
1990 @section Reporting bugs
1991 I tried pretty hard to make sure @code{bzip2} is
1992 bug free, both by design and by testing.  Hopefully
1993 you'll never need to read this section for real.
1994
1995 Nevertheless, if @code{bzip2} dies with a segmentation
1996 fault, a bus error or an internal assertion failure, it
1997 will ask you to email me a bug report.  Experience with
1998 version 0.1 shows that almost all these problems can
1999 be traced to either compiler bugs or hardware problems.
2000 @itemize @bullet
2001 @item
2002 Recompile the program with no optimisation, and see if it
2003 works.  And/or try a different compiler.
2004 I heard all sorts of stories about various flavours
2005 of GNU C (and other compilers) generating bad code for
2006 @code{bzip2}, and I've run across two such examples myself.
2007
2008 2.7.X versions of GNU C are known to generate bad code from
2009 time to time, at high optimisation levels.
2010 If you get problems, try using the flags
2011 @code{-O2} @code{-fomit-frame-pointer} @code{-fno-strength-reduce}.
2012 You should specifically @emph{not} use @code{-funroll-loops}.
2013
2014 You may notice that the Makefile runs six tests as part of
2015 the build process.  If the program passes all of these, it's
2016 a pretty good (but not 100%) indication that the compiler has
2017 done its job correctly.
2018 @item
2019 If @code{bzip2} crashes randomly, and the crashes are not
2020 repeatable, you may have a flaky memory subsystem.  @code{bzip2}
2021 really hammers your memory hierarchy, and if it's a bit marginal,
2022 you may get these problems.  Ditto if your disk or I/O subsystem
2023 is slowly failing.  Yup, this really does happen.
2024
2025 Try using a different machine of the same type, and see if
2026 you can repeat the problem.
2027 @item This isn't really a bug, but ... If @code{bzip2} tells
2028 you your file is corrupted on decompression, and you
2029 obtained the file via FTP, there is a possibility that you
2030 forgot to tell FTP to do a binary mode transfer.  That absolutely
2031 will cause the file to be non-decompressible.  You'll have to transfer
2032 it again.
2033 @end itemize
2034
2035 If you've incorporated @code{libbzip2} into your own program
2036 and are getting problems, please, please, please, check that the
2037 parameters you are passing in calls to the library, are
2038 correct, and in accordance with what the documentation says
2039 is allowable.  I have tried to make the library robust against
2040 such problems, but I'm sure I haven't succeeded.
2041
2042 Finally, if the above comments don't help, you'll have to send
2043 me a bug report.  Now, it's just amazing how many people will
2044 send me a bug report saying something like
2045 @display
2046    bzip2 crashed with segmentation fault on my machine
2047 @end display
2048 and absolutely nothing else.  Needless to say, a such a report
2049 is @emph{totally, utterly, completely and comprehensively 100% useless;
2050 a waste of your time, my time, and net bandwidth}.
2051 With no details at all, there's no way I can possibly begin
2052 to figure out what the problem is.
2053
2054 The rules of the game are: facts, facts, facts.  Don't omit
2055 them because "oh, they won't be relevant".  At the bare
2056 minimum:
2057 @display
2058    Machine type.  Operating system version.
2059    Exact version of @code{bzip2} (do @code{bzip2 -V}).
2060    Exact version of the compiler used.
2061    Flags passed to the compiler.
2062 @end display
2063 However, the most important single thing that will help me is
2064 the file that you were trying to compress or decompress at the
2065 time the problem happened.  Without that, my ability to do anything
2066 more than speculate about the cause, is limited.
2067
2068 Please remember that I connect to the Internet with a modem, so
2069 you should contact me before mailing me huge files.
2070
2071
2072 @section Did you get the right package?
2073
2074 @code{bzip2} is a resource hog.  It soaks up large amounts of CPU cycles
2075 and memory.  Also, it gives very large latencies.  In the worst case, you
2076 can feed many megabytes of uncompressed data into the library before
2077 getting any compressed output, so this probably rules out applications
2078 requiring interactive behaviour.
2079
2080 These aren't faults of my implementation, I hope, but more
2081 an intrinsic property of the Burrows-Wheeler transform (unfortunately).
2082 Maybe this isn't what you want.
2083
2084 If you want a compressor and/or library which is faster, uses less
2085 memory but gets pretty good compression, and has minimal latency,
2086 consider Jean-loup
2087 Gailly's and Mark Adler's work, @code{zlib-1.1.3} and
2088 @code{gzip-1.2.4}.  Look for them at
2089
2090 @code{http://www.zlib.org} and
2091 @code{http://www.gzip.org} respectively.
2092
2093 For something faster and lighter still, you might try Markus F X J
2094 Oberhumer's @code{LZO} real-time compression/decompression library, at
2095 @* @code{http://wildsau.idv.uni-linz.ac.at/mfx/lzo.html}.
2096
2097 If you want to use the @code{bzip2} algorithms to compress small blocks
2098 of data, 64k bytes or smaller, for example on an on-the-fly disk
2099 compressor, you'd be well advised not to use this library.  Instead,
2100 I've made a special library tuned for that kind of use.  It's part of
2101 @code{e2compr-0.40}, an on-the-fly disk compressor for the Linux
2102 @code{ext2} filesystem.  Look at
2103 @code{http://www.netspace.net.au/~reiter/e2compr}.
2104
2105
2106
2107 @section Testing
2108
2109 A record of the tests I've done.
2110
2111 First, some data sets:
2112 @itemize @bullet
2113 @item B: a directory containing 6001 files, one for every length in the
2114       range 0 to 6000 bytes.  The files contain random lowercase
2115       letters.  18.7 megabytes.
2116 @item H: my home directory tree.  Documents, source code, mail files,
2117       compressed data.  H contains B, and also a directory of
2118       files designed as boundary cases for the sorting; mostly very
2119       repetitive, nasty files.  565 megabytes.
2120 @item A: directory tree holding various applications built from source:
2121       @code{egcs}, @code{gcc-2.8.1}, KDE, GTK, Octave, etc.
2122       2200 megabytes.
2123 @end itemize
2124 The tests conducted are as follows.  Each test means compressing
2125 (a copy of) each file in the data set, decompressing it and
2126 comparing it against the original.
2127
2128 First, a bunch of tests with block sizes and internal buffer
2129 sizes set very small,
2130 to detect any problems with the
2131 blocking and buffering mechanisms.
2132 This required modifying the source code so as to try to
2133 break it.
2134 @enumerate
2135 @item Data set H, with
2136       buffer size of 1 byte, and block size of 23 bytes.
2137 @item Data set B, buffer sizes 1 byte, block size 1 byte.
2138 @item As (2) but small-mode decompression.
2139 @item As (2) with block size 2 bytes.
2140 @item As (2) with block size 3 bytes.
2141 @item As (2) with block size 4 bytes.
2142 @item As (2) with block size 5 bytes.
2143 @item As (2) with block size 6 bytes and small-mode decompression.
2144 @item H with buffer size of 1 byte, but normal block
2145       size (up to 900000 bytes).
2146 @end enumerate
2147 Then some tests with unmodified source code.
2148 @enumerate
2149 @item H, all settings normal.
2150 @item As (1), with small-mode decompress.
2151 @item H, compress with flag @code{-1}.
2152 @item H, compress with flag @code{-s}, decompress with flag @code{-s}.
2153 @item Forwards compatibility: H, @code{bzip2-0.1pl2} compressing,
2154       @code{bzip2-0.9.5} decompressing, all settings normal.
2155 @item Backwards compatibility:  H, @code{bzip2-0.9.5} compressing,
2156       @code{bzip2-0.1pl2} decompressing, all settings normal.
2157 @item Bigger tests: A, all settings normal.
2158 @item As (7), using the fallback (Sadakane-like) sorting algorithm.
2159 @item As (8), compress with flag @code{-1}, decompress with flag
2160       @code{-s}.
2161 @item H, using the fallback sorting algorithm.
2162 @item Forwards compatibility: A, @code{bzip2-0.1pl2} compressing,
2163       @code{bzip2-0.9.5} decompressing, all settings normal.
2164 @item Backwards compatibility:  A, @code{bzip2-0.9.5} compressing,
2165       @code{bzip2-0.1pl2} decompressing, all settings normal.
2166 @item Misc test: about 400 megabytes of @code{.tar} files with
2167       @code{bzip2} compiled with Checker (a memory access error
2168        detector, like Purify).
2169 @item Misc tests to make sure it builds and runs ok on non-Linux/x86
2170       platforms.
2171 @end enumerate
2172 These tests were conducted on a 225 MHz IDT WinChip machine, running
2173 Linux 2.0.36.  They represent nearly a week of continuous computation.
2174 All tests completed successfully.
2175
2176
2177 @section Further reading
2178 @code{bzip2} is not research work, in the sense that it doesn't present
2179 any new ideas.  Rather, it's an engineering exercise based on existing
2180 ideas.
2181
2182 Four documents describe essentially all the ideas behind @code{bzip2}:
2183 @example
2184 Michael Burrows and D. J. Wheeler:
2185   "A block-sorting lossless data compression algorithm"
2186    10th May 1994.
2187    Digital SRC Research Report 124.
2188    ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz
2189    If you have trouble finding it, try searching at the
2190    New Zealand Digital Library, http://www.nzdl.org.
2191
2192 Daniel S. Hirschberg and Debra A. LeLewer
2193   "Efficient Decoding of Prefix Codes"
2194    Communications of the ACM, April 1990, Vol 33, Number 4.
2195    You might be able to get an electronic copy of this
2196       from the ACM Digital Library.
2197
2198 David J. Wheeler
2199    Program bred3.c and accompanying document bred3.ps.
2200    This contains the idea behind the multi-table Huffman
2201    coding scheme.
2202    ftp://ftp.cl.cam.ac.uk/users/djw3/
2203
2204 Jon L. Bentley and Robert Sedgewick
2205   "Fast Algorithms for Sorting and Searching Strings"
2206    Available from Sedgewick's web page,
2207    www.cs.princeton.edu/~rs
2208 @end example
2209 The following paper gives valuable additional insights into the
2210 algorithm, but is not immediately the basis of any code
2211 used in bzip2.
2212 @example
2213 Peter Fenwick:
2214    Block Sorting Text Compression
2215    Proceedings of the 19th Australasian Computer Science Conference,
2216      Melbourne, Australia.  Jan 31 - Feb 2, 1996.
2217    ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps
2218 @end example
2219 Kunihiko Sadakane's sorting algorithm, mentioned above,
2220 is available from:
2221 @example
2222 http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/Sada98b.ps.gz
2223 @end example
2224 The Manber-Myers suffix array construction
2225 algorithm is described in a paper
2226 available from:
2227 @example
2228 http://www.cs.arizona.edu/people/gene/PAPERS/suffix.ps
2229 @end example
2230 Finally, the following paper documents some recent investigations
2231 I made into the performance of sorting algorithms:
2232 @example
2233 Julian Seward:
2234    On the Performance of BWT Sorting Algorithms
2235    Proceedings of the IEEE Data Compression Conference 2000
2236      Snowbird, Utah.  28-30 March 2000.
2237 @end example
2238
2239
2240 @contents
2241
2242 @bye
2243