1 \input texinfo @c -*- Texinfo -*-
2 @setfilename bzip2.info
5 This file documents bzip2 version 1.0.2, and associated library
6 libbzip2, written by Julian Seward (jseward@acm.org).
8 Copyright (C) 1996-2002 Julian R Seward
10 Permission is granted to make and distribute verbatim copies of
11 this manual provided the copyright notice and this permission notice
12 are preserved on all copies.
14 Permission is granted to copy and distribute translations of this manual
15 into another language, under the above conditions for verbatim copies.
21 * Bzip2: (bzip2). A program and library for data compression.
29 @settitle bzip2 and libbzip2
31 @title bzip2 and libbzip2
32 @subtitle a program and library for data compression
33 @subtitle copyright (C) 1996-2002 Julian Seward
34 @subtitle version 1.0.2 of 30 December 2001
45 The following text is the License for this software. You should
46 find it identical to that contained in the file LICENSE in the
49 @bf{------------------ START OF THE LICENSE ------------------}
51 This program, @code{bzip2},
52 and associated library @code{libbzip2}, are
53 Copyright (C) 1996-2002 Julian R Seward. All rights reserved.
55 Redistribution and use in source and binary forms, with or without
56 modification, are permitted provided that the following conditions
60 Redistributions of source code must retain the above copyright
61 notice, this list of conditions and the following disclaimer.
63 The origin of this software must not be misrepresented; you must
64 not claim that you wrote the original software. If you use this
65 software in a product, an acknowledgment in the product
66 documentation would be appreciated but is not required.
68 Altered source versions must be plainly marked as such, and must
69 not be misrepresented as being the original software.
71 The name of the author may not be used to endorse or promote
72 products derived from this software without specific prior written
75 THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS
76 OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
77 WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
78 ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY
79 DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
80 DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
81 GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
82 INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
83 WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
84 NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
85 SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
87 Julian Seward, Cambridge, UK.
89 @code{jseward@@acm.org}
91 @code{bzip2}/@code{libbzip2} version 1.0.2 of 30 December 2001.
93 @bf{------------------ END OF THE LICENSE ------------------}
97 @code{http://sources.redhat.com/bzip2}
99 @code{http://www.cacheprof.org}
101 PATENTS: To the best of my knowledge, @code{bzip2} does not use any patented
102 algorithms. However, I do not have the resources available to carry out
103 a full patent search. Therefore I cannot give any guarantee of the
112 @chapter Introduction
114 @code{bzip2} compresses files using the Burrows-Wheeler
115 block-sorting text compression algorithm, and Huffman coding.
116 Compression is generally considerably better than that
117 achieved by more conventional LZ77/LZ78-based compressors,
118 and approaches the performance of the PPM family of statistical compressors.
120 @code{bzip2} is built on top of @code{libbzip2}, a flexible library
121 for handling compressed data in the @code{bzip2} format. This manual
122 describes both how to use the program and
123 how to work with the library interface. Most of the
124 manual is devoted to this library, not the program,
125 which is good news if your interest is only in the program.
127 Chapter 2 describes how to use @code{bzip2}; this is the only part
128 you need to read if you just want to know how to operate the program.
129 Chapter 3 describes the programming interfaces in detail, and
130 Chapter 4 records some miscellaneous notes which I thought
131 ought to be recorded somewhere.
134 @chapter How to use @code{bzip2}
136 This chapter contains a copy of the @code{bzip2} man page,
141 @unnumberedsubsubsec NAME
143 @item @code{bzip2}, @code{bunzip2}
144 - a block-sorting file compressor, v1.0.2
146 - decompresses files to stdout
147 @item @code{bzip2recover}
148 - recovers data from damaged bzip2 files
151 @unnumberedsubsubsec SYNOPSIS
153 @item @code{bzip2} [ -cdfkqstvzVL123456789 ] [ filenames ... ]
154 @item @code{bunzip2} [ -fkvsVL ] [ filenames ... ]
155 @item @code{bzcat} [ -s ] [ filenames ... ]
156 @item @code{bzip2recover} filename
159 @unnumberedsubsubsec DESCRIPTION
161 @code{bzip2} compresses files using the Burrows-Wheeler block sorting
162 text compression algorithm, and Huffman coding. Compression is
163 generally considerably better than that achieved by more conventional
164 LZ77/LZ78-based compressors, and approaches the performance of the PPM
165 family of statistical compressors.
167 The command-line options are deliberately very similar to those of GNU
168 @code{gzip}, but they are not identical.
170 @code{bzip2} expects a list of file names to accompany the command-line
171 flags. Each file is replaced by a compressed version of itself, with
172 the name @code{original_name.bz2}. Each compressed file has the same
173 modification date, permissions, and, when possible, ownership as the
174 corresponding original, so that these properties can be correctly
175 restored at decompression time. File name handling is naive in the
176 sense that there is no mechanism for preserving original file names,
177 permissions, ownerships or dates in filesystems which lack these
178 concepts, or have serious file name length restrictions, such as MS-DOS.
180 @code{bzip2} and @code{bunzip2} will by default not overwrite existing
181 files. If you want this to happen, specify the @code{-f} flag.
183 If no file names are specified, @code{bzip2} compresses from standard
184 input to standard output. In this case, @code{bzip2} will decline to
185 write compressed output to a terminal, as this would be entirely
186 incomprehensible and therefore pointless.
188 @code{bunzip2} (or @code{bzip2 -d}) decompresses all
189 specified files. Files which were not created by @code{bzip2}
190 will be detected and ignored, and a warning issued.
191 @code{bzip2} attempts to guess the filename for the decompressed file
192 from that of the compressed file as follows:
194 @item @code{filename.bz2 } becomes @code{filename}
195 @item @code{filename.bz } becomes @code{filename}
196 @item @code{filename.tbz2} becomes @code{filename.tar}
197 @item @code{filename.tbz } becomes @code{filename.tar}
198 @item @code{anyothername } becomes @code{anyothername.out}
200 If the file does not end in one of the recognised endings,
201 @code{.bz2}, @code{.bz},
202 @code{.tbz2} or @code{.tbz}, @code{bzip2} complains that it cannot
203 guess the name of the original file, and uses the original name
204 with @code{.out} appended.
206 As with compression, supplying no
207 filenames causes decompression from standard input to standard output.
209 @code{bunzip2} will correctly decompress a file which is the
210 concatenation of two or more compressed files. The result is the
211 concatenation of the corresponding uncompressed files. Integrity
212 testing (@code{-t}) of concatenated compressed files is also supported.
214 You can also compress or decompress files to the standard output by
215 giving the @code{-c} flag. Multiple files may be compressed and
216 decompressed like this. The resulting outputs are fed sequentially to
217 stdout. Compression of multiple files in this manner generates a stream
218 containing multiple compressed file representations. Such a stream
219 can be decompressed correctly only by @code{bzip2} version 0.9.0 or
220 later. Earlier versions of @code{bzip2} will stop after decompressing
221 the first file in the stream.
223 @code{bzcat} (or @code{bzip2 -dc}) decompresses all specified files to
226 @code{bzip2} will read arguments from the environment variables
227 @code{BZIP2} and @code{BZIP}, in that order, and will process them
228 before any arguments read from the command line. This gives a
229 convenient way to supply default arguments.
231 Compression is always performed, even if the compressed file is slightly
232 larger than the original. Files of less than about one hundred bytes
233 tend to get larger, since the compression mechanism has a constant
234 overhead in the region of 50 bytes. Random data (including the output
235 of most file compressors) is coded at about 8.05 bits per byte, giving
236 an expansion of around 0.5%.
238 As a self-check for your protection, @code{bzip2} uses 32-bit CRCs to
239 make sure that the decompressed version of a file is identical to the
240 original. This guards against corruption of the compressed data, and
241 against undetected bugs in @code{bzip2} (hopefully very unlikely). The
242 chances of data corruption going undetected is microscopic, about one
243 chance in four billion for each file processed. Be aware, though, that
244 the check occurs upon decompression, so it can only tell you that
245 something is wrong. It can't help you recover the original uncompressed
246 data. You can use @code{bzip2recover} to try to recover data from
249 Return values: 0 for a normal exit, 1 for environmental problems (file
250 not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt
251 compressed file, 3 for an internal consistency error (eg, bug) which
252 caused @code{bzip2} to panic.
255 @unnumberedsubsubsec OPTIONS
258 Compress or decompress to standard output.
259 @item -d --decompress
260 Force decompression. @code{bzip2}, @code{bunzip2} and @code{bzcat} are
261 really the same program, and the decision about what actions to take is
262 done on the basis of which name is used. This flag overrides that
263 mechanism, and forces bzip2 to decompress.
265 The complement to @code{-d}: forces compression, regardless of the
268 Check integrity of the specified file(s), but don't decompress them.
269 This really performs a trial decompression and throws away the result.
271 Force overwrite of output files. Normally, @code{bzip2} will not overwrite
272 existing output files. Also forces @code{bzip2} to break hard links
273 to files, which it otherwise wouldn't do.
275 @code{bzip2} normally declines to decompress files which don't have the
276 correct magic header bytes. If forced (@code{-f}), however, it will
277 pass such files through unmodified. This is how GNU @code{gzip}
280 Keep (don't delete) input files during compression
283 Reduce memory usage, for compression, decompression and testing. Files
284 are decompressed and tested using a modified algorithm which only
285 requires 2.5 bytes per block byte. This means any file can be
286 decompressed in 2300k of memory, albeit at about half the normal speed.
288 During compression, @code{-s} selects a block size of 200k, which limits
289 memory use to around the same figure, at the expense of your compression
290 ratio. In short, if your machine is low on memory (8 megabytes or
291 less), use -s for everything. See MEMORY MANAGEMENT below.
293 Suppress non-essential warning messages. Messages pertaining to
294 I/O errors and other critical events will not be suppressed.
296 Verbose mode -- show the compression ratio for each file processed.
297 Further @code{-v}'s increase the verbosity level, spewing out lots of
298 information which is primarily of interest for diagnostic purposes.
299 @item -L --license -V --version
300 Display the software version, license terms and conditions.
301 @item -1 (or --fast) to -9 (or --best)
302 Set the block size to 100 k, 200 k .. 900 k when compressing. Has no
303 effect when decompressing. See MEMORY MANAGEMENT below.
304 The @code{--fast} and @code{--best} aliases are primarily for GNU
305 @code{gzip} compatibility. In particular, @code{--fast} doesn't make
306 things significantly faster. And @code{--best} merely selects the
309 Treats all subsequent arguments as file names, even if they start
310 with a dash. This is so you can handle files with names beginning
311 with a dash, for example: @code{bzip2 -- -myfilename}.
312 @item --repetitive-fast
313 @item --repetitive-best
314 These flags are redundant in versions 0.9.5 and above. They provided
315 some coarse control over the behaviour of the sorting algorithm in
316 earlier versions, which was sometimes useful. 0.9.5 and above have an
317 improved algorithm which renders these flags irrelevant.
321 @unnumberedsubsubsec MEMORY MANAGEMENT
323 @code{bzip2} compresses large files in blocks. The block size affects
324 both the compression ratio achieved, and the amount of memory needed for
325 compression and decompression. The flags @code{-1} through @code{-9}
326 specify the block size to be 100,000 bytes through 900,000 bytes (the
327 default) respectively. At decompression time, the block size used for
328 compression is read from the header of the compressed file, and
329 @code{bunzip2} then allocates itself just enough memory to decompress
330 the file. Since block sizes are stored in compressed files, it follows
331 that the flags @code{-1} to @code{-9} are irrelevant to and so ignored
332 during decompression.
334 Compression and decompression requirements, in bytes, can be estimated
337 Compression: 400k + ( 8 x block size )
339 Decompression: 100k + ( 4 x block size ), or
340 100k + ( 2.5 x block size )
342 Larger block sizes give rapidly diminishing marginal returns. Most of
343 the compression comes from the first two or three hundred k of block
344 size, a fact worth bearing in mind when using @code{bzip2} on small machines.
345 It is also important to appreciate that the decompression memory
346 requirement is set at compression time by the choice of block size.
348 For files compressed with the default 900k block size, @code{bunzip2}
349 will require about 3700 kbytes to decompress. To support decompression
350 of any file on a 4 megabyte machine, @code{bunzip2} has an option to
351 decompress using approximately half this amount of memory, about 2300
352 kbytes. Decompression speed is also halved, so you should use this
353 option only where necessary. The relevant flag is @code{-s}.
355 In general, try and use the largest block size memory constraints allow,
356 since that maximises the compression achieved. Compression and
357 decompression speed are virtually unaffected by block size.
359 Another significant point applies to files which fit in a single block
360 -- that means most files you'd encounter using a large block size. The
361 amount of real memory touched is proportional to the size of the file,
362 since the file is smaller than a block. For example, compressing a file
363 20,000 bytes long with the flag @code{-9} will cause the compressor to
364 allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560
365 kbytes of it. Similarly, the decompressor will allocate 3700k but only
366 touch 100k + 20000 * 4 = 180 kbytes.
368 Here is a table which summarises the maximum memory usage for different
369 block sizes. Also recorded is the total compressed size for 14 files of
370 the Calgary Text Compression Corpus totalling 3,141,622 bytes. This
371 column gives some feel for how compression varies with block size.
372 These figures tend to understate the advantage of larger block sizes for
373 larger files, since the Corpus is dominated by smaller files.
375 Compress Decompress Decompress Corpus
376 Flag usage usage -s usage Size
378 -1 1200k 500k 350k 914704
379 -2 2000k 900k 600k 877703
380 -3 2800k 1300k 850k 860338
381 -4 3600k 1700k 1100k 846899
382 -5 4400k 2100k 1350k 845160
383 -6 5200k 2500k 1600k 838626
384 -7 6100k 2900k 1850k 834096
385 -8 6800k 3300k 2100k 828642
386 -9 7600k 3700k 2350k 828642
389 @unnumberedsubsubsec RECOVERING DATA FROM DAMAGED FILES
391 @code{bzip2} compresses files in blocks, usually 900kbytes long. Each
392 block is handled independently. If a media or transmission error causes
393 a multi-block @code{.bz2} file to become damaged, it may be possible to
394 recover data from the undamaged blocks in the file.
396 The compressed representation of each block is delimited by a 48-bit
397 pattern, which makes it possible to find the block boundaries with
398 reasonable certainty. Each block also carries its own 32-bit CRC, so
399 damaged blocks can be distinguished from undamaged ones.
401 @code{bzip2recover} is a simple program whose purpose is to search for
402 blocks in @code{.bz2} files, and write each block out into its own
403 @code{.bz2} file. You can then use @code{bzip2 -t} to test the
404 integrity of the resulting files, and decompress those which are
408 takes a single argument, the name of the damaged file, and writes a
409 number of files @code{rec00001file.bz2}, @code{rec00002file.bz2}, etc,
410 containing the extracted blocks. The output filenames are designed so
411 that the use of wildcards in subsequent processing -- for example,
412 @code{bzip2 -dc rec*file.bz2 > recovered_data} -- processes the files in
415 @code{bzip2recover} should be of most use dealing with large @code{.bz2}
416 files, as these will contain many blocks. It is clearly futile to use
417 it on damaged single-block files, since a damaged block cannot be
418 recovered. If you wish to minimise any potential data loss through
419 media or transmission errors, you might consider compressing with a
423 @unnumberedsubsubsec PERFORMANCE NOTES
425 The sorting phase of compression gathers together similar strings in the
426 file. Because of this, files containing very long runs of repeated
427 symbols, like "aabaabaabaab ..." (repeated several hundred times) may
428 compress more slowly than normal. Versions 0.9.5 and above fare much
429 better than previous versions in this respect. The ratio between
430 worst-case and average-case compression time is in the region of 10:1.
431 For previous versions, this figure was more like 100:1. You can use the
432 @code{-vvvv} option to monitor progress in great detail, if you want.
434 Decompression speed is unaffected by these phenomena.
436 @code{bzip2} usually allocates several megabytes of memory to operate
437 in, and then charges all over it in a fairly random fashion. This means
438 that performance, both for compressing and decompressing, is largely
439 determined by the speed at which your machine can service cache misses.
440 Because of this, small changes to the code to reduce the miss rate have
441 been observed to give disproportionately large performance improvements.
442 I imagine @code{bzip2} will perform best on machines with very large
446 @unnumberedsubsubsec CAVEATS
448 I/O error messages are not as helpful as they could be. @code{bzip2}
449 tries hard to detect I/O errors and exit cleanly, but the details of
450 what the problem is sometimes seem rather misleading.
452 This manual page pertains to version 1.0.2 of @code{bzip2}. Compressed
453 data created by this version is entirely forwards and backwards
454 compatible with the previous public releases, versions 0.1pl2, 0.9.0,
455 0.9.5, 1.0.0 and 1.0.1, but with the following exception: 0.9.0 and
456 above can correctly decompress multiple concatenated compressed files.
457 0.1pl2 cannot do this; it will stop after decompressing just the first
460 @code{bzip2recover} versions prior to this one, 1.0.2, used 32-bit
461 integers to represent bit positions in compressed files, so it could not
462 handle compressed files more than 512 megabytes long. Version 1.0.2 and
463 above uses 64-bit ints on some platforms which support them (GNU
464 supported targets, and Windows). To establish whether or not
465 @code{bzip2recover} was built with such a limitation, run it without
466 arguments. In any event you can build yourself an unlimited version if
467 you can recompile it with @code{MaybeUInt64} set to be an unsigned
472 @unnumberedsubsubsec AUTHOR
473 Julian Seward, @code{jseward@@acm.org}.
475 @code{http://sources.redhat.com/bzip2}
477 The ideas embodied in @code{bzip2} are due to (at least) the following
478 people: Michael Burrows and David Wheeler (for the block sorting
479 transformation), David Wheeler (again, for the Huffman coder), Peter
480 Fenwick (for the structured coding model in the original @code{bzip},
481 and many refinements), and Alistair Moffat, Radford Neal and Ian Witten
482 (for the arithmetic coder in the original @code{bzip}). I am much
483 indebted for their help, support and advice. See the manual in the
484 source distribution for pointers to sources of documentation. Christian
485 von Roques encouraged me to look for faster sorting algorithms, so as to
486 speed up compression. Bela Lubkin encouraged me to improve the
487 worst-case compression performance. The @code{bz*} scripts are derived
488 from those of GNU @code{gzip}. Many people sent patches, helped with
489 portability problems, lent machines, gave advice and were generally
497 @chapter Programming with @code{libbzip2}
499 This chapter describes the programming interface to @code{libbzip2}.
501 For general background information, particularly about memory
502 use and performance aspects, you'd be well advised to read Chapter 2
505 @section Top-level structure
507 @code{libbzip2} is a flexible library for compressing and decompressing
508 data in the @code{bzip2} data format. Although packaged as a single
509 entity, it helps to regard the library as three separate parts: the low
510 level interface, and the high level interface, and some utility
513 The structure of @code{libbzip2}'s interfaces is similar to
514 that of Jean-loup Gailly's and Mark Adler's excellent @code{zlib}
517 All externally visible symbols have names beginning @code{BZ2_}.
518 This is new in version 1.0. The intention is to minimise pollution
519 of the namespaces of library clients.
521 @subsection Low-level summary
523 This interface provides services for compressing and decompressing
524 data in memory. There's no provision for dealing with files, streams
525 or any other I/O mechanisms, just straight memory-to-memory work.
526 In fact, this part of the library can be compiled without inclusion
527 of @code{stdio.h}, which may be helpful for embedded applications.
529 The low-level part of the library has no global variables and
530 is therefore thread-safe.
532 Six routines make up the low level interface:
533 @code{BZ2_bzCompressInit}, @code{BZ2_bzCompress}, and @* @code{BZ2_bzCompressEnd}
535 and a corresponding trio @code{BZ2_bzDecompressInit}, @* @code{BZ2_bzDecompress}
536 and @code{BZ2_bzDecompressEnd} for decompression.
537 The @code{*Init} functions allocate
538 memory for compression/decompression and do other
539 initialisations, whilst the @code{*End} functions close down operations
542 The real work is done by @code{BZ2_bzCompress} and @code{BZ2_bzDecompress}.
543 These compress and decompress data from a user-supplied input buffer
544 to a user-supplied output buffer. These buffers can be any size;
545 arbitrary quantities of data are handled by making repeated calls
546 to these functions. This is a flexible mechanism allowing a
547 consumer-pull style of activity, or producer-push, or a mixture of
552 @subsection High-level summary
554 This interface provides some handy wrappers around the low-level
555 interface to facilitate reading and writing @code{bzip2} format
556 files (@code{.bz2} files). The routines provide hooks to facilitate
557 reading files in which the @code{bzip2} data stream is embedded
558 within some larger-scale file structure, or where there are
559 multiple @code{bzip2} data streams concatenated end-to-end.
561 For reading files, @code{BZ2_bzReadOpen}, @code{BZ2_bzRead},
562 @code{BZ2_bzReadClose} and @* @code{BZ2_bzReadGetUnused} are supplied. For
563 writing files, @code{BZ2_bzWriteOpen}, @code{BZ2_bzWrite} and
564 @code{BZ2_bzWriteFinish} are available.
566 As with the low-level library, no global variables are used
567 so the library is per se thread-safe. However, if I/O errors
568 occur whilst reading or writing the underlying compressed files,
569 you may have to consult @code{errno} to determine the cause of
570 the error. In that case, you'd need a C library which correctly
571 supports @code{errno} in a multithreaded environment.
573 To make the library a little simpler and more portable,
574 @code{BZ2_bzReadOpen} and @code{BZ2_bzWriteOpen} require you to pass them file
575 handles (@code{FILE*}s) which have previously been opened for reading or
576 writing respectively. That avoids portability problems associated with
577 file operations and file attributes, whilst not being much of an
578 imposition on the programmer.
582 @subsection Utility functions summary
583 For very simple needs, @code{BZ2_bzBuffToBuffCompress} and
584 @code{BZ2_bzBuffToBuffDecompress} are provided. These compress
585 data in memory from one buffer to another buffer in a single
586 function call. You should assess whether these functions
587 fulfill your memory-to-memory compression/decompression
588 requirements before investing effort in understanding the more
589 general but more complex low-level interface.
591 Yoshioka Tsuneo (@code{QWF00133@@niftyserve.or.jp} /
592 @code{tsuneo-y@@is.aist-nara.ac.jp}) has contributed some functions to
593 give better @code{zlib} compatibility. These functions are
594 @code{BZ2_bzopen}, @code{BZ2_bzread}, @code{BZ2_bzwrite}, @code{BZ2_bzflush},
596 @code{BZ2_bzerror} and @code{BZ2_bzlibVersion}. You may find these functions
597 more convenient for simple file reading and writing, than those in the
598 high-level interface. These functions are not (yet) officially part of
599 the library, and are minimally documented here. If they break, you
600 get to keep all the pieces. I hope to document them properly when time
603 Yoshioka also contributed modifications to allow the library to be
604 built as a Windows DLL.
607 @section Error handling
609 The library is designed to recover cleanly in all situations, including
610 the worst-case situation of decompressing random data. I'm not
611 100% sure that it can always do this, so you might want to add
612 a signal handler to catch segmentation violations during decompression
613 if you are feeling especially paranoid. I would be interested in
614 hearing more about the robustness of the library to corrupted
617 Version 1.0 is much more robust in this respect than
618 0.9.0 or 0.9.5. Investigations with Checker (a tool for
619 detecting problems with memory management, similar to Purify)
620 indicate that, at least for the few files I tested, all single-bit
621 errors in the decompressed data are caught properly, with no
622 segmentation faults, no reads of uninitialised data and no
623 out of range reads or writes. So it's certainly much improved,
624 although I wouldn't claim it to be totally bombproof.
626 The file @code{bzlib.h} contains all definitions needed to use
627 the library. In particular, you should definitely not include
628 @code{bzlib_private.h}.
630 In @code{bzlib.h}, the various return values are defined. The following
631 list is not intended as an exhaustive description of the circumstances
632 in which a given value may be returned -- those descriptions are given
633 later. Rather, it is intended to convey the rough meaning of each
634 return value. The first five actions are normal and not intended to
635 denote an error situation.
638 The requested action was completed successfully.
642 In @code{BZ2_bzCompress}, the requested flush/finish/nothing-special action
643 was completed successfully.
645 Compression of data was completed, or the logical stream end was
646 detected during decompression.
649 The following return values indicate an error of some kind.
651 @item BZ_CONFIG_ERROR
652 Indicates that the library has been improperly compiled on your
653 platform -- a major configuration error. Specifically, it means
654 that @code{sizeof(char)}, @code{sizeof(short)} and @code{sizeof(int)}
655 are not 1, 2 and 4 respectively, as they should be. Note that the
656 library should still work properly on 64-bit platforms which follow
657 the LP64 programming model -- that is, where @code{sizeof(long)}
658 and @code{sizeof(void*)} are 8. Under LP64, @code{sizeof(int)} is
659 still 4, so @code{libbzip2}, which doesn't use the @code{long} type,
661 @item BZ_SEQUENCE_ERROR
662 When using the library, it is important to call the functions in the
663 correct sequence and with data structures (buffers etc) in the correct
664 states. @code{libbzip2} checks as much as it can to ensure this is
665 happening, and returns @code{BZ_SEQUENCE_ERROR} if not. Code which
666 complies precisely with the function semantics, as detailed below,
667 should never receive this value; such an event denotes buggy code
668 which you should investigate.
670 Returned when a parameter to a function call is out of range
671 or otherwise manifestly incorrect. As with @code{BZ_SEQUENCE_ERROR},
672 this denotes a bug in the client code. The distinction between
673 @code{BZ_PARAM_ERROR} and @code{BZ_SEQUENCE_ERROR} is a bit hazy, but still worth
676 Returned when a request to allocate memory failed. Note that the
677 quantity of memory needed to decompress a stream cannot be determined
678 until the stream's header has been read. So @code{BZ2_bzDecompress} and
679 @code{BZ2_bzRead} may return @code{BZ_MEM_ERROR} even though some of
680 the compressed data has been read. The same is not true for
681 compression; once @code{BZ2_bzCompressInit} or @code{BZ2_bzWriteOpen} have
682 successfully completed, @code{BZ_MEM_ERROR} cannot occur.
684 Returned when a data integrity error is detected during decompression.
685 Most importantly, this means when stored and computed CRCs for the
686 data do not match. This value is also returned upon detection of any
687 other anomaly in the compressed data.
688 @item BZ_DATA_ERROR_MAGIC
689 As a special case of @code{BZ_DATA_ERROR}, it is sometimes useful to
690 know when the compressed stream does not start with the correct
691 magic bytes (@code{'B' 'Z' 'h'}).
693 Returned by @code{BZ2_bzRead} and @code{BZ2_bzWrite} when there is an error
694 reading or writing in the compressed file, and by @code{BZ2_bzReadOpen}
695 and @code{BZ2_bzWriteOpen} for attempts to use a file for which the
696 error indicator (viz, @code{ferror(f)}) is set.
697 On receipt of @code{BZ_IO_ERROR}, the caller should consult
698 @code{errno} and/or @code{perror} to acquire operating-system
699 specific information about the problem.
700 @item BZ_UNEXPECTED_EOF
701 Returned by @code{BZ2_bzRead} when the compressed file finishes
702 before the logical end of stream is detected.
703 @item BZ_OUTBUFF_FULL
704 Returned by @code{BZ2_bzBuffToBuffCompress} and
705 @code{BZ2_bzBuffToBuffDecompress} to indicate that the output data
706 will not fit into the output buffer provided.
711 @section Low-level interface
713 @subsection @code{BZ2_bzCompressInit}
718 unsigned int avail_in;
719 unsigned int total_in_lo32;
720 unsigned int total_in_hi32;
723 unsigned int avail_out;
724 unsigned int total_out_lo32;
725 unsigned int total_out_hi32;
729 void *(*bzalloc)(void *,int,int);
730 void (*bzfree)(void *,void *);
735 int BZ2_bzCompressInit ( bz_stream *strm,
742 Prepares for compression. The @code{bz_stream} structure
743 holds all data pertaining to the compression activity.
744 A @code{bz_stream} structure should be allocated and initialised
746 The fields of @code{bz_stream}
747 comprise the entirety of the user-visible data. @code{state}
748 is a pointer to the private data structures required for compression.
750 Custom memory allocators are supported, via fields @code{bzalloc},
752 and @code{opaque}. The value
753 @code{opaque} is passed to as the first argument to
754 all calls to @code{bzalloc} and @code{bzfree}, but is
755 otherwise ignored by the library.
756 The call @code{bzalloc ( opaque, n, m )} is expected to return a
758 @code{n * m} bytes of memory, and @code{bzfree ( opaque, p )}
762 If you don't want to use a custom memory allocator, set @code{bzalloc},
764 @code{opaque} to @code{NULL},
765 and the library will then use the standard @code{malloc}/@code{free}
768 Before calling @code{BZ2_bzCompressInit}, fields @code{bzalloc},
769 @code{bzfree} and @code{opaque} should
770 be filled appropriately, as just described. Upon return, the internal
771 state will have been allocated and initialised, and @code{total_in_lo32},
772 @code{total_in_hi32}, @code{total_out_lo32} and
773 @code{total_out_hi32} will have been set to zero.
774 These four fields are used by the library
775 to inform the caller of the total amount of data passed into and out of
776 the library, respectively. You should not try to change them.
777 As of version 1.0, 64-bit counts are maintained, even on 32-bit
778 platforms, using the @code{_hi32} fields to store the upper 32 bits
779 of the count. So, for example, the total amount of data in
780 is @code{(total_in_hi32 << 32) + total_in_lo32}.
782 Parameter @code{blockSize100k} specifies the block size to be used for
783 compression. It should be a value between 1 and 9 inclusive, and the
784 actual block size used is 100000 x this figure. 9 gives the best
785 compression but takes most memory.
787 Parameter @code{verbosity} should be set to a number between 0 and 4
788 inclusive. 0 is silent, and greater numbers give increasingly verbose
789 monitoring/debugging output. If the library has been compiled with
790 @code{-DBZ_NO_STDIO}, no such output will appear for any verbosity
793 Parameter @code{workFactor} controls how the compression phase behaves
794 when presented with worst case, highly repetitive, input data. If
795 compression runs into difficulties caused by repetitive data, the
796 library switches from the standard sorting algorithm to a fallback
797 algorithm. The fallback is slower than the standard algorithm by
798 perhaps a factor of three, but always behaves reasonably, no matter how
801 Lower values of @code{workFactor} reduce the amount of effort the
802 standard algorithm will expend before resorting to the fallback. You
803 should set this parameter carefully; too low, and many inputs will be
804 handled by the fallback algorithm and so compress rather slowly, too
805 high, and your average-to-worst case compression times can become very
806 large. The default value of 30 gives reasonable behaviour over a wide
807 range of circumstances.
809 Allowable values range from 0 to 250 inclusive. 0 is a special case,
810 equivalent to using the default value of 30.
812 Note that the compressed output generated is the same regardless of
813 whether or not the fallback algorithm is used.
815 Be aware also that this parameter may disappear entirely in future
816 versions of the library. In principle it should be possible to devise a
817 good way to automatically choose which algorithm to use. Such a
818 mechanism would render the parameter obsolete.
820 Possible return values:
822 @code{BZ_CONFIG_ERROR}
823 if the library has been mis-compiled
824 @code{BZ_PARAM_ERROR}
825 if @code{strm} is @code{NULL}
826 or @code{blockSize} < 1 or @code{blockSize} > 9
827 or @code{verbosity} < 0 or @code{verbosity} > 4
828 or @code{workFactor} < 0 or @code{workFactor} > 250
830 if not enough memory is available
834 Allowable next actions:
836 @code{BZ2_bzCompress}
837 if @code{BZ_OK} is returned
838 no specific action needed in case of error
841 @subsection @code{BZ2_bzCompress}
843 int BZ2_bzCompress ( bz_stream *strm, int action );
845 Provides more input and/or output buffer space for the library. The
846 caller maintains input and output buffers, and calls @code{BZ2_bzCompress} to
847 transfer data between them.
849 Before each call to @code{BZ2_bzCompress}, @code{next_in} should point at
850 the data to be compressed, and @code{avail_in} should indicate how many
851 bytes the library may read. @code{BZ2_bzCompress} updates @code{next_in},
852 @code{avail_in} and @code{total_in} to reflect the number of bytes it
855 Similarly, @code{next_out} should point to a buffer in which the
856 compressed data is to be placed, with @code{avail_out} indicating how
857 much output space is available. @code{BZ2_bzCompress} updates
858 @code{next_out}, @code{avail_out} and @code{total_out} to reflect the
859 number of bytes output.
861 You may provide and remove as little or as much data as you like on each
862 call of @code{BZ2_bzCompress}. In the limit, it is acceptable to supply and
863 remove data one byte at a time, although this would be terribly
864 inefficient. You should always ensure that at least one byte of output
865 space is available at each call.
867 A second purpose of @code{BZ2_bzCompress} is to request a change of mode of the
870 Conceptually, a compressed stream can be in one of four states: IDLE,
871 RUNNING, FLUSHING and FINISHING. Before initialisation
872 (@code{BZ2_bzCompressInit}) and after termination (@code{BZ2_bzCompressEnd}), a
873 stream is regarded as IDLE.
875 Upon initialisation (@code{BZ2_bzCompressInit}), the stream is placed in the
876 RUNNING state. Subsequent calls to @code{BZ2_bzCompress} should pass
877 @code{BZ_RUN} as the requested action; other actions are illegal and
878 will result in @code{BZ_SEQUENCE_ERROR}.
880 At some point, the calling program will have provided all the input data
881 it wants to. It will then want to finish up -- in effect, asking the
882 library to process any data it might have buffered internally. In this
883 state, @code{BZ2_bzCompress} will no longer attempt to read data from
884 @code{next_in}, but it will want to write data to @code{next_out}.
885 Because the output buffer supplied by the user can be arbitrarily small,
886 the finishing-up operation cannot necessarily be done with a single call
887 of @code{BZ2_bzCompress}.
889 Instead, the calling program passes @code{BZ_FINISH} as an action to
890 @code{BZ2_bzCompress}. This changes the stream's state to FINISHING. Any
891 remaining input (ie, @code{next_in[0 .. avail_in-1]}) is compressed and
892 transferred to the output buffer. To do this, @code{BZ2_bzCompress} must be
893 called repeatedly until all the output has been consumed. At that
894 point, @code{BZ2_bzCompress} returns @code{BZ_STREAM_END}, and the stream's
895 state is set back to IDLE. @code{BZ2_bzCompressEnd} should then be
898 Just to make sure the calling program does not cheat, the library makes
899 a note of @code{avail_in} at the time of the first call to
900 @code{BZ2_bzCompress} which has @code{BZ_FINISH} as an action (ie, at the
901 time the program has announced its intention to not supply any more
902 input). By comparing this value with that of @code{avail_in} over
903 subsequent calls to @code{BZ2_bzCompress}, the library can detect any
904 attempts to slip in more data to compress. Any calls for which this is
905 detected will return @code{BZ_SEQUENCE_ERROR}. This indicates a
906 programming mistake which should be corrected.
908 Instead of asking to finish, the calling program may ask
909 @code{BZ2_bzCompress} to take all the remaining input, compress it and
910 terminate the current (Burrows-Wheeler) compression block. This could
911 be useful for error control purposes. The mechanism is analogous to
912 that for finishing: call @code{BZ2_bzCompress} with an action of
913 @code{BZ_FLUSH}, remove output data, and persist with the
914 @code{BZ_FLUSH} action until the value @code{BZ_RUN} is returned. As
915 with finishing, @code{BZ2_bzCompress} detects any attempt to provide more
916 input data once the flush has begun.
918 Once the flush is complete, the stream returns to the normal RUNNING
921 This all sounds pretty complex, but isn't really. Here's a table
922 which shows which actions are allowable in each state, what action
923 will be taken, what the next state is, and what the non-error return
924 values are. Note that you can't explicitly ask what state the
925 stream is in, but nor do you need to -- it can be inferred from the
926 values returned by @code{BZ2_bzCompress}.
929 Illegal. IDLE state only exists after @code{BZ2_bzCompressEnd} or
930 before @code{BZ2_bzCompressInit}.
931 Return value = @code{BZ_SEQUENCE_ERROR}
933 RUNNING/@code{BZ_RUN}
934 Compress from @code{next_in} to @code{next_out} as much as possible.
936 Return value = @code{BZ_RUN_OK}
938 RUNNING/@code{BZ_FLUSH}
939 Remember current value of @code{next_in}. Compress from @code{next_in}
940 to @code{next_out} as much as possible, but do not accept any more input.
941 Next state = FLUSHING
942 Return value = @code{BZ_FLUSH_OK}
944 RUNNING/@code{BZ_FINISH}
945 Remember current value of @code{next_in}. Compress from @code{next_in}
946 to @code{next_out} as much as possible, but do not accept any more input.
947 Next state = FINISHING
948 Return value = @code{BZ_FINISH_OK}
950 FLUSHING/@code{BZ_FLUSH}
951 Compress from @code{next_in} to @code{next_out} as much as possible,
952 but do not accept any more input.
953 If all the existing input has been used up and all compressed
954 output has been removed
955 Next state = RUNNING; Return value = @code{BZ_RUN_OK}
957 Next state = FLUSHING; Return value = @code{BZ_FLUSH_OK}
961 Return value = @code{BZ_SEQUENCE_ERROR}
963 FINISHING/@code{BZ_FINISH}
964 Compress from @code{next_in} to @code{next_out} as much as possible,
965 but to not accept any more input.
966 If all the existing input has been used up and all compressed
967 output has been removed
968 Next state = IDLE; Return value = @code{BZ_STREAM_END}
970 Next state = FINISHING; Return value = @code{BZ_FINISHING}
974 Return value = @code{BZ_SEQUENCE_ERROR}
977 That still looks complicated? Well, fair enough. The usual sequence
978 of calls for compressing a load of data is:
980 @item Get started with @code{BZ2_bzCompressInit}.
981 @item Shovel data in and shlurp out its compressed form using zero or more
982 calls of @code{BZ2_bzCompress} with action = @code{BZ_RUN}.
984 Repeatedly call @code{BZ2_bzCompress} with action = @code{BZ_FINISH},
985 copying out the compressed output, until @code{BZ_STREAM_END} is returned.
986 @item Close up and go home. Call @code{BZ2_bzCompressEnd}.
988 If the data you want to compress fits into your input buffer all
989 at once, you can skip the calls of @code{BZ2_bzCompress ( ..., BZ_RUN )} and
990 just do the @code{BZ2_bzCompress ( ..., BZ_FINISH )} calls.
992 All required memory is allocated by @code{BZ2_bzCompressInit}. The
993 compression library can accept any data at all (obviously). So you
994 shouldn't get any error return values from the @code{BZ2_bzCompress} calls.
995 If you do, they will be @code{BZ_SEQUENCE_ERROR}, and indicate a bug in
998 Trivial other possible return values:
1000 @code{BZ_PARAM_ERROR}
1001 if @code{strm} is @code{NULL}, or @code{strm->s} is @code{NULL}
1004 @subsection @code{BZ2_bzCompressEnd}
1006 int BZ2_bzCompressEnd ( bz_stream *strm );
1008 Releases all memory associated with a compression stream.
1010 Possible return values:
1012 @code{BZ_PARAM_ERROR} if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL}
1013 @code{BZ_OK} otherwise
1017 @subsection @code{BZ2_bzDecompressInit}
1019 int BZ2_bzDecompressInit ( bz_stream *strm, int verbosity, int small );
1021 Prepares for decompression. As with @code{BZ2_bzCompressInit}, a
1022 @code{bz_stream} record should be allocated and initialised before the
1023 call. Fields @code{bzalloc}, @code{bzfree} and @code{opaque} should be
1024 set if a custom memory allocator is required, or made @code{NULL} for
1025 the normal @code{malloc}/@code{free} routines. Upon return, the internal
1026 state will have been initialised, and @code{total_in} and
1027 @code{total_out} will be zero.
1029 For the meaning of parameter @code{verbosity}, see @code{BZ2_bzCompressInit}.
1031 If @code{small} is nonzero, the library will use an alternative
1032 decompression algorithm which uses less memory but at the cost of
1033 decompressing more slowly (roughly speaking, half the speed, but the
1034 maximum memory requirement drops to around 2300k). See Chapter 2 for
1035 more information on memory management.
1037 Note that the amount of memory needed to decompress
1038 a stream cannot be determined until the stream's header has been read,
1039 so even if @code{BZ2_bzDecompressInit} succeeds, a subsequent
1040 @code{BZ2_bzDecompress} could fail with @code{BZ_MEM_ERROR}.
1042 Possible return values:
1044 @code{BZ_CONFIG_ERROR}
1045 if the library has been mis-compiled
1046 @code{BZ_PARAM_ERROR}
1047 if @code{(small != 0 && small != 1)}
1048 or @code{(verbosity < 0 || verbosity > 4)}
1050 if insufficient memory is available
1053 Allowable next actions:
1055 @code{BZ2_bzDecompress}
1056 if @code{BZ_OK} was returned
1057 no specific action required in case of error
1062 @subsection @code{BZ2_bzDecompress}
1064 int BZ2_bzDecompress ( bz_stream *strm );
1066 Provides more input and/out output buffer space for the library. The
1067 caller maintains input and output buffers, and uses @code{BZ2_bzDecompress}
1068 to transfer data between them.
1070 Before each call to @code{BZ2_bzDecompress}, @code{next_in}
1071 should point at the compressed data,
1072 and @code{avail_in} should indicate how many bytes the library
1073 may read. @code{BZ2_bzDecompress} updates @code{next_in}, @code{avail_in}
1075 to reflect the number of bytes it has read.
1077 Similarly, @code{next_out} should point to a buffer in which the uncompressed
1078 output is to be placed, with @code{avail_out} indicating how much output space
1079 is available. @code{BZ2_bzCompress} updates @code{next_out},
1080 @code{avail_out} and @code{total_out} to reflect
1081 the number of bytes output.
1083 You may provide and remove as little or as much data as you like on
1084 each call of @code{BZ2_bzDecompress}.
1085 In the limit, it is acceptable to
1086 supply and remove data one byte at a time, although this would be
1087 terribly inefficient. You should always ensure that at least one
1088 byte of output space is available at each call.
1090 Use of @code{BZ2_bzDecompress} is simpler than @code{BZ2_bzCompress}.
1092 You should provide input and remove output as described above, and
1093 repeatedly call @code{BZ2_bzDecompress} until @code{BZ_STREAM_END} is
1094 returned. Appearance of @code{BZ_STREAM_END} denotes that
1095 @code{BZ2_bzDecompress} has detected the logical end of the compressed
1096 stream. @code{BZ2_bzDecompress} will not produce @code{BZ_STREAM_END} until
1097 all output data has been placed into the output buffer, so once
1098 @code{BZ_STREAM_END} appears, you are guaranteed to have available all
1099 the decompressed output, and @code{BZ2_bzDecompressEnd} can safely be
1102 If case of an error return value, you should call @code{BZ2_bzDecompressEnd}
1103 to clean up and release memory.
1105 Possible return values:
1107 @code{BZ_PARAM_ERROR}
1108 if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL}
1109 or @code{strm->avail_out < 1}
1110 @code{BZ_DATA_ERROR}
1111 if a data integrity error is detected in the compressed stream
1112 @code{BZ_DATA_ERROR_MAGIC}
1113 if the compressed stream doesn't begin with the right magic bytes
1115 if there wasn't enough memory available
1116 @code{BZ_STREAM_END}
1117 if the logical end of the data stream was detected and all
1118 output in has been consumed, eg @code{s->avail_out > 0}
1122 Allowable next actions:
1124 @code{BZ2_bzDecompress}
1125 if @code{BZ_OK} was returned
1126 @code{BZ2_bzDecompressEnd}
1131 @subsection @code{BZ2_bzDecompressEnd}
1133 int BZ2_bzDecompressEnd ( bz_stream *strm );
1135 Releases all memory associated with a decompression stream.
1137 Possible return values:
1139 @code{BZ_PARAM_ERROR}
1140 if @code{strm} is @code{NULL} or @code{strm->s} is @code{NULL}
1145 Allowable next actions:
1151 @section High-level interface
1153 This interface provides functions for reading and writing
1154 @code{bzip2} format files. First, some general points.
1157 @item All of the functions take an @code{int*} first argument,
1159 After each call, @code{bzerror} should be consulted first to determine
1160 the outcome of the call. If @code{bzerror} is @code{BZ_OK},
1162 successfully, and only then should the return value of the function
1163 (if any) be consulted. If @code{bzerror} is @code{BZ_IO_ERROR},
1165 reading/writing the underlying compressed file, and you should
1166 then consult @code{errno}/@code{perror} to determine the
1167 cause of the difficulty.
1168 @code{bzerror} may also be set to various other values; precise details are
1169 given on a per-function basis below.
1170 @item If @code{bzerror} indicates an error
1171 (ie, anything except @code{BZ_OK} and @code{BZ_STREAM_END}),
1172 you should immediately call @code{BZ2_bzReadClose} (or @code{BZ2_bzWriteClose},
1173 depending on whether you are attempting to read or to write)
1174 to free up all resources associated
1175 with the stream. Once an error has been indicated, behaviour of all calls
1176 except @code{BZ2_bzReadClose} (@code{BZ2_bzWriteClose}) is undefined.
1177 The implication is that (1) @code{bzerror} should
1178 be checked after each call, and (2) if @code{bzerror} indicates an error,
1179 @code{BZ2_bzReadClose} (@code{BZ2_bzWriteClose}) should then be called to clean up.
1180 @item The @code{FILE*} arguments passed to
1181 @code{BZ2_bzReadOpen}/@code{BZ2_bzWriteOpen}
1182 should be set to binary mode.
1183 Most Unix systems will do this by default, but other platforms,
1184 including Windows and Mac, will not. If you omit this, you may
1185 encounter problems when moving code to new platforms.
1186 @item Memory allocation requests are handled by
1187 @code{malloc}/@code{free}.
1189 there is no facility for user-defined memory allocators in the file I/O
1190 functions (could easily be added, though).
1195 @subsection @code{BZ2_bzReadOpen}
1197 typedef void BZFILE;
1199 BZFILE *BZ2_bzReadOpen ( int *bzerror, FILE *f,
1200 int small, int verbosity,
1201 void *unused, int nUnused );
1203 Prepare to read compressed data from file handle @code{f}. @code{f}
1204 should refer to a file which has been opened for reading, and for which
1205 the error indicator (@code{ferror(f)})is not set. If @code{small} is 1,
1206 the library will try to decompress using less memory, at the expense of
1209 For reasons explained below, @code{BZ2_bzRead} will decompress the
1210 @code{nUnused} bytes starting at @code{unused}, before starting to read
1211 from the file @code{f}. At most @code{BZ_MAX_UNUSED} bytes may be
1212 supplied like this. If this facility is not required, you should pass
1213 @code{NULL} and @code{0} for @code{unused} and n@code{Unused}
1216 For the meaning of parameters @code{small} and @code{verbosity},
1217 see @code{BZ2_bzDecompressInit}.
1219 The amount of memory needed to decompress a file cannot be determined
1220 until the file's header has been read. So it is possible that
1221 @code{BZ2_bzReadOpen} returns @code{BZ_OK} but a subsequent call of
1222 @code{BZ2_bzRead} will return @code{BZ_MEM_ERROR}.
1224 Possible assignments to @code{bzerror}:
1226 @code{BZ_CONFIG_ERROR}
1227 if the library has been mis-compiled
1228 @code{BZ_PARAM_ERROR}
1229 if @code{f} is @code{NULL}
1230 or @code{small} is neither @code{0} nor @code{1}
1231 or @code{(unused == NULL && nUnused != 0)}
1232 or @code{(unused != NULL && !(0 <= nUnused <= BZ_MAX_UNUSED))}
1234 if @code{ferror(f)} is nonzero
1236 if insufficient memory is available
1241 Possible return values:
1243 Pointer to an abstract @code{BZFILE}
1244 if @code{bzerror} is @code{BZ_OK}
1249 Allowable next actions:
1252 if @code{bzerror} is @code{BZ_OK}
1258 @subsection @code{BZ2_bzRead}
1260 int BZ2_bzRead ( int *bzerror, BZFILE *b, void *buf, int len );
1262 Reads up to @code{len} (uncompressed) bytes from the compressed file
1264 the buffer @code{buf}. If the read was successful,
1265 @code{bzerror} is set to @code{BZ_OK}
1266 and the number of bytes read is returned. If the logical end-of-stream
1267 was detected, @code{bzerror} will be set to @code{BZ_STREAM_END},
1269 of bytes read is returned. All other @code{bzerror} values denote an error.
1271 @code{BZ2_bzRead} will supply @code{len} bytes,
1272 unless the logical stream end is detected
1273 or an error occurs. Because of this, it is possible to detect the
1274 stream end by observing when the number of bytes returned is
1275 less than the number
1276 requested. Nevertheless, this is regarded as inadvisable; you should
1277 instead check @code{bzerror} after every call and watch out for
1278 @code{BZ_STREAM_END}.
1280 Internally, @code{BZ2_bzRead} copies data from the compressed file in chunks
1281 of size @code{BZ_MAX_UNUSED} bytes
1282 before decompressing it. If the file contains more bytes than strictly
1283 needed to reach the logical end-of-stream, @code{BZ2_bzRead} will almost certainly
1284 read some of the trailing data before signalling @code{BZ_SEQUENCE_END}.
1285 To collect the read but unused data once @code{BZ_SEQUENCE_END} has
1286 appeared, call @code{BZ2_bzReadGetUnused} immediately before @code{BZ2_bzReadClose}.
1288 Possible assignments to @code{bzerror}:
1290 @code{BZ_PARAM_ERROR}
1291 if @code{b} is @code{NULL} or @code{buf} is @code{NULL} or @code{len < 0}
1292 @code{BZ_SEQUENCE_ERROR}
1293 if @code{b} was opened with @code{BZ2_bzWriteOpen}
1295 if there is an error reading from the compressed file
1296 @code{BZ_UNEXPECTED_EOF}
1297 if the compressed file ended before the logical end-of-stream was detected
1298 @code{BZ_DATA_ERROR}
1299 if a data integrity error was detected in the compressed stream
1300 @code{BZ_DATA_ERROR_MAGIC}
1301 if the stream does not begin with the requisite header bytes (ie, is not
1302 a @code{bzip2} data file). This is really a special case of @code{BZ_DATA_ERROR}.
1304 if insufficient memory was available
1305 @code{BZ_STREAM_END}
1306 if the logical end of stream was detected.
1311 Possible return values:
1313 number of bytes read
1314 if @code{bzerror} is @code{BZ_OK} or @code{BZ_STREAM_END}
1319 Allowable next actions:
1321 collect data from @code{buf}, then @code{BZ2_bzRead} or @code{BZ2_bzReadClose}
1322 if @code{bzerror} is @code{BZ_OK}
1323 collect data from @code{buf}, then @code{BZ2_bzReadClose} or @code{BZ2_bzReadGetUnused}
1324 if @code{bzerror} is @code{BZ_SEQUENCE_END}
1325 @code{BZ2_bzReadClose}
1331 @subsection @code{BZ2_bzReadGetUnused}
1333 void BZ2_bzReadGetUnused ( int* bzerror, BZFILE *b,
1334 void** unused, int* nUnused );
1336 Returns data which was read from the compressed file but was not needed
1337 to get to the logical end-of-stream. @code{*unused} is set to the address
1338 of the data, and @code{*nUnused} to the number of bytes. @code{*nUnused} will
1339 be set to a value between @code{0} and @code{BZ_MAX_UNUSED} inclusive.
1341 This function may only be called once @code{BZ2_bzRead} has signalled
1342 @code{BZ_STREAM_END} but before @code{BZ2_bzReadClose}.
1344 Possible assignments to @code{bzerror}:
1346 @code{BZ_PARAM_ERROR}
1347 if @code{b} is @code{NULL}
1348 or @code{unused} is @code{NULL} or @code{nUnused} is @code{NULL}
1349 @code{BZ_SEQUENCE_ERROR}
1350 if @code{BZ_STREAM_END} has not been signalled
1351 or if @code{b} was opened with @code{BZ2_bzWriteOpen}
1356 Allowable next actions:
1358 @code{BZ2_bzReadClose}
1362 @subsection @code{BZ2_bzReadClose}
1364 void BZ2_bzReadClose ( int *bzerror, BZFILE *b );
1366 Releases all memory pertaining to the compressed file @code{b}.
1367 @code{BZ2_bzReadClose} does not call @code{fclose} on the underlying file
1368 handle, so you should do that yourself if appropriate.
1369 @code{BZ2_bzReadClose} should be called to clean up after all error
1372 Possible assignments to @code{bzerror}:
1374 @code{BZ_SEQUENCE_ERROR}
1375 if @code{b} was opened with @code{BZ2_bzOpenWrite}
1380 Allowable next actions:
1387 @subsection @code{BZ2_bzWriteOpen}
1389 BZFILE *BZ2_bzWriteOpen ( int *bzerror, FILE *f,
1390 int blockSize100k, int verbosity,
1393 Prepare to write compressed data to file handle @code{f}.
1394 @code{f} should refer to
1395 a file which has been opened for writing, and for which the error
1396 indicator (@code{ferror(f)})is not set.
1398 For the meaning of parameters @code{blockSize100k},
1399 @code{verbosity} and @code{workFactor}, see
1400 @* @code{BZ2_bzCompressInit}.
1402 All required memory is allocated at this stage, so if the call
1403 completes successfully, @code{BZ_MEM_ERROR} cannot be signalled by a
1404 subsequent call to @code{BZ2_bzWrite}.
1406 Possible assignments to @code{bzerror}:
1408 @code{BZ_CONFIG_ERROR}
1409 if the library has been mis-compiled
1410 @code{BZ_PARAM_ERROR}
1411 if @code{f} is @code{NULL}
1412 or @code{blockSize100k < 1} or @code{blockSize100k > 9}
1414 if @code{ferror(f)} is nonzero
1416 if insufficient memory is available
1421 Possible return values:
1423 Pointer to an abstract @code{BZFILE}
1424 if @code{bzerror} is @code{BZ_OK}
1429 Allowable next actions:
1432 if @code{bzerror} is @code{BZ_OK}
1433 (you could go directly to @code{BZ2_bzWriteClose}, but this would be pretty pointless)
1434 @code{BZ2_bzWriteClose}
1440 @subsection @code{BZ2_bzWrite}
1442 void BZ2_bzWrite ( int *bzerror, BZFILE *b, void *buf, int len );
1444 Absorbs @code{len} bytes from the buffer @code{buf}, eventually to be
1445 compressed and written to the file.
1447 Possible assignments to @code{bzerror}:
1449 @code{BZ_PARAM_ERROR}
1450 if @code{b} is @code{NULL} or @code{buf} is @code{NULL} or @code{len < 0}
1451 @code{BZ_SEQUENCE_ERROR}
1452 if b was opened with @code{BZ2_bzReadOpen}
1454 if there is an error writing the compressed file.
1462 @subsection @code{BZ2_bzWriteClose}
1464 void BZ2_bzWriteClose ( int *bzerror, BZFILE* f,
1466 unsigned int* nbytes_in,
1467 unsigned int* nbytes_out );
1469 void BZ2_bzWriteClose64 ( int *bzerror, BZFILE* f,
1471 unsigned int* nbytes_in_lo32,
1472 unsigned int* nbytes_in_hi32,
1473 unsigned int* nbytes_out_lo32,
1474 unsigned int* nbytes_out_hi32 );
1477 Compresses and flushes to the compressed file all data so far supplied
1478 by @code{BZ2_bzWrite}. The logical end-of-stream markers are also written, so
1479 subsequent calls to @code{BZ2_bzWrite} are illegal. All memory associated
1480 with the compressed file @code{b} is released.
1481 @code{fflush} is called on the
1482 compressed file, but it is not @code{fclose}'d.
1484 If @code{BZ2_bzWriteClose} is called to clean up after an error, the only
1485 action is to release the memory. The library records the error codes
1486 issued by previous calls, so this situation will be detected
1487 automatically. There is no attempt to complete the compression
1488 operation, nor to @code{fflush} the compressed file. You can force this
1489 behaviour to happen even in the case of no error, by passing a nonzero
1490 value to @code{abandon}.
1492 If @code{nbytes_in} is non-null, @code{*nbytes_in} will be set to be the
1493 total volume of uncompressed data handled. Similarly, @code{nbytes_out}
1494 will be set to the total volume of compressed data written. For
1495 compatibility with older versions of the library, @code{BZ2_bzWriteClose}
1496 only yields the lower 32 bits of these counts. Use
1497 @code{BZ2_bzWriteClose64} if you want the full 64 bit counts. These
1498 two functions are otherwise absolutely identical.
1501 Possible assignments to @code{bzerror}:
1503 @code{BZ_SEQUENCE_ERROR}
1504 if @code{b} was opened with @code{BZ2_bzReadOpen}
1506 if there is an error writing the compressed file
1511 @subsection Handling embedded compressed data streams
1513 The high-level library facilitates use of
1514 @code{bzip2} data streams which form some part of a surrounding, larger
1517 @item For writing, the library takes an open file handle, writes
1518 compressed data to it, @code{fflush}es it but does not @code{fclose} it.
1519 The calling application can write its own data before and after the
1520 compressed data stream, using that same file handle.
1521 @item Reading is more complex, and the facilities are not as general
1522 as they could be since generality is hard to reconcile with efficiency.
1523 @code{BZ2_bzRead} reads from the compressed file in blocks of size
1524 @code{BZ_MAX_UNUSED} bytes, and in doing so probably will overshoot
1525 the logical end of compressed stream.
1526 To recover this data once decompression has
1527 ended, call @code{BZ2_bzReadGetUnused} after the last call of @code{BZ2_bzRead}
1528 (the one returning @code{BZ_STREAM_END}) but before calling
1529 @code{BZ2_bzReadClose}.
1532 This mechanism makes it easy to decompress multiple @code{bzip2}
1533 streams placed end-to-end. As the end of one stream, when @code{BZ2_bzRead}
1534 returns @code{BZ_STREAM_END}, call @code{BZ2_bzReadGetUnused} to collect the
1535 unused data (copy it into your own buffer somewhere).
1536 That data forms the start of the next compressed stream.
1537 To start uncompressing that next stream, call @code{BZ2_bzReadOpen} again,
1538 feeding in the unused data via the @code{unused}/@code{nUnused}
1540 Keep doing this until @code{BZ_STREAM_END} return coincides with the
1541 physical end of file (@code{feof(f)}). In this situation
1542 @code{BZ2_bzReadGetUnused}
1543 will of course return no data.
1545 This should give some feel for how the high-level interface can be used.
1546 If you require extra flexibility, you'll have to bite the bullet and get
1547 to grips with the low-level interface.
1549 @subsection Standard file-reading/writing code
1550 Here's how you'd write data to a compressed file:
1555 char buf[ /* whatever size you like */ ];
1559 f = fopen ( "myfile.bz2", "w" );
1563 b = BZ2_bzWriteOpen ( &bzerror, f, 9 );
1564 if (bzerror != BZ_OK) @{
1565 BZ2_bzWriteClose ( b );
1569 while ( /* condition */ ) @{
1570 /* get data to write into buf, and set nBuf appropriately */
1571 nWritten = BZ2_bzWrite ( &bzerror, b, buf, nBuf );
1572 if (bzerror == BZ_IO_ERROR) @{
1573 BZ2_bzWriteClose ( &bzerror, b );
1578 BZ2_bzWriteClose ( &bzerror, b );
1579 if (bzerror == BZ_IO_ERROR) @{
1583 And to read from a compressed file:
1588 char buf[ /* whatever size you like */ ];
1592 f = fopen ( "myfile.bz2", "r" );
1596 b = BZ2_bzReadOpen ( &bzerror, f, 0, NULL, 0 );
1597 if (bzerror != BZ_OK) @{
1598 BZ2_bzReadClose ( &bzerror, b );
1603 while (bzerror == BZ_OK && /* arbitrary other conditions */) @{
1604 nBuf = BZ2_bzRead ( &bzerror, b, buf, /* size of buf */ );
1605 if (bzerror == BZ_OK) @{
1606 /* do something with buf[0 .. nBuf-1] */
1609 if (bzerror != BZ_STREAM_END) @{
1610 BZ2_bzReadClose ( &bzerror, b );
1613 BZ2_bzReadClose ( &bzerror );
1619 @section Utility functions
1620 @subsection @code{BZ2_bzBuffToBuffCompress}
1622 int BZ2_bzBuffToBuffCompress( char* dest,
1623 unsigned int* destLen,
1625 unsigned int sourceLen,
1630 Attempts to compress the data in @code{source[0 .. sourceLen-1]}
1631 into the destination buffer, @code{dest[0 .. *destLen-1]}.
1632 If the destination buffer is big enough, @code{*destLen} is
1633 set to the size of the compressed data, and @code{BZ_OK} is
1634 returned. If the compressed data won't fit, @code{*destLen}
1635 is unchanged, and @code{BZ_OUTBUFF_FULL} is returned.
1637 Compression in this manner is a one-shot event, done with a single call
1638 to this function. The resulting compressed data is a complete
1639 @code{bzip2} format data stream. There is no mechanism for making
1640 additional calls to provide extra input data. If you want that kind of
1641 mechanism, use the low-level interface.
1643 For the meaning of parameters @code{blockSize100k}, @code{verbosity}
1644 and @code{workFactor}, @* see @code{BZ2_bzCompressInit}.
1646 To guarantee that the compressed data will fit in its buffer, allocate
1647 an output buffer of size 1% larger than the uncompressed data, plus
1648 six hundred extra bytes.
1650 @code{BZ2_bzBuffToBuffDecompress} will not write data at or
1651 beyond @code{dest[*destLen]}, even in case of buffer overflow.
1653 Possible return values:
1655 @code{BZ_CONFIG_ERROR}
1656 if the library has been mis-compiled
1657 @code{BZ_PARAM_ERROR}
1658 if @code{dest} is @code{NULL} or @code{destLen} is @code{NULL}
1659 or @code{blockSize100k < 1} or @code{blockSize100k > 9}
1660 or @code{verbosity < 0} or @code{verbosity > 4}
1661 or @code{workFactor < 0} or @code{workFactor > 250}
1663 if insufficient memory is available
1664 @code{BZ_OUTBUFF_FULL}
1665 if the size of the compressed data exceeds @code{*destLen}
1672 @subsection @code{BZ2_bzBuffToBuffDecompress}
1674 int BZ2_bzBuffToBuffDecompress ( char* dest,
1675 unsigned int* destLen,
1677 unsigned int sourceLen,
1681 Attempts to decompress the data in @code{source[0 .. sourceLen-1]}
1682 into the destination buffer, @code{dest[0 .. *destLen-1]}.
1683 If the destination buffer is big enough, @code{*destLen} is
1684 set to the size of the uncompressed data, and @code{BZ_OK} is
1685 returned. If the compressed data won't fit, @code{*destLen}
1686 is unchanged, and @code{BZ_OUTBUFF_FULL} is returned.
1688 @code{source} is assumed to hold a complete @code{bzip2} format
1689 data stream. @* @code{BZ2_bzBuffToBuffDecompress} tries to decompress
1690 the entirety of the stream into the output buffer.
1692 For the meaning of parameters @code{small} and @code{verbosity},
1693 see @code{BZ2_bzDecompressInit}.
1695 Because the compression ratio of the compressed data cannot be known in
1696 advance, there is no easy way to guarantee that the output buffer will
1697 be big enough. You may of course make arrangements in your code to
1698 record the size of the uncompressed data, but such a mechanism is beyond
1699 the scope of this library.
1701 @code{BZ2_bzBuffToBuffDecompress} will not write data at or
1702 beyond @code{dest[*destLen]}, even in case of buffer overflow.
1704 Possible return values:
1706 @code{BZ_CONFIG_ERROR}
1707 if the library has been mis-compiled
1708 @code{BZ_PARAM_ERROR}
1709 if @code{dest} is @code{NULL} or @code{destLen} is @code{NULL}
1710 or @code{small != 0 && small != 1}
1711 or @code{verbosity < 0} or @code{verbosity > 4}
1713 if insufficient memory is available
1714 @code{BZ_OUTBUFF_FULL}
1715 if the size of the compressed data exceeds @code{*destLen}
1716 @code{BZ_DATA_ERROR}
1717 if a data integrity error was detected in the compressed data
1718 @code{BZ_DATA_ERROR_MAGIC}
1719 if the compressed data doesn't begin with the right magic bytes
1720 @code{BZ_UNEXPECTED_EOF}
1721 if the compressed data ends unexpectedly
1728 @section @code{zlib} compatibility functions
1729 Yoshioka Tsuneo has contributed some functions to
1730 give better @code{zlib} compatibility. These functions are
1731 @code{BZ2_bzopen}, @code{BZ2_bzread}, @code{BZ2_bzwrite}, @code{BZ2_bzflush},
1733 @code{BZ2_bzerror} and @code{BZ2_bzlibVersion}.
1734 These functions are not (yet) officially part of
1735 the library. If they break, you get to keep all the pieces.
1736 Nevertheless, I think they work ok.
1738 typedef void BZFILE;
1740 const char * BZ2_bzlibVersion ( void );
1742 Returns a string indicating the library version.
1744 BZFILE * BZ2_bzopen ( const char *path, const char *mode );
1745 BZFILE * BZ2_bzdopen ( int fd, const char *mode );
1747 Opens a @code{.bz2} file for reading or writing, using either its name
1748 or a pre-existing file descriptor.
1749 Analogous to @code{fopen} and @code{fdopen}.
1751 int BZ2_bzread ( BZFILE* b, void* buf, int len );
1752 int BZ2_bzwrite ( BZFILE* b, void* buf, int len );
1754 Reads/writes data from/to a previously opened @code{BZFILE}.
1755 Analogous to @code{fread} and @code{fwrite}.
1757 int BZ2_bzflush ( BZFILE* b );
1758 void BZ2_bzclose ( BZFILE* b );
1760 Flushes/closes a @code{BZFILE}. @code{BZ2_bzflush} doesn't actually do
1761 anything. Analogous to @code{fflush} and @code{fclose}.
1764 const char * BZ2_bzerror ( BZFILE *b, int *errnum )
1766 Returns a string describing the more recent error status of
1767 @code{b}, and also sets @code{*errnum} to its numerical value.
1770 @section Using the library in a @code{stdio}-free environment
1772 @subsection Getting rid of @code{stdio}
1774 In a deeply embedded application, you might want to use just
1775 the memory-to-memory functions. You can do this conveniently
1776 by compiling the library with preprocessor symbol @code{BZ_NO_STDIO}
1777 defined. Doing this gives you a library containing only the following
1780 @code{BZ2_bzCompressInit}, @code{BZ2_bzCompress}, @code{BZ2_bzCompressEnd} @*
1781 @code{BZ2_bzDecompressInit}, @code{BZ2_bzDecompress}, @code{BZ2_bzDecompressEnd} @*
1782 @code{BZ2_bzBuffToBuffCompress}, @code{BZ2_bzBuffToBuffDecompress}
1784 When compiled like this, all functions will ignore @code{verbosity}
1787 @subsection Critical error handling
1788 @code{libbzip2} contains a number of internal assertion checks which
1789 should, needless to say, never be activated. Nevertheless, if an
1790 assertion should fail, behaviour depends on whether or not the library
1791 was compiled with @code{BZ_NO_STDIO} set.
1793 For a normal compile, an assertion failure yields the message
1795 bzip2/libbzip2: internal error number N.
1796 This is a bug in bzip2/libbzip2, 1.0.2, 30-Dec-2001.
1797 Please report it to me at: jseward@@acm.org. If this happened
1798 when you were using some program which uses libbzip2 as a
1799 component, you should also report this bug to the author(s)
1800 of that program. Please make an effort to report this bug;
1801 timely and accurate bug reports eventually lead to higher
1802 quality software. Thanks. Julian Seward, 30 December 2001.
1804 where @code{N} is some error code number. If @code{N == 1007}, it also
1805 prints some extra text advising the reader that unreliable memory is
1806 often associated with internal error 1007. (This is a
1807 frequently-observed-phenomenon with versions 1.0.0/1.0.1).
1809 @code{exit(3)} is then called.
1811 For a @code{stdio}-free library, assertion failures result
1812 in a call to a function declared as:
1814 extern void bz_internal_error ( int errcode );
1816 The relevant code is passed as a parameter. You should supply
1819 In either case, once an assertion failure has occurred, any
1820 @code{bz_stream} records involved can be regarded as invalid.
1821 You should not attempt to resume normal operation with them.
1823 You may, of course, change critical error handling to suit
1824 your needs. As I said above, critical errors indicate bugs
1825 in the library and should not occur. All "normal" error
1826 situations are indicated via error return codes from functions,
1827 and can be recovered from.
1830 @section Making a Windows DLL
1831 Everything related to Windows has been contributed by Yoshioka Tsuneo
1832 @* (@code{QWF00133@@niftyserve.or.jp} /
1833 @code{tsuneo-y@@is.aist-nara.ac.jp}), so you should send your queries to
1834 him (but perhaps Cc: me, @code{jseward@@acm.org}).
1836 My vague understanding of what to do is: using Visual C++ 5.0,
1837 open the project file @code{libbz2.dsp}, and build. That's all.
1840 open the project file for some reason, make a new one, naming these files:
1841 @code{blocksort.c}, @code{bzlib.c}, @code{compress.c},
1842 @code{crctable.c}, @code{decompress.c}, @code{huffman.c}, @*
1843 @code{randtable.c} and @code{libbz2.def}. You will also need
1844 to name the header files @code{bzlib.h} and @code{bzlib_private.h}.
1846 If you don't use VC++, you may need to define the proprocessor symbol
1849 Finally, @code{dlltest.c} is a sample program using the DLL. It has a
1850 project file, @code{dlltest.dsp}.
1852 If you just want a makefile for Visual C, have a look at
1853 @code{makefile.msc}.
1855 Be aware that if you compile @code{bzip2} itself on Win32, you must set
1856 @code{BZ_UNIX} to 0 and @code{BZ_LCCWIN32} to 1, in the file
1857 @code{bzip2.c}, before compiling. Otherwise the resulting binary won't
1860 I haven't tried any of this stuff myself, but it all looks plausible.
1864 @chapter Miscellanea
1866 These are just some random thoughts of mine. Your mileage may
1869 @section Limitations of the compressed file format
1870 @code{bzip2-1.0}, @code{0.9.5} and @code{0.9.0}
1871 use exactly the same file format as the previous
1872 version, @code{bzip2-0.1}. This decision was made in the interests of
1873 stability. Creating yet another incompatible compressed file format
1874 would create further confusion and disruption for users.
1876 Nevertheless, this is not a painless decision. Development
1877 work since the release of @code{bzip2-0.1} in August 1997
1878 has shown complexities in the file format which slow down
1879 decompression and, in retrospect, are unnecessary. These are:
1881 @item The run-length encoder, which is the first of the
1882 compression transformations, is entirely irrelevant.
1883 The original purpose was to protect the sorting algorithm
1884 from the very worst case input: a string of repeated
1885 symbols. But algorithm steps Q6a and Q6b in the original
1886 Burrows-Wheeler technical report (SRC-124) show how
1887 repeats can be handled without difficulty in block
1889 @item The randomisation mechanism doesn't really need to be
1890 there. Udi Manber and Gene Myers published a suffix
1891 array construction algorithm a few years back, which
1892 can be employed to sort any block, no matter how
1893 repetitive, in O(N log N) time. Subsequent work by
1894 Kunihiko Sadakane has produced a derivative O(N (log N)^2)
1895 algorithm which usually outperforms the Manber-Myers
1898 I could have changed to Sadakane's algorithm, but I find
1899 it to be slower than @code{bzip2}'s existing algorithm for
1900 most inputs, and the randomisation mechanism protects
1901 adequately against bad cases. I didn't think it was
1902 a good tradeoff to make. Partly this is due to the fact
1903 that I was not flooded with email complaints about
1904 @code{bzip2-0.1}'s performance on repetitive data, so
1905 perhaps it isn't a problem for real inputs.
1907 Probably the best long-term solution,
1908 and the one I have incorporated into 0.9.5 and above,
1909 is to use the existing sorting
1910 algorithm initially, and fall back to a O(N (log N)^2)
1911 algorithm if the standard algorithm gets into difficulties.
1912 @item The compressed file format was never designed to be
1913 handled by a library, and I have had to jump though
1914 some hoops to produce an efficient implementation of
1915 decompression. It's a bit hairy. Try passing
1916 @code{decompress.c} through the C preprocessor
1917 and you'll see what I mean. Much of this complexity
1918 could have been avoided if the compressed size of
1919 each block of data was recorded in the data stream.
1920 @item An Adler-32 checksum, rather than a CRC32 checksum,
1921 would be faster to compute.
1923 It would be fair to say that the @code{bzip2} format was frozen
1924 before I properly and fully understood the performance
1925 consequences of doing so.
1927 Improvements which I was able to incorporate into
1928 0.9.0, despite using the same file format, are:
1930 @item Single array implementation of the inverse BWT. This
1931 significantly speeds up decompression, presumably
1932 because it reduces the number of cache misses.
1933 @item Faster inverse MTF transform for large MTF values. The
1934 new implementation is based on the notion of sliding blocks
1936 @item @code{bzip2-0.9.0} now reads and writes files with @code{fread}
1937 and @code{fwrite}; version 0.1 used @code{putc} and @code{getc}.
1938 Duh! Well, you live and learn.
1941 Further ahead, it would be nice
1942 to be able to do random access into files. This will
1943 require some careful design of compressed file formats.
1947 @section Portability issues
1948 After some consideration, I have decided not to use
1949 GNU @code{autoconf} to configure 0.9.5 or 1.0.
1951 @code{autoconf}, admirable and wonderful though it is,
1952 mainly assists with portability problems between Unix-like
1953 platforms. But @code{bzip2} doesn't have much in the way
1954 of portability problems on Unix; most of the difficulties appear
1955 when porting to the Mac, or to Microsoft's operating systems.
1956 @code{autoconf} doesn't help in those cases, and brings in a
1957 whole load of new complexity.
1959 Most people should be able to compile the library and program
1960 under Unix straight out-of-the-box, so to speak, especially
1961 if you have a version of GNU C available.
1963 There are a couple of @code{__inline__} directives in the code. GNU C
1964 (@code{gcc}) should be able to handle them. If you're not using
1965 GNU C, your C compiler shouldn't see them at all.
1966 If your compiler does, for some reason, see them and doesn't
1967 like them, just @code{#define} @code{__inline__} to be @code{/* */}. One
1968 easy way to do this is to compile with the flag @code{-D__inline__=},
1969 which should be understood by most Unix compilers.
1971 If you still have difficulties, try compiling with the macro
1972 @code{BZ_STRICT_ANSI} defined. This should enable you to build the
1973 library in a strictly ANSI compliant environment. Building the program
1974 itself like this is dangerous and not supported, since you remove
1975 @code{bzip2}'s checks against compressing directories, symbolic links,
1976 devices, and other not-really-a-file entities. This could cause
1977 filesystem corruption!
1979 One other thing: if you create a @code{bzip2} binary for public
1980 distribution, please try and link it statically (@code{gcc -s}). This
1981 avoids all sorts of library-version issues that others may encounter
1984 If you build @code{bzip2} on Win32, you must set @code{BZ_UNIX} to 0 and
1985 @code{BZ_LCCWIN32} to 1, in the file @code{bzip2.c}, before compiling.
1986 Otherwise the resulting binary won't work correctly.
1990 @section Reporting bugs
1991 I tried pretty hard to make sure @code{bzip2} is
1992 bug free, both by design and by testing. Hopefully
1993 you'll never need to read this section for real.
1995 Nevertheless, if @code{bzip2} dies with a segmentation
1996 fault, a bus error or an internal assertion failure, it
1997 will ask you to email me a bug report. Experience with
1998 version 0.1 shows that almost all these problems can
1999 be traced to either compiler bugs or hardware problems.
2002 Recompile the program with no optimisation, and see if it
2003 works. And/or try a different compiler.
2004 I heard all sorts of stories about various flavours
2005 of GNU C (and other compilers) generating bad code for
2006 @code{bzip2}, and I've run across two such examples myself.
2008 2.7.X versions of GNU C are known to generate bad code from
2009 time to time, at high optimisation levels.
2010 If you get problems, try using the flags
2011 @code{-O2} @code{-fomit-frame-pointer} @code{-fno-strength-reduce}.
2012 You should specifically @emph{not} use @code{-funroll-loops}.
2014 You may notice that the Makefile runs six tests as part of
2015 the build process. If the program passes all of these, it's
2016 a pretty good (but not 100%) indication that the compiler has
2017 done its job correctly.
2019 If @code{bzip2} crashes randomly, and the crashes are not
2020 repeatable, you may have a flaky memory subsystem. @code{bzip2}
2021 really hammers your memory hierarchy, and if it's a bit marginal,
2022 you may get these problems. Ditto if your disk or I/O subsystem
2023 is slowly failing. Yup, this really does happen.
2025 Try using a different machine of the same type, and see if
2026 you can repeat the problem.
2027 @item This isn't really a bug, but ... If @code{bzip2} tells
2028 you your file is corrupted on decompression, and you
2029 obtained the file via FTP, there is a possibility that you
2030 forgot to tell FTP to do a binary mode transfer. That absolutely
2031 will cause the file to be non-decompressible. You'll have to transfer
2035 If you've incorporated @code{libbzip2} into your own program
2036 and are getting problems, please, please, please, check that the
2037 parameters you are passing in calls to the library, are
2038 correct, and in accordance with what the documentation says
2039 is allowable. I have tried to make the library robust against
2040 such problems, but I'm sure I haven't succeeded.
2042 Finally, if the above comments don't help, you'll have to send
2043 me a bug report. Now, it's just amazing how many people will
2044 send me a bug report saying something like
2046 bzip2 crashed with segmentation fault on my machine
2048 and absolutely nothing else. Needless to say, a such a report
2049 is @emph{totally, utterly, completely and comprehensively 100% useless;
2050 a waste of your time, my time, and net bandwidth}.
2051 With no details at all, there's no way I can possibly begin
2052 to figure out what the problem is.
2054 The rules of the game are: facts, facts, facts. Don't omit
2055 them because "oh, they won't be relevant". At the bare
2058 Machine type. Operating system version.
2059 Exact version of @code{bzip2} (do @code{bzip2 -V}).
2060 Exact version of the compiler used.
2061 Flags passed to the compiler.
2063 However, the most important single thing that will help me is
2064 the file that you were trying to compress or decompress at the
2065 time the problem happened. Without that, my ability to do anything
2066 more than speculate about the cause, is limited.
2068 Please remember that I connect to the Internet with a modem, so
2069 you should contact me before mailing me huge files.
2072 @section Did you get the right package?
2074 @code{bzip2} is a resource hog. It soaks up large amounts of CPU cycles
2075 and memory. Also, it gives very large latencies. In the worst case, you
2076 can feed many megabytes of uncompressed data into the library before
2077 getting any compressed output, so this probably rules out applications
2078 requiring interactive behaviour.
2080 These aren't faults of my implementation, I hope, but more
2081 an intrinsic property of the Burrows-Wheeler transform (unfortunately).
2082 Maybe this isn't what you want.
2084 If you want a compressor and/or library which is faster, uses less
2085 memory but gets pretty good compression, and has minimal latency,
2087 Gailly's and Mark Adler's work, @code{zlib-1.1.3} and
2088 @code{gzip-1.2.4}. Look for them at
2090 @code{http://www.zlib.org} and
2091 @code{http://www.gzip.org} respectively.
2093 For something faster and lighter still, you might try Markus F X J
2094 Oberhumer's @code{LZO} real-time compression/decompression library, at
2095 @* @code{http://wildsau.idv.uni-linz.ac.at/mfx/lzo.html}.
2097 If you want to use the @code{bzip2} algorithms to compress small blocks
2098 of data, 64k bytes or smaller, for example on an on-the-fly disk
2099 compressor, you'd be well advised not to use this library. Instead,
2100 I've made a special library tuned for that kind of use. It's part of
2101 @code{e2compr-0.40}, an on-the-fly disk compressor for the Linux
2102 @code{ext2} filesystem. Look at
2103 @code{http://www.netspace.net.au/~reiter/e2compr}.
2109 A record of the tests I've done.
2111 First, some data sets:
2113 @item B: a directory containing 6001 files, one for every length in the
2114 range 0 to 6000 bytes. The files contain random lowercase
2115 letters. 18.7 megabytes.
2116 @item H: my home directory tree. Documents, source code, mail files,
2117 compressed data. H contains B, and also a directory of
2118 files designed as boundary cases for the sorting; mostly very
2119 repetitive, nasty files. 565 megabytes.
2120 @item A: directory tree holding various applications built from source:
2121 @code{egcs}, @code{gcc-2.8.1}, KDE, GTK, Octave, etc.
2124 The tests conducted are as follows. Each test means compressing
2125 (a copy of) each file in the data set, decompressing it and
2126 comparing it against the original.
2128 First, a bunch of tests with block sizes and internal buffer
2129 sizes set very small,
2130 to detect any problems with the
2131 blocking and buffering mechanisms.
2132 This required modifying the source code so as to try to
2135 @item Data set H, with
2136 buffer size of 1 byte, and block size of 23 bytes.
2137 @item Data set B, buffer sizes 1 byte, block size 1 byte.
2138 @item As (2) but small-mode decompression.
2139 @item As (2) with block size 2 bytes.
2140 @item As (2) with block size 3 bytes.
2141 @item As (2) with block size 4 bytes.
2142 @item As (2) with block size 5 bytes.
2143 @item As (2) with block size 6 bytes and small-mode decompression.
2144 @item H with buffer size of 1 byte, but normal block
2145 size (up to 900000 bytes).
2147 Then some tests with unmodified source code.
2149 @item H, all settings normal.
2150 @item As (1), with small-mode decompress.
2151 @item H, compress with flag @code{-1}.
2152 @item H, compress with flag @code{-s}, decompress with flag @code{-s}.
2153 @item Forwards compatibility: H, @code{bzip2-0.1pl2} compressing,
2154 @code{bzip2-0.9.5} decompressing, all settings normal.
2155 @item Backwards compatibility: H, @code{bzip2-0.9.5} compressing,
2156 @code{bzip2-0.1pl2} decompressing, all settings normal.
2157 @item Bigger tests: A, all settings normal.
2158 @item As (7), using the fallback (Sadakane-like) sorting algorithm.
2159 @item As (8), compress with flag @code{-1}, decompress with flag
2161 @item H, using the fallback sorting algorithm.
2162 @item Forwards compatibility: A, @code{bzip2-0.1pl2} compressing,
2163 @code{bzip2-0.9.5} decompressing, all settings normal.
2164 @item Backwards compatibility: A, @code{bzip2-0.9.5} compressing,
2165 @code{bzip2-0.1pl2} decompressing, all settings normal.
2166 @item Misc test: about 400 megabytes of @code{.tar} files with
2167 @code{bzip2} compiled with Checker (a memory access error
2168 detector, like Purify).
2169 @item Misc tests to make sure it builds and runs ok on non-Linux/x86
2172 These tests were conducted on a 225 MHz IDT WinChip machine, running
2173 Linux 2.0.36. They represent nearly a week of continuous computation.
2174 All tests completed successfully.
2177 @section Further reading
2178 @code{bzip2} is not research work, in the sense that it doesn't present
2179 any new ideas. Rather, it's an engineering exercise based on existing
2182 Four documents describe essentially all the ideas behind @code{bzip2}:
2184 Michael Burrows and D. J. Wheeler:
2185 "A block-sorting lossless data compression algorithm"
2187 Digital SRC Research Report 124.
2188 ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz
2189 If you have trouble finding it, try searching at the
2190 New Zealand Digital Library, http://www.nzdl.org.
2192 Daniel S. Hirschberg and Debra A. LeLewer
2193 "Efficient Decoding of Prefix Codes"
2194 Communications of the ACM, April 1990, Vol 33, Number 4.
2195 You might be able to get an electronic copy of this
2196 from the ACM Digital Library.
2199 Program bred3.c and accompanying document bred3.ps.
2200 This contains the idea behind the multi-table Huffman
2202 ftp://ftp.cl.cam.ac.uk/users/djw3/
2204 Jon L. Bentley and Robert Sedgewick
2205 "Fast Algorithms for Sorting and Searching Strings"
2206 Available from Sedgewick's web page,
2207 www.cs.princeton.edu/~rs
2209 The following paper gives valuable additional insights into the
2210 algorithm, but is not immediately the basis of any code
2214 Block Sorting Text Compression
2215 Proceedings of the 19th Australasian Computer Science Conference,
2216 Melbourne, Australia. Jan 31 - Feb 2, 1996.
2217 ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps
2219 Kunihiko Sadakane's sorting algorithm, mentioned above,
2222 http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/Sada98b.ps.gz
2224 The Manber-Myers suffix array construction
2225 algorithm is described in a paper
2228 http://www.cs.arizona.edu/people/gene/PAPERS/suffix.ps
2230 Finally, the following paper documents some recent investigations
2231 I made into the performance of sorting algorithms:
2234 On the Performance of BWT Sorting Algorithms
2235 Proceedings of the IEEE Data Compression Conference 2000
2236 Snowbird, Utah. 28-30 March 2000.