rfc1951.txt

Version: - FREEBSD - FREEBSD-13-STABLE - FREEBSD-13-0 - FREEBSD-12-STABLE - FREEBSD-12-0 - FREEBSD-11-STABLE - FREEBSD-11-0 - FREEBSD-10-STABLE - FREEBSD-10-0 - FREEBSD-9-STABLE - FREEBSD-9-0 - FREEBSD-8-STABLE - FREEBSD-8-0 - FREEBSD-7-STABLE - FREEBSD-7-0 - FREEBSD-6-STABLE - FREEBSD-6-0 - FREEBSD-5-STABLE - FREEBSD-5-0 - FREEBSD-4-STABLE - FREEBSD-3-STABLE - FREEBSD22 - l41 - OPENBSD - linux-2.6 - MK84 - PLAN9 - xnu-8792
SearchContext: - none - 3 - 10
    1 
    2 
    3 
    4 
    5 
    6 
    7 Network Working Group                                         P. Deutsch
    8 Request for Comments: 1951                           Aladdin Enterprises
    9 Category: Informational                                         May 1996
   10 
   11 
   12         DEFLATE Compressed Data Format Specification version 1.3
   13 
   14 Status of This Memo
   15 
   16    This memo provides information for the Internet community.  This memo
   17    does not specify an Internet standard of any kind.  Distribution of
   18    this memo is unlimited.
   19 
   20 IESG Note:
   21 
   22    The IESG takes no position on the validity of any Intellectual
   23    Property Rights statements contained in this document.
   24 
   25 Notices
   26 
   27    Copyright (c) 1996 L. Peter Deutsch
   28 
   29    Permission is granted to copy and distribute this document for any
   30    purpose and without charge, including translations into other
   31    languages and incorporation into compilations, provided that the
   32    copyright notice and this notice are preserved, and that any
   33    substantive changes or deletions from the original are clearly
   34    marked.
   35 
   36    A pointer to the latest version of this and related documentation in
   37    HTML format can be found at the URL
   38    <ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>.
   39 
   40 Abstract
   41 
   42    This specification defines a lossless compressed data format that
   43    compresses data using a combination of the LZ77 algorithm and Huffman
   44    coding, with efficiency comparable to the best currently available
   45    general-purpose compression methods.  The data can be produced or
   46    consumed, even for an arbitrarily long sequentially presented input
   47    data stream, using only an a priori bounded amount of intermediate
   48    storage.  The format can be implemented readily in a manner not
   49    covered by patents.
   50 
   51 
   52 
   53 
   54 
   55 
   56 
   57 
   58 Deutsch                      Informational                      [Page 1]
   59 
   60 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
   61 
   62 
   63 Table of Contents
   64 
   65    1. Introduction ................................................... 2
   66       1.1. Purpose ................................................... 2
   67       1.2. Intended audience ......................................... 3
   68       1.3. Scope ..................................................... 3
   69       1.4. Compliance ................................................ 3
   70       1.5.  Definitions of terms and conventions used ................ 3
   71       1.6. Changes from previous versions ............................ 4
   72    2. Compressed representation overview ............................. 4
   73    3. Detailed specification ......................................... 5
   74       3.1. Overall conventions ....................................... 5
   75           3.1.1. Packing into bytes .................................. 5
   76       3.2. Compressed block format ................................... 6
   77           3.2.1. Synopsis of prefix and Huffman coding ............... 6
   78           3.2.2. Use of Huffman coding in the "deflate" format ....... 7
   79           3.2.3. Details of block format ............................. 9
   80           3.2.4. Non-compressed blocks (BTYPE=00) ................... 11
   81           3.2.5. Compressed blocks (length and distance codes) ...... 11
   82           3.2.6. Compression with fixed Huffman codes (BTYPE=01) .... 12
   83           3.2.7. Compression with dynamic Huffman codes (BTYPE=10) .. 13
   84       3.3. Compliance ............................................... 14
   85    4. Compression algorithm details ................................. 14
   86    5. References .................................................... 16
   87    6. Security Considerations ....................................... 16
   88    7. Source code ................................................... 16
   89    8. Acknowledgements .............................................. 16
   90    9. Author's Address .............................................. 17
   91 
   92 1. Introduction
   93 
   94    1.1. Purpose
   95 
   96       The purpose of this specification is to define a lossless
   97       compressed data format that:
   98           * Is independent of CPU type, operating system, file system,
   99             and character set, and hence can be used for interchange;
  100           * Can be produced or consumed, even for an arbitrarily long
  101             sequentially presented input data stream, using only an a
  102             priori bounded amount of intermediate storage, and hence
  103             can be used in data communications or similar structures
  104             such as Unix filters;
  105           * Compresses data with efficiency comparable to the best
  106             currently available general-purpose compression methods,
  107             and in particular considerably better than the "compress"
  108             program;
  109           * Can be implemented readily in a manner not covered by
  110             patents, and hence can be practiced freely;
  111 
  112 
  113 
  114 Deutsch                      Informational                      [Page 2]
  115 
  116 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
  117 
  118 
  119           * Is compatible with the file format produced by the current
  120             widely used gzip utility, in that conforming decompressors
  121             will be able to read data produced by the existing gzip
  122             compressor.
  123 
  124       The data format defined by this specification does not attempt to:
  125 
  126           * Allow random access to compressed data;
  127           * Compress specialized data (e.g., raster graphics) as well
  128             as the best currently available specialized algorithms.
  129 
  130       A simple counting argument shows that no lossless compression
  131       algorithm can compress every possible input data set.  For the
  132       format defined here, the worst case expansion is 5 bytes per 32K-
  133       byte block, i.e., a size increase of 0.015% for large data sets.
  134       English text usually compresses by a factor of 2.5 to 3;
  135       executable files usually compress somewhat less; graphical data
  136       such as raster images may compress much more.
  137 
  138    1.2. Intended audience
  139 
  140       This specification is intended for use by implementors of software
  141       to compress data into "deflate" format and/or decompress data from
  142       "deflate" format.
  143 
  144       The text of the specification assumes a basic background in
  145       programming at the level of bits and other primitive data
  146       representations.  Familiarity with the technique of Huffman coding
  147       is helpful but not required.
  148 
  149    1.3. Scope
  150 
  151       The specification specifies a method for representing a sequence
  152       of bytes as a (usually shorter) sequence of bits, and a method for
  153       packing the latter bit sequence into bytes.
  154 
  155    1.4. Compliance
  156 
  157       Unless otherwise indicated below, a compliant decompressor must be
  158       able to accept and decompress any data set that conforms to all
  159       the specifications presented here; a compliant compressor must
  160       produce data sets that conform to all the specifications presented
  161       here.
  162 
  163    1.5.  Definitions of terms and conventions used
  164 
  165       Byte: 8 bits stored or transmitted as a unit (same as an octet).
  166       For this specification, a byte is exactly 8 bits, even on machines
  167 
  168 
  169 
  170 Deutsch                      Informational                      [Page 3]
  171 
  172 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
  173 
  174 
  175       which store a character on a number of bits different from eight.
  176       See below, for the numbering of bits within a byte.
  177 
  178       String: a sequence of arbitrary bytes.
  179 
  180    1.6. Changes from previous versions
  181 
  182       There have been no technical changes to the deflate format since
  183       version 1.1 of this specification.  In version 1.2, some
  184       terminology was changed.  Version 1.3 is a conversion of the
  185       specification to RFC style.
  186 
  187 2. Compressed representation overview
  188 
  189    A compressed data set consists of a series of blocks, corresponding
  190    to successive blocks of input data.  The block sizes are arbitrary,
  191    except that non-compressible blocks are limited to 65,535 bytes.
  192 
  193    Each block is compressed using a combination of the LZ77 algorithm
  194    and Huffman coding. The Huffman trees for each block are independent
  195    of those for previous or subsequent blocks; the LZ77 algorithm may
  196    use a reference to a duplicated string occurring in a previous block,
  197    up to 32K input bytes before.
  198 
  199    Each block consists of two parts: a pair of Huffman code trees that
  200    describe the representation of the compressed data part, and a
  201    compressed data part.  (The Huffman trees themselves are compressed
  202    using Huffman encoding.)  The compressed data consists of a series of
  203    elements of two types: literal bytes (of strings that have not been
  204    detected as duplicated within the previous 32K input bytes), and
  205    pointers to duplicated strings, where a pointer is represented as a
  206    pair <length, backward distance>.  The representation used in the
  207    "deflate" format limits distances to 32K bytes and lengths to 258
  208    bytes, but does not limit the size of a block, except for
  209    uncompressible blocks, which are limited as noted above.
  210 
  211    Each type of value (literals, distances, and lengths) in the
  212    compressed data is represented using a Huffman code, using one code
  213    tree for literals and lengths and a separate code tree for distances.
  214    The code trees for each block appear in a compact form just before
  215    the compressed data for that block.
  216 
  217 
  218 
  219 
  220 
  221 
  222 
  223 
  224 
  225 
  226 Deutsch                      Informational                      [Page 4]
  227 
  228 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
  229 
  230 
  231 3. Detailed specification
  232 
  233    3.1. Overall conventions In the diagrams below, a box like this:
  234 
  235          +---+
  236          |   | <-- the vertical bars might be missing
  237          +---+
  238 
  239       represents one byte; a box like this:
  240 
  241          +==============+
  242          |              |
  243          +==============+
  244 
  245       represents a variable number of bytes.
  246 
  247       Bytes stored within a computer do not have a "bit order", since
  248       they are always treated as a unit.  However, a byte considered as
  249       an integer between 0 and 255 does have a most- and least-
  250       significant bit, and since we write numbers with the most-
  251       significant digit on the left, we also write bytes with the most-
  252       significant bit on the left.  In the diagrams below, we number the
  253       bits of a byte so that bit 0 is the least-significant bit, i.e.,
  254       the bits are numbered:
  255 
  256          +--------+
  257          |76543210|
  258          +--------+
  259 
  260       Within a computer, a number may occupy multiple bytes.  All
  261       multi-byte numbers in the format described here are stored with
  262       the least-significant byte first (at the lower memory address).
  263       For example, the decimal number 520 is stored as:
  264 
  265              0        1
  266          +--------+--------+
  267          |00001000|00000010|
  268          +--------+--------+
  269           ^        ^
  270           |        |
  271           |        + more significant byte = 2 x 256
  272           + less significant byte = 8
  273 
  274       3.1.1. Packing into bytes
  275 
  276          This document does not address the issue of the order in which
  277          bits of a byte are transmitted on a bit-sequential medium,
  278          since the final data format described here is byte- rather than
  279 
  280 
  281 
  282 Deutsch                      Informational                      [Page 5]
  283 
  284 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
  285 
  286 
  287          bit-oriented.  However, we describe the compressed block format
  288          in below, as a sequence of data elements of various bit
  289          lengths, not a sequence of bytes.  We must therefore specify
  290          how to pack these data elements into bytes to form the final
  291          compressed byte sequence:
  292 
  293              * Data elements are packed into bytes in order of
  294                increasing bit number within the byte, i.e., starting
  295                with the least-significant bit of the byte.
  296              * Data elements other than Huffman codes are packed
  297                starting with the least-significant bit of the data
  298                element.
  299              * Huffman codes are packed starting with the most-
  300                significant bit of the code.
  301 
  302          In other words, if one were to print out the compressed data as
  303          a sequence of bytes, starting with the first byte at the
  304          *right* margin and proceeding to the *left*, with the most-
  305          significant bit of each byte on the left as usual, one would be
  306          able to parse the result from right to left, with fixed-width
  307          elements in the correct MSB-to-LSB order and Huffman codes in
  308          bit-reversed order (i.e., with the first bit of the code in the
  309          relative LSB position).
  310 
  311    3.2. Compressed block format
  312 
  313       3.2.1. Synopsis of prefix and Huffman coding
  314 
  315          Prefix coding represents symbols from an a priori known
  316          alphabet by bit sequences (codes), one code for each symbol, in
  317          a manner such that different symbols may be represented by bit
  318          sequences of different lengths, but a parser can always parse
  319          an encoded string unambiguously symbol-by-symbol.
  320 
  321          We define a prefix code in terms of a binary tree in which the
  322          two edges descending from each non-leaf node are labeled 0 and
  323          1 and in which the leaf nodes correspond one-for-one with (are
  324          labeled with) the symbols of the alphabet; then the code for a
  325          symbol is the sequence of 0's and 1's on the edges leading from
  326          the root to the leaf labeled with that symbol.  For example:
  327 
  328 
  329 
  330 
  331 
  332 
  333 
  334 
  335 
  336 
  337 
  338 Deutsch                      Informational                      [Page 6]
  339 
  340 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
  341 
  342 
  343                           /\              Symbol    Code
  344                          0  1             ------    ----
  345                         /    \                A      00
  346                        /\     B               B       1
  347                       0  1                    C     011
  348                      /    \                   D     010
  349                     A     /\
  350                          0  1
  351                         /    \
  352                        D      C
  353 
  354          A parser can decode the next symbol from an encoded input
  355          stream by walking down the tree from the root, at each step
  356          choosing the edge corresponding to the next input bit.
  357 
  358          Given an alphabet with known symbol frequencies, the Huffman
  359          algorithm allows the construction of an optimal prefix code
  360          (one which represents strings with those symbol frequencies
  361          using the fewest bits of any possible prefix codes for that
  362          alphabet).  Such a code is called a Huffman code.  (See
  363          reference [1] in Chapter 5, references for additional
  364          information on Huffman codes.)
  365 
  366          Note that in the "deflate" format, the Huffman codes for the
  367          various alphabets must not exceed certain maximum code lengths.
  368          This constraint complicates the algorithm for computing code
  369          lengths from symbol frequencies.  Again, see Chapter 5,
  370          references for details.
  371 
  372       3.2.2. Use of Huffman coding in the "deflate" format
  373 
  374          The Huffman codes used for each alphabet in the "deflate"
  375          format have two additional rules:
  376 
  377              * All codes of a given bit length have lexicographically
  378                consecutive values, in the same order as the symbols
  379                they represent;
  380 
  381              * Shorter codes lexicographically precede longer codes.
  382 
  383 
  384 
  385 
  386 
  387 
  388 
  389 
  390 
  391 
  392 
  393 
  394 Deutsch                      Informational                      [Page 7]
  395 
  396 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
  397 
  398 
  399          We could recode the example above to follow this rule as
  400          follows, assuming that the order of the alphabet is ABCD:
  401 
  402             Symbol  Code
  403             ------  ----
  404             A       10
  405             B       0
  406             C       110
  407             D       111
  408 
  409          I.e., 0 precedes 10 which precedes 11x, and 110 and 111 are
  410          lexicographically consecutive.
  411 
  412          Given this rule, we can define the Huffman code for an alphabet
  413          just by giving the bit lengths of the codes for each symbol of
  414          the alphabet in order; this is sufficient to determine the
  415          actual codes.  In our example, the code is completely defined
  416          by the sequence of bit lengths (2, 1, 3, 3).  The following
  417          algorithm generates the codes as integers, intended to be read
  418          from most- to least-significant bit.  The code lengths are
  419          initially in tree[I].Len; the codes are produced in
  420          tree[I].Code.
  421 
  422          1)  Count the number of codes for each code length.  Let
  423              bl_count[N] be the number of codes of length N, N >= 1.
  424 
  425          2)  Find the numerical value of the smallest code for each
  426              code length:
  427 
  428                 code = 0;
  429                 bl_count[0] = 0;
  430                 for (bits = 1; bits <= MAX_BITS; bits++) {
  431                     code = (code + bl_count[bits-1]) << 1;
  432                     next_code[bits] = code;
  433                 }
  434 
  435          3)  Assign numerical values to all codes, using consecutive
  436              values for all codes of the same length with the base
  437              values determined at step 2. Codes that are never used
  438              (which have a bit length of zero) must not be assigned a
  439              value.
  440 
  441                 for (n = 0;  n <= max_code; n++) {
  442                     len = tree[n].Len;
  443                     if (len != 0) {
  444                         tree[n].Code = next_code[len];
  445                         next_code[len]++;
  446                     }
  447 
  448 
  449 
  450 Deutsch                      Informational                      [Page 8]
  451 
  452 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
  453 
  454 
  455                 }
  456 
  457          Example:
  458 
  459          Consider the alphabet ABCDEFGH, with bit lengths (3, 3, 3, 3,
  460          3, 2, 4, 4).  After step 1, we have:
  461 
  462             N      bl_count[N]
  463             -      -----------
  464             2      1
  465             3      5
  466             4      2
  467 
  468          Step 2 computes the following next_code values:
  469 
  470             N      next_code[N]
  471             -      ------------
  472             1      0
  473             2      0
  474             3      2
  475             4      14
  476 
  477          Step 3 produces the following code values:
  478 
  479             Symbol Length   Code
  480             ------ ------   ----
  481             A       3        010
  482             B       3        011
  483             C       3        100
  484             D       3        101
  485             E       3        110
  486             F       2         00
  487             G       4       1110
  488             H       4       1111
  489 
  490       3.2.3. Details of block format
  491 
  492          Each block of compressed data begins with 3 header bits
  493          containing the following data:
  494 
  495             first bit       BFINAL
  496             next 2 bits     BTYPE
  497 
  498          Note that the header bits do not necessarily begin on a byte
  499          boundary, since a block does not necessarily occupy an integral
  500          number of bytes.
  501 
  502 
  503 
  504 
  505 
  506 Deutsch                      Informational                      [Page 9]
  507 
  508 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
  509 
  510 
  511          BFINAL is set if and only if this is the last block of the data
  512          set.
  513 
  514          BTYPE specifies how the data are compressed, as follows:
  515 
  516             00 - no compression
  517             01 - compressed with fixed Huffman codes
  518             10 - compressed with dynamic Huffman codes
  519             11 - reserved (error)
  520 
  521          The only difference between the two compressed cases is how the
  522          Huffman codes for the literal/length and distance alphabets are
  523          defined.
  524 
  525          In all cases, the decoding algorithm for the actual data is as
  526          follows:
  527 
  528             do
  529                read block header from input stream.
  530                if stored with no compression
  531                   skip any remaining bits in current partially
  532                      processed byte
  533                   read LEN and NLEN (see next section)
  534                   copy LEN bytes of data to output
  535                otherwise
  536                   if compressed with dynamic Huffman codes
  537                      read representation of code trees (see
  538                         subsection below)
  539                   loop (until end of block code recognized)
  540                      decode literal/length value from input stream
  541                      if value < 256
  542                         copy value (literal byte) to output stream
  543                      otherwise
  544                         if value = end of block (256)
  545                            break from loop
  546                         otherwise (value = 257..285)
  547                            decode distance from input stream
  548 
  549                            move backwards distance bytes in the output
  550                            stream, and copy length bytes from this
  551                            position to the output stream.
  552                   end loop
  553             while not last block
  554 
  555          Note that a duplicated string reference may refer to a string
  556          in a previous block; i.e., the backward distance may cross one
  557          or more block boundaries.  However a distance cannot refer past
  558          the beginning of the output stream.  (An application using a
  559 
  560 
  561 
  562 Deutsch                      Informational                     [Page 10]
  563 
  564 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
  565 
  566 
  567          preset dictionary might discard part of the output stream; a
  568          distance can refer to that part of the output stream anyway)
  569          Note also that the referenced string may overlap the current
  570          position; for example, if the last 2 bytes decoded have values
  571          X and Y, a string reference with <length = 5, distance = 2>
  572          adds X,Y,X,Y,X to the output stream.
  573 
  574          We now specify each compression method in turn.
  575 
  576       3.2.4. Non-compressed blocks (BTYPE=00)
  577 
  578          Any bits of input up to the next byte boundary are ignored.
  579          The rest of the block consists of the following information:
  580 
  581               0   1   2   3   4...
  582             +---+---+---+---+================================+
  583             |  LEN  | NLEN  |... LEN bytes of literal data...|
  584             +---+---+---+---+================================+
  585 
  586          LEN is the number of data bytes in the block.  NLEN is the
  587          one's complement of LEN.
  588 
  589       3.2.5. Compressed blocks (length and distance codes)
  590 
  591          As noted above, encoded data blocks in the "deflate" format
  592          consist of sequences of symbols drawn from three conceptually
  593          distinct alphabets: either literal bytes, from the alphabet of
  594          byte values (0..255), or <length, backward distance> pairs,
  595          where the length is drawn from (3..258) and the distance is
  596          drawn from (1..32,768).  In fact, the literal and length
  597          alphabets are merged into a single alphabet (0..285), where
  598          values 0..255 represent literal bytes, the value 256 indicates
  599          end-of-block, and values 257..285 represent length codes
  600          (possibly in conjunction with extra bits following the symbol
  601          code) as follows:
  602 
  603 
  604 
  605 
  606 
  607 
  608 
  609 
  610 
  611 
  612 
  613 
  614 
  615 
  616 
  617 
  618 Deutsch                      Informational                     [Page 11]
  619 
  620 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
  621 
  622 
  623                  Extra               Extra               Extra
  624             Code Bits Length(s) Code Bits Lengths   Code Bits Length(s)
  625             ---- ---- ------     ---- ---- -------   ---- ---- -------
  626              257   0     3       267   1   15,16     277   4   67-82
  627              258   0     4       268   1   17,18     278   4   83-98
  628              259   0     5       269   2   19-22     279   4   99-114
  629              260   0     6       270   2   23-26     280   4  115-130
  630              261   0     7       271   2   27-30     281   5  131-162
  631              262   0     8       272   2   31-34     282   5  163-194
  632              263   0     9       273   3   35-42     283   5  195-226
  633              264   0    10       274   3   43-50     284   5  227-257
  634              265   1  11,12      275   3   51-58     285   0    258
  635              266   1  13,14      276   3   59-66
  636 
  637          The extra bits should be interpreted as a machine integer
  638          stored with the most-significant bit first, e.g., bits 1110
  639          represent the value 14.
  640 
  641                   Extra           Extra               Extra
  642              Code Bits Dist  Code Bits   Dist     Code Bits Distance
  643              ---- ---- ----  ---- ----  ------    ---- ---- --------
  644                0   0    1     10   4     33-48    20    9   1025-1536
  645                1   0    2     11   4     49-64    21    9   1537-2048
  646                2   0    3     12   5     65-96    22   10   2049-3072
  647                3   0    4     13   5     97-128   23   10   3073-4096
  648                4   1   5,6    14   6    129-192   24   11   4097-6144
  649                5   1   7,8    15   6    193-256   25   11   6145-8192
  650                6   2   9-12   16   7    257-384   26   12  8193-12288
  651                7   2  13-16   17   7    385-512   27   12 12289-16384
  652                8   3  17-24   18   8    513-768   28   13 16385-24576
  653                9   3  25-32   19   8   769-1024   29   13 24577-32768
  654 
  655       3.2.6. Compression with fixed Huffman codes (BTYPE=01)
  656 
  657          The Huffman codes for the two alphabets are fixed, and are not
  658          represented explicitly in the data.  The Huffman code lengths
  659          for the literal/length alphabet are:
  660 
  661                    Lit Value    Bits        Codes
  662                    ---------    ----        -----
  663                      0 - 143     8          00110000 through
  664                                             10111111
  665                    144 - 255     9          110010000 through
  666                                             111111111
  667                    256 - 279     7          0000000 through
  668                                             0010111
  669                    280 - 287     8          11000000 through
  670                                             11000111
  671 
  672 
  673 
  674 Deutsch                      Informational                     [Page 12]
  675 
  676 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
  677 
  678 
  679          The code lengths are sufficient to generate the actual codes,
  680          as described above; we show the codes in the table for added
  681          clarity.  Literal/length values 286-287 will never actually
  682          occur in the compressed data, but participate in the code
  683          construction.
  684 
  685          Distance codes 0-31 are represented by (fixed-length) 5-bit
  686          codes, with possible additional bits as shown in the table
  687          shown in Paragraph 3.2.5, above.  Note that distance codes 30-
  688          31 will never actually occur in the compressed data.
  689 
  690       3.2.7. Compression with dynamic Huffman codes (BTYPE=10)
  691 
  692          The Huffman codes for the two alphabets appear in the block
  693          immediately after the header bits and before the actual
  694          compressed data, first the literal/length code and then the
  695          distance code.  Each code is defined by a sequence of code
  696          lengths, as discussed in Paragraph 3.2.2, above.  For even
  697          greater compactness, the code length sequences themselves are
  698          compressed using a Huffman code.  The alphabet for code lengths
  699          is as follows:
  700 
  701                0 - 15: Represent code lengths of 0 - 15
  702                    16: Copy the previous code length 3 - 6 times.
  703                        The next 2 bits indicate repeat length
  704                              (0 = 3, ... , 3 = 6)
  705                           Example:  Codes 8, 16 (+2 bits 11),
  706                                     16 (+2 bits 10) will expand to
  707                                     12 code lengths of 8 (1 + 6 + 5)
  708                    17: Repeat a code length of 0 for 3 - 10 times.
  709                        (3 bits of length)
  710                    18: Repeat a code length of 0 for 11 - 138 times
  711                        (7 bits of length)
  712 
  713          A code length of 0 indicates that the corresponding symbol in
  714          the literal/length or distance alphabet will not occur in the
  715          block, and should not participate in the Huffman code
  716          construction algorithm given earlier.  If only one distance
  717          code is used, it is encoded using one bit, not zero bits; in
  718          this case there is a single code length of one, with one unused
  719          code.  One distance code of zero bits means that there are no
  720          distance codes used at all (the data is all literals).
  721 
  722          We can now define the format of the block:
  723 
  724                5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286)
  725                5 Bits: HDIST, # of Distance codes - 1        (1 - 32)
  726                4 Bits: HCLEN, # of Code Length codes - 4     (4 - 19)
  727 
  728 
  729 
  730 Deutsch                      Informational                     [Page 13]
  731 
  732 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
  733 
  734 
  735                (HCLEN + 4) x 3 bits: code lengths for the code length
  736                   alphabet given just above, in the order: 16, 17, 18,
  737                   0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15
  738 
  739                   These code lengths are interpreted as 3-bit integers
  740                   (0-7); as above, a code length of 0 means the
  741                   corresponding symbol (literal/length or distance code
  742                   length) is not used.
  743 
  744                HLIT + 257 code lengths for the literal/length alphabet,
  745                   encoded using the code length Huffman code
  746 
  747                HDIST + 1 code lengths for the distance alphabet,
  748                   encoded using the code length Huffman code
  749 
  750                The actual compressed data of the block,
  751                   encoded using the literal/length and distance Huffman
  752                   codes
  753 
  754                The literal/length symbol 256 (end of data),
  755                   encoded using the literal/length Huffman code
  756 
  757          The code length repeat codes can cross from HLIT + 257 to the
  758          HDIST + 1 code lengths.  In other words, all code lengths form
  759          a single sequence of HLIT + HDIST + 258 values.
  760 
  761    3.3. Compliance
  762 
  763       A compressor may limit further the ranges of values specified in
  764       the previous section and still be compliant; for example, it may
  765       limit the range of backward pointers to some value smaller than
  766       32K.  Similarly, a compressor may limit the size of blocks so that
  767       a compressible block fits in memory.
  768 
  769       A compliant decompressor must accept the full range of possible
  770       values defined in the previous section, and must accept blocks of
  771       arbitrary size.
  772 
  773 4. Compression algorithm details
  774 
  775    While it is the intent of this document to define the "deflate"
  776    compressed data format without reference to any particular
  777    compression algorithm, the format is related to the compressed
  778    formats produced by LZ77 (Lempel-Ziv 1977, see reference [2] below);
  779    since many variations of LZ77 are patented, it is strongly
  780    recommended that the implementor of a compressor follow the general
  781    algorithm presented here, which is known not to be patented per se.
  782    The material in this section is not part of the definition of the
  783 
  784 
  785 
  786 Deutsch                      Informational                     [Page 14]
  787 
  788 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
  789 
  790 
  791    specification per se, and a compressor need not follow it in order to
  792    be compliant.
  793 
  794    The compressor terminates a block when it determines that starting a
  795    new block with fresh trees would be useful, or when the block size
  796    fills up the compressor's block buffer.
  797 
  798    The compressor uses a chained hash table to find duplicated strings,
  799    using a hash function that operates on 3-byte sequences.  At any
  800    given point during compression, let XYZ be the next 3 input bytes to
  801    be examined (not necessarily all different, of course).  First, the
  802    compressor examines the hash chain for XYZ.  If the chain is empty,
  803    the compressor simply writes out X as a literal byte and advances one
  804    byte in the input.  If the hash chain is not empty, indicating that
  805    the sequence XYZ (or, if we are unlucky, some other 3 bytes with the
  806    same hash function value) has occurred recently, the compressor
  807    compares all strings on the XYZ hash chain with the actual input data
  808    sequence starting at the current point, and selects the longest
  809    match.
  810 
  811    The compressor searches the hash chains starting with the most recent
  812    strings, to favor small distances and thus take advantage of the
  813    Huffman encoding.  The hash chains are singly linked. There are no
  814    deletions from the hash chains; the algorithm simply discards matches
  815    that are too old.  To avoid a worst-case situation, very long hash
  816    chains are arbitrarily truncated at a certain length, determined by a
  817    run-time parameter.
  818 
  819    To improve overall compression, the compressor optionally defers the
  820    selection of matches ("lazy matching"): after a match of length N has
  821    been found, the compressor searches for a longer match starting at
  822    the next input byte.  If it finds a longer match, it truncates the
  823    previous match to a length of one (thus producing a single literal
  824    byte) and then emits the longer match.  Otherwise, it emits the
  825    original match, and, as described above, advances N bytes before
  826    continuing.
  827 
  828    Run-time parameters also control this "lazy match" procedure.  If
  829    compression ratio is most important, the compressor attempts a
  830    complete second search regardless of the length of the first match.
  831    In the normal case, if the current match is "long enough", the
  832    compressor reduces the search for a longer match, thus speeding up
  833    the process.  If speed is most important, the compressor inserts new
  834    strings in the hash table only when no match was found, or when the
  835    match is not "too long".  This degrades the compression ratio but
  836    saves time since there are both fewer insertions and fewer searches.
  837 
  838 
  839 
  840 
  841 
  842 Deutsch                      Informational                     [Page 15]
  843 
  844 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
  845 
  846 
  847 5. References
  848 
  849    [1] Huffman, D. A., "A Method for the Construction of Minimum
  850        Redundancy Codes", Proceedings of the Institute of Radio
  851        Engineers, September 1952, Volume 40, Number 9, pp. 1098-1101.
  852 
  853    [2] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data
  854        Compression", IEEE Transactions on Information Theory, Vol. 23,
  855        No. 3, pp. 337-343.
  856 
  857    [3] Gailly, J.-L., and Adler, M., ZLIB documentation and sources,
  858        available in ftp://ftp.uu.net/pub/archiving/zip/doc/
  859 
  860    [4] Gailly, J.-L., and Adler, M., GZIP documentation and sources,
  861        available as gzip-*.tar in ftp://prep.ai.mit.edu/pub/gnu/
  862 
  863    [5] Schwartz, E. S., and Kallick, B. "Generating a canonical prefix
  864        encoding." Comm. ACM, 7,3 (Mar. 1964), pp. 166-169.
  865 
  866    [6] Hirschberg and Lelewer, "Efficient decoding of prefix codes,"
  867        Comm. ACM, 33,4, April 1990, pp. 449-459.
  868 
  869 6. Security Considerations
  870 
  871    Any data compression method involves the reduction of redundancy in
  872    the data.  Consequently, any corruption of the data is likely to have
  873    severe effects and be difficult to correct.  Uncompressed text, on
  874    the other hand, will probably still be readable despite the presence
  875    of some corrupted bytes.
  876 
  877    It is recommended that systems using this data format provide some
  878    means of validating the integrity of the compressed data.  See
  879    reference [3], for example.
  880 
  881 7. Source code
  882 
  883    Source code for a C language implementation of a "deflate" compliant
  884    compressor and decompressor is available within the zlib package at
  885    ftp://ftp.uu.net/pub/archiving/zip/zlib/.
  886 
  887 8. Acknowledgements
  888 
  889    Trademarks cited in this document are the property of their
  890    respective owners.
  891 
  892    Phil Katz designed the deflate format.  Jean-Loup Gailly and Mark
  893    Adler wrote the related software described in this specification.
  894    Glenn Randers-Pehrson converted this document to RFC and HTML format.
  895 
  896 
  897 
  898 Deutsch                      Informational                     [Page 16]
  899 
  900 RFC 1951      DEFLATE Compressed Data Format Specification      May 1996
  901 
  902 
  903 9. Author's Address
  904 
  905    L. Peter Deutsch
  906    Aladdin Enterprises
  907    203 Santa Margarita Ave.
  908    Menlo Park, CA 94025
  909 
  910    Phone: (415) 322-0103 (AM only)
  911    FAX:   (415) 322-1734
  912    EMail: <ghost@aladdin.com>
  913 
  914    Questions about the technical content of this specification can be
  915    sent by email to:
  916 
  917    Jean-Loup Gailly <gzip@prep.ai.mit.edu> and
  918    Mark Adler <madler@alumni.caltech.edu>
  919 
  920    Editorial comments on this specification can be sent by email to:
  921 
  922    L. Peter Deutsch <ghost@aladdin.com> and
  923    Glenn Randers-Pehrson <randeg@alumni.rpi.edu>
  924 
  925 
  926 
  927 
  928 
  929 
  930 
  931 
  932 
  933 
  934 
  935 
  936 
  937 
  938 
  939 
  940 
  941 
  942 
  943 
  944 
  945 
  946 
  947 
  948 
  949 
  950 
  951 
  952 
  953 
  954 Deutsch                      Informational                     [Page 17]
  955
Cache object: dfc3bde541578ec861321a07c50e7f3a
[ source navigation ] [ diff markup ] [ identifier search ] [ freetext search ] [ file search ] [ list types ] [ track identifier ]
This page is part of the FreeBSD/Linux Linux Kernel Cross-Reference, and was automatically generated using a modified version of the LXR engine.
FreeBSD/Linux Kernel Cross Reference sys/contrib/zlib/doc/rfc1951.txt

FreeBSD/Linux Kernel Cross Reference
sys/contrib/zlib/doc/rfc1951.txt