ZLIB + Deflate file format
This writing is a quick reference for those who would like to implement the inflate algorithm on their own or would like to make their compressors produce a bit stream in deflate format.
Overall ZLIB format
Bits are numbered from lowest to highest starting from zero. So 0 is the lowest order bit, 7 is the highest order one.
Size in bytes | Name | Description |
---|---|---|
1 | compression method and window size (CMF) |
This is essentially a magic number. It's 0x78 most often.
|
1 | checksum and flags (FLG) |
|
4 (optional) | preset dictionary checksum | If the bit 5 in FLG is set, this will be the Alder32 checksum of the data that's initially fed to the decompressor without consuming input. So the compressed data can refer to it. This can be used, for example, to decompress a stream that consists of multiple parts of ZLIB compressed streams. When bit 5 in FLG is not set, this field is not present. This checksum is in big-endian byte order (most significant byte first). |
varies | compressed data | The compressed data itself. We will describe the format of this later. |
4 | checksum | This is the Alder32 checksum at the end of the stream, used to validate the integrity of the decompressed data. Preset dictionaries are not considered in this checksum. Later we will describe how to calculate it. This checksum is in big-endian byte order (most significant byte first). |
Deflate compression format
Deflate is a frequently used compression method in implementations of GZIP, ZIP and others. This section discusses the format of this compression format. The deflate algorithm uses a combination of LZ algorithm and Huffman coding.
Before moving forward we should understand how is the data encoded.
The compressed data is a bit stream on the lowest level. Bytes of the compressed stream is consumed one after the other. Bits in these bytes are consumed by consuming the lowest order bit first and the highest order bit last.
The data contains bit fields. Bits in these fields are in little-endian bit order. So the least significant bit will be read first and the most significant one is read last.
Overview of the LZ compressed data
The point of the LZ algorithm is the elimination of duplicate sequences from the input. This means we don't write the same thing twice, instead we write a reference where the is sequence is found. If the sequence is long this can save space.
Let Δ(X, Y) mean "Go back X places and copy Y bytes from there to the output". This allows the decompressor copy sequences to the output that's previously occurred in the output. Consider the following string: ABCDABCDEFGHABCDEFG. At the 5th position we can see that the sequence ABC repeats, so we can replace it by the copy code: ABCDΔ(4,3)DEFGHABCDEFG. Then we can see the sequence ABCDEFG also repeats so instead of writing it down again we can replace it with a copy command too: ABCDΔ(4,3)DEFGHΔ(8,7). And these copy codes can shorten the data greatly when very long sequences repeat.
The sequence and its next occurrence may overlap. Consider the following: ABCABCABCABCABC. This can be reduced to: ABCΔ(3,12). So we tell the decompressor go back 3 positions and copy 12 bytes from there. There are not enough bytes in the output to copy 12 bytes right away, but as the decompressor executes the copy the output is being written so the decompressor will copy these previously written bytes again. And this allows the very efficient storage of repeated data. But this should be a note for implementers: you shouldn't use memory copying functions to copy because that would break this behavior.
Encoding of the LZ compressed data in Deflate
The deflate format uses two alphabets to encode LZ compressed data. One is the literal/length alphabet and the other is the distance alphabet. The literal/length alphabet contains the symbols for literals. And the length part of the copy command. The distance alphabet is a supplementary alphabet used to determine the distance.
When we say symbol we refer to the element of the alphabet. When we say code we refer to the bit pattern that represent the symbol.
The symbols of the literal/length alphabet means the following:
- 0..255: Literal bytes, just copy the value to the output.
- 256: The end of block marker. This denotes the end of the compressed data.
- 257..285: Copy command. This also encodes the length part of a copy command. After each symbol the specified number of extra bits must be read and the value must be added to length base to get the actual length for the copy command. When the number of extra bits to read is 0 then the value is exact an no extra bits is read. The length bases and the corresponding extra bits can be seen in the following table:
Code | Length base | Extra bits |
---|---|---|
257 | 3 | 0 |
258 | 4 | 0 |
259 | 5 | 0 |
260 | 6 | 0 |
261 | 7 | 0 |
262 | 8 | 0 |
263 | 9 | 0 |
264 | 10 | 0 |
265 | 11 | 1 |
266 | 13 | 1 |
Code | Length base | Extra bits |
---|---|---|
267 | 15 | 1 |
268 | 17 | 1 |
269 | 19 | 2 |
270 | 23 | 2 |
271 | 27 | 2 |
272 | 31 | 2 |
273 | 35 | 3 |
274 | 43 | 3 |
275 | 51 | 3 |
276 | 59 | 3 |
Code | Length base | Extra bits |
---|---|---|
277 | 67 | 4 |
278 | 83 | 4 |
279 | 99 | 4 |
280 | 115 | 4 |
281 | 131 | 5 |
282 | 163 | 5 |
283 | 195 | 5 |
284 | 227 | 5 |
285 | 258 | 0 |
Although these values indeed have a mathematical formula, it's easier to shove them into an array instead of calculating it.
After we know the length we need to read the distance using the distance alphabet. Each symbol in the distance alphabet specifies a distance base and a number of extra bits to read to be added to the distance base just like with lengths.
Code | Length base | Extra bits |
---|---|---|
0 | 1 | 0 |
1 | 2 | 0 |
2 | 3 | 0 |
3 | 4 | 0 |
4 | 5 | 1 |
5 | 7 | 1 |
6 | 9 | 2 |
7 | 13 | 2 |
8 | 17 | 3 |
9 | 25 | 3 |
Code | Length base | Extra bits |
---|---|---|
10 | 33 | 4 |
11 | 49 | 4 |
12 | 65 | 5 |
13 | 97 | 5 |
14 | 129 | 6 |
15 | 193 | 6 |
16 | 257 | 7 |
17 | 385 | 7 |
18 | 513 | 8 |
19 | 769 | 8 |
Code | Length base | Extra bits |
---|---|---|
20 | 1025 | 9 |
21 | 1537 | 9 |
22 | 2049 | 10 |
23 | 3073 | 10 |
24 | 4097 | 11 |
25 | 6145 | 11 |
26 | 8193 | 12 |
27 | 12289 | 12 |
28 | 16385 | 13 |
29 | 24577 | 13 |
So for example the Δ(13,12) copy command is encoded by first having the symbol 265 (length base 11), then an extra bit read that will contain 1 to increase it to 12, followed by the distance symbol of 7 from which gives the distance base 13 followed by reading 2 extra bits of 0 to keep it 13.
Overview of Huffman coding
The LZ algorithm already reduced the size of the data. But it can be reduced further by assigning shorter code to symbols that appear often and assign longer codes to symbols that appear rarely. This is what's the Huffman coding is about.
The longest allowed code length in the deflate format is 15.
The shortest one is 1.
The code length 0 reserved to mark the symbols that doesn't occur in the decompressed data at all.
The deflate doesn't store the Huffman codes themselves instead it stores only the lengths of the code, and calculates the actual code from them.
To determine the Huffman codes we start from an array of lengths.
Consider the following list of lengths: [2, 0, 3, 4, 4, 0, 2, 4, 4, 0, 4, 5, 5]
.
These are the code lengths corresponding to 13 symbols indexed from 0 to 12.
We can see the symbols 1, 5 and 9 are unused because their code length is 0.
The first step is counting the occurrences of each code length and populate an array with the counts.
In our example this is: [0, 0, 2, 1, 5, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
.
Let's call this array $F$.
And refer to the $i$th element of it as $F_i$.
So for example $F_3 = 1$, because there are 1 elements with length 3.
The 0 length is reserved for unused symbols, so there are no code with 0 length, so $F_0 = 0$ by definition.
The second step is determining the initial codes corresponding to each code length.
So another array of 16 elements is created, let's call it $N$.
Let's call the $i$th element of it as $N_i$.
The formula to calculate this array is: $N_i = 2 \left( N_{i-1} + F_{i-1}\right)$.
The $N_0$ is set 0 by definition.
So the generation should start from $i = 1$.
This array will be in our example: [0, 0, 0, 4, 10, 30, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768]
The third step is computing the actual codes.
Let's denote the $i$th symbol's code with $S_i$.
The operation to calculate the code is: $S_i = N_{F_i}; N_{F_i} := N_{F_i} + 1$.
So the $N$ values increase after each assignment as we are doing progress.
Using this step we can obtain the array of Huffman codes for each symbol: [0, 0, 4, 10, 11, 0, 1, 12, 13, 0, 14, 30, 31]
.
If we express them in binary, considering the code lengths, we have these codes: [00, 0, 100, 1010, 1011, 0, 01, 1100, 1101, 0, 1110, 11110, 11111]
.
A these codes will represent the symbols.
Indexes are shown to see better: [00 -> 0, 0, 100 -> 2, 1010 -> 3, 1011 -> 4, 0, 01 -> 6, 1100 -> 7, 1101 -> 8, 0, 1110 -> 10, 11110 -> 11, 11111 -> 12]
Here is an example Python code used to calculate the Huffman codes.
"""Huffman stuff"""
# Example LENGTHS
LENGTHS = [2, 0, 3, 4, 4, 0, 2, 4, 4, 0, 4, 5, 5]
# Maximum Huffman code length is 15
MAX_HUFF = 16
def genhuff(lengths):
"Generates Huffman codes"
# Generate zeroed arrays.
n_lengths = [0] * MAX_HUFF
next_codes = [0] * MAX_HUFF
huff_codes = [0] * len(lengths)
# Count lengths
for length in lengths:
n_lengths[length] += 1
print n_lengths
# Calculate next code
n_lengths[0] = 0
for i in range(1, len(n_lengths)):
next_codes[i] = 2*(next_codes[i - 1] + n_lengths[i - 1])
print next_codes
# Assign Huffman codes
for i in range(0, len(huff_codes)):
if lengths[i] == 0:
continue
huff_codes[i] = next_codes[lengths[i]]
next_codes[lengths[i]] += 1
# Print it
print huff_codes
print '[{}]'.format(','.join(bin(x)[2:].zfill(y) for x, y in zip(huff_codes, lengths)))
return
genhuff(LENGTHS)
These codes are stored in big-endian bit order in the file. For example the symbol 4 has the code 1011. This means we read the bits 1, 0, 1, 1 in this order, not vice versa unlike any other bit fields. The reason for this is that Huffman code is prefix code. This means no code word is a prefix of other code word. In example above, there is no symbol which has the code 1, 10 or 101. So any sequence of bits will unambiguously map to the symbols denoted by the codes (usually). Without needing to specify length before or an end symbol after each code. Consider the bit sequence 10101010101011111000101000010110111101010, which I randomly typed in. This represent these code words: 1010 1010 1010 11111 00 01 01 00 00 1011 01 1110 1010. Which encode: 3, 3, 3, 12, 0, 6, 6, 0, 0, 4, 6, 10, 4.
Code length alphabet
We only store the lengths of the literal/length and the distance alphabets. This means 315 lengths at most. This is a lot. It can be shortened by encoding with another alphabet: the code length alphabet. It has 19 symbols.
- 0..15: literal code length
- 16: Repeat previous code 3..6 times. Read the next 2 bits and add 3 to it to get the amount to repeat.
- 17: Repeat code length of 0 3..10 times. Read the next 3 bits and add 3 to it.
- 18: Repeat code length of 0 11..138 times. Read the next 7 bits and add 11 to it.
So basically having the sequence, 3, 4, 16 (extra bits: 11), 17 (extra bits: 100) means the code length sequence. 3, 4, 4, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0.
And even this encoding is further compressed using Huffman code which reduces its size even further.
Layout of the compressed data
The compressed data is a series of blocks. Each block can be arbitrary long. The format of the block is the following:
Size in bits | Name | Description |
---|---|---|
1 | final block flag (BFINAL) | When this bit is set it indicates that this block is the last block in the compressed data. |
2 | block type (BTYPE) |
The type of block:
|
varies | the actual block data | The layout of this data depends on the type of the block. |
Layout of stored blocks
When BTYPE is 0 that means the block is a stored. When this happens the decompressor discards the remaining bits of the current byte it's consuming and resumes reading the streams byte-by-byte. The format of the stored block is:
Size in bytes | Name | Description |
---|---|---|
2 | length of the data (LEN) | 2 byte length of the block. The length is stored in little-endian order. And this means the block length is limited to 65536 bytes. |
2 | bitwise negation of LEN (NLEN) | This is a sort of check word. Its bits must be set the opposite way than LEN. |
LEN | block data | The stored block data. It should be just appended to the output. |
After the block is read. The bit based processing resumes starting from the next byte after the block.
Layout of the compressed blocks
The layout of the compressed blocks is the following:
Size in bits | Name | Description |
---|---|---|
5 | number of symbols in the literal length alphabet (NLIT - 257) | There are NLIT symbols in the literal/length alphabet. Note the value of this field is 257 less than the actual number. |
5 | number of symbols in the distance alphabet (NDIST - 1) | There are NDIST symbols in the distance alphabet. Note the value of this field is 1 less than the actual number. |
4 | number of symbols in the code length alphabet (NCLEN - 4) | There are NCLEN symbols in the code length alphabet. Note the value of this field is 4 less than the actual number. |
3*NCLEN | lengths of the prefix codes used to encode the symbols in the code length alphabet |
These are 3 bit long bit fields that describe the prefix code lengths used for the code length alphabet. Using these values the Huffman tree for the code length alphabet can be built. The index order in this table is weird. The length codes are assigned in the following order: 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15. If NCLEN is smaller than the maximum possible 19, the unset fields are set zero. |
varies | lengths of the codes for the literal/length alphabet and the distance alphabet | There are NLIT + NDIST number of elements in this list encoded by the code length alphabet and whose symbols are encoded using the prefix code described by previous row. First the prefix code lengths for the the literal/length alphabet, then the prefix code lengths for the distance alphabet. |
varies | the compressed data itself | After processing the lengths and building the Huffman tree for the literal/length alphabet and the distance alphabet, the compressed data follows. It's a series of commands that describe how to generate the uncompressed file. |
Layout of compressed blocks using a fixed Huffman tree
In this case the block contains the compressed data only. The code lengths for both the literal/length alphabet and the distance alphabet is hard coded. All codes of the distance alphabet is 5 bits long. For the literal/length alphabet the code lengths vary in the following intervals: 8 bit long for the symbols between 0..143, 9 bits between 144..255, 7 bits between 256..279, 8 bits between 280..287. (286 and 287 won't appear within the compressed data, but takes part in the code generation.)
Alder32 checksum
The alder32 checksum is computed the following way: Let's have two integer variables $A$ and $B$. Initialitze them as: $A = 1$, $B = 0$. Let's have $C$ to denote the next byte in the stream. For each byte in the file going from start to end update the two variables the following way: $A := (A + C) \text{ mod } 65521$, $B := (A + B) \text{ mod } 65521$. Once all bytes processed, the checksum will be: $65536 B + A$.