Zip Files: History, Explanation, and Implementation



I have long been wondering how data is compressed, including in Zip files. Once I decided to satisfy my curiosity: to learn how compression works, and write my own Zip program. Implementation has become an exciting exercise in programming. You get great pleasure from creating a debugged machine that takes data, transfers its bits to a more efficient representation, and then collects it back. I hope you will also be interested in reading about it.

The article explains in great detail how Zip files and the compression scheme work: LZ77 compression, Huffman algorithm, Deflate algorithm and more. You will learn the history of the development of the technology and look at fairly effective implementation examples written from scratch in C. The source code is here: hwzip-1.0.zip .

I am very grateful to Ange Albertini , Gynvael Coldwind , Fabian Giesen , Jonas Skeppstedt ( web ), Primiano Tucci, and Nico Weber , who provided valuable feedback on the drafts of this article.

Content



Story


PKZip


In the eighties and early nineties, before the Internet became widespread, computer enthusiasts used dial-up modems to connect via the telephone network to the Bulletin Board Systems (BBS) network. BBS was an interactive computer system that allowed users to send messages, play games and share files. To get online, it was enough to have a computer, modem and a good BBS phone number. Numbers were published in computer magazines and on other BBS.

An important tool to facilitate the distribution of files was the archiver . It allows you to save one or more files in a single archive fileto more conveniently store or transmit information. And ideally, the archive also compressed files to save space and time for transmission over the network. In the days of BBS, the Arc archiver was popular, written by Tom Henderson of System Enhancement Associates (SEA), a small company that he founded with his brother-in-law.

In the late 1980s, programmer Phil Katz released his own version of Arc, PKArc. It was compatible with SEA Arc, but it worked faster thanks to subroutines written in assembly language and used a new compression method. The program became popular, Katz quit his job and created PKWare to focus on further development. According to legend, most of the work took place in his mother’s kitchen in Glendale, Wisconsin.


Photo by Phil Katz from an article in the Milwaukee Sentinel , September 19, 1994.

However, the SEA was not happy with Katz’s initiative. The company accused him of trademark and copyright infringement. The litigation and controversy on the BBS network and the PC world has become known as Arc Wars . In the end, the dispute was settled in favor of the SEA.

Abandoning Arc, Katz created a new archiving format in 1989, which he called Zip and made available to the public :

, , , . , ".ZIP", , , , , , , , , .

Katz's program for creating such files was called PKZip and soon spread to the world of BBS and PC.

One of the aspects that most likely contributed to the success of the Zip format is that the documentation came with PKZip, Application Note , which explained in detail how the format works. This allowed others to learn the format and create programs that generate, extract, or otherwise interact with Zip files.

Zip - a compression format without loss : after unpacking the data will be the same as before compression. The algorithm seeks redundancy in the source data and more efficiently presents information. This approach is different from lossy compression., which is used in formats such as JPEG and MP3: when compressed, some of the information that is less noticeable to the human eye or ear is thrown out.

PKZip was distributed as Shareware: it could be freely used and copied, but the author suggested users to “register” the program. For $ 47, you could get printed instructions, premium support, and an enhanced version of the app.


One of the key versions of PKZip was 2.04c, released on December 28, 1992 ( version 2.04g was released shortly after ). It used the default Deflate compression algorithm. The version determined the further development of compression in Zip files ( article devoted to the release ).


Since then, the zip format has been used in many other file formats. For example, Java archives (.jar), Android Application Packages (.apk), and Microsoft Office .docx files use the Zip format. Many formats and protocols use the same compression algorithm, Deflate. Say, web pages are probably transferred to your browser as a gzip file, the format of which uses Deflate compression.

Phil Katz died in 2000. PKWare still exists and supports the Zip format, although the company focuses mainly on data protection software.

Info-zip and zlib


Soon after the release of PKZip in 1989, other programs for unpacking Zip files began to appear. For example, an unzip program that could unpack on Unix systems. In March 1990, a mailing list called Info-ZIP was created.

The Info-ZIP group has released free open source programs unzip and zip , which were used to unpack and create zip files. The code has been ported to many systems, and it is still the standard for Zip programs for Unix systems. This later helped to increase the popularity of Zip files.

Once the Info-ZIP code that did the deflate compression and decompression was moved to a separate zlib library that they wroteJean-loup Gailly (compression) and Mark Adler (unpacking).


Jean-loup Gailly (left) and Mark Adler (right) at the 2009 USENIX STUG Award .

One of the reasons for creating the library was that it provided the convenience of using Deflate compression in other applications and formats, for example, in the new gzip and PNG . These new formats were intended to replace Compress and GIF , which used the patent-protected LZW algorithm.

As part of creating these formats, Peter Deutsch wrote the Deflate specification and published it under the name Internet RFC 1951 in May 1996. This turned out to be a more accessible description compared to the original PKZip Application Note.

Today zlib is used everywhere. Perhaps he is now responsible for compressing this page on a web server and unpacking it in your browser. Today, most zip files are compressed and decompressed using zlib.

Winzip


Many of those who did not find PKZip used WinZip. PC users switched from DOS to Windows, and from PKZip to WinZip.

It all started with a project by programmer Nico Mac, who created software for OS / 2 at Mansfield Software Group in Storrs-Mansfield, Connecticut. Nico used Presentation Manager, this is a graphical user interface in OS / 2, and he was upset that he had to switch from a file manager to DOS commands every time he wanted to create Zip files.

Mac wrote a simple GUI program that worked with Zip files directly in Presentation Manager, named it PMZip, and released it as shareware in the 1990s.

OS / 2 did not succeed, and the PC world took over Microsoft Windows. In 1991, Mac decided to learn how to write Windows programs, and his first project was to port his Zip application to a new OS. In April 1991, WinZip 1.00 was released . It was distributed as shareware with a 21-day trial period and a registration fee of $ 29. She looked like this:


In the first versions of WinZip, PKZip was used under the hood. But from version 5.0 in 1993, code from Info-ZIP began to be used for direct processing of Zip files. The user interface has also gradually evolved.


WinZip 6.3 under Windows 3.11 for Workgroups.

WinZip was one of the most popular shareware programs in the 1990s. But in the end, it lost relevance due to embedding support for Zip files in operating systems. Windows has been working with them as “compressed folders” since 2001 (Windows XP), the DynaZip library is used for this.

Mac was originally called Nico Mak Computing. In 2000, it was renamed WinZip Computing, and around those years Mack left it. In 2005, Vector Capital sold the company, and in the end, it became owned by Corel , which still releases WinZip as a product.

Lempel-Ziv Compression (LZ77)


Zip compression consists of two main ingredients: Lempel-Ziv compression and Huffman code.

One way to compress the text is to create a list of common words or phrases with the replacement of varieties of these words within the text with links to the dictionary. For example, the long word “compression” in the source text can be represented as # 1234, where 1234 refers to the position of the word in the list. This is called dictionary compression .

But from the point of view of universal compression, this method has several drawbacks. First, what exactly should get into the dictionary? The source data can be in different languages, it can even be not human-readable text. And if the dictionary is not agreed upon in advance between compression and decompression, then it will have to be stored and transferred along with the compressed data, which reduces the benefits of compression.

An elegant solution to this problem is to use the source data itself as a dictionary. In 1977, A Universal Algorithm for Sequential Data Compression , Jacob Ziv and Abraham Lempel (who worked at Technion), proposed a compression scheme in which the source data is presented as a sequence of triplets:

(, , )

where they form a backward link to the sequence of characters that you want to copy from the previous position in the original text, and this is the next character in the generated data.


Abraham Lempel and Jacob Ziv.

Consider the following lines:

It was the best of times,
it was the worst of times,

In the second line, the sequence “t was the w” can be represented as (26, 10, w), since it is recreated by copying 10 characters from the position of 26 characters back to the letter “w”. For characters that have not yet appeared, zero-length backlinks are used. For example, the initial “I” can be represented as (0, 0, I).

This scheme is called Lempel-Ziv compression, or LZ77 compression. However, in practical implementations of the algorithm, part of the triplet is usually not used . Instead, characters are generated individually, and ( , ) pairs are used for backlinks (this option is called LZSS compression ). How literals and backlinks are encoded is a separate issue, we will consider it below when we analyze the algorithmThe Deflate .

This text:

It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
it was the epoch of belief,
it was the epoch of incredulity,
it was the season of Light,
it was the season of Darkness,
it was the spring of hope,
it was the winter of despair,
we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way

You can compress in this:

It was the best of times,
i(26,10)wor(27,24)age(25,4)wisdom(26,20)
foolishnes(57,14)epoch(33,4)belief(28,22)incredulity
(33,13)season(34,4)Light(28,23)Dark(120,17)
spring(31,4)hope(231,14)inter(27,4)despair,
we had everyth(57,4)before us(29,9)no(26,20)
we(12,3)all go(29,4)direct to Heaven
(36,28)(139,3)(83,3)(138,3)way

One of the important properties of backlinks is that they can overlap. This happens when the length is greater than the distance. For instance:

Fa-la-la-la-la

You can compress to:

Fa-la(3,9)

It may seem strange to you, but the method works: after the bytes of the first three “-la” are copied, the copying continues using the newly generated bytes.

In fact, this is a type of coding of series lengths , in which part of the data is repeatedly copied to obtain the desired length.

An interactive example of using Lempel-Ziv compression for lyrics is shown in an article by Colin Morris Are Pop Lyrics Getting More Repetitive? .

The following is an example of copying backlinks in C. Please note that due to possible overlap, we cannot use memcpyor memmove.

/* Output the (dist,len) backref at dst_pos in dst. */
static inline void lz77_output_backref(uint8_t *dst, size_t dst_pos,
                                       size_t dist, size_t len)
{
        size_t i;

        assert(dist <= dst_pos && "cannot reference before beginning of dst");

        for (i = 0; i < len; i++) {
                dst[dst_pos] = dst[dst_pos - dist];
                dst_pos++;
        }
}

It is easy to generate literals, but for completeness we will use an auxiliary function:

/* Output lit at dst_pos in dst. */
static inline void lz77_output_lit(uint8_t *dst, size_t dst_pos, uint8_t lit)
{
        dst[dst_pos] = lit;
}

Note that the caller of this function must make sure that there is dstenough space for the generated data and that the backlink does not access the position before the start of the buffer.

It’s difficult not to generate data using backlinks during unpacking, but to create it first when compressing the source data. This can be done in different ways, but we will use the method based on hash tables from zlib, which is proposed in RFC 1951.

We will use a hash table with the positions of three-character prefixes that were previously found in the line (shorter backlinks do not bring any benefit). Deflate allows backlinks within the previous 32,768 characters - this is called a window. This provides streaming compression: the input data is processed a little at a time, provided that the window with the last bytes is stored in memory. However, our implementation assumes that all of the input data is available to us and that we can process it whole at a time. This allows you to focus on compression rather than accounting, which is necessary for stream processing.

We will use two arrays: in headcontains the hash value of the three-character prefix for the position in the input, and in prevcontains the position of the previous position with this hash value. In fact, head[h]this is the heading of a linked list of prefix positions with a hash h, and prev[x]receives the element preceding xthe list.

#define LZ_WND_SIZE 32768
#define LZ_MAX_LEN  258

#define HASH_SIZE 15
#define NO_POS    SIZE_MAX

/* Perform LZ77 compression on the len bytes in src. Returns false as soon as
   either of the callback functions returns false, otherwise returns true when
   all bytes have been processed. */
bool lz77_compress(const uint8_t *src, size_t len,
                   bool (*lit_callback)(uint8_t lit, void *aux),
                   bool (*backref_callback)(size_t dist, size_t len, void *aux),
                   void *aux)
{
        size_t head[1U << HASH_SIZE];
        size_t prev[LZ_WND_SIZE];

        uint16_t h;
        size_t i, j, dist;
        size_t match_len, match_pos;
        size_t prev_match_len, prev_match_pos;

        /* Initialize the hash table. */
        for (i = 0; i < sizeof(head) / sizeof(head[0]); i++) {
                head[i] = NO_POS;
        }

To insert a new string position into the hash table prev, it is updated to indicate the previous one head, and then it is updated itself head:

static void insert_hash(uint16_t hash, size_t pos, size_t *head, size_t *prev)
{
        prev[pos % LZ_WND_SIZE] = head[hash];
        head[hash] = pos;
}

Pay attention to the modulo operation when indexing in prev: we are only interested in those positions that fall into the current window.

Instead of calculating the hash value for each three-character prefix from scratch, we will use a ring hash and will constantly update it so that only the last three characters are reflected in its value:

static uint16_t update_hash(uint16_t hash, uint8_t c)
{
        hash <<= 5;                     /* Shift out old bits. */
        hash ^= c;                      /* Include new bits. */
        hash &= (1U << HASH_SIZE) - 1;  /* Mask off excess bits. */

        return hash;
}

The hash map can then be used to efficiently search for previous matches with a sequence, as shown below. Searching for matches is the most resource-intensive compression operation, so we will limit the depth of the search in the list.

Changing various parameters, such as the depth of the search on the list of prefixes and performing lazy comparisons, as described below, is a way to increase speed by reducing the degree of compression. The settings in our code are selected to match the maximum compression level in zlib.

/* Find the longest most recent string which matches the string starting
 * at src[pos]. The match must be strictly longer than prev_match_len and
 * shorter or equal to max_match_len. Returns the length of the match if found
 * and stores the match position in *match_pos, otherwise returns zero. */
static size_t find_match(const uint8_t *src, size_t pos, uint16_t hash,
                         size_t prev_match_len, size_t max_match_len,
                         const size_t *head, const size_t *prev,
                         size_t *match_pos)
{
        size_t max_match_steps = 4096;
        size_t i, l;
        bool found;

        if (prev_match_len == 0) {
                /* We want backrefs of length 3 or longer. */
                prev_match_len = 2;
        }

        if (prev_match_len >= max_match_len) {
                /* A longer match would be too long. */
                return 0;
        }

        if (prev_match_len >= 32) {
                /* Do not try too hard if there is already a good match. */
                max_match_steps /= 4;
        }

        found = false;
        i = head[hash];

        while (max_match_steps != 0) {
                if (i == NO_POS) {
                        /* No match. */
                        break;
                }

                assert(i < pos && "Matches should precede pos.");
                if (pos - i > LZ_WND_SIZE) {
                        /* The match is outside the window. */
                        break;
                }

                l = cmp(src, i, pos, prev_match_len, max_match_len);

                if (l != 0) {
                        assert(l > prev_match_len);
                        assert(l <= max_match_len);

                        found = true;
                        *match_pos = i;
                        prev_match_len = l;

                        if (l == max_match_len) {
                                /* A longer match is not possible. */
                                return l;
                        }
                }

                /* Look further back in the prefix list. */
                i = prev[i % LZ_WND_SIZE];
                max_match_steps--;
        }

        if (!found) {
                return 0;
        }

        return prev_match_len;
}

/* Compare the substrings starting at src[i] and src[j], and return the length
 * of the common prefix. The match must be strictly longer than prev_match_len
 * and shorter or equal to max_match_len. */
static size_t cmp(const uint8_t *src, size_t i, size_t j,
                  size_t prev_match_len, size_t max_match_len)
{
        size_t l;

        assert(prev_match_len < max_match_len);

        /* Check whether the first prev_match_len + 1 characters match. Do this
         * backwards for a higher chance of finding a mismatch quickly. */
        for (l = 0; l < prev_match_len + 1; l++) {
                if (src[i + prev_match_len - l] !=
                    src[j + prev_match_len - l]) {
                        return 0;
                }
        }

        assert(l == prev_match_len + 1);

        /* Now check how long the full match is. */
        for (; l < max_match_len; l++) {
                if (src[i + l] != src[j + l]) {
                        break;
                }
        }

        assert(l > prev_match_len);
        assert(l <= max_match_len);
        assert(memcmp(&src[i], &src[j], l) == 0);

        return l;
}

You can terminate the function with lz77_compressthis code to search for previous matches:

       /* h is the hash of the three-byte prefix starting at position i. */
        h = 0;
        if (len >= 2) {
                h = update_hash(h, src[0]);
                h = update_hash(h, src[1]);
        }

        prev_match_len = 0;
        prev_match_pos = 0;

        for (i = 0; i + 2 < len; i++) {
                h = update_hash(h, src[i + 2]);

                /* Search for a match using the hash table. */
                match_len = find_match(src, i, h, prev_match_len,
                                       min(LZ_MAX_LEN, len - i), head, prev,
                                       &match_pos);

                /* Insert the current hash for future searches. */
                insert_hash(h, i, head, prev);

                /* If the previous match is at least as good as the current. */
                if (prev_match_len != 0 && prev_match_len >= match_len) {
                        /* Output the previous match. */
                        dist = (i - 1) - prev_match_pos;
                        if (!backref_callback(dist, prev_match_len, aux)) {
                                return false;
                        }
                        /* Move past the match. */
                        for (j = i + 1; j < min((i - 1) + prev_match_len,
                                                len - 2); j++) {
                                h = update_hash(h, src[j + 2]);
                                insert_hash(h, j, head, prev);
                        }
                        i = (i - 1) + prev_match_len - 1;
                        prev_match_len = 0;
                        continue;
                }

                /* If no match (and no previous match), output literal. */
                if (match_len == 0) {
                        assert(prev_match_len == 0);
                        if (!lit_callback(src[i], aux)) {
                                return false;
                        }
                        continue;
                }

                /* Otherwise the current match is better than the previous. */

                if (prev_match_len != 0) {
                        /* Output a literal instead of the previous match. */
                        if (!lit_callback(src[i - 1], aux)) {
                                return false;
                        }
                }

                /* Defer this match and see if the next is even better. */
                prev_match_len = match_len;
                prev_match_pos = match_pos;
        }

        /* Output any previous match. */
        if (prev_match_len != 0) {
                dist = (i - 1) - prev_match_pos;
                if (!backref_callback(dist, prev_match_len, aux)) {
                        return false;
                }
                i = (i - 1) + prev_match_len;
        }

        /* Output any remaining literals. */
        for (; i < len; i++) {
                if (!lit_callback(src[i], aux)) {
                        return false;
                }
        }

        return true;
}

This code is looking for the longest backlink that can be generated at the current position. But before issuing it, the program decides whether it is possible to find an even longer match in the next position. In zlib, this is called lazy comparison evaluation . This is still a greedy algorithm: it selects the longest match, even if the current shorter one allows later to get a match even longer and achieve stronger compression.

Lempel-Ziv compression can work both fast and slow. Zopflispent a lot of time looking for optimal backlinks to squeeze extra compression percentages. This is useful for data that is compressed once and then reused, for example, for static information on a web server. On the other side of the scale are compressors such as Snappy and LZ4 , which are compared only with the last 4-byte prefix and are very fast. This type of compression is useful in databases and RPC systems in which the time spent on compression pays off by saving time when sending data over a network or to disk.

The idea of ​​using source data as a dictionary is very elegant, but you can also benefit from a static dictionary. Brotli is an LZ77 based algorithm, but it also uses a largestatic dictionary of string, which are often found on the network.

LZ77 code can be viewed in lz77.h and lz77.c .

Huffman Code


The second Zip compression algorithm is the Huffman code.

The term code in this context is a reference to a system for presenting data in some other form. In this case, we are interested in code that can be used to efficiently represent literals and backlinks generated by the Lempel-Ziv algorithm.

Traditionally, English text is presented using the American Standard Code for Information Interchange (ASCII) . This system assigns each character a number, which is usually stored in an 8-bit representation. Here are the ASCII codes for the uppercase letters of the English alphabet:

A01000001N01001110
B01000010O01001111
C01000011P01010000
D01000100Q01010001
E01000101R01010010
F01000110S01010011
G01000111T01010100
H01001000U01010101
I01001001V01010110
J01001010W01010111
K01001011X01011000
L01001100Y01011001
M01001101Z01011010

One byte per character is a convenient way to store text. It allows you to easily access or modify parts of the text, and it is always clear how many bytes are required to store N characters, or how many characters are stored in N bytes. However, this is not the most effective way in terms of occupied space. For example, in English, the letter E is used most often, and Z is the least used. Therefore, in terms of volume, it is more efficient to use a shorter bit representation for E and a longer one for Z, rather than assigning the same number of bits to each character.

A code that specifies encodings of different lengths to different source characters is called a variable-length code . The most famous example is Morse code., in which each character is encoded with dots and dashes, originally transmitted by telegraph with short and long pulses:

A• -N- •
B- • • •O- - -
C- • - •P• - - •
D- • •Q- - • -
ER• - •
F• • - •S• • •
G- - •T-
H• • • •U• • -
I• •V• • • -
J• - - -W• - -
K- • -X- • • -
L• - • •Y- • - -
M- -Z- - • •

One of the drawbacks of Morse code is that one codeword may be a prefix of another. For example, • • - • does not have a unique decoding: it can be F or ER. This is solved by pauses (three dots in length) between the letters during the transmission. However, it would be better if codewords could not be prefixes of other words. This code is called unrefixed . ASCII code of a fixed length is unrefixed because codewords are always the same length. But variable length codes can also be unrefixed. Phone numbers are often unrefixed. Before the emergency telephone number 112 was introduced in Sweden, all the numbers starting with 112 had to be changed. But in the USA there is not a single telephone number starting with 911.

To minimize the size of the encoded message, it is better to use an unrefixed code in which frequently occurring characters have shorter codewords. The optimal code will be the one that generates the shortest possible result - the sum of the lengths of code words, multiplied by their frequency of occurrence, will be the minimum possible. This is called a non-prefix code with minimal redundancy , or a Huffman code , in honor of the inventor of an efficient algorithm for generating such codes.

Huffman Algorithm


While studying materials for writing his doctoral dissertation on electronic engineering at MIT, David Huffman attended a course on information theory taught by Robert Fano. According to legend , Fano allowed his students to choose: write the final exam or course. Huffman chose the latter, and he was given the topic of searching for prefixless codes with minimal redundancy. It is assumed that he did not know that Fano himself was working on this task at that time (the Shannon-Fano algorithm was the most famous method in those years ). Huffman's work was published in 1952 under the title A Method for the Construction of Minimum-Redundancy Codes in 1952. And since then its algorithm has been widely used.


David Huffman press release UC Santa Cruz.

The Huffman algorithm creates unrefixed code with minimal redundancy for the character set and their frequency of use. The algorithm repeatedly selects two characters that are least likely to be found in the source data — say, X and Y — and replaces them with a composite character meaning “X or Y”. The frequency of occurrence of a composite symbol is the sum of the frequencies of two source symbols. The codewords for X and Y can be any codewords that are assigned to the compound character “X or Y” followed by 0 or 1 to distinguish the original characters. When the input data is reduced to one character, the algorithm stops working ( video explanation ).

Here is an example of the algorithm working on a small character set:

SymbolFrequency
A6
B4
C2
D3

First iteration of processing:


The two rarest symbols, C and D, are removed from the set and replaced by a composite symbol whose frequency is the sum of the frequencies C and D:


Now the rarest symbols are B and a composite symbol with a frequency of 5. They are removed from the set and replaced with a composite symbol with a frequency of 9:


Finally, A and a composite symbol with a frequency of 9 are combined into a new symbol with a frequency of 15:


The whole set was reduced to one character, processing is complete.

The algorithm created a structure called the Huffman tree . Input characters are leaves, and the higher the frequency of a character, the higher it is located. Starting from the root of the tree, you can generate code words for characters by adding 0 or 1 when moving left or right, respectively. It turns out like this:

SymbolThe codeword
A0
B10
C110
D111

No codeword is a prefix for any other. The more often a symbol occurs, the shorter its code word.

The tree can also be used for decoding: we start from the root and go right or left for the value with 0 or 1 in front of the character. For example, line 010100 is decoded in ABBA.

Note that the length of each codeword is equivalent to the depth of the corresponding tree node. As we will see in the next part, we do not need a real tree to assign codewords. It is enough to know the length of the words themselves. Thus, the result of our implementation of the Huffman algorithm will be the length of code words.

To store the character set and efficiently find the lowest frequencies, we will use the binary heap data structure . In particular, we are interested inmin-heap , as the minimum value should be at the top.

/* Swap the 32-bit values pointed to by a and b. */
static void swap32(uint32_t *a, uint32_t *b)
{
        uint32_t tmp;

        tmp = *a;
        *a = *b;
        *b = tmp;
}

/* Move element i in the n-element heap down to restore the minheap property. */
static void minheap_down(uint32_t *heap, size_t n, size_t i)
{
        size_t left, right, min;

        assert(i >= 1 && i <= n && "i must be inside the heap");

        /* While the ith element has at least one child. */
        while (i * 2 <= n) {
                left = i * 2;
                right = i * 2 + 1;

                /* Find the child with lowest value. */
                min = left;
                if (right <= n && heap[right] < heap[left]) {
                        min = right;
                }

                /* Move i down if it is larger. */
                if (heap[min] < heap[i]) {
                        swap32(&heap[min], &heap[i]);
                        i = min;
                } else {
                        break;
                }
        }
}

/* Establish minheap property for heap[1..n]. */
static void minheap_heapify(uint32_t *heap, size_t n)
{
        size_t i;

        /* Floyd's algorithm. */
        for (i = n / 2; i >= 1; i--) {
                minheap_down(heap, n, i);
        }
}

To track the frequency of ncharacters, we will use a bunch of nelements. Also, each time a composite symbol is created, we want to “link” both source symbols to it. Therefore, each symbol will have a “communication element”.

To store the n-element heap and ncommunication elements, we will use an array of n * 2 + 1elements. When two characters on the heap are replaced by one, we will use the second element to save the link to the new character. This approach is based on the implementation of Managing Gigabytes of Witten, Moffat and Bell.

At each node in the heap, we will use the 16 most significant bits to store the symbol frequency, and the 16 least significant bits to store the index of the symbol communication element. Due to the use of high bits, the frequency difference will be determined by the result of a 32-bit comparison between two elements of the heap.

Because of this representation, we need to make sure that the frequency of characters always fits in 16 bits. After completion of the algorithm, the final composite symbol will have the frequency of all combined symbols, that is, this sum should be placed in 16 bits. Our Deflate implementation will verify this by simultaneously processing up to 64,535 characters.

Symbols with zero frequency will receive code words of zero length and will not participate in the compilation of the encoding.

If the code word reaches the specified maximum depth, we will “smooth out” the frequency distribution by imposing a frequency limit and try again (yes, with the help goto). There are more sophisticated ways to do depth-limited Huffman coding, but this one is simple and efficient.

#define MAX_HUFFMAN_SYMBOLS 288      /* Deflate uses max 288 symbols. */

/* Construct a Huffman code for n symbols with the frequencies in freq, and
 * codeword length limited to max_len. The sum of the frequencies must be <=
 * UINT16_MAX. max_len must be large enough that a code is always possible,
 * i.e. 2 ** max_len >= n. Symbols with zero frequency are not part of the code
 * and get length zero. Outputs the codeword lengths in lengths[0..n-1]. */
static void compute_huffman_lengths(const uint16_t *freqs, size_t n,
                                    uint8_t max_len, uint8_t *lengths)
{
        uint32_t nodes[MAX_HUFFMAN_SYMBOLS * 2 + 1], p, q;
        uint16_t freq;
        size_t i, h, l;
        uint16_t freq_cap = UINT16_MAX;

#ifndef NDEBUG
        uint32_t freq_sum = 0;
        for (i = 0; i < n; i++) {
                freq_sum += freqs[i];
        }
        assert(freq_sum <= UINT16_MAX && "Frequency sum too large!");
#endif

        assert(n <= MAX_HUFFMAN_SYMBOLS);
        assert((1U << max_len) >= n && "max_len must be large enough");

try_again:
        /* Initialize the heap. h is the heap size. */
        h = 0;
        for (i = 0; i < n; i++) {
                freq = freqs[i];

                if (freq == 0) {
                        continue; /* Ignore zero-frequency symbols. */
                }
                if (freq > freq_cap) {
                        freq = freq_cap; /* Enforce the frequency cap. */
                }

                /* High 16 bits: Symbol frequency.
                   Low 16 bits:  Symbol link element index. */
                h++;
                nodes[h] = ((uint32_t)freq << 16) | (uint32_t)(n + h);
        }
        minheap_heapify(nodes, h);

        /* Special case for less than two non-zero symbols. */
        if (h < 2) {
                for (i = 0; i < n; i++) {
                        lengths[i] = (freqs[i] == 0) ? 0 : 1;
                }
                return;
        }

        /* Build the Huffman tree. */
        while (h > 1) {
                /* Remove the lowest frequency node p from the heap. */
                p = nodes[1];
                nodes[1] = nodes[h--];
                minheap_down(nodes, h, 1);

                /* Get q, the next lowest frequency node. */
                q = nodes[1];

                /* Replace q with a new symbol with the combined frequencies of
                   p and q, and with the no longer used h+1'th node as the
                   link element. */
                nodes[1] = ((p & 0xffff0000) + (q & 0xffff0000))
                           | (uint32_t)(h + 1);

                /* Set the links of p and q to point to the link element of
                   the new node. */
                nodes[p & 0xffff] = nodes[q & 0xffff] = (uint32_t)(h + 1);

                /* Move the new symbol down to restore heap property. */
                minheap_down(nodes, h, 1);
        }

        /* Compute the codeword length for each symbol. */
        h = 0;
        for (i = 0; i < n; i++) {
                if (freqs[i] == 0) {
                        lengths[i] = 0;
                        continue;
                }
                h++;

                /* Link element for the i'th symbol. */
                p = nodes[n + h];

                /* Follow the links until we hit the root (link index 2). */
                l = 1;
                while (p != 2) {
                        l++;
                        p = nodes[p];
                }

                if (l > max_len) {
                        /* Lower freq_cap to flatten the distribution. */
                        assert(freq_cap != 1 && "Cannot lower freq_cap!");
                        freq_cap /= 2;
                        goto try_again;
                }

                assert(l <= UINT8_MAX);
                lengths[i] = (uint8_t)l;
        }
}

An elegant alternative to the binary heap option is to store characters in two queues. The first contains the source characters, sorted by frequency. When a compound symbol is created, it is added secondarily. Thus, the symbol with the lowest frequency will always be in the first position of one of the queues. This approach is described by Jan van Leeuwen in On the Construction of Huffman Trees (1976).

Huffman coding is optimal for non-prefix codes, but in other cases there are more efficient methods: arithmetic coding and asymmetric number systems .

Huffman's canonical codes


In the example above, we built a Huffman tree:


If we go from the root and use 0 for the left branch and 1 for the right, then we get the following codes:

SymbolThe codeword
A0
B10
C110
D111

The decision to use 0 for the left branch and 1 for the right one seems arbitrary. If we do the opposite, we get:

SymbolThe codeword
A1
B01
C001
D000

We can arbitrarily mark two branches originating from a node with zero and one (the main thing is that the labels are different), and still get the equivalent code:


SymbolThe codeword
A0
Beleven
C100
D101

Although the Huffman algorithm provides the required codeword lengths for unrefixed code with minimal redundancy, there are many ways to assign individual codewords.

Given the length of the codeword calculated by the Huffman algorithm, the canonical Huffman code assigns codewords to characters in a specific way. This is useful because it allows you to store and transmit codeword lengths with compressed data: the decoder will be able to recover codewords based on their lengths. Of course, you can store and transmit symbol frequencies and run the Huffman algorithm in the decoder, but this will require more work and more storage from the decoder. Another very important property is that the structure of canonical codes uses efficient decoding.

The idea is to assign code words to characters sequentially, under one at a time. The first code word is 0. The next will be a word with a length of the previous word + 1. The first word with a length of N is made up of the last word of length N-1, adding one (to get a new code word) and shifting one step to the left (to increase the length).

In the terminology of the Hoffman tree, code words are sequentially assigned to leaves in the order from left to right, one level at a time, shifting to the left when moving to the next level.

In our ABCD example, the Huffman algorithm assigned code words with lengths of 1, 2, 3, and 3. The first word is 0. This is also the last word of length 1. For length 2, we take 0 and add 1 to get the next code, which will become the prefix of two-bit codes , shift to the left and get 10. This is now the last word of length 2. To get length 3, we add 1 and shift: 110. To get the next word of length 3 we add 1: 111.

SymbolThe codeword
A0
B10
C110
D111

The implementation of the canonical code generator is shown below. Note that the Deflate algorithm expects codewords to be generated on the basis of the LSB-first principle (first, the least significant bit). That is, the first bit of the codeword should be stored in the least significant bit. This means that we need to change the order of the bits, for example, using the lookup table.

#define MAX_HUFFMAN_BITS 15          /* Deflate uses max 15-bit codewords. */

static void compute_canonical_code(uint16_t *codewords, const uint8_t *lengths,
                                   size_t n)
{
        size_t i;
        uint16_t count[MAX_HUFFMAN_BITS + 1] = {0};
        uint16_t code[MAX_HUFFMAN_BITS + 1];
        int l;

        /* Count the number of codewords of each length. */
        for (i = 0; i < n; i++) {
                count[lengths[i]]++;
        }
        count[0] = 0; /* Ignore zero-length codes. */

        /* Compute the first codeword for each length. */
        code[0] = 0;
        for (l = 1; l <= MAX_HUFFMAN_BITS; l++) {
                code[l] = (uint16_t)((code[l - 1] + count[l - 1]) << 1);
        }

        /* Assign a codeword for each symbol. */
        for (i = 0; i < n; i++) {
                l = lengths[i];
                if (l == 0) {
                        continue;
                }

                codewords[i] = reverse16(code[l]++, l); /* Make it LSB-first. */
        }
}

/* Reverse the n least significant bits of x.
   The (16 - n) most significant bits of the result will be zero. */
static inline uint16_t reverse16(uint16_t x, int n)
{
        uint16_t lo, hi;
        uint16_t reversed;

        assert(n > 0);
        assert(n <= 16);

        lo = x & 0xff;
        hi = x >> 8;

        reversed = (uint16_t)((reverse8_tbl[lo] << 8) | reverse8_tbl[hi]);

        return reversed >> (16 - n);
}

Now put it all together and write the encoder initialization code:

typedef struct huffman_encoder_t huffman_encoder_t;
struct huffman_encoder_t {
        uint16_t codewords[MAX_HUFFMAN_SYMBOLS]; /* LSB-first codewords. */
        uint8_t lengths[MAX_HUFFMAN_SYMBOLS];    /* Codeword lengths. */
};

/* Initialize a Huffman encoder based on the n symbol frequencies. */
void huffman_encoder_init(huffman_encoder_t *e, const uint16_t *freqs, size_t n,
                          uint8_t max_codeword_len)
{
        assert(n <= MAX_HUFFMAN_SYMBOLS);
        assert(max_codeword_len <= MAX_HUFFMAN_BITS);

        compute_huffman_lengths(freqs, n, max_codeword_len, e->lengths);
        compute_canonical_code(e->codewords, e->lengths, n);
}

We also make a function to configure the encoder using the already calculated code lengths:

/* Initialize a Huffman encoder based on the n codeword lengths. */
void huffman_encoder_init2(huffman_encoder_t *e, const uint8_t *lengths,
                           size_t n)
{
        size_t i;

        for (i = 0; i < n; i++) {
                e->lengths[i] = lengths[i];
        }
        compute_canonical_code(e->codewords, e->lengths, n);
}

Efficient Huffman Decoding


The easiest way to decode Huffman is to traverse the tree starting from the root, reading one bit of input at a time and deciding which branch to take next, left or right. When a leaf node is reached, it is a decoded character.

This method is often taught in universities and books. It is simple and elegant, but processing one bit at a time is too slow. It is much faster to decode using the lookup table. For the above example, in which the maximum codeword length is three bits, you can use the following table:

BitsSymbolCodeword Length
0 00A1
0 01A1
0 10A1
0 11A1
10 0B2
10 1B2
110C3
111D3

Although there are only four characters, we need a table with eight entries to cover all possible three-bit combinations. Symbols with codewords shorter than three bits have several entries in the table. For example, the word 10 was “supplemented” with 10 0 and 10 1 to cover all three-bit combinations starting with 10.

To decode in this way, you need to index in the table with the following three input bits and immediately find the corresponding character and the length of its code word. The length is important, because despite looking at the next three bits, we need to get the same number of input bits as the length of the codeword.

The search table-based method works very quickly, but it has a drawback: the size of the table doubles with each additional bit in the length of the codeword. That is, the construction of the table slows down exponentially, and if it ceases to fit in the processor cache, then the method starts to work slowly.

Because of this, the lookup table is usually used only for code words no larger than a certain length. And for longer words take a different approach. Just as Huffman coding assigns shorter codewords to more frequent characters, the use of a lookup table for short codewords is in many cases an excellent optimization.

In zlibseveral levels of search tables are used. If the codeword is too long for the first table, then the search will go to the secondary table to index the remaining bits.

But there is another, very elegant method, based on the properties of canonical Huffman codes. It is described in On the Implementation of Minimum Redundancy Prefix Codes (Moffat and Turpin, 1997), and is also explained in Charles Bloom's The Lost Huffman Paper .

We take the code words from the canonical version: 0, 10, 110, 111. We will track the first code words of each length, as well as the number of each code word in the general sequence - “symbolic index”.

Codeword LengthFirst codewordFirst character index
101 (A)
2102 (B)
31103 (C)

Since code words are assigned sequentially, if we know the number of bits, we can find in the above table the symbol that these bits represent. For example, for three-bit 111 we see that this is an offset by one from the first codeword of this length (110). The first character index of this length is 3, and an offset of one gives us an index of 4. Another table compares the character index with the character:

sym_idx = d->first_symbol[len] + (bits - d->first_code[len]);
sym = d->syms[sym_idx];

A little optimization: instead of separately storing the first character index and the first code word, we can store the first index in the table minus the first code word:

sym_idx = d->offset_first_sym_idx[len] + bits;
sym = d->syms[sym_idx];

To understand how many bits need to be estimated, we again use the code sequence property. In our example, all valid one-bit codewords are strictly less than 1, two-bit - strictly less than 11, three-bit - less than 1000 (in fact, true for all three-bit values). In other words, the valid N-bit codeword must be strictly less than the first N-bit codeword plus the number of N-bit codewords. Moreover, we can shift these boundaries to the left so that they are all three-bit wide. Let's call it restrictive bits for each of the codeword lengths:

Codeword LengthLimit bits
1100
2110
31000

The limiter for length 3 has overflowed to 4 bits, but this only means that any three-bit word will do.

We can search among the three-bit input data and compare with the restrictive bits to understand how long our code word is. After completion, we shift the input bits, only to calculate their correct number, and then find the character index:

for (len = 1; len <= 3; len++) {
        if (bits < d->sentinel_bits[len]) {
                bits >>= 3 - len;  /* Get the len most significant bits. */
                sym_idx = d->offset_first_sym_idx[len] + bits;
        }
}

The time complexity of the process is linear with respect to the number of bits in the code words, but the place is spent efficiently, only downloading and comparison are required at each step, and since shorter code words are more common, the method allows optimizing compression in many situations.

Full decoder code:

#define HUFFMAN_LOOKUP_TABLE_BITS 8  /* Seems a good trade-off. */

typedef struct huffman_decoder_t huffman_decoder_t;
struct huffman_decoder_t {
        /* Lookup table for fast decoding of short codewords. */
        struct {
                uint16_t sym : 9;  /* Wide enough to fit the max symbol nbr. */
                uint16_t len : 7;  /* 0 means no symbol. */
        } table[1U << HUFFMAN_LOOKUP_TABLE_BITS];

        /* "Sentinel bits" value for each codeword length. */
        uint16_t sentinel_bits[MAX_HUFFMAN_BITS + 1];

        /* First symbol index minus first codeword mod 2**16 for each length. */
        uint16_t offset_first_sym_idx[MAX_HUFFMAN_BITS + 1];

        /* Map from symbol index to symbol. */
        uint16_t syms[MAX_HUFFMAN_SYMBOLS];
#ifndef NDEBUG
        size_t num_syms;
#endif
};

/* Get the n least significant bits of x. */
static inline uint64_t lsb(uint64_t x, int n)
{
        assert(n >= 0 && n <= 63);
        return x & (((uint64_t)1 << n) - 1);
}

/* Use the decoder d to decode a symbol from the LSB-first zero-padded bits.
 * Returns the decoded symbol number or -1 if no symbol could be decoded.
 * *num_used_bits will be set to the number of bits used to decode the symbol,
 * or zero if no symbol could be decoded. */
static inline int huffman_decode(const huffman_decoder_t *d, uint16_t bits,
                                 size_t *num_used_bits)
{
        uint64_t lookup_bits;
        size_t l;
        size_t sym_idx;

        /* First try the lookup table. */
        lookup_bits = lsb(bits, HUFFMAN_LOOKUP_TABLE_BITS);
        assert(lookup_bits < sizeof(d->table) / sizeof(d->table[0]));
        if (d->table[lookup_bits].len != 0) {
                assert(d->table[lookup_bits].len <= HUFFMAN_LOOKUP_TABLE_BITS);
                assert(d->table[lookup_bits].sym < d->num_syms);

                *num_used_bits = d->table[lookup_bits].len;
                return d->table[lookup_bits].sym;
        }

        /* Then do canonical decoding with the bits in MSB-first order. */
        bits = reverse16(bits, MAX_HUFFMAN_BITS);
        for (l = HUFFMAN_LOOKUP_TABLE_BITS + 1; l <= MAX_HUFFMAN_BITS; l++) {
                if (bits < d->sentinel_bits[l]) {
                        bits >>= MAX_HUFFMAN_BITS - l;

                        sym_idx = (uint16_t)(d->offset_first_sym_idx[l] + bits);
                        assert(sym_idx < d->num_syms);

                        *num_used_bits = l;
                        return d->syms[sym_idx];
                }
        }

        *num_used_bits = 0;
        return -1;
}

To configure the decoder, we will pre-compute the canonical codes, as for huffman_encoder_init , and fill in different tables:

/* Initialize huffman decoder d for a code defined by the n codeword lengths.
   Returns false if the codeword lengths do not correspond to a valid prefix
   code. */
bool huffman_decoder_init(huffman_decoder_t *d, const uint8_t *lengths,
                          size_t n)
{
        size_t i;
        uint16_t count[MAX_HUFFMAN_BITS + 1] = {0};
        uint16_t code[MAX_HUFFMAN_BITS + 1];
        uint32_t s;
        uint16_t sym_idx[MAX_HUFFMAN_BITS + 1];
        int l;

#ifndef NDEBUG
        assert(n <= MAX_HUFFMAN_SYMBOLS);
        d->num_syms = n;
#endif

        /* Zero-initialize the lookup table. */
        for (i = 0; i < sizeof(d->table) / sizeof(d->table[0]); i++) {
                d->table[i].len = 0;
        }

        /* Count the number of codewords of each length. */
        for (i = 0; i < n; i++) {
                assert(lengths[i] <= MAX_HUFFMAN_BITS);
                count[lengths[i]]++;
        }
        count[0] = 0;  /* Ignore zero-length codewords. */

        /* Compute sentinel_bits and offset_first_sym_idx for each length. */
        code[0] = 0;
        sym_idx[0] = 0;
        for (l = 1; l <= MAX_HUFFMAN_BITS; l++) {
                /* First canonical codeword of this length. */
                code[l] = (uint16_t)((code[l - 1] + count[l - 1]) << 1);

                if (count[l] != 0 && code[l] + count[l] - 1 > (1U << l) - 1) {
                        /* The last codeword is longer than l bits. */
                        return false;
                }

                s = (uint32_t)((code[l] + count[l]) << (MAX_HUFFMAN_BITS - l));
                d->sentinel_bits[l] = (uint16_t)s;
                assert(d->sentinel_bits[l] == s && "No overflow.");

                sym_idx[l] = sym_idx[l - 1] + count[l - 1];
                d->offset_first_sym_idx[l] = sym_idx[l] - code[l];
        }

        /* Build mapping from index to symbol and populate the lookup table. */
        for (i = 0; i < n; i++) {
                l = lengths[i];
                if (l == 0) {
                        continue;
                }

                d->syms[sym_idx[l]] = (uint16_t)i;
                sym_idx[l]++;

                if (l <= HUFFMAN_LOOKUP_TABLE_BITS) {
                        table_insert(d, i, l, code[l]);
                        code[l]++;
                }
        }

        return true;
}

static void table_insert(huffman_decoder_t *d, size_t sym, int len,
                         uint16_t codeword)
{
        int pad_len;
        uint16_t padding, index;

        assert(len <= HUFFMAN_LOOKUP_TABLE_BITS);

        codeword = reverse16(codeword, len); /* Make it LSB-first. */
        pad_len = HUFFMAN_LOOKUP_TABLE_BITS - len;

        /* Pad the pad_len upper bits with all bit combinations. */
        for (padding = 0; padding < (1U << pad_len); padding++) {
                index = (uint16_t)(codeword | (padding << len));
                d->table[index].sym = (uint16_t)sym;
                d->table[index].len = (uint16_t)len;

                assert(d->table[index].sym == sym && "Fits in bitfield.");
                assert(d->table[index].len == len && "Fits in bitfield.");
        }
}

Deflate


Deflate algorithm, introduced in PKZip 2.04c in 1993, is a standard compression method in modern Zip files. It is also used in gzip, PNG and many other formats. It uses a combination of LZ77 compression and Huffman coding, which we will discuss and implement in this section.

Prior to Deflate, PKZip used Shrink, Reduce, and Implode compression methods. Today they are rare, although after Deflate were still in use for some time, because they consumed less memory. But we will not consider them.

Bit streams


Deflate stores Huffman codewords in a bitstream according to the LSB-first principle. This means that the first bit of the stream is stored in the least significant bit of the first byte.

Consider a bitstream (read from left to right) 1-0-0-1-1. When it is stored according to the LSB-first principle, the byte value becomes 0b00011001 (binary) or 0x19 (hexadecimal). It may seem that the stream is simply represented backwards (in a sense, it is), but the advantage is that it is easier for us to get the first N bits from a computer word: we simply hide the N least significant bits.

These procedures are taken from bitstream.h :

/* Input bitstream. */
typedef struct istream_t istream_t;
struct istream_t {
        const uint8_t *src;  /* Source bytes. */
        const uint8_t *end;  /* Past-the-end byte of src. */
        size_t bitpos;       /* Position of the next bit to read. */
        size_t bitpos_end;   /* Position of past-the-end bit. */
};

/* Initialize an input stream to present the n bytes from src as an LSB-first
 * bitstream. */
static inline void istream_init(istream_t *is, const uint8_t *src, size_t n)
{
        is->src = src;
        is->end = src + n;
        is->bitpos = 0;
        is->bitpos_end = n * 8;
}

Our Huffman decoder needs to look at the following bits in the stream (enough bits for the longest codeword possible), and then continue the stream by the number of bits used by the decoded symbol:

#define ISTREAM_MIN_BITS (64 - 7)

/* Get the next bits from the input stream. The number of bits returned is
 * between ISTREAM_MIN_BITS and 64, depending on the position in the stream, or
 * fewer if the end of stream is reached. The upper bits are zero-padded. */
static inline uint64_t istream_bits(const istream_t *is)
{
        const uint8_t *next;
        uint64_t bits;
        int i;

        next = is->src + (is->bitpos / 8);

        assert(next <= is->end && "Cannot read past end of stream.");

        if (is->end - next >= 8) {
                /* Common case: read 8 bytes in one go. */
                bits = read64le(next);
        } else {
                /* Read the available bytes and zero-pad. */
                bits = 0;
                for (i = 0; i < is->end - next; i++) {
                        bits |= (uint64_t)next[i] << (i * 8);
                }
        }

        return bits >> (is->bitpos % 8);
}

/* Advance n bits in the bitstream if possible. Returns false if that many bits
 * are not available in the stream. */
static inline bool istream_advance(istream_t *is, size_t n) {
        if (is->bitpos + n > is->bitpos_end) {
                return false;
        }

        is->bitpos += n;
        return true;
}

The bottom line is that on 64-bit machines istream_bitsyou can usually execute as a single-boot instruction and some arithmetic, given that the structure elements istream_tare in registers. read64le is implemented in bits.h (modern compilers convert it to a single 64-bit download using the little-endian principle):

/* Read a 64-bit value from p in little-endian byte order. */
static inline uint64_t read64le(const uint8_t *p)
{
        /* The one true way, see
         * https://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html */
        return ((uint64_t)p[0] << 0)  |
               ((uint64_t)p[1] << 8)  |
               ((uint64_t)p[2] << 16) |
               ((uint64_t)p[3] << 24) |
               ((uint64_t)p[4] << 32) |
               ((uint64_t)p[5] << 40) |
               ((uint64_t)p[6] << 48) |
               ((uint64_t)p[7] << 56);
}

We also need a function to continue the bitstream to the border of the next byte:

/* Round x up to the next multiple of m, which must be a power of 2. */
static inline size_t round_up(size_t x, size_t m)
{
        assert((m & (m - 1)) == 0 && "m must be a power of two");
        return (x + m - 1) & (size_t)(-m); /* Hacker's Delight (2nd), 3-1. */
}

/* Align the input stream to the next 8-bit boundary and return a pointer to
 * that byte, which may be the past-the-end-of-stream byte. */
static inline const uint8_t *istream_byte_align(istream_t *is)
{
        const uint8_t *byte;

        assert(is->bitpos <= is->bitpos_end && "Not past end of stream.");

        is->bitpos = round_up(is->bitpos, 8);
        byte = is->src + is->bitpos / 8;
        assert(byte <= is->end);

        return byte;
}

For an outgoing bitstream, we write bits using a read-modify-write process. In the quick case, you can write a bit using a 64-bit read, some kind of bit operation and 64-bit write.

/* Output bitstream. */
typedef struct ostream_t ostream_t;
struct ostream_t {
        uint8_t *dst;
        uint8_t *end;
        size_t bitpos;
        size_t bitpos_end;
};

/* Initialize an output stream to write LSB-first bits into dst[0..n-1]. */
static inline void ostream_init(ostream_t *os, uint8_t *dst, size_t n)
{
        os->dst = dst;
        os->end = dst + n;
        os->bitpos = 0;
        os->bitpos_end = n * 8;
}

/* Get the current bit position in the stream. */
static inline size_t ostream_bit_pos(const ostream_t *os)
{
        return os->bitpos;
}

/* Return the number of bytes written to the output buffer. */
static inline size_t ostream_bytes_written(ostream_t *os)
{
        return round_up(os->bitpos, 8) / 8;
}

/* Write n bits to the output stream. Returns false if there is not enough room
 * at the destination. */
static inline bool ostream_write(ostream_t *os, uint64_t bits, size_t n)
{
        uint8_t *p;
        uint64_t x;
        int shift, i;

        assert(n <= 57);
        assert(bits <= ((uint64_t)1 << n) - 1 && "Must fit in n bits.");

        if (os->bitpos_end - os->bitpos < n) {
                /* Not enough room. */
                return false;
        }

        p = &os->dst[os->bitpos / 8];
        shift = os->bitpos % 8;

        if (os->end - p >= 8) {
                /* Common case: read and write 8 bytes in one go. */
                x = read64le(p);
                x = lsb(x, shift);
                x |= bits << shift;
                write64le(p, x);
        } else {
                /* Slow case: read/write as many bytes as are available. */
                x = 0;
                for (i = 0; i < os->end - p; i++) {
                        x |= (uint64_t)p[i] << (i * 8);
                }
                x = lsb(x, shift);
                x |= bits << shift;
                for (i = 0; i < os->end - p; i++) {
                        p[i] = (uint8_t)(x >> (i * 8));
                }
        }

        os->bitpos += n;

        return true;
}

/* Write a 64-bit value x to dst in little-endian byte order. */
static inline void write64le(uint8_t *dst, uint64_t x)
{
        dst[0] = (uint8_t)(x >> 0);
        dst[1] = (uint8_t)(x >> 8);
        dst[2] = (uint8_t)(x >> 16);
        dst[3] = (uint8_t)(x >> 24);
        dst[4] = (uint8_t)(x >> 32);
        dst[5] = (uint8_t)(x >> 40);
        dst[6] = (uint8_t)(x >> 48);
        dst[7] = (uint8_t)(x >> 56);
}

We also need to efficiently write bytes to the stream. Of course, you can repeatedly execute 8-bit recordings, but it will be much faster to use memcpy:

/* Align the bitstream to the next byte boundary, then write the n bytes from
   src to it. Returns false if there is not enough room in the stream. */
static inline bool ostream_write_bytes_aligned(ostream_t *os,
                                               const uint8_t *src,
                                               size_t n)
{
        if (os->bitpos_end - round_up(os->bitpos, 8) < n * 8) {
                return false;
        }

        os->bitpos = round_up(os->bitpos, 8);
        memcpy(&os->dst[os->bitpos / 8], src, n);
        os->bitpos += n * 8;

        return true;
}

Unpacking (Inflation)


Since the compression algorithm is called Deflate - blowing, extracting air from something - the unpacking process is sometimes called Inflation . If you first study this process, we will understand how the format works. You can see the code in the first part of deflate.h and deflate.c , bits.h , tables.h and tables.c (generated using generate_tables.c ).

Data compressed using Deflate is stored as a series of blocks.. Each block starts with a 3-bit header, in which the first (least significant) bit is set if this is the final block of the series, and the other two bits indicate its type.


There are three types of blocks: uncompressed (0), compressed using fixed Huffman codes (1), and compressed using “dynamic” Huffman codes (2).

This code performs unpacking using auxiliary functions for different types of blocks, which we will implement later:

typedef enum {
        HWINF_OK,   /* Inflation was successful. */
        HWINF_FULL, /* Not enough room in the output buffer. */
        HWINF_ERR   /* Error in the input data. */
} inf_stat_t;

/* Decompress (inflate) the Deflate stream in src. The number of input bytes
   used, at most src_len, is stored in *src_used on success. Output is written
   to dst. The number of bytes written, at most dst_cap, is stored in *dst_used
   on success. src[0..src_len-1] and dst[0..dst_cap-1] must not overlap.
   Returns a status value as defined above. */
inf_stat_t hwinflate(const uint8_t *src, size_t src_len, size_t *src_used,
                     uint8_t *dst, size_t dst_cap, size_t *dst_used)
{
        istream_t is;
        size_t dst_pos;
        uint64_t bits;
        bool bfinal;
        inf_stat_t s;

        istream_init(&is, src, src_len);
        dst_pos = 0;

        do {
                /* Read the 3-bit block header. */
                bits = istream_bits(&is);
                if (!istream_advance(&is, 3)) {
                        return HWINF_ERR;
                }
                bfinal = bits & 1;
                bits >>= 1;

                switch (lsb(bits, 2)) {
                case 0: /* 00: No compression. */
                        s = inf_noncomp_block(&is, dst, dst_cap, &dst_pos);
                        break;
                case 1: /* 01: Compressed with fixed Huffman codes. */
                        s = inf_fixed_block(&is, dst, dst_cap, &dst_pos);
                        break;
                case 2: /* 10: Compressed with "dynamic" Huffman codes. */
                        s = inf_dyn_block(&is, dst, dst_cap, &dst_pos);
                        break;
                default: /* Invalid block type. */
                        return HWINF_ERR;
                }

                if (s != HWINF_OK) {
                        return s;
                }
        } while (!bfinal);

        *src_used = (size_t)(istream_byte_align(&is) - src);

        assert(dst_pos <= dst_cap);
        *dst_used = dst_pos;

        return HWINF_OK;
}

Uncompressed Deflate Blocks


These are "stored" blocks, the simplest type. It begins with the next 8-bit boundary of the bitstream with a 16-bit word (len) denoting the length of the block. Behind it is another 16-bit word (nlen), which complements (the order of the bits is inverted) of the words len. It is assumed that nlen acts as a simple len checksum: if the file is damaged, then the values ​​will probably not be complementary and the program will be able to detect an error.


After len and nlen are uncompressed data. Since the block length is a 16-bit value, the data size is limited to 65,535 bytes.

static inf_stat_t inf_noncomp_block(istream_t *is, uint8_t *dst,
                                    size_t dst_cap, size_t *dst_pos)
{
        const uint8_t *p;
        uint16_t len, nlen;

        p = istream_byte_align(is);

        /* Read len and nlen (2 x 16 bits). */
        if (!istream_advance(is, 32)) {
                return HWINF_ERR; /* Not enough input. */
        }
        len  = read16le(p);
        nlen = read16le(p + 2);
        p += 4;

        if (nlen != (uint16_t)~len) {
                return HWINF_ERR;
        }

        if (!istream_advance(is, len * 8)) {
                return HWINF_ERR; /* Not enough input. */
        }

        if (dst_cap - *dst_pos < len) {
                return HWINF_FULL; /* Not enough room to output. */
        }

        memcpy(&dst[*dst_pos], p, len);
        *dst_pos += len;

        return HWINF_OK;
}

Deflate blocks using fixed Huffman codes


Compressed Deflate blocks use Huffman code to represent a sequence of LZ77 literals. Backlinks are broken using block end markers. For literals, backlink lengths, and markers, the litlen Huffman code is used . And for backlink distances, dist code is used .


Litlen encodes values ​​in the range 0-285. Values ​​0-255 are used for literal bytes, 256 is the end-of-block marker, and 257-285 are used for backlink lengths.

Backlinks are 3-258 bytes long. The Litlen value determines the base length to which zero or more additional bits are added from the stream to get the full length according to the table below. For example, a litlen value of 269 means a base length of 19 and two extra bits. The addition of two bits from the stream gives the final length from 19 to 22.

LitlenExtra bitsLengths
25703
25804
25905
26006
26107
26208
26309
264010
265111-12
266113-14
267115-16
268117-18
269219-22
270223–26
271227-30
272231–34
273335–42
274343-50
275351–58
276359–66
277467–82
278483–98
279499–114
2804115–130
2815131–162
2825163–194
2835195–226
2845227–257
2850258

Please note that a litlen value of 284 plus 5 extra bits can represent lengths from 227 to 258, however, the specification states that length 258 - the maximum length of the backlink - must be represented using a separate litlen value. This is supposed to reduce coding in situations where the maximum length is often encountered.

The decompressor uses the table to get the base length and additional bits from the litlen value (minus 257):

/* Table of litlen symbol values minus 257 with corresponding base length
   and number of extra bits. */
struct litlen_tbl_t {
        uint16_t base_len : 9;
        uint16_t ebits : 7;
};
const struct litlen_tbl_t litlen_tbl[29] = {
/* 257 */ { 3, 0 },
/* 258 */ { 4, 0 },

...

/* 284 */ { 227, 5 },
/* 285 */ { 258, 0 }
};

Huffman's fixed litlen code is canonical and uses the following codeword lengths (286–287 are not valid litlen values, but they are involved in code generation):

Litlen valuesCodeword Length
0–1438
144–2559
256–2797
280–2878

The decompressor stores these lengths in a table convenient for transmission to huffman_decoder_init:

const uint8_t fixed_litlen_lengths[288] = {
/*   0 */ 8,
/*   1 */ 8,

...

/* 287 */ 8,
};

Backlink distances vary from 1 to 32,768. They are encoded using a scheme that is similar to a length coding scheme. The Huffman code dist encodes values ​​from 0 to 29, each of which corresponds to the base length, to which additional bits are added to obtain the final distance:

DistExtra bitsDistances
001
102
203
304
415-6
517-8
629-12
7213–16
8317-24
9325–32
10433–48
eleven449–64
12565–96
thirteen597–128
146129–192
fifteen6193–256
sixteen7257–384
177385-512
eighteen8513-768
nineteen8769-1024
twenty91025-1536
2191537–2048
22102049-3072
23103073–4096
24eleven4097-6144
25eleven6145–8192
26128193–12288
271212289–16384
28thirteen16385–24576
29ththirteen24577–32768

Huffman's fixed code dist is canonical. All codewords are 5 bits long. It is simple, the decompressor stores the codes in a table that can be used with huffman_decoder_init(dist values ​​30–31 are not correct. It is indicated that they are involved in the generation of Huffman codes, but actually have no effect):

const uint8_t fixed_dist_lengths[32] = {
/*  0 */ 5,
/*  1 */ 5,

...

/* 31 */ 5,
};

Decompression or Unpacking Code - Deflate block using fixed Huffman codes:

static inf_stat_t inf_fixed_block(istream_t *is, uint8_t *dst,
                                  size_t dst_cap, size_t *dst_pos)
{
        huffman_decoder_t litlen_dec, dist_dec;

        huffman_decoder_init(&litlen_dec, fixed_litlen_lengths,
                             sizeof(fixed_litlen_lengths) /
                             sizeof(fixed_litlen_lengths[0]));
        huffman_decoder_init(&dist_dec, fixed_dist_lengths,
                             sizeof(fixed_dist_lengths) /
                             sizeof(fixed_dist_lengths[0]));

        return inf_block(is, dst, dst_cap, dst_pos, &litlen_dec, &dist_dec);
}

#define LITLEN_EOB 256
#define LITLEN_MAX 285
#define LITLEN_TBL_OFFSET 257
#define MIN_LEN 3
#define MAX_LEN 258

#define DISTSYM_MAX 29
#define MIN_DISTANCE 1
#define MAX_DISTANCE 32768

static inf_stat_t inf_block(istream_t *is, uint8_t *dst, size_t dst_cap,
                            size_t *dst_pos,
                            const huffman_decoder_t *litlen_dec,
                            const huffman_decoder_t *dist_dec)
{
        uint64_t bits;
        size_t used, used_tot, dist, len;
        int litlen, distsym;
        uint16_t ebits;

        while (true) {
                /* Read a litlen symbol. */
                bits = istream_bits(is);
                litlen = huffman_decode(litlen_dec, (uint16_t)bits, &used);
                bits >>= used;
                used_tot = used;

                if (litlen < 0 || litlen > LITLEN_MAX) {
                        /* Failed to decode, or invalid symbol. */
                        return HWINF_ERR;
                } else if (litlen <= UINT8_MAX) {
                        /* Literal. */
                        if (!istream_advance(is, used_tot)) {
                                return HWINF_ERR;
                        }
                        if (*dst_pos == dst_cap) {
                                return HWINF_FULL;
                        }
                        lz77_output_lit(dst, (*dst_pos)++, (uint8_t)litlen);
                        continue;
                } else if (litlen == LITLEN_EOB) {
                        /* End of block. */
                        if (!istream_advance(is, used_tot)) {
                                return HWINF_ERR;
                        }
                        return HWINF_OK;
                }

                /* It is a back reference. Figure out the length. */
                assert(litlen >= LITLEN_TBL_OFFSET && litlen <= LITLEN_MAX);
                len   = litlen_tbl[litlen - LITLEN_TBL_OFFSET].base_len;
                ebits = litlen_tbl[litlen - LITLEN_TBL_OFFSET].ebits;
                if (ebits != 0) {
                        len += lsb(bits, ebits);
                        bits >>= ebits;
                        used_tot += ebits;
                }
                assert(len >= MIN_LEN && len <= MAX_LEN);

                /* Get the distance. */
                distsym = huffman_decode(dist_dec, (uint16_t)bits, &used);
                bits >>= used;
                used_tot += used;

                if (distsym < 0 || distsym > DISTSYM_MAX) {
                        /* Failed to decode, or invalid symbol. */
                        return HWINF_ERR;
                }
                dist  = dist_tbl[distsym].base_dist;
                ebits = dist_tbl[distsym].ebits;
                if (ebits != 0) {
                        dist += lsb(bits, ebits);
                        bits >>= ebits;
                        used_tot += ebits;
                }
                assert(dist >= MIN_DISTANCE && dist <= MAX_DISTANCE);

                assert(used_tot <= ISTREAM_MIN_BITS);
                if (!istream_advance(is, used_tot)) {
                        return HWINF_ERR;
                }

                /* Bounds check and output the backref. */
                if (dist > *dst_pos) {
                        return HWINF_ERR;
                }
                if (round_up(len, 8) <= dst_cap - *dst_pos) {
                        output_backref64(dst, *dst_pos, dist, len);
                } else if (len <= dst_cap - *dst_pos) {
                        lz77_output_backref(dst, *dst_pos, dist, len);
                } else {
                        return HWINF_FULL;
                }
                (*dst_pos) += len;
        }
}

Pay attention to this optimization: when there is not enough space in the outgoing buffer, we issue backlinks using the function below, which copies 64 bits at a time. This is “messy” in the sense that it often copies a few extra bytes (up to the next multiple of 8). But it works much faster lz77_output_backref, because it requires less cyclic iterations and memory accesses. In fact, short backlinks will now be processed in one iteration, which is very good for predicting branching.

/* Output the (dist,len) backref at dst_pos in dst using 64-bit wide writes.
   There must be enough room for len bytes rounded to the next multiple of 8. */
static void output_backref64(uint8_t *dst, size_t dst_pos, size_t dist,
                             size_t len)
{
        size_t i;
        uint64_t tmp;

        assert(len > 0);
        assert(dist <= dst_pos && "cannot reference before beginning of dst");

        if (len > dist) {
                /* Self-overlapping backref; fall back to byte-by-byte copy. */
                lz77_output_backref(dst, dst_pos, dist, len);
                return;
        }

        i = 0;
        do {
                memcpy(&tmp, &dst[dst_pos - dist + i], 8);
                memcpy(&dst[dst_pos + i], &tmp, 8);
                i += 8;
        } while (i < len);
}

Deflate blocks using dynamic Huffman codes


Deflate blocks using dynamic Huffman codes work the same way as described above. But instead of the predefined codes for litlen and dist, they use the codes stored in the Deflate stream itself at the beginning of the block. The name is probably unsuccessful, since dynamic Huffman codes are also called codes that change during coding - this is adaptive Huffman coding . The codes described here have nothing to do with that procedure. They are dynamic only in the sense that different blocks can use different codes.

Generating dynamic litlen and dist codes is the most difficult part of the Deflate format. But as soon as the codes are generated, decompression is performed in the same way as described in the previous part, using inf_block:

static inf_stat_t inf_dyn_block(istream_t *is, uint8_t *dst,
                                size_t dst_cap, size_t *dst_pos)
{
        inf_stat_t s;
        huffman_decoder_t litlen_dec, dist_dec;

        s = init_dyn_decoders(is, &litlen_dec, &dist_dec);
        if (s != HWINF_OK) {
                return s;
        }

        return inf_block(is, dst, dst_cap, dst_pos, &litlen_dec, &dist_dec);
}

Litlen and dist codes for dynamic Deflate blocks are stored as a series of codeword lengths. The lengths themselves are encoded using the third Huffman code - codelen . This code is determined by the length of the code words ( codelen_lens) that are stored in the block (did I mention that it is difficult?).


At the beginning of the dynamic block there are 14 bits that determine the number of litlen-, dist- and codelen-lengths of code words that need to be read from the block:

#define MIN_CODELEN_LENS 4
#define MAX_CODELEN_LENS 19

#define MIN_LITLEN_LENS 257
#define MAX_LITLEN_LENS 288

#define MIN_DIST_LENS 1
#define MAX_DIST_LENS 32

#define CODELEN_MAX_LIT 15

#define CODELEN_COPY 16
#define CODELEN_COPY_MIN 3
#define CODELEN_COPY_MAX 6

#define CODELEN_ZEROS 17
#define CODELEN_ZEROS_MIN 3
#define CODELEN_ZEROS_MAX 10

#define CODELEN_ZEROS2 18
#define CODELEN_ZEROS2_MIN 11
#define CODELEN_ZEROS2_MAX 138

/* RFC 1951, 3.2.7 */
static const int codelen_lengths_order[MAX_CODELEN_LENS] =
{ 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15 };

static inf_stat_t init_dyn_decoders(istream_t *is,
                                    huffman_decoder_t *litlen_dec,
                                    huffman_decoder_t *dist_dec)
{
        uint64_t bits;
        size_t num_litlen_lens, num_dist_lens, num_codelen_lens;
        uint8_t codelen_lengths[MAX_CODELEN_LENS];
        uint8_t code_lengths[MAX_LITLEN_LENS + MAX_DIST_LENS];
        size_t i, n, used;
        int sym;
        huffman_decoder_t codelen_dec;

        bits = istream_bits(is);

        /* Number of litlen codeword lengths (5 bits + 257). */
        num_litlen_lens = lsb(bits, 5) + MIN_LITLEN_LENS;
        bits >>= 5;
        assert(num_litlen_lens <= MAX_LITLEN_LENS);

        /* Number of dist codeword lengths (5 bits + 1). */
        num_dist_lens = lsb(bits, 5) + MIN_DIST_LENS;
        bits >>= 5;
        assert(num_dist_lens <= MAX_DIST_LENS);

        /* Number of code length lengths (4 bits + 4). */
        num_codelen_lens = lsb(bits, 4) + MIN_CODELEN_LENS;
        bits >>= 4;
        assert(num_codelen_lens <= MAX_CODELEN_LENS);

        if (!istream_advance(is, 5 + 5 + 4)) {
                return HWINF_ERR;
        }

Then come the codeword lengths for the codelen code. These lengths are the usual three-bit values, but written in the special order specified in codelen_lengths_order. Since it is necessary to determine 19 lengths, only will be read from the stream num_codelen_lens; everything else is implicitly null. The lengths are listed in a specific order so that zero lengths are more likely to fall at the end of the list and not be stored in the block.

       /* Read the codelen codeword lengths (3 bits each)
           and initialize the codelen decoder. */
        for (i = 0; i < num_codelen_lens; i++) {
                bits = istream_bits(is);
                codelen_lengths[codelen_lengths_order[i]] =
                        (uint8_t)lsb(bits, 3);
                if (!istream_advance(is, 3)) {
                        return HWINF_ERR;
                }
        }
        for (; i < MAX_CODELEN_LENS; i++) {
                codelen_lengths[codelen_lengths_order[i]] = 0;
        }
        if (!huffman_decoder_init(&codelen_dec, codelen_lengths,
                                  MAX_CODELEN_LENS)) {
                return HWINF_ERR;
        }

By setting the codelen decoder, we can read the lengths of code words litlen and dist from the stream.

       /* Read the litlen and dist codeword lengths. */
        i = 0;
        while (i < num_litlen_lens + num_dist_lens) {
                bits = istream_bits(is);
                sym = huffman_decode(&codelen_dec, (uint16_t)bits, &used);
                bits >>= used;
                if (!istream_advance(is, used)) {
                        return HWINF_ERR;
                }

                if (sym >= 0 && sym <= CODELEN_MAX_LIT) {
                        /* A literal codeword length. */
                        code_lengths[i++] = (uint8_t)sym;
                }

16, 17 and 18 are not real lengths, they are indicators that the previous length needs to be repeated a number of times, or that you need to repeat the zero length:

               else if (sym == CODELEN_COPY) {
                        /* Copy the previous codeword length 3--6 times. */
                        if (i < 1) {
                                return HWINF_ERR; /* No previous length. */
                        }
                        n = lsb(bits, 2) + CODELEN_COPY_MIN; /* 2 bits + 3 */
                        if (!istream_advance(is, 2)) {
                                return HWINF_ERR;
                        }
                        assert(n >= CODELEN_COPY_MIN && n <= CODELEN_COPY_MAX);
                        if (i + n > num_litlen_lens + num_dist_lens) {
                                return HWINF_ERR;
                        }
                        while (n--) {
                                code_lengths[i] = code_lengths[i - 1];
                                i++;
                        }
                } else if (sym == CODELEN_ZEROS) {
                        /* 3--10 zeros. */
                        n = lsb(bits, 3) + CODELEN_ZEROS_MIN; /* 3 bits + 3 */
                        if (!istream_advance(is, 3)) {
                                return HWINF_ERR;
                        }
                        assert(n >= CODELEN_ZEROS_MIN &&
                               n <= CODELEN_ZEROS_MAX);
                        if (i + n > num_litlen_lens + num_dist_lens) {
                                return HWINF_ERR;
                        }
                        while (n--) {
                                code_lengths[i++] = 0;
                        }
                } else if (sym == CODELEN_ZEROS2) {
                        /* 11--138 zeros. */
                        n = lsb(bits, 7) + CODELEN_ZEROS2_MIN; /* 7 bits +138 */
                        if (!istream_advance(is, 7)) {
                                return HWINF_ERR;
                        }
                        assert(n >= CODELEN_ZEROS2_MIN &&
                               n <= CODELEN_ZEROS2_MAX);
                        if (i + n > num_litlen_lens + num_dist_lens) {
                                return HWINF_ERR;
                        }
                        while (n--) {
                                code_lengths[i++] = 0;
                        }
                } else {
                        /* Invalid symbol. */
                        return HWINF_ERR;
                }
        }

Note that litlen and dist lengths are read one by one into the array code_lengths. They cannot be read separately, because code length runs can be carried over from the last litlen lengths to the first dist lengths.

Having prepared the codeword lengths, we can configure Huffman decoders and return to the task of decoding literals and backlinks:

       if (!huffman_decoder_init(litlen_dec, &code_lengths[0],
                                  num_litlen_lens)) {
                return HWINF_ERR;
        }
        if (!huffman_decoder_init(dist_dec, &code_lengths[num_litlen_lens],
                                  num_dist_lens)) {
                return HWINF_ERR;
        }

        return HWINF_OK;
}

Compression (Deflation)


In the previous parts, we created all the tools necessary for Deflate compression: Lempel-Ziv, Huffman coding, bit streams and a description of the three types of Deflate blocks. And in this part we will put it all together to get Deflate compression.

Compression Lempel-Ziv parses the source data into a sequence of backlinks and literals. This sequence must be divided and encoded into Deflate blocks, as described in the previous part. Choosing a partitioning method is often called blocking.. On the one hand, each new block means some kind of overhead, the volume of which depends on the type of block and its contents. Fewer blocks - less overhead. On the other hand, these costs of creating a new block can pay off. For example, if the characteristics of the data allow you to more efficiently perform Huffman coding and reduce the total amount of data generated.

Blocking is a difficult optimization task. Some compressors (for example, Zopfli ) try better than others, but most simply use the greedy approach: they issue blocks as soon as they reach a certain size.

Different types of blocks have their own size restrictions:

  • Uncompressed blocks can contain no more than 65,535 bytes.
  • Huffman fixed codes do not have a maximum size.
  • Dynamic Huffman codes generally do not have a maximum size, but since our implementation of the Huffman algorithm uses 16-bit character sequences, we are limited to 65,535 characters.

To use blocks of any type freely, limit their size to 65,534 bytes:

/* The largest number of bytes that will fit in any kind of block is 65,534.
   It will fit in an uncompressed block (max 65,535 bytes) and a Huffman
   block with only literals (65,535 symbols including end-of-block marker). */
#define MAX_BLOCK_LEN_BYTES 65534

To track the outgoing bitstream and the contents of the current block during compression, we will use the structure:

typedef struct deflate_state_t deflate_state_t;
struct deflate_state_t {
        ostream_t os;

        const uint8_t *block_src; /* First src byte in the block. */

        size_t block_len;       /* Number of symbols in the current block. */
        size_t block_len_bytes; /* Number of src bytes in the block. */

        /* Symbol frequencies for the current block. */
        uint16_t litlen_freqs[LITLEN_MAX + 1];
        uint16_t dist_freqs[DISTSYM_MAX + 1];

        struct {
                uint16_t distance;    /* Backref distance. */
                union {
                        uint16_t lit; /* Literal byte or end-of-block. */
                        uint16_t len; /* Backref length (distance != 0). */
                } u;
        } block[MAX_BLOCK_LEN_BYTES + 1];
};

static void reset_block(deflate_state_t *s)
{
        s->block_len = 0;
        s->block_len_bytes = 0;
        memset(s->litlen_freqs, 0, sizeof(s->litlen_freqs));
        memset(s->dist_freqs, 0, sizeof(s->dist_freqs));
}

To add work results to the block, lz77_compresswe will use the callback functions, and upon reaching the maximum size, we will write the block to the bitstream:

static bool lit_callback(uint8_t lit, void *aux)
{
        deflate_state_t *s = aux;

        if (s->block_len_bytes + 1 > MAX_BLOCK_LEN_BYTES) {
                if (!write_block(s, false)) {
                        return false;
                }
                s->block_src += s->block_len_bytes;
                reset_block(s);
        }

        assert(s->block_len < sizeof(s->block) / sizeof(s->block[0]));
        s->block[s->block_len  ].distance = 0;
        s->block[s->block_len++].u.lit = lit;
        s->block_len_bytes++;

        s->litlen_freqs[lit]++;

        return true;
}

static bool backref_callback(size_t dist, size_t len, void *aux)
{
        deflate_state_t *s = aux;

        if (s->block_len_bytes + len > MAX_BLOCK_LEN_BYTES) {
                if (!write_block(s, false)) {
                        return false;
                }
                s->block_src += s->block_len_bytes;
                reset_block(s);
        }

        assert(s->block_len < sizeof(s->block) / sizeof(s->block[0]));
        s->block[s->block_len  ].distance = (uint16_t)dist;
        s->block[s->block_len++].u.len = (uint16_t)len;
        s->block_len_bytes += len;

        assert(len >= MIN_LEN && len <= MAX_LEN);
        assert(dist >= MIN_DISTANCE && dist <= MAX_DISTANCE);
        s->litlen_freqs[len2litlen[len]]++;
        s->dist_freqs[distance2dist[dist]]++;

        return true;
}

The most interesting thing is the recording of blocks. If the block is not compressed, then everything is simple:

static bool write_uncomp_block(deflate_state_t *s, bool final)
{
        uint8_t len_nlen[4];

        /* Write the block header. */
        if (!ostream_write(&s->os, (0x0 << 1) | final, 3)) {
                return false;
        }

        len_nlen[0] = (uint8_t)(s->block_len_bytes >> 0);
        len_nlen[1] = (uint8_t)(s->block_len_bytes >> 8);
        len_nlen[2] = ~len_nlen[0];
        len_nlen[3] = ~len_nlen[1];

        if (!ostream_write_bytes_aligned(&s->os, len_nlen, sizeof(len_nlen))) {
                return false;
        }

        if (!ostream_write_bytes_aligned(&s->os, s->block_src,
                                         s->block_len_bytes)) {
                return false;
        }

        return true;
}

To write a static Huffman block, we first generate canonical codes based on fixed codeword lengths for litlen and dist codes. Then we iterate the block, writing down the characters that use these codes:

static bool write_static_block(deflate_state_t *s, bool final)
{
        huffman_encoder_t litlen_enc, dist_enc;

        /* Write the block header. */
        if (!ostream_write(&s->os, (0x1 << 1) | final, 3)) {
                return false;
        }

        huffman_encoder_init2(&litlen_enc, fixed_litlen_lengths,
                              sizeof(fixed_litlen_lengths) /
                              sizeof(fixed_litlen_lengths[0]));
        huffman_encoder_init2(&dist_enc, fixed_dist_lengths,
                              sizeof(fixed_dist_lengths) /
                              sizeof(fixed_dist_lengths[0]));

        return write_huffman_block(s, &litlen_enc, &dist_enc);
}

static bool write_huffman_block(deflate_state_t *s,
                                const huffman_encoder_t *litlen_enc,
                                const huffman_encoder_t *dist_enc)
{
        size_t i, nbits;
        uint64_t distance, dist, len, litlen, bits, ebits;

        for (i = 0; i < s->block_len; i++) {
                if (s->block[i].distance == 0) {
                        /* Literal or EOB. */
                        litlen = s->block[i].u.lit;
                        assert(litlen <= LITLEN_EOB);
                        if (!ostream_write(&s->os,
                                           litlen_enc->codewords[litlen],
                                           litlen_enc->lengths[litlen])) {
                                return false;
                        }
                        continue;
                }

                /* Back reference length. */
                len = s->block[i].u.len;
                litlen = len2litlen[len];

                /* litlen bits */
                bits = litlen_enc->codewords[litlen];
                nbits = litlen_enc->lengths[litlen];

                /* ebits */
                ebits = len - litlen_tbl[litlen - LITLEN_TBL_OFFSET].base_len;
                bits |= ebits << nbits;
                nbits += litlen_tbl[litlen - LITLEN_TBL_OFFSET].ebits;

                /* Back reference distance. */
                distance = s->block[i].distance;
                dist = distance2dist[distance];

                /* dist bits */
                bits |= (uint64_t)dist_enc->codewords[dist] << nbits;
                nbits += dist_enc->lengths[dist];

                /* ebits */
                ebits = distance - dist_tbl[dist].base_dist;
                bits |= ebits << nbits;
                nbits += dist_tbl[dist].ebits;

                if (!ostream_write(&s->os, bits, nbits)) {
                        return false;
                }
        }

        return true;
}

Of course, writing Huffman dynamic blocks is harder because they contain tricky coding of litlen and dist codes. To represent this encoding, we use the following structure:

typedef struct codelen_sym_t codelen_sym_t;
struct codelen_sym_t {
        uint8_t sym;
        uint8_t count; /* For symbols 16, 17, 18. */
};

First, we discard the tail of their zero lengths of the litlen and dist codewords, and then copy them into a regular array for subsequent encoding. We cannot discard all zeros: it is impossible to encode a Deflate block if there is not a single dist code in it. It is also impossible to have less than 257 litlen codes, but since we always have a byte end marker, there will always be a non-zero code length for a character of 256.

/* Encode litlen_lens and dist_lens into encoded. *num_litlen_lens and
   *num_dist_lens will be set to the number of encoded litlen and dist lens,
   respectively. Returns the number of elements in encoded. */
static size_t encode_dist_litlen_lens(const uint8_t *litlen_lens,
                                      const uint8_t *dist_lens,
                                      codelen_sym_t *encoded,
                                      size_t *num_litlen_lens,
                                      size_t *num_dist_lens)
{
        size_t i, n;
        uint8_t lens[LITLEN_MAX + 1 + DISTSYM_MAX + 1];

        *num_litlen_lens = LITLEN_MAX + 1;
        *num_dist_lens = DISTSYM_MAX + 1;

        /* Drop trailing zero litlen lengths. */
        assert(litlen_lens[LITLEN_EOB] != 0 && "EOB len should be non-zero.");
        while (litlen_lens[*num_litlen_lens - 1] == 0) {
                (*num_litlen_lens)--;
        }
        assert(*num_litlen_lens >= MIN_LITLEN_LENS);

        /* Drop trailing zero dist lengths, keeping at least one. */
        while (dist_lens[*num_dist_lens - 1] == 0 && *num_dist_lens > 1) {
                (*num_dist_lens)--;
        }
        assert(*num_dist_lens >= MIN_DIST_LENS);

        /* Copy the lengths into a unified array. */
        n = 0;
        for (i = 0; i < *num_litlen_lens; i++) {
                lens[n++] = litlen_lens[i];
        }
        for (i = 0; i < *num_dist_lens; i++) {
                lens[n++] = dist_lens[i];
        }

        return encode_lens(lens, n, encoded);
}

Having added the code lengths into one array, we perform coding using special characters to run the same code lengths.

/* Encode the n code lengths in lens into encoded, returning the number of
   elements in encoded. */
static size_t encode_lens(const uint8_t *lens, size_t n, codelen_sym_t *encoded)
{
        size_t i, j, num_encoded;
        uint8_t count;

        i = 0;
        num_encoded = 0;
        while (i < n) {
                if (lens[i] == 0) {
                        /* Scan past the end of this zero run (max 138). */
                        for (j = i; j < min(n, i + CODELEN_ZEROS2_MAX) &&
                                    lens[j] == 0; j++);
                        count = (uint8_t)(j - i);

                        if (count < CODELEN_ZEROS_MIN) {
                                /* Output a single zero. */
                                encoded[num_encoded++].sym = 0;
                                i++;
                                continue;
                        }

                        /* Output a repeated zero. */
                        if (count <= CODELEN_ZEROS_MAX) {
                                /* Repeated zero 3--10 times. */
                                assert(count >= CODELEN_ZEROS_MIN &&
                                       count <= CODELEN_ZEROS_MAX);
                                encoded[num_encoded].sym = CODELEN_ZEROS;
                                encoded[num_encoded++].count = count;
                        } else {
                                /* Repeated zero 11--138 times. */
                                assert(count >= CODELEN_ZEROS2_MIN &&
                                       count <= CODELEN_ZEROS2_MAX);
                                encoded[num_encoded].sym = CODELEN_ZEROS2;
                                encoded[num_encoded++].count = count;
                        }
                        i = j;
                        continue;
                }

                /* Output len. */
                encoded[num_encoded++].sym = lens[i++];

                /* Scan past the end of the run of this len (max 6). */
                for (j = i; j < min(n, i + CODELEN_COPY_MAX) &&
                            lens[j] == lens[i - 1]; j++);
                count = (uint8_t)(j - i);

                if (count >= CODELEN_COPY_MIN) {
                        /* Repeat last len 3--6 times. */
                        assert(count >= CODELEN_COPY_MIN &&
                               count <= CODELEN_COPY_MAX);
                        encoded[num_encoded].sym = CODELEN_COPY;
                        encoded[num_encoded++].count = count;
                        i = j;
                        continue;
                }
        }

        return num_encoded;
}

The characters used for encoding will be recorded using the Huffman code - codelen. The lengths of codewords from codelen code are written to the block in a specific order so that zero lengths are more likely to end up at the end. Here is a function that calculates how many lengths should be written:

static const int codelen_lengths_order[19] =
{ 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15 };

/* Count the number of significant (not trailing zeros) codelen lengths. */
size_t count_codelen_lens(const uint8_t *codelen_lens)
{
        size_t n = MAX_CODELEN_LENS;

        /* Drop trailing zero lengths. */
        while (codelen_lens[codelen_lengths_order[n - 1]] == 0) {
                n--;
        }

        /* The first 4 lengths in the order (16, 17, 18, 0) cannot be used to
           encode any non-zero lengths. Since there will always be at least
           one non-zero codeword length (for EOB), n will be >= 4. */
        assert(n >= MIN_CODELEN_LENS && n <= MAX_CODELEN_LENS);

        return n;
}

Suppose we have already set the litlen and dist codes, set up the encoding of the lengths of their code words and the code for these lengths. Now we can write a dynamic Huffman block:

static bool write_dynamic_block(deflate_state_t *s, bool final,
                                size_t num_litlen_lens, size_t num_dist_lens,
                                size_t num_codelen_lens,
                                const huffman_encoder_t *codelen_enc,
                                const codelen_sym_t *encoded_lens,
                                size_t num_encoded_lens,
                                const huffman_encoder_t *litlen_enc,
                                const huffman_encoder_t *dist_enc)
{
        size_t i;
        uint8_t codelen, sym;
        size_t nbits;
        uint64_t bits, hlit, hdist, hclen, count;

        /* Block header. */
        bits = (0x2 << 1) | final;
        nbits = 3;

        /* hlit (5 bits) */
        hlit = num_litlen_lens - MIN_LITLEN_LENS;
        bits |= hlit << nbits;
        nbits += 5;

        /* hdist (5 bits) */
        hdist = num_dist_lens - MIN_DIST_LENS;
        bits |= hdist << nbits;
        nbits += 5;

        /* hclen (4 bits) */
        hclen = num_codelen_lens - MIN_CODELEN_LENS;
        bits |= hclen << nbits;
        nbits += 4;

        if (!ostream_write(&s->os, bits, nbits)) {
                return false;
        }

        /* Codelen lengths. */
        for (i = 0; i < num_codelen_lens; i++) {
                codelen = codelen_enc->lengths[codelen_lengths_order[i]];
                if (!ostream_write(&s->os, codelen, 3)) {
                        return false;
                }
        }

        /* Litlen and dist code lengths. */
        for (i = 0; i < num_encoded_lens; i++) {
                sym = encoded_lens[i].sym;

                bits = codelen_enc->codewords[sym];
                nbits = codelen_enc->lengths[sym];

                count = encoded_lens[i].count;
                if (sym == CODELEN_COPY) { /* 2 ebits */
                        bits |= (count - CODELEN_COPY_MIN) << nbits;
                        nbits += 2;
                } else if (sym == CODELEN_ZEROS) { /* 3 ebits */
                        bits |= (count - CODELEN_ZEROS_MIN) << nbits;
                        nbits += 3;
                } else if (sym == CODELEN_ZEROS2) { /* 7 ebits */
                        bits |= (count - CODELEN_ZEROS2_MIN) << nbits;
                        nbits += 7;
                }

                if (!ostream_write(&s->os, bits, nbits)) {
                        return false;
                }
        }

        return write_huffman_block(s, litlen_enc, dist_enc);
}

For each block, we want to use the type that requires the least number of bits. The length of an uncompressed block can be calculated quickly:

/* Calculate the number of bits for an uncompressed block, including header. */
static size_t uncomp_block_len(const deflate_state_t *s)
{
        size_t bit_pos, padding;

        /* Bit position after writing the block header. */
        bit_pos = ostream_bit_pos(&s->os) + 3;
        padding = round_up(bit_pos, 8) - bit_pos;

        /* Header + padding + len/nlen + block contents. */
        return 3 + padding + 2 * 16 + s->block_len_bytes * 8;
}

For Huffman encoded blocks, you can calculate the body length using the litlen- and dist-frequencies of characters and codeword lengths:

/* Calculate the number of bits for a Huffman encoded block body. */
static size_t huffman_block_body_len(const deflate_state_t *s,
                                     const uint8_t *litlen_lens,
                                     const uint8_t *dist_lens)
{
        size_t i, freq, len;

        len = 0;

        for (i = 0; i <= LITLEN_MAX; i++) {
                freq = s->litlen_freqs[i];
                len += litlen_lens[i] * freq;

                if (i >= LITLEN_TBL_OFFSET) {
                        len += litlen_tbl[i - LITLEN_TBL_OFFSET].ebits * freq;
                }
        }

        for (i = 0; i <= DISTSYM_MAX; i++) {
                freq = s->dist_freqs[i];
                len += dist_lens[i] * freq;
                len += dist_tbl[i].ebits * freq;
        }

        return len;
}

The total length of the static block is 3 bits of the header plus the length of the body. Calculating the header size of a dynamic block requires much more work:

/* Calculate the number of bits for a dynamic Huffman block. */
static size_t dyn_block_len(const deflate_state_t *s, size_t num_codelen_lens,
                            const uint16_t *codelen_freqs,
                            const huffman_encoder_t *codelen_enc,
                            const huffman_encoder_t *litlen_enc,
                            const huffman_encoder_t *dist_enc)
{
        size_t len, i, freq;

        /* Block header. */
        len = 3;

        /* Nbr of litlen, dist, and codelen lengths. */
        len += 5 + 5 + 4;

        /* Codelen lengths. */
        len += 3 * num_codelen_lens;

        /* Codelen encoding. */
        for (i = 0; i < MAX_CODELEN_LENS; i++) {
                freq = codelen_freqs[i];
                len += codelen_enc->lengths[i] * freq;

                /* Extra bits. */
                if (i == CODELEN_COPY) {
                        len += 2 * freq;
                } else if (i == CODELEN_ZEROS) {
                        len += 3 * freq;
                } else if (i == CODELEN_ZEROS2) {
                        len += 7 * freq;
                }
        }

        return len + huffman_block_body_len(s, litlen_enc->lengths,
                                            dist_enc->lengths);
}

Now we will put everything together and create the main function for writing blocks:

/* Write the current deflate block, marking it final if that parameter is true,
   returning false if there is not enough room in the output stream. */
static bool write_block(deflate_state_t *s, bool final)
{
        size_t old_bit_pos, uncomp_len, static_len, dynamic_len;
        huffman_encoder_t dyn_litlen_enc, dyn_dist_enc, codelen_enc;
        size_t num_encoded_lens, num_litlen_lens, num_dist_lens;
        codelen_sym_t encoded_lens[LITLEN_MAX + 1 + DISTSYM_MAX + 1];
        uint16_t codelen_freqs[MAX_CODELEN_LENS] = {0};
        size_t num_codelen_lens;
        size_t i;

        old_bit_pos = ostream_bit_pos(&s->os);

        /* Add the end-of-block marker in case we write a Huffman block. */
        assert(s->block_len < sizeof(s->block) / sizeof(s->block[0]));
        assert(s->litlen_freqs[LITLEN_EOB] == 0);
        s->block[s->block_len  ].distance = 0;
        s->block[s->block_len++].u.lit = LITLEN_EOB;
        s->litlen_freqs[LITLEN_EOB] = 1;

        uncomp_len = uncomp_block_len(s);

        static_len = 3 + huffman_block_body_len(s, fixed_litlen_lengths,
                                                fixed_dist_lengths);

        /* Compute "dynamic" Huffman codes. */
        huffman_encoder_init(&dyn_litlen_enc, s->litlen_freqs,
                             LITLEN_MAX + 1, 15);
        huffman_encoder_init(&dyn_dist_enc, s->dist_freqs, DISTSYM_MAX + 1, 15);

        /* Encode the litlen and dist code lengths. */
        num_encoded_lens = encode_dist_litlen_lens(dyn_litlen_enc.lengths,
                                                   dyn_dist_enc.lengths,
                                                   encoded_lens,
                                                   &num_litlen_lens,
                                                   &num_dist_lens);

        /* Compute the codelen code. */
        for (i = 0; i < num_encoded_lens; i++) {
                codelen_freqs[encoded_lens[i].sym]++;
        }
        huffman_encoder_init(&codelen_enc, codelen_freqs, MAX_CODELEN_LENS, 7);
        num_codelen_lens = count_codelen_lens(codelen_enc.lengths);

        dynamic_len = dyn_block_len(s, num_codelen_lens, codelen_freqs,
                                    &codelen_enc, &dyn_litlen_enc,
                                    &dyn_dist_enc);

        if (uncomp_len <= dynamic_len && uncomp_len <= static_len) {
                if (!write_uncomp_block(s, final)) {
                        return false;
                }
                assert(ostream_bit_pos(&s->os) - old_bit_pos == uncomp_len);
        } else if (static_len <= dynamic_len) {
                if (!write_static_block(s, final)) {
                        return false;
                }
                assert(ostream_bit_pos(&s->os) - old_bit_pos == static_len);
        } else {
                if (!write_dynamic_block(s, final, num_litlen_lens,
                                         num_dist_lens, num_codelen_lens,
                                         &codelen_enc, encoded_lens,
                                         num_encoded_lens, &dyn_litlen_enc,
                                         &dyn_dist_enc)) {
                        return false;
                }
                assert(ostream_bit_pos(&s->os) - old_bit_pos == dynamic_len);
        }

        return true;
}

Finally, the initiator of the entire compression process should set the initial state, start Lempel-Ziv compression and write the resulting block:

/* Compress (deflate) the data in src into dst. The number of bytes output, at
   most dst_cap, is stored in *dst_used. Returns false if there is not enough
   room in dst. src and dst must not overlap. */
bool hwdeflate(const uint8_t *src, size_t src_len, uint8_t *dst,
               size_t dst_cap, size_t *dst_used)
{
        deflate_state_t s;

        ostream_init(&s.os, dst, dst_cap);
        reset_block(&s);
        s.block_src = src;

        if (!lz77_compress(src, src_len, &lit_callback,
                           &backref_callback, &s)) {
                return false;
        }

        if (!write_block(&s, true)) {
                return false;
        }

        /* The end of the final block should match the end of src. */
        assert(s.block_src + s.block_len_bytes == src + src_len);

        *dst_used = ostream_bytes_written(&s.os);

        return true;
}

Zip File Format


Above, we examined how the Deflate compression used in Zip files works. What about the file format itself? In this part, we will examine in detail its structure and implementation. The code is available in zip.h and zip.c .

Overview


The file format is described in PKZip Application Note :

  1. Each file, or archive item, in a zip file has a local file header with metadata about the item.
  2. . , , , Zip-.
  3. , . , . Zip-.


Each archive item is compressed and stored individually. This means that even if there are matches between files in the archive, they will not be taken into account to improve compression.

The location of the central catalog at the end allows you to gradually complete the archive. As file elements are compressed, they are added to the archive. The index is recorded after all compressed sizes, which allows you to find out the offsets of all files. Adding files to an existing archive is quite easy, it is placed after the last element, and the central directory is overwritten.

The ability to gradually create archives was especially important for disseminating information on numerous floppy disks or volumes. As it compressed, PKZip suggested users insert new diskettes, and write the central directory to the last (last) ones. To unzip a multi-volume archive, PKZip first asked for the last diskette to write to read the central directory, and then the rest of the diskette needed to extract the requested files.

This may surprise you, but there was no rule prohibiting having multiple files with the same name in the archive. This could lead to a lot of confusion when unpacking: if there are several files with the same name, which one should you unpack? In turn, this could lead to security problems. Due to the “Master Key” bug in Android (CVE-2013-4787 , slides and video from the report on Black Hat) an attacker could bypass the checks of the cryptographic signature operating system when installing programs. Android programs are distributed in APK files, which are Zip files. As it turned out, if the APK contained several files with the same name, the signature verification code selected the last file with the same name, and the installer selected the first file, that is, the verification was not performed. In other words, this small difference between the two Zip libraries made it possible to bypass the entire operating system security model.

Unlike most formats, zip files should not start with a signature or magic number. It is generally not indicated that the Zip file should start in any particular way, which easily allows you to create files that are both valid as Zip and as a different format — polyglot files . For example, self-extracting Zip archives (for example, pkz204g.exe ) are usually both executable and Zip files: the first part is executable, followed by a Zip file (which is unpacked by the executable part). The OS can execute it as executable, but the Zip program will open it as a Zip file. This feature could cause you to not require a signature at the beginning of the file.

Although such polyglot files are smart, they can lead to security problems because they can trick programs trying to determine the contents of the file and also allow delivering malicious code in a place with files of different types. For example, exploits used GIFAR files, which at the same time are valid GIF images and Java archives (JAR, a type of Zip file). For more information on this issue, see the article Abusing file formats (starting on page 18).

As we will see below, Zip files use 32-bit fields for offsets and sizes in order to limit the size of the archive and its elements to four gigabytes. In Application Note 4.5PKWare has added format extensions allowing the use of 64-bit offsets and sizes. Files using these extensions are in the Zip64 format, but we will not consider them.

Data structures


End of central directory entry


The end of a central directory entry (EOCDR) is typically used as the starting point for reading a zip file. It contains the location and size of the central directory, as well as optional comments about the entire archive.

In Zip files that occupied several diskettes - or volumes - EOCDR also contained information about which disk we are currently using, on which disk the central directory starts, etc. Today, this functionality is rarely used, and the code in this article does not process such files.

EOCDR is determined by the signature 'P' 'K', followed by bytes 5 and 6. It is followed by the structure below, the numbers are stored according to the little-endian principle:

/* End of Central Directory Record. */
struct eocdr {
        uint16_t disk_nbr;        /* Number of this disk. */
        uint16_t cd_start_disk;   /* Nbr. of disk with start of the CD. */
        uint16_t disk_cd_entries; /* Nbr. of CD entries on this disk. */
        uint16_t cd_entries;      /* Nbr. of Central Directory entries. */
        uint32_t cd_size;         /* Central Directory size in bytes. */
        uint32_t cd_offset;       /* Central Directory file offset. */
        uint16_t comment_len;     /* Archive comment length. */
        const uint8_t *comment;   /* Archive comment. */
};

EOCDR should be located at the end of the file. But since there may be a comment of arbitrary 16-bit length in its tail, it may be necessary to find a specific position:

/* Read 16/32 bits little-endian and bump p forward afterwards. */
#define READ16(p) ((p) += 2, read16le((p) - 2))
#define READ32(p) ((p) += 4, read32le((p) - 4))

/* Size of the End of Central Directory Record, not including comment. */
#define EOCDR_BASE_SZ 22
#define EOCDR_SIGNATURE 0x06054b50  /* "PK\5\6" little-endian. */

static bool find_eocdr(struct eocdr *r, const uint8_t *src, size_t src_len)
{
        size_t comment_len;
        const uint8_t *p;
        uint32_t signature;

        for (comment_len = 0; comment_len <= UINT16_MAX; comment_len++) {
                if (src_len < EOCDR_BASE_SZ + comment_len) {
                        break;
                }

                p = &src[src_len - EOCDR_BASE_SZ - comment_len];
                signature = READ32(p);

                if (signature == EOCDR_SIGNATURE) {
                        r->disk_nbr = READ16(p);
                        r->cd_start_disk = READ16(p);
                        r->disk_cd_entries = READ16(p);
                        r->cd_entries = READ16(p);
                        r->cd_size = READ32(p);
                        r->cd_offset = READ32(p);
                        r->comment_len = READ16(p);
                        r->comment = p;
                        assert(p == &src[src_len - comment_len] &&
                               "All fields read.");

                        if (r->comment_len == comment_len) {
                                return true;
                        }
                }
        }

        return false;
}

Recording EOCDR is easy. This function writes and returns the number of bytes written:

/* Write 16/32 bits little-endian and bump p forward afterwards. */
#define WRITE16(p, x) (write16le((p), (x)), (p) += 2)
#define WRITE32(p, x) (write32le((p), (x)), (p) += 4)

static size_t write_eocdr(uint8_t *dst, const struct eocdr *r)
{
        uint8_t *p = dst;

        WRITE32(p, EOCDR_SIGNATURE);
        WRITE16(p, r->disk_nbr);
        WRITE16(p, r->cd_start_disk);
        WRITE16(p, r->disk_cd_entries);
        WRITE16(p, r->cd_entries);
        WRITE32(p, r->cd_size);
        WRITE32(p, r->cd_offset);
        WRITE16(p, r->comment_len);
        assert(p - dst == EOCDR_BASE_SZ);

        if (r->comment_len != 0) {
                memcpy(p, r->comment, r->comment_len);
                p += r->comment_len;
        }

        return (size_t)(p - dst);
}

Central file header


The central directory consists of central file headers written one after another, one for each archive item. Each heading begins with the signature 'P', 'K', 1, 2, and then there is such a structure:

#define EXT_ATTR_DIR (1U << 4)
#define EXT_ATTR_ARC (1U << 5)

/* Central File Header (Central Directory Entry) */
struct cfh {
        uint16_t made_by_ver;    /* Version made by. */
        uint16_t extract_ver;    /* Version needed to extract. */
        uint16_t gp_flag;        /* General purpose bit flag. */
        uint16_t method;         /* Compression method. */
        uint16_t mod_time;       /* Modification time. */
        uint16_t mod_date;       /* Modification date. */
        uint32_t crc32;          /* CRC-32 checksum. */
        uint32_t comp_size;      /* Compressed size. */
        uint32_t uncomp_size;    /* Uncompressed size. */
        uint16_t name_len;       /* Filename length. */
        uint16_t extra_len;      /* Extra data length. */
        uint16_t comment_len;    /* Comment length. */
        uint16_t disk_nbr_start; /* Disk nbr. where file begins. */
        uint16_t int_attrs;      /* Internal file attributes. */
        uint32_t ext_attrs;      /* External file attributes. */
        uint32_t lfh_offset;     /* Local File Header offset. */
        const uint8_t *name;     /* Filename. */
        const uint8_t *extra;    /* Extra data. */
        const uint8_t *comment;  /* File comment. */
};

made_by_verand extract_verencode information about the OS and version of the program used to add this item, as well as which version is needed to retrieve it. The most important eight bits encode the operating system (for example, 0 means DOS, 3 means Unix, 10 means Windows NTFS), and the lower eight bits encode the software version. Set the decimal value to 20, which means compatibility with PKZip 2.0.

gp_flagcontains different flags. We are interested:

  • Bit 0, indicating the fact of encryption of the element (we will not consider this);
  • And bits 1 and 2, encoding the level of Deflate compression (0 - normal, 1 - maximum, 2 - fast, 3 - very fast).

methodencodes a compression method. 0 - data not compressed, 8 - Delate applied. Other values ​​relate to old or new algorithms, but almost all Zip use these two values.

mod_timeand mod_datecontain the date and time the file was modified, encoded in MS-DOS format . Using this code, we will convert the usual C timestamps time_tto and from the MS-DOS format:

/* Convert DOS date and time to time_t. */
static time_t dos2ctime(uint16_t dos_date, uint16_t dos_time)
{
        struct tm tm = {0};

        tm.tm_sec = (dos_time & 0x1f) * 2;  /* Bits 0--4:  Secs divided by 2. */
        tm.tm_min = (dos_time >> 5) & 0x3f; /* Bits 5--10: Minute. */
        tm.tm_hour = (dos_time >> 11);      /* Bits 11-15: Hour (0--23). */

        tm.tm_mday = (dos_date & 0x1f);          /* Bits 0--4: Day (1--31). */
        tm.tm_mon = ((dos_date >> 5) & 0xf) - 1; /* Bits 5--8: Month (1--12). */
        tm.tm_year = (dos_date >> 9) + 80;       /* Bits 9--15: Year-1980. */

        tm.tm_isdst = -1;

        return mktime(&tm);
}

/* Convert time_t to DOS date and time. */
static void ctime2dos(time_t t, uint16_t *dos_date, uint16_t *dos_time)
{
        struct tm *tm = localtime(&t);

        *dos_time = 0;
        *dos_time |= tm->tm_sec / 2;    /* Bits 0--4:  Second divided by two. */
        *dos_time |= tm->tm_min << 5;   /* Bits 5--10: Minute. */
        *dos_time |= tm->tm_hour << 11; /* Bits 11-15: Hour. */

        *dos_date = 0;
        *dos_date |= tm->tm_mday;             /* Bits 0--4:  Day (1--31). */
        *dos_date |= (tm->tm_mon + 1) << 5;   /* Bits 5--8:  Month (1--12). */
        *dos_date |= (tm->tm_year - 80) << 9; /* Bits 9--15: Year from 1980. */
}

The field crc32contains the value of the cyclic redundant code of uncompressed data. It is used to verify data integrity after retrieval. Implementation here: crc32.c .

comp_sizeand uncomp_sizecontain the compressed and uncompressed size of the item file data. The following three fields contain the length of the name, comment, and additional data immediately following the title. disk_nbr_startDesigned for archives using multiple diskettes.

int_attrsand ext_attrsdescribe the internal and external attributes of the file. Internal ones relate to the contents of the file, for example, the least significant bit indicates whether the file contains only text. External attributes indicate whether the file is hidden, read-only, etc. The content of these fields depends on the OS, in particular, onmade_by_ver. In DOS, the lower 8 bits contain the file attribute byte, which can be obtained from the Int 21 / AX = 4300h system call . For example, bit 4 means that it is a directory, and bit 5 means that the “archive” attribute is set (true for most files in DOS). As far as I understand, for the sake of compatibility, these bits are similarly set in other OSs. On Unix, the high 16 bits of this field contain file mode bits that are returned by stat (2) in st_mode.

lfh_offsettells us where to look for the local file header. name- file name (C-line), and comment- optional comment for this archive element (C-line). extramay contain optional additional data such as information about the owner of the Unix file, more accurate date and time of the change, or Zip64 fields.

This function is used to read the central headers of files:

/* Size of a Central File Header, not including name, extra, and comment. */
#define CFH_BASE_SZ 46
#define CFH_SIGNATURE 0x02014b50 /* "PK\1\2" little-endian. */

static bool read_cfh(struct cfh *cfh, const uint8_t *src, size_t src_len,
                     size_t offset)
{
        const uint8_t *p;
        uint32_t signature;

        if (offset > src_len || src_len - offset < CFH_BASE_SZ) {
                return false;
        }

        p = &src[offset];
        signature = READ32(p);
        if (signature != CFH_SIGNATURE) {
                return false;
        }

        cfh->made_by_ver = READ16(p);
        cfh->extract_ver = READ16(p);
        cfh->gp_flag = READ16(p);
        cfh->method = READ16(p);
        cfh->mod_time = READ16(p);
        cfh->mod_date = READ16(p);
        cfh->crc32 = READ32(p);
        cfh->comp_size = READ32(p);
        cfh->uncomp_size = READ32(p);
        cfh->name_len = READ16(p);
        cfh->extra_len = READ16(p);
        cfh->comment_len = READ16(p);
        cfh->disk_nbr_start = READ16(p);
        cfh->int_attrs = READ16(p);
        cfh->ext_attrs = READ32(p);
        cfh->lfh_offset = READ32(p);
        cfh->name = p;
        cfh->extra = cfh->name + cfh->name_len;
        cfh->comment = cfh->extra + cfh->extra_len;
        assert(p == &src[offset + CFH_BASE_SZ] && "All fields read.");

        if (src_len - offset - CFH_BASE_SZ <
            cfh->name_len + cfh->extra_len + cfh->comment_len) {
                return false;
        }

        return true;
}

static size_t write_cfh(uint8_t *dst, const struct cfh *cfh)
{
        uint8_t *p = dst;

        WRITE32(p, CFH_SIGNATURE);
        WRITE16(p, cfh->made_by_ver);
        WRITE16(p, cfh->extract_ver);
        WRITE16(p, cfh->gp_flag);
        WRITE16(p, cfh->method);
        WRITE16(p, cfh->mod_time);
        WRITE16(p, cfh->mod_date);
        WRITE32(p, cfh->crc32);
        WRITE32(p, cfh->comp_size);
        WRITE32(p, cfh->uncomp_size);
        WRITE16(p, cfh->name_len);
        WRITE16(p, cfh->extra_len);
        WRITE16(p, cfh->comment_len);
        WRITE16(p, cfh->disk_nbr_start);
        WRITE16(p, cfh->int_attrs);
        WRITE32(p, cfh->ext_attrs);
        WRITE32(p, cfh->lfh_offset);
        assert(p - dst == CFH_BASE_SZ);

        if (cfh->name_len != 0) {
                memcpy(p, cfh->name, cfh->name_len);
                p += cfh->name_len;
        }

        if (cfh->extra_len != 0) {
                memcpy(p, cfh->extra, cfh->extra_len);
                p += cfh->extra_len;
        }

        if (cfh->comment_len != 0) {
                memcpy(p, cfh->comment, cfh->comment_len);
                p += cfh->comment_len;
        }

        return (size_t)(p - dst);
}

Local file header


The data of each archive element is preceded by a local file header, which repeats most of the information from the central header.

The duplication of data in the central and local headers was probably introduced so that PKZip would not keep the entire central directory in memory when unpacking. Instead, as each file is extracted, its name and other information can be read from the local header. In addition, local headers are useful for recovering files from Zip archives in which the central directory is missing or damaged.

However, this redundancy is also the main source of uncertainty. For example, what happens if the file names in the central and local headers do not match? This often leads to bugs and security issues.

Not all information from the central heading is duplicated. For example, fields with file attributes. In addition, if the third least significant bit gp_flags(CRC-32) is specified , then the compressed and uncompressed fields will be reset to zero, and this information can be found in the Data Descriptor block after the data of the file itself (we will not consider it). This allows you to record a local header before the file size of the element is known or to what size it will be compressed.

The local header begins with the signature 'P', 'K', 3, 4, and then there is such a structure:

/* Local File Header. */
struct lfh {
        uint16_t extract_ver;
        uint16_t gp_flag;
        uint16_t method;
        uint16_t mod_time;
        uint16_t mod_date;
        uint32_t crc32;
        uint32_t comp_size;
        uint32_t uncomp_size;
        uint16_t name_len;
        uint16_t extra_len;
        const uint8_t *name;
        const uint8_t *extra;
};

These functions read and write local headers, like other data structures:

/* Size of a Local File Header, not including name and extra. */
#define LFH_BASE_SZ 30
#define LFH_SIGNATURE 0x04034b50 /* "PK\3\4" little-endian. */

static bool read_lfh(struct lfh *lfh, const uint8_t *src, size_t src_len,
                     size_t offset)
{
        const uint8_t *p;
        uint32_t signature;

        if (offset > src_len || src_len - offset < LFH_BASE_SZ) {
                return false;
        }

        p = &src[offset];
        signature = READ32(p);
        if (signature != LFH_SIGNATURE) {
                return false;
        }

        lfh->extract_ver = READ16(p);
        lfh->gp_flag = READ16(p);
        lfh->method = READ16(p);
        lfh->mod_time = READ16(p);
        lfh->mod_date = READ16(p);
        lfh->crc32 = READ32(p);
        lfh->comp_size = READ32(p);
        lfh->uncomp_size = READ32(p);
        lfh->name_len = READ16(p);
        lfh->extra_len = READ16(p);
        lfh->name = p;
        lfh->extra = lfh->name + lfh->name_len;
        assert(p == &src[offset + LFH_BASE_SZ] && "All fields read.");

        if (src_len - offset - LFH_BASE_SZ < lfh->name_len + lfh->extra_len) {
                return false;
        }

        return true;
}

static size_t write_lfh(uint8_t *dst, const struct lfh *lfh)
{
        uint8_t *p = dst;

        WRITE32(p, LFH_SIGNATURE);
        WRITE16(p, lfh->extract_ver);
        WRITE16(p, lfh->gp_flag);
        WRITE16(p, lfh->method);
        WRITE16(p, lfh->mod_time);
        WRITE16(p, lfh->mod_date);
        WRITE32(p, lfh->crc32);
        WRITE32(p, lfh->comp_size);
        WRITE32(p, lfh->uncomp_size);
        WRITE16(p, lfh->name_len);
        WRITE16(p, lfh->extra_len);
        assert(p - dst == LFH_BASE_SZ);

        if (lfh->name_len != 0) {
                memcpy(p, lfh->name, lfh->name_len);
                p += lfh->name_len;
        }

        if (lfh->extra_len != 0) {
                memcpy(p, lfh->extra, lfh->extra_len);
                p += lfh->extra_len;
        }

        return (size_t)(p - dst);
}

Implementation of Zip Read


Using the above functions, we implement the reading of the Zip file into memory and get an iterator for accessing the archive elements:

typedef size_t zipiter_t; /* Zip archive member iterator. */

typedef struct zip_t zip_t;
struct zip_t {
        uint16_t num_members;    /* Number of members. */
        const uint8_t *comment;  /* Zip file comment (not terminated). */
        uint16_t comment_len;    /* Zip file comment length. */
        zipiter_t members_begin; /* Iterator to the first member. */
        zipiter_t members_end;   /* Iterator to the end of members. */

        const uint8_t *src;
        size_t src_len;
};

/* Initialize zip based on the source data. Returns true on success, or false
   if the data could not be parsed as a valid Zip file. */
bool zip_read(zip_t *zip, const uint8_t *src, size_t src_len)
{
        struct eocdr eocdr;
        struct cfh cfh;
        struct lfh lfh;
        size_t i, offset;
        const uint8_t *comp_data;

        zip->src = src;
        zip->src_len = src_len;

        if (!find_eocdr(&eocdr, src, src_len)) {
                return false;
        }

        if (eocdr.disk_nbr != 0 || eocdr.cd_start_disk != 0 ||
            eocdr.disk_cd_entries != eocdr.cd_entries) {
                return false; /* Cannot handle multi-volume archives. */
        }

        zip->num_members = eocdr.cd_entries;
        zip->comment = eocdr.comment;
        zip->comment_len = eocdr.comment_len;

        offset = eocdr.cd_offset;
        zip->members_begin = offset;

        /* Read the member info and do a few checks. */
        for (i = 0; i < eocdr.cd_entries; i++) {
                if (!read_cfh(&cfh, src, src_len, offset)) {
                        return false;
                }

                if (cfh.gp_flag & 1) {
                        return false; /* The member is encrypted. */
                }
                if (cfh.method != ZIP_STORED && cfh.method != ZIP_DEFLATED) {
                        return false; /* Unsupported compression method. */
                }
                if (cfh.method == ZIP_STORED &&
                    cfh.uncomp_size != cfh.comp_size) {
                        return false;
                }
                if (cfh.disk_nbr_start != 0) {
                        return false; /* Cannot handle multi-volume archives. */
                }
                if (memchr(cfh.name, '\0', cfh.name_len) != NULL) {
                        return false; /* Bad filename. */
                }

                if (!read_lfh(&lfh, src, src_len, cfh.lfh_offset)) {
                        return false;
                }

                comp_data = lfh.extra + lfh.extra_len;
                if (cfh.comp_size > src_len - (size_t)(comp_data - src)) {
                        return false; /* Member data does not fit in src. */
                }

                offset += CFH_BASE_SZ + cfh.name_len + cfh.extra_len +
                          cfh.comment_len;
        }

        zip->members_end = offset;

        return true;
}

As I mentioned above, element iterators are simply offsets to the central file header through which you can access element data:

typedef enum { ZIP_STORED = 0, ZIP_DEFLATED = 8 } method_t;

typedef struct zipmemb_t zipmemb_t;
struct zipmemb_t {
        const uint8_t *name;      /* Member name (not null terminated). */
        uint16_t name_len;        /* Member name length. */
        time_t mtime;             /* Modification time. */
        uint32_t comp_size;       /* Compressed size. */
        const uint8_t *comp_data; /* Compressed data. */
        method_t method;          /* Compression method. */
        uint32_t uncomp_size;     /* Uncompressed size. */
        uint32_t crc32;           /* CRC-32 checksum. */
        const uint8_t *comment;   /* Comment (not null terminated). */
        uint16_t comment_len;     /* Comment length. */
        bool is_dir;              /* Whether this is a directory. */
        zipiter_t next;           /* Iterator to the next member. */
};

/* Get the Zip archive member through iterator it. */
zipmemb_t zip_member(const zip_t *zip, zipiter_t it)
{
        struct cfh cfh;
        struct lfh lfh;
        bool ok;
        zipmemb_t m;

        assert(it >= zip->members_begin && it < zip->members_end);

        ok = read_cfh(&cfh, zip->src, zip->src_len, it);
        assert(ok);

        ok = read_lfh(&lfh, zip->src, zip->src_len, cfh.lfh_offset);
        assert(ok);

        m.name = cfh.name;
        m.name_len = cfh.name_len;
        m.mtime = dos2ctime(cfh.mod_date, cfh.mod_time);
        m.comp_size = cfh.comp_size;
        m.comp_data = lfh.extra + lfh.extra_len;
        m.method = cfh.method;
        m.uncomp_size = cfh.uncomp_size;
        m.crc32 = cfh.crc32;
        m.comment = cfh.comment;
        m.comment_len = cfh.comment_len;
        m.is_dir = (cfh.ext_attrs & EXT_ATTR_DIR) != 0;

        m.next = it + CFH_BASE_SZ +
                 cfh.name_len + cfh.extra_len + cfh.comment_len;

        assert(m.next <= zip->members_end);

        return m;
}

Zip Record Implementation


To write a zip file to the memory buffer, you first need to find out how much memory to allocate for it. And since we do not know how much data we will compress before we try to write, we will calculate the upper boundary based on the sizes of the uncompressed elements:

/* Compute an upper bound on the dst size required by zip_write() for an
 * archive with num_memb members with certain filenames, sizes, and archive
 * comment. Returns zero on error, e.g. if a filename is longer than 2^16-1, or
 * if the total file size is larger than 2^32-1. */
uint32_t zip_max_size(uint16_t num_memb, const char *const *filenames,
                      const uint32_t *file_sizes, const char *comment)
{
        size_t comment_len, name_len;
        uint64_t total;
        uint16_t i;

        comment_len = (comment == NULL ? 0 : strlen(comment));
        if (comment_len > UINT16_MAX) {
                return 0;
        }

        total = EOCDR_BASE_SZ + comment_len; /* EOCDR */

        for (i = 0; i < num_memb; i++) {
                assert(filenames[i] != NULL);
                name_len = strlen(filenames[i]);
                if (name_len > UINT16_MAX) {
                        return 0;
                }

                total += CFH_BASE_SZ + name_len; /* Central File Header */
                total += LFH_BASE_SZ + name_len; /* Local File Header */
                total += file_sizes[i];          /* Uncompressed data size. */
        }

        if (total > UINT32_MAX) {
                return 0;
        }

        return (uint32_t)total;
}

This code writes the zip file using the deflate compression of each element, reducing their size:

/* Write a Zip file containing num_memb members into dst, which must be large
   enough to hold the resulting data. Returns the number of bytes written, which
   is guaranteed to be less than or equal to the result of zip_max_size() when
   called with the corresponding arguments. comment shall be a null-terminated
   string or null. callback shall be null or point to a function which will
   get called after the compression of each member. */
uint32_t zip_write(uint8_t *dst, uint16_t num_memb,
                   const char *const *filenames,
                   const uint8_t *const *file_data,
                   const uint32_t *file_sizes,
                   const time_t *mtimes,
                   const char *comment,
                   void (*callback)(const char *filename, uint32_t size,
                                    uint32_t comp_size))
{
        uint16_t i;
        uint8_t *p;
        struct eocdr eocdr;
        struct cfh cfh;
        struct lfh lfh;
        bool ok;
        uint16_t name_len;
        uint8_t *data_dst;
        size_t comp_sz;
        uint32_t lfh_offset, cd_offset, eocdr_offset;

        p = dst;

        /* Write Local File Headers and deflated or stored data. */
        for (i = 0; i < num_memb; i++) {
                assert(filenames[i] != NULL);
                assert(strlen(filenames[i]) <= UINT16_MAX);
                name_len = (uint16_t)strlen(filenames[i]);

                data_dst = p + LFH_BASE_SZ + name_len;

                if (hwdeflate(file_data[i], file_sizes[i], data_dst,
                              file_sizes[i], &comp_sz) &&
                                comp_sz < file_sizes[i]) {
                        lfh.method = ZIP_DEFLATED;
                        assert(comp_sz <= UINT32_MAX);
                        lfh.comp_size = (uint32_t)comp_sz;
                } else {
                        memcpy(data_dst, file_data[i], file_sizes[i]);
                        lfh.method = ZIP_STORED;
                        lfh.comp_size = file_sizes[i];
                }

                if (callback != NULL) {
                        callback(filenames[i], file_sizes[i], lfh.comp_size);
                }

                lfh.extract_ver = (0 << 8) | 20; /* DOS | PKZIP 2.0 */
                lfh.gp_flag = (lfh.method == ZIP_DEFLATED ? (0x1 << 1) : 0x0);
                ctime2dos(mtimes[i], &lfh.mod_date, &lfh.mod_time);
                lfh.crc32 = crc32(file_data[i], file_sizes[i]);
                lfh.uncomp_size = file_sizes[i];
                lfh.name_len = name_len;
                lfh.extra_len = 0;
                lfh.name = (const uint8_t*)filenames[i];
                p += write_lfh(p, &lfh);
                p += lfh.comp_size;
        }

        assert(p - dst <= UINT32_MAX);
        cd_offset = (uint32_t)(p - dst);

        /* Write the Central Directory based on the Local File Headers. */
        lfh_offset = 0;
        for (i = 0; i < num_memb; i++) {
                ok = read_lfh(&lfh, dst, SIZE_MAX, lfh_offset);
                assert(ok);

                cfh.made_by_ver = lfh.extract_ver;
                cfh.extract_ver = lfh.extract_ver;
                cfh.gp_flag = lfh.gp_flag;
                cfh.method = lfh.method;
                cfh.mod_time = lfh.mod_time;
                cfh.mod_date = lfh.mod_date;
                cfh.crc32 = lfh.crc32;
                cfh.comp_size = lfh.comp_size;
                cfh.uncomp_size = lfh.uncomp_size;
                cfh.name_len = lfh.name_len;
                cfh.extra_len = 0;
                cfh.comment_len = 0;
                cfh.disk_nbr_start = 0;
                cfh.int_attrs = 0;
                cfh.ext_attrs = EXT_ATTR_ARC;
                cfh.lfh_offset = lfh_offset;
                cfh.name = lfh.name;
                p += write_cfh(p, &cfh);

                lfh_offset += LFH_BASE_SZ + lfh.name_len + lfh.comp_size;
        }

        assert(p - dst <= UINT32_MAX);
        eocdr_offset = (uint32_t)(p - dst);

        /* Write the End of Central Directory Record. */
        eocdr.disk_nbr = 0;
        eocdr.cd_start_disk = 0;
        eocdr.disk_cd_entries = num_memb;
        eocdr.cd_entries = num_memb;
        eocdr.cd_size = eocdr_offset - cd_offset;
        eocdr.cd_offset = cd_offset;
        eocdr.comment_len = (uint16_t)(comment == NULL ? 0 : strlen(comment));
        eocdr.comment = (const uint8_t*)comment;
        p += write_eocdr(p, &eocdr);

        assert(p - dst <= zip_max_size(num_memb, filenames, file_sizes,
                                       comment));

        return (uint32_t)(p - dst);
}

Hwzip


Now we know how to read and write Zip files, how to compress and decompress the data stored in them. Now let's write a simple Zip program containing all of these tools. The code is available at hwzip.c .

We will use a macro for simple error handling and several auxiliary functions for checked memory allocation:

#define PERROR_IF(cond, msg) if (cond) { perror(msg); exit(1); }

static void *xmalloc(size_t size)
{
        void *ptr = malloc(size);
        PERROR_IF(ptr == NULL, "malloc");
        return ptr;
}

static void *xrealloc(void *ptr, size_t size)
{
        ptr = realloc(ptr, size);
        PERROR_IF(ptr == NULL, "realloc");
        return ptr;
}

The other two functions are used to read and write files:

static uint8_t *read_file(const char *filename, size_t *file_sz)
{
        FILE *f;
        uint8_t *buf;
        size_t buf_cap;

        f = fopen(filename, "rb");
        PERROR_IF(f == NULL, "fopen");

        buf_cap = 4096;
        buf = xmalloc(buf_cap);

        *file_sz = 0;
        while (feof(f) == 0) {
                if (buf_cap - *file_sz == 0) {
                        buf_cap *= 2;
                        buf = xrealloc(buf, buf_cap);
                }

                *file_sz += fread(&buf[*file_sz], 1, buf_cap - *file_sz, f);
                PERROR_IF(ferror(f), "fread");
        }

        PERROR_IF(fclose(f) != 0, "fclose");
        return buf;
}

static void write_file(const char *filename, const uint8_t *data, size_t n)
{
        FILE *f;

        f = fopen(filename, "wb");
        PERROR_IF(f == NULL, "fopen");
        PERROR_IF(fwrite(data, 1, n, f) != n, "fwrite");
        PERROR_IF(fclose(f) != 0, "fclose");
}

Our Zip program can perform three functions: make a list of the contents of Zip files and extract it, as well as create Zip files. Listing is nowhere easier:

static void list_zip(const char *filename)
{
        uint8_t *zip_data;
        size_t zip_sz;
        zip_t z;
        zipiter_t it;
        zipmemb_t m;

        printf("Listing ZIP archive: %s\n\n", filename);

        zip_data = read_file(filename, &zip_sz);

        if (!zip_read(&z, zip_data, zip_sz)) {
                printf("Failed to parse ZIP file!\n");
                exit(1);
        }

        if (z.comment_len != 0) {
                printf("%.*s\n\n", (int)z.comment_len, z.comment);
        }

        for (it = z.members_begin; it != z.members_end; it = m.next) {
                m = zip_member(&z, it);
                printf("%.*s\n", (int)m.name_len, m.name);
        }

        printf("\n");

        free(zip_data);
}

The extraction is a little more complicated. We will use auxiliary functions for null termination of the file name (to pass it to fopen) and unpacking:

static char *terminate_str(const char *str, size_t n)
{
        char *p = xmalloc(n + 1);
        memcpy(p, str, n);
        p[n] = '\0';
        return p;
}

static uint8_t *inflate_member(const zipmemb_t *m)
{
        uint8_t *p;
        size_t src_used, dst_used;

        assert(m->method == ZIP_DEFLATED);

        p = xmalloc(m->uncomp_size);

        if (hwinflate(m->comp_data, m->comp_size, &src_used, p, m->uncomp_size,
                      &dst_used) != HWINF_OK) {
                free(p);
                return NULL;
        }

        if (src_used != m->comp_size || dst_used != m->uncomp_size) {
                free(p);
                return NULL;
        }

        return p;
}

Our program will skip any archive elements that have directories. This is done in order to avoid the so-called path traversal attacks : a malicious archive is used to write the file from outside the directory specified by the user. Read the details in the Info-ZIP FAQ .

static void extract_zip(const char *filename)
{
        uint8_t *zip_data;
        size_t zip_sz;
        zip_t z;
        zipiter_t it;
        zipmemb_t m;
        char *tname;
        uint8_t *inflated;
        const uint8_t *uncomp_data;

        printf("Extracting ZIP archive: %s\n\n", filename);

        zip_data = read_file(filename, &zip_sz);

        if (!zip_read(&z, zip_data, zip_sz)) {
                printf("Failed to read ZIP file!\n");
                exit(1);
        }

        if (z.comment_len != 0) {
                printf("%.*s\n\n", (int)z.comment_len, z.comment);
        }

        for (it = z.members_begin; it != z.members_end; it = m.next) {
                m = zip_member(&z, it);

                if (m.is_dir) {
                        printf(" (Skipping dir: %.*s)\n",
                               (int)m.name_len, m.name);
                        continue;
                }

                if (memchr(m.name, '/',  m.name_len) != NULL ||
                    memchr(m.name, '\\', m.name_len) != NULL) {
                        printf(" (Skipping file in dir: %.*s)\n",
                               (int)m.name_len, m.name);
                        continue;
                }

                assert(m.method == ZIP_STORED || m.method == ZIP_DEFLATED);
                printf(" %s: %.*s",
                       m.method == ZIP_STORED ? "Extracting" : " Inflating",
                       (int)m.name_len, m.name);
                fflush(stdout);

                if (m.method == ZIP_STORED) {
                        assert(m.uncomp_size == m.comp_size);
                        inflated = NULL;
                        uncomp_data = m.comp_data;
                } else {
                        inflated = inflate_member(&m);
                        if (inflated == NULL) {
                                printf("Error: inflation failed!\n");
                                exit(1);
                        }
                        uncomp_data = inflated;
                }

                if (crc32(uncomp_data, m.uncomp_size) != m.crc32) {
                        printf("Error: CRC-32 mismatch!\n");
                        exit(1);
                }

                tname = terminate_str((const char*)m.name, m.name_len);
                write_file(tname, uncomp_data, m.uncomp_size);
                printf("\n");

                free(inflated);
                free(tname);
        }

        printf("\n");
        free(zip_data);
}

To create a zip archive, we will read the input files and feed them zip_write. Since the standard C library does not allow you to get the file modification time, we will use the current time (I leave this as a homework to fix this feature).

void zip_callback(const char *filename, uint32_t size, uint32_t comp_size)
{
        bool deflated = comp_size < size;

        printf(" %s: %s", deflated ? "Deflated" : "  Stored", filename);
        if (deflated) {
                printf(" (%u%%)", 100 - 100 * comp_size / size);
        }
        printf("\n");
}

static void create_zip(const char *zip_filename, const char *comment,
                       uint16_t n, const char *const *filenames)
{
        time_t mtime;
        time_t *mtimes;
        uint8_t **file_data;
        uint32_t *file_sizes;
        size_t file_size, zip_size;
        uint8_t *zip_data;
        uint16_t i;

        printf("Creating ZIP archive: %s\n\n", zip_filename);

        if (comment != NULL) {
                printf("%s\n\n", comment);
        }

        mtime = time(NULL);

        file_data = xmalloc(sizeof(*file_data) * n);
        file_sizes = xmalloc(sizeof(*file_sizes) * n);
        mtimes = xmalloc(sizeof(*mtimes) * n);

        for (i = 0; i < n; i++) {
                file_data[i] = read_file(filenames[i], &file_size);
                if (file_size >= UINT32_MAX) {
                        printf("%s is too large!\n", filenames[i]);
                        exit(1);
                }
                file_sizes[i] = (uint32_t)file_size;
                mtimes[i] = mtime;
        }

        zip_size = zip_max_size(n, filenames, file_sizes, comment);
        if (zip_size == 0) {
                printf("zip writing not possible");
                exit(1);
        }

        zip_data = xmalloc(zip_size);
        zip_size = zip_write(zip_data, n, filenames,
                             (const uint8_t *const *)file_data,
                             file_sizes, mtimes, comment, zip_callback);

        write_file(zip_filename, zip_data, zip_size);
        printf("\n");

        free(zip_data);
        for (i = 0; i < n; i++) {
                free(file_data[i]);
        }
        free(mtimes);
        free(file_sizes);
        free(file_data);
}

Finally, it mainchecks the command line arguments and decides what to do:

static void print_usage(const char *argv0)
{
        printf("Usage:\n\n");
        printf("  %s list <zipfile>\n", argv0);
        printf("  %s extract <zipfile>\n", argv0);
        printf("  %s create <zipfile> [-c <comment>] <files...>\n", argv0);
        printf("\n");
}

int main(int argc, char **argv) {
        printf("\n");
        printf("HWZIP " VERSION " -- A very simple ZIP program ");
        printf("from https://www.hanshq.net/zip.html\n");
        printf("\n");

        if (argc == 3 && strcmp(argv[1], "list") == 0) {
                list_zip(argv[2]);
        } else if (argc == 3 && strcmp(argv[1], "extract") == 0) {
                extract_zip(argv[2]);
        } else if (argc >= 3 && strcmp(argv[1], "create") == 0) {
                if (argc >= 5 && strcmp(argv[3], "-c") == 0) {
                        create_zip(argv[2], argv[4], (uint16_t)(argc - 5),
                                   (const char *const *)&argv[5]);
                } else {
                        create_zip(argv[2], NULL, (uint16_t)(argc - 3),
                                   (const char *const *)&argv[3]);
                }
        } else {
                print_usage(argv[0]);
                return 1;
        }

        return 0;
}

Assembly instructions


A complete set of source files is available at hwzip-1.0.zip . How to compile HWZip under Linux or Mac:

$ clang generate_tables.c && ./a.out > tables.c
$ clang -O3 -DNDEBUG -march=native -o hwzip crc32.c deflate.c huffman.c \
        hwzip.c lz77.c tables.c zip.c

On Windows, at the developer’s command prompt in Visual Studio (if you don’t have Visual Studio, download the build tools ):

cl /TC generate_tables.c && generate_tables > tables.c
cl /O2 /DNDEBUG /MT /Fehwzip.exe /TC crc32.c deflate.c huffman.c hwzip.c
        lz77.c tables.c zip.c /link setargv.obj

Setargv.obj for expanding wildcard command line arguments .)

Conclusion


It is amazing how technology is developing rapidly and slowly. The Zip format was created 30 years ago based on technology from the fifties and seventies. And although much has changed since then, Zip files, in fact, have remained the same and today are more common than ever. I think it will be useful to have a good understanding of how they work.

Exercises


  • Make HWZip record the time it took for each file to change, rather than the current time the archive was created. Use stat (2) on Linux or Mac and GetFileTime on Windows. Or add a command line flag that allows the user to set a specific time for file changes.
  • gzip-. — , Deflate ( ). RFC 1952.
  • Zip- . HWZip , read_file mmap(2) Linux Mac CreateFileMapping Windows.
  • HWZip , Zip64. appnote.txt.



All Articles