I have long been wondering how data is compressed, including in Zip files. Once I decided to satisfy my curiosity: to learn how compression works, and write my own Zip program. Implementation has become an exciting exercise in programming. You get great pleasure from creating a debugged machine that takes data, transfers its bits to a more efficient representation, and then collects it back. I hope you will also be interested in reading about it.The article explains in great detail how Zip files and the compression scheme work: LZ77 compression, Huffman algorithm, Deflate algorithm and more. You will learn the history of the development of the technology and look at fairly effective implementation examples written from scratch in C. The source code is here: hwzip-1.0.zip .I am very grateful to Ange Albertini , Gynvael Coldwind , Fabian Giesen , Jonas Skeppstedt ( web ), Primiano Tucci, and Nico Weber , who provided valuable feedback on the drafts of this article.Content
Story
PKZip
In the eighties and early nineties, before the Internet became widespread, computer enthusiasts used dial-up modems to connect via the telephone network to the Bulletin Board Systems (BBS) network. BBS was an interactive computer system that allowed users to send messages, play games and share files. To get online, it was enough to have a computer, modem and a good BBS phone number. Numbers were published in computer magazines and on other BBS.An important tool to facilitate the distribution of files was the archiver . It allows you to save one or more files in a single archive fileto more conveniently store or transmit information. And ideally, the archive also compressed files to save space and time for transmission over the network. In the days of BBS, the Arc archiver was popular, written by Tom Henderson of System Enhancement Associates (SEA), a small company that he founded with his brother-in-law.In the late 1980s, programmer Phil Katz released his own version of Arc, PKArc. It was compatible with SEA Arc, but it worked faster thanks to subroutines written in assembly language and used a new compression method. The program became popular, Katz quit his job and created PKWare to focus on further development. According to legend, most of the work took place in his mother’s kitchen in Glendale, Wisconsin.Photo by Phil Katz from an article in the Milwaukee Sentinel , September 19, 1994.However, the SEA was not happy with Katz’s initiative. The company accused him of trademark and copyright infringement. The litigation and controversy on the BBS network and the PC world has become known as Arc Wars . In the end, the dispute was settled in favor of the SEA.Abandoning Arc, Katz created a new archiving format in 1989, which he called Zip and made available to the public :, , , . , ".ZIP", , , , , , , , , .
Katz's program for creating such files was called PKZip and soon spread to the world of BBS and PC.One of the aspects that most likely contributed to the success of the Zip format is that the documentation came with PKZip, Application Note , which explained in detail how the format works. This allowed others to learn the format and create programs that generate, extract, or otherwise interact with Zip files.Zip - a compression format without loss : after unpacking the data will be the same as before compression. The algorithm seeks redundancy in the source data and more efficiently presents information. This approach is different from lossy compression., which is used in formats such as JPEG and MP3: when compressed, some of the information that is less noticeable to the human eye or ear is thrown out.PKZip was distributed as Shareware: it could be freely used and copied, but the author suggested users to “register” the program. For $ 47, you could get printed instructions, premium support, and an enhanced version of the app.One of the key versions of PKZip was 2.04c, released on December 28, 1992 ( version 2.04g was released shortly after ). It used the default Deflate compression algorithm. The version determined the further development of compression in Zip files ( article devoted to the release ).Since then, the zip format has been used in many other file formats. For example, Java archives (.jar), Android Application Packages (.apk), and Microsoft Office .docx files use the Zip format. Many formats and protocols use the same compression algorithm, Deflate. Say, web pages are probably transferred to your browser as a gzip file, the format of which uses Deflate compression.Phil Katz died in 2000. PKWare still exists and supports the Zip format, although the company focuses mainly on data protection software.Info-zip and zlib
Soon after the release of PKZip in 1989, other programs for unpacking Zip files began to appear. For example, an unzip program that could unpack on Unix systems. In March 1990, a mailing list called Info-ZIP was created.The Info-ZIP group has released free open source programs unzip and zip , which were used to unpack and create zip files. The code has been ported to many systems, and it is still the standard for Zip programs for Unix systems. This later helped to increase the popularity of Zip files.Once the Info-ZIP code that did the deflate compression and decompression was moved to a separate zlib library that they wroteJean-loup Gailly (compression) and Mark Adler (unpacking).Jean-loup Gailly (left) and Mark Adler (right) at the 2009 USENIX STUG Award .One of the reasons for creating the library was that it provided the convenience of using Deflate compression in other applications and formats, for example, in the new gzip and PNG . These new formats were intended to replace Compress and GIF , which used the patent-protected LZW algorithm.As part of creating these formats, Peter Deutsch wrote the Deflate specification and published it under the name Internet RFC 1951 in May 1996. This turned out to be a more accessible description compared to the original PKZip Application Note.Today zlib is used everywhere. Perhaps he is now responsible for compressing this page on a web server and unpacking it in your browser. Today, most zip files are compressed and decompressed using zlib.Winzip
Many of those who did not find PKZip used WinZip. PC users switched from DOS to Windows, and from PKZip to WinZip.It all started with a project by programmer Nico Mac, who created software for OS / 2 at Mansfield Software Group in Storrs-Mansfield, Connecticut. Nico used Presentation Manager, this is a graphical user interface in OS / 2, and he was upset that he had to switch from a file manager to DOS commands every time he wanted to create Zip files.Mac wrote a simple GUI program that worked with Zip files directly in Presentation Manager, named it PMZip, and released it as shareware in the 1990s.OS / 2 did not succeed, and the PC world took over Microsoft Windows. In 1991, Mac decided to learn how to write Windows programs, and his first project was to port his Zip application to a new OS. In April 1991, WinZip 1.00 was released . It was distributed as shareware with a 21-day trial period and a registration fee of $ 29. She looked like this:In the first versions of WinZip, PKZip was used under the hood. But from version 5.0 in 1993, code from Info-ZIP began to be used for direct processing of Zip files. The user interface has also gradually evolved.WinZip 6.3 under Windows 3.11 for Workgroups.WinZip was one of the most popular shareware programs in the 1990s. But in the end, it lost relevance due to embedding support for Zip files in operating systems. Windows has been working with them as “compressed folders” since 2001 (Windows XP), the DynaZip library is used for this.Mac was originally called Nico Mak Computing. In 2000, it was renamed WinZip Computing, and around those years Mack left it. In 2005, Vector Capital sold the company, and in the end, it became owned by Corel , which still releases WinZip as a product.Lempel-Ziv Compression (LZ77)
Zip compression consists of two main ingredients: Lempel-Ziv compression and Huffman code.One way to compress the text is to create a list of common words or phrases with the replacement of varieties of these words within the text with links to the dictionary. For example, the long word “compression” in the source text can be represented as # 1234, where 1234 refers to the position of the word in the list. This is called dictionary compression .But from the point of view of universal compression, this method has several drawbacks. First, what exactly should get into the dictionary? The source data can be in different languages, it can even be not human-readable text. And if the dictionary is not agreed upon in advance between compression and decompression, then it will have to be stored and transferred along with the compressed data, which reduces the benefits of compression.An elegant solution to this problem is to use the source data itself as a dictionary. In 1977, A Universal Algorithm for Sequential Data Compression , Jacob Ziv and Abraham Lempel (who worked at Technion), proposed a compression scheme in which the source data is presented as a sequence of triplets:(, , )
where
they
form a backward link to the sequence of characters that you want to copy from the previous position in the original text, and
this is the next character in the generated data.Abraham Lempel and Jacob Ziv.Consider the following lines:It was the best of times,
it was the worst of times,
In the second line, the sequence “t was the w” can be represented as (26, 10, w), since it is recreated by copying 10 characters from the position of 26 characters back to the letter “w”. For characters that have not yet appeared, zero-length backlinks are used. For example, the initial “I” can be represented as (0, 0, I).This scheme is called Lempel-Ziv compression, or LZ77 compression. However, in practical implementations of the algorithm, part of the triplet is usually not used
. Instead, characters are generated individually, and (
,
) pairs are used for backlinks (this option is called LZSS compression ). How literals and backlinks are encoded is a separate issue, we will consider it below when we analyze the algorithmThe Deflate .This text:It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
it was the epoch of belief,
it was the epoch of incredulity,
it was the season of Light,
it was the season of Darkness,
it was the spring of hope,
it was the winter of despair,
we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way
You can compress in this:It was the best of times,
i(26,10)wor(27,24)age(25,4)wisdom(26,20)
foolishnes(57,14)epoch(33,4)belief(28,22)incredulity
(33,13)season(34,4)Light(28,23)Dark(120,17)
spring(31,4)hope(231,14)inter(27,4)despair,
we had everyth(57,4)before us(29,9)no(26,20)
we(12,3)all go(29,4)direct to Heaven
(36,28)(139,3)(83,3)(138,3)way
One of the important properties of backlinks is that they can overlap. This happens when the length is greater than the distance. For instance:Fa-la-la-la-la
You can compress to:Fa-la(3,9)
It may seem strange to you, but the method works: after the bytes of the first three “-la” are copied, the copying continues using the newly generated bytes.In fact, this is a type of coding of series lengths , in which part of the data is repeatedly copied to obtain the desired length.An interactive example of using Lempel-Ziv compression for lyrics is shown in an article by Colin Morris Are Pop Lyrics Getting More Repetitive? .The following is an example of copying backlinks in C. Please note that due to possible overlap, we cannot use memcpy
or memmove
.
static inline void lz77_output_backref(uint8_t *dst, size_t dst_pos,
size_t dist, size_t len)
{
size_t i;
assert(dist <= dst_pos && "cannot reference before beginning of dst");
for (i = 0; i < len; i++) {
dst[dst_pos] = dst[dst_pos - dist];
dst_pos++;
}
}
It is easy to generate literals, but for completeness we will use an auxiliary function:
static inline void lz77_output_lit(uint8_t *dst, size_t dst_pos, uint8_t lit)
{
dst[dst_pos] = lit;
}
Note that the caller of this function must make sure that there is dst
enough space for the generated data and that the backlink does not access the position before the start of the buffer.It’s difficult not to generate data using backlinks during unpacking, but to create it first when compressing the source data. This can be done in different ways, but we will use the method based on hash tables from zlib, which is proposed in RFC 1951.We will use a hash table with the positions of three-character prefixes that were previously found in the line (shorter backlinks do not bring any benefit). Deflate allows backlinks within the previous 32,768 characters - this is called a window. This provides streaming compression: the input data is processed a little at a time, provided that the window with the last bytes is stored in memory. However, our implementation assumes that all of the input data is available to us and that we can process it whole at a time. This allows you to focus on compression rather than accounting, which is necessary for stream processing.We will use two arrays: in head
contains the hash value of the three-character prefix for the position in the input, and in prev
contains the position of the previous position with this hash value. In fact, head[h]
this is the heading of a linked list of prefix positions with a hash h
, and prev[x]
receives the element preceding x
the list.#define LZ_WND_SIZE 32768
#define LZ_MAX_LEN 258
#define HASH_SIZE 15
#define NO_POS SIZE_MAX
bool lz77_compress(const uint8_t *src, size_t len,
bool (*lit_callback)(uint8_t lit, void *aux),
bool (*backref_callback)(size_t dist, size_t len, void *aux),
void *aux)
{
size_t head[1U << HASH_SIZE];
size_t prev[LZ_WND_SIZE];
uint16_t h;
size_t i, j, dist;
size_t match_len, match_pos;
size_t prev_match_len, prev_match_pos;
for (i = 0; i < sizeof(head) / sizeof(head[0]); i++) {
head[i] = NO_POS;
}
To insert a new string position into the hash table prev
, it is updated to indicate the previous one head
, and then it is updated itself head
:static void insert_hash(uint16_t hash, size_t pos, size_t *head, size_t *prev)
{
prev[pos % LZ_WND_SIZE] = head[hash];
head[hash] = pos;
}
Pay attention to the modulo operation when indexing in prev
: we are only interested in those positions that fall into the current window.Instead of calculating the hash value for each three-character prefix from scratch, we will use a ring hash and will constantly update it so that only the last three characters are reflected in its value:static uint16_t update_hash(uint16_t hash, uint8_t c)
{
hash <<= 5;
hash ^= c;
hash &= (1U << HASH_SIZE) - 1;
return hash;
}
The hash map can then be used to efficiently search for previous matches with a sequence, as shown below. Searching for matches is the most resource-intensive compression operation, so we will limit the depth of the search in the list.Changing various parameters, such as the depth of the search on the list of prefixes and performing lazy comparisons, as described below, is a way to increase speed by reducing the degree of compression. The settings in our code are selected to match the maximum compression level in zlib.
static size_t find_match(const uint8_t *src, size_t pos, uint16_t hash,
size_t prev_match_len, size_t max_match_len,
const size_t *head, const size_t *prev,
size_t *match_pos)
{
size_t max_match_steps = 4096;
size_t i, l;
bool found;
if (prev_match_len == 0) {
prev_match_len = 2;
}
if (prev_match_len >= max_match_len) {
return 0;
}
if (prev_match_len >= 32) {
max_match_steps /= 4;
}
found = false;
i = head[hash];
while (max_match_steps != 0) {
if (i == NO_POS) {
break;
}
assert(i < pos && "Matches should precede pos.");
if (pos - i > LZ_WND_SIZE) {
break;
}
l = cmp(src, i, pos, prev_match_len, max_match_len);
if (l != 0) {
assert(l > prev_match_len);
assert(l <= max_match_len);
found = true;
*match_pos = i;
prev_match_len = l;
if (l == max_match_len) {
return l;
}
}
i = prev[i % LZ_WND_SIZE];
max_match_steps--;
}
if (!found) {
return 0;
}
return prev_match_len;
}
static size_t cmp(const uint8_t *src, size_t i, size_t j,
size_t prev_match_len, size_t max_match_len)
{
size_t l;
assert(prev_match_len < max_match_len);
for (l = 0; l < prev_match_len + 1; l++) {
if (src[i + prev_match_len - l] !=
src[j + prev_match_len - l]) {
return 0;
}
}
assert(l == prev_match_len + 1);
for (; l < max_match_len; l++) {
if (src[i + l] != src[j + l]) {
break;
}
}
assert(l > prev_match_len);
assert(l <= max_match_len);
assert(memcmp(&src[i], &src[j], l) == 0);
return l;
}
You can terminate the function with lz77_compress
this code to search for previous matches:
h = 0;
if (len >= 2) {
h = update_hash(h, src[0]);
h = update_hash(h, src[1]);
}
prev_match_len = 0;
prev_match_pos = 0;
for (i = 0; i + 2 < len; i++) {
h = update_hash(h, src[i + 2]);
match_len = find_match(src, i, h, prev_match_len,
min(LZ_MAX_LEN, len - i), head, prev,
&match_pos);
insert_hash(h, i, head, prev);
if (prev_match_len != 0 && prev_match_len >= match_len) {
dist = (i - 1) - prev_match_pos;
if (!backref_callback(dist, prev_match_len, aux)) {
return false;
}
for (j = i + 1; j < min((i - 1) + prev_match_len,
len - 2); j++) {
h = update_hash(h, src[j + 2]);
insert_hash(h, j, head, prev);
}
i = (i - 1) + prev_match_len - 1;
prev_match_len = 0;
continue;
}
if (match_len == 0) {
assert(prev_match_len == 0);
if (!lit_callback(src[i], aux)) {
return false;
}
continue;
}
if (prev_match_len != 0) {
if (!lit_callback(src[i - 1], aux)) {
return false;
}
}
prev_match_len = match_len;
prev_match_pos = match_pos;
}
if (prev_match_len != 0) {
dist = (i - 1) - prev_match_pos;
if (!backref_callback(dist, prev_match_len, aux)) {
return false;
}
i = (i - 1) + prev_match_len;
}
for (; i < len; i++) {
if (!lit_callback(src[i], aux)) {
return false;
}
}
return true;
}
This code is looking for the longest backlink that can be generated at the current position. But before issuing it, the program decides whether it is possible to find an even longer match in the next position. In zlib, this is called lazy comparison evaluation . This is still a greedy algorithm: it selects the longest match, even if the current shorter one allows later to get a match even longer and achieve stronger compression.Lempel-Ziv compression can work both fast and slow. Zopflispent a lot of time looking for optimal backlinks to squeeze extra compression percentages. This is useful for data that is compressed once and then reused, for example, for static information on a web server. On the other side of the scale are compressors such as Snappy and LZ4 , which are compared only with the last 4-byte prefix and are very fast. This type of compression is useful in databases and RPC systems in which the time spent on compression pays off by saving time when sending data over a network or to disk.The idea of using source data as a dictionary is very elegant, but you can also benefit from a static dictionary. Brotli is an LZ77 based algorithm, but it also uses a largestatic dictionary of string, which are often found on the network.LZ77 code can be viewed in lz77.h and lz77.c .Huffman Code
The second Zip compression algorithm is the Huffman code.The term code in this context is a reference to a system for presenting data in some other form. In this case, we are interested in code that can be used to efficiently represent literals and backlinks generated by the Lempel-Ziv algorithm.Traditionally, English text is presented using the American Standard Code for Information Interchange (ASCII) . This system assigns each character a number, which is usually stored in an 8-bit representation. Here are the ASCII codes for the uppercase letters of the English alphabet:One byte per character is a convenient way to store text. It allows you to easily access or modify parts of the text, and it is always clear how many bytes are required to store N characters, or how many characters are stored in N bytes. However, this is not the most effective way in terms of occupied space. For example, in English, the letter E is used most often, and Z is the least used. Therefore, in terms of volume, it is more efficient to use a shorter bit representation for E and a longer one for Z, rather than assigning the same number of bits to each character.A code that specifies encodings of different lengths to different source characters is called a variable-length code . The most famous example is Morse code., in which each character is encoded with dots and dashes, originally transmitted by telegraph with short and long pulses:One of the drawbacks of Morse code is that one codeword may be a prefix of another. For example, • • - • does not have a unique decoding: it can be F or ER. This is solved by pauses (three dots in length) between the letters during the transmission. However, it would be better if codewords could not be prefixes of other words. This code is called unrefixed . ASCII code of a fixed length is unrefixed because codewords are always the same length. But variable length codes can also be unrefixed. Phone numbers are often unrefixed. Before the emergency telephone number 112 was introduced in Sweden, all the numbers starting with 112 had to be changed. But in the USA there is not a single telephone number starting with 911.To minimize the size of the encoded message, it is better to use an unrefixed code in which frequently occurring characters have shorter codewords. The optimal code will be the one that generates the shortest possible result - the sum of the lengths of code words, multiplied by their frequency of occurrence, will be the minimum possible. This is called a non-prefix code with minimal redundancy , or a Huffman code , in honor of the inventor of an efficient algorithm for generating such codes.Huffman Algorithm
While studying materials for writing his doctoral dissertation on electronic engineering at MIT, David Huffman attended a course on information theory taught by Robert Fano. According to legend , Fano allowed his students to choose: write the final exam or course. Huffman chose the latter, and he was given the topic of searching for prefixless codes with minimal redundancy. It is assumed that he did not know that Fano himself was working on this task at that time (the Shannon-Fano algorithm was the most famous method in those years ). Huffman's work was published in 1952 under the title A Method for the Construction of Minimum-Redundancy Codes in 1952. And since then its algorithm has been widely used.David Huffman press release UC Santa Cruz.The Huffman algorithm creates unrefixed code with minimal redundancy for the character set and their frequency of use. The algorithm repeatedly selects two characters that are least likely to be found in the source data — say, X and Y — and replaces them with a composite character meaning “X or Y”. The frequency of occurrence of a composite symbol is the sum of the frequencies of two source symbols. The codewords for X and Y can be any codewords that are assigned to the compound character “X or Y” followed by 0 or 1 to distinguish the original characters. When the input data is reduced to one character, the algorithm stops working ( video explanation ).Here is an example of the algorithm working on a small character set:First iteration of processing:The two rarest symbols, C and D, are removed from the set and replaced by a composite symbol whose frequency is the sum of the frequencies C and D:Now the rarest symbols are B and a composite symbol with a frequency of 5. They are removed from the set and replaced with a composite symbol with a frequency of 9:Finally, A and a composite symbol with a frequency of 9 are combined into a new symbol with a frequency of 15:The whole set was reduced to one character, processing is complete.The algorithm created a structure called the Huffman tree . Input characters are leaves, and the higher the frequency of a character, the higher it is located. Starting from the root of the tree, you can generate code words for characters by adding 0 or 1 when moving left or right, respectively. It turns out like this:No codeword is a prefix for any other. The more often a symbol occurs, the shorter its code word.The tree can also be used for decoding: we start from the root and go right or left for the value with 0 or 1 in front of the character. For example, line 010100 is decoded in ABBA.Note that the length of each codeword is equivalent to the depth of the corresponding tree node. As we will see in the next part, we do not need a real tree to assign codewords. It is enough to know the length of the words themselves. Thus, the result of our implementation of the Huffman algorithm will be the length of code words.To store the character set and efficiently find the lowest frequencies, we will use the binary heap data structure . In particular, we are interested inmin-heap , as the minimum value should be at the top.
static void swap32(uint32_t *a, uint32_t *b)
{
uint32_t tmp;
tmp = *a;
*a = *b;
*b = tmp;
}
static void minheap_down(uint32_t *heap, size_t n, size_t i)
{
size_t left, right, min;
assert(i >= 1 && i <= n && "i must be inside the heap");
while (i * 2 <= n) {
left = i * 2;
right = i * 2 + 1;
min = left;
if (right <= n && heap[right] < heap[left]) {
min = right;
}
if (heap[min] < heap[i]) {
swap32(&heap[min], &heap[i]);
i = min;
} else {
break;
}
}
}
static void minheap_heapify(uint32_t *heap, size_t n)
{
size_t i;
for (i = n / 2; i >= 1; i--) {
minheap_down(heap, n, i);
}
}
To track the frequency of n
characters, we will use a bunch of n
elements. Also, each time a composite symbol is created, we want to “link” both source symbols to it. Therefore, each symbol will have a “communication element”.To store the n
-element heap and n
communication elements, we will use an array of n * 2 + 1
elements. When two characters on the heap are replaced by one, we will use the second element to save the link to the new character. This approach is based on the implementation of Managing Gigabytes of Witten, Moffat and Bell.At each node in the heap, we will use the 16 most significant bits to store the symbol frequency, and the 16 least significant bits to store the index of the symbol communication element. Due to the use of high bits, the frequency difference will be determined by the result of a 32-bit comparison between two elements of the heap.Because of this representation, we need to make sure that the frequency of characters always fits in 16 bits. After completion of the algorithm, the final composite symbol will have the frequency of all combined symbols, that is, this sum should be placed in 16 bits. Our Deflate implementation will verify this by simultaneously processing up to 64,535 characters.Symbols with zero frequency will receive code words of zero length and will not participate in the compilation of the encoding.If the code word reaches the specified maximum depth, we will “smooth out” the frequency distribution by imposing a frequency limit and try again (yes, with the help goto
). There are more sophisticated ways to do depth-limited Huffman coding, but this one is simple and efficient.#define MAX_HUFFMAN_SYMBOLS 288
static void compute_huffman_lengths(const uint16_t *freqs, size_t n,
uint8_t max_len, uint8_t *lengths)
{
uint32_t nodes[MAX_HUFFMAN_SYMBOLS * 2 + 1], p, q;
uint16_t freq;
size_t i, h, l;
uint16_t freq_cap = UINT16_MAX;
#ifndef NDEBUG
uint32_t freq_sum = 0;
for (i = 0; i < n; i++) {
freq_sum += freqs[i];
}
assert(freq_sum <= UINT16_MAX && "Frequency sum too large!");
#endif
assert(n <= MAX_HUFFMAN_SYMBOLS);
assert((1U << max_len) >= n && "max_len must be large enough");
try_again:
h = 0;
for (i = 0; i < n; i++) {
freq = freqs[i];
if (freq == 0) {
continue;
}
if (freq > freq_cap) {
freq = freq_cap;
}
h++;
nodes[h] = ((uint32_t)freq << 16) | (uint32_t)(n + h);
}
minheap_heapify(nodes, h);
if (h < 2) {
for (i = 0; i < n; i++) {
lengths[i] = (freqs[i] == 0) ? 0 : 1;
}
return;
}
while (h > 1) {
p = nodes[1];
nodes[1] = nodes[h--];
minheap_down(nodes, h, 1);
q = nodes[1];
nodes[1] = ((p & 0xffff0000) + (q & 0xffff0000))
| (uint32_t)(h + 1);
nodes[p & 0xffff] = nodes[q & 0xffff] = (uint32_t)(h + 1);
minheap_down(nodes, h, 1);
}
h = 0;
for (i = 0; i < n; i++) {
if (freqs[i] == 0) {
lengths[i] = 0;
continue;
}
h++;
p = nodes[n + h];
l = 1;
while (p != 2) {
l++;
p = nodes[p];
}
if (l > max_len) {
assert(freq_cap != 1 && "Cannot lower freq_cap!");
freq_cap /= 2;
goto try_again;
}
assert(l <= UINT8_MAX);
lengths[i] = (uint8_t)l;
}
}
An elegant alternative to the binary heap option is to store characters in two queues. The first contains the source characters, sorted by frequency. When a compound symbol is created, it is added secondarily. Thus, the symbol with the lowest frequency will always be in the first position of one of the queues. This approach is described by Jan van Leeuwen in On the Construction of Huffman Trees (1976).Huffman coding is optimal for non-prefix codes, but in other cases there are more efficient methods: arithmetic coding and asymmetric number systems .Huffman's canonical codes
In the example above, we built a Huffman tree:If we go from the root and use 0 for the left branch and 1 for the right, then we get the following codes:The decision to use 0 for the left branch and 1 for the right one seems arbitrary. If we do the opposite, we get:We can arbitrarily mark two branches originating from a node with zero and one (the main thing is that the labels are different), and still get the equivalent code:Although the Huffman algorithm provides the required codeword lengths for unrefixed code with minimal redundancy, there are many ways to assign individual codewords.Given the length of the codeword calculated by the Huffman algorithm, the canonical Huffman code assigns codewords to characters in a specific way. This is useful because it allows you to store and transmit codeword lengths with compressed data: the decoder will be able to recover codewords based on their lengths. Of course, you can store and transmit symbol frequencies and run the Huffman algorithm in the decoder, but this will require more work and more storage from the decoder. Another very important property is that the structure of canonical codes uses efficient decoding.The idea is to assign code words to characters sequentially, under one at a time. The first code word is 0. The next will be a word with a length of the previous word + 1. The first word with a length of N is made up of the last word of length N-1, adding one (to get a new code word) and shifting one step to the left (to increase the length).In the terminology of the Hoffman tree, code words are sequentially assigned to leaves in the order from left to right, one level at a time, shifting to the left when moving to the next level.In our ABCD example, the Huffman algorithm assigned code words with lengths of 1, 2, 3, and 3. The first word is 0. This is also the last word of length 1. For length 2, we take 0 and add 1 to get the next code, which will become the prefix of two-bit codes , shift to the left and get 10. This is now the last word of length 2. To get length 3, we add 1 and shift: 110. To get the next word of length 3 we add 1: 111.The implementation of the canonical code generator is shown below. Note that the Deflate algorithm expects codewords to be generated on the basis of the LSB-first principle (first, the least significant bit). That is, the first bit of the codeword should be stored in the least significant bit. This means that we need to change the order of the bits, for example, using the lookup table.#define MAX_HUFFMAN_BITS 15
static void compute_canonical_code(uint16_t *codewords, const uint8_t *lengths,
size_t n)
{
size_t i;
uint16_t count[MAX_HUFFMAN_BITS + 1] = {0};
uint16_t code[MAX_HUFFMAN_BITS + 1];
int l;
for (i = 0; i < n; i++) {
count[lengths[i]]++;
}
count[0] = 0;
code[0] = 0;
for (l = 1; l <= MAX_HUFFMAN_BITS; l++) {
code[l] = (uint16_t)((code[l - 1] + count[l - 1]) << 1);
}
for (i = 0; i < n; i++) {
l = lengths[i];
if (l == 0) {
continue;
}
codewords[i] = reverse16(code[l]++, l);
}
}
static inline uint16_t reverse16(uint16_t x, int n)
{
uint16_t lo, hi;
uint16_t reversed;
assert(n > 0);
assert(n <= 16);
lo = x & 0xff;
hi = x >> 8;
reversed = (uint16_t)((reverse8_tbl[lo] << 8) | reverse8_tbl[hi]);
return reversed >> (16 - n);
}
Now put it all together and write the encoder initialization code:typedef struct huffman_encoder_t huffman_encoder_t;
struct huffman_encoder_t {
uint16_t codewords[MAX_HUFFMAN_SYMBOLS];
uint8_t lengths[MAX_HUFFMAN_SYMBOLS];
};
void huffman_encoder_init(huffman_encoder_t *e, const uint16_t *freqs, size_t n,
uint8_t max_codeword_len)
{
assert(n <= MAX_HUFFMAN_SYMBOLS);
assert(max_codeword_len <= MAX_HUFFMAN_BITS);
compute_huffman_lengths(freqs, n, max_codeword_len, e->lengths);
compute_canonical_code(e->codewords, e->lengths, n);
}
We also make a function to configure the encoder using the already calculated code lengths:
void huffman_encoder_init2(huffman_encoder_t *e, const uint8_t *lengths,
size_t n)
{
size_t i;
for (i = 0; i < n; i++) {
e->lengths[i] = lengths[i];
}
compute_canonical_code(e->codewords, e->lengths, n);
}
Efficient Huffman Decoding
The easiest way to decode Huffman is to traverse the tree starting from the root, reading one bit of input at a time and deciding which branch to take next, left or right. When a leaf node is reached, it is a decoded character.This method is often taught in universities and books. It is simple and elegant, but processing one bit at a time is too slow. It is much faster to decode using the lookup table. For the above example, in which the maximum codeword length is three bits, you can use the following table:Although there are only four characters, we need a table with eight entries to cover all possible three-bit combinations. Symbols with codewords shorter than three bits have several entries in the table. For example, the word 10 was “supplemented” with 10 0 and 10 1 to cover all three-bit combinations starting with 10.To decode in this way, you need to index in the table with the following three input bits and immediately find the corresponding character and the length of its code word. The length is important, because despite looking at the next three bits, we need to get the same number of input bits as the length of the codeword.The search table-based method works very quickly, but it has a drawback: the size of the table doubles with each additional bit in the length of the codeword. That is, the construction of the table slows down exponentially, and if it ceases to fit in the processor cache, then the method starts to work slowly.Because of this, the lookup table is usually used only for code words no larger than a certain length. And for longer words take a different approach. Just as Huffman coding assigns shorter codewords to more frequent characters, the use of a lookup table for short codewords is in many cases an excellent optimization.In zlibseveral levels of search tables are used. If the codeword is too long for the first table, then the search will go to the secondary table to index the remaining bits.But there is another, very elegant method, based on the properties of canonical Huffman codes. It is described in On the Implementation of Minimum Redundancy Prefix Codes (Moffat and Turpin, 1997), and is also explained in Charles Bloom's The Lost Huffman Paper .We take the code words from the canonical version: 0, 10, 110, 111. We will track the first code words of each length, as well as the number of each code word in the general sequence - “symbolic index”.Since code words are assigned sequentially, if we know the number of bits, we can find in the above table the symbol that these bits represent. For example, for three-bit 111 we see that this is an offset by one from the first codeword of this length (110). The first character index of this length is 3, and an offset of one gives us an index of 4. Another table compares the character index with the character:sym_idx = d->first_symbol[len] + (bits - d->first_code[len]);
sym = d->syms[sym_idx];
A little optimization: instead of separately storing the first character index and the first code word, we can store the first index in the table minus the first code word:sym_idx = d->offset_first_sym_idx[len] + bits;
sym = d->syms[sym_idx];
To understand how many bits need to be estimated, we again use the code sequence property. In our example, all valid one-bit codewords are strictly less than 1, two-bit - strictly less than 11, three-bit - less than 1000 (in fact, true for all three-bit values). In other words, the valid N-bit codeword must be strictly less than the first N-bit codeword plus the number of N-bit codewords. Moreover, we can shift these boundaries to the left so that they are all three-bit wide. Let's call it restrictive bits for each of the codeword lengths:The limiter for length 3 has overflowed to 4 bits, but this only means that any three-bit word will do.We can search among the three-bit input data and compare with the restrictive bits to understand how long our code word is. After completion, we shift the input bits, only to calculate their correct number, and then find the character index:for (len = 1; len <= 3; len++) {
if (bits < d->sentinel_bits[len]) {
bits >>= 3 - len;
sym_idx = d->offset_first_sym_idx[len] + bits;
}
}
The time complexity of the process is linear with respect to the number of bits in the code words, but the place is spent efficiently, only downloading and comparison are required at each step, and since shorter code words are more common, the method allows optimizing compression in many situations.Full decoder code:#define HUFFMAN_LOOKUP_TABLE_BITS 8
typedef struct huffman_decoder_t huffman_decoder_t;
struct huffman_decoder_t {
struct {
uint16_t sym : 9;
uint16_t len : 7;
} table[1U << HUFFMAN_LOOKUP_TABLE_BITS];
uint16_t sentinel_bits[MAX_HUFFMAN_BITS + 1];
uint16_t offset_first_sym_idx[MAX_HUFFMAN_BITS + 1];
uint16_t syms[MAX_HUFFMAN_SYMBOLS];
#ifndef NDEBUG
size_t num_syms;
#endif
};
static inline uint64_t lsb(uint64_t x, int n)
{
assert(n >= 0 && n <= 63);
return x & (((uint64_t)1 << n) - 1);
}
static inline int huffman_decode(const huffman_decoder_t *d, uint16_t bits,
size_t *num_used_bits)
{
uint64_t lookup_bits;
size_t l;
size_t sym_idx;
lookup_bits = lsb(bits, HUFFMAN_LOOKUP_TABLE_BITS);
assert(lookup_bits < sizeof(d->table) / sizeof(d->table[0]));
if (d->table[lookup_bits].len != 0) {
assert(d->table[lookup_bits].len <= HUFFMAN_LOOKUP_TABLE_BITS);
assert(d->table[lookup_bits].sym < d->num_syms);
*num_used_bits = d->table[lookup_bits].len;
return d->table[lookup_bits].sym;
}
bits = reverse16(bits, MAX_HUFFMAN_BITS);
for (l = HUFFMAN_LOOKUP_TABLE_BITS + 1; l <= MAX_HUFFMAN_BITS; l++) {
if (bits < d->sentinel_bits[l]) {
bits >>= MAX_HUFFMAN_BITS - l;
sym_idx = (uint16_t)(d->offset_first_sym_idx[l] + bits);
assert(sym_idx < d->num_syms);
*num_used_bits = l;
return d->syms[sym_idx];
}
}
*num_used_bits = 0;
return -1;
}
To configure the decoder, we will pre-compute the canonical codes, as for huffman_encoder_init , and fill in different tables:
bool huffman_decoder_init(huffman_decoder_t *d, const uint8_t *lengths,
size_t n)
{
size_t i;
uint16_t count[MAX_HUFFMAN_BITS + 1] = {0};
uint16_t code[MAX_HUFFMAN_BITS + 1];
uint32_t s;
uint16_t sym_idx[MAX_HUFFMAN_BITS + 1];
int l;
#ifndef NDEBUG
assert(n <= MAX_HUFFMAN_SYMBOLS);
d->num_syms = n;
#endif
for (i = 0; i < sizeof(d->table) / sizeof(d->table[0]); i++) {
d->table[i].len = 0;
}
for (i = 0; i < n; i++) {
assert(lengths[i] <= MAX_HUFFMAN_BITS);
count[lengths[i]]++;
}
count[0] = 0;
code[0] = 0;
sym_idx[0] = 0;
for (l = 1; l <= MAX_HUFFMAN_BITS; l++) {
code[l] = (uint16_t)((code[l - 1] + count[l - 1]) << 1);
if (count[l] != 0 && code[l] + count[l] - 1 > (1U << l) - 1) {
return false;
}
s = (uint32_t)((code[l] + count[l]) << (MAX_HUFFMAN_BITS - l));
d->sentinel_bits[l] = (uint16_t)s;
assert(d->sentinel_bits[l] == s && "No overflow.");
sym_idx[l] = sym_idx[l - 1] + count[l - 1];
d->offset_first_sym_idx[l] = sym_idx[l] - code[l];
}
for (i = 0; i < n; i++) {
l = lengths[i];
if (l == 0) {
continue;
}
d->syms[sym_idx[l]] = (uint16_t)i;
sym_idx[l]++;
if (l <= HUFFMAN_LOOKUP_TABLE_BITS) {
table_insert(d, i, l, code[l]);
code[l]++;
}
}
return true;
}
static void table_insert(huffman_decoder_t *d, size_t sym, int len,
uint16_t codeword)
{
int pad_len;
uint16_t padding, index;
assert(len <= HUFFMAN_LOOKUP_TABLE_BITS);
codeword = reverse16(codeword, len);
pad_len = HUFFMAN_LOOKUP_TABLE_BITS - len;
for (padding = 0; padding < (1U << pad_len); padding++) {
index = (uint16_t)(codeword | (padding << len));
d->table[index].sym = (uint16_t)sym;
d->table[index].len = (uint16_t)len;
assert(d->table[index].sym == sym && "Fits in bitfield.");
assert(d->table[index].len == len && "Fits in bitfield.");
}
}
Deflate
Deflate algorithm, introduced in PKZip 2.04c in 1993, is a standard compression method in modern Zip files. It is also used in gzip, PNG and many other formats. It uses a combination of LZ77 compression and Huffman coding, which we will discuss and implement in this section.Prior to Deflate, PKZip used Shrink, Reduce, and Implode compression methods. Today they are rare, although after Deflate were still in use for some time, because they consumed less memory. But we will not consider them.Bit streams
Deflate stores Huffman codewords in a bitstream according to the LSB-first principle. This means that the first bit of the stream is stored in the least significant bit of the first byte.Consider a bitstream (read from left to right) 1-0-0-1-1. When it is stored according to the LSB-first principle, the byte value becomes 0b00011001 (binary) or 0x19 (hexadecimal). It may seem that the stream is simply represented backwards (in a sense, it is), but the advantage is that it is easier for us to get the first N bits from a computer word: we simply hide the N least significant bits.These procedures are taken from bitstream.h :
typedef struct istream_t istream_t;
struct istream_t {
const uint8_t *src;
const uint8_t *end;
size_t bitpos;
size_t bitpos_end;
};
static inline void istream_init(istream_t *is, const uint8_t *src, size_t n)
{
is->src = src;
is->end = src + n;
is->bitpos = 0;
is->bitpos_end = n * 8;
}
Our Huffman decoder needs to look at the following bits in the stream (enough bits for the longest codeword possible), and then continue the stream by the number of bits used by the decoded symbol:#define ISTREAM_MIN_BITS (64 - 7)
static inline uint64_t istream_bits(const istream_t *is)
{
const uint8_t *next;
uint64_t bits;
int i;
next = is->src + (is->bitpos / 8);
assert(next <= is->end && "Cannot read past end of stream.");
if (is->end - next >= 8) {
bits = read64le(next);
} else {
bits = 0;
for (i = 0; i < is->end - next; i++) {
bits |= (uint64_t)next[i] << (i * 8);
}
}
return bits >> (is->bitpos % 8);
}
static inline bool istream_advance(istream_t *is, size_t n) {
if (is->bitpos + n > is->bitpos_end) {
return false;
}
is->bitpos += n;
return true;
}
The bottom line is that on 64-bit machines istream_bits
you can usually execute as a single-boot instruction and some arithmetic, given that the structure elements istream_t
are in registers. read64le is implemented in bits.h (modern compilers convert it to a single 64-bit download using the little-endian principle):
static inline uint64_t read64le(const uint8_t *p)
{
return ((uint64_t)p[0] << 0) |
((uint64_t)p[1] << 8) |
((uint64_t)p[2] << 16) |
((uint64_t)p[3] << 24) |
((uint64_t)p[4] << 32) |
((uint64_t)p[5] << 40) |
((uint64_t)p[6] << 48) |
((uint64_t)p[7] << 56);
}
We also need a function to continue the bitstream to the border of the next byte:
static inline size_t round_up(size_t x, size_t m)
{
assert((m & (m - 1)) == 0 && "m must be a power of two");
return (x + m - 1) & (size_t)(-m);
}
static inline const uint8_t *istream_byte_align(istream_t *is)
{
const uint8_t *byte;
assert(is->bitpos <= is->bitpos_end && "Not past end of stream.");
is->bitpos = round_up(is->bitpos, 8);
byte = is->src + is->bitpos / 8;
assert(byte <= is->end);
return byte;
}
For an outgoing bitstream, we write bits using a read-modify-write process. In the quick case, you can write a bit using a 64-bit read, some kind of bit operation and 64-bit write.
typedef struct ostream_t ostream_t;
struct ostream_t {
uint8_t *dst;
uint8_t *end;
size_t bitpos;
size_t bitpos_end;
};
static inline void ostream_init(ostream_t *os, uint8_t *dst, size_t n)
{
os->dst = dst;
os->end = dst + n;
os->bitpos = 0;
os->bitpos_end = n * 8;
}
static inline size_t ostream_bit_pos(const ostream_t *os)
{
return os->bitpos;
}
static inline size_t ostream_bytes_written(ostream_t *os)
{
return round_up(os->bitpos, 8) / 8;
}
static inline bool ostream_write(ostream_t *os, uint64_t bits, size_t n)
{
uint8_t *p;
uint64_t x;
int shift, i;
assert(n <= 57);
assert(bits <= ((uint64_t)1 << n) - 1 && "Must fit in n bits.");
if (os->bitpos_end - os->bitpos < n) {
return false;
}
p = &os->dst[os->bitpos / 8];
shift = os->bitpos % 8;
if (os->end - p >= 8) {
x = read64le(p);
x = lsb(x, shift);
x |= bits << shift;
write64le(p, x);
} else {
x = 0;
for (i = 0; i < os->end - p; i++) {
x |= (uint64_t)p[i] << (i * 8);
}
x = lsb(x, shift);
x |= bits << shift;
for (i = 0; i < os->end - p; i++) {
p[i] = (uint8_t)(x >> (i * 8));
}
}
os->bitpos += n;
return true;
}
static inline void write64le(uint8_t *dst, uint64_t x)
{
dst[0] = (uint8_t)(x >> 0);
dst[1] = (uint8_t)(x >> 8);
dst[2] = (uint8_t)(x >> 16);
dst[3] = (uint8_t)(x >> 24);
dst[4] = (uint8_t)(x >> 32);
dst[5] = (uint8_t)(x >> 40);
dst[6] = (uint8_t)(x >> 48);
dst[7] = (uint8_t)(x >> 56);
}
We also need to efficiently write bytes to the stream. Of course, you can repeatedly execute 8-bit recordings, but it will be much faster to use memcpy
:
static inline bool ostream_write_bytes_aligned(ostream_t *os,
const uint8_t *src,
size_t n)
{
if (os->bitpos_end - round_up(os->bitpos, 8) < n * 8) {
return false;
}
os->bitpos = round_up(os->bitpos, 8);
memcpy(&os->dst[os->bitpos / 8], src, n);
os->bitpos += n * 8;
return true;
}
Unpacking (Inflation)
Since the compression algorithm is called Deflate - blowing, extracting air from something - the unpacking process is sometimes called Inflation . If you first study this process, we will understand how the format works. You can see the code in the first part of deflate.h and deflate.c , bits.h , tables.h and tables.c (generated using generate_tables.c ).Data compressed using Deflate is stored as a series of blocks.. Each block starts with a 3-bit header, in which the first (least significant) bit is set if this is the final block of the series, and the other two bits indicate its type.There are three types of blocks: uncompressed (0), compressed using fixed Huffman codes (1), and compressed using “dynamic” Huffman codes (2).This code performs unpacking using auxiliary functions for different types of blocks, which we will implement later:typedef enum {
HWINF_OK,
HWINF_FULL,
HWINF_ERR
} inf_stat_t;
inf_stat_t hwinflate(const uint8_t *src, size_t src_len, size_t *src_used,
uint8_t *dst, size_t dst_cap, size_t *dst_used)
{
istream_t is;
size_t dst_pos;
uint64_t bits;
bool bfinal;
inf_stat_t s;
istream_init(&is, src, src_len);
dst_pos = 0;
do {
bits = istream_bits(&is);
if (!istream_advance(&is, 3)) {
return HWINF_ERR;
}
bfinal = bits & 1;
bits >>= 1;
switch (lsb(bits, 2)) {
case 0:
s = inf_noncomp_block(&is, dst, dst_cap, &dst_pos);
break;
case 1:
s = inf_fixed_block(&is, dst, dst_cap, &dst_pos);
break;
case 2:
s = inf_dyn_block(&is, dst, dst_cap, &dst_pos);
break;
default:
return HWINF_ERR;
}
if (s != HWINF_OK) {
return s;
}
} while (!bfinal);
*src_used = (size_t)(istream_byte_align(&is) - src);
assert(dst_pos <= dst_cap);
*dst_used = dst_pos;
return HWINF_OK;
}
Uncompressed Deflate Blocks
These are "stored" blocks, the simplest type. It begins with the next 8-bit boundary of the bitstream with a 16-bit word (len) denoting the length of the block. Behind it is another 16-bit word (nlen), which complements (the order of the bits is inverted) of the words len. It is assumed that nlen acts as a simple len checksum: if the file is damaged, then the values will probably not be complementary and the program will be able to detect an error.After len and nlen are uncompressed data. Since the block length is a 16-bit value, the data size is limited to 65,535 bytes.static inf_stat_t inf_noncomp_block(istream_t *is, uint8_t *dst,
size_t dst_cap, size_t *dst_pos)
{
const uint8_t *p;
uint16_t len, nlen;
p = istream_byte_align(is);
if (!istream_advance(is, 32)) {
return HWINF_ERR;
}
len = read16le(p);
nlen = read16le(p + 2);
p += 4;
if (nlen != (uint16_t)~len) {
return HWINF_ERR;
}
if (!istream_advance(is, len * 8)) {
return HWINF_ERR;
}
if (dst_cap - *dst_pos < len) {
return HWINF_FULL;
}
memcpy(&dst[*dst_pos], p, len);
*dst_pos += len;
return HWINF_OK;
}
Deflate blocks using fixed Huffman codes
Compressed Deflate blocks use Huffman code to represent a sequence of LZ77 literals. Backlinks are broken using block end markers. For literals, backlink lengths, and markers, the litlen Huffman code is used . And for backlink distances, dist code is used .Litlen encodes values in the range 0-285. Values 0-255 are used for literal bytes, 256 is the end-of-block marker, and 257-285 are used for backlink lengths.Backlinks are 3-258 bytes long. The Litlen value determines the base length to which zero or more additional bits are added from the stream to get the full length according to the table below. For example, a litlen value of 269 means a base length of 19 and two extra bits. The addition of two bits from the stream gives the final length from 19 to 22.Please note that a litlen value of 284 plus 5 extra bits can represent lengths from 227 to 258, however, the specification states that length 258 - the maximum length of the backlink - must be represented using a separate litlen value. This is supposed to reduce coding in situations where the maximum length is often encountered.The decompressor uses the table to get the base length and additional bits from the litlen value (minus 257):
struct litlen_tbl_t {
uint16_t base_len : 9;
uint16_t ebits : 7;
};
const struct litlen_tbl_t litlen_tbl[29] = {
{ 3, 0 },
{ 4, 0 },
...
{ 227, 5 },
{ 258, 0 }
};
Huffman's fixed litlen code is canonical and uses the following codeword lengths (286–287 are not valid litlen values, but they are involved in code generation):The decompressor stores these lengths in a table convenient for transmission to huffman_decoder_init
:const uint8_t fixed_litlen_lengths[288] = {
8,
8,
...
8,
};
Backlink distances vary from 1 to 32,768. They are encoded using a scheme that is similar to a length coding scheme. The Huffman code dist encodes values from 0 to 29, each of which corresponds to the base length, to which additional bits are added to obtain the final distance:Huffman's fixed code dist is canonical. All codewords are 5 bits long. It is simple, the decompressor stores the codes in a table that can be used with huffman_decoder_init
(dist values 30–31 are not correct. It is indicated that they are involved in the generation of Huffman codes, but actually have no effect):const uint8_t fixed_dist_lengths[32] = {
5,
5,
...
5,
};
Decompression or Unpacking Code - Deflate block using fixed Huffman codes:static inf_stat_t inf_fixed_block(istream_t *is, uint8_t *dst,
size_t dst_cap, size_t *dst_pos)
{
huffman_decoder_t litlen_dec, dist_dec;
huffman_decoder_init(&litlen_dec, fixed_litlen_lengths,
sizeof(fixed_litlen_lengths) /
sizeof(fixed_litlen_lengths[0]));
huffman_decoder_init(&dist_dec, fixed_dist_lengths,
sizeof(fixed_dist_lengths) /
sizeof(fixed_dist_lengths[0]));
return inf_block(is, dst, dst_cap, dst_pos, &litlen_dec, &dist_dec);
}
#define LITLEN_EOB 256
#define LITLEN_MAX 285
#define LITLEN_TBL_OFFSET 257
#define MIN_LEN 3
#define MAX_LEN 258
#define DISTSYM_MAX 29
#define MIN_DISTANCE 1
#define MAX_DISTANCE 32768
static inf_stat_t inf_block(istream_t *is, uint8_t *dst, size_t dst_cap,
size_t *dst_pos,
const huffman_decoder_t *litlen_dec,
const huffman_decoder_t *dist_dec)
{
uint64_t bits;
size_t used, used_tot, dist, len;
int litlen, distsym;
uint16_t ebits;
while (true) {
bits = istream_bits(is);
litlen = huffman_decode(litlen_dec, (uint16_t)bits, &used);
bits >>= used;
used_tot = used;
if (litlen < 0 || litlen > LITLEN_MAX) {
return HWINF_ERR;
} else if (litlen <= UINT8_MAX) {
if (!istream_advance(is, used_tot)) {
return HWINF_ERR;
}
if (*dst_pos == dst_cap) {
return HWINF_FULL;
}
lz77_output_lit(dst, (*dst_pos)++, (uint8_t)litlen);
continue;
} else if (litlen == LITLEN_EOB) {
if (!istream_advance(is, used_tot)) {
return HWINF_ERR;
}
return HWINF_OK;
}
assert(litlen >= LITLEN_TBL_OFFSET && litlen <= LITLEN_MAX);
len = litlen_tbl[litlen - LITLEN_TBL_OFFSET].base_len;
ebits = litlen_tbl[litlen - LITLEN_TBL_OFFSET].ebits;
if (ebits != 0) {
len += lsb(bits, ebits);
bits >>= ebits;
used_tot += ebits;
}
assert(len >= MIN_LEN && len <= MAX_LEN);
distsym = huffman_decode(dist_dec, (uint16_t)bits, &used);
bits >>= used;
used_tot += used;
if (distsym < 0 || distsym > DISTSYM_MAX) {
return HWINF_ERR;
}
dist = dist_tbl[distsym].base_dist;
ebits = dist_tbl[distsym].ebits;
if (ebits != 0) {
dist += lsb(bits, ebits);
bits >>= ebits;
used_tot += ebits;
}
assert(dist >= MIN_DISTANCE && dist <= MAX_DISTANCE);
assert(used_tot <= ISTREAM_MIN_BITS);
if (!istream_advance(is, used_tot)) {
return HWINF_ERR;
}
if (dist > *dst_pos) {
return HWINF_ERR;
}
if (round_up(len, 8) <= dst_cap - *dst_pos) {
output_backref64(dst, *dst_pos, dist, len);
} else if (len <= dst_cap - *dst_pos) {
lz77_output_backref(dst, *dst_pos, dist, len);
} else {
return HWINF_FULL;
}
(*dst_pos) += len;
}
}
Pay attention to this optimization: when there is not enough space in the outgoing buffer, we issue backlinks using the function below, which copies 64 bits at a time. This is “messy” in the sense that it often copies a few extra bytes (up to the next multiple of 8). But it works much faster lz77_output_backref
, because it requires less cyclic iterations and memory accesses. In fact, short backlinks will now be processed in one iteration, which is very good for predicting branching.
static void output_backref64(uint8_t *dst, size_t dst_pos, size_t dist,
size_t len)
{
size_t i;
uint64_t tmp;
assert(len > 0);
assert(dist <= dst_pos && "cannot reference before beginning of dst");
if (len > dist) {
lz77_output_backref(dst, dst_pos, dist, len);
return;
}
i = 0;
do {
memcpy(&tmp, &dst[dst_pos - dist + i], 8);
memcpy(&dst[dst_pos + i], &tmp, 8);
i += 8;
} while (i < len);
}
Deflate blocks using dynamic Huffman codes
Deflate blocks using dynamic Huffman codes work the same way as described above. But instead of the predefined codes for litlen and dist, they use the codes stored in the Deflate stream itself at the beginning of the block. The name is probably unsuccessful, since dynamic Huffman codes are also called codes that change during coding - this is adaptive Huffman coding . The codes described here have nothing to do with that procedure. They are dynamic only in the sense that different blocks can use different codes.Generating dynamic litlen and dist codes is the most difficult part of the Deflate format. But as soon as the codes are generated, decompression is performed in the same way as described in the previous part, using inf_block
:static inf_stat_t inf_dyn_block(istream_t *is, uint8_t *dst,
size_t dst_cap, size_t *dst_pos)
{
inf_stat_t s;
huffman_decoder_t litlen_dec, dist_dec;
s = init_dyn_decoders(is, &litlen_dec, &dist_dec);
if (s != HWINF_OK) {
return s;
}
return inf_block(is, dst, dst_cap, dst_pos, &litlen_dec, &dist_dec);
}
Litlen and dist codes for dynamic Deflate blocks are stored as a series of codeword lengths. The lengths themselves are encoded using the third Huffman code - codelen . This code is determined by the length of the code words ( codelen_lens
) that are stored in the block (did I mention that it is difficult?).At the beginning of the dynamic block there are 14 bits that determine the number of litlen-, dist- and codelen-lengths of code words that need to be read from the block:#define MIN_CODELEN_LENS 4
#define MAX_CODELEN_LENS 19
#define MIN_LITLEN_LENS 257
#define MAX_LITLEN_LENS 288
#define MIN_DIST_LENS 1
#define MAX_DIST_LENS 32
#define CODELEN_MAX_LIT 15
#define CODELEN_COPY 16
#define CODELEN_COPY_MIN 3
#define CODELEN_COPY_MAX 6
#define CODELEN_ZEROS 17
#define CODELEN_ZEROS_MIN 3
#define CODELEN_ZEROS_MAX 10
#define CODELEN_ZEROS2 18
#define CODELEN_ZEROS2_MIN 11
#define CODELEN_ZEROS2_MAX 138
static const int codelen_lengths_order[MAX_CODELEN_LENS] =
{ 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15 };
static inf_stat_t init_dyn_decoders(istream_t *is,
huffman_decoder_t *litlen_dec,
huffman_decoder_t *dist_dec)
{
uint64_t bits;
size_t num_litlen_lens, num_dist_lens, num_codelen_lens;
uint8_t codelen_lengths[MAX_CODELEN_LENS];
uint8_t code_lengths[MAX_LITLEN_LENS + MAX_DIST_LENS];
size_t i, n, used;
int sym;
huffman_decoder_t codelen_dec;
bits = istream_bits(is);
num_litlen_lens = lsb(bits, 5) + MIN_LITLEN_LENS;
bits >>= 5;
assert(num_litlen_lens <= MAX_LITLEN_LENS);
num_dist_lens = lsb(bits, 5) + MIN_DIST_LENS;
bits >>= 5;
assert(num_dist_lens <= MAX_DIST_LENS);
num_codelen_lens = lsb(bits, 4) + MIN_CODELEN_LENS;
bits >>= 4;
assert(num_codelen_lens <= MAX_CODELEN_LENS);
if (!istream_advance(is, 5 + 5 + 4)) {
return HWINF_ERR;
}
Then come the codeword lengths for the codelen code. These lengths are the usual three-bit values, but written in the special order specified in codelen_lengths_order
. Since it is necessary to determine 19 lengths, only will be read from the stream num_codelen_lens
; everything else is implicitly null. The lengths are listed in a specific order so that zero lengths are more likely to fall at the end of the list and not be stored in the block.
for (i = 0; i < num_codelen_lens; i++) {
bits = istream_bits(is);
codelen_lengths[codelen_lengths_order[i]] =
(uint8_t)lsb(bits, 3);
if (!istream_advance(is, 3)) {
return HWINF_ERR;
}
}
for (; i < MAX_CODELEN_LENS; i++) {
codelen_lengths[codelen_lengths_order[i]] = 0;
}
if (!huffman_decoder_init(&codelen_dec, codelen_lengths,
MAX_CODELEN_LENS)) {
return HWINF_ERR;
}
By setting the codelen decoder, we can read the lengths of code words litlen and dist from the stream.
i = 0;
while (i < num_litlen_lens + num_dist_lens) {
bits = istream_bits(is);
sym = huffman_decode(&codelen_dec, (uint16_t)bits, &used);
bits >>= used;
if (!istream_advance(is, used)) {
return HWINF_ERR;
}
if (sym >= 0 && sym <= CODELEN_MAX_LIT) {
code_lengths[i++] = (uint8_t)sym;
}
16, 17 and 18 are not real lengths, they are indicators that the previous length needs to be repeated a number of times, or that you need to repeat the zero length: else if (sym == CODELEN_COPY) {
if (i < 1) {
return HWINF_ERR;
}
n = lsb(bits, 2) + CODELEN_COPY_MIN;
if (!istream_advance(is, 2)) {
return HWINF_ERR;
}
assert(n >= CODELEN_COPY_MIN && n <= CODELEN_COPY_MAX);
if (i + n > num_litlen_lens + num_dist_lens) {
return HWINF_ERR;
}
while (n--) {
code_lengths[i] = code_lengths[i - 1];
i++;
}
} else if (sym == CODELEN_ZEROS) {
n = lsb(bits, 3) + CODELEN_ZEROS_MIN;
if (!istream_advance(is, 3)) {
return HWINF_ERR;
}
assert(n >= CODELEN_ZEROS_MIN &&
n <= CODELEN_ZEROS_MAX);
if (i + n > num_litlen_lens + num_dist_lens) {
return HWINF_ERR;
}
while (n--) {
code_lengths[i++] = 0;
}
} else if (sym == CODELEN_ZEROS2) {
n = lsb(bits, 7) + CODELEN_ZEROS2_MIN;
if (!istream_advance(is, 7)) {
return HWINF_ERR;
}
assert(n >= CODELEN_ZEROS2_MIN &&
n <= CODELEN_ZEROS2_MAX);
if (i + n > num_litlen_lens + num_dist_lens) {
return HWINF_ERR;
}
while (n--) {
code_lengths[i++] = 0;
}
} else {
return HWINF_ERR;
}
}
Note that litlen and dist lengths are read one by one into the array code_lengths
. They cannot be read separately, because code length runs can be carried over from the last litlen lengths to the first dist lengths.Having prepared the codeword lengths, we can configure Huffman decoders and return to the task of decoding literals and backlinks: if (!huffman_decoder_init(litlen_dec, &code_lengths[0],
num_litlen_lens)) {
return HWINF_ERR;
}
if (!huffman_decoder_init(dist_dec, &code_lengths[num_litlen_lens],
num_dist_lens)) {
return HWINF_ERR;
}
return HWINF_OK;
}
Compression (Deflation)
In the previous parts, we created all the tools necessary for Deflate compression: Lempel-Ziv, Huffman coding, bit streams and a description of the three types of Deflate blocks. And in this part we will put it all together to get Deflate compression.Compression Lempel-Ziv parses the source data into a sequence of backlinks and literals. This sequence must be divided and encoded into Deflate blocks, as described in the previous part. Choosing a partitioning method is often called blocking.. On the one hand, each new block means some kind of overhead, the volume of which depends on the type of block and its contents. Fewer blocks - less overhead. On the other hand, these costs of creating a new block can pay off. For example, if the characteristics of the data allow you to more efficiently perform Huffman coding and reduce the total amount of data generated.Blocking is a difficult optimization task. Some compressors (for example, Zopfli ) try better than others, but most simply use the greedy approach: they issue blocks as soon as they reach a certain size.Different types of blocks have their own size restrictions:- Uncompressed blocks can contain no more than 65,535 bytes.
- Huffman fixed codes do not have a maximum size.
- Dynamic Huffman codes generally do not have a maximum size, but since our implementation of the Huffman algorithm uses 16-bit character sequences, we are limited to 65,535 characters.
To use blocks of any type freely, limit their size to 65,534 bytes:
#define MAX_BLOCK_LEN_BYTES 65534
To track the outgoing bitstream and the contents of the current block during compression, we will use the structure:typedef struct deflate_state_t deflate_state_t;
struct deflate_state_t {
ostream_t os;
const uint8_t *block_src;
size_t block_len;
size_t block_len_bytes;
uint16_t litlen_freqs[LITLEN_MAX + 1];
uint16_t dist_freqs[DISTSYM_MAX + 1];
struct {
uint16_t distance;
union {
uint16_t lit;
uint16_t len;
} u;
} block[MAX_BLOCK_LEN_BYTES + 1];
};
static void reset_block(deflate_state_t *s)
{
s->block_len = 0;
s->block_len_bytes = 0;
memset(s->litlen_freqs, 0, sizeof(s->litlen_freqs));
memset(s->dist_freqs, 0, sizeof(s->dist_freqs));
}
To add work results to the block, lz77_compress
we will use the callback functions, and upon reaching the maximum size, we will write the block to the bitstream:static bool lit_callback(uint8_t lit, void *aux)
{
deflate_state_t *s = aux;
if (s->block_len_bytes + 1 > MAX_BLOCK_LEN_BYTES) {
if (!write_block(s, false)) {
return false;
}
s->block_src += s->block_len_bytes;
reset_block(s);
}
assert(s->block_len < sizeof(s->block) / sizeof(s->block[0]));
s->block[s->block_len ].distance = 0;
s->block[s->block_len++].u.lit = lit;
s->block_len_bytes++;
s->litlen_freqs[lit]++;
return true;
}
static bool backref_callback(size_t dist, size_t len, void *aux)
{
deflate_state_t *s = aux;
if (s->block_len_bytes + len > MAX_BLOCK_LEN_BYTES) {
if (!write_block(s, false)) {
return false;
}
s->block_src += s->block_len_bytes;
reset_block(s);
}
assert(s->block_len < sizeof(s->block) / sizeof(s->block[0]));
s->block[s->block_len ].distance = (uint16_t)dist;
s->block[s->block_len++].u.len = (uint16_t)len;
s->block_len_bytes += len;
assert(len >= MIN_LEN && len <= MAX_LEN);
assert(dist >= MIN_DISTANCE && dist <= MAX_DISTANCE);
s->litlen_freqs[len2litlen[len]]++;
s->dist_freqs[distance2dist[dist]]++;
return true;
}
The most interesting thing is the recording of blocks. If the block is not compressed, then everything is simple:static bool write_uncomp_block(deflate_state_t *s, bool final)
{
uint8_t len_nlen[4];
if (!ostream_write(&s->os, (0x0 << 1) | final, 3)) {
return false;
}
len_nlen[0] = (uint8_t)(s->block_len_bytes >> 0);
len_nlen[1] = (uint8_t)(s->block_len_bytes >> 8);
len_nlen[2] = ~len_nlen[0];
len_nlen[3] = ~len_nlen[1];
if (!ostream_write_bytes_aligned(&s->os, len_nlen, sizeof(len_nlen))) {
return false;
}
if (!ostream_write_bytes_aligned(&s->os, s->block_src,
s->block_len_bytes)) {
return false;
}
return true;
}
To write a static Huffman block, we first generate canonical codes based on fixed codeword lengths for litlen and dist codes. Then we iterate the block, writing down the characters that use these codes:static bool write_static_block(deflate_state_t *s, bool final)
{
huffman_encoder_t litlen_enc, dist_enc;
if (!ostream_write(&s->os, (0x1 << 1) | final, 3)) {
return false;
}
huffman_encoder_init2(&litlen_enc, fixed_litlen_lengths,
sizeof(fixed_litlen_lengths) /
sizeof(fixed_litlen_lengths[0]));
huffman_encoder_init2(&dist_enc, fixed_dist_lengths,
sizeof(fixed_dist_lengths) /
sizeof(fixed_dist_lengths[0]));
return write_huffman_block(s, &litlen_enc, &dist_enc);
}
static bool write_huffman_block(deflate_state_t *s,
const huffman_encoder_t *litlen_enc,
const huffman_encoder_t *dist_enc)
{
size_t i, nbits;
uint64_t distance, dist, len, litlen, bits, ebits;
for (i = 0; i < s->block_len; i++) {
if (s->block[i].distance == 0) {
litlen = s->block[i].u.lit;
assert(litlen <= LITLEN_EOB);
if (!ostream_write(&s->os,
litlen_enc->codewords[litlen],
litlen_enc->lengths[litlen])) {
return false;
}
continue;
}
len = s->block[i].u.len;
litlen = len2litlen[len];
bits = litlen_enc->codewords[litlen];
nbits = litlen_enc->lengths[litlen];
ebits = len - litlen_tbl[litlen - LITLEN_TBL_OFFSET].base_len;
bits |= ebits << nbits;
nbits += litlen_tbl[litlen - LITLEN_TBL_OFFSET].ebits;
distance = s->block[i].distance;
dist = distance2dist[distance];
bits |= (uint64_t)dist_enc->codewords[dist] << nbits;
nbits += dist_enc->lengths[dist];
ebits = distance - dist_tbl[dist].base_dist;
bits |= ebits << nbits;
nbits += dist_tbl[dist].ebits;
if (!ostream_write(&s->os, bits, nbits)) {
return false;
}
}
return true;
}
Of course, writing Huffman dynamic blocks is harder because they contain tricky coding of litlen and dist codes. To represent this encoding, we use the following structure:typedef struct codelen_sym_t codelen_sym_t;
struct codelen_sym_t {
uint8_t sym;
uint8_t count;
};
First, we discard the tail of their zero lengths of the litlen and dist codewords, and then copy them into a regular array for subsequent encoding. We cannot discard all zeros: it is impossible to encode a Deflate block if there is not a single dist code in it. It is also impossible to have less than 257 litlen codes, but since we always have a byte end marker, there will always be a non-zero code length for a character of 256.
static size_t encode_dist_litlen_lens(const uint8_t *litlen_lens,
const uint8_t *dist_lens,
codelen_sym_t *encoded,
size_t *num_litlen_lens,
size_t *num_dist_lens)
{
size_t i, n;
uint8_t lens[LITLEN_MAX + 1 + DISTSYM_MAX + 1];
*num_litlen_lens = LITLEN_MAX + 1;
*num_dist_lens = DISTSYM_MAX + 1;
assert(litlen_lens[LITLEN_EOB] != 0 && "EOB len should be non-zero.");
while (litlen_lens[*num_litlen_lens - 1] == 0) {
(*num_litlen_lens)--;
}
assert(*num_litlen_lens >= MIN_LITLEN_LENS);
while (dist_lens[*num_dist_lens - 1] == 0 && *num_dist_lens > 1) {
(*num_dist_lens)--;
}
assert(*num_dist_lens >= MIN_DIST_LENS);
n = 0;
for (i = 0; i < *num_litlen_lens; i++) {
lens[n++] = litlen_lens[i];
}
for (i = 0; i < *num_dist_lens; i++) {
lens[n++] = dist_lens[i];
}
return encode_lens(lens, n, encoded);
}
Having added the code lengths into one array, we perform coding using special characters to run the same code lengths.
static size_t encode_lens(const uint8_t *lens, size_t n, codelen_sym_t *encoded)
{
size_t i, j, num_encoded;
uint8_t count;
i = 0;
num_encoded = 0;
while (i < n) {
if (lens[i] == 0) {
for (j = i; j < min(n, i + CODELEN_ZEROS2_MAX) &&
lens[j] == 0; j++);
count = (uint8_t)(j - i);
if (count < CODELEN_ZEROS_MIN) {
encoded[num_encoded++].sym = 0;
i++;
continue;
}
if (count <= CODELEN_ZEROS_MAX) {
assert(count >= CODELEN_ZEROS_MIN &&
count <= CODELEN_ZEROS_MAX);
encoded[num_encoded].sym = CODELEN_ZEROS;
encoded[num_encoded++].count = count;
} else {
assert(count >= CODELEN_ZEROS2_MIN &&
count <= CODELEN_ZEROS2_MAX);
encoded[num_encoded].sym = CODELEN_ZEROS2;
encoded[num_encoded++].count = count;
}
i = j;
continue;
}
encoded[num_encoded++].sym = lens[i++];
for (j = i; j < min(n, i + CODELEN_COPY_MAX) &&
lens[j] == lens[i - 1]; j++);
count = (uint8_t)(j - i);
if (count >= CODELEN_COPY_MIN) {
assert(count >= CODELEN_COPY_MIN &&
count <= CODELEN_COPY_MAX);
encoded[num_encoded].sym = CODELEN_COPY;
encoded[num_encoded++].count = count;
i = j;
continue;
}
}
return num_encoded;
}
The characters used for encoding will be recorded using the Huffman code - codelen. The lengths of codewords from codelen code are written to the block in a specific order so that zero lengths are more likely to end up at the end. Here is a function that calculates how many lengths should be written:static const int codelen_lengths_order[19] =
{ 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15 };
size_t count_codelen_lens(const uint8_t *codelen_lens)
{
size_t n = MAX_CODELEN_LENS;
while (codelen_lens[codelen_lengths_order[n - 1]] == 0) {
n--;
}
assert(n >= MIN_CODELEN_LENS && n <= MAX_CODELEN_LENS);
return n;
}
Suppose we have already set the litlen and dist codes, set up the encoding of the lengths of their code words and the code for these lengths. Now we can write a dynamic Huffman block:static bool write_dynamic_block(deflate_state_t *s, bool final,
size_t num_litlen_lens, size_t num_dist_lens,
size_t num_codelen_lens,
const huffman_encoder_t *codelen_enc,
const codelen_sym_t *encoded_lens,
size_t num_encoded_lens,
const huffman_encoder_t *litlen_enc,
const huffman_encoder_t *dist_enc)
{
size_t i;
uint8_t codelen, sym;
size_t nbits;
uint64_t bits, hlit, hdist, hclen, count;
bits = (0x2 << 1) | final;
nbits = 3;
hlit = num_litlen_lens - MIN_LITLEN_LENS;
bits |= hlit << nbits;
nbits += 5;
hdist = num_dist_lens - MIN_DIST_LENS;
bits |= hdist << nbits;
nbits += 5;
hclen = num_codelen_lens - MIN_CODELEN_LENS;
bits |= hclen << nbits;
nbits += 4;
if (!ostream_write(&s->os, bits, nbits)) {
return false;
}
for (i = 0; i < num_codelen_lens; i++) {
codelen = codelen_enc->lengths[codelen_lengths_order[i]];
if (!ostream_write(&s->os, codelen, 3)) {
return false;
}
}
for (i = 0; i < num_encoded_lens; i++) {
sym = encoded_lens[i].sym;
bits = codelen_enc->codewords[sym];
nbits = codelen_enc->lengths[sym];
count = encoded_lens[i].count;
if (sym == CODELEN_COPY) {
bits |= (count - CODELEN_COPY_MIN) << nbits;
nbits += 2;
} else if (sym == CODELEN_ZEROS) {
bits |= (count - CODELEN_ZEROS_MIN) << nbits;
nbits += 3;
} else if (sym == CODELEN_ZEROS2) {
bits |= (count - CODELEN_ZEROS2_MIN) << nbits;
nbits += 7;
}
if (!ostream_write(&s->os, bits, nbits)) {
return false;
}
}
return write_huffman_block(s, litlen_enc, dist_enc);
}
For each block, we want to use the type that requires the least number of bits. The length of an uncompressed block can be calculated quickly:
static size_t uncomp_block_len(const deflate_state_t *s)
{
size_t bit_pos, padding;
bit_pos = ostream_bit_pos(&s->os) + 3;
padding = round_up(bit_pos, 8) - bit_pos;
return 3 + padding + 2 * 16 + s->block_len_bytes * 8;
}
For Huffman encoded blocks, you can calculate the body length using the litlen- and dist-frequencies of characters and codeword lengths:
static size_t huffman_block_body_len(const deflate_state_t *s,
const uint8_t *litlen_lens,
const uint8_t *dist_lens)
{
size_t i, freq, len;
len = 0;
for (i = 0; i <= LITLEN_MAX; i++) {
freq = s->litlen_freqs[i];
len += litlen_lens[i] * freq;
if (i >= LITLEN_TBL_OFFSET) {
len += litlen_tbl[i - LITLEN_TBL_OFFSET].ebits * freq;
}
}
for (i = 0; i <= DISTSYM_MAX; i++) {
freq = s->dist_freqs[i];
len += dist_lens[i] * freq;
len += dist_tbl[i].ebits * freq;
}
return len;
}
The total length of the static block is 3 bits of the header plus the length of the body. Calculating the header size of a dynamic block requires much more work:
static size_t dyn_block_len(const deflate_state_t *s, size_t num_codelen_lens,
const uint16_t *codelen_freqs,
const huffman_encoder_t *codelen_enc,
const huffman_encoder_t *litlen_enc,
const huffman_encoder_t *dist_enc)
{
size_t len, i, freq;
len = 3;
len += 5 + 5 + 4;
len += 3 * num_codelen_lens;
for (i = 0; i < MAX_CODELEN_LENS; i++) {
freq = codelen_freqs[i];
len += codelen_enc->lengths[i] * freq;
if (i == CODELEN_COPY) {
len += 2 * freq;
} else if (i == CODELEN_ZEROS) {
len += 3 * freq;
} else if (i == CODELEN_ZEROS2) {
len += 7 * freq;
}
}
return len + huffman_block_body_len(s, litlen_enc->lengths,
dist_enc->lengths);
}
Now we will put everything together and create the main function for writing blocks:
static bool write_block(deflate_state_t *s, bool final)
{
size_t old_bit_pos, uncomp_len, static_len, dynamic_len;
huffman_encoder_t dyn_litlen_enc, dyn_dist_enc, codelen_enc;
size_t num_encoded_lens, num_litlen_lens, num_dist_lens;
codelen_sym_t encoded_lens[LITLEN_MAX + 1 + DISTSYM_MAX + 1];
uint16_t codelen_freqs[MAX_CODELEN_LENS] = {0};
size_t num_codelen_lens;
size_t i;
old_bit_pos = ostream_bit_pos(&s->os);
assert(s->block_len < sizeof(s->block) / sizeof(s->block[0]));
assert(s->litlen_freqs[LITLEN_EOB] == 0);
s->block[s->block_len ].distance = 0;
s->block[s->block_len++].u.lit = LITLEN_EOB;
s->litlen_freqs[LITLEN_EOB] = 1;
uncomp_len = uncomp_block_len(s);
static_len = 3 + huffman_block_body_len(s, fixed_litlen_lengths,
fixed_dist_lengths);
huffman_encoder_init(&dyn_litlen_enc, s->litlen_freqs,
LITLEN_MAX + 1, 15);
huffman_encoder_init(&dyn_dist_enc, s->dist_freqs, DISTSYM_MAX + 1, 15);
num_encoded_lens = encode_dist_litlen_lens(dyn_litlen_enc.lengths,
dyn_dist_enc.lengths,
encoded_lens,
&num_litlen_lens,
&num_dist_lens);
for (i = 0; i < num_encoded_lens; i++) {
codelen_freqs[encoded_lens[i].sym]++;
}
huffman_encoder_init(&codelen_enc, codelen_freqs, MAX_CODELEN_LENS, 7);
num_codelen_lens = count_codelen_lens(codelen_enc.lengths);
dynamic_len = dyn_block_len(s, num_codelen_lens, codelen_freqs,
&codelen_enc, &dyn_litlen_enc,
&dyn_dist_enc);
if (uncomp_len <= dynamic_len && uncomp_len <= static_len) {
if (!write_uncomp_block(s, final)) {
return false;
}
assert(ostream_bit_pos(&s->os) - old_bit_pos == uncomp_len);
} else if (static_len <= dynamic_len) {
if (!write_static_block(s, final)) {
return false;
}
assert(ostream_bit_pos(&s->os) - old_bit_pos == static_len);
} else {
if (!write_dynamic_block(s, final, num_litlen_lens,
num_dist_lens, num_codelen_lens,
&codelen_enc, encoded_lens,
num_encoded_lens, &dyn_litlen_enc,
&dyn_dist_enc)) {
return false;
}
assert(ostream_bit_pos(&s->os) - old_bit_pos == dynamic_len);
}
return true;
}
Finally, the initiator of the entire compression process should set the initial state, start Lempel-Ziv compression and write the resulting block:
bool hwdeflate(const uint8_t *src, size_t src_len, uint8_t *dst,
size_t dst_cap, size_t *dst_used)
{
deflate_state_t s;
ostream_init(&s.os, dst, dst_cap);
reset_block(&s);
s.block_src = src;
if (!lz77_compress(src, src_len, &lit_callback,
&backref_callback, &s)) {
return false;
}
if (!write_block(&s, true)) {
return false;
}
assert(s.block_src + s.block_len_bytes == src + src_len);
*dst_used = ostream_bytes_written(&s.os);
return true;
}
Zip File Format
Above, we examined how the Deflate compression used in Zip files works. What about the file format itself? In this part, we will examine in detail its structure and implementation. The code is available in zip.h and zip.c .Overview
The file format is described in PKZip Application Note :- Each file, or archive item, in a zip file has a local file header with metadata about the item.
- . , , , Zip-.
- , . , . Zip-.
Each archive item is compressed and stored individually. This means that even if there are matches between files in the archive, they will not be taken into account to improve compression.The location of the central catalog at the end allows you to gradually complete the archive. As file elements are compressed, they are added to the archive. The index is recorded after all compressed sizes, which allows you to find out the offsets of all files. Adding files to an existing archive is quite easy, it is placed after the last element, and the central directory is overwritten.The ability to gradually create archives was especially important for disseminating information on numerous floppy disks or volumes. As it compressed, PKZip suggested users insert new diskettes, and write the central directory to the last (last) ones. To unzip a multi-volume archive, PKZip first asked for the last diskette to write to read the central directory, and then the rest of the diskette needed to extract the requested files.This may surprise you, but there was no rule prohibiting having multiple files with the same name in the archive. This could lead to a lot of confusion when unpacking: if there are several files with the same name, which one should you unpack? In turn, this could lead to security problems. Due to the “Master Key” bug in Android (CVE-2013-4787 , slides and video from the report on Black Hat) an attacker could bypass the checks of the cryptographic signature operating system when installing programs. Android programs are distributed in APK files, which are Zip files. As it turned out, if the APK contained several files with the same name, the signature verification code selected the last file with the same name, and the installer selected the first file, that is, the verification was not performed. In other words, this small difference between the two Zip libraries made it possible to bypass the entire operating system security model.Unlike most formats, zip files should not start with a signature or magic number. It is generally not indicated that the Zip file should start in any particular way, which easily allows you to create files that are both valid as Zip and as a different format — polyglot files . For example, self-extracting Zip archives (for example, pkz204g.exe ) are usually both executable and Zip files: the first part is executable, followed by a Zip file (which is unpacked by the executable part). The OS can execute it as executable, but the Zip program will open it as a Zip file. This feature could cause you to not require a signature at the beginning of the file.Although such polyglot files are smart, they can lead to security problems because they can trick programs trying to determine the contents of the file and also allow delivering malicious code in a place with files of different types. For example, exploits used GIFAR files, which at the same time are valid GIF images and Java archives (JAR, a type of Zip file). For more information on this issue, see the article Abusing file formats (starting on page 18).As we will see below, Zip files use 32-bit fields for offsets and sizes in order to limit the size of the archive and its elements to four gigabytes. In Application Note 4.5PKWare has added format extensions allowing the use of 64-bit offsets and sizes. Files using these extensions are in the Zip64 format, but we will not consider them.Data structures
End of central directory entry
The end of a central directory entry (EOCDR) is typically used as the starting point for reading a zip file. It contains the location and size of the central directory, as well as optional comments about the entire archive.In Zip files that occupied several diskettes - or volumes - EOCDR also contained information about which disk we are currently using, on which disk the central directory starts, etc. Today, this functionality is rarely used, and the code in this article does not process such files.EOCDR is determined by the signature 'P' 'K', followed by bytes 5 and 6. It is followed by the structure below, the numbers are stored according to the little-endian principle:
struct eocdr {
uint16_t disk_nbr;
uint16_t cd_start_disk;
uint16_t disk_cd_entries;
uint16_t cd_entries;
uint32_t cd_size;
uint32_t cd_offset;
uint16_t comment_len;
const uint8_t *comment;
};
EOCDR should be located at the end of the file. But since there may be a comment of arbitrary 16-bit length in its tail, it may be necessary to find a specific position:
#define READ16(p) ((p) += 2, read16le((p) - 2))
#define READ32(p) ((p) += 4, read32le((p) - 4))
#define EOCDR_BASE_SZ 22
#define EOCDR_SIGNATURE 0x06054b50
static bool find_eocdr(struct eocdr *r, const uint8_t *src, size_t src_len)
{
size_t comment_len;
const uint8_t *p;
uint32_t signature;
for (comment_len = 0; comment_len <= UINT16_MAX; comment_len++) {
if (src_len < EOCDR_BASE_SZ + comment_len) {
break;
}
p = &src[src_len - EOCDR_BASE_SZ - comment_len];
signature = READ32(p);
if (signature == EOCDR_SIGNATURE) {
r->disk_nbr = READ16(p);
r->cd_start_disk = READ16(p);
r->disk_cd_entries = READ16(p);
r->cd_entries = READ16(p);
r->cd_size = READ32(p);
r->cd_offset = READ32(p);
r->comment_len = READ16(p);
r->comment = p;
assert(p == &src[src_len - comment_len] &&
"All fields read.");
if (r->comment_len == comment_len) {
return true;
}
}
}
return false;
}
Recording EOCDR is easy. This function writes and returns the number of bytes written:
#define WRITE16(p, x) (write16le((p), (x)), (p) += 2)
#define WRITE32(p, x) (write32le((p), (x)), (p) += 4)
static size_t write_eocdr(uint8_t *dst, const struct eocdr *r)
{
uint8_t *p = dst;
WRITE32(p, EOCDR_SIGNATURE);
WRITE16(p, r->disk_nbr);
WRITE16(p, r->cd_start_disk);
WRITE16(p, r->disk_cd_entries);
WRITE16(p, r->cd_entries);
WRITE32(p, r->cd_size);
WRITE32(p, r->cd_offset);
WRITE16(p, r->comment_len);
assert(p - dst == EOCDR_BASE_SZ);
if (r->comment_len != 0) {
memcpy(p, r->comment, r->comment_len);
p += r->comment_len;
}
return (size_t)(p - dst);
}
Central file header
The central directory consists of central file headers written one after another, one for each archive item. Each heading begins with the signature 'P', 'K', 1, 2, and then there is such a structure:#define EXT_ATTR_DIR (1U << 4)
#define EXT_ATTR_ARC (1U << 5)
struct cfh {
uint16_t made_by_ver;
uint16_t extract_ver;
uint16_t gp_flag;
uint16_t method;
uint16_t mod_time;
uint16_t mod_date;
uint32_t crc32;
uint32_t comp_size;
uint32_t uncomp_size;
uint16_t name_len;
uint16_t extra_len;
uint16_t comment_len;
uint16_t disk_nbr_start;
uint16_t int_attrs;
uint32_t ext_attrs;
uint32_t lfh_offset;
const uint8_t *name;
const uint8_t *extra;
const uint8_t *comment;
};
made_by_ver
and extract_ver
encode information about the OS and version of the program used to add this item, as well as which version is needed to retrieve it. The most important eight bits encode the operating system (for example, 0 means DOS, 3 means Unix, 10 means Windows NTFS), and the lower eight bits encode the software version. Set the decimal value to 20, which means compatibility with PKZip 2.0.gp_flag
contains different flags. We are interested:- Bit 0, indicating the fact of encryption of the element (we will not consider this);
- And bits 1 and 2, encoding the level of Deflate compression (0 - normal, 1 - maximum, 2 - fast, 3 - very fast).
method
encodes a compression method. 0 - data not compressed, 8 - Delate applied. Other values relate to old or new algorithms, but almost all Zip use these two values.mod_time
and mod_date
contain the date and time the file was modified, encoded in MS-DOS format . Using this code, we will convert the usual C timestamps time_t
to and from the MS-DOS format:
static time_t dos2ctime(uint16_t dos_date, uint16_t dos_time)
{
struct tm tm = {0};
tm.tm_sec = (dos_time & 0x1f) * 2;
tm.tm_min = (dos_time >> 5) & 0x3f;
tm.tm_hour = (dos_time >> 11);
tm.tm_mday = (dos_date & 0x1f);
tm.tm_mon = ((dos_date >> 5) & 0xf) - 1;
tm.tm_year = (dos_date >> 9) + 80;
tm.tm_isdst = -1;
return mktime(&tm);
}
static void ctime2dos(time_t t, uint16_t *dos_date, uint16_t *dos_time)
{
struct tm *tm = localtime(&t);
*dos_time = 0;
*dos_time |= tm->tm_sec / 2;
*dos_time |= tm->tm_min << 5;
*dos_time |= tm->tm_hour << 11;
*dos_date = 0;
*dos_date |= tm->tm_mday;
*dos_date |= (tm->tm_mon + 1) << 5;
*dos_date |= (tm->tm_year - 80) << 9;
}
The field crc32
contains the value of the cyclic redundant code of uncompressed data. It is used to verify data integrity after retrieval. Implementation here: crc32.c .comp_size
and uncomp_size
contain the compressed and uncompressed size of the item file data. The following three fields contain the length of the name, comment, and additional data immediately following the title. disk_nbr_start
Designed for archives using multiple diskettes.int_attrs
and ext_attrs
describe the internal and external attributes of the file. Internal ones relate to the contents of the file, for example, the least significant bit indicates whether the file contains only text. External attributes indicate whether the file is hidden, read-only, etc. The content of these fields depends on the OS, in particular, onmade_by_ver
. In DOS, the lower 8 bits contain the file attribute byte, which can be obtained from the Int 21 / AX = 4300h system call . For example, bit 4 means that it is a directory, and bit 5 means that the “archive” attribute is set (true for most files in DOS). As far as I understand, for the sake of compatibility, these bits are similarly set in other OSs. On Unix, the high 16 bits of this field contain file mode bits that are returned by stat (2) in st_mode
.lfh_offset
tells us where to look for the local file header. name
- file name (C-line), and comment
- optional comment for this archive element (C-line). extra
may contain optional additional data such as information about the owner of the Unix file, more accurate date and time of the change, or Zip64 fields.This function is used to read the central headers of files:
#define CFH_BASE_SZ 46
#define CFH_SIGNATURE 0x02014b50
static bool read_cfh(struct cfh *cfh, const uint8_t *src, size_t src_len,
size_t offset)
{
const uint8_t *p;
uint32_t signature;
if (offset > src_len || src_len - offset < CFH_BASE_SZ) {
return false;
}
p = &src[offset];
signature = READ32(p);
if (signature != CFH_SIGNATURE) {
return false;
}
cfh->made_by_ver = READ16(p);
cfh->extract_ver = READ16(p);
cfh->gp_flag = READ16(p);
cfh->method = READ16(p);
cfh->mod_time = READ16(p);
cfh->mod_date = READ16(p);
cfh->crc32 = READ32(p);
cfh->comp_size = READ32(p);
cfh->uncomp_size = READ32(p);
cfh->name_len = READ16(p);
cfh->extra_len = READ16(p);
cfh->comment_len = READ16(p);
cfh->disk_nbr_start = READ16(p);
cfh->int_attrs = READ16(p);
cfh->ext_attrs = READ32(p);
cfh->lfh_offset = READ32(p);
cfh->name = p;
cfh->extra = cfh->name + cfh->name_len;
cfh->comment = cfh->extra + cfh->extra_len;
assert(p == &src[offset + CFH_BASE_SZ] && "All fields read.");
if (src_len - offset - CFH_BASE_SZ <
cfh->name_len + cfh->extra_len + cfh->comment_len) {
return false;
}
return true;
}
static size_t write_cfh(uint8_t *dst, const struct cfh *cfh)
{
uint8_t *p = dst;
WRITE32(p, CFH_SIGNATURE);
WRITE16(p, cfh->made_by_ver);
WRITE16(p, cfh->extract_ver);
WRITE16(p, cfh->gp_flag);
WRITE16(p, cfh->method);
WRITE16(p, cfh->mod_time);
WRITE16(p, cfh->mod_date);
WRITE32(p, cfh->crc32);
WRITE32(p, cfh->comp_size);
WRITE32(p, cfh->uncomp_size);
WRITE16(p, cfh->name_len);
WRITE16(p, cfh->extra_len);
WRITE16(p, cfh->comment_len);
WRITE16(p, cfh->disk_nbr_start);
WRITE16(p, cfh->int_attrs);
WRITE32(p, cfh->ext_attrs);
WRITE32(p, cfh->lfh_offset);
assert(p - dst == CFH_BASE_SZ);
if (cfh->name_len != 0) {
memcpy(p, cfh->name, cfh->name_len);
p += cfh->name_len;
}
if (cfh->extra_len != 0) {
memcpy(p, cfh->extra, cfh->extra_len);
p += cfh->extra_len;
}
if (cfh->comment_len != 0) {
memcpy(p, cfh->comment, cfh->comment_len);
p += cfh->comment_len;
}
return (size_t)(p - dst);
}
Local file header
The data of each archive element is preceded by a local file header, which repeats most of the information from the central header.The duplication of data in the central and local headers was probably introduced so that PKZip would not keep the entire central directory in memory when unpacking. Instead, as each file is extracted, its name and other information can be read from the local header. In addition, local headers are useful for recovering files from Zip archives in which the central directory is missing or damaged.However, this redundancy is also the main source of uncertainty. For example, what happens if the file names in the central and local headers do not match? This often leads to bugs and security issues.Not all information from the central heading is duplicated. For example, fields with file attributes. In addition, if the third least significant bit gp_flags
(CRC-32) is specified , then the compressed and uncompressed fields will be reset to zero, and this information can be found in the Data Descriptor block after the data of the file itself (we will not consider it). This allows you to record a local header before the file size of the element is known or to what size it will be compressed.The local header begins with the signature 'P', 'K', 3, 4, and then there is such a structure:
struct lfh {
uint16_t extract_ver;
uint16_t gp_flag;
uint16_t method;
uint16_t mod_time;
uint16_t mod_date;
uint32_t crc32;
uint32_t comp_size;
uint32_t uncomp_size;
uint16_t name_len;
uint16_t extra_len;
const uint8_t *name;
const uint8_t *extra;
};
These functions read and write local headers, like other data structures:
#define LFH_BASE_SZ 30
#define LFH_SIGNATURE 0x04034b50
static bool read_lfh(struct lfh *lfh, const uint8_t *src, size_t src_len,
size_t offset)
{
const uint8_t *p;
uint32_t signature;
if (offset > src_len || src_len - offset < LFH_BASE_SZ) {
return false;
}
p = &src[offset];
signature = READ32(p);
if (signature != LFH_SIGNATURE) {
return false;
}
lfh->extract_ver = READ16(p);
lfh->gp_flag = READ16(p);
lfh->method = READ16(p);
lfh->mod_time = READ16(p);
lfh->mod_date = READ16(p);
lfh->crc32 = READ32(p);
lfh->comp_size = READ32(p);
lfh->uncomp_size = READ32(p);
lfh->name_len = READ16(p);
lfh->extra_len = READ16(p);
lfh->name = p;
lfh->extra = lfh->name + lfh->name_len;
assert(p == &src[offset + LFH_BASE_SZ] && "All fields read.");
if (src_len - offset - LFH_BASE_SZ < lfh->name_len + lfh->extra_len) {
return false;
}
return true;
}
static size_t write_lfh(uint8_t *dst, const struct lfh *lfh)
{
uint8_t *p = dst;
WRITE32(p, LFH_SIGNATURE);
WRITE16(p, lfh->extract_ver);
WRITE16(p, lfh->gp_flag);
WRITE16(p, lfh->method);
WRITE16(p, lfh->mod_time);
WRITE16(p, lfh->mod_date);
WRITE32(p, lfh->crc32);
WRITE32(p, lfh->comp_size);
WRITE32(p, lfh->uncomp_size);
WRITE16(p, lfh->name_len);
WRITE16(p, lfh->extra_len);
assert(p - dst == LFH_BASE_SZ);
if (lfh->name_len != 0) {
memcpy(p, lfh->name, lfh->name_len);
p += lfh->name_len;
}
if (lfh->extra_len != 0) {
memcpy(p, lfh->extra, lfh->extra_len);
p += lfh->extra_len;
}
return (size_t)(p - dst);
}
Implementation of Zip Read
Using the above functions, we implement the reading of the Zip file into memory and get an iterator for accessing the archive elements:typedef size_t zipiter_t;
typedef struct zip_t zip_t;
struct zip_t {
uint16_t num_members;
const uint8_t *comment;
uint16_t comment_len;
zipiter_t members_begin;
zipiter_t members_end;
const uint8_t *src;
size_t src_len;
};
bool zip_read(zip_t *zip, const uint8_t *src, size_t src_len)
{
struct eocdr eocdr;
struct cfh cfh;
struct lfh lfh;
size_t i, offset;
const uint8_t *comp_data;
zip->src = src;
zip->src_len = src_len;
if (!find_eocdr(&eocdr, src, src_len)) {
return false;
}
if (eocdr.disk_nbr != 0 || eocdr.cd_start_disk != 0 ||
eocdr.disk_cd_entries != eocdr.cd_entries) {
return false;
}
zip->num_members = eocdr.cd_entries;
zip->comment = eocdr.comment;
zip->comment_len = eocdr.comment_len;
offset = eocdr.cd_offset;
zip->members_begin = offset;
for (i = 0; i < eocdr.cd_entries; i++) {
if (!read_cfh(&cfh, src, src_len, offset)) {
return false;
}
if (cfh.gp_flag & 1) {
return false;
}
if (cfh.method != ZIP_STORED && cfh.method != ZIP_DEFLATED) {
return false;
}
if (cfh.method == ZIP_STORED &&
cfh.uncomp_size != cfh.comp_size) {
return false;
}
if (cfh.disk_nbr_start != 0) {
return false;
}
if (memchr(cfh.name, '\0', cfh.name_len) != NULL) {
return false;
}
if (!read_lfh(&lfh, src, src_len, cfh.lfh_offset)) {
return false;
}
comp_data = lfh.extra + lfh.extra_len;
if (cfh.comp_size > src_len - (size_t)(comp_data - src)) {
return false;
}
offset += CFH_BASE_SZ + cfh.name_len + cfh.extra_len +
cfh.comment_len;
}
zip->members_end = offset;
return true;
}
As I mentioned above, element iterators are simply offsets to the central file header through which you can access element data:typedef enum { ZIP_STORED = 0, ZIP_DEFLATED = 8 } method_t;
typedef struct zipmemb_t zipmemb_t;
struct zipmemb_t {
const uint8_t *name;
uint16_t name_len;
time_t mtime;
uint32_t comp_size;
const uint8_t *comp_data;
method_t method;
uint32_t uncomp_size;
uint32_t crc32;
const uint8_t *comment;
uint16_t comment_len;
bool is_dir;
zipiter_t next;
};
zipmemb_t zip_member(const zip_t *zip, zipiter_t it)
{
struct cfh cfh;
struct lfh lfh;
bool ok;
zipmemb_t m;
assert(it >= zip->members_begin && it < zip->members_end);
ok = read_cfh(&cfh, zip->src, zip->src_len, it);
assert(ok);
ok = read_lfh(&lfh, zip->src, zip->src_len, cfh.lfh_offset);
assert(ok);
m.name = cfh.name;
m.name_len = cfh.name_len;
m.mtime = dos2ctime(cfh.mod_date, cfh.mod_time);
m.comp_size = cfh.comp_size;
m.comp_data = lfh.extra + lfh.extra_len;
m.method = cfh.method;
m.uncomp_size = cfh.uncomp_size;
m.crc32 = cfh.crc32;
m.comment = cfh.comment;
m.comment_len = cfh.comment_len;
m.is_dir = (cfh.ext_attrs & EXT_ATTR_DIR) != 0;
m.next = it + CFH_BASE_SZ +
cfh.name_len + cfh.extra_len + cfh.comment_len;
assert(m.next <= zip->members_end);
return m;
}
Zip Record Implementation
To write a zip file to the memory buffer, you first need to find out how much memory to allocate for it. And since we do not know how much data we will compress before we try to write, we will calculate the upper boundary based on the sizes of the uncompressed elements:
uint32_t zip_max_size(uint16_t num_memb, const char *const *filenames,
const uint32_t *file_sizes, const char *comment)
{
size_t comment_len, name_len;
uint64_t total;
uint16_t i;
comment_len = (comment == NULL ? 0 : strlen(comment));
if (comment_len > UINT16_MAX) {
return 0;
}
total = EOCDR_BASE_SZ + comment_len;
for (i = 0; i < num_memb; i++) {
assert(filenames[i] != NULL);
name_len = strlen(filenames[i]);
if (name_len > UINT16_MAX) {
return 0;
}
total += CFH_BASE_SZ + name_len;
total += LFH_BASE_SZ + name_len;
total += file_sizes[i];
}
if (total > UINT32_MAX) {
return 0;
}
return (uint32_t)total;
}
This code writes the zip file using the deflate compression of each element, reducing their size:
uint32_t zip_write(uint8_t *dst, uint16_t num_memb,
const char *const *filenames,
const uint8_t *const *file_data,
const uint32_t *file_sizes,
const time_t *mtimes,
const char *comment,
void (*callback)(const char *filename, uint32_t size,
uint32_t comp_size))
{
uint16_t i;
uint8_t *p;
struct eocdr eocdr;
struct cfh cfh;
struct lfh lfh;
bool ok;
uint16_t name_len;
uint8_t *data_dst;
size_t comp_sz;
uint32_t lfh_offset, cd_offset, eocdr_offset;
p = dst;
for (i = 0; i < num_memb; i++) {
assert(filenames[i] != NULL);
assert(strlen(filenames[i]) <= UINT16_MAX);
name_len = (uint16_t)strlen(filenames[i]);
data_dst = p + LFH_BASE_SZ + name_len;
if (hwdeflate(file_data[i], file_sizes[i], data_dst,
file_sizes[i], &comp_sz) &&
comp_sz < file_sizes[i]) {
lfh.method = ZIP_DEFLATED;
assert(comp_sz <= UINT32_MAX);
lfh.comp_size = (uint32_t)comp_sz;
} else {
memcpy(data_dst, file_data[i], file_sizes[i]);
lfh.method = ZIP_STORED;
lfh.comp_size = file_sizes[i];
}
if (callback != NULL) {
callback(filenames[i], file_sizes[i], lfh.comp_size);
}
lfh.extract_ver = (0 << 8) | 20;
lfh.gp_flag = (lfh.method == ZIP_DEFLATED ? (0x1 << 1) : 0x0);
ctime2dos(mtimes[i], &lfh.mod_date, &lfh.mod_time);
lfh.crc32 = crc32(file_data[i], file_sizes[i]);
lfh.uncomp_size = file_sizes[i];
lfh.name_len = name_len;
lfh.extra_len = 0;
lfh.name = (const uint8_t*)filenames[i];
p += write_lfh(p, &lfh);
p += lfh.comp_size;
}
assert(p - dst <= UINT32_MAX);
cd_offset = (uint32_t)(p - dst);
lfh_offset = 0;
for (i = 0; i < num_memb; i++) {
ok = read_lfh(&lfh, dst, SIZE_MAX, lfh_offset);
assert(ok);
cfh.made_by_ver = lfh.extract_ver;
cfh.extract_ver = lfh.extract_ver;
cfh.gp_flag = lfh.gp_flag;
cfh.method = lfh.method;
cfh.mod_time = lfh.mod_time;
cfh.mod_date = lfh.mod_date;
cfh.crc32 = lfh.crc32;
cfh.comp_size = lfh.comp_size;
cfh.uncomp_size = lfh.uncomp_size;
cfh.name_len = lfh.name_len;
cfh.extra_len = 0;
cfh.comment_len = 0;
cfh.disk_nbr_start = 0;
cfh.int_attrs = 0;
cfh.ext_attrs = EXT_ATTR_ARC;
cfh.lfh_offset = lfh_offset;
cfh.name = lfh.name;
p += write_cfh(p, &cfh);
lfh_offset += LFH_BASE_SZ + lfh.name_len + lfh.comp_size;
}
assert(p - dst <= UINT32_MAX);
eocdr_offset = (uint32_t)(p - dst);
eocdr.disk_nbr = 0;
eocdr.cd_start_disk = 0;
eocdr.disk_cd_entries = num_memb;
eocdr.cd_entries = num_memb;
eocdr.cd_size = eocdr_offset - cd_offset;
eocdr.cd_offset = cd_offset;
eocdr.comment_len = (uint16_t)(comment == NULL ? 0 : strlen(comment));
eocdr.comment = (const uint8_t*)comment;
p += write_eocdr(p, &eocdr);
assert(p - dst <= zip_max_size(num_memb, filenames, file_sizes,
comment));
return (uint32_t)(p - dst);
}
Hwzip
Now we know how to read and write Zip files, how to compress and decompress the data stored in them. Now let's write a simple Zip program containing all of these tools. The code is available at hwzip.c .We will use a macro for simple error handling and several auxiliary functions for checked memory allocation:#define PERROR_IF(cond, msg) if (cond) { perror(msg); exit(1); }
static void *xmalloc(size_t size)
{
void *ptr = malloc(size);
PERROR_IF(ptr == NULL, "malloc");
return ptr;
}
static void *xrealloc(void *ptr, size_t size)
{
ptr = realloc(ptr, size);
PERROR_IF(ptr == NULL, "realloc");
return ptr;
}
The other two functions are used to read and write files:static uint8_t *read_file(const char *filename, size_t *file_sz)
{
FILE *f;
uint8_t *buf;
size_t buf_cap;
f = fopen(filename, "rb");
PERROR_IF(f == NULL, "fopen");
buf_cap = 4096;
buf = xmalloc(buf_cap);
*file_sz = 0;
while (feof(f) == 0) {
if (buf_cap - *file_sz == 0) {
buf_cap *= 2;
buf = xrealloc(buf, buf_cap);
}
*file_sz += fread(&buf[*file_sz], 1, buf_cap - *file_sz, f);
PERROR_IF(ferror(f), "fread");
}
PERROR_IF(fclose(f) != 0, "fclose");
return buf;
}
static void write_file(const char *filename, const uint8_t *data, size_t n)
{
FILE *f;
f = fopen(filename, "wb");
PERROR_IF(f == NULL, "fopen");
PERROR_IF(fwrite(data, 1, n, f) != n, "fwrite");
PERROR_IF(fclose(f) != 0, "fclose");
}
Our Zip program can perform three functions: make a list of the contents of Zip files and extract it, as well as create Zip files. Listing is nowhere easier:static void list_zip(const char *filename)
{
uint8_t *zip_data;
size_t zip_sz;
zip_t z;
zipiter_t it;
zipmemb_t m;
printf("Listing ZIP archive: %s\n\n", filename);
zip_data = read_file(filename, &zip_sz);
if (!zip_read(&z, zip_data, zip_sz)) {
printf("Failed to parse ZIP file!\n");
exit(1);
}
if (z.comment_len != 0) {
printf("%.*s\n\n", (int)z.comment_len, z.comment);
}
for (it = z.members_begin; it != z.members_end; it = m.next) {
m = zip_member(&z, it);
printf("%.*s\n", (int)m.name_len, m.name);
}
printf("\n");
free(zip_data);
}
The extraction is a little more complicated. We will use auxiliary functions for null termination of the file name (to pass it to fopen
) and unpacking:static char *terminate_str(const char *str, size_t n)
{
char *p = xmalloc(n + 1);
memcpy(p, str, n);
p[n] = '\0';
return p;
}
static uint8_t *inflate_member(const zipmemb_t *m)
{
uint8_t *p;
size_t src_used, dst_used;
assert(m->method == ZIP_DEFLATED);
p = xmalloc(m->uncomp_size);
if (hwinflate(m->comp_data, m->comp_size, &src_used, p, m->uncomp_size,
&dst_used) != HWINF_OK) {
free(p);
return NULL;
}
if (src_used != m->comp_size || dst_used != m->uncomp_size) {
free(p);
return NULL;
}
return p;
}
Our program will skip any archive elements that have directories. This is done in order to avoid the so-called path traversal attacks : a malicious archive is used to write the file from outside the directory specified by the user. Read the details in the Info-ZIP FAQ .static void extract_zip(const char *filename)
{
uint8_t *zip_data;
size_t zip_sz;
zip_t z;
zipiter_t it;
zipmemb_t m;
char *tname;
uint8_t *inflated;
const uint8_t *uncomp_data;
printf("Extracting ZIP archive: %s\n\n", filename);
zip_data = read_file(filename, &zip_sz);
if (!zip_read(&z, zip_data, zip_sz)) {
printf("Failed to read ZIP file!\n");
exit(1);
}
if (z.comment_len != 0) {
printf("%.*s\n\n", (int)z.comment_len, z.comment);
}
for (it = z.members_begin; it != z.members_end; it = m.next) {
m = zip_member(&z, it);
if (m.is_dir) {
printf(" (Skipping dir: %.*s)\n",
(int)m.name_len, m.name);
continue;
}
if (memchr(m.name, '/', m.name_len) != NULL ||
memchr(m.name, '\\', m.name_len) != NULL) {
printf(" (Skipping file in dir: %.*s)\n",
(int)m.name_len, m.name);
continue;
}
assert(m.method == ZIP_STORED || m.method == ZIP_DEFLATED);
printf(" %s: %.*s",
m.method == ZIP_STORED ? "Extracting" : " Inflating",
(int)m.name_len, m.name);
fflush(stdout);
if (m.method == ZIP_STORED) {
assert(m.uncomp_size == m.comp_size);
inflated = NULL;
uncomp_data = m.comp_data;
} else {
inflated = inflate_member(&m);
if (inflated == NULL) {
printf("Error: inflation failed!\n");
exit(1);
}
uncomp_data = inflated;
}
if (crc32(uncomp_data, m.uncomp_size) != m.crc32) {
printf("Error: CRC-32 mismatch!\n");
exit(1);
}
tname = terminate_str((const char*)m.name, m.name_len);
write_file(tname, uncomp_data, m.uncomp_size);
printf("\n");
free(inflated);
free(tname);
}
printf("\n");
free(zip_data);
}
To create a zip archive, we will read the input files and feed them zip_write
. Since the standard C library does not allow you to get the file modification time, we will use the current time (I leave this as a homework to fix this feature).void zip_callback(const char *filename, uint32_t size, uint32_t comp_size)
{
bool deflated = comp_size < size;
printf(" %s: %s", deflated ? "Deflated" : " Stored", filename);
if (deflated) {
printf(" (%u%%)", 100 - 100 * comp_size / size);
}
printf("\n");
}
static void create_zip(const char *zip_filename, const char *comment,
uint16_t n, const char *const *filenames)
{
time_t mtime;
time_t *mtimes;
uint8_t **file_data;
uint32_t *file_sizes;
size_t file_size, zip_size;
uint8_t *zip_data;
uint16_t i;
printf("Creating ZIP archive: %s\n\n", zip_filename);
if (comment != NULL) {
printf("%s\n\n", comment);
}
mtime = time(NULL);
file_data = xmalloc(sizeof(*file_data) * n);
file_sizes = xmalloc(sizeof(*file_sizes) * n);
mtimes = xmalloc(sizeof(*mtimes) * n);
for (i = 0; i < n; i++) {
file_data[i] = read_file(filenames[i], &file_size);
if (file_size >= UINT32_MAX) {
printf("%s is too large!\n", filenames[i]);
exit(1);
}
file_sizes[i] = (uint32_t)file_size;
mtimes[i] = mtime;
}
zip_size = zip_max_size(n, filenames, file_sizes, comment);
if (zip_size == 0) {
printf("zip writing not possible");
exit(1);
}
zip_data = xmalloc(zip_size);
zip_size = zip_write(zip_data, n, filenames,
(const uint8_t *const *)file_data,
file_sizes, mtimes, comment, zip_callback);
write_file(zip_filename, zip_data, zip_size);
printf("\n");
free(zip_data);
for (i = 0; i < n; i++) {
free(file_data[i]);
}
free(mtimes);
free(file_sizes);
free(file_data);
}
Finally, it main
checks the command line arguments and decides what to do:static void print_usage(const char *argv0)
{
printf("Usage:\n\n");
printf(" %s list <zipfile>\n", argv0);
printf(" %s extract <zipfile>\n", argv0);
printf(" %s create <zipfile> [-c <comment>] <files...>\n", argv0);
printf("\n");
}
int main(int argc, char **argv) {
printf("\n");
printf("HWZIP " VERSION " -- A very simple ZIP program ");
printf("from https://www.hanshq.net/zip.html\n");
printf("\n");
if (argc == 3 && strcmp(argv[1], "list") == 0) {
list_zip(argv[2]);
} else if (argc == 3 && strcmp(argv[1], "extract") == 0) {
extract_zip(argv[2]);
} else if (argc >= 3 && strcmp(argv[1], "create") == 0) {
if (argc >= 5 && strcmp(argv[3], "-c") == 0) {
create_zip(argv[2], argv[4], (uint16_t)(argc - 5),
(const char *const *)&argv[5]);
} else {
create_zip(argv[2], NULL, (uint16_t)(argc - 3),
(const char *const *)&argv[3]);
}
} else {
print_usage(argv[0]);
return 1;
}
return 0;
}
Assembly instructions
A complete set of source files is available at hwzip-1.0.zip . How to compile HWZip under Linux or Mac:$ clang generate_tables.c && ./a.out > tables.c
$ clang -O3 -DNDEBUG -march=native -o hwzip crc32.c deflate.c huffman.c \
hwzip.c lz77.c tables.c zip.c
On Windows, at the developer’s command prompt in Visual Studio (if you don’t have Visual Studio, download the build tools ):cl /TC generate_tables.c && generate_tables > tables.c
cl /O2 /DNDEBUG /MT /Fehwzip.exe /TC crc32.c deflate.c huffman.c hwzip.c
lz77.c tables.c zip.c /link setargv.obj
Setargv.obj for expanding wildcard command line arguments .)Conclusion
It is amazing how technology is developing rapidly and slowly. The Zip format was created 30 years ago based on technology from the fifties and seventies. And although much has changed since then, Zip files, in fact, have remained the same and today are more common than ever. I think it will be useful to have a good understanding of how they work.Exercises
- Make HWZip record the time it took for each file to change, rather than the current time the archive was created. Use stat (2) on Linux or Mac and GetFileTime on Windows. Or add a command line flag that allows the user to set a specific time for file changes.
- gzip-. — , Deflate ( ). RFC 1952.
- Zip- . HWZip ,
read_file
mmap(2) Linux Mac CreateFileMapping Windows. - HWZip , Zip64. appnote.txt.