Pictures like boxes - what's inside? Report in Yandex

Pictures and videos are “black boxes”, inside of which there is a lot of interesting and incomprehensible. But you can look inside some formats, change everything there and see what happens.

Polina Gurtovaya from the Evil Martians company spoke at our  Frontend conference in February. With the help of the experiment, Polina figured out how to turn simple pictures into “effective images” with metrics. The tools that can do this for us, Polina examined closer to the end of the report. The result was a great excursion into the insides and principles of operation of various formats: from PNG and JPEG to AV1 and exotic.


- Hello everyone. My name is Polina, I’m the front in the company “Evil Martians”.

Maybe you know Martians from our many open sources. I’ll tell you a little about him later. And probably I must say that we are still developing products, and not just sawing open source.



The materials for the report will be available to you through a wonderful link in the repository on GitHub.



Let's talk a little about optimization. When we deal with them, the problem is that they work out well if we understand what we are doing. If we do not understand, it turns out badly. When it comes to image optimization, unfortunately, everything here is really, really not cool. We may not optimize the images at all, and then there will be two-meter monsters on the prod, it's all sad and sad.

If we do optimize, then what are we doing? We think: here we have a picture, it's some kind of mysterious black box, and the optimizer program does something with this picture, some kind of black shamanism. The quality of optimization that we get is a bit dubious.



Let's look at an example. I have a cat in PNG format. I think we need to optimize it. What am I doing? I create a WebP version and carefully put both images into a <picture> tag. Do you think I'm done well here or not? Why are there so few hands? I'm really well done!

I did everything right, but the WebP version turned out two kilobytes more than the original. This is a bit not what I wanted.




Another optimization, attempt number 2. I have a small container on the page and a big, big cat. I want to put a big cat in a small container. What am I doing? I am doing a resize because it’s stupid to drive bytes over the network if my container size is small. Of course, I take into account the device pixel ratio of my device. Do you think I'm doing great here or not? I'm done! And look what I did.

I am using the libvips library. She is very cool and popular, and from my huge but happy light cat I got a small and very heavy cat. The seal increased 2.5 times (in bytes) during resize (in pixels) down. Cool, yeah?



In general, so that this does not happen to us, that we understand how to optimize our images for our task, and, in general, so that we at least understand what is happening, let's look into the box and understand what is inside.



Let's start by looking at such an interesting format as PNG. Around each site, a small peengshechka is hidden somewhere. This time. Therefore, they must be understood. Second: PNG - lossless compression format. This means we guarantee a perfect match with the original in pixels, but at the same time, alas, we are limited by nature, we cannot compress less than how much.



Peengeshka folds into a container, like any picture format. One of the first things we need to tell the program if it reads it all is what lies inside. If you assume that your decoders determine pictures by extension, this is not so.

Pengashka reports that it is PNG, the first eight bytes in its container. It says “PNG”. Further - again, this is characteristic of any container - you have some layout of chunks. That is, the info is packed in chunks, they are somehow arranged. How - defines the container. In PNG, it looks like this: you have four bytes that are responsible for the length, and four bytes that are responsible for the type of chunk. What types - we will talk a little later. 

If the chunk has a nonzero length, it has a payload. In addition, there is such a thing as a checksum. You are checking to see if something has been beaten there. Next come the following chunks.



To parse not only a PNG file, but almost any one is pretty easy. Take FileReader, this is a browser API. We read the file using FileReader. As soon as we read, we cut this file into chunks. I will not give here the code of the split to chunks function, but you can guess that there is an intricate combination of if and for. 




Okay, we’ve cut it, we’ll see what happens. We have several types of chunks, and they are very, very characteristic of almost any format. The first is called IHDR. There are a number of chunks called IDATs. These names may seem a little strange to you, but we will now figure out what it is. When it all ends, we see the end chunk.



Let's take a closer look inside the chunks. IHDR is a meta-chunk, and almost any picture has such a meta-chunk. It is called differently, it is arranged differently, but it most likely is. Without it, your decompressor - a thing that shows you peengeshki or non-peengeshki - cannot show you anything. What lies in this chunk? Again, the content is typical for most formats. This is the height and width. The height and width are sewn into your file, it comes to you. Next up are typical panache flags: bitDepth, colorType and interlacing. 



Before we talk about what these flags mean and why they are so very, very important to us, let's see how we store pixels in pangshes. In peaneshs, pixels are stored inside a chunk called IDAT. In a good scenario, pixels are a certain number of numbers that are packed into a chunk, and this chunk is compressed by the Deflate compression algorithm. Who used the Deflate compression algorithm? Okay, when was the last time you zipped something? Do you know that Deflate is gzip? So I think - many used it.

But in peengeshah another interesting thing appears that is used in a huge number of formats, but probably in all. This gizmo is called predictive coding. The fact is that our images are not random pixels. What is painted on our little picture is somehow connected with each other. There are some dark areas, bright areas, and so on.

We are trying to exploit this fact, and instead of storing the pixel value in these blue cells, we are trying to predict these pixels based on the previous ones. In PNG, these predictions are very simple, and they are packaged in the very, very first byte before the line with pixels. A prediction can be like this, for example, let's not predict anything and just put everything as it is. Or, for example, we can say this: but let us only keep the difference between the current pixel and the previous one.

If you have the same color in your line, you will have all zeros, everything is perfectly compressed, this is very cool.



But now let's talk about what a pixel actually means. A pixel appears in a peengesh as a number of numbers. By manipulating how many numbers there are, you can very, very tightly compress your PNG - three times.

What options are there? The first is True Color and alpha. We have three channels, three colors, three numbers per color. Plus a channel that is responsible for transparency.

The size of this numeral in bits is bitDepth, the same flag that we saw in the IHDR chunk. The smaller your bitDepth, the smaller the file, but the less colors you can present to them. A typical number is 8. How much is it? In my opinion, there will be 16 million with something.

Okay, the first optimization you can do is throw out the alpha channels in your peengesh. This will be a different colorType.

You can optimize even better and use just one instead of four numbers. But the problem is that then your peengeshka should be black and white.

If you still want only one number, and leave the colors, then this can also be done. What's going on here? You take all the colors inside your peengeshka and cut them into a separate chunk. Call it a palette. Further inside the sample, which is responsible for the pixel inside the IDAT chunk, you just store the index of this palette. If you have any screenshot without an intricate background or some drawing, this thing comes in just perfect. She squeezes peengeshki right wow!

Another important thing to say about is Interlacing. What is Interlacing? This is when you ship your peengeshka gradually. You have not one peengeshka, but several images. Each image is called a scan.



At the same time, inside the paengashka, you sort the pixels in such a way that some of the pixels are torn out of the pictures, one image comes from special places. The next part is another and so on. A seemingly cool technique like progressive JPEG.

But it looks like this. I'm not sure if you want your users to see this, although it may be useful for your task.

The second and very serious problem of Interlaced PNG is that as soon as you interlace your peengeshka, the size of your peengeshka becomes larger. And not so weak anymore, somewhere in a couple of kilobytes your six-kilobyte peengeshka will grow if you turn off Interlaced. Therefore, think carefully whether you want it or not.



We talked only about PNG, but from this thing you can draw important and useful conclusions. First conclusion: the size of your file, you won’t believe it, depends on what is drawn there. The black square shrinks better than the cat, I will not give any recommendation here. Second, more important: the size of your file depends very much on the encoder and on its parameters that you transfer.

If you want to see how horrible encoders work, use the browser ones. How it's done? Take the PNG file, draw it on canvas, then click save as and compare what happened with what happened. In general, Chrome will increase your file size by 2.5 times, Firefox - by 1.6.

By the way, it also always depends on the format, that is, not only PNG should be used. Let's understand why it all depends on the format and what interesting options we still have.



To do this, we will talk about the technology of the ancients, about JPEG. You cannot, of course, downplay the importance of JPEG. They are found everywhere. They are so cool, good, and even more so, seals in JPEGs are a fairly common story. But JPEG is a rather complicated thing, and it is complicated due to the fact that JPEG is lossy compression. Moreover, JPEG is always lossy compression. JPEG 100% quality still compresses with loss.

How do we get lossy compression? Very simple. We take some source, throw out the data from it, and then compress without loss. That is, plus one step.



Let's look at how we make losses in our JPEGs. So, you have a cat with a size of 32 by 32. In order for us to take the first step with losses, we need to change our channels. Usually we talk about pictures in terms of RGB. But we perceive colors a bit intricate. Our brain is generally a big problem, although it helps us very much to compress JPEG.

We perceive black and white very well. Even if you look closely, you will notice that the details in the black and white image you distinguish better. We just put this black and white image in a separate channel. It is called Y. Actually, the Y-bar. We are not doing anything with him, we just leave him as is.

There are two more channels that are responsible for color. These are CB and CR. With these channels we can already have a little fun. Here with these channels we produce such a cool procedure called Downsampling. We take and reduce the resolution of this channel. For JPEG it is typical to halve. That is, in fact, you get three pictures - one original and two half as much. Hooray!

What do we do next? We do not compress JPEG, not like a whole file. We break it into blocks and further compress, we are already starting blocks. Blocks in JPEG are 8 by 8 in size and see what happens to them. Let's just look at channel Y. CB and CR are all the same.



So, a block is not a picture, but numbers. We need to make losses in JPEG. This block is 8 by 8, 64 pixels, which one to throw out? The one on the left, the one on the right, the one in the middle? Unclear. But there is cool math that allows us to solve this problem.

This math is called - now, please do not be nervous if anyone remembers the terrible institutional past - the discrete cosine transform. So, with the help of this discrete cosine transform, you can convert these numbers in your block so that they are important and unimportant among them.

Important: after conversion, important numbers remain in the upper left part of the block. In the lower right remain unimportant numbers.

Next you need to make your JPEG loss. This is also very easy to do. This trick is called quantization. Sorry if you want to sleep now, but this is important, believe me. So, this very quantization works in a rather simple way. You take your block and a specially designed plate. This plate is determined by your encoder program. Those numbers that turned out in your block, you divide by this plate term by number and integer. What do you get as a result?

Since the numbers are large in the lower right part of the plate, there will be only zeros.



And at the same time your JPEG, your block will compress perfectly. You will have a small number of numbers that you will bypass in such an intricate zigzag, the zeros will all go away, and, cheers, our block is ready for compression. Then we just need to compress it with a lossless compression algorithm. JPEG uses Huffman Coding, whatever it is.



How is it packed in a container? JPEG containers look a little dumb, I'm afraid of them. Because you see the first two bytes and it says that most likely this is JPEG. But so far it is not clear.

Next you need to look for two meta chunks. Why two? Because JPEG is a very large set of different standards. What we call JPEG is, by standard, called JIFF. This is a special extension of the JPEG standard. I won’t continue further - in general, there are two meta chunks, just trust me. These meta-chunks contain information about the width and height of your file and the version of JPEG. Imagine, JPEG has more versions! And besides, is it progressive JPEG? This is an important flag. He talks about how your blocks will be distributed further.

If JPEG is not progressive, then what do you need to decode your blocks? JPEG quality, this very plate. The plate you divide your blocks into is quality. But JPEG has two qualities. The first quality is responsible for channel Y, the second - for channels CB and CR, this is what determines the color. Since we put the quality in a file and squeezed everything into a lossless compression algorithm, we also need a special Huffman Tables dictionary to expand this.

Next comes your blocks, and then your JPEG is over.



Okay, a progressive story. Everything is exactly the same. At the very beginning you have a meta-chunk. Next comes your quality in the form of 64 numbers, plus 64 numbers. And then just the same blocks, but just a little bit different with distributed numbers. First part of the blocks, then another part, another part and so on. As you receive these blocks, the browser draws an approximation of your JPEG, because, in fact, these numbers are some approximation of your file.



About JPEG we finished, you can exhale, all is well. Let's talk about such an interesting thing as JPEG 2000. Do any of you in production use JPEG 2000? Okay, who ever heard of this? And which of you has read in Lighthouse - “use modern formats”?

In general, JPEG 2000 is a cool interesting format, which, firstly, is more effective than JPEG. Secondly, you will not believe it, in some cases it is more effective than WebP, which we will talk about later.

He knows how to be transparent, knows how to compress without loss. Just the perfect format. But unfortunately, yes, it only works in Safari.

It’s worth mentioning that JPEG 2000 is designed in a very intricate way and works on cool math called wavelet transform. If you are suddenly interested, google, and we will go further.



Then we suddenly need to talk about the video. This whole report is about image optimization and about images. But the video here is very important, you will see why now. When we think of a video, the first word that comes to our mind is “codec”. The video needs to be encoded somehow, and in order to show the video, we need to code it. If we decode the video stream, what do we get?

First of all, we have a set of frames. But do not think of these frames as pictures in the GIF. All wrong. Which frames are very dependent on the codec. But in the general case, you can assume that you have a keyframe. You can get a cat out of keyframe - in the sense, any picture that is on this keyframe. And there are dependent frames. It is impossible to get a cat out of the dependent frame, because the dependent frame stores not only information not about the image, if any, but about how the blocks of the previous or previous frame moved on this. Therefore, you cannot get a picture for a dependent frame until you decode a little bit.

All we are going to talk about now is keyframe and intraframe compression. This is how you compress a picture inside a keyframe.

Let's look at an abstract codec in a vacuum and compare it with JPEG. So far it seems - why do this? Everything will become clearer, trust me.



Once again we repeat the same thing that we do with JPEG ,. You take a picture, make it split into channels, do Downsampling to channels. Same story here. Then you break this picture into blocks. But there are already features. First of all, the size of the block you are breaking into depends on your codec. And these blocks can be very large. For JPEG - 8 by 8. For video codecs - it can be, for example, 128 by 128.

Farther. If you get some very small details on your picture that you want to pay attention to, you can still subdivide the blocks a little, approximately to size 4 by 4. How you break the blocks, this partitioning algorithm depends on the codec.

And the most recent - the maximum block size, again, is specific to your codec. An encoder is part of the codec, to be understood in terminology. Here we are still similar to JPEG.



What doesn't look like JPEG is predictive coding. We talked about him in part about peengeshki. Intraframe-video compression is so cool and effective just because of this. What's going on here?

We are trying to predict the pixels of each block based on the previous ones. That is, we do not store pixels in raw form, we predict them. Prediction options are many. Within one codec, we can use different variants of predictions. Moreover, for all sorts of intricate codecs of these options, as many as 35, for example. How can you do this. Let's look at some example.

Here you have the block. You say: I want to predict pixels there. You look to the left, you look up and remember what is left and above. Next, you take all the pixel values ​​that you found, average and fill it with a block, and say: I predicted. If you guessed right, and by the way, on the little picture where there are blue arrows, you guessed right, then you are great, you don’t have to do anything else. But, if you have not guessed, then you need to save the difference between what is actually and what you predicted. This difference compresses much, much better than the pure pixel value.



Then everything is exactly the same as in JPEG. You will transform the resulting block. But the peculiarity of all sorts of different codecs is that you can use not DCT (discrete cosine transform), but something else. What to use depends on the codec.



Then again the same plates, but unlike JPEG you can use more than one plate for your entire file, and you can use several different plates for different blocks. Imagine - you have a person, for example, against the sky. Perhaps, since the sky is blue, you don’t need a special quality there, you can use one quality for the sky, one plate. And for a person who has any texture, clothes, use a different quality, and it turns out cool and effective.



The most recent is what JPEG does not have, and what JPEG is very, very lacking. This is the use of filters. When we all reaped, we get such nasty artifacts after compression. If you've ever compressed JPEGs to low quality, you should see how JPEGs fall apart simply into nightmarish terrible blocks. In general, in order to get rid of these artifacts, video codecs use a special thing. They apply filters, and the edges of these blocks smooth out. The technology of the ancients, which allowed us to do the same with JPEG, was such. You take your JPEG, compress it very, very hard, then bend it like this so that nothing is noticeable. In general, this is about the same, but it has already been done at the codec level. Great.



Naturally, when we tried and this was all done, we now need to compress the received blocks without loss. We squeezed, well done. The compression algorithm is similar to JPEG, but still different. Here it must be understood that lossless compression is limited by the natural limit. We really want to get close to it, and the best way to get closer to it is if we use an algorithm called Arithmetic coding. And there are also all sorts of variations. This again depends on the encoder, but let's just assume that there is lossless compression and approx.



I have long wanted to call these abstract codecs in a vacuum by their proper names. A small historical excursion. What happened in 20 years? I am talking only about those video codecs that are at least somehow supported on the web. H.264 is a codec that supports everything and everyone. This is the default solution for the entire video. After a certain amount of time, after a few years, the VP8 video codec appears.

Here wild wars begin, holivars on the topic which of these codecs are better. I googled for a very long time - there is no answer. Great scientific articles have been written about this, but on average, if I say the same now, a tomato will fly into me. But, okay, they are the same. Average. Then why do we need a second?

The second is needed because it is free. If you use H.264, you need to carry MPEG money in some circumstances. For VP8 you don’t need to carry money. It's good. So, VP8 keyframe - this is WebP. Indeed, why should we invent a new image format? We take the keyframe, we tried so hard, we squeezed it all. We call it all a new format of pictures, and voila!

What happens next? Then after a number of years, two more cool video codecs, from MPEG and from Google, appear almost simultaneously. From Google - VP9, ​​from MPEG - H.265. Next to H.265 there is a new picture standard called HEIF. It is not supported by browsers, not one at all. But it is supported by your Apple devices. The HEIF standard is insanely interesting, because it's just an abstraction of this idea. In a HEIF container, you can cram keyframe from almost any codec. That is, VP8 is not a modern format. But HEIF is modern.

What happens next? Now in a very large organization, which includes Mozilla and Google, a video codec called AV1 is being sawn. The organization is called Alliance for Open Media. The quality of AV1-video is many times higher than everything that was before. He is free, he is royalty free, he is very cool. We have such a nice HEIF container. All that remains for us is to shove the AV1 keyframe into it. And it is done. The new format for pushing the AV1 keyframe into a HEIF container is called AVIF. This is what awaits us in the future. Maybe someday we will use it natively.

But we can use it now. We just put one frame from the video on the page and say: voila, you have a picture.



How is this done in webp? WebP is, as I said, a VP8 keyframe packaged in a container called riff. There is such a header in the riff container. There, do not believe it, it is written that this is WebP. Who would doubt that. PNG says it is PNG WebP, and there it is.

But WebP has an interesting feature: VP8 keyframe can lie inside it, and this is what is usually called WebP. But VP8 keyframe may not be. In general, WebP supports lossless compression. WebP lossless is a completely different format that has nothing to do with VP8, lossy compression, etc. Therefore, when someone tells you that WebP is more effective than something else, the first question to ask is what WebP something? Because if we talk about lossless compression, then there is a natural aisle to which we can strive. These differences, “60% more effective than ...”, are more likely not lossless, but WebP with losses.

Okay, enough theory, sick of it, let's look at something already. Clickable




Let's start with this. We take a photo taken by a professional camera. Cut out a piece of 1000 by 1000 pixels from it. This, incidentally, looks very cool on the projector. We begin to consider small details. At the same time, we compress this piece so that we get exactly 15 kilobytes. Clickable See what happens. JPEG fell into blocks immediately. Indeed, low quality, we expected this. This is what WebP looks like. It also fell into blocks, but these blocks are not so clearly visible. When you use the WebP encoder and control it with your hands, you can control the strength of the filter that is used in WebP. And if you unscrew this filter harder, then you can get rid of a large number of block artifacts. Therefore, purely theoretically, these blocks can also be removed.








And here is AV1. Let us just admire silently. Look how cool he is. AV1 is supported in Firefox, in Chrome, so you can use AV1 video instead of a picture if you suddenly want to. Clickable There is a spoiler, in vain I added it. The situation when PNG defeats WebP. Yes, PNG in this case is more effective than WebP. This is because I used lossy WebP. Clickable What did I do with the peengeshka? I made the indexed color mode, that is, I cut the palette, in my opinion, to 16 colors. It is quite effective for a black and white picture. It turned out well, it contracted very much. For quality lossy WebP we got a larger size. However, for lossless, this is expected, it is more efficient than peengeshka. We got a win.











I summarize. Very cool recessed pangshes can defeat lossy compression formats and don't defeat lossless WebP. Sadly, sadly. Clickable Maybe you are tormented by the question: why are you doing this, do we know what SVG is? And I know, but for some sizes, PNG is more efficient. This picture turns out to be more effective than SVG for sizes like 200 by 200. Then SVG, of course, wins. Clickable Now let's look at Mike. This is Mike. Its dimensions are 3000 by 3000 pixels. JPEG vs WebP. It was obvious here that JPEG was winning. But in this case, I got about six percent victory for about the same visual quality. This is a feature of the photo and how I prepared this photo. You can then ask me how I did it.












Clickable.

Still, everything very much depends on the parameters of the encoder. If you try very hard and unscrew the encoder parameters in a special way, then JPEG will begin to defeat WebP in size for the same visual quality. I would like to conclude that cats shrink better than JPEG, but no. This is just an example of how you can unscrew it the way you like if you want. Clickable This is a very low quality. JPEG is falling into blocks. This is especially evident right on the projector - the nose turned blue at the dog, it became square. WebP is not so sick. Everything seems to be cool and good, but the thing is that for very, very low qualities WebP gives about two, or maybe three times the file size than JPEG. So here you also need to think about what quality you want. Clickable









This is the most honest comparison. So you have to compare, because H.264 and WebP are similar. Who do you think won here? H.264. But to be honest, the experiment was not entirely clean. In a good way, both in WebP and in H.264, the video frame is approximately unambiguous. Clickable But with AV1, everything is absolutely clear. Thirty percent win on the same visual quality. Hooray! Clickable It is very important to understand what kind of picture you put and how this or that format responds to the quality of the picture. Here the dog in WebP format weighs 79 kilobytes in quality as about 75% versus 56 kilobytes in JPEG. Why is this happening?











Because not a single video codec, not a single format can properly compress noise. If your picture has a lot of such sharp distortions, dots and something else, then most likely you will have problems with compression. If you can take some other picture and remove this noise, remove it.

So, pictures are a complicated thing. Can they slow down your interface? An important and good question.



Answer: most likely not. Why it happens? Because when the picture is decoded, it happens in a separate stream. But there is an exception - if you draw something on canvas, you need to remember that image decoding will occur in the main stream and the buttons may not be pressed at that moment.



If you really want to make it a deal, open Chrome, look for the corresponding rasterize threads and the Image Decode event, you will find it.



If you are very, very curious, you can go to the tracing tab and see there with details what happens when decoding an image.

Optimization tools


The most important thing is optimization tools. We now roughly know what we want. It remains to understand how we do this.



The most important image optimization tool is the designer, no matter how strange it sounds. Only this wonderful person knows what problem you want to solve with him. We do not add images to pages in order to optimize them cool, but to impress users. To maintain a balance between the degree of optimization and user experience, use a designer that helps a lot.


Link from the slide The

second tool is our Martian open source, which I promised to talk about. This thing is called imgproxy and solves all our problems in general. On my projects I use only imgproxy, this thing can do almost everything that I want.



How it works? Do you have any wish for the picture. You want a picture of a certain size with a certain optimization. And somewhere far away you have a picture of any resolution - maybe on the local computer, or maybe somewhere at the user or in general anywhere. You just need to create a special url and ask imgproxy to resize your picture. This is such a service, it can be in the cloud or somewhere else. That is, you had a huge cat, you send a special URL to imgproxy. He does everything you want on the fly.



If that doesn't sound clear, let's see what the request to imgproxy looks like. First, you need to tell where imgproxy is located. Secondly, if you do not want to be aggressively sucked, then the URL you are asking for would be nice to digitally sign. You can not do this, this is just an additional measure of protection.

Further, if you want to resize, then directly in the url pass the resize parameters. If you want to optimize - the same thing. You only need to transfer the original address of your picture.



If you want manual optimizations, there is a huge set of tools. I will not describe them all now. The materials for the report, which I will send to you, have everything.



Here is the most cool and useful. These all images are not so complicated. I think I managed to convey this to you. If you are interested, take your favorite programming language - probably JavaScript, although far from a fact - and begin to sort it all out.

If you want to do this in a browser, please. You probably need a binding that is most likely written in pluses or in C. But what prevents you from compiling this all in WebAssembly? There is a cool application called Squoosh. It does exactly that. You too can, try, it will be cool. I really like.

Thank you all for your attention. Materials for the report - by reference .

All Articles