👩🏾‍🤝‍👩🏽 🔒 🥐 Skeleton Animation Compression Guide 🙋🏼 👩‍👩‍👧‍👦 ◀️

This article will be a brief overview of how to implement a simple animation compression scheme and some related concepts. I am by no means an expert in this matter, but there is very little information on this topic, and it is quite fragmented. If you want to read deeper articles on this topic, then I recommend that you go to the following links:

Before we get started, it's worth giving a brief introduction to skeletal animation and some of its basic concepts.

Basics of Animation and Compression

Skeletal animation is a pretty simple topic, if you forget about skinning. We have a concept of a skeleton containing transformations of a character’s bones. These bone transformations are stored in a hierarchical format; in fact, they are stored as a delta between their global position and the parent's position. The terminology here is confusing, because in the game engine local is often called the model / character space, and global is the world space. In animation terminology, local is called the space of the parent of the bone, and global is either the space of the character or the world space, depending on whether there is movement of the root bone; but let's not worry about that much. The important thing is that bone transformations are stored locally relative to their parents. This has many advantages, and especially when mixing (blending):if the mixing of the two positions were global, then they would be linearly interpolated in the position, which would lead to an increase and decrease in bones and deformation of the character.And if you use deltas, the mixing is performed from one difference to another, so if the delta transformation for one bone between two poses is the same, then the length of the bone remains constant. I think that it is easiest (but not entirely accurate) to take it this way: the use of deltas leads to a “spherical” movement of bone positions during mixing, and the mixing of global transformations leads to a linear movement of bone positions.

Skeletal animation is just an ordered list of keyframes with a (usually) constant frame rate. The key frame is the skeleton pose. If we want to get a pose between keyframes, we sample both keyframes and mix between them, using the fraction of the time between them as the weight of the mix. The image below shows an animation created at 30fps. The animation has a total of 5 frames and we need to get the pose 0.52 s after the start. Therefore, we need to sample the pose in frame 1 and the pose in frame 2, and then mix between them with a mix weight of approximately 57%.

An example of an animation of 5 frames and a request for a pose at an intermediate frame time

Having the information above and considering that memory is not a problem for us, the sequential saving of the pose would be the ideal way to store the animation, as shown below:

Simple animation data storage

Why is this perfect? Sampling any keyframe comes down to a simple memcpy operation. Sampling an intermediate pose requires two memcpy operations and one mixing operation. From the point of view of the cache, we copy using memcpy two data blocks in order, that is, after copying the first frame, one of the caches will already have a second frame. You can say: wait, when we do the mixing, we need to mix all the bones; What if most of them do not change between frames? Wouldn't it be better to store bones as records and mix only changed transformations? Well, if this is realized, then a little more cache misses can potentially occur when reading individual records, and then you will need to keep track of which conversions you need to mix, and so on ... Mixing may seem like a lot of time-consuming work,but in essence it is the application of one instruction to two blocks of memory that are already in the cache. In addition, the mixing code is relatively simple, often just a set of SIMD instructions without branching, and a modern processor will process them in a matter of moments.

The problem with this approach is that it takes an extremely large amount of memory, especially in games where the following conditions are true for 95% of the data.

The bones have a constant length
- The characters in most games do not stretch the bones, therefore, within the same animation, the records of transformations are constant.
We usually don’t scale the bones.
- Scale is rarely used in game animations. It is quite actively used in films and VFX, but very little in games. Even when used, the same scale is usually used.
- In fact, in most of the animations I created at runtime, I took advantage of this fact and kept the entire bone transformation in 8 float variables: 4 to rotate the quaternion, 3 to move and 1 to scale. This significantly reduces the size of the pose at run time, providing increased productivity when mixing and copying.

With all of this in mind, if you look at the original data format, you can see how inefficient it is spending memory. We duplicate the displacement and scale values of each bone, even if they do not change. And the situation is quickly getting out of hand. Usually, animators create animations at a frequency of 30fps, and in AAA-level games, a character usually has about 100 bones. Based on this amount of information and a format of 8 float, we need about 3 KB per pose and 94 KB per second of animation as a result. Values quickly accumulate and on some platforms can easily clog all the memory.

So let's talk about compression; When trying to compress data, there are several aspects to consider:

Compression ratio
- How much we managed to reduce the amount of occupied memory
Quality
- How much information we lost from the source data
Compression rate
- .

I am primarily concerned about quality and speed, and less concerned about memory. In addition, I work with game animations, and I can take advantage of the fact that in fact, to reduce the load on memory, we do not have to use displacement and scale in the data. Due to this, we can avoid a decrease in quality caused by a decrease in the number of frames and other solutions with losses.

It is also extremely important to note that you should not underestimate the effect of animation compression on performance: in one of my previous projects, the sampling rate decreased by about 35%, and there were also some quality problems.

When we start working with compression of animation data, there are two main important areas to consider:

How quickly can we compress individual elements of information in a key frame (quaternions, float, etc.).
How can we compress the sequence of key frames to remove redundant information.

Data discretization

Almost all of this section can be reduced to one principle: discretize data.

Discretization is a difficult way to say that we want to convert a value from a continuous interval to a discrete set of values.

Discretization Float

When it comes to sampling float values, we strive to take that float value and represent it as an integer using fewer bits. The trick is that an integer may not actually represent a source number, but a value in a discrete interval mapped to a continuous interval. Usually a very simple approach is used. To sample a value, we first need an interval for the original value; Having received this interval, we normalize the initial value for this interval. Then this normalized value is multiplied by the maximum value possible for the desired given output size in bits. That is, for 16 bits we multiply the value by 65535. Then the resulting value is rounded to the nearest integer and stored. This is clearly shown in the image:

An example of sampling a 32-bit float to an unsigned 16-bit integer

To get the original value again, we simply perform the operations in the reverse order. It is important to note here that we need to record somewhere the initial interval of the value; otherwise, we will not be able to decode the sampled value. The number of bits in the sampled value determines the step size in the normalized interval, and therefore the step size in the original interval: the decoded value will be a multiple of this step size, which allows us to easily calculate the maximum error that occurs due to the sampling process, so we can determine the number of bits required for our application.

I will not give examples of the source code, because there is a fairly convenient and simple library for performing basic sampling operations, which is a good source on this topic: https://github.com/r-lyeh-archived/quant (I would say that you should not use its quaternion discretization function, but more on this later).

Quaternion Compression

Quaternion compression is a well-studied topic, so I will not repeat what other people explained better. Here is a link to a snapshot compression post that provides the best description on this topic: https://gafferongames.com/post/snapshot_compression/ .

However, I have something to say on the topic. The bitsquid posts, which talk about quaternion compression, suggest compressing the quaternion to 32 bits using approximately 10 bits of data for each quaternion component. This is exactly what Quant does, because it is based on bitsquid posts. In my opinion, such a compression is too great and in my tests it caused strong shaking. Perhaps the authors used less deep hierarchies of the character, but if you multiply 15-plus quaternions from my animation examples, the combined error turns out to be quite serious. In my opinion, the absolute minimum of accuracy is 48 bits per quaternion.

Downsizing due to sampling

Before we begin to consider the different compression methods and arrangement of records, let's see what type of compression we get if we simply apply the discretization in the original circuit. We will use the same example as before (a skeleton of 100 bones), so if you use 48 bits (3 x 16 bits) per quaternion, 48 bits (3 × 16) to move and 16 bits to scale, then in total for conversion we need 14 bytes instead of 32 bytes. This is 43.75% of the original size. That is, for 1 second of animation with a frequency of 30FPS, we reduced the volume from about 94 KB to about 41 KB.

This is not bad at all, discretization is a relatively low-cost operation, so this will not affect the unpacking time too much. We found a good starting point for the start, and in some cases this will even be enough to implement animations within the budget of resources and ensure excellent quality and performance.

Record compression

Everything becomes very complicated here, especially when developers begin to try such techniques as reducing the key frame, curve fitting, etc. Also at this stage we are really starting to reduce the quality of the animations.

In almost all such decisions, it is assumed that the characteristics of each bone (rotation, displacement, and scale) are stored as a separate record. Therefore, we can flip the circuit, as I showed it earlier:

Saving data of bones as records

Here we simply save all records sequentially, but could also group all records of rotations, displacements, and scales. The basic idea is that we move from storing data from each pose to storing records.

Having done this, we can use other ways to further reduce the occupied memory. The first is to start dropping frames. Note: this does not require a record format and this method can be applied in the previous scheme. This method works, but leads to the loss of small movements in the animation, because we discard most of the data. This technique was actively used on the PS3, and sometimes we had to stoop to insanely low sampling frequencies, for example, up to 7 frames per second (usually for not very important animations). I have bad memories from this, as an animation programmer I clearly see the lost details and expressiveness, but if you look from the point of view of the system programmer, we can say that the animation is "almost" the same, because in general the movement is preserved, but at the same time we save a lot of memory.

Let's omit this approach (in my opinion, it is too destructive) and consider other possible options. Another popular approach is to create a curve for each record and perform reduction of key frames on the curve, i.e. removing duplicate keyframes. From the point of view of game animations, with this approach, the movement and scale recordings are perfectly compressed, sometimes being reduced to one keyframe. This solution is non-destructive, but requires unpacking, because every time we need to get the transformation, we have to calculate the curve, because we can no longer just go to the data in memory. The situation can be improved a little if you calculate animations in only one direction.and store the state of the sampler of each animation for each bone (i.e. where to get the calculation of the curve from), but you have to pay for this with an increase in memory and a significant increase in code complexity. In modern animation systems, we often do not play animations from beginning to end. Often at certain time offsets, they make transitions to new animations thanks to things like synchronized blending or phase matching. Often we sample individual but not consecutive poses to implement things like mixing aiming / looking at an object, and often animations are played in the reverse order. Therefore, I do not recommend using such a solution, it is simply not worth the hassle caused by complexity and potential bugs.

There is also the concept of not only deleting identical keys on curves, but also specifying a threshold at which similar keys are deleted; this leads to the fact that the animation becomes more faded, similar to the method of dropping frames, because the end result is the same in terms of data. Animation compression schemes are often used, in which compression parameters are set for each record, and animators are constantly tormented with these values, trying to maintain quality and reduce size at the same time. This is a painful and stressful workflow, but it is necessary if you work with the limited memory of older generations of consoles. Fortunately, today we have a large memory budget and we do not need such terrible things.

All these aspects are disclosed in the posts of Riot / BitSquid and Nicholas (see links at the beginning of my article). I will not talk about them in detail. Instead, I’ll talk about what I decided about compressing the records ...

I ... decided not to compress the records.

Before you start waving, let me explain ...

When I save the data in the records, I store the rotation data for all frames. When it comes to movement and scale, I keep track of whether the movement and scale are static during compression, and if so, I save only one value per record. That is, if the record moves along X, but not along Y and Z, then I save all the values of moving the record along X, but only one value of moving the record along Y and Z.

This situation arises for most bones in about 95% of our animations, so in the end we can significantly reduce the occupied memory, absolutely without losing quality. This requires work from the point of view of content creation (DCC): we do not want the bones to have slight movements and zooms in the animation creation workflow, but such a benefit is worth the extra cost.

In our animation example, there are only two records with moving and there are no records with scale. Then for 1 second of animation, the data volume decreases from 41 KB to 18.6 KB (that is, up to 20% of the volume of the original data). The situation becomes even better when the duration of the animation increases, we spend resources only on recording turns and dynamic movements, and the costs of static recordings remain constant, which saves more in long animations. And we do not have to experience quality loss caused by sampling.

With all this information in mind, my final data schema looks like this:

An example of a compressed animation data scheme (3 frames per record)

In addition, I save the offset in the data block to start the data of each bone. This is necessary because sometimes we need to sample data for only one bone without reading the entire pose. This provides us with a quick way to directly access record data.

In addition to the animation data stored in one memory block, I also have compression options for each record:

Example of compression parameters for records from my Kruger engine

These parameters store all the data I need to decode the sampled values of each record. They also monitor the static of records so that I know how to handle compressed data when I stumble upon a static record when sampling.

You can also notice that the discretization for each record is individual: during compression, I track the minimum and maximum values of each characteristic (for example, moving along the X) of each record to ensure that the data is discretized within the minimum / maximum interval and maintain maximum accuracy. I do not think that it is generally possible to create global sampling intervals without destroying your data (when the values are outside the interval) and without making significant errors.

Be that as it may, here is a brief summary of my stupid attempts to implement animation compression: in the end, I almost use compression.