🖕🏿 📭 🐴 About mask registers 👩🏻‍🤝‍👨🏽 💃🏼 👏

The AVX-512 instruction set included eight so-called mask registers [1] - from k0 [2] to k7 . They are suitable for use with most ALU operations and allow you to perform mask operations on vector elements with zeroing or merging data in the destination register [3], thereby speeding up the work of the code, which would require additional merge operations in the AVX2 instruction set and earlier versions .

If the above is not enough to make you a follower of the cult of mask registers, I will quote an excerpt from the Wikipedia article , which, I hope, will help you finally figure it out:

Most AVX-512 commands can use the operand mask corresponding to one of 8 mask registers (k0 – k7). If the mask register is used as the mask of the operation, the register k0 behaves differently than the rest of the mask registers: in this case, it acts as a hard-coded constant indicating that the mask is not used with this operation. However, in arithmetic and logical operations and when writing values to mask registers, k0 behaves like a normal working register. In most commands, mask registers are used as a mask that determines which items should be written to the output register. The behavior of the operand mask depends on the flag: if it is set, all unselected elements will be reset (“zeroing” mode, zero), if not, all unselected elements retain their previous state ( merge mode, merge ). Merge mode has the same effect as blend instructions .

In general, mask registers [4] are an important innovation, but they are rarely remembered in contrast to, say, general purpose registers ( eax , rsi and others) or SIMD registers ( xmm0 , ymm5 , etc.). In Intel presentations, which show the sizes of microarchitecture resources, mask registers are not mentioned:

As far as I know, information about the size of the physical register file ( physical register file, PRF ) of mask registers has also never been published. Now we will fix it.
I used a modified version of the tool for measuring the size of the reordering buffer of commands ( ROB ), which was created and described by Henry Wong [5] (hereinafter simply Henry). Using this tool, he calculated the size of documented and undocumented structures of extraordinary execution in previous architectures. If you have not read Henry’s note, stop and return to it. And my article will wait.

Well, read? For harm, here is a summary of Henry's article:

ROB

A number of ballast instructions are inserted between two read instructions with a cache miss [6] - their exact number will depend on what processor resource we want to measure. If there are not very many ballast commands, both cache misses will be processed in parallel, so their delays will overlap and the total execution time will be approximately [7] as much as it would take for one cache miss.

However, if the number of ballast teams exceeds a certain critical threshold, the corresponding resource will be completely exhausted and the placement of teams in the ROB will stop before the second command with a cache miss is issued. In this case, their parallel processing will be impossible and the total time will almost double the time of one such operation, which will be reflected in the graph as a sharp jump.

Finally, the test is written so that the teams use exactly one unit of the resource being checked - in this case, the peak peak on the graph will indicate its total volume. So, standard general-purpose commands, as a rule, use one physical register from general-purpose PRF registers and therefore are ideal for measuring the volume of a given resource.

Size of the physical register file of mask registers

In this test, we will execute commands that write the value to mask registers to find out the PRF size of these registers.

Let's start with a series of teams kaddd k1, k2, k3 (16 ballast teams shown):

mov    rcx,QWORD PTR [rcx]  ;    -
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
kaddd  k1,k2,k3
mov    rdx,QWORD PTR [rdx]  ;    -
lfence                      ;      ,    
                            ;   
;     16

Each kaddd command consumes one physical mask register. If the number of ballast commands is less than or equal to the number of mask registers, cache misses will be processed in parallel, otherwise in sequential mode. So, when switching from parallel to serial mode, we should see a sharp jump on the graph, indicating an increase in execution time.

This is exactly what we are observing:

Let’s take a closer look at the rise:

As we now see, the jump is not so sharp : with the number of ballast commands from 130 to 134, the execution speed takes intermediate values between the minimum and maximum levels. Henry calls this behavior imperfect; I observed it in many of these tests, although not in all. The fact is that the implementation of the hardware does not always allow a complete exhaustion of the resource as it approaches the limit [8] - in some cases this succeeds, in others there are only a few teams missing to a theoretical maximum.
In this regard, we are interested in the penultimate ascent point, in which the speed is still higher than in slow mode. This point indicates the number of resource units available to us, which means that there are at least as many physical registers. As you can see, in this case it is at around 134.

Thus, SKX has 134 physical registers capable of storing speculative (obtained with anticipatory execution) values of mask registers. Henry suggests that 8 more are used to store the current architectural state of eight mask registers, so the full amount of their PRF can be estimated at 142. This is slightly smaller than the size of the files for general purpose registers (180) and SIMD registers (168), but still quite a lot (see table of sizes of resources of extraordinary execution for other platforms).

Let's say this file is large enough so that in practice we do not have time to completely occupy it: it is difficult to imagine real code in which almost 60% [9] of the commands write [10] to mask registers - namely, so many of them will be required to exhaust this resource .

Are these different registry files?

As you must have noticed, so far I have assumed by default that mask register PRFs are a separate file that does not intersect other types of register files. I believe that this is very likely based on the principle of operation of mask registers and the fact that they are part of a separate domain for register renaming [11]. Another argument in favor of my assumption is the fact that the observed size of the PRF mask registers does not match the size of the general-purpose register file or the SIMD register file. Actually, we can take and test this with a test!

This test is similar to the previous one, but now kaddd commandswill alternate with commands that use either general purpose registers or SIMD registers. If mask registers are combined with the first or second in the same register file, the jump in the graph should indicate the size of the corresponding PRF. If the register files are not combined, we will come across some other limit, which will not be equal to the size of either of the two register files, but will be equal, for example, to the size of ROB.

In test 29 , kaddd commands and scalar add commands alternate :

mov    rcx,QWORD PTR [rcx]
add    ebx,ebx
kaddd  k1,k2,k3
add    esi,esi
kaddd  k1,k2,k3
add    ebx,ebx
kaddd  k1,k2,k3
add    esi,esi
kaddd  k1,k2,k3
add    ebx,ebx
kaddd  k1,k2,k3
add    esi,esi
kaddd  k1,k2,k3
add    ebx,ebx
kaddd  k1,k2,k3
mov    rdx,QWORD PTR [rdx]
lfence

We look at the graph:

As you can see, the number of ballast teams, which accounts for the peak, is larger than the sizes of PRF general purpose registers and mask registers. From this we conclude that mask registers are not included in the general-purpose register file.

Then maybe they are included in the SIMD register file? After all, mask registers are more associated with SIMD commands than with general purpose commands.

To find out, we will use test 35, which is identical to test 29 with the difference that here the kaddd commands alternate with the vxorps commands :

mov    rcx,QWORD PTR [rcx]
vxorps ymm0,ymm0,ymm1
kaddd  k1,k2,k3
vxorps ymm2,ymm2,ymm3
kaddd  k1,k2,k3
vxorps ymm4,ymm4,ymm5
kaddd  k1,k2,k3
vxorps ymm6,ymm6,ymm7
kaddd  k1,k2,k3
vxorps ymm0,ymm0,ymm1
kaddd  k1,k2,k3
vxorps ymm2,ymm2,ymm3
kaddd  k1,k2,k3
vxorps ymm4,ymm4,ymm5
kaddd  k1,k2,k3
mov    rdx,QWORD PTR [rdx]
lfence

Graph:

In this test, the same behavior is observed as in the previous one, so we conclude that the register files of mask registers and SIMD registers are also separated.

Unsolved Mystery

Nevertheless, in both tests, the end of the peak falls at about 212 commands, while the ROB size for this microarchitecture is 224. Maybe this is just an imperfect behavior that we already observed earlier? Well, let's check this: compare the results of these two tests with the results of test 4, in which only nop commands are used as ballast commands : except for ROB, they should not consume any other resources. Compare the graphs of test 4 ( nop ) and test 29 ( kaddd and scalar add alternate ):

In test 4the beginning of the slow mode falls exactly at the 224 mark (vector images, so you can increase them and see for yourself). It turns out that 212 (from this point the slow mode starts when alternating mask registers with general registers or SIMD registers) - this is the limit of some other resource. In fact, we encounter the same limitation even when we alternate general registers and SIMD registers - compare test 4 and test 21 (it combines the addition commands in general registers and SIMD vxorps commands ):

In your article , under with the same heading ("Unsolved Mystery") Henry describes the same effect, but even more pronounced:

, AVS SSE Sandy Bridge 147 , ROB. (, , AVX- , NOP-), , SSE/AVX, , - , 147, – , .

For details, I refer you to Henry's article. We observe a similar effect, but less pronounced: we at least manage to occupy 95% of the ROB volume, but we still do not exhaust it completely. Perhaps that mysterious common pool of registers is associated with the mechanism of their release, for example, a PRRT table [12], which keeps track of the registers available for release after the command is completed.

Finally, let's talk about some more features of mask registers and check if the optimization mechanisms available to general purpose registers and SIMD registers are applicable to them.

Copy replacement

For general purpose or SIMD commands, so-called move elimination can be applied . With this optimization, the register renaming mechanism allows not to execute commands that copy the value from one register to another, for example mov eax , edx or vmovdqu ymm1 , ymm2 - instead , the destination register is “simply” [13] reassigned to the source register in RAT, which allows you to do without involving ALU.

Check if the copy replacement is applicable to, say, the kmov k1 , k2 command . First, look at the graph of test 28 , where kmovd k1 is the ballast team ,k2 :

This graph looks exactly the same as in test 27 discussed earlier with kaddd commands . Therefore, it is reasonable to assume that it is physical registers that are filled in, unless we accidentally exhausted some other resource used when replacing copy, which behaves the same and has the same size [14].

We find additional confirmation on the website uops.info: it says that all variants of the kmov copy command between mask registers occupy one micro-operation executed on port p0 . If there was a replacement copy, we would not observe activity on the ports.
From this I conclude that copy commands that use mask registers [15] are not replaced.

Addiction Idioms

The best way to nullify the general-purpose register in x86 architecture is to use an exclusive OR idiom (xor): xor reg , reg . Its action is based on the fact that comparing any value with itself using this operation yields zero. This command is shorter (takes less bytes) than the more obvious mov eax , 0 , and also faster, because the processor understands that this is a reset idiom and performs the necessary renaming of registers [16], which eliminates the need for ALU and port loading.

Moreover, this idiom eliminates data dependencies: usually the result of the command xor reg1 , reg2depends on the values in the reg1 and reg2 registers , but in the special case when reg1 and reg2 contain the same value, there is no dependence, since for any input values the output will be zero. All modern x86 processors recognize this [17] special case. The same is true for those versions of the xor idiom that use SIMD registers, namely vpxor operations on integers and vxorps and vxorpd operations on real numbers.
Here, a curious reader may ask: does this idiom work with similar variants of the kxor command ? For example, would the kxorb k1, k1, k1 [18] command be considered a reset idiom?
In fact, these are two different questions, since the effect of using a reset idiom is composed of two components:

Zero-delayed execution bypassing the execution module ( execution elimination )
Dependency Elimination

We will deal with each question separately.

Replacement execution

So, can commands with xor, for example kxorb k1, k1, k1 , be replaced by reassigning registers without being placed in the execution module?

No.

I don’t even have to do anything myself to prove it: all the information is on the uops.info website, since they conducted such a test and showed that this command is executed with a delay of 1 clock cycle and takes one microoperation on the port p0 . It follows that xor reset idioms for mask registers do not work.

Dependency Elimination

But what if reset idioms with kxor still eliminate data dependencies, even if they require placement in the execution module?

Here uops.info will not help us. The kxor command has a delay of 1 clock cycle and is executed on a single port ( p0 ), therefore there is an interesting (?) Situation in which the kxor command chain is executed at the same speed, regardless of whether there are dependencies between them or not: bandwidth ability 1 command / cycle gives the same performance decrease as delay 1 command / cycle!

Nothing, we still have a couple of tricks in stock. The following test will help us answer this question. Embed kxorin a chain of commands, in which each subsequent command depends on the previous one, and the total execution time of this chain should be large enough to form a bottleneck. If the kxor command does not eliminate the dependency, the total execution time of the chain will be equal to the sum of the delays of its constituent commands. If the dependence disappears, the chain breaks into shorter sequences, the delays of which overlap, and then the speed of their execution will be limited by some limit value of throughput (associated, for example, with competition for ports ). This could be clearly shown using the scheme, but I am not strong at this.

All these tests can be found in the benchmark uarch benchbut the key points I will give below.

First, measure the standard copy time from the general register and vice versa:

kmovb k0, eax
kmovb eax, k0
;   127

A pair of these commands is executed [19] in 4 measures. It is not known exactly how much time falls on each of them: 2 measures or one measure, and 3 measures on the other [20]? However, for our task this is irrelevant, because we are interested in the total time of copying back and forth. It is noteworthy that the bandwidth of this sequence is 1 clock cycle, which is 4 times faster than the delay, since each command is executed on its own port ( p5 and p0 , respectively). This means that we can separate the effect of delay from the effect of bandwidth.

Next, in our chain, we include the kxor command , which is guaranteed to not lead to a case reset:

kmovb k0, eax
kxorb k0, k0, k1
kmovb eax, k0
;   127

Since we know that kxorb has a delay of 1 clock cycle, the total execution time should increase to 5 cycles - this is what the test shows (the results of the first two tests are shown):

**   avx512 :  AVX512 **
                                       
mov  GP  kreg                    4.00         1.25
mov  GP  kreg   + 
kxorb                              5.00         1.57

And finally, the main test:

kmovb k0, eax
kxorb k0, k0, k0
kmovb eax, k0
;   127

This time we use the kxorb command with case reset: kxorb k0, k0, k0 . If the dependence on the value in the register k0 disappears, this will mean that the kmovb eax, k0 command no longer depends on the previous kmovb k0, eax command and that the chain has broken up and the total execution time should decrease.

Drum roll ...

We got all the same 5.0 measures - as in the previous example:

**   avx512 :  AVX512 **
                                     
mov  GP  kreg                   4.00         1.25
mov  GP  kreg   + 
kxorb                             5.00         1.57
mov  GP  kreg   + 
kxorb                             5.00         1.57

The preliminary conclusion is this: reset idioms are not recognized by the processor if they are applied to mask registers.

In conclusion, we will conduct another test to make sure our reasoning is correct: we replace the kxor command with the kmov command , which, as you know, always removes the dependencies:

kmovb k0, eax
kmovb k0, ecx
kmovb eax, k0
;   127

The final answer is presented below. The last test is much faster - only 2 clock cycles, and the bottleneck is port p5 (both kmov k, r32 commands are executed only on this port):

**   avx512 :  AVX512 **
                                       
mov  GP  kreg                   4.00            1.25
mov  GP  kreg   + 
kxorb                             5.00            1.57
mov  GP  kreg   + 
kxorb                             5.00            1.57
mov  GP  kreg   + 
mov  GP                                  2.00            0.63

It turns out that our assumption is correct.

Playback Results

You can reproduce all the results presented in this article yourself by running the robsize executable file on Linux or Windows (under WSL). They are also available in the repository , as are scripts for collecting and plotting them.

findings

SKX architecture mask registers are located in a separate physical register file; 134 of them are designed to store speculative values, the total number of mask registers is 142
This number is comparable to the size of other types of register files, as well as the ROB buffer, and is large enough not to experience performance degradation when working with mask registers
Copy commands with mask registers are not replaced
[21]

k- kregs – -. , k – «» (m) «» (f).
( AVX-512 ), k0 – , , , . : k0 – , , , k, SIMD-, (, AVX-512). SIMD- k0 , .
, -, 0, , . , , , , - - .
« », kreg – ( k0, k1 ..), – «kreg» « » ( ).
H. Wong, Measuring Reorder Buffer Capacity, May, 2013. [Online]. (. , « », 2013. -.) : blog.stuffedcow.net/2013/05/measuring-rob-capacity
100 300 . , - 2 50 100 , – 2,5 (, 2 5 ). TLB-/ .
«» . , , . . , 29 104, , – , 200. , ( ) – , - ( ), .
, , , (register alias table, RAT), . RAT , , , , . RAT , , .
60% 134 224, .. PRF ROB. , ROB 224 , , , [10] 60% , ROB. , - , 60% , ROB, .
, , . , (, SIMD- ), . [2]
, . , 2 2 SIMD- ( ), 4 .
Physical Register Reclaim Table ( ) Post Retirement Reclaim Table ( ).
– , « », .. , , . , . : , .
, ( 7) , PRF ROB.
, , , . , ( , , , – ).
RAT , RAT , , .
xor, , sub reg,reg sbb reg, reg. , reg 0 -1 ( ) . , reg, – , . .
, : kxorb k1, k1, k1 , kxorb k1, k2, k2.
, ,
```
./uarch-bench.sh --test-name=avx512/*. 
```
uops.info kmov r32, k, kmov k, 32 <= 3. , 4 . 1 , , 3 .
, xor, , , -, , . , : , .

About mask registers

ROB