🐙 😻 🧟 SHISHUA: the world's fastest pseudo random number generator 👨🏽‍🚒 👓 ♐️

Six months ago, I wanted to create the best pseudorandom number generator (PRNG) with some unusual architecture. I thought that the beginning would be easy, and as you work, the task will slowly become more complex. And I thought if I could learn everything fast enough to cope with the most difficult.

To my surprise, complexity did not increase linearly. Chi-square byte testing proved to be very difficult! Later it was just as difficult to pass diehard tests. I published the current results to understand what other difficulties await me. However , the PractRand test failed at that time .

Then it was very difficult to pass the BigCrush test .

Then it was very difficult to transfer 32 terabytes of data when passing PractRand. Speed has become a problem. It was not enough to create a design that generated ten megabytes per second, because passing PractRand would take a month. But I must admit that it was very difficult to pass this test at a speed of gigabytes per second .

When you rise to such a height ... you want to know if you can get to the Pareto border. You want to create the world's fastest PRNG, which will pass the most complex statistical tests.

I have succeeded.

In a previous article, I talked about what I learned to achieve my goal. And here I will tell you how the final architecture works.

purpose

Let's start with the obvious: speed depends on the platform . I focused on optimizing for the modern x86-64 architecture (Intel and AMD processors).

To compare performance, the classic cpb metric is used : this is the number of processor cycles spent on generating one byte. This metric is calculated and compared in all cryptographic works. A slightly lower cpb in the world of software or hardware can ensure victory in the competition or use on websites around the world.

To improve cpb, you can:

Generate more bytes with the same amount of work,
Or do less work to generate the same number of bytes,
Or parallelize the work.

We will do all of the above.

According to the first point, we need to produce more bits at each iteration.

I was afraid that they would tell me: “If it does not give out 32-bit numbers, then this is not the PRSP”, or the same with 64-bit numbers. Or: “The PRNG should only be for x86-64 architecture,” as if instructions like POPCNT or registers like% xmm7 are forbidden.

However, the PRNG is engineering: generators have been trying for several decades to squeeze everything possible out of processors! When ROL appeared, they began to rely on him. With the advent of 64-bit processors, they began to rely on% rax. Of course, such algorithms can work slower on ARM (although this remains to be seen), however, 64-bit PRNs were actively used even before Android started demanding 64-bit support in 2019!

That is, this area is developing along with the hardware. And today, Intel and AMD processors due to AVX2 already support 256-bit operations. RC4 produced 1 byte, drand48 could produce 4 bytes at a time, pcg64 - 8 bytes, and now we can immediately generate 32 bytes.

8 bytes can be a 64-bit number, and most programming languages have built-in types for this. But few languages provide types for 16 bytes (a notable exception is __uint128_t in C). Even fewer languages have type for 32 bytes (except internal).

So we can say goodbye to the usual prototype of the PRNG function (example from the Vigny HWD benchmark ):

static uint64_t next(void);

Instead, you can make a generator that fills the buffer (example from my benchmark ):

void prng_gen(prng_state *s, __uint64_t buf[], __uint64_t size);

What are the disadvantages of this solution?

If your generator produces 32 bytes at a time, then you need the consumer to provide an array that is a multiple of 32 (ideally, aligned by 32 bytes). Although you can do without it, we’ll just fill the buffer. We will remove unused data from it and fill it again as necessary. The delay becomes unpredictable: some calls will just read the buffer. But on average everything will be the same.

Now we generate more bytes, doing the same amount of work. How do we parallelize it?

Parallelism

Processors offer an incredible set of parallelization tools at all levels.

Firstly, these are SIMD instructions (Single-Instruction, Multiple Data). For example, AVX2 simultaneously performs four 64-bit additions, or eight 32-bit additions, etc.

It has been used in cryptography for about fifteen years. Concurrency provided the incredible performance of the ChaCha20 . It is used by most important primitives that do not use AESNI. For example, NORX and Gimli are designed with parallelism in mind.

Recently, interest in this topic has also increased in the non-cryptographic PRNG community. In particular, existing primitives that were not designed for SIMD can be the basis for creating very fast PRNs.

When Sebastiano Vigna promoted his xoshiro256 ++ architecture in the Julia standard library, he found out that the results of eight competitive, differently initialized PRNG instances can be concatenated very quickly if each operation is performed simultaneously in all PRNRs.

SIMD is just one of the levels of parallelization in the processor. I recommend reading the previous article on this topic in order to have a better idea, but I will give some explanations.

Processor pipelines allow several instructions to be processed at different stages. If the order of their execution is well organized in order to reduce the dependencies between the stages, then you can speed up the processing of instructions.

Superscalar execution allows you to simultaneously process the computing parts of the instructions. But for this they should not have read-write dependencies. You can adapt the architecture to reduce the risk of downtime by recording long before reading.

Extraordinary execution allows the processor to execute instructions not in the order of sequence, but as they are ready, even if the previous instructions are not yet ready. But for this there should be no read-write dependency.

And now we pass to implementation!

Architecture

Consider a scheme called semi-SHISHUA. Where such a name comes from will become gradually apparent as you read.

The scheme looks like this:

Consider its line by line.

typedef struct prng_state {
  __m256i state[2];
  __m256i output;
  __m256i counter;
} prng_state;

The state is divided into two parts, which are placed in the AVX2 register (256 bits). To increase the speed, we keep the result close to the state itself, but it is not part of the state.

We also have a 64-bit counter. To simplify the calculation, it is also an AVX2 register. The fact is that AVX2 has a small feature: ordinary registers (% rax and the like) cannot be transferred directly to SIMD via MOV, they must pass through RAM (most often through the stack), which increases the delay and costs two processor instructions (MOV on the stack, VMOV from the stack).

Now let's look at the generation. Let's start by loading, then go through the buffer and fill it with 32 bytes at each iteration.

inline void prng_gen(prng_state *s, __uint64_t buf[], __uint64_t size) {
  __m256i s0 = s->state[0], counter = s->counter,
          s1 = s->state[1],       o = s->output;
  for (__uint64_t i = 0; i < size; i += 4) {
    _mm256_storeu_si256((__m256i*)&buf[i], o);
    // …
  }
  s->state[0] = s0; s->counter = counter;
  s->state[1] = s1; s->output  = o;
}

Since the function is inline, immediately filling the buffer at startup allows the processor to immediately execute the instructions depending on this through an extraordinary execution mechanism.

Inside the loop, we quickly perform three state operations:

SHI ft
SHU ffle
A dd

Hence the name SHISHUA!

First, shift

u0 = _mm256_srli_epi64(s0, 1);              u1 = _mm256_srli_epi64(s1, 3);

Unfortunately, the AVX2 does not support revs. But I want to mix the bits of one position in a 64-bit number with the bits of another position! And shift is the best way to realize this. We will shift by an odd number, so that each bit will visit all 64-bit positions, and not half of them.

During the shift, bits are lost, which leads to the removal of information from our state. This is bad, you need to minimize losses. The smallest odd numbers are 1 and 3, we will use different shift values to increase the discrepancy between the two parts. This will help reduce the similarity of their self-correlation.

We will shift to the right, because the rightmost bits have the lowest diffusion during addition: for example, the least significant bit in A + B is just the XOR of the least significant bits A and B.

Stirring

t0 = _mm256_permutevar8x32_epi32(s0, shu0); t1 = _mm256_permutevar8x32_epi32(s1, shu1);

We will use 32-bit mixing, because it gives a different granularity compared to the 64-bit operations that we use everywhere (64-bit alignment is violated). Moreover, it can be a cross-lane operation: other shuffles can move bits within the left 128 bits if they start from the left, or within the right 128 bits if they start from the right.

Mixing Constants:

__m256i shu0 = _mm256_set_epi32(4, 3, 2, 1, 0, 7, 6, 5),
        shu1 = _mm256_set_epi32(2, 1, 0, 7, 6, 5, 4, 3);

So that mixing really improves the result, we will move the weak (low dispersion) 32-bit parts of 64-bit additions to strong positions, so that the next additions enrich them.

The low-end 32-bit part of the 64-bit chunk never moves to the same 64-bit chunk as the high-order part. Thus, both parts do not remain within the same chunk, which improves mixing.

In the end, each 32-bit part goes through all positions in a circle: from A to B, from B to C, ... from H to A.

You may have noticed that the simplest mixing that takes into account all these requirements is two 256-bit turnover (revolutions of 96 bits and 160 bits to the right, respectively).

Addition

Let's add 64-bit chunks from two temporary variables - shift and mixing.

s0 = _mm256_add_epi64(t0, u0);              s1 = _mm256_add_epi64(t1, u1);

Addition is the main source of dispersion: in this operation, bits are combined into irreducible combinations of XOR and AND expressions distributed over 64-bit positions.

Storing the result of addition within a state permanently preserves this dispersion.

Output function

Where do we get the output from?

It's simple: the structure we created allows us to generate two independent parts of the state s0 and s1, which do not affect each other in any way. Apply XOR to them and get a completely random result.

To strengthen the independence between the data to which we apply XOR, we take a partial result: the shifted part of one state and the mixed part of another.

o = _mm256_xor_si256(u0, t1);

This is similar to reducing the read-write dependencies between instructions in a superscalar processor, as if u0 and t1 were ready to read to s0 and s1.

Now discuss the counter. We process it at the beginning of the cycle. First, change the state, and then increase the counter value:

s1 = _mm256_add_epi64(s1, counter);
counter = _mm256_add_epi64(counter, increment);

Why do we first change the state, and then update the counter? s1 becomes available earlier, this reduces the likelihood that subsequent instructions that read it will stop in the processor pipeline. Also, this sequence helps to avoid the direct dependence of the read-write counter.

We apply the counter to s1, and not to s0, because they both affect the output anyway, however, s1 loses more bits due to the shift, so that it helps it to “get on its feet” after the shift.

The counter may not record the PractRand test. Its sole purpose is to set a lower limit of 2 ⁶⁹bytes = 512 exbibytes for the PRNG period: we begin to repeat the cycle only after one millennium of work at a speed of 10 gibytes per second. It is unlikely that it will be too slow for practical use in the coming centuries.

Increment:

__m256i increment = _mm256_set_epi64x(1, 3, 5, 7);

The odd numbers are chosen as increments, because only the basic coprime numbers cover the entire cycle of the finite field GF (2 ⁶⁴ ), and all the odd numbers are coprime for 2. In other words, if you increment by an even integer in the range from 0 to 4, returning to 0 after 4, it turns out the sequence 0-2-0-2- ..., which will never lead to 1 or 3. And the odd increment passes through all integers.

For all 64-bit numbers in state, we will use different odd numbers, which will further separate them and slightly increase mixing. I chose the smallest odd numbers so they don't look magical.

This is how state transition and output function work. How to initialize them?

Initialization

We initialize the state using the hexadecimal digits Φ, the irrational number that is least approximated by the fraction.

static __uint64_t phi[8] = {
  0x9E3779B97F4A7C15, 0xF39CC0605CEDC834, 0x1082276BF3A27251, 0xF86C6A11D0C18E95,
  0x2767F0B153D27B7F, 0x0347045B5BF1827F, 0x01886F0928403002, 0xC1D64BA40F335E36,
};

Take a 256-bit seed. This is often done in cryptography and does not harm the work of non-cryptographic PRNGs:

prng_state prng_init(SEEDTYPE seed[4]) {
  prng_state s;
  // …
  return s;
}

We do not want to redefine the entire part of the state (s0 or s1) with this initial number, we only need to affect half. This way we will avoid the use of attenuating initial numbers, which accidentally or intentionally give rise to a known weak initial state.

Since we do not change half of each state, we retain control over 128 status bits. Such entropy is enough to start and maintain a strong position.

s.state[0] = _mm256_set_epi64x(phi[3], phi[2] ^ seed[1], phi[1], phi[0] ^ seed[0]);
s.state[1] = _mm256_set_epi64x(phi[7], phi[6] ^ seed[3], phi[5], phi[4] ^ seed[2]);

Then we repeat ( ROUNDS) several times the following sequence:

Run the steps ( STEPS) of the SHISHUA iterations.
We assign one part of the state to another state, and the other part to the output.

for (char i = 0; i < ROUNDS; i++) {
  prng_gen(&s, buf, 4 * STEPS);
  s.state[0] = s.state[1];
  s.state[1] = s.output;
}

Assigning an output result increases state dispersion. During initialization, the additional work and correlation of states does not matter, because this series of operations is performed once. We are only interested in dispersion during initialization.

After evaluating the effect on the correlation of the initial values, I chose STEPS2 for and for ROUNDS10. I calculated the correlation by calculating the “unusual” and “suspicious” anomalies that arise due to the PRNG quality control tools in PractRand.

Performance

It is difficult to measure speed for several reasons:

Clock measurement may not be accurate enough.
, , - , -, .
, . .
: , .

I use the RDTSC processor instruction, which calculates the number of cycles.

So that anyone can reproduce my results, I use a cloud-based virtual machine. This does not change the level of benchmark results compared to local testing. In addition, you do not have to buy the same computer as mine. Finally, there are many situations where the PRNG is launched in the clouds.

I chose Google Cloud Platform N2 (Intel processor) and N2D (AMD processor). The advantage of GCP is that they offer servers with processors from both manufacturers. In this article, we will focus on Intel, but for AMD the results will be in the same order.

To delve deeper into the topic, let's first get rid of the old RC4 cryptographic generator. Unable to parallelize the work, I got7.5 cpb (cycles per generated byte).

Now let's run a very popular and fast MCG: the simplest Lehmer128 PRNG , which passes the BigCrush test, showed 0.5 cpb . Wow, great!

Then we will run the latest development, which is used for fast hash tables - wyrand . 0.41 cpb , a little better!

Some PRSPs do not pass the 32-terabyte PractRand test, but they work very quickly. Xoshiro256 + reached only 512 mebibytes, but showed a very high speed: 0.34 cpb .

Another recent development of RomuTrio . She claims to be the fastest PRNG in the world - 0.31 cpb.

Okay, that’s enough. What did semi-SHISHUA show?

0.14 cpb . Twice as fast as RomuTrio.

Cool. Now test the cryptographic generator ChaCha8. He reached ... 0.12 cpb .

Oh.

SIMD is real magic!

For the cryptographic community, this was not a special surprise . ChaCha8 is extremely easy to parallelize. This is just a well-hashed counter in a diffuse state.

And remember how the Julia language team tried to combine several instances of Vigny’s architecture to create a fast SIMD-based PRNG? Let's look at their result using this technique ( 8 pieces of Xoshiro256 + ). 0.09 cpb !

Technically, my laptop could affect the results. I'm not sure why Julia team development is faster than ChaCha8 in GCP, but slower when tested locally. On my machine, semi-SHISHUA runs faster than the Julia team development, but slower than ChaCha8.

It is necessary to defeat all competitors.

You are probably already asking why we called the previous version of the semi-SHISHUA generator? Because it turned out to be easy to double the speed if you run two copies of semi-SHISHUA.

Similar to the idea of the Julia command, we separately initialize two PRNGs (four blocks of a 256-bit state), alternately supplying the output of their work.

But if we make more states, then we can produce even more data, combining four states in pairs:

o0 = _mm256_xor_si256(u0, t1);
o1 = _mm256_xor_si256(u2, t3);
o2 = _mm256_xor_si256(s0, s3);
o3 = _mm256_xor_si256(s2, s1);

So we got SHISHUA, which showed a speed of 0.06 cpb .

This is twice as fast as the previous fastest competitor in the world, which passed the 32-terabyte PractRand test. The result is on the graph.

I believe that the development turned out to be competitive. It works even faster on my laptop - 0.03 cpb, but I will adhere to my principles regarding the benchmark.

Hopefully for a few more weeks my generator will stay on the podium of the fastest in the world (please do so).

Quality

The generator honestly passes BigCrush and the 32-terabyte PractRand test. And all thanks to four output streams.

The disadvantages of architecture include its irreversibility . This can be seen by reducing to a 4-bit state with s0 = [a, b]and s1 = [c, d]. With a shift, we get [0, a]and [0, d], and with stirring, [b, c]and [d, a]. New s0equals [b, c] + [0, a] = [b⊕(a∧c), a⊕c], but s1equals [d, a] + [0, c] = [d⊕(a∧c), a⊕c]. If a = ¬c, then, a⊕c = 1and a∧c = 0therefore, s0 = [b, 1]and s1 = [d, 1]. That is, we get two combinations of aand c, which give us the same final state.

In our case, this is not a problem, because the 64-bit counter is also part of the state. It turns out the minimum cycle of 2 ⁷¹bytes (128 bytes per state transition), which is at a speed of 10 gibytes / sec. will last seven thousand years. This balances the lost states.

In addition, even despite irreversibility, the average transition period between states is 2 ^ ((256 + 1) ÷ 2). This gives an average cycle of ^2,135 bytes (at a speed of 10 gibytes / sec. It will last more than a trillion times longer than the universe exists). Although I believe that middle cycles are overestimated, because they do not tell us anything about the quality of the generator.

Here are the benchmark results:

Generator	Performance	Quality	Seed correlation
SHISHUA	0.06	> 32 TiB	> 256 GiB
xoshiro256 + x8	0.09	1 KiB	0 KiB
ChaCha8	0.12	> 32 TiB?	> 32 TiB?
RomuTrio	0.31	> 32 TiB	1 KiB
xoshiro256 +	0.34	512 MiB	1 KiB
wyrand	0.41	> 32 TiB	8 KiB
Lehmer128	0.44	> 32 TiB	1 KiB
RC4	7.48	1 TiB	1 KiB

Performance : The number of processor cycles spent on one generated byte. Received on cloud machines N2 GCP and N2D (AMD), the order is the same.
Quality : The level at which the generator fails to pass the PractRand test. If it does not fail, there is a> sign. If the result is not proven, there is a question mark.
Correlation of seed numbers : PractRand traversal with alternating bytes of eight streams with seed numbers 0, 1, 2, 4, 8, 16, 32, 64. We use PractRand with double convolution and advanced tests.

Further

Although in our case there are no problems with irreversibility, we can still improve SHISHUA.

In my opinion, the ideal PRNG has the following properties:

, 2¹⁰²⁴. 10 /. 10²⁸² , . «» ( ). , . , 128- NEON ARM? , , .
. , SHISHUA XOR . , .
, 2¹²⁸ ( ). SHISHUA, , . , ( ) (, , . 2).
State initialization has perfect dispersion : all bits of the initial number affect all bits of the state with the same probability. I want to find out in relation to SHISHUA.

One of the problems holding back the development of PRNGs and cryptography as a whole is the lack of better general-purpose tools. I need a tool that can immediately give me the exact measurement result so that I can compare different architectures on the fly.

PractRand is great compared to what it was before, however:

It does not allow evaluating high-quality generators, making it impossible to compare them with each other. We have to say: “well, after 32 terabytes they have no anomalies ...”
It takes weeks to run it ...

I hope that the situation will improve greatly soon.

SHISHUA: the world's fastest pseudo random number generator