Cloudflare selects AMD processors for tenth-generation edge servers



More than a billion unique IP addresses pass through the Cloudflare Network daily; It serves more than 11 million HTTP requests per second; it is located at a distance of no more than 100 ms from 95% of the Internet population. Our network spans 200 cities in more than 90 countries, and our engineering team has built an extremely fast and reliable infrastructure.

We are very proud of our work and are committed to helping make the Internet better and safer. Cloudflare engineers, whose work is related to hardware, are well versed in servers and their components in order to understand and choose the best equipment to maximize its efficiency.

Our software stack handles high-load computing and is very dependent on CPU speed, which is why our engineers have to constantly optimize the efficiency and reliability of Cloudflare at all levels of the stack. On the server side, the easiest way is to increase computing power by adding CPU cores. The more cores you can fit in a server, the more it can process data. This is important for us, as the diversity of our products and customers grows over time, and the growth of requests requires servers to increase productivity. To increase their productivity, we needed to increase the density of nuclei - and this is precisely what we have accomplished. Below we provide details of processor data for servers that we have been deploying since 2015, including the number of cores:

---Gen 6Gen 7Gen 8Gen 9
Beginning of work2015201620172018
CPUIntel Xeon E5-2630 v3Intel Xeon E5-2630 v4Intel Xeon Silver 4116Intel Xeon Platinum 6162
Physical cores2 x 82 x 102 x 122 x 24
TDP2 x 85W2 x 85W2 x 85W2 x 150W
TDP per core10.65W8.50W7.08W6.25W


In 2018, we made a big leap in the total number of cores per server with the 9th generation. The environmental impact was reduced by 33% compared with the 8th generation, which gave us the opportunity to increase the volume and computing power per rack. Thermal Design Power (TDP) design requirements are mentioned to emphasize that our energy efficiency has also grown over time. This indicator is important for us: firstly, we want to emit less carbon into the atmosphere; secondly, we want to make the best use of the energy of data centers. But we know what we have to strive for.

Our main defining metric is the number of requests per watt. We can increase the number of requests per second by adding cores, but we need to stay within our energy budget. We are limited by the power infrastructure of data centers, which, together with our selected energy distribution modules, gives us a certain upper limit for each server rack. Adding servers to a rack increases power consumption. Operating costs will increase dramatically if we go beyond limiting energy per rack and we have to add new racks. We need to increase computing power, remaining in the same energy range, which will increase the number of requests per watt - our key metric.

As you might have guessed, we carefully studied energy consumption at the design stage. The table above shows that we should not waste time deploying more energy-hungry CPUs, if the TDP per core is higher than the current generation - this will negatively affect our metric, the number of requests per watt. We carefully examined the ready-to-operate systems for our Generation X on the market and made a decision. We are moving from our circuit with 48 Intel Xeon Platinum 6162 cores and two sockets to a 48-core AMD EPYC 7642 with one socket.



---IntelAMD
CPUXeon Platinum 6162EPYC 7642
Microarchitecture“Skylake”“Zen 2”
Codename“Skylake SP”“Rome”
Process technology14nm7nm
Cores2 x 2448
Frequency1.9 GHz2.4 GHz
L3 cache / socket24 x 1.375MiB16 x 16MiB
Memory / socket6 channels, up to DDR4-24008 channels, up to DDR4-3200
TDP2 x 150W225W
PCIe / socket48 lanes128 lanes
ISAx86-64x86-64


From the specifications it is clear that the chip from AMD will allow us to leave the same number of cores, lowering TDP. In the 9th generation, the TDP per core was 6.25 watts, and in the Xth generation it will be 4.69 watts. A decrease of 25%. Due to the increase in frequency, and, possibly, a simpler circuit with a single socket, we can assume that the AMD chip will prove to be better in business. While we are conducting various tests and simulations to understand how much better AMD will perform.

In the meantime, we note that TDP is a simplified metric from the manufacturer's specifications, which we used in the early stages of server design and CPU selection. A quick Google search demonstrates that AMD and Intel have different approaches to the definition of TDP, which is why this specification is not reliable. The actual power consumption of the CPU, and, more importantly, the power consumption of the server, is what we really use when making the final decision.

Ecosystem preparedness


At the beginning of our journey to choosing the next processor, we studied a large assortment of CPUs from different manufacturers that were well suited for our software stack and services (written in C, LuaJIT and Go). We have already described in detail a set of tools for measuring speed in an article in our blog . In this case, we used the same set - it allows us to evaluate the effectiveness of the CPU in a reasonable time, after which our engineers can begin to adapt our programs to a specific processor.

We tested various processors with a diverse number of cores, sockets and frequencies. Since this article describes why we settled on the AMD EPYC 7642, all of the graphics in this blog focus on how AMD processors perform compared to the Intel Xeon Platinum 6162 fromour 9th generation .

The results correspond to measurements of the operation of one server with each processor variant - that is, with two 24-core processors from Intel, or with one 48-core processor from AMD (server for Intel with two sockets and server for AMD EPYC with one). In the BIOS, we set the parameters corresponding to the working servers. These are 3.03 GHz for AMD and 2.5 GHz for Intel. Simplifying very much, we expect that with the same number of cores AMD will show results 21% better than Intel.

Cryptography






It looks promising for AMD. It works 18% better on public key cryptography. With a symmetric key, it loses for the AES-128-GCM encryption options, but in general it shows itself comparable.

Compression


On edge servers, we compress a lot of data to save on bandwidth and increase the speed of content delivery. We pass data through the zlib and brotli C libraries. All tests took place on the blog.cloudflare.com HTML file in memory.





AMD won an average of 29% when using gzip. In the case of brotli, the results are even better on tests with quality 7, which we use for dynamic compression. A sharp drop occurs on the brotli-9 test - we attribute this to the fact that Brotli consumes a lot of memory and overflows the cache. However, AMD wins by a wide margin.

Many of our services are written in Go. In the following graphs, we recheck the cryptography and compression rates on Go with RegExp on 32 KB lines using the strings library.

Go cryptography




Go compression






Go regexp






Go strings




AMD shows the best results in all tests with Go except the ECDSA P256 Sign, where it is 38% behind - which is strange, considering that it showed 24% better results in C. It’s worth figuring out what’s going on there. But in general, AMD does not win much, but still shows the best results.

Luajit


We often use LuaJIT in the stack. This is the glue holding all parts of Cloudflare. And we are glad that AMD won here.

In general, tests show that EPYC 7642 performs better than two Xeon Platinum 6162. AMD loses on a pair of tests - for example, AES-128-GCM and Go OpenSSL ECDSA-P256 Sign - however, it wins on all others, on average by 25% .

Workload simulation


After our express tests, we ran the server through another set of simulations in which the synthetic load is applied to the software edge stack. Here we simulate a workload of scripts with various types of queries that can be found in real work. Requests vary in terms of data volume, HTTP or HTTPS protocols, WAF, Workers sources, and others from a variety of variables. Below is a comparison of the throughput of the two CPUs for the types of requests that we find most often.



The results on the diagram are measured by the basic indicators of the 9th generation of machines with Intel processors, normalized to a value of 1.0 along the x axis. For example, taking simple requests of 10 KiB via HTTPS, we can see that AMD is 1.5 times better than Intel in terms of the number of requests per second. On average, AMD performed 34% better for the given tests than Intel. Considering that TDP for a single AMD EPYC 7642 is equal to 225 W, and for two Intel processors - 300 W, it turns out that in terms of “requests per watt” AMD shows 2 times better results than Intel!

At this point, we were clearly leaning towards the option of a single socket for AMD EPYC 7642 as our future CPUs for Generation X. We were very interested to know how AMD EPYC servers behave in real work, and we immediately sent several servers to some from data centers.

Real work


The first thing, of course, was to prepare the server for work in real conditions. All the machines in our fleet work with the same processes and services, which makes it a great opportunity to correctly compare performance. As in most data centers, we have several generations of servers deployed, and we assemble our servers in clusters so that each class contains servers of approximately the same generation. In some cases, this can result in utilization curves varying between clusters. But not with us. Our engineers have optimized CPU utilization for all generations so that regardless of whether the CPU has 8 cores on a particular machine or 24, CPU usage is usually no different from the rest.



The graph illustrates our comment on the similarity of utilization - there is no significant difference between the use of AMD CPUs in Gen X servers and the use of Intel processors in Gen 9 servers. This means that both test and core servers are loaded equally. Fine. This is exactly what we achieve in the work of our servers, and we need this for an honest comparison. The two graphs below show the number of requests processed by one CPU core and all the cores at the server level.


Queries to the


server Queries to the server

It can be seen that on average AMD processes 23% more requests. Not bad at all! We often wrote on our blog about ways to increase the performance of Gen 9. And here we have the same number of cores, but AMD is doing more work with less energy. Immediately from the specifications for the number of cores and TDP it can be seen that AMD delivers more speed with greater energy efficiency.

But, as we already mentioned, TDP is not a standard specification, and it is not the same for all manufacturers, so let's look at the real use of energy. By measuring the energy consumption of the server in parallel with the number of requests per second, we got the following graph:



By the number of requests per second spent per watt, the Gen X server on AMD processors is 28% more efficient. One could have expected more, given that AMD's TDP is 25% lower, however, it should be remembered that TDP is an ambiguous characteristic. We saw that AMD’s actual energy consumption almost coincides with the indicated TDP at a frequency well above the base; Intel doesn't have that. This is another reason why TDP is not a reliable estimate of energy consumption. Intel CPUs in our Gen 9 servers are integrated into the multicode system, while AMD CPUs work in standard 1U form factor servers. This does not speak in favor of AMD, since multinode servers should provide higher density with less power consumption per node, however AMD still outperformed Intel in terms of energy consumption per node.

In most comparisons on specifications, test simulations, and real-world performance, the 1P AMD EPYC 7642 configuration proved to be significantly better than the 2P Intel Xeon 6162. In some conditions, AMD can work 36% better, and we believe that by optimizing hardware and programs, we we can achieve such an improvement on an ongoing basis.

It turns out that AMD won.

Additional graphs show the average delay and p99 delay in NGINX operation for 24 hours. On average, AMD processes ran 25% faster. At p99, it runs 20-50% faster depending on the time of day.

Conclusion


Cloudflare’s hardware and performance engineers do a significant amount of testing and research to select the best server configuration for our customers. We like to work here because we can solve such grandiose tasks, and also help you solve your problems with services such as serverless edge-computing and an array of solutions to security problems, in particular, Magic Transit, Argo Tunnel and DDoS protection . All servers in the Cloudflare network are configured for reliable operation, and we are always trying to make each next generation of servers better than the previous one. We believe that AMD EPYC 7642 is the answer to the question about choosing processors for Gen X.

Using the Cloudflare Workers service, developers deploy their applications on our network, expanding around the world. We are proud to offer our customers the opportunity to concentrate on writing code while we work on security and reliability in the cloud. And today we are even more pleased to announce that their work will be deployed on our Gen X generation servers running second-generation AMD EPYC processors.


EPYC 7642 processors, codenamed “Rome” [Rome]

Using AMD's EPYC 7642, we were able to increase our speed and facilitate the expansion of the network to new cities. Rome was not built in one day, but soon it will be closer to many of you.

In the past couple of years, we have experimented with many x86 chips from Intel and AMD, as well as processors from ARM. We expect that in the future, these CPU manufacturers will work together with us so that we can together build an improved Internet.

All Articles