On the development trends of the processor architecture, or why I believe in the success of Huawei in the server market

We live in interesting times. It seems to me that the next 2-3 years will determine where the development of architecture will go for the next decade. Now on the server processor market there are several players representing completely different approaches to technology. And this is very cool (I even find it difficult to say which syllable the emphasis in the last word falls on :))
.
But even 5-6 years ago it seemed that time froze and development stopped. Rest on various kinds of restrictions (power wall, scalability wall, etc.). I talked a little about it here. Moore’s law was called into question and particularly ardent theorists suggested introducing logarithmic corrections into it :) Intel’s dominance in the server processor market seemed unshakable then. AMD did not show serious competition, NVidia GPGPUs looked like a purely niche product, and ARM attempts to break into the server market were unsuccessful.

Everything has changed with the development of segments such as Machine Learning and Artificial intelligence. GPGPUs turned out to be much better “adapted” to operations of convolution and matrix multiplication (especially with little accuracy) than CPUs. In addition, NVidia was able to demonstrate an increase in the number of transistors, even ahead of Moore's law. This has led the world of server architectures to become bipolar. At one end is x86 CPU, a latency engine adapted to hide latencies out of order machine. Its undeniable advantage is its excellent performance on single-threaded applications (single-thread performance). The disadvantage is the huge area and the number of transistors. At the other end of the GPGPU is Throughput engine, a large number of uncomplicated computing elements (EUs). Here, the opposite is true - the size of one element is small,which allows you to place a large number of them on one chip. On the other hand, the performance of single-threaded applications leaves much to be desired ...

Alas, attempts to combine high-power CPUs and GPUs in one package have not been successful so far. At least for the reason that such a chip will consume and dissipate too much energy. Therefore, discrete solutions are now in trend. I don’t really believe in them for two reasons. First, the architecture becomes more complex and additional gags appear in it in the form of a bus connecting CPU and GPU, and a heterogeneous memory structure. The second difficulty is partly related to the first, and consists in the fact that such solutions are much more difficult to program. Each of the existing accelerator programming models (CUDA from NVidia, DPC ++ from Intel, OpenCL or OpenMP) has its own advantages and disadvantages. At the same time, none of them is neither universal nor dominant.

It seems to me that from the point of view of the development of architecture, the right step was taken by AMD, which introduced the Rome processor. Due to the compactness of the core (compared to Intel), more cores were able to be placed in one case. However, this alone is not enough - in order for such a solution to scale in performance, it is necessary to take care of the correct communication between the kernels (uncore) and the quality of parallel runtimes. AMD engineers managed to solve both problems and get very competitive results on one of the most important industrial benchmarks - SPEC CPU. AMD's solution is between the poles provided by Nvidia and Intel. But it is much closer to the last. This is still the same “big core." The golden mean between the polar approaches, it seems to me, requires more more compact cores.And of the existing architectures, ARM has the best chance of occupying this niche.
image
So why is ARM exactly from Huawei? I found the answer for myself: with an increase in the number of cores on a chip, the significance of the performance of one core decreases (but to a certain limit), and the importance of communication between the cores and with memory increases. And communication is the area where Huawei is a world leader. It is the uncore design (and not only and not even the core) that will determine the performance of the system. And here, I think, we have good chances.

However, ideal architectures exist only in a vacuum. In reality, it is always necessary to correlate with the quantity and structure of the software existing in the server market. And it has been written and optimized for X86 for years. It will take a lot of time and effort to make it more “friendly” to the ARM architecture. Huge work lies ahead both in the field of software tools (compilers, libraries, runtimes) and in the field of application adaptation (application engineering). But I believe that the road will be overpowered by the traveler.

All Articles