GPU Computing - Why, When, and How. Plus some tests

Everyone has long known that on video cards you can not only play toys, but also perform things that are not related to games, for example, train a neural network, remember cryptocurrency, or perform scientific calculations. How it happened, you can read it here , but I wanted to touch on the topic of why the GPU may be of interest to the ordinary programmer (not related to GameDev) how to approach the development of the GPU without spending a lot of time on it, decide whether look in this direction, and " figure out on your fingers" what profit you can get. 



The article was written based on my presentation in HighLoad ++. It mainly discusses the technologies offered by NVIDIA. I have no purpose to advertise any products, I just give them as an example, and for sure something similar can be found in competing manufacturers.

Why count on the GPU?


Two processors can be compared according to different criteria, probably the most popular are the frequency and number of cores, the size of caches, etc., but in the end, we are interested in how many operations a processor can perform per unit of time, what kind of operation this is, but a separate question A common metric is the number of floating point operations per second - flops. And when we want to compare warm with soft, and in our case GPU with CPU, this metric comes in handy.

The graph below shows the growth of these same flops over time for processors and video cards.


(Data is collected from open sources, there is no data for 2019-20 years, because not everything is so beautiful there, but the GPUs still win)

Well, it’s tempting, isn't it? We shift all the calculations from the CPU to the GPU and get eight times the best performance!

But, of course, not everything is so simple. You can’t just take and transfer everything to the GPU, why, we’ll talk further.

GPU architecture and its comparison with the CPU


I bring to many a familiar picture with the architecture of the CPU and the basic elements:


CPU Core

What's so special? One core and a bunch of auxiliary blocks.

Now let's look at the GPU architecture:


GPU Core

A video card has a lot of processing cores, usually several thousand, but they are combined into blocks; for NVIDIA video cards, usually 32 each, and have common elements, including and registers. The architecture of the GPU core and logical elements is much simpler than on the CPU, namely, there are no prefetchers, brunch predictors, and much more.

Well, these are the key points of differences in the architecture of the CPU and GPU, and, in fact, they impose restrictions or, conversely, open up the possibilities for what we can effectively read on the GPU.

I didn’t mention one more important point, usually the video card and the processor do not “rummage” between each other and write data to the video card and read the result back - these are separate operations and may turn out to be a “bottleneck” in your system, a graph of the pumping time versus size data is given later in the article.

GPU limitations and features


What limitations does this architecture impose on executable algorithms:

  • If we are calculating on a GPU, then we cannot select only one core, a whole block of cores will be allocated (32 for NVIDIA).
  • All cores execute the same instructions, but with different data (we will talk about this later), such calculations are called Single-Instruction-Multiple-Data or SIMD (although NVIDIA introduces its refinement). 
  • Due to the relatively simple set of logic blocks and general registers, the GPU really does not like branching, and indeed the complex logic in the algorithms.

What opportunities does it open:

  • Actually, the acceleration of those same SIMD calculations. The simplest example is the elementwise addition of matrices, and let's analyze it.

Reduction of classical algorithms to SIMD representation


Transformation


We have two arrays, A and B, and we want to add an element from array B to each element of array A. Below is an example in C, although I hope it will be clear to those who do not speak this language:

void func(float *A, float *B, size)
{ 
   for (int i = 0; i < size; i++) 
   { 
       A[i] += B[i]
   } 
}

Classical loopback of elements in a loop and linear runtime.

Now let's see how such code will look for the GPU:

void func(float *A, float *B, size) 
{ 
   int i = threadIdx.x; 
   if (i < size) 
      A[i] += B[i] 
}

And here it’s already interesting, the threadIdx variable appeared, which we did not seem to declare anywhere. Yes, its system provides us. Imagine that in the previous example the array consists of three elements, and you want to run it in three parallel threads. To do this, you would need to add another parameter - the index or stream number. This is what the video card does for us, though it passes the index as a static variable and can work with several dimensions at once - x, y, z.

Another nuance, if you are going to start a large number of parallel streams at once, then the streams will have to be divided into blocks (an architectural feature of video cards). The maximum block size depends on the video card, and the index of the element for which we perform calculations will need to be obtained as follows:

int i = blockIdx.x * blockDim.x + threadIdx.x; // blockIdx –  , blockDim –  , threadIdx –    

As a result, what we have: a lot of parallel-running threads that execute the same code, but with different indices, and, accordingly, data, i.e. the same SIMD.

This is the simplest example, but if you want to work with the GPU, you need to bring your task to the same form. Unfortunately, this is not always possible and in some cases may become the subject of a doctoral dissertation, but nevertheless, classical algorithms can still be brought to this form.

Aggregation


Let's now see how the aggregation cast to the SIMD representation will look:
 

We have an array of n elements. At the first stage, we start n / 2 threads and each thread adds two elements, i.e. in one iteration, we add together half of the elements in the array. And then in the loop we repeat the same thing for the newly created array, until we aggregate the last two elements. As you can see, the smaller the size of the array, the less parallel threads we can start, i.e. on a GPU, it makes sense to aggregate arrays of a sufficiently large size. Such an algorithm can be used to calculate the sum of elements (by the way, do not forget about the possible overflow of the type of data you are working with), search for maximum, minimum, or just search.

Sorting


But sorting already looks a lot more complicated.

The two most popular sorting algorithms on the GPU are:

  • Bitonic-sort
  • Radix-sort

But radix-sort is still used more often, and production-ready implementation can be found in some libraries. I will not analyze in detail how these algorithms work; those who are interested can find a description of radix-sort at https://www.codeproject.com/Articles/543451/Parallel-Radix-Sort-on-the-GPU-using-Cplusplus- AMP and https://stackoverflow.com/a/26229897

But the idea is that even such a non-linear algorithm as sorting can be reduced to a SIMD view.

And now, before looking at the real numbers that can be obtained from the GPU, let's figure out how to program for this miracle of technology?

Where to start


The most common two technologies that can be used for programming under the GPU:

  • Opencl
  • Cuda

OpenCL is a standard that is supported by most video card manufacturers, including and on mobile devices, also code written in OpenCL can be run on the CPU.

You can use OpenCL from C / C ++, there are binders to other languages.

For OpenCL, I liked the book OpenCL in Action the most . It also describes different algorithms on the GPU, including Bitonic-sort and Radix-sort.

CUDA is NVIDIA's proprietary technology and SDK. You can write in C / C ++ or use bindings to other languages.

Comparing OpenCL and CUDA is somewhat not correct, because one is the standard, the other is the whole SDK. Nevertheless, many people choose CUDA for development for video cards, despite the fact that the technology is proprietary, although free and only works on NVIDIA cards. There are several reasons for this:

  • API
  • , GPU, (host)
  • , ..

The peculiarities include the fact that CUDA comes with its own compiler, which can also compile standard C / C ++ code.

The most comprehensive CUDA book I came across was Professional CUDA C Programming , although it’s already a little outdated, nevertheless it discusses a lot of technical nuances of programming for NVIDIA cards.

But what if I do not want to spend a couple of months reading these books, writing my own program for a video card, testing and debugging, and then finding out that this is not for me? 

As I said, there are a large number of libraries that hide the complexity of development under the GPU: XGBoost, cuBLAS, TensorFlow, PyTorch and others, we will consider the thrust library, since it is less specialized than the other libraries above, but at the same time it implements basic algorithms, for example, sorting, searching, aggregation, and with a high probability it can be applicable in your tasks.

Thrust is a C ++ library that aims to "replace" standard STL algorithms with GPU-based algorithms. For example, sorting an array of numbers using this library on a video card would look like this:

thrust::host_vector<DataType> h_vec(size); //    
std::generate(h_vec.begin(), h_vec.end(), rand); //   
thrust::device_vector<DataType> d_vec = h_vec; //         
thrust::sort(d_vec.begin(), d_vec.end()); //    
thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin()); //   ,     

(do not forget that the example must be compiled by a compiler from NVIDIA)

As you can see, thrust :: sort is very similar to a similar algorithm from STL. This library hides many difficulties, in particular the development of a subprogram (more precisely, the kernel), which will be executed on the video card, but at the same time deprives of flexibility. For example, if we want to sort several gigabytes of data, it would be logical to send a piece of data to the card to start sorting, and while sorting is in progress, send more data to the card. This approach is called latency hiding and allows more efficient use of server map resources, but, unfortunately, when we use high-level libraries, such opportunities remain hidden. But for prototyping and measuring performance, they are just the same, especially with thrust you can measure which overhead the data transfer provides.

I wrote a small benchmark using this library, which runs several popular algorithms with different amounts of data on the GPU, let's see what the results are.

GPU Algorithm Results


To test the GPU, I took an instance in AWS with a Tesla k80 video card, this is not the most powerful server card to date (the most powerful Tesla v100), but the most affordable and has on board:

  • 4992 CUDA Kernels
  • 24 GB of memory
  • 480 Gb / s - memory bandwidth 

And for tests on the CPU, I took an instance with an Intel Xeon processor CPU E5-2686 v4 @ 2.30GHz

Transformation



Transformation execution time on the GPU and CPU in ms

As you can see, the usual transformation of the array elements is approximately the same in time, both on the GPU and on the CPU. And why? Because the overhead for sending data to the card and back eats up the entire performance boost (we'll talk about overhead separately), and there are relatively few calculations on the card. Also, do not forget that processors also support SIMD instructions, and compilers in simple cases can effectively use them. 

Let's now see how efficiently aggregation is done on the GPU.

Aggregation



Aggregation execution time on GPU and CPU in ms

In the aggregation example, we already see a significant performance increase with an increase in data volume. It is also worth paying attention to the fact that we pump a large amount of data into the memory of the card, and only one aggregated value is taken back, i.e. Overhead for transferring data from the card to RAM is minimal.

Let's move on to the most interesting example - sorting.

Sorting



Sorting time to the GPU and CPU in ms

Despite the fact that we send the entire data array to the video card and vice versa, sorting to the GPU 800 MB of data is approximately 25 times faster than on the processor.

Data transfer overhead


As can be seen from the transformation example, it is not always obvious whether the GPU will be effective even in those tasks that parallel well. The reason for this is an overhead to transfer data from the computer’s RAM to the video card’s memory (in game consoles, by the way, the memory is shared between the CPU and GPU, and there is no need to transfer data). One of the characteristics of a video card is the memory bandwidth or memory bandwidth, which determines the theoretical bandwidth of the card. For Tesla k80 it is 480 GB / s, for Tesla v100 it is already 900 GB / s. Also, the PCI Express version and the implementation of how you will transfer data to the card will affect the throughput, for example, this can be done in several parallel streams.

Let's look at the practical results that were obtained for the Tesla k80 graphics card in the Amazon cloud:


Time to transfer data to the GPU, sort and transfer data back to RAM in ms

HtoD - transfer data to the

GPU video card Execution - sort on the video card

DtoH - copy data from the video card to RAM


The first thing to note is that reading data from the video card is faster than write them down there.

The second - when working with a video card, you can get latency from 350 microseconds, and this may already be enough for some low latency applications.

The chart below shows an overhead for more data:


Time to transfer data to the GPU, sort and transfer data back to RAM in ms

Server use


The most common question is how does a game video card differ from a server one? According to the characteristics, they are very similar, but prices differ significantly.


The main differences between the server (NVIDIA) and the game card:

  • Manufacturer's warranty (the game card is not designed for server use)
  • Possible virtualization issues for a consumer graphics card
  • Availability of error correction mechanism on the server card
  • The number of parallel threads (not CUDA cores) or support for Hyper-Q, which allows you to work with the card from several threads on the CPU, for example, upload data to one card from one thread and start calculations from another

These are, perhaps, the main important differences that I found.

Multithreading


After we figured out how to run the simplest algorithm on the video card and what results can be expected, the next logical question is how the video card will behave when processing several parallel requests. As an answer, I have two graphs of computing on the GPU and a processor with 4 and 32 cores:


The time taken to perform mathematical calculations on the GPU and CPU with matrices of 1000 x 60 in ms

. This graph performs calculations with matrices of 1000 x 60 elements. Calculations are started from several program streams, a separate stream is created for the GPU for each CPU stream (the very Hyper-Q is used). 

As you can see, the processor copes with this load very well, while the latency for one request per GPU increases significantly with an increase in the number of parallel requests.


The time for performing mathematical calculations on the GPU and CPU with matrices 10,000 x 60 in ms.

On the second graph, the same calculations, but with matrices 10 times longer, and the GPU behaves much better under such a load. These graphs are very indicative, and we can conclude: the behavior under load depends on the nature of the load itself. A processor can also handle matrix calculations quite efficiently, but to a certain extent. For a video card, it is characteristic that for a small computing load, performance drops approximately linearly. With an increase in load and the number of parallel threads, the video card copes better. 

It is difficult to hypothesize how the GPU will behave in various situations, but as you can see, under certain conditions, a server card can process requests from several parallel streams quite efficiently.

We will discuss a few more questions that you may have if you still decide to use the GPU in your projects.

Resource limit


As we have already said, the two main resources of a video card are computing cores and memory.

For example, we have several processes or containers using a video card, and we would like to be able to share the video card between them. Unfortunately, there is no simple API for this. NVIDIA offers vGPU technology , but I did not find the Tesla k80 card in the list of supported ones, and as far as I can understand from the description, the technology is more focused on virtual displays than on calculations. Perhaps AMD offers something more suitable.

Therefore, if you plan to use the GPU in your projects, you should rely on the fact that the application will use the video card exclusively, or you will programmatically control the amount of allocated memory and the number of cores used for calculations.

Containers and GPU


If you figured out the resource limit, then the following logical question: what if there are several video cards in the server?

Again, you can decide at the application level which GPU it will use.

Another more convenient way is Docker containers. You can use regular containers, but NVIDIA offers its NGC containers , with optimized versions of various software, libraries and drivers. For one container, you can limit the number of GPUs used and their visibility to the container. Overhead on container usage is about 3%.

Work in a cluster


Another question, what if you want to perform one task on multiple GPUs within the same server or cluster?

If you chose a library similar to thrust or a lower-level solution, then the task will have to be solved manually. High-level frameworks, for example, for machine learning or neural networks, usually support the ability to use multiple cards out of the box.

In addition, I would like to note that, for example, NVIDIA offers an interface for direct data exchange between cards - NVLINK , which is significantly faster than PCI Express. And there is technology for direct access to the memory of the card from other PCI Express devices - GPUDirect RDMA , incl. and network .

Recommendations


If you are considering using the GPU in your projects, then the GPU is most likely suitable for you if:

  • Your task can be reduced to a SIMD view
  • It is possible to load most of the data on the map before calculations (cache)
  • The challenge involves intensive computing

You should also ask questions in advance:

  • How many parallel queries will be 
  • What latency do you expect
  • Do you need one card for your load? Do you need a server with several cards or a cluster of GPU servers 

That's all, I hope that the material will be useful to you and help you make the right decision!

References


Benchmark and results on github - https://github.com/tishden/gpu_benchmark/tree/master/cuda

In addition to the topic, a recording of the report “GPU Databases - Architecture, Performance and Prospects for Use”

NVIDIA NGC Containers Webinar - http : //bit.ly/2UmVIVt or http://bit.ly/2x4vJKF

All Articles