Rendering optimization for Mobile

Hello dear readers, lovers and professionals of programming graphics! We bring to your attention a series of articles on optimizing rendering for mobile devices: phones and tablets based on iOS and Android. The cycle will consist of three parts. In the first part, we will examine the features of the popular GPU tile architecture on Mobile . In the second, we will go over the main families of GPUs presented in modern devices, and consider their strengths and weaknesses. In the third part, we will get acquainted with the features of shader optimization.

So, let's get down to the first part.

The development of video cards on desktops and consoles took place in the absence of significant restrictions on power consumption. With the advent of video cards for mobile devices, engineers were faced with the task of ensuring acceptable performance at comparable desktop resolutions, while the power consumption of such video cards should be 2 orders of magnitude lower. 



The solution was found in a special architecture called Tile Based Rendering (TBR) . To a programmer with experience in PC development, when he gets acquainted with mobile development, everything seems familiar: a similar OpenGL ES API is used, the same structure of the graphics pipeline. However, the tile architecture of mobile GPUs is significantly different from that used on the PC / Immediate Mode consoles . Knowing the strengths and weaknesses of TBR will help you make the right decisions and get great performance with Mobile.

Below is a simplified diagram of a classic graphics pipeline used on PCs and consoles for the third decade.


At the geometry processing stage, the vertex attributes are read from the GPU video memory. After various transformations (Vertex Shader), ready-to-render primitives in the original order (FIFO) are passed to the rasterizer, which divides the primitives into pixels. After that, the step of fragment processing of each pixel (Fragment Shader) is carried out , and the obtained color values ​​are written to the screen buffer, which is also located in the video memory. A feature of the traditional architecture of “Immediate Mode” is the recording of the result of the Fragment Shader in arbitrary sections of the screen buffer when processing a single draw call. Thus, for each draw call, access to the entire screen buffer may be required. Working with a large array of memory requires an appropriate bus bandwidth ( bandwidth ) and is associated with high power consumption. Therefore, mobile GPUs began to take a different approach. On the tile architecture typical of mobile video cards, rendering is done in a small piece of memory corresponding to the part of the screen - the tile. The small size of the tile (e.g. 16x16 pixels for Mali video cards, 32x32 for PowerVR) allows you to place it directly on the video card chip, which makes the access speed comparable to the speed of access to the shader core registers, i.e. very fast.


However, since primitives can fall into arbitrary sections of the screen buffer, and the tile covers only a small part of it, an additional step in the graphics pipeline was required. The following is a simplified diagram of how the pipeline works with tile architecture.


After processing the vertices and constructing the primitives, the latter, instead of being sent to the fragment pipeline, fall into the so-called Tiler . Here the primitives are distributed by tiles, into the pixels of which they fall. After such distribution, which, as a rule, covers all draw calls directed to one Frame Buffer Object (aka Render Target ), the tiles are sequentially rendered. For each tile, the following sequence of actions is performed:

  1. Loading old FBO contents from system memory ( Load
  2. Render of primitives falling into this tile
  3. Uploading new FBO content to system memory ( Store )


It should be noted that the Load operation can be considered as an additional superposition of the “full-screen texture” without compression. If possible, avoid this operation, i.e. Do not allow FBO to switch back and forth. If all its contents are cleared before rendering in FBO , the Load operation is not performed. However, in order to send the correct signal to the driver, the parameters of such cleaning must meet certain criteria:

  1. Must be disabled Scissor Rect
  2. Recording in all color channels and alpha should be allowed.

To prevent the Load operation for the depth buffer and stencil, they also need to be cleaned before the start of rendering.

It is also possible to avoid the Store operation for the depth / stencil buffer. After all, the contents of these buffers are not displayed in any way on the screen. Before the glSwapBuffers operation, you can call glDiscardFramebufferEXT or glInvalidateFramebuffer

const GLenum attachments[] = {GL_DEPTH_ATTACHMENT, GL_STENCIL_ATTACHMENT};
glDiscardFramebufferEXT (GL_FRAMEBUFFER, 2, attachments);

const GLenum attachments[] = {GL_DEPTH_ATTACHMENT, GL_STENCIL_ATTACHMENT};
glInvalidateFramebuffer(GL_FRAMEBUFFER, 2, attachments);

There are rendering scenarios in which the placement of depth / stencil buffers, as well as MSAA buffers in the system memory is not required. For example, if the rendering in the FBO with the depth buffer is continuous, and the depth information from the previous frame is not used, then the depth buffer does not need to be loaded into tile memory before the start of rendering, or unloaded after completion of the rendering. Therefore, the system memory can not be allocated under the depth buffer. Modern graphics APIs such as Vulkan and Metal allow you to explicitly set the memory mode for your FBO counterparts  ( MTLStorageModeMemoryless in Metal, VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT + VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT in Vulkan ).

Of particular note is the implementation of MSAA on tile architectures. The high resolution buffer for the MSAA does not leave the tile memory by splitting the FBO into more tiles. For example, for MSAA 2x2, 16x16 tiles will be resolved as 8x8 during the Store operation, i.e. In total, it will be necessary to process 4 times more tiles. But additional memory for MSAA is not required, and due to the rendering in fast tile memory there will be no significant bandwidth restrictions . However useMSAA on tile architecture increases the load on Tiler, which can negatively affect the rendering performance of scenes with a lot of geometry.

Summarizing the above, we present the desired scheme of working with FBO on the tile architecture:

// 1.   ,    auxFBO
glBindFramebuffer(GL_FRAMEBUFFER, auxFBO);
glDisable(GL_SCISSOR);
glColorMask(GL_TRUE, GL_TRUE, GL_TRUE, GL_TRUE);
glDepthMask(GL_TRUE);
// glClear,     
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT | 
           GL_STENCIL_BUFFER_BIT);

renderAuxFBO();         

//   /      
glInvalidateFramebuffer(GL_FRAMEBUFFER, 2, depth_and_stencil);
// 2.   mainFBO
glBindFramebuffer(GL_FRAMEBUFFER, mainFBO);
glDisable(GL_SCISSOR);

glClear(...);
//   mainFBO    auxFBO
renderMainFBO(auxFBO);

glInvalidateFramebuffer(GL_FRAMEBUFFER, 2, depth_and_stencil);

If you switch to auxFBO rendering in the middle of mainFBO formation , you can get unnecessary Load & Store operations, which can significantly increase the frame formation time. In our practice, we encountered a slowdown in rendering even in the case of idle FBO settings, i.e. without the actual render in them. Due to the architecture of the engine, our old circuit looked like this:

//   mainFBO
glBindFramebuffer(GL_FRAMEBUFFER, mainFBO);
//   
glBindFramebuffer(GL_FRAMEBUFFER, auxFBO);
//  auxFBO
renderAuxFBO();

glBindFramebuffer(GL_FRAMEBUFFER, mainFBO);
//   mainFBO
renderMainFBO(auxFBO);

Despite the lack of gl calls after the first installation of mainFBO , on some devices we got extra Load & Store operations and worse performance.

To improve our understanding of overhead from using intermediate FBOs , we measured the time loss for switching full-screen FBOs using a synthetic test. The table shows the time spent on the Store operation when switching FBO multiple times in one frame (the time of one such operation is given). Load operation absent due to glClear, i.e. a more favorable scenario was measured. The permission used on the device contributed. It could more or less correspond to the power of the installed GPU. Therefore, these figures give only a general idea of ​​how expensive the switching of targets on mobile video cards of various generations is.
GPUmillisecondsGPUmilliseconds
Adreno 3205.2
Adreno 5120.74
PowerVR G62003.3Adreno 6150.7
Mali-4003.2Adreno 5300.4
Mali-t7201.9Mali-g510.32
PowerVR SXG 5441.4Mali-t830
0.15

Based on the obtained data, we can come to the recommendation not to use more than one or two FBO switches per frame, at least for older video cards. If the game has a separate code pass for Low-End devices, it is advisable not to use the FBO change there. However, on the Low-End, the issue of lowering the resolution often becomes relevant. On Android, you can lower the rendering resolution without resorting to using an intermediate FBO by calling SurfaceHolder.setFixedSize ():

surfaceView.getHolder().setFixedSize(...)

This method will not work if the game is rendered through the main Surface application (a typical scheme for working with NativeActivity ). If you use the main Surface, lower resolution can be set by calling the native function ANativeWindow_setBuffersGeometry.

JNIEXPORT void JNICALL Java_com_organization_app_AppNativeActivity_setBufferGeometry(JNIEnv *env, jobject thiz, jobject surface, jint width, jint height)
{
ANativeWindow* window = ANativeWindow_fromSurface(env, surface); 
ANativeWindow_setBuffersGeometry(window, width, height, AHARDWAREBUFFER_FORMAT_R8G8B8X8_UNORM); 
}

In Java:

private static native void setBufferGeometry(Surface surface, int width , int height )
...
//   SurfaceHolder.Callback
@Override public void surfaceChanged(SurfaceHolder holder, int format, int width, int height)
{
     setBufferGeometry(holder.getSurface(), 768, 1366); /* ... */
...

Finally, we mention the convenient ADB command for controlling selected surface buffers on Android:

adb shell dumpsys surfaceflinger

You can get a similar conclusion that allows you to estimate the memory consumption for surface buffers:


The screenshot shows the system highlighting 3 buffers for triple buffering the GLSurfaceView game (highlighted in yellow), as well as 2 buffers for the main Surface (highlighted in red). In the case of rendering through the main Surface, which is the default scheme when using NativeActivity , the allocation of additional buffers can be avoided. 

That's all for now. In the following articles, we will classify mobile GPUs, as well as analyze methods for optimizing shaders for them.

All Articles