Rendering optimization for Mobile, part 2. The main families of modern mobile GPUs

Greetings, dear lovers and professionals, graphics programmers! Let's start the second part of our series of articles about rendering optimization for Mobile. In this part, we will consider the main families of GPUs presented by players on Mobile.


To begin, consider a number of criteria by which mobile GPUs can be classified.

Unified or specialized shader kernels


In the era of early mobile video cards, before the spread of complex effects, there was a point of view that for fragment shaders, support for calculations with reduced accuracy is sufficient. Indeed, in a typical display mode, 8 or even fewer bits are used for each color channel. This view has led to the use of specialized shader cores. For the vertices, we used kernels optimized for matrix transformations with increased accuracy FP24 / FP32 ( highp ). For pixels, cores that work more efficiently with reduced accuracy FP16 ( mediump ). With this highpthey were not supported. At first glance, this specialization allows us to achieve a more rational distribution of transistors on the chip. However, in practice, this leads to difficulties in developing complex effects, as well as when using high-resolution textures. In addition, core specialization can lead to vertex / fragment bottleneck . This term refers to the situation when, due to the asymmetric load on the vertex and pixel cores, some of the cores were “idle”. 


Therefore, modern architectures use unified cores. Such kernels can take on vertex, pixel and other computational tasks depending on the load.


Vector (SIMD) or scalar instruction set


In the spirit of the desire to save on transistors described above, specializing in cores, the design of a set of shader instructions took place. Most typical transformations for three-dimensional graphics operate with 4 component vectors. Therefore, early GPUs worked specifically with such operands. If the shader code contained heterogeneous scalar operations that could not be packaged into vector operations by the optimizer, some of the computing power was not used. This phenomenon can be illustrated as follows:


There is a shader that implements the common Multiply Add operation: multiply 2 operands, and then add the third. When compiling on a conditional vector architecture (Vector ISA = Vector Instruction Set Architecture) we get one vector instruction vMADD , which runs for 1 clock. On conditional scalar architecture, we get 4 scalar instructions, which, thanks to an improved pipeline, also execute in 1 clock cycle. Now consider a sophisticated shader that performs 2 operations, but on 2 component operands.


In the case of vector architecture, we already get 2 instructions that require 2 clock cycles to execute. However, no action is taken on the .zw components , and the processing power is idle. In the case of scalar architecture, these same operations can be packaged in 4 scalar sMADDs that run in the same 1 clock cycle. Thus, on scalar architecture due to the improvement of the pipeline, a higher density of calculations is achieved. However, as will be shown below, the vector ISA is still relevant. So, it makes sense to apply vectorization techniques for shader code. They allow you to achieve increased performance on video cards with vector ISA . At the same time, as a rule, this does not harm the performance on more modern scalarThe ISA .

Based on the above characteristics, we will consider the families of mobile GPUs that are common in our time. Let's start with the most common family. Many people know that we are talking about Mali graphics cards from the British company ARM . ARM is not directly involved in chip production, offering intellectual property instead. Like other mobile video cards, Mali is part of System on Chip (SoC) , i.e. works with shared memory for the CPU and GPU and bus. 

Mali utgard


In 2008, the first representatives of the Mali Utgard architecture were born , relevant up to the present day. These video cards are named according to the Mali-4 scheme xx MP n , where xx is the model number and n is the number of fragment cores. In Mali Utgard shader core specialty, and all models come with a vertex only 1 core.

Other features of the Mali Utgard architecture:

  • OpenGL ES 2.0 
  • Lack of highp support in fragmented kernels
  • Vector instruction set (it makes sense to vectorize calculations)

Despite the OpenGL ES specification , Mali Utgard video card drivers successfully compile fragment shaders that use highp precision (for example, accuracy is set by default using precision highp float ). But the accuracy of mediump is actually used . Therefore, it is advisable to additionally test all shaders for mobile games on such video cards. According to data collected by Unity , at the end of 2019, Mali Utgard worked on devices for about 10% of players. And if you set the appropriate filters on market.yandex.ru , you can see that in 2019 more than 10 new phones with video cards of this architecture were announced.


If you are ready to abandon this audience, it is enough to set the requirement for OpenGL ES 3.0 support in AndroidManifest.xml:

<uses-feature android:glEsVersion="0x00030000" android:required="true"⁄>

In addition to Mali Utgard , there are currently no widespread mobile GPUs without support for OpenGL ES 3.0.

Of particular note is the use of high resolution textures on the Mali Utgard . Ten bits of the mantissa with mediump accuracy are not enough for high-quality texturing with a texture resolution of more than 1024 on one side. However, despite supporting only mediump precision in Mali Utgard fragment cores , you can get fp24 texture coordinate accuracy when using varying directly.

// vertex shader
varying highp vec2 v_texc;
void main()
{
    v_texc = …;
}

//  fragment shader
...
varying highp vec2 v_texc;
void main()
{
    gl_FragColor = texture2D(u_sampler, v_texc); //  v_texc 
                                                 //  
}

As a bonus on some architectures, this approach allows you to prefetch texture content before running a fragment shader , which minimizes stalls while waiting for texture sampling results.

Mali midgard


The Mali Utgard has been replaced by the Mali Midgard architecture . There are several generations of this architecture with the names of the species Mali-6xx , Mali-7xx and Mali-8xx . Despite the 8-year-old age, Mali Midgard can be called modern architecture that provides support for most of the new features:

  • unified shader kernels
  • OpenGL ES 3.2 (compute & geometry shaders, tesselation ...)

However, the Mali Midgard retains the vector ISA . Given the widespread use of Mali Midgard (about 25% of our audience), vectorization of computing becomes appropriate.

Another feature of Mali Midgard is Forward Pixel Kill technology . Each pixel is calculated in a separate stream of the fragment core. If during the execution of the stream it becomes known that the resulting pixel will be blocked by an opaque pixel of another primitive, the stream terminates prematurely and the freed resources are used for other calculations.

Mali bifrost


Next to Midgard, Bifrost architecture stands out for its transition to scalar ISA . Compared with the previous architecture, the maximum number of cores has been increased (from 16 to 32), and an improved interface with a CPU is supported, which allows for coherent access to shared memory: changes to the contents of the CPU / GPU memory immediately become “visible” to each other despite the caches, which allows you to simplify synchronization.

From unofficial


Many attempts have been made to reverse engineer Mali video cards to create Open Source drivers for Linux . The works of the dedicated people trying to do this allow you to take a look at the undocumented features of Mali video cards . So, in the PanFrost project there is a disassembler for Mali Midgard / Bifrost , with which you can get acquainted with a set of shader instructions (there is no open official information on this topic).


Adreno


The second most common family of mobile GPUs is Adreno . This video card is installed on the SoC , known under the Snapdragon brand , from the American company Qualcomm . Snapdragon is installed in the top-end smartphones of our time from Samsung , Sony and others. The

current Adreno video cards are the 3xx - 6xx series families. All these series combine the following features:

  • unified shader kernels
  • Pseudo TBR (large tile sizes located in a traditional dedicated GPU memory)
  • Automatic switching in Immediate Mode Rendering depending on the nature of the scene ( FlexRender )
  • Scalar instruction set

Starting with Adreno 4xx , support for OpenGL ES 3.1 is introduced , and with Adreno 5xx - Vulkan and OpenGL ES 3.2 .

Adreno tile based rendering


Adreno video cards have a “traditional” GPU called GMEM . Volumes from 128kb to 1536kb apply. This allows you to use a larger tile size compared to architectures of other developers of mobile GPUs. On Adreno, the size of tiles is dynamic and depends on the color format used, depth buffer and stencil. When working in Immediate Mode, rendering occurs in the system memory. There is a GL ES extension that allows you to specify the preferred mode: QCOM_binning_control . However, the latest recommendations from Qualcomm suggest relying entirely on GPU drivers, which themselves determine the most preferred mode for the application-generated command buffer. 

When working in TBR mode Adreno makes 2 vertex passes:

  1. Binning pass - distribution of primitives by bin ( bins , a synonym for tiles)
  2. Full vertex pass for rendering only those primitives that fall into the current Bin

During the Binning pass, Adreno only calculates vertex positions. Other attributes are not calculated, and unnecessary code is removed by the optimizer. In the official documentation (9.2 Optimize vertex processing), there is a recommendation to store the vertex information needed to calculate positions separately from the rest of the data. This makes caching vertex data more efficient.

Freedreno


Unlike ARM and Imagination Technologies, Qualcomm is reluctant to share the details of the internal structure of its GPUs. However, thanks to the efforts of the reverse engineer Rob Clark, much can be learned from the Freedreno project , the open source Adreno driver for Linux.

Rob Clark by Freedreno

PowerVR by Imagination Technologies


Imagination Technologies is a British fabless company famous for developing GPUs for Apple products. The company performed this role until the advent of the iPhone 8 / X, which uses Apple's internal development. Although the recommendations on optimizations for these chips, as well as on patent claims against Apple from Imagination, which remained unchanged, suggest that Apple continued to develop the PowerVR architecture, an original development from Imagination. In early 2020, Apple returned to licensing practices with Imagination Technologies. In addition to devices with iOS / iPadOS, PowerVR video cards are installed in a large number of Android smartphones and tablets.


Consider the family of PowerVR graphics cards that can still be found among users.

PowerVR SGX


The first PowerVR SGX graphics cards appeared back in 2009. There are several generations of this architecture: Series5, Series5XT and Series5XE. Apple used these GPUs right up to the iPAD 4 / iPhone 5 / iPOD Touch 5. The following SGX features can be cited:

  • unified shader kernels
  • OpenGL ES 2.0
  • vector instruction set
  • support for 10-bit lowp precision in shaders
  • low performance of dependent texture reads

Let us dwell on some of them in more detail. 

Lowp accuracy


PowerVR SGX are the only up-to-date mobile GPUs with
lowp hardware support . Newer PowerVR models, as well as all modern GPUs of other vendors, actually use mediump accuracy . Using
lowp on the PowerVR SXG allows you to achieve a higher computation density (more operations per cycle). At the same time, the swizzle operation (permutation of the vector components) for lowp , unlike other precision, is not free. This feature, as well as the narrow range of values ​​that lowp provides ([-2,2]), limits its scope. At the same time, the poorly set lowpresulting in artifacts on the SGX family will not be seen on all other graphics cards where mediump precision will actually be used . For this reason, you should consider refusing to use lowp in shaders.

Dependent texture reads


As you know, texture sampling operations are the slowest due to the need to wait for memory read results. In the case of mobile SoC, we are talking about shared system memory with a CPU. To reduce the number of accesses to slow memory, texture caches are used. To avoid downtime at the beginning of rasterization using a texture, it makes sense to cache the used areas in advance. If the fragment shader uses the texture coordinate passed from the vertex shader without changes, then the texture section necessary for caching can be determined before the fragment shader is executed. If the fragment shader changes the texture coordinate or calculates it using data from another texture, then this is not always possible. As a result, the execution of the fragment shader may slow down.PowerVR SGX graphics cards are particularly painful in this scenario. Moreover, even the use of a permutation of the components of the texture coordinate (swizzle) leads todependent texture read . Here is an example shader program without dependent texture read .

vertex program

attribute highp vec2 a_texc;
varying highp vec2 v_texc;

void main()
{
	gl_Position = …
	v_texc = a_texc;
}


fragment program

precision mediump float;
uniform sampler u_sampler;
varying highp vec2 v_texc;

void main()
{
	gl_FragColor = texture2D( u_sampler, v_texc ); //  dependent texture read
}

In this case:

fragment program

precision mediump float;
uniform sampler u_sampler;
varying highp vec2 v_texc;

void main()
{
	gl_FragColor = texture2D( u_sampler, v_texc.yx ); // dependent texture read!
}

PowerVR Rogue


PowerVR video cards were further developed in Rogue architecture . There are several generations of this architecture: from Series6 to Series9. All PowerVR Rogue have these features:

  • unified shader kernels
  • scalar instruction architecture
  • support for OpenGL ES 3.0+ (up to 3.2, as well as the Vulkan API for fresh rulers) 

PowerVR TBDR


Like all common mobile GPUs, PowerVR uses a tile pipeline. But unlike its competitors, Imagination went further and implemented deferred rasterization of primitives, which allows skipping shading of invisible pixels regardless of the rendering order. This approach is called Tile Based Deferred Rendering , and the process of eliminating invisible pixels is called Hidden Surface Removal (HSR).


Hidden Surface Removal

It is recommended to draw opaque geometry to transparent and not use Z Prepass, which in the case of PowerVR video cards in most scenarios will lead to unnecessary work. However, several consecutive transparent pixels overlapping each other are completely shaded to obtain the correct color, taking into account mixing. The last transparent pixel can be discarded if it is followed by an opaque pixel. 

Openness Imagination Technologies


The creators of PowerVR have provided open access more documentation than other GPU developers. The architecture of the graphics pipeline is described in detail, as well as a set of instructions for the Rogue architecture . There is a convenient tool PVRShaderEditor , which allows you to instantly receive profiling information on the shader, as well as its disassembled listing for Rogue.


Despite the limited presence of PowerVR video cards in the environment of devices based on Android, it makes sense to study their architecture for the competent programming of graphics for iOS.

Immediate mode mobile GPUs


We examined the most common families of mobile video cards. All of these families used tile rendering architecture. However, there are mobile video cards that use the traditional immediate mode approach. Here are some of them:

  • nVIdia (Tegra SoC)
  • All Intel family except recent Gen 11
  • Vivante GCxxxx (+ Arcturus GC8000)

A feature of mobile video cards operating in immediate mode is the expensive FBO cleaning operation. Recall that on the tile architecture, full-screen cleaning speeds up the rendering, allowing the driver not to add the Load operation of the old contents to the tile memory. On mobile immediate mode GPUs, full-screen cleaning is a time-consuming operation that allows, among other things, such GPUs to “calculate”. If adding cleanup does not speed up, but slows down the rendering, then most likely we are working with immediate mode GPU . Well, of course, let's not forget to mention that on immediate mode GPUs, changing a target is a “conditionally free” procedure.

Distribution of different families of mobile GPUs among our players


Here are the statistics on mobile GPUs collected from our players at the end of 2019:


Below we open the “Others” segment


Based on these data, we look at the distribution of the GPU in terms of their main features.


Vector ALUs (arithmetic logic unit) become obsolete and replaced with scalar ones. Today, the bulk of mobile GPUs with a vector instruction set is the Mali Midgard , which can be considered average in performance. Because vectorization, as a rule, does not slow down execution on scalar ALUs; it is worth considering vectorization as an actual technique for optimizing shaders for mobile. 

Specialized shader kernels are deprecated and are replaced by unified ones. The Vertex Bottleneck on the skeletal mesh is no longer scary. Specialized kernels are used only on the Mali-4xx (Utgard) family . Recall that these GPUs only support OpenGL ES 2.0 . Our audience has about 3.5% of them.

Finally, the vast majority of mobile GPUs use the tile approach. Immediate Mode has become marginalized and is quickly being squeezed out along with the video cards that use it. The share of immediate mode GPUs in our players is about 0.7%.

Useful links:


Thank you for the attention! In the next article from the series, we will consider techniques for optimizing shaders for Mobile.

All Articles