Hierarchical Depth Buffer


Short review


A hierarchical depth buffer is a multi-level depth buffer (Z-buffer) used as an acceleration structure for depth queries. As in the case of texture mip chains, the sizes of each level are usually the result of dividing by the degree of two the sizes of the full resolution buffer. In this article, I will talk about two ways to generate a hierarchical depth buffer from a full resolution buffer.

First, I will show how to generate a complete mip chain for a depth buffer, which preserves the accuracy of depth queries in the texture coordinate space (or NDC) even for depth buffer sizes not equal to powers of two. (On the Internet, I have come across code examples that do not guarantee this accuracy, which complicates the execution of accurate queries at high mip levels.)

Then, for cases in which only one level of downsampling is required, I will demonstrate how to generate this level with a single call to a compute shader using atomic operations in the shared memory of the workgroup. For my application, where only 1/16 x 1/16 resolution is required (mip level 4), the method with the computational shader is 2-3 times faster than the usual approach with downsampling the mip chain in several passes.

Introduction


Hierarchical Depths (also called Hi-Z) is a technique often used in 3D graphics. It is used to accelerate the trimming of invisible geometry (occlusion culling) (in the CPU , as well as in the GPU ), the calculation of reflections in the screen space , volumetric fog, and much more.

In addition, Hi-Z GPUs are often implemented as part of a rasterization pipeline. Quick Hi-Z search operations in the caches on the chip allow you to completely skip fragment tiles if they are completely covered by previously rendered primitives.

Hi-Z's basic idea is to speed up depth query operations by reading from lower resolution buffers. This is faster than reading full resolution depths from the buffer, for two reasons:

  1. One texel (or just a few texels) of a lower resolution buffer can be used as an approximate value of a plurality of texels of a high resolution buffer.
  2. A lower resolution buffer can be small enough and cached, which greatly speeds up the execution of search operations (especially with random access).

The content of down-sampled Hi-Z buffer levels depends on how they are used (whether the depth buffer will be โ€œinvertedโ€ , what types of requests should be used). In general, a texel at the Hi-Z buffer level stores minor maxall texels corresponding to it at the previous level. Sometimes the values โ€‹โ€‹of minand are stored at the same time max. Simple averaged values โ€‹โ€‹(which are often used in the mip levels of regular textures) are used infrequently because they are rarely useful for such types of queries.

Hi-Z buffers are most often requested almost immediately at the exit to avoid further processing and more accurate search operations in the full resolution buffer. For example, if we store valuesmaxfor a non-inverted depth buffer (in which the greater the depth value, the farther the object is), we can quickly determine exactly whether a particular position in the screen space is covered by a depth buffer (if its coordinate Z> is the value (max) stored in some higher level (i.e., lower resolution) of the Hi-Z buffer).

Please note that I used the phrase โ€œexactlyโ€: if the coordinate Z <= the received value (max), then it is not known whether its buffer overlaps. In some applications, in cases of uncertainty, it may be necessary to search the buffer for full resolution depths; in other cases, this is not required (for example, if only unnecessary calculations are at stake, and not the correct rendering).

My application: rendering particles in a computational shader


I was faced with the need to use Hi-Z when implementing particle rendering in a computational shader in the engine of my PARTICULATE VR application . Since this rendering technique does not use rasterization with fixed functions, it needs to use its own depth check for each particle with a size of one pixel. And since the particles are not sorted in any way, access to the depth buffer is (in the worst of cases) almost random.

Search operations on a full-screen random-access texture is the way to poor performance. To reduce the load, I first search for depths in the reduced depth buffer with a resolution of 1/16 x 1/16 from the original. This buffer contains depth values.min, which allows the computational rendering shader for the vast majority of visible particles to skip the full resolution depth test. (If the particle depth is <the minimum depth stored in the lower resolution buffer, then we know that it is absolutely visible. If it is> = min, then we need to check the full resolution depth buffer.)

Thanks to this, the depth test for visible particles in the general case becomes a low-cost operation. (For particles overlapped by geometry, it is more expensive, but it suits us because it does not cause rendering costs, so the particles still require little computation.)

Due to the fact that the search is first performed in the buffer of depths of lower resolution (as mentioned above) , particle rendering time is reduced by a maximum of 35%compared to the case when the search is performed only in the full resolution buffer. Therefore, for my application, Hi-Z is very beneficial.

Now we will look at two techniques for generating a hierarchical depth buffer.

Technique 1: generating a complete Mip chain


In many Hi-Z applications, the creation of a complete depth buffer mip chain is required. For example, when performing occlusion culling using Hi-Z, the bounding volume is projected into the screen space and the projected size is used to select the appropriate mip level (so that a fixed number of texels is involved in each overlap test).

Generating a mip chain from the full resolution depth buffer is usually a simple task - for each texel at level N we take max(or min, or both) the corresponding 4 texels in the previously generated level N-1. We perform sequential passes (each time reducing the size by half) until we get the last mip level 1x1 in size.

However, in the case of depth buffers, the sizes of which do not correspond to the powers of two, everything is more complicated. Since Hi-Z for depth buffers is often built from standard screen resolutions (which are rarely powers of two), we need to find a reliable solution to this problem.

Let's first decide what will mean the value of each texel mip-level depth buffer. In this article, we will assume that the mip chain stores values min. Depth search operations should use filtering of the nearest neighbors, because the interpolation of values โ€‹โ€‹is minuseless for us and will harm the hierarchical nature of the created mip-chain of depths.

So, what exactly will the value of the individual texel at the mip-level N obtained by us mean? This should be the minimum value (min) of all texels of the full-screen depth buffer that occupies the same space in the (normalized) texture coordinate space.

In other words, if a separate coordinate of the texture (in the interval[0,1]2) is mapped (by filtering the nearest neighbors) to an individual texel of the full resolution buffer, then this texel of full resolution should be considered as a candidate for the value mincalculated for the texel at each subsequent higher mip level, with which the same texture coordinate is mapped.

If this correspondence is guaranteed, then we will be sure that the search operation at high mip levels will never return the depth value> texel value in the same texture coordinate corresponding to the full resolution buffer (level 0). In the case of a separate N, this guarantee is maintained for all levels below it (<N).

For even dimensions (and in the case of full resolution buffers, which are powers of two, even dimensions at each level until the very last, where the dimensions become equal to 1) will be easy to do. In the one-dimensional case, for texel with an indexi at level N we need to take texels at level N-1 with indices 2and 2i+1and find their meaning min. I.eDN[i]=min(DNโˆ’1[2i],DNโˆ’1[2i+1]). We can directly compare the texels in the โ€œ2 to 1โ€ ratio (and therefore the size of the texture coordinates), because the size of each level is exactly two times smaller than the previous one.


An example of even level sizes: 6 texels at this level are reduced to 3 at a higher level. The texture coordinate sizes of each of the three high-level texels are precisely superimposed on every two lower-level texels. (Dots are the centers of texels, and squares are the dimensions of the texture coordinate when filtering the nearest neighbors.)

In the case of odd level sizes (and full resolution buffers that are not a power of two will have at least one level with an odd size) everything getting harder. For level N-1 of odd sizedimNโˆ’1 the size of the next level (N) will be equal dimN=โŒŠdimNโˆ’12โŒ‹, i.e โ‰ dimNโˆ’12.

This means that now we do not have a clear 2-to-1 mapping of texels of level N-1 to texels of level N. Now the size of the texture coordinate of each texel at level N is superimposed on the size of 3 texels at level N-1.


An example of an odd level size: 7 texels of this level are reduced to 3 texels at the next level. The dimensions of the texture coordinates of the three high-level texels are superimposed on the sizes of the three texels from the lower level.

HenceDN[i]=min(DNโˆ’1[2i],DNโˆ’1[2i+1],DNโˆ’1[2i+2]). This means that one texel at the level of N-1 sometimes affects the value mincalculated for 2 texels at the level of N. This is necessary to maintain the comparison described above.

The description above has been presented in just one dimension for simplicity. In two dimensions, if both dimensions of the N-1 level are even, then the 2x2 texel region at the N-1 level is mapped to one texel at the N level. If one of the dimensions is odd, the 2x3 or 3x2 region at the N-1 level is mapped to one texel at level N. If both dimensions are odd , then the โ€œangularโ€ texel should also be taken into account, that is, the 3x3 region at level N-1 is compared with one texel at level N.

Code example


The GLSL shader code shown below implements the algorithm we described. It must be executed for each subsequent mip, starting from level 1 (level 0 is the full resolution level).

uniform sampler2D u_depthBuffer;
uniform int u_previousLevel;
uniform ivec2 u_previousLevelDimensions;

void main() {
	ivec2 thisLevelTexelCoord = ivec2(gl_FragCoord);
	ivec2 previousLevelBaseTexelCoord = 2 * thisLevelTexelCoord;

	vec4 depthTexelValues;
	depthTexelValues.x = texelFetch(u_depthBuffer,
                                    previousLevelBaseTexelCoord,
                                    u_previousLevel).r;
	depthTexelValues.y = texelFetch(u_depthBuffer,
                                    previousLevelBaseTexelCoord + ivec2(1, 0),
                                    u_previousLevel).r;
	depthTexelValues.z = texelFetch(u_depthBuffer,
                                    previousLevelBaseTexelCoord + ivec2(1, 1),
                                    u_previousLevel).r;
	depthTexelValues.w = texelFetch(u_depthBuffer,
                                    previousLevelBaseTexelCoord + ivec2(0, 1),
                                    u_previousLevel).r;

	float minDepth = min(min(depthTexelValues.x, depthTexelValues.y),
                         min(depthTexelValues.z, depthTexelValues.w));

    // Incorporate additional texels if the previous level's width or height (or both) 
    // are odd. 
	bool shouldIncludeExtraColumnFromPreviousLevel = ((u_previousLevelDimensions.x & 1) != 0);
	bool shouldIncludeExtraRowFromPreviousLevel = ((u_previousLevelDimensions.y & 1) != 0);
	if (shouldIncludeExtraColumnFromPreviousLevel) {
		vec2 extraColumnTexelValues;
		extraColumnTexelValues.x = texelFetch(u_depthBuffer,
                                              previousLevelBaseTexelCoord + ivec2(2, 0),
                                              u_previousLevel).r;
		extraColumnTexelValues.y = texelFetch(u_depthBuffer,
                                              previousLevelBaseTexelCoord + ivec2(2, 1),
                                              u_previousLevel).r;

		// In the case where the width and height are both odd, need to include the 
        // 'corner' value as well. 
		if (shouldIncludeExtraRowFromPreviousLevel) {
			float cornerTexelValue = texelFetch(u_depthBuffer,
                                                previousLevelBaseTexelCoord + ivec2(2, 2),
                                                u_previousLevel).r;
			minDepth = min(minDepth, cornerTexelValue);
		}
		minDepth = min(minDepth, min(extraColumnTexelValues.x, extraColumnTexelValues.y));
	}
	if (shouldIncludeExtraRowFromPreviousLevel) {
		vec2 extraRowTexelValues;
		extraRowTexelValues.x = texelFetch(u_depthBuffer,
                                           previousLevelBaseTexelCoord + ivec2(0, 2),
                                           u_previousLevel).r;
		extraRowTexelValues.y = texelFetch(u_depthBuffer,
                                           previousLevelBaseTexelCoord + ivec2(1, 2),
                                           u_previousLevel).r;
		minDepth = min(minDepth, min(extraRowTexelValues.x, extraRowTexelValues.y));
	}

	gl_FragDepth = minDepth;
}

Flaws in this code


First, in the case of full resolution depth buffers for which one dimension is more than two times the size of another dimension, call indices texelFetchmay go beyond u_depthBuffer. (In such cases, the smaller dimension turns into 1 before the other.) I wanted to use this example texelFetch(using integer coordinates) so that what was happening was as clear as possible, and did not personally encounter such particularly wide / high depth buffers. If you encounter such problems, you can limit ( clamp) the transmitted texelFetchcoordinates or use the texturenormalized coordinates of the texture (in the sampler, set a limit on the edge). When calculating minor maxshould always consider one texel several times for the presence of borderline cases.

Secondly, despite the fact that the first four calls texelFetchcan be replaced with one textureGather, this complicates things (since the textureGathermip level cannot be indicated); In addition, I did not notice an increase in speed when using textureGather.

Performance


I used the fragment shader above to generate two full mip chains for two depth buffers (one for each eye) in my VR engine. In the test, the resolution for each eye was 1648x1776, which led to the creation of 10 additional reduced mip levels (which means 10 passes). It took 0.25 ms on the NVIDIA GTX 980 and 0.30 ms on the AMD R9 290 to generate a full chain for both eyes.



Mip- 4, 5 6, , . ( , , , .) Mip- 4 โ€” , (103x111) 2.

mip-


The task of the algorithm described above is to maintain the accuracy of depth queries in the texture coordinate space (or NDC). For completeness (and because I refused this guarantee in the technique below 2), I would like to demonstrate one more method that I came across (for example, in this article ).

Note that, like the previous one, this alternative method is designed to work with full resolution buffers whose sizes are not powers of two (but, of course, they work with sizes equal to powers of two).

In this alternative method, when downsampling a level with an odd width (or height) instead of adding for each output texel an additional column (or row) of texels from the previous (lower) level, we perform this operation only for output texels with maximum indices (โ€œextremeโ€ texels ) The only thing that changes in the fragment shader presented above is setting values shouldIncludeExtraColumnFromPreviousLeveland shouldIncludeExtraRowFromPreviousLevel:

// If the previous level's width is odd and this is the highest-indexed "edge" texel for 
// this level, incorporate the rightmost edge texels from the previous level. The same goes 
// for the height. 
bool shouldIncludeExtraColumnFromPreviousLevel =
    (previousMipLevelBaseTexelCoords.x == u_previousLevelDimensions.x - 3);
bool shouldIncludeExtraRowFromPreviousLevel =
    (previousMipLevelBaseTexelCoords.y == u_previousLevelDimensions.y - 3);

Because of this, the extreme texels with the highest index become very โ€œthickโ€, since each division by 2 odd dimensions leads to the fact that they occupy a proportionately larger interval of the normalized texture coordinate space.

The disadvantage of this approach is that it becomes more difficult to perform depth queries of high mip levels. Instead of just using the normalized texture coordinates, we first need to determine the full resolution texel corresponding to these coordinates, and then transfer the coordinates of this texel to the coordinates of the corresponding mip level, the request of which is being executed.

The code snippet below is migrating from NDC space[โˆ’1,1]2to texel coordinates at mip level higherMipLevel:

vec2 windowCoords = (0.5 * ndc.xy + vec2(0.5)) * textureSize(u_depthBuffer, 0);
// Account for texel centers being halfway between integers. 
ivec2 texelCoords = ivec2(round(windowCoords.xy - vec2(0.5)));
ivec2 higherMipLevelTexelCoords =
    min(texelCoords / (1 << higherMipLevel),
        textureSize(u_depthBuffer, higherMipLevel).xy - ivec2(1));

Technique 2: Generate a Single Hi-Z Level Using a Computing Shader


Generating a complete mip chain is pretty fast, but it bothered me a bit that my application generates all these levels and uses only one of them (level 4). In addition to eliminating this slight inefficiency, I also wanted to see how much everything could be accelerated if I used only one compute shader call to generate the level I needed. (Itโ€™s worth noting that my application can stop at level 4 when using a solution with a multi-pass fragment shader, so at the end of this section I used it as a basis for comparing runtime.)

In most Hi-Z applications, only one depth level is required, so I find this situation to be common. I wrote a computational shader for my own specific requirements (generating level 4, which has a resolution of 1/16 x 1/16 from the original). Similar code can be used to generate different levels.

The computational shader is well suited for this task, because it can use shared workgroup memory to exchange data between threads. Each working group is responsible for one output texel (reduced by downsampling of the buffer), and the threads of the working group share the work of calculating the mincorresponding texels of full resolution, sharing the results through shared memory.

I tried two main solutions based on computational shaders. In the first, each thread called atomicMinfor one shared memory variable.

Please note that since programmers cannot (without extensions for the hardware of a particular manufacturer) perform atomic operations on non-integer values โ€‹โ€‹(and my depths are stored as float), some trick is needed here. Since non-negative floating-point values โ€‹โ€‹of the IEEE 754 standard keep their order when their bits are processed as unsigned integer values, we can use floatBitsToUintto bring (using reinterpret cast) the depth values floatto uint, and then call atomicMin(to then execute uintBitsToFloatfor the finished minimum value uint) .

The most obvious solution to atomicMinwould be to create 16x16 thread groups in which each thread receives one texel and then executes it atomicMinwith a value in the shared memory. I compared this approach using smaller stream blocks (8x8, 4x8, 4x4, 2x4, 2x2), in which each stream receives a texel region and calculates its own local minimum, and then calls atomicMin.

The fastest of all these tested solutions withatomicMinboth NVIDIA and AMD turned out to have a solution with 4x4 stream blocks (in which each stream itself receives a 4x4 texel area). I donโ€™t quite understand why this option turned out to be the fastest, but perhaps it reflects a compromise between the competition of atomic operations and calculations in independent flows. It is also worth noting that the size of the 4x4 workgroup uses only 16 threads per warp / wave (and it is also possible to use 32 or 64), which is interesting. The example below implements this approach.

As an alternative to using, atomicMinI tried to perform parallel reduction using the techniques used in this actively cited NVIDIA presentation. (The basic idea is to use a shared memory array of the same size as the number of threads in the workgroup, as well aslog2โก(n)passes for sequential joint calculation of the minminimums of each stream until the final minimum of the entire working group is obtained.)

I tried this solution with all the same sizes of working groups as in solution c atomicMin. Even with all the optimizations described in the NVIDIA presentation, the parallel reduction solution is slightly slower (within ten GPUs on both GPUs) than the solution atomicMinI came to. In addition, the solution with is atomicMinmuch simpler in terms of code.

Code example


With this method, the easiest way is not to try to maintain correspondence in the normalized space of texture coordinates between texels of reduced buffers and full resolution. You can simply perform conversions from texel coordinates of full resolution to texel coordinates of reduced resolution:

ivec2 reducedResTexelCoords = texelCoords / ivec2(downscalingFactor);

In my case (generating the equivalent of mip-level 4) downscalingFactoris 16.

As mentioned above, this GLSL computational shader implements a solution with atomicMin4x4 workgroup sizes, where each thread receives a 4x4 texel area from the full resolution buffer. The resulting reduced value depth buffer minis 1/16 x 1/16 of the full resolution buffer size (rounded up when full resolution sizes are not divisible by 16 completely).

uniform sampler2D u_inputDepthBuffer;
uniform restrict writeonly image2DArray u_outputDownsampledMinDepthBufferImage;
// The dimension in normalized texture coordinate space of a single texel in 
// u_inputDepthBuffer. 
uniform vec2 u_texelDimensions;

// Resulting image is 1/16th x 1/16th resolution, but we fetch 4x4 texels per thread, hence 
// the divisions by 4 here. 
layout(local_size_x = 16/4, local_size_y = 16/4, local_size_z = 1) in;

// This is stored as uint because atomicMin only works on integer types. Luckily 
// (non-negative) floats maintain their order when their bits are interpreted as uint (using 
// floatBitsToUint). 
shared uint s_workgroupMinDepthEncodedAsUint;

void main() {
	if (gl_LocalInvocationIndex == 0) {
        // Initialize to 1.0 (max depth) before performing atomicMin's. 
		s_workgroupMinDepthEncodedAsUint = floatBitsToUint(1.0);
	}

	memoryBarrierShared();
	barrier();

	// Fetch a 4x4 texel region per thread with 4 calls to textureGather. 'gatherCoords' 
    // are set up to be equidistant from the centers of the 4 texels being gathered (which 
    // puts them on integer values). In my tests textureGather was not faster than 
    // individually fetching each texel - I use it here only for conciseness. 
    // 
    // Note that in the case of the full-res depth buffer's dimensions not being evenly 
    // divisible by the downscaling factor (16), these textureGather's may try to fetch 
    // out-of-bounds coordinates - that's fine as long as the texture sampler is set to 
    // clamp-to-edge, as redundant values don't affect the resulting min. 

	uvec2 baseTexelCoords = 4 * gl_GlobalInvocationID.xy;
	vec2 gatherCoords1 = (baseTexelCoords + uvec2(1, 1)) * u_texelDimensions;
	vec2 gatherCoords2 = (baseTexelCoords + uvec2(3, 1)) * u_texelDimensions;
	vec2 gatherCoords3 = (baseTexelCoords + uvec2(1, 3)) * u_texelDimensions;
	vec2 gatherCoords4 = (baseTexelCoords + uvec2(3, 3)) * u_texelDimensions;

	vec4 gatheredTexelValues1 = textureGather(u_inputDepthBuffer, gatherCoords1);
	vec4 gatheredTexelValues2 = textureGather(u_inputDepthBuffer, gatherCoords2);
	vec4 gatheredTexelValues3 = textureGather(u_inputDepthBuffer, gatherCoords3);
	vec4 gatheredTexelValues4 = textureGather(u_inputDepthBuffer, gatherCoords4);

	// Now find the min across the 4x4 region fetched, and apply that to the workgroup min 
    // using atomicMin. 
	vec4 gatheredTexelMins = min(min(gatheredTexelValues1, gatheredTexelValues2),
                                 min(gatheredTexelValues3, gatheredTexelValues4));
	float finalMin = min(min(gatheredTexelMins.x, gatheredTexelMins.y),
                         min(gatheredTexelMins.z, gatheredTexelMins.w));
	atomicMin(s_workgroupMinDepthEncodedAsUint, floatBitsToUint(finalMin));

	memoryBarrierShared();
	barrier();

    // Thread 0 writes workgroup-wide min to image. 
	if (gl_LocalInvocationIndex == 0) {
		float workgroupMinDepth = uintBitsToFloat(s_workgroupMinDepthEncodedAsUint);

		imageStore(u_outputDownsampledMinDepthBufferImage,
		           ivec2(gl_WorkGroupID.xy),
                   // imageStore can only be passed vec4, but only a float is stored. 
				   vec4(workgroupMinDepth));
	}
}

Performance


I used the computational shader above to process the full resolution depth buffer with the same dimensions that were used to generate the full mip chain (1648x1776 buffers for each eye). It runs in 0.12 ms on the NVIDIA GTX 980 and 0.08 ms on the AMD R9 290. If we compare with the generation time of only 1โ€“4 mip levels (0.22 ms on NVIDIA, 0.25 ms AMD), then The solution with a computational shader turned out to be 87% faster with NVIDIA GPUs and 197% faster than AMD GPUs .

In absolute terms, the acceleration is not so big, but every 0.1 ms is important, especially in VR :)

All Articles