🏏 ↙️ 🈂️ 3D graphics on the STM32F103 🧓🏿 🕛 🙋🏽

A short story about how to shove the non-editable and display real-time three-dimensional graphics using a controller that has neither speed nor memory for this.

Back in 2017 (judging by the file modification date), I decided to switch from AVR controllers to more powerful STM32s. Naturally, the first controller was the widely-publicized F103. It is no less natural that the use of off-the-shelf debug boards was rejected in favor of manufacturing one from scratch in accordance with its requirements. Oddly enough, there were almost no jambs (except that UART1 should be brought to a normal connector, and not crutched on wiring).

Compared with AVR, the characteristics of the stone are pretty decent: 72 MHz clock (in practice, you can overclock to 100 MHz, or even more, but only at your own peril and risk!), 20 kB of RAM and 64 kB of flash. Plus, a ton of peripherals, when using which the main problem is not to be afraid of this abundance and realize that you do not need to shovel all ten registers to start, it is enough to set three bits in the right ones. At least until you want something strange.

When the first euphoria from the possession of such power passed, a desire arose to probe its limits. As an effective example, I chose the calculation of three-dimensional graphics with all these matrices, lighting, polygonal models and a Z-buffer with a 320x240 display on the ili9341 controller. The two most obvious problems to be solved are speed and volume. A screen size of 320x240 at 16 bits per color gives 150 kB per frame. But the total RAM we have is only 20 kB ... And these 150 kB must be transferred to the display at least 10 times per second, that is, the exchange rate should be at least 1.5 MB / s or 12 MB / s, which already looks like a significant load on the core. Fortunately, in this controller there is an RAP module (direct access to memory, aka Direct Memory Access, DMA), which allows you not to load the kernel with transfusion operations from empty to empty.That is, you can prepare a buffer, tell the module “here you have the data buffer, work!”, And at this time prepare the data for the next transfer. And taking into account the ability of the display to receive data in a stream, the following algorithm emerges: the front buffer is highlighted, from which the DMA transfers data to the display, the back buffer into which the rendering takes place, and the Z-buffer used for cutting in depth. Buffers are a single row (or column, whatever) of the display. And instead of 150 kB, we need only 1920 bytes (320 pixels per line * 3 buffers * 2 bytes per point), which fits perfectly in memory. The second hack is based on the fact that the calculation of transformation matrices and vertex coordinates cannot be performed for each row, otherwise the image will be distorted in the most bizarre ways, and it’s disadvantageous in speed. Instead, "external" calculations,that is, the multiplication of transformation matrices and their application to the vertices are recalculated on each frame, and then converted to an intermediate representation, which is optimized for rendering in a 320x1 picture.

For hooligan reasons, the library will resemble OpenGL from the outside. As in the original OpenGL, rendering begins with the formation of the transformation matrix - clearing glLoadIdentity () makes the current matrix unit, then a set of transformations glRotateXY (...), glTranslate (...), each of which is multiplied by the current matrix. Since these calculations will be carried out only once per frame, there are no special speed requirements; simple floats can be dispensed with without perversions with fixed-point numbers. The matrix itself is an array of float [4] [4], mapped to a one-dimensional array of float [16] - in fact, this method is usually used for dynamic arrays, but you can also get a little benefit from static arrays. Another standard hack: instead of constantly calculating sines and cosines, which are many in the rotation matrices,count them in advance and write them on the tablet. To do this, divide the full circle into 256 parts, calculate the sine value for each and dump it into the sin_table [] array. Well, anyone from school can get the cosine from the sine. It should be noted that the rotation functions take an angle not in radians, but in fractions of a full revolution, after reduction to the range [0 ... 255]. However, “honest” functions have been implemented that perform the conversion from angle to lobes under the hood.performing conversion from angle to lobes under the hood.performing conversion from angle to lobes under the hood.

When the matrix is ready, you can start drawing the primitives. In general, in three-dimensional graphics there are three types of primitives - a point, a line and a triangle. But if we are interested in polygonal models, attention should be paid only to the triangle. Its "rendering" occurs in the function glDrawTriangle () or glDrawTriangleV (). The word "rendering" is enclosed in quotation marks because no rendering occurs at this stage. We just multiply all the points of the primitive by the transformation matrix, and then we extract from them the analytical formulas of the edges y = ky * x + by, which allow us to find the intersections of all three edges of the triangle with the current output line. We discard one of them, since it lies not on the interval between the vertices, but on its continuation.That is, to draw a frame, you just need to go through all the lines and for each paint the area between the intersection points. But if you apply this algorithm “head-on”, each primitive will overlap those that were drawn earlier. We need to consider the Z-coordinate (depth) so that the triangles intersect beautifully. Instead of simply printing point by point, we will consider its Z-coordinate and, in comparison with the Z-coordinate stored in the depth buffer, either output (updating the Z-buffer) or ignore it. And to calculate the Z-coordinate of each point of the line of interest to us, we use the same straight line formula z = kz * y + bz calculated by the same two intersection points with edges. As a result, the object of the “semi-finished” triangle struct glTriangle consists of three X-coordinates of the vertices (there is no sense in storing the Y and Z-coordinates, they will be calculated) and k,b direct coefficients, well, color to the heap. Here, in contrast to the calculation of transformation matrices, the speed is critical, so we already use fixed-point numbers. Moreover, if for the term b, the same accuracy is sufficient as for the coordinates (2 bytes), then the accuracy of the factor k, the greater the better, so we take 4 bytes. But not a float, since working with integers is still faster, even with the same size.

So, by calling a bunch of glDrawTriangle () we prepared an array of semi-finished triangles. In my implementation, triangles are deduced one at a time by explicit function calls. In fact, it would be logical to have an array of triangles with the addresses of the vertices, but here I decided not to complicate. Anyway, the rendering function is written by robots, and it does not matter to them whether to fill out a constant array or write three hundred identical calls. It's time to translate the semi-finished products of the triangles into a beautiful picture on the screen. To do this, the glSwapBuffers () function is called. As described above, it goes through the lines of the display, searches for each intersection point with all the triangles, and draws segments in accordance with depth filtering. After rendering each line, you need to send this line to the display. To do this, DMA is launched, which indicates the address of the string and its size.In the meantime, DMA works, you can switch to another buffer and render the next line. The main thing is not to forget to wait for the end of the transfer if you suddenly finished rendering earlier. To visualize the ratio of speeds, I added the inclusion of a red LED after the end of rendering and off after completion of the DMA wait. It turns out something like PWM, which adjusts the brightness depending on the latency. Theoretically, instead of a “dumb" wait, DMA interrupts could be used, but then I could not use them, and the algorithm would have become much more complicated. For a demo program, this is redundant.To visualize the ratio of speeds, I added the inclusion of a red LED after rendering, and off after completion of the DMA wait. It turns out something like PWM, which adjusts the brightness depending on the latency. Theoretically, instead of a “dumb” wait, DMA interrupts could be used, but then I could not use them, and the algorithm would have become much more complicated. For a demo program, this is redundant.To visualize the ratio of speeds, I added the inclusion of a red LED after the end of rendering and off after completion of the DMA wait. It turns out something like PWM, which adjusts the brightness depending on the latency. Theoretically, instead of a “dumb" wait, DMA interrupts could be used, but then I could not use them, and the algorithm would have become much more complicated. For a demo program, this is redundant.

The result of the above procedures was a rotating picture of three intersecting planes of different colors, and with a fairly decent speed: the brightness of the red LED is quite high, which indicates a large margin in kernel performance.

Well, if the core is idle, you need to load it. And we will load it with better models. However, do not forget that the memory is still very limited, so the controller will not pull too many polygons physically. The simplest calculation showed that after subtracting the memory on the line buffer and the like, there was a place for 378 triangles. As practice has shown, models from the old but interesting Gothic game are perfect for this size. Actually, the models of a snake and a blood fly were pulled out from there (and already at the time of writing this article and a glocoor, flaunting on KDPV), after which the controller ran out of flash memory. But game models are not intended for use by a microcontroller.

Let's say they contain animation, textures and the like, which is not useful to us, and does not fit in memory. Fortunately, blender allows not only to save them to * .obj, which is more amenable to parsing, but also to reduce the number of polygons if necessary. Further, with the help of a simple self-written program obj2arr * .obj, the files are sorted into coordinates, from which a * .h file is subsequently formed for direct inclusion in the firmware.

But for now, the models look just like plain curly blots. On the test model, this did not bother us, since all the faces were painted in their own colors, but do not prescribe the same colors to each polygon of the model. No, you can, of course, paint a fly in random colors, but it will look pretty out of the blue, I checked. Especially when the colors also change on each frame ... Instead, apply another drop of vector magic and add lighting.

The calculation of lighting in its primitive version consists in calculating the scalar product of the normal and direction to the light source, followed by multiplying by the “native” color of the face.
We now have three models - two from the game and one test, from which we started. To switch them, we will use one of the two buttons soldered on the board. At the same time, you can add control over the processor. We already have one control - a red LED associated with DMA latency. And the second, green, LED, we will blink with each frame update - so we can estimate the frame rate. For the naked eye, it was about 15 fps.

In general, I am satisfied with the result: it’s nice to implement something that is fundamentally impossible to solve head on. Of course, there is still much to optimize and improve, but there is little point in this. Objectively, the controller for three-dimensional graphics is weak, and it’s not even about speed, but rather RAM. However, like any demoscene sample, this project is valuable not by the result, but by the process.

If someone is suddenly interested, the source code is available here .

3D graphics on the STM32F103

More articles: