🤵🏽 🗜️ 🐴 Instead of 100 application launches - one autotest, or how to save a QA engineer 20 years of life 🗺️ 🛏️ 🕝

Hello everyone, my name is Evgeny Demidenko. For the past few years, I have been developing an automated game testing system at Pixonic. Today I wanted to share our experience in developing, supporting and using such a system on the War Robots project.

To begin with, we’ll figure out what we are automating with this system.

First of all, these are regression UI-testing, testing of core-gameplay and automation of benchmarks. All three systems as a whole make it possible to reduce the burden on the QA department before releases, be more confident in large-scale and deep refactoring, and constantly maintain an overall assessment of the performance of the application, as well as its individual parts. Another point I want to note is automation of the routine, for example - testing of any hypotheses.

I will give a few numbers. Now more than 600 UI tests and about 100 core tests have been written for War Robots. On this project alone, we made about a million launches of our test scripts, each of which took about 80 seconds. If we checked these scenarios manually, we would have spent at least five minutes each. In addition, we launched more than 700 thousand benchmarks.

Of the platforms, we use Android and iOS - only 12 devices in the park. Two programmers are involved in system development and support, and one QA engineer is writing and analyzing tests.

As for the software stack, we use NUnit in our database, but not for unit tests, but for integration and system tests. For core gameplay and build verification tests, we use the built-in solution from Unity - Unity Test Tools. For writing and analyzing reports after these tests, the Allure Test Report from Yandex is used, as well as TeamCity - as a system of continuous integration for the application build, server deployment and run tests. We use the Nexus Repository and the PostgreSQL database to store our artifacts.

How do you create, analyze and run tests

Suppose we want to write a simple test that in the game settings window will check the icon for turning the sound on and off.

So, we wrote a test and committed it to a specific branch in our test repository. We chose the tests that we want to run, chose a build to run, or maybe a specific commit, on which the build will be assembled. Now run the test, wait a while and get the result.

In this case, 575 tests were launched, of which 97% were successful. It took us about three hours to complete all the tests. For comparison, the same tests, if done manually, would take at least 50 hours of continuous operation.

So what happened to those 3% of the tests that failed?

We open a specific test and see a message that an error occurred while matching screenshots.

Then we open the screenshot, which at that moment was on the device, and we see that zones that do not correspond to the original are marked with red pixels. For comparison, we give him.

Naturally, after this, the QA engineer must either make a bug that the behavior of the build does not correspond to the game design document, or update the original screenshots, because the game design document has changed, and now these elements will not be in the game.

It looks cool. Why is all this necessary?

Some time ago, on the War Robots project, we needed to do one small refactoring. It consisted of rewriting some pieces of code for firing weapons - in particular, machine guns.

During testing, we found one interesting nuance: the rate of machine guns depended directly on FPS. Such a bug would be unrealistic to detect during manual testing: firstly, due to the features of the network calculation of damage on the project, and secondly, due to the fact that the War Robots application is quite well optimized and at that time it ran on all devices with approximately the same FPS - 30 frames / s. Of course, there were small deviations, but they were not enough to notice an increase in damage from firing weapons during manual testing. Then we asked ourselves: how many such bugs do we still have and how many can appear during refactoring?

Since we didn’t want to reduce the number of tests, but rather increase it, since we had planned major updates and an increase in the number of content, we did not want to grow horizontally and increase the number of QA department employees. Instead, we planned vertical growth with a reduction in the routine of current employees and making their lives easier during the integration testing of new content.

What tools do we use

When we first started to automate tests, we first of all drew attention to the Unity Integration Test Tools, which was built-in at that time in Unity at that time. We wrote several UIs and core tests on it, finished the refactoring that we started earlier, and were satisfied with it, because the solution already worked, which means that our assumptions were correct, and we had to move on. The only negative of this solution, but very significant for us, was that tests could not be run on mobile devices.

Thus, we came up with the idea of using the Appium framework. This is a fork of another well-known testing framework - Selenium. It, in turn, is perhaps the most famous framework for testing web applications, the main concept of which is to work with UI elements, get their coordinates and organize input into these UI elements. Appium adopted this concept and, in addition to the existing web drivers in Selenium, also added iOS and Android drivers: they use native test frameworks for each of these platforms.

Since there are no native UI elements in Unity, and there is only one UI element in which the picture is rendered, I had to add in addition to the Appium UnityDriver, which allows you to work with the scene hierarchy, get scene objects, and much more.

At that moment, a QA engineer had already appeared on the project, things started to flow, the number of test scenarios began to grow significantly, which we gradually automated. We started to launch them on devices, and in general, our work already looked about the way we wanted.

In the future, in addition to UI tests, more core tests and other tools based on our system began to appear, as a result of which we ran into performance and quality of work on various devices, added support for several more devices, parallel tests, and also abandoned Appium in benefit of your own framework.

The only problem that remained with us - and still is - was the UI hierarchy. Because if a hierarchy changes in a scene due to UI refactoring or work on the scene, this needs to be supported in tests.

After the next innovations and revisions, the architecture of the entire system began to look as follows.

We take the War Robots build, take our tests, which are in a separate repository, add some parameters to run there that allow us to configure the launch of tests in each case, and send it all to the TeamCity agent on a remote PC. The TeamCity agent launches our tests, passes them the “Robots” build and launch parameters, after which the tests begin to work and “communicate” independently with the devices that are connected to the TeamCity agent by wire: put builds on them, run them, execute certain scripts, delete builds, restart the application and so on.

Since the tests and the application itself run on physically different devices - on a mobile phone and Mac mini - we needed to implement communication between our framework, War Robots API and Unity API. We have added a small UDP server to the application, which receives commands from the framework and communicates with the application API and Unity through handlers.

The main task of our framework is to organize the work of tests: the correct preparation, completion and management of devices. In particular, parallelization to speed up work, the right choice of devices and screenshots, communication with the build. After completing the tests, our framework should save all the generated artifacts and generate a report.

Tips for choosing devices

Separately, I want to pay attention to the choice of devices for testing.

Considerable attention should be paid to hubs. If you want to run benchmarks on your devices - especially if they are Android devices - they will run out of power. Hubs must provide the necessary power for the devices used. There is another very subtle feature: some hubs have active power, and this power turns off after power surges, after which it is turned on only by physical pressing a button. We have such hubs, and this is very inconvenient.

If you want to run regression UI testing and test logic on devices, do not take different devices. Take the same devices - better the most productive ones you can afford, because in this way you will save time on device brakes, the convenience of working with them, and the behavior of the application on all devices will be the same.

A separate issue is the use of cloud farms. We do not use them yet, although we have conducted research on them: what they are, how much they cost and how to run our tests on them - but so far we have enough of our in-house device park to cover our requests.

Test Reporting

After the tests are completed, we generate an allure report, which includes all the artifacts that were created during our test.

The main “workhorse" for analyzing what happened and identifying the causes of the crash during the test is the logs. First of all, we collect them from our framework, which tells us about the state of the script and what happened in this script. We divide the logs into the system (more detailed) and the log for QA (more compact and convenient for analysis). We also collect system logs from devices (for example, logcat) and logs from a Unity application.

During the fall of the tests, we also take a screenshot to understand what was happening on the devices at the time of the fall, record a video to understand what happened before the crash, and try to collect information about the status of the device, such as our server pings and ifconfig information, to the maximum to understand if the device has an IP. You will be surprised, however, if you launch the application manually 50 times, everything will be fine with it, but if you run it 50 thousand times in automatic mode, you will find that the Internet on the device may be lost, and it will not be clear during the test, whether there was a connection before and after the fall.

We also collect a list of processes, battery power, temperature, and generally everything that we can reach.

What are good screenshots and videos

Some time ago, our QA engineer suggested, in addition to taking screenshots in the fall, in certain places in the tests to compare these screenshots with the templates that are in our repository. Thus, he proposed saving time on the number of test runs and reducing the size of the code base. That is, with one test we could check the logic and the visual part. From the point of view of the concept of unit testing, this is not very correct, because in one test we should not test several hypotheses. But this is a deliberate step: we know how to analyze all this correctly, so we ventured to add similar functionality.

First of all, we thought about adding libraries to match screenshots, but we realized that using images with different resolutions is not very reliable, so we stopped on devices with the same resolution and just compare the images with a certain threshold pixel by pixel.

A very interesting effect of using screenshot matching is that if some process is difficult to automate, we will automate it as far as it turns out, and then we simply look at the screenshots manually. This is exactly what we did with test localization. We received a request to test the localization of our applications, so we started looking at libraries that allow text recognition - but we realized that it was rather unreliable, and as a result we wrote several scripts that “walk” across different screens and cause different pop- ups, and at this moment screenshots are created. Before starting such a script, we change the locale on the device, run the script, take screenshots, change the locale again and run the script again. Thus, all tests are run at night,so that in the morning the QA engineer can look at 500 screenshots and immediately analyze if there are problems with localization somewhere. Yes, screenshots still need to be watched, but this is much faster than manually passing through all the screens on the device.

Sometimes screenshots and logs are not enough: something strange starts to happen on the devices, but since they are located remotely, you cannot go and evaluate what happened there. Moreover, it is sometimes unclear what happened literally a few moments before the test fell. Therefore, we added a video recording from the device, which starts with the start of the test and is saved only in the event of a fall. With the help of such videos it is very convenient to track application crashes and freezes.

What else can our system do?

Some time ago, from the QA testing department, we received a request to develop a tool for collecting metrics during manual playtests.

What is it for?

This is necessary so that QA engineers, after a manual playtest, can additionally analyze the behavior of FPS and memory consumption in the application, simultaneously looking at screenshots and videos reflecting what was happening on this device.

The system developed by us worked as follows. The QA engineer launched War Robots on the device, turned on the record of the playbench session — our analogue of the gamebench — played the playtest, then clicked “end the playbench session”, the generated report was saved in the repository, after which the engineer with the data for this playtest could reach his working machines and see the report: what were the drawdowns on FPS, what memory consumption, what was happening on the device.

We also automated the launch of benchmarks on the War Robots project, essentially just wrapping the existing benchmarks in an automatic launch. The result of benchmarks is usually one digit. In our case, this is usually the average FPS per benchmark. In addition to the automatic launch, we decided to add another playbench session and thus received not just a specific figure, how the benchmark worked, but also information thanks to which we can analyze what happened to the benchmark at that moment.

We should also mention the pull request test. This time it was more to help the client development team, rather than QA-engineers. We run the so-called build verification test for each pull request. You can run them both on devices and in the Unity editor to speed up the work of checking logic. We also run a set of core-tests in separate branches, where a kind of redesign of some elements or code refactoring takes place.

And other useful features

In the end, I want to dwell on some interesting cases that we have met in the past few years.

One of the most interesting cases that appeared recently with us is benchmarks during fights with bots.

For the new project, Pixonic Dino Squad developed a system in which the QA engineer could play a play test with bots, so as not to wait for his colleagues, but to test some hypothesis. Our QA engineer, in turn, asked to add the ability not only to play with bots, but also so that bots can play with each other. Thus, we simply launch the application, and at this moment the bot starts playing with other bots. At the same time, all the interaction is network, with real servers, just instead of players playing a computer. All this is wrapped in benchmarks and a playbench session with triggers for night starts. Thus, at night we start several battles between bots and bots, at this time FPS and memory consumption are written, screenshots are taken and videos are recorded. In the morning, a QA engineer comes and can see,what playtests were held and what happened on them.

Also worth checking out is texture leakage. This is a kind of sub-analysis of memory usage - but here we mainly check the use, for example, of garage textures in battle. Accordingly, in the battle there should not be atlases that are used in the garage, and when we exit the battle, the textures that were used in the battle should not remain in memory.

An interesting side effect of our system is that almost from the very beginning of its use, we tracked the loading time of the application. In the case of War Robots, this time is not strong, but it is constantly growing, because new content is being added, and the quality of this content is improving - but we can keep this parameter under control and always be aware of its size.

Instead of a conclusion

In the end, I would like to draw attention to the problems that we have that we know about and that we would like to solve in the first place.

The first and most painful is UI changes. Since we are working with a black box, we do not embed anything in the War Robots application except our server - that is, we test everything in the same way as a QA engineer would test. But somehow we need to access the elements in the scene. And we find them along the absolute path. Thus, when something changes on the stage, especially at a high level of hierarchy, we have to support these changes in a large number of tests. Unfortunately, we can’t do anything about it right now. Of course, there are some solutions, but they bring their additional problems.

The second big problem is infrastructure. As I said, if you run your application 50 times with your hands, you will not notice most of the problems that will come to light if you run your application 50 thousand times. Those problems that can be easily solved in manual mode - for example, reinstalling builds or restarting the Internet - will turn out to be a real pain in automation, because all these problems must be correctly handled, an error message displayed, and provided that they can occur at all. In particular, we need to determine why the tests fell: because of a malfunctioning logic or some kind of infrastructure problem, or for any other reason. There are a lot of problems with low-end devices: they do not have builds, the Internet falls off, devices freeze, crash, do not turn on, are quickly discharged, and so on.

I would also like to interact with native UIs, but so far we do not have such an opportunity. We know how to do this, but the presence of other requests for functionality does not allow us to get to this.

And personally, my desire is to comply with the standards that exist in the industry, but this is also in the plans for the future, maybe even this year.

Instead of 100 application launches - one autotest, or how to save a QA engineer 20 years of life