How to make a car write tests from code for you

We live in an imperfect world. People write code here, and people are naturally prone to make mistakes . Everything would be fine, errors can be caught at the testing stage and not be allowed to harm anyone. It is possible if you write tests. What people do not like to do for some reason. But perhaps there is hope - autogeneration of tests from written code.

Julia Volkova wants to test the idea in reality and is trying to shift the creation of tests based on code to a machine, without using additional instructions or contracts. Julia will tell you about the discoveries that the journey brings to the world of metaprogramming, AST, parsing and tokenization, and what all of this has allowed us to achieve in the autogeneration of tests, at Moscow Python Conf ++. In the meantime, I asked where the idea came from - to automate testing, what is the basis of the prototype and what remains to be done.

Julia Volkova (xnuinside) Senior Python Developer at GridDynamics. In his free time he writes pet projects, which sometimes find application in real life. So, over and over again testing the legacy code, Julia noticed that many things can be done automatically. Of course, understanding the code and writing the “right” tests for it is sometimes too difficult for a living experienced developer. But automation can well do a lot of simple tests and prepare a code base, which the developer will be able to modify at his discretion.

- Let's start with the patient himself, why do you think people don’t write tests? Smart people say to write tests, but they still do not write. Why is there such a problem?

- I think there are several reasons. First, most of us are lazy in nature. Few people directly like to write tests - wake up in the morning and say: “We must start the day with 15 tests, otherwise everything will be bad, but at the same time my life will not succeed.” Natural laziness is more often manifested, especially when you see that the method is not very interesting, it has clear, primitive code, but you still need to cover it with tests.
Few write TDD, so not only do you have to write a test, you also have to spend time on the code.
The problem is that an infinite amount of time is not allocated for development. There are always time-limited Wishlist products. In product teams in general, as a rule, everything was necessary yesterday, because time is money. It seems to managers that the more we green a feature, the more expensive and better our product will be. And it is not always obvious that testing coverage, code quality directly affects the subsequent speed of adding features, code support, updating, etc.

We often blame everything on managers and say that they do not give us enough time, otherwise we would sit and write tests. In fact, this is not always the case. And not always experienced robust developers say to write tests, and younger colleagues do not want to.

I’ve been in IT for a long time, but I’ve been directly involved in development for 3-4 years. Before that, I worked more in managerial positions and saw different developers. There are many people who cannot be called inexperienced, because they have been writing code for 10 years, but at the same time believe that tests as such are not needed. Suppose you don’t need to cover the code with unit tests, because there is a QA engineer who needs to catch bugs. And they don’t think that such an engineer can cover not all cases with end-to-end tests.

“If you don’t go to such extremes, what do you think, who should write the tests?” Should it be the programmer himself, the junior or, conversely, the coolest developer on the team?

- If we are talking about unit tests, it definitely should not be QA. These should definitely be those tests that are checked, passed and written before commits. They should be directed to pull request, in no case should another person write them later. For example, I, as a lazy non-junior developer, would just put juniors to write tests for primitive code. There are things for which it’s enough to simply read the code at an intermediate level and write asserts, such work is quite suitable for juniors and will be useful for their development.

These are unit tests that simply cover state code as it is. These tests do not check how valid the function is in relation to the task requirement in task, but just make sure that the code does what it does and does it correctly ...

But to verify the validity of the code for business requirements, for business logic, nevertheless, a person who implements these requirements must. He must understand what and how he covers with tests. But it is not clear how it will help if a person did not initially understand the problem, wrote a method that solves it incorrectly, but did the correct test for this incorrect method.

- We can say that the problem is that people have a poor idea of ​​how the software development process is going?

- This is very subjective. You imagine yourself as a unit of developers who understand that tests are needed, why they are needed, and you think that is true and good. But there is a fairly large layer of developers who believe that this is redundant. And, in a sense, managers are probably right in their own way when they say that tests do not need to cover all the code, just manual testing at the stage is enough.
It is not always correct to say that a person who does not like tests is an unskilled developer.
He has some vision of his own, and not for me to judge. I still often meet developers who have been writing code for 10 years and say that it’s redundant to cover everything with unit tests, enough smoke testing and QA work are enough.

I, in turn, feel uncomfortable on a project in which there are no unit-tests for functions. It is important for me that there are at least tests guaranteeing protection against the human factor, capable of catching a randomly placed comma or a changed key name in a dict. But I don’t like spending time on it, because I always want to do more “smart” tasks. Therefore, I’m thinking about tools for automating the process of writing tests.

- Do you think that Python is dynamically typed and does not check anything at the compilation stage? Could it be easier in other languages ​​with this?

- I think, plays, and strong. This is an eternal story about types, but with the advent of type annotations it has become easier to work with.

For example, in Python there may be chains of nested functions, where the expected at the end of the list for some reason turns into a dictionary. Execution may never reach the final function, but in some if, in some exceptional case it does, and then an error will appear.

Of course, with a typed language this cannot happen in principle, because an error will occur already at the compilation stage. In this regard, of course, Python provides additional ways to shoot yourself in the foot (in the head and elsewhere). Especially if you work with large projects with branched logic, where data can be poured into different variations, into different aggregations.

- What then to be with typification? Do you think typing should be at the maximum or at the minimum? What should be the balance of typing in dynamic code?

- This is again quite subjective. Many people came to Python precisely because there is no typing and because it is all so flexible and convenient. You should not forget about this and do not weed out a huge layer of developers, including data scientists and analysts who also write code. Suppose I, as a backend developer, are of course more comfortable when typing is generally everywhere. Ideally, mypy also works.

But in most projects in which I participated, this is not possible. Because the project also has data analysts who say that because they write in Python because they don’t want to mess with types, it’s so convenient for them.
A large number of people believe that plus Python in the absence of types and typing.
You need to grow to a certain level to understand when and why it becomes a minus. In some small Python scripts or in small projects, I also do not use types, because I know that in a 2-function script, types are not particularly needed. But this is something that, roughly speaking, I quickly did on my knee to pull something out of the base. And in larger projects, I try to add types to the maximum everywhere, if there is no resistance from other developers.

- I completely agree with you on this. It remains only to understand how to use types, because this is a separate obscure topic.

: «, Haskell , : , . Python , , ».

— . , , legacy- smoke-. . ?

- I won’t say that my approach is better, it’s just different. Covering your code with smoke tests is good when you can. My previous project was the quintessential pain associated with tests. It was a data science platform of 8 microservices and 20 thousand lines of code. The problem is that the platform receives a large amount of data and characteristics for vehicles, stations and cities, various parking lots and types of supplies, aggregates and creates a huge set of potential schedules for these vehicles around the world. The schedule takes into account a huge number of conditions from the category where you can refuel the vehicle, where to make an intermediate stop.

There are many different methods in the system that can be used in 1-2 situations, which, perhaps, even none of the clients will ever remember. Then writing smoke tests in fact turns into writing tests for the entire system, taking into account all the functions and their combinations.

Smoke-test should check that everything works on the output and does not break minimally. A very primitive smoke test that the system started and somehow works does not bring any benefit in our case. Let's say we checked that there is a connection to the database, something is starting, the UI is getting some kind of API. And then a step to the left, a step to the right - and nothing works. That is, there is a smoke test, as it were, but errors still fly from the production.

In this system, unit tests worked just fine: when it is clearly monitored that the functions have not changed, they have not broken after some code changes. The code is also different. Different projects, different tasks need different approaches to testing.

The idea that I am currently working on can only be called auto-generation of tests conditionally. It is rather a developer tool. I want to get a tool that will write tests for me and run all the code that can run without me.

I will give an example. There is a small function that takes a dictionary, from it some value and a key. This key is very important for business, but from the point of view of the code it is a rather primitive operation: take from the dictionary, even if it is a nested key several times; check that he is there, that he is not zero; swap it or maybe just return the value. This is pretty primitive code exactly from the point of view of AST. I do not want to waste my time on him and write tests. I want the car to do it for me.

This is precisely a metaprogram with an input code and an output code. Let’s say, to the py-module, which says: “Here I have an assert, I“ assisted ”you that there are raise errors in this condition, valid values ​​returned in such a situation, something else happened with such an argument” . That is, in fact, it does the work where I myself would look at what is fed to the input of the function and write it in the test.

I want the program to generate the minimum that it itself can run for me. But this should be a test file, in which then, if desired, you can change or expand something. Which you can commit in Git, test test, etc.

- How much can you rely on such auto-generated tests? What do I mean - how much are they tied to a specific implementation, and how will they behave under normal changes in business logic or refactoring?

- The idea is to take the code in the form in which it is now, and based on it to generate valid tests at the moment.

Of course, you can regenerate the tests every time, but this will not be correct, because then there will be no tracking of the state of code change. Accordingly, there is still test diff for this, that is, tests are generated only to what has not been covered by tests before. And already created tests need to be supported by yourself.

Perhaps this is a little paranoia, but so far I doubt that with auto-generation it is possible to guarantee that by regenerating the tests you will not cover valid code with valid tests. It is one thing when in February 2019 I generated tests, and if you change the logic, then you change the tests yourself, because you know what changes have been made. You know why the tests fell, and you can correct the tests accordingly. And it’s a completely different matter when you regenerate them every time. Tests will be valid, but only to that changed state of the code.
I want to get a tool for the developer, and not a piece for increasing code coverage.

- What can be success metrics? How to understand that we generated tests well?

I will name what I pay attention to, without which it seems to me that the tests do not make sense. It is imperative that all cases of code behavior described by the developer are processed in the tests. For example, if there is an if that does not return anything, but writes a log, in the test this log should work. Not just that people write warning and print. Accordingly, if somewhere there is a raise-error processing, you need to work it out in a test. If suddenly raise disappears, that is, there will be a change in the logic of the code, then this also needs to be worked out.

Similarly, if there are if-statements, then there must be processing in the assert of each condition. Then the test will be more or less close to the truth. And do not forget that this should all be started, and not just issue “success” in PyTest with empty test bodies.

- Tell me how difficult it is technically to do. Sounds like a pretty hard task.

Yes, this is a very difficult task, and it is probably this fact and several other circumstances that have led me to talk about this in a report on Moscow Python Conf ++. I want to raise this topic, interest other people in it, and discuss solutions with them.

I have a feeling that no one just tried to do this, because the task is difficult. Otherwise, there would be some artifacts on the network such as code, descriptions, articles, or at least mentions that there was such a thing, but it was abandoned.

To understand how difficult this is, let us recall how the interpreter works. There are operations, statements in the code, the interpreter performs them - good, not good, failed, did not fail - and produces the result. Further, the developer manually adds new arguments, starts the interpreter again, makes sure everything is successful now. But when you try to generate tests for the code, first you need to go through the AST tree and understand what steps you need to take to get the result.

A function can have many groups of arguments, strategies for arguments, and many results for these strategies. Speaking of strategies, I mean that, let's say, there are if arg_1==1: raise error. This means that there is some group s arg_1=1for which the function always returns an error. But with the argument, the arg_1>2result of the function will be different, and a second group will be created, the second strategy.

Accordingly, we need to find and highlight all such groups of arguments (if, of course, they are), in which the function changes its behavior. And then follow the chain of actions: what will happen inside the function with these arguments to get the final result.

Moreover, we do not forget that besides the fact that there is some argument, there are also actions inside the function, for example, assign variables, calling other functions. That is, we also get a graph of the dependencies of the methods on the methods, when in order to check some code you must first get the result of another code.

Accordingly, to generate tests, you must first get all the necessary information from the AST tree, and then generate arguments, parameters, data for each strategy. With them, go through the entire chain of actions, get the result, and only then we will have a valid test with different asserts. This is a difficult task.

I don’t think that someday it will be possible to 100% cover all kinds of cases automatically, for example, for the huge canvases of Django source codes. It is laborious but interesting. So far I’m just curious where I have the patience and strength to reach.

- Are there any examples from other languages ​​and areas where something like this works?

- There are no known similar ones. I think because it’s easier to write a test than to cut a special tool.
But I have a feeling that we will sooner or later automate what we are already doing well.
There is a large pool of developers who write unit tests well. We have enough competencies in Python development to want to write a tool or library that does this for us. And we will write more complex things, more complex tests.

There are some kind of test generation in Java, C, and .Net. But there, too, everything is rather more property-based or contract-based. In C, there is a character-by-symbol test generation, it seems like it just looks at the code and on the basis of this does some tests. But this is such a different level of abstraction in the language itself that I am not sure if this is a similar story.

If there was something very similar, then, of course, one could adopt something, peek.

- Do you think that frameworks or maybe techniques for writing Python code simplify or complicate the task of generating tests from the AST tree?

- It is difficult to say whether in this sense it is very different to simply import some library or use a directly specific framework. Absolutely, it can greatly complicate the work of something that changes the behavior of the interpretation of a code process, for example, a C-extension. How to deal with this, I don’t know yet, but the use of my favorite third packages so far in this problem rests on the need to resolve imports. Everything is simple with built-in packages, but with imports everything becomes more complicated. Mypy has some ideas and implementations, but I don’t touch the history of importing third-party packages yet.

- Maybe it’s some kind of technique - a lot of dynamics, the use of getattr - something like that? Or is it working fine?

“It just works perfectly fine.” Because getattr or manipulations with metaclasses are visible in AST. Yes, they need to be resolved, and this adds some complexity. But this is tracked anyway.

- We have already said that auto-generated tests are primarily intended for people. How readable will they be for people? There will be a lot of logic inside each test, assert? What will the separation between code and data look like, how do you see it?

- Now I try to initially add all kinds of banal things to the tests. Suppose, if it’s some kind of raise error, then it’s not just with raise, but at least leave a comment, what kind of error, why it pops up, so that the person, after reading the test, understands what actually happened, what argument leads to which error .

Asserts so far combined in one method. That is, if there is a function and there are 5 states to it that we want to check, then until 5 asserts go inside the function.

There was an idea to introduce name conventions, for example: put errors at the end of error, test logs also have something of their own. But I have postponed it for now, because the question of how to create the final type of tests in the code, directly a text block with tests, is the most low-cost operation. If the idea suddenly appears that everything needs to be reformatted, then this will be easy to do - there are ready-made assembled asserts, you just need to choose a different look for the tests.

- Do you support unittest or pytest?

- Pytest. And just because I don’t want to spend a lot of energy on the output now. Pytest is good because there are many plugins, decorators, various modifiers for it that are easy to use.

Prettiness may be important for both the end user and the developer. But this does not affect the development of the idea at all. If you need to support unittest, this can be easily added.

- How much is this approach related to property-based tests?

- Now, to generate arguments, just moki type is used: you need int, give random int. But such strategies will then be easy to rewrite, for example, start using hypothesis. While I do not spend much time and effort on this, because I understand that I can then use third-party generators for value. Now, it seems to me, this is not as important as working with AST.

- Do you plan to support contract programming or somehow separate out in a special way? Because it helps a lot in working with unit testing, property-based testing, and tests, in principle, for understanding business logic.

- If by contract programming we mean contracts in the code, then I just depart from this as much as possible. Because when you can use contract programming, you can basically code the contracts with contracts and generate unit tests on their basis. And then my tool is not so needed.

Now I try not to think about anything that modifies the code. Because, for example, in projects on outsourcing, in which I faced the problem of lack of tests - and these were almost all projects, sadly, in the current company - it was almost impossible to touch the code. That is, it was impossible to make changes until you could guarantee that this decorator or contract would not change the entire functional component of the code.
If it is possible to edit the code, then contract tests are good.
But for now, I proceed from the fact that there is no such possibility. And so, indeed, on the basis of contracts, you can generate unit tests and, in fact, implement duplication of functionality.

- Tell us about the next important point: how to test the received tests and how much can you guarantee that these tests really test something?

- Mutational testing has not been canceled, and in an ideal picture of the world it certainly needs to be used in this story. The idea as a whole is the same as if the test was written by the developer manually. That is, everything that is available for testing tests can be fully applied.

- Let’s now discuss the Moscow Python Conf ++ conference a bit. We will performone of the hypothesis developers we mentioned several times. What would you be interested to ask him?

- I would be interested to ask Zach about where they want to develop the project together with the maintainers: what to add, which way to develop. I know for sure that Zach now has a PR for test generation. They do it regularly. More precisely, decorators add to existing unit tests.

I would like to discuss the ideas of automatic test generation in terms of how hypothesis looks at it, how contributors look at it. Surely people who are engaged in tests at such a level have some ideas or maybe someone has already tried something.

“We are counting on this when we are preparing the conference program: for the reports to set topics for discussion, during which everyone would find new ideas and directions for development. What reports will you go to?

- I would like to get upset and go to all the reports at 12 o’clock. At this time, there will be Zac Hatfield-Dodds, Andrey Svetlov with a report on asynchronous programming, and Vladimir Protasov with refactoring automation . I’ll go to one of the last two, and then I’ll run to Zach at the end of the report ( editor's note: take a life hack into service - listen almost completely to the new topic, and come to the end of the report and questions to the speaker who you want to talk to ) .

There must be very interestingreport on data validation , I'm interested in it directly. And there are two more reports that I would also go to, but they will all go in parallel with mine: this is a report by Vitaly Bragilevsky about typing and Christian Heimes about profiling . Unfortunately, I can’t get to them in any way.

- Tell me a little more about the topic of your report, why are you doing, what are you doing, why are you speaking and what are you waiting for from the speech?

- I want more tools for automating development processes and more collaborations related to this. There is such activity, but against the background of constantly writing the same code, it seems to me that there should be more.

As I said, there is no open experience in auto-generating tests in Python. It is unclear whether anyone was doing this, if so, why didn’t take off, did not go. I don’t know how much the generation of AST-based tests will be relevant for the community, how far it can go. Now I am doing this because I am interested in the process itself, I am interested in digging through AST trees, learning more about how Python code works, and encountering a lot of different nuances that are not obvious when working with the code top-level. Working with AST trees brings a ton of sudden discoveries.

I want people to have ideas after the report, for example, how to automate something that they use in their work. So that some of them stop writing pieces of code that they already write every day, and begin to generate or reduce the amount of time to write them. I hope someone comes out with a new understanding of how to solve this problem.

- Where do you take the time to speak at conferences, write your own libraries? This question actually pops up constantly, many people complain that they don’t have time for anything.

- Firstly, about the time. I am not a very convenient employee for many companies in the sense that I do not do things that seem ineffective to me. I try to do things that are either really interesting to me, or that I can do effective and correct. If, for example, a manager wants me to fix some kind of bug right now, which is actually not a bug, but a fresh customer’s wishlist, I won’t sit down and fix everything back, because I know that the customer will come back and say why you did it.
I try not to do unnecessary work at work, not to do what will entail the loss of my time afterwards.
Suppose, if they ask me to deploy on Friday, I say: “Guys, I love you all very much, you are all great fellows, but if you need to deploy something now, please deploy yourself, and I will go home. I can deploy it on Monday, we can talk about why such a situation has occurred, that you want to deploy now on Friday. ” It may be painful for the first time to tell the customer or managers this, but later on people get used to, learn and do not ask you to do something very urgent on Friday night. They understand that, firstly, no one died last Friday, when no one was flooded, and even no one lost money. I try not to do something that will harm me.

The same story about bugs - if there are many bugs that have to be fixed constantly, the question is: why do these bugs appear. We should not fix them, but think about why there are so many of them, where they come from and fight primarily with the root problem. These are also always painful issues, when a manager or customer says that an urgent need to fix a feature in production. But you need to be able to say that if I touch this code now, then perhaps you have something other than this feature, you won’t have production, since the code is not covered by tests, you cannot add another if to it, because we don’t remember what the other six do.

Sometimes you need to overcome yourself and start talking. This is not always possible, it is necessary to grow to a certain level of awareness that for how much time you spend on what kind of work, you are responsible.

Therefore, I probably have time. Because I try to optimize my working time, to make it take a certain number of hours to complete a task. At the same time, I understand that in a good structure there should be 1-2 hours for technical debt and some improvements.

I will not say that I work 8 hours without getting up. I would look at a developer who sits and writes code for 8 hours of working time. If you take my usual working day, then 2 hours is just all sorts of tests, code review, technical debt, "buzz" on the code. Hours 3 is a solution to current problems, an hour to communicate with managers. And the remaining 2 hours are spread out for some reason, for discussion with teams and freelance stuff.

There are things that you are interested in doing - you do, and when you have no strength, they give you strength. I have a lot of different activities - this is probably called useful procrastination - when I do what I am interested in at the moment, and not what I need to do. If you learn to vary between what is interesting and what is still needed, it turns out to be the most successful. You just don’t waste time wasting yourself to do what you don’t want.

There is no secret, you just need to do what you like, but at the same time without harm to those around you and the project.

For details of implementing test generation from Python code, as well as solving many other tasks of a Python developer, come to Moscow Python Conf ++ , which we postponed to September 15.

All Articles