Plumber programmers, or the story of one leak and the difficulties of dealing with it

It was Tuesday, February 25th. The difficult release of the version on Saturday, February 22, was already in the past. It seemed that all the worst was behind, and nothing portended trouble. But everything changed at one time when a monitoring error came about a memory leak on the coordinator process of the access control service.

Where from? The last major changes in the coordinator's code base were in the previous version more than two months ago, and after that nothing remarkable happened to the memory. But, unfortunately, the monitoring schedules were adamant - the coordinator's memory obviously began to leak somewhere, there was a large puddle on the service floor, which meant that the plumbing team had a lot of work to do.



First we make a small digression. Among other things, VLSI allows you to keep track of working hours and monitor access by face model, fingerprint or access cards. At the same time, VLSI communicates with endpoint controllers (locks, turnstiles, access terminals, etc.). A separate service communicates with devices. It is passive, interacts with access control devices based on their own protocols implemented over HTTP (S). It is written on the basis of the standard stack for services in our company: PostgreSQL database, Python 3 is used for business logic, extended with C / C ++ methods from our platform.

A typical web service node consists of the following processes:

  • Monitor is the root process.
  • A coordinator is a child process of a monitor.
  • Work processes.

Monitor is the first web service process. Its task is to call fork (), start the child coordinator process and monitor its work. The coordinator is the main process of the web service, it is he who receives requests from external devices, sends responses and balances the load. The coordinator sends the request to the work processes for execution, they execute it, transmit the response to the shared memory and inform the coordinator that the task is completed and you can pick up the result.

Who is to blame and what to do?


So, the coordinator of the access control service differs from the coordinators of other services in our company by the presence of a web server. The coordinators of other services worked without leaks, so the problem had to be looked for in our configuration. Worse, on the test stands, the new version existed for a long time, and no one noticed memory problems on them. They began to look more closely and found that memory flows only on one of the stands, and even then with varying success - which means that the problem is still not so easily reproduced.



What to do? How to find a reason? First of all, we took a memory dump and sent it to specialists from the platform for analysis. The result - there is nothing in the dump: no reason, no hint what way to look further. We checked the changes in the coordinator's code from the previous version - all of a sudden we made some terrible edits, but just didn’t understand this right away? But no - only a few comments were added to the coordinator code, but a couple of methods moved to new files - in general, nothing criminal.

They began to look towards our colleagues - developers of the core of our service. They confidently denied the very possibility of involvement in our misfortune, but offered to implant tracemalloc monitoring in the service. No sooner said than done, at the next hotfix we are finalizing the service, quickly testing, releasing to battle.

And what do we see? Now our memory flows away not just quickly, but very quickly - the growth has become exponential. We attribute the first peak to evil forces and the factors accompanying them, but the second peak, several hours after the first, makes it clear that it takes too long to wait for the next hotfix to be released with such an emergency behavior of the service. Therefore, we take the results of tracemalloc and patch the service, rolling back the edits with monitoring in order to return at least to linear growth.



It seemed that we had the results of the work of tracemalloc on the memory allocated in Python, now we will look into them and find the culprit of the leak, but it wasn’t there - in the collected data there are no 5.5GB peaks that we saw on the monitoring graphs. The maximum used memory is only 250MB, and even traceMalloc eats 130MB of them. This is partly explainable - tracemalloc allows you to see the memory dynamics in Python, but it does not know about the memory allocation in C and C ++ packages that are implemented by our platform. In the obtained data, it was not possible to find something interesting, the memory is allocated in acceptable volumes to ordinary objects such as streams, strings and dictionaries - in general, nothing suspicious. Then we decided to remove everything superfluous from the data, leaving only the total memory consumption and time, and visualize.Although the visualization did not help answer the questions “what is happening” and “why”, with its help we saw a correlation with the data from monitoring - which means that we definitely have a problem somewhere, and we need to look for it.



At that time, our plumber team ran out of ideas on where to look for a leak. Fortunately, the bird sang to us that one major change from the platform did happen - the Python version changed from 3.4 to 3.7, and this is a huge search field.

We decided to look for problems related to memory leaks in Python 3.7 on the Internet, because for sure someone already encountered this behavior. Still, Python 3.7 was published a long time ago, we switched to it only with the current update. Fortunately, the answer to our question was found quickly, and there was also issue and pull-request to fix the problem, and she herself was in the changes made by the Python developers.
What happened?

Starting with version 3.7, the behavior of the ThreadingMixIn class has changed, from which we inherit from our web server to process each request in a separate thread. In the ThreadingMixIn class, they added the entry of all created threads into an array. Due to such changes, class instances that handle device connections are not freed after completion, and the garbage collector in Python cannot clear memory from spent threads. This is what led to the linear growth of allocated memory in direct proportion to the number of requests to our server.

Here it is, the insidious code of the Python module with a big hole (the code in Python 3.5 is shown on the left before changes, on the right - in 3.7, after):



Finding out the reason, we easily eliminated the leak: in our heir class, we changed the value of the flag that returned the old behavior, and that's all - a victory! Streams are created as before, without writing to the class variable, but we observe a pleasant picture on the monitoring graphs - the leak has been fixed!



It's nice to write about this after a victory. We are probably not the first to encounter this problem after switching to Python 3.7, but most likely not the last. For ourselves, we concluded that we need:

  • Take a more serious approach to assessing the possible consequences of major changes, especially if other applied decisions depend on us.
  • In the event of global changes in the platform, such as, for example, changing the version of Python, check your code for possible problems.
  • Respond to any suspicious change in the monitoring schedules of not only combat services, but also test ones. Despite the current garbage collector, there are memory leaks in Python too.
  • There is a need to be careful with memory analysis tools like tracemalloc, as using them incorrectly can make things worse.
  • You need to be prepared for the fact that the detection of memory leaks will require patience, perseverance and a little detective work.

Well, I would like to express my gratitude to everyone who helped to cope with emergency plumbing work and once again return to its former working capacity to our service!

All Articles