🕓 🙈 🛫 See the architecture? And I do not see, but she is 🍘 🧑🏽 ✌🏻

In the development of hh.ru today about 150 people. We have many interesting teams, and each one makes a significant contribution. But in this article I will tell only about one of them.

~~Because I am her team leader.~~ And there are several reasons for this:

often the candidates do not understand what we are doing;
sometimes even employees inside the company do not know this, because our team does not have a product manager, its own business functional area and the list of services supported by us ...;
our merits most often remain in the shade;
in the end, “if you want to figure it out, try to explain it to someone” :)

Therefore, I will try to understand with understandable examples what our work actually consists in.

Let's start with the most general, that is, with priorities, there are two of them:

system support for the reliability and resiliency of our platform. Here it is worth paying attention to the word “system” - it means that we will not fix specific performance defects, but develop general rules and patterns, fix them in frameworks, automatic checks, etc. so that it works for everyone.
development focus on business logic. That is, the less a developer thinks about supporting reliability, architecture, etc. - all the better. It is clear that completely ridding colleagues of such thoughts is rather harmful, but maintaining a reasonable balance makes sense.

From these priorities the main directions of our work follow:

0. Support and development of architecture

hh.ru is 5-6k rps from users, at the peak reaching 10k, which grow by an order of magnitude, reaching backends. This is more than 1,500 instances, spinning about 150 services in 3 DC. So yes, first of all, these are the very branchy schemes with squares, banks and arrows: who goes where, where what should be. Of course, we don’t draw the schemes - we cover the needs with automation, logging and monitoring, but we scared our students , for example, with such things:

We are really responsible for finding and eliminating bottlenecks and inflexible solutions in the architecture and developing it according to the needs.

I will give an example:

hh.ru has been working far from the first year, and once it seemed like a good idea to have a separate machine for performing background tasks on a schedule - you can allocate more resources for it and there will be no races on it. But what do we have in the end:

point of failure for all tasks
unique configuration reproduced only in prod
tasks whose logic is designed for a dedicated launch on a separate bold machine and does not scale horizontally

When we understood this, we made sure that we had all the means to transfer the crown tasks to general instances, and started a big task in the technical debt category - now that the time comes to repay debts, colleagues are gradually eliminating this problem.

1. Standardization of crutches

First of all, these are our frameworks and tools for the rapid development of services: nuts-and-bolts and frontik . Anyway, jclient and many other libraries opened on our github emerged from the idea that it makes sense to aggregate the experience of operating various technologies. This allows us to cultivate the limitations, patterns of design and behavior that we worked out in battle, and we consider them to be the most suitable, understandable and reliable.

In addition to such obvious examples of standardization, there are those in which it makes sense to generalize particular solutions.

For example, at some point, we began to periodically have the need to send messages to rabbitmq (at least once). The tasks were repeatedly solved by self-written queues at the base, and dba said over and over how much the queues at the base were STRONGLY loved, especially loaded ones. In the end, it became obvious that a standard solution was needed here, which would be acceptable for dba, ensure reliable delivery and be convenient for development - this is how we wrote our library for integrating pgq and rabbitmq. Now there is a high probability that we will use pgq also in conjunction with kafka.

1.0. Bugs

Bugs are also global. For example, at some point, we found out that our python-framework is registered in consul in every process-worker, and even does this before the application is ready to accept requests. After fixing in the framework, changes will gradually reach all services as they are updated.

I talked about another general bug related to jvm settings at demo stage jpoint 2019 .

And what, for example, to do with a bug that is reproduced once a week on one of the instances, is treated with a restart, but neither load nor synthetics reproduce it?

, java- . nuts-and-bolts:

"qtp1778300121-22" #22 prio=5 os_prio=0 cpu=797.67ms elapsed=11737.06s tid=0x00007f5890139000 nid=0x26 waiting for monitor entry [0x00007f58922c7000]
java.lang.Thread.State: BLOCKED (on object monitor)
at ch.qos.logback.core.AppenderBase.doAppend(AppenderBase.java:63)
- waiting to lock <0x00000000e86acad0> (a ru.hh.nab.logging.HhSyslogAppender)
at ru.hh.nab.logging.HhMultiAppender.doAppend(HhMultiAppender.java:47)
at ru.hh.nab.logging.HhMultiAppender.doAppend(HhMultiAppender.java:21)
at ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:51)

"qtp1778300121-22" #22 prio=5 os_prio=0 cpu=5718.81ms elapsed=7767.14s tid=0x00007f1537dba000 nid=0x24 waiting for monitor entry [0x00007f153d2b9000]
java.lang.Thread.State: BLOCKED (on object monitor)
at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(java.base@11.0.4/ConcurrentHashMap.java:1723)
- waiting to lock <0x00000000e976a668> (a java.util.concurrent.ConcurrentHashMap$Node)
at org.springframework.beans.factory.BeanFactoryUtils.transformedBeanName(BeanFactoryUtils.java:86)

jackson:

"qtp1778300121-23" #23 prio=5 os_prio=0 cpu=494.19ms elapsed=7234.32s tid=0x00007f6c01218800 nid=0x25 waiting for monitor entry [0x00007f6c07cfa000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.glassfish.jersey.jackson.internal.jackson.jaxrs.base.ProviderBase._endpointForWriting(ProviderBase.java:711)
- waiting to lock <0x00000000e9f94c38> (a org.glassfish.jersey.jackson.internal.jackson.jaxrs.util.LRUMap)
at org.glassfish.jersey.jackson.internal.jackson.jaxrs.base.ProviderBase.writeTo(ProviderBase.java:588)

Code Cache:

java . java, , - . , .

1.1. General solutions

Sometimes it’s possible to come up with standard solutions before this becomes a serious problem. As examples, we can cite the log processing task that our Vlad Senin talked about at the same jpoint 2019 , or the timeout management task in our http client.

Its meaning is that it is useful to determine a reasonable timeout not on the client side, but on the server side. For the server, we have data on how quickly it responds to its endpoints. Now our client supports one timeout for the service. But it’s obvious that not all service endpoints respond the same way — some longer, some faster. I would like to be able to use different timeouts. Otherwise, a situation similar to this one arises:

So far, such situations appear only under stress testing, but I want to solve them before this becomes a problem.

1.2. Open issues

But not all problems are explained by some important places and complicated manual processes. Further, I will give some examples of issues that also fall into the field of our priorities, but at the same time are much less deterministic. Therefore, I will describe only the initial data, and we can discuss the solutions if desired in the comments.

So, the first example: now it becomes clear that there is a problem of integrating our services among themselves. The integration of, say, a site handle in an API may take longer than its initial development.

Another, probably familiar to many, example of a similar problem is sawing a monolith. Everyone understands that a monolith, overgrown with a large number of legacy, complicates the development and operation. But who can tell how much? Is it worth sacrificing other tasks of the technical debt in favor of sawing, each piece of which individually carries vanishingly small value?

The scale of these and similar problems is such that sometimes they have to go far beyond the technical framework and plunge into completely new areas of the work process to solve them. This is frightening on the one hand, but on the other hand gives incredible freedom in choosing decisions.

2. How we work

The story about the directions of our work will be incomplete without a description of HOW we work with all this.

To begin with, what attracted me to work in “Architecture” and what motivates us all: we really work for quality.

And before the stones fly at me, I’ll try to explain what I mean. I believe that no developer deliberately scores on quality. The point is in the technical debt: if we are talking about a section of business logic that is not planned to be reused, then most likely the amount of debt from a not-so-ideal solution will grow slowly over time, if at all.

This allows you to cool down your perfectionism a bit - start a task for debt and get to the next iteration. But if we are talking about a framework or a global configuration preparation tool that is used in hundreds of applications and consolidates certain design or naming patterns, then the rate of debt growth from its unsuccessful decision can block any gains at all. It is clear that there are situations when even the best solution reveals weaknesses as it is used, but this does not happen often ...

Toward the end, I would like to talk about the obstacles that still arise in our way. Without this, a story about our work would be dishonest. So.

2.0. Difficulty assessment tasks

As I said above, we can not evaluate the beneficial effect for all tasks. How much will the task release time decrease when a “boxed” solution comes out for some function? Which of the two problematic sections of code should be refactored first? To develop an adequate system for assessing tasks, we met a couple of times a week for several months, but this is a topic for a separate post.

2.1. Collective unconscious

Coordinating something for 150 people is not an easy task. Our very decentralized organizational structure most often manifests itself with its best sides, but for “Architecture” it is sometimes a serious obstacle. There are very few agreements to which it is possible to reach an agreement, even fewer those whose compliance can then be monitored.

And all changes must be rolled smoothly. The service may not be updated for months, but there is still a monolith ... Well, quite sad.

So we talked

I hope that after my story I’ve clarified a little what “Architecture” does in hh.ru. And if I managed to arouse your interest in our work, it’s generally fantastic. Moreover, just now a vacancy is open in our team . We will be very happy for fresh ideas that will help us achieve our hidden from idle glances, but such important victories.

ps the original KDPV is, it turns out, an illustration of this man . I hope he is not against the use of his images as a KDPV

See the architecture? And I do not see, but she is