🤸🏻 📎 🤱 How the content system of Turbo pages is arranged: schemes, facts and a bit of history 🏂🏻 🤖 🈸

According to TelecomDaily , almost 30% of mobile Internet users in Russia daily have problems downloading sites. However, the reason may be not only in uneven coverage, but also in too much "weight" of the page.

We cannot influence the quality of the connection, but to help webmasters simplify the content of the site, make it easier - why not? So, the technology of Turbo pages appeared in Yandex: everything needed for placement is transferred to our content system, and it converts this data into easy and fast materials.

How does this magic work? What is the data path before becoming a full Turbo page? My name is Stas Makeev, I lead the development of technology Turbo pages. Now I’ll try to explain everything.

But first, it’s a bit of a summary so that you don’t get lost when I start to delve into the details.

A key advantage of the Turbo pages system is the quick conversion of data from the original form to the final one: the materials of news sites are most in demand in the first minutes after publication, and the cards of goods of online stores should be updated promptly and always correspond to the current status of availability. The second important parameter is reliability: the content system should be as stable as possible, be able to survive the breakdown of individual servers or even entire data centers. And, of course, it was important to prevent excessive load on the hosts of our partners connected to the Turbo pages. That is, when designing the service, it was necessary to somehow find a balance between the speed of data processing and the increase in the number of requests.

Site owners have several ways to connect to the system:

. : YML — -, RSS – ;
API: ( );
: - .

The content system stores the results of its work in a special storage of the "key-value" type (key-value-storage or KV-storage), where the key is the URL of the original site, and the value stores the content of the Turbo page. As soon as the data enters this KV-storage, the next Turbo page immediately becomes available to search users, and in Yandex services the corresponding document has a special icon with a rocket. Also, to speed up work, we cache pictures and videos in our CDNs.

A very simplified general scheme of work looks like this:

How it all began

The very first version of the content system was arranged quite simply: every few minutes, according to the schedule, the same program was launched on the Yandex internal cloud server. It consisted of several steps, each next run after the previous data was ready for all the feeds we know:

The list of RSS feeds was downloaded, the document parser was launched;
a list of images was extracted from the parser results;
not cached pictures were loaded into the CDN yet;
processed documents were poured into the KV repository.

Such a scheme worked perfectly when the system dealt with several thousand rather light RSS feeds of news agencies (in total - information about a little less than 100,000 documents). But with the increase in the number of feeds, a problem was quickly discovered: each step took more and more time, there was a delay between the appearance of a new document in the original source and its display in Turbo mode.

We managed to keep the situation under control with the help of various tricks: the first thing we did was select the first step (downloading RSS feeds + a document parser) into a separate process. That is, while one was processing pictures for the previous iteration, the other process was already downloading feeds for the next. After some time, it became clear: in this form, the system is very difficult to scale. We need something fundamentally new.

Processing RSS, API and YML in the new content system

The main problem of the old content system was that all the data was processed in one piece: there was no transition to the next step until each document passed the previous one. To get rid of this, it was decided to build a certain pipeline: let the feeds and individual documents be processed as independently as possible. All steps were separated into separate service cubes - at the top level, the scheme turned out like this:

the first cube downloads RSS feeds and passes on;
the second - takes feeds one by one, parses the contents. At the exit - separate documents;
the third - takes documents one at a time, processes pictures and videos, records everything in the KV-storage.

The same feeds can be registered not only in Turbo, but also on our other services - in the News or in the Market, for example. If each of them downloads data on its own, the load on the webmasters server will be several times higher than the allowable one. How right? Download the feed once, and then distribute the content to all consumer services - Yandex.Robot does this. We use his services for downloading content: we take from Yandex.Webmaster a list of RSS and YML feeds registered in Turbo, transfer it to Robot and subscribe to the download results.

On the received data, start the parser. Just in case, let me remind you: an RSS feed is just a file in the “.XML” format, accessible by a static URL on the partner’s host. This file contains information about all updates on the site - which documents are new, which are changed. Only the most up-to-date information in the last few hours would be in ideal feeds: no more than 100 documents per several hundred kilobytes.

Reality bites: sometimes the files are inside the feed for a very long time and never change. How to avoid reprocessing in such cases? We calculate the hash of each document, store it in the database, and do nothing until the hash changes.

Processing YML feeds and APIs from the point of view of the content system is practically no different from interacting with RSS: for YML, we start another parser, and the data transmitted through the API is obtained directly from Yandex.Webmaster.

Image and video processing

The document that is received at the output of the parser is almost ready for writing to the KV-storage. The only thing left to do before sending is to process the images and videos: cache in the CDN and replace the links in the document. Here again we turn to the Robot for help.

First of all, we check each picture and video: are they in the CDN? If everything is already cached, replace the links and send the updated document to the KV repository. Otherwise:

we send the list of missing URLs to the Robot for planning and downloading;
the document itself is in temporary storage, so after a while try to check it again.

Another cube at this time receives the download results, uploads the data to the CDN and updates the database.

In such a scheme, another important problem related to planning can be solved: the robot understands how fast it is possible to download data from different hosts, and does not allow overloads.

Typical path that a new document follows:

The document appears in the feed.
The robot downloads the feed;
the parser discovers a new document and sends it further;
we check that the pictures from the document are not mentioned in the database, order the download, the document is sent to temporary storage (Delay). While the document is there, the Robot downloads the pictures, they are cached in the CDN, and the links appear in the database;
, CDN, KV-.
: , Delay.

There is another way to connect to Turbo pages, for which the webmaster does not need to do anything - Autoparser. It builds Turbo pages based on the source data of the content site. You can connect, see examples of finished pages, configure ads and analytics in Yandex.Webmaster.

The main difficulty that AutoParser faces is to recognize by HTML markup what basic information should be used when building a Turbo page. To do this, we have several offline processes that are trying to understand exactly how to parse a specific host. I will focus on two main ones:

The first RSS- HTML- . — , RSS- ( ), . , . CatBoost , , . , , , . , . , , , HTML . ? : . – , .
: .. , , , , — . — .

By the way, one more frequent obstacle - many sites block the ability to download images by robots in robots.txt. Unfortunately, it is impossible to get around this problem, and the Autoparser is not available for such pages.

As a result, the full scheme of the content system looks like this:

The system turned out to be well scalable: now a significant amount of resources are used to service the database, autoparser and other components of the system (only the cube responsible for parsing RSS, YML and API uses more than 300 processor cores), and in case of increased load it will not be too difficult to connect additional capacities.

Thank you for reading to the end! I hope, after this material, in the work of Turbo pages you will get more logic and less magic (by the way, here- even more details about Turbo pages). If something is still incomprehensible, write in the comments - we are in touch.

How the content system of Turbo pages is arranged: schemes, facts and a bit of history

How it all began

Processing RSS, API and YML in the new content system

Image and video processing

More articles: