🗡️ 📦 😞 How we began to create product cards automatically 🥥 🍃 🙍🏿

In my last article, I talked about how we learned to automatically match products by product names, that is, to understand, for example, that

Headset A4Tech Bloody G501 black

and

A4 G501, black (red) {Headphones with a microphone, 2.2m}

- It is the same. This made it possible to automate everything related to prices and availability. In this article I will tell you how we went further and automated the work with the characteristics and images of goods.

This is a logical continuation of the product development, given the reliable operation of the automatic matching of goods by name. We separated the automation project into a separate company, and I talk about the project using one of our clients as an example, and share some client data with its permission. Next is a small introduction, and then we'll talk about the algorithms that we used.

Where did we start

The initial state a year ago was as follows: a catalog of 120 thousand goods, of which 70 thousand are in stock. After the automatic comparison and creation of goods was launched (as the last article was about), the catalog quickly grew to a million with a few goods, of which 600-650 thousand were in stock. But only 350 thousand positions were available for retail purchases, because the rest did not have characteristics. Depending on the category, for a product to appear on sale, a certain percentage of characteristics must be filled. In some categories, there must be a photo. Of the 350 thousand goods on sale, 120 thousand had no images. I will explain that the goods that are in stock but are not sold in retail can be sold in wholesale, it is customary to send excel files with names and prices there. But b2b portal with cards is undoubtedly a plus.They can also be placed on aggregators, where there is already more detailed information about the product.

Images and specifications were filled in semi-automatic mode. During this year, other customers appeared, and now I know how the process was built in a dozen or two relatively large stores, it is more or less standard. The programmer who maintains the site from time to time writes site parsers on the subject of the store. Further, the content department in the admin panel compares the consumed goods and products from the catalog, as well as the categories in the source and categories in the catalog. Then, in each category, for each property, correspondence, replacements, unit conversion are configured. Cards of some goods are filled in completely manually. A small part of the goods is given to an external contractor. So it was with our client.

About 6% of the catalog becomes obsolete per month, that is, to sell 500 thousand products on the site, you need to create 30 thousand cards a month just to maintain the quantity at the same level.

We decided to approach the issue as systematically as possible and automate the process of creating product cards as much as possible.

Where do we get the data

The first question that arises is where to get the source data, and how much we can get. We have two main sources:

vendor files and APIs
the data that we found on the Internet

The second point involves parsing sites. We have created an infrastructure for creating parsers, which allows you to configure site parsing in about half an hour. Then about a hundred parsers were made on this infrastructure, at the moment they have already collected information about 16 million products and downloaded and uploaded 35 million images and documents (instructions, datasheet) to CDN. As a CDN we use a Bunny CDN. When saving images, we consider their perceptual hashes so that we can quickly find duplicates later, we will talk more about this below. Ideally, if the parser itself without the hints and settings will be able to collect information about the products on any site, but these are distant plans.

How do we fill the product card with images

This is the simplest and most obvious part. We already have a mechanism for matching products by name, it works quickly and accurately. In this case, we are interested in false positive responses, which are about 0.1%. False negative practically does not affect the result, because there is not much difference, we work in the end with 10 or 11 sources. Comparison works at a speed of 5-6 thousand products per second, and all of our 15 million products are matched to any catalog in less than an hour. It would be possible to end on this, since after comparison we know all the images of the goods. But in practice, they are duplicated on different sites in different combinations, and it would be right to choose the best in quality from a set of similar pictures, and leave the rest as options in case of subsequent manual editing of the product card.

Identification of similar images occurs in two stages. At the first stage, when downloading a file, for each of our images we consider a perceptual hash. First, remove the solid background, crop the fields. Then apply DCT hashing to the resulting image. About this algorithm once there was an article on Habré .

After comparing for each product, we have a set of images with hashes already calculated, and we need to divide it into clusters, and then select the highest quality image from each cluster. The size of such a set can vary from several units to several hundred units, a typical value is 10 - 40. At this size, you can apply the quadratic complexity algorithm, the main operation - calculating the Hamming distance - is very fast.

We divide the images into clusters as follows:

the first image from the set forms the first cluster
each next image is compared with the first image from each cluster
if the highest similarity coefficient is greater than the threshold value, add the image to the existing cluster
otherwise we create a new cluster from one image

After that, we take the highest quality image from each cluster. In a first approximation, quality is determined by the size and resolution of the image. The larger the size and resolution, the better, but the ratio of size to resolution should be within reasonable limits. Such an analysis is enough in practice.

Further there is one more nuance. We found the pictures for the goods automatically, but, nevertheless, so far the person is more important. And if a person decided to change the sequence of images, he should have such an opportunity. But at the same time, we don’t like the idea of “freezing” the card after manual editing, we want to improve the algorithms, find new images and automatically improve the card in the future.

With this in mind, this approach turned out to be a working option: all images that were manually changed (and a change here means the widest possible set of actions: confirmation, exclusion, changing the order) will always remain in place. All other images we can change in the future in automatic mode.

How do we fill out a product card with properties

With properties, things are a little more complicated than with pictures. In different sources, the same property may be called differently, may be expressed in different units, may not have an explicit correspondence (for example, HDMI: yes and Interfaces: HDMI, USB).

We started by defining properties and came up with the following types:

well no
number + unit
single choice
multiple choice
line

Further I will use the terms “ dimension ” and “ unit of measurement ”, they have different meanings.

Then they described the units of measurement and how to convert them. At the moment, there are already more than a hundred, from meters to decibels. All of them are divided into groups depending on the dimension, and within the group can be brought to each other. For example, horsepower can be converted to watts, and lumens to newton meters - it is impossible.

Next, we tried to break the process of filling the properties into independent steps so that each step could be improved and tested separately from the others. For each product from the catalog, we perform the following operations:

we compare goods to goods from sources by name (we have a complex process, it is described in detail in a previous article )
form a list of property definitions to be filled
we form a list of properties from all the related products
we associate with each definition from clause 2 a list of values from clause 3
convert the list to a single value

Let us consider the last two points in more detail as an example. Suppose we have a property definition like this:

Name: “Power”
Type: Number (6, 0)
Unit: Watt

and such a list of values (in fact, it is much larger):

Length (mm): 220
Length: 0.22 m
Power (kW): 1.8
Power: 1.8 kW
Rated Power: 1800 W
Power: 2 kW
Power: 1800

To begin with, we need to understand what values are worth considering in principle. A working but not final version looks like this:

We have these points implemented as independent algorithms for each case, so each can be improved separately from the others. In the paragraph with synonyms, we manually entered some of the most obvious pairs and postponed until better times. Further, perhaps, we will try to select synonyms automatically, using data from the filled properties. In addition, we can take only one value from one source, so each of the algorithms has a priority, it determines the order of execution of the algorithms.

After this operation, we will have the following set of values:

Power (kW): 1.8
Power: 1.8 kW
Rated Power: 1800 W
Power: 2 kW
Power: 1800

We know that this set must be reduced to a number of a certain dimension. For each type, we have an independent algorithm. In the case of the number and unit of measure, we do the following:

for each value, we try to determine the unit of measurement. If this failed, or its dimension does not coincide with the dimension of the property, ignore this value in the future. We will have the following set of values:

Power (kW): 1.8
Power: 1.8 kW
Rated Power: 1800 W
Power: 2 kW

We try to convert each of the values into a number and bring to the desired unit of measurement. Here we try to take into account all possible formats: with a comma and a dot to separate the integer and fractional parts, with commas and spaces to separate the thousands, abbreviations like thousands and so on. It turns out the following:

1800 W
1800 W
1800 W
2000 W

This set must be reduced to a single value. In the case of a single value in the set, use it. If the values are more than one and they coincide, then even better. If the values are more than one and they do not coincide, we take the most common one. If this is not the case, do not fill the property automatically.

This was an example for a property of a numeric type, for other types the last steps are different and take into account their features.

As for manual editing, we deal with properties in exactly the same way as with images. Anything that has been changed or confirmed manually cannot automatically change anymore. Everything else can.

How do we test a project

This project has a remarkable feature: the core is a technology that consists entirely of optimization tasks. Moreover, most of these tasks do not have an exact solution, and for them we came up with our own criteria that take into account speed and accuracy. This is where the tests really are needed.

The architecture of the project is designed in such a way that complex algorithms, as far as possible, are divided into simple steps. For each step there is an optimization criterion and a set of tests. Emphasis on accuracy, but also speed. At the very least, we were looking after every change. Each such step resembles the task of sports programming, and we have the opportunity to compare the results after each change.

results

It will be possible to more accurately evaluate the results after a couple of months of work in production. Less than a week has passed by now. During this time, we were able to add to the site about 110 thousand products that were not previously sold due to lack of properties or images.
The goal for the future is to completely abandon manual work to fill the catalog. At the first stage, a good result would be to reduce its volume by 70%. I think that by the results of several months we will achieve this.