Hello! My name is Vladimir Olokhtonov, I am a senior developer in the Avito automatic moderation team. In autumn 2019, we launched a search service for similar images based on the faiss library. It helps us understand that photos have already been seen in another ad, even if they are seriously distorted enough: blurred, cropped and the like. This is how we identify potentially fake publications.
I would like to talk about the problems that we encountered in the process of creating this service, and our approaches to solving them.
![](https://habrastorage.org/webt/cv/iu/hg/cviuhge0fxcrjnlmcklq20snwkm.png)
The article assumes that the reader is at least a little familiar with the topic of searching in multidimensional spaces, since hereinafter we will focus mainly on technical details. If this is not the case, I recommend reading the basic article on the Mail.ru blog first .
Statement of the problem and description of our system
The first thing I thought about when I was tasked with making a search system for pictures was requirements and limitations:
- The number of entries. We had about 150 million vectors at the start, now there are already more than 240 million.
- Search time limits. In our case, about 300 ms for 95 percentiles.
- Memory limit. It is necessary that the index is placed on ordinary servers, taking into account growth for the next 2 years.
- . Kubernetes, , , .
- , .
:
, — , . , .
, . , , .
. — eventually consistent . , .
Python3.7, PostgreSQL, — MinIO. faiss.
![](https://habrastorage.org/webt/49/h7/o-/49h7o--tc0rrmvsqvq4zdgduexa.png)
http- avio, aiohttp.
asyncio fork, , — , , multiprocessing.Pipe.
Product Quantization. 64 .
![image](https://habrastorage.org/webt/5w/lv/b_/5wlvb_dzumdhxccghilhll1cn3g.png)
Product Quantization .
Inverted File HNSW.
![](https://habrastorage.org/webt/8j/nh/yk/8jnhyk30nb3jxyacwku4eo-mm_w.png)
, faiss : IVF262144_HNSW32,PQ64. , Inverted File 262144 , HNSW 32 , 64 Product Quantization.
faiss .
, :
- 10 000 , 1 .
- 16 OpenMP-. , 16 , #pragma omp parallel for.
— . , , - .
, IVF262144_HNSW32,PQ64 80 :
:
int(faiss.get_mem_usage_kb() * 1024 / index.ntotal)
, 2Gb. nlist × pq.M × pq.ksub × float. 262144 × 64 × 256 × 4 ≈ 17G, pq.M — Product Quantization, pq.ksub — 256, .
. : , 2. , Inverted File, . — .
bytes per vector :
![](https://habrastorage.org/webt/pv/ov/v5/pvovv572e8pjv4geu72tgkbetm0.png)
, , , - , , . , .
OpenMP faiss, , , . , ThreadPoolExecutor (faiss GIL).
![](https://habrastorage.org/webt/au/on/cd/auoncdoln059pebxfoifr5uqzom.png)
, faiss , . -, (add, remove) - . -, OpenMP- , , .
RWLock, , . Python . OpenMP- faiss.omp_set_num_threads.
, query-per-second. 5.
, . , , issue.
faiss.
, 5 10 000 ( ). , : 800 .
20 150 rps 20 latency 500 throughput. : , .
, . , , . MinIO.
— fork Copy-on-Write . . — 80 .
, .
GPU
GPU , .
: SIFT1M, 128.
: 100 10 000 nprobe.
![](https://habrastorage.org/webt/tj/if/0x/tjif0x32bocxr-3es1vxrlttz20.png)
:
- GPU. GPU , , , , .
- Flat- GPU, ( 128).
- PQ64- GPU .
Faiss — open source , .
, , faiss . issues, .
, vearch JD.com. open source, .