Display search results and performance issues

One of the typical scenarios in all familiar applications is to search for data according to certain criteria and display them in a form that is easy to read. There may also be additional opportunities for sorting, grouping, pagination. The task, in theory, is trivial, but when solving it, many developers make a number of mistakes, which then suffer from performance. Let's try to consider various solutions to this problem and formulate recommendations for choosing the most effective implementation.

image

Paging Option # 1


The simplest option that comes to mind is to paginate the search results in its most classic form.


Suppose an application uses a relational database. In this case, to display information in this form, you will need to execute two SQL queries:

  • Get rows for the current page.
  • Calculate the total number of lines that match the search criteria - this is necessary for showing pages.

Consider the first query using the test MS SQL database AdventureWorks for 2016 server as an example . For this purpose, we will use the table Sales.SalesOrderHeader:

SELECT * FROM Sales.SalesOrderHeader
ORDER BY OrderDate DESC
OFFSET 0 ROWS
FETCH NEXT 50 ROWS ONLY

The above query will display the first 50 orders from a list sorted in descending order by date of addition, in other words, the last 50 orders.

It runs quickly on a test basis, but let's look at the execution plan and I / O statistics:


Table 'SalesOrderHeader'. Scan count 1, logical reads 698, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

You can get I / O statistics for each request by running the SET STATISTICS IO ON command in the query runtime.

As you can see from the execution plan, the most resource-intensive is to sort all the rows of the source table by the date of addition. And the problem is that the more rows appear in the table, the “harder” the sorting will be. In practice, such situations should be avoided, so add the index on the date of addition and see if resource consumption has changed:


Table 'SalesOrderHeader'. Scan count 1, logical reads 165, physical reads 0, read-ahead reads 5, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Obviously, it got a lot better. But have all the problems been resolved? Let's change the search request for orders where the total cost of goods exceeds $ 100:

SELECT * FROM Sales.SalesOrderHeader
WHERE SubTotal > 100
ORDER BY OrderDate DESC
OFFSET 0 ROWS
FETCH NEXT 50 ROWS ONLY


Table 'SalesOrderHeader'. Scan count 1, logical reads 1081, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

We have a funny situation: the query plan is slightly worse than the previous one, but the actual number of logical readings is almost twice as much as with a full table scan. There is a way out - if we make the composite price from the existing index and add the total price of the goods as the second field, then again we get 165 logical reads:

CREATE INDEX IX_SalesOrderHeader_OrderDate_SubTotal on Sales.SalesOrderHeader(OrderDate, SubTotal);

This series of examples can be continued for a long time, but the two main thoughts that I want to express here are:

  • Adding any new criterion or sort order to the search query can significantly affect the speed of its execution.
  • But if we need to subtract only a part of the data, and not all the results that fit the search conditions, there are many ways to optimize such a query.

Now let's move on to the second query, mentioned at the very beginning - to the one that counts the number of records that meet the search criteria. Take the same example - finding orders that cost more than $ 100:

SELECT COUNT(1) FROM Sales.SalesOrderHeader
WHERE SubTotal > 100

In the presence of the composite index indicated above, we obtain:


Table 'SalesOrderHeader'. Scan count 1, logical reads 698, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

The fact that the query passes through the entire index is not surprising, since the SubTotal field is not in the first position, so the query cannot use it. The problem is solved by adding another index to the SubTotal field, and as a result gives already only 48 logical reads.

You can give a few more examples of requests for counting the quantity, but the essence remains the same: receiving a portion of data and counting the total quantity are two fundamentally different requests , and each requires its own measures for optimization. In the general case, you cannot find a combination of indexes that works equally well for both queries.

Accordingly, one of the important requirements that should be clarified when developing such a search solution is whether it is really important for the business to see the total number of objects found. It often happens that no. And navigation to specific page numbers, in my opinion, is a solution with a very narrow scope, since most paging scenarios look like "go to the next page."

Paging Option # 2


Suppose users don’t care about knowing the total number of objects found. Let's try to simplify the search page:


In fact, only the fact that there is no way to go to specific page numbers has changed, and now this table does not need to know how many of them can be displayed. But the question arises - how does the table know if there is data for the next page (in order to correctly display the “Next” link)?

The answer is very simple: you can subtract from the database one record more than you need to display, and the presence of this "additional" record will show whether there is the next portion. Thus, to get one page of data, you will need to complete only one query, which significantly improves performance and makes it easier to support this functionality. In my practice, there was a case when refusing to count the total number of records accelerated the output of results by 4-5 times.

For this approach, there are several options for the user interface: the “back” and “forward” commands, as in the example above, the “load more” button, which simply adds a new portion to the displayed results, “infinite scrolling”, which works on the principle of “load more” ", But the signal to get the next portion is the user to scroll through all the displayed results to the end. Whatever the visual solution, the principle of data sampling remains the same.

The nuances of implementing paging


In all the examples of the queries above, the approach "offset + quantity" is used, when the query itself indicates in which order the result lines and how many rows should be returned. First, consider how best to organize the transfer of parameters in this case. In practice, I met several ways:

  • The serial number of the requested page (pageIndex), page size (pageSize).
  • The serial number of the first record to be returned (startIndex), the maximum number of records as a result (count).
  • The serial number of the first record to be returned (startIndex), the serial number of the last record to be returned (endIndex).

At first glance, it might seem that this is so elementary that there is no difference. But this is not so - the most convenient and universal option is the second (startIndex, count). There are several reasons for this:

  • +1 , , pageIndex pageSize . , 50 . , , . «+1» , , 1 51, — 51 101 .. 51 pageIndex, 52 102 .. , — «» , .
  • The third option does not make sense at all, since to execute queries in most databases, you will still need to transfer the quantity, not the index of the last record. Let the subtraction of startIndex from endIndex and an elementary arithmetic operation, but it is superfluous here.

Now we should describe the shortcomings of the implementation of paging through "offset + quantity":

  • Getting each next page will be more expensive and slower than the previous one, because the database will still need to go through all the records “from the beginning” according to the search and sort criteria, and then stop at the desired fragment.
  • Not all DBMSs can support this approach.

There are alternatives, but they are also imperfect. The first of these approaches is called “keyset paging” or “seek method” and consists in the following: after receiving a portion, you can remember the values ​​of the fields in the last record on the page, and then use them to get the next portion. For example, we performed the following request:

SELECT * FROM Sales.SalesOrderHeader
ORDER BY OrderDate DESC
OFFSET 0 ROWS
FETCH NEXT 50 ROWS ONLY

And in the last record we got the value of the order date '2014-06-29'. Then, to get the next page, you can try to do this:

SELECT * FROM Sales.SalesOrderHeader
WHERE OrderDate < '2014-06-29'
ORDER BY OrderDate DESC
OFFSET 0 ROWS
FETCH NEXT 50 ROWS ONLY

The problem is that OrderDate is a non-unique field, and the condition mentioned above is likely to skip a lot of the required lines. To make this request unambiguous, you need to add a unique field to the condition (suppose 75074 is the last value of the primary key from the first portion):

SELECT * FROM Sales.SalesOrderHeader
WHERE (OrderDate = '2014-06-29' AND SalesOrderID < 75074)
   OR (OrderDate < '2014-06-29')
ORDER BY OrderDate DESC, SalesOrderID DESC
OFFSET 0 ROWS
FETCH NEXT 50 ROWS ONLY

This option will work correctly, but in the general case it will be difficult to optimize, since the condition contains the OR operator. If the value of the primary key grows with the growth of OrderDate, then the condition can be simplified by leaving only the filter by SalesOrderID. But if there is no strict correlation between the values ​​of the primary key and the field by which the result is sorted - in most DBMSs this OR cannot be avoided. The exception known to me is PostgreSQL, where tuple comparison is fully supported, and the above condition can be written as “WHERE (OrderDate, SalesOrderID) <('2014-06-29', 75074). If you have a composite key with these two fields, a similar request should be fairly easy.

The second alternative approach can be found, for example, in the ElasticSearch scroll API orCosmos DB - when a query in addition to data returns a special identifier with which you can get the next batch of data. If this identifier has an unlimited lifetime (as in Comsos DB), then this is a great way to implement paging with sequential transition between pages (option # 2 mentioned above). Its possible disadvantages: not all DBMSs are supported; the received next batch identifier may have a limited lifetime, which is generally not suitable for implementing user interaction (such as the ElasticSearch scroll API).

Complex filtering


We complicate the task further. Suppose there is a requirement to implement the so-called faceted search, which is familiar to everyone from online stores. The above examples based on the order table are not very indicative in this case, so we will switch to the Product table from the AdventureWorks database:


What is the idea of ​​faceted search? In that, for each filter element, the number of records matching this criterion is shown taking into account the filters selected in all other categories .

For example, if we select the Bikes category and the color Black in this example, the table will only display black bikes, but at the same time:

  • For each criterion of the “Categories” group, the number of products from this category in black will be shown.
  • For each criterion of the Colors group, the number of bikes of that color will be shown.

Here is an example of outputting the result for such conditions:


If in addition to marking the category “Clothing”, the table will also show the black clothes that are available. The number of black products in the “Color” section will also be recalculated according to the new conditions, only in the “Categories” section nothing will change ... I hope these examples are enough to understand the familiar faceted search algorithm.

Now imagine how this can be implemented on a relational basis. Each group of criteria, such as Category and Color, will require a separate request:

SELECT pc.ProductCategoryID, pc.Name, COUNT(1) FROM Production.Product p
  INNER JOIN Production.ProductSubcategory ps ON p.ProductSubcategoryID = ps.ProductSubcategoryID
  INNER JOIN Production.ProductCategory pc ON ps.ProductCategoryID = pc.ProductCategoryID
WHERE p.Color = 'Black'
GROUP BY pc.ProductCategoryID, pc.Name
ORDER BY COUNT(1) DESC


SELECT Color, COUNT(1) FROM Production.Product p
  INNER JOIN Production.ProductSubcategory ps ON p.ProductSubcategoryID = ps.ProductSubcategoryID
WHERE ps.ProductCategoryID = 1 --Bikes
GROUP BY Color
ORDER BY COUNT(1) DESC


What is wrong with this decision? Very simple - it does not scale well. Each filter section requires a separate query for counting quantities, and these queries are not the easiest. In online stores, in some sections, there may be several dozen sections of the filter, which can be a serious problem for performance.

Usually, after these statements, they offer me some solutions, namely:

  • Combine all quantity counts into one query. Technically, this is possible using the UNION keyword, only this will not help much in performance - anyway, the database will have to execute each of the fragments from scratch.
  • . , . , . , 10 «», 5 . «» , -. 9- , , . 50 , , 250. , . , 5-10 . , , - , , ( ).

Fortunately, such a task has long had sufficiently effective solutions that predictably work on large amounts of data. For any of these options, it makes sense to divide the facet recount and get the results page into two parallel calls to the server and organize the user interface in such a way that loading the data on the facets “does not interfere” with the display of the search results.

  • «» . , , , , — «1425 , ?» , «». «». , , . -. , , .
  • search engine , Solr, ElasticSearch, Sphinx . «» . , , — . , search engine , : , , ; search engine . — . « ». «» , , . , - transactional outbox .


  1. — , . «» «» — , :
    • — .
    • , , , . - — .
  2. , — #2.
  3. faceted search, :
    • .
    • search engine Solr, ElasticSearch, Sphinx . , , .
  4. faceted search . , , .
  5. SQL , , , ( «» ). , — «». , .

All Articles