Microservice architecture, like everything in this world, has its pros and cons. Some processes with it become easier, others more complicated. And for the sake of speed of change and better scalability, sacrifices must be made. One of them is the complication of analytics. If in a monolith all operational analytics can be reduced to SQL queries for an analytic replica, then in a multiservice architecture each service has its own base and it seems that one query cannot be dispensed with (or can it be dispensed with?). For those who are interested in how we solved the problem of operational analytics in our company and how we learned to live with this solution - welcome.

My name is Pavel Sivash, in DomKlik I work in a team that is responsible for maintaining the analytical data warehouse. Conventionally, our activities can be attributed to the date of engineering, but, in fact, the range of tasks is much wider. There are ETL / ELT standard for date engineering, support and adaptation of tools for data analysis and development of their own tools. In particular, for operational reporting, we decided to “pretend” that we have a monolith and give analysts one base in which they will have all the data they need.

In general, we considered different options. It was possible to build a full-fledged repository - we even tried, but to be honest, we failed to make friends enough frequent changes in the logic with a rather slow process of building a repository and making changes to it (if someone succeeded, write in the comments how). One could say to the analysts: “Guys, learn python and go to analytic cues”, but this is an additional requirement for staff recruitment, and it seemed that this should be avoided if possible. We decided to try using FDW (Foreign Data Wrapper) technology: in fact, this is the standard dblink, which is in the SQL standard, but with its much more convenient interface. On the basis of it, we made a decision, which eventually settled down, we stopped on it. Its details are the subject of a separate article, or maybe not one,because I want to talk about a lot: from synchronization of database schemes to access control and anonymization of personal data. You also need to make a reservation that this solution is not a substitute for real analytical databases and repositories, it solves only a specific problem.

Top-level it looks like this:

There is a PostgreSQL database, where users can store their working data, and most importantly, analytical replicas of all services are connected to this database through FDW. This makes it possible to write a query to several databases, no matter what it is: PostgreSQL, MySQL, MongoDB or something else (file, API, if suddenly there is no suitable wrapper, you can write your own). Well, everything seems to be super! Do we disagree?

If everything ended so quickly and simply, then, probably, there would have been no article.

It is important to clearly understand how postgres handles requests to remote servers. This seems logical, but often they do not pay attention to it: postgres divides the request into parts that are performed on remote servers independently, collects this data, and the final calculations are carried out by itself, so the speed of the request will greatly depend on how it is written. It should also be noted: when data comes from a remote server, they no longer have indexes, there is nothing that can help the scheduler, therefore, only we can help and suggest it. And I’d like to tell you more about that.

Simple request and plan with it

To show how postgres executes a query on a 6 million row table on a remote server, let's look at a simple plan.

explain analyze verbose  
SELECT count(1)
FROM fdw_schema.table;

Aggregate  (cost=418383.23..418383.24 rows=1 width=8) (actual time=3857.198..3857.198 rows=1 loops=1)
  Output: count(1)
  ->  Foreign Scan on fdw_schema."table"  (cost=100.00..402376.14 rows=6402838 width=0) (actual time=4.874..3256.511 rows=6406868 loops=1)
        Output: "table".id, "table".is_active, "table".meta, "table".created_dt
        Remote SQL: SELECT NULL FROM fdw_schema.table
Planning time: 0.986 ms
Execution time: 3857.436 ms

Using the VERBOSE instruction allows you to see the request that will be sent to the remote server and the results of which we will receive for further processing (RemoteSQL line).

Let's go a little further and add several filters to our query: one by the boolean field, one by the timestamp in the interval, and one by jsonb .

explain analyze verbose
SELECT count(1)
FROM fdw_schema.table 
WHERE is_active is True
AND created_dt BETWEEN CURRENT_DATE - INTERVAL '7 month' 
AND CURRENT_DATE - INTERVAL '6 month'
AND meta->>'source' = 'test';

Aggregate  (cost=577487.69..577487.70 rows=1 width=8) (actual time=27473.818..25473.819 rows=1 loops=1)
  Output: count(1)
  ->  Foreign Scan on fdw_schema."table"  (cost=100.00..577469.21 rows=7390 width=0) (actual time=31.369..25372.466 rows=1360025 loops=1)
        Output: "table".id, "table".is_active, "table".meta, "table".created_dt
        Filter: (("table".is_active IS TRUE) AND (("table".meta ->> 'source'::text) = 'test'::text) AND ("table".created_dt >= (('now'::cstring)::date - '7 mons'::interval)) AND ("table".created_dt <= ((('now'::cstring)::date)::timestamp with time zone - '6 mons'::interval)))
        Rows Removed by Filter: 5046843
        Remote SQL: SELECT created_dt, is_active, meta FROM fdw_schema.table
Planning time: 0.665 ms
Execution time: 27474.118 ms

This is where the moment lies that you need to pay attention to when writing queries. Filters were not transferred to the remote server, which means that to execute it, the postgres extends all 6 million lines, only to filter it locally later (Filter line) and perform aggregation. The key to success is to write a request so that the filters are transferred to the remote machine, and we receive and aggregate only the necessary rows.

That's some booleanshit

With boolean fields, everything is simple. In the original request, the problem was due to the is statement . If we replace it with = , then we get the following result:

explain analyze verbose
SELECT count(1)
FROM fdw_schema.table
WHERE is_active = True
AND created_dt BETWEEN CURRENT_DATE - INTERVAL '7 month' 
AND CURRENT_DATE - INTERVAL '6 month'
AND meta->>'source' = 'test';

Aggregate  (cost=508010.14..508010.15 rows=1 width=8) (actual time=19064.314..19064.314 rows=1 loops=1)
  Output: count(1)
  ->  Foreign Scan on fdw_schema."table"  (cost=100.00..507988.44 rows=8679 width=0) (actual time=33.035..18951.278 rows=1360025 loops=1)
        Output: "table".id, "table".is_active, "table".meta, "table".created_dt
        Filter: ((("table".meta ->> 'source'::text) = 'test'::text) AND ("table".created_dt >= (('now'::cstring)::date - '7 mons'::interval)) AND ("table".created_dt <= ((('now'::cstring)::date)::timestamp with time zone - '6 mons'::interval)))
        Rows Removed by Filter: 3567989
        Remote SQL: SELECT created_dt, meta FROM fdw_schema.table WHERE (is_active)
Planning time: 0.834 ms
Execution time: 19064.534 ms

As you can see, the filter flew to a remote server, and the runtime was reduced from 27 to 19 seconds.

It is worth noting that the is operator differs from the = operator in that it can work with the Null value. This means that is not True in the filter will leave False and Null, while ! = True will only leave False. Therefore, when replacing the is not operator, two conditions with the OR operator should be passed to the filter, for example, WHERE (col! = True) OR (col is null) .

With boolean sorted out, move on. In the meantime, return the filter by Boolean value to its original form, to independently consider the effect of other changes.

timestamptz? hz

In general, one often has to experiment with how to write a query that involves remote servers, and only then look for an explanation of why this is happening. Very little information about this can be found on the Internet. So, in experiments, we found that the filter by a fixed date flies to the remote server with a bang, but when we want to set the date dynamically, for example, now () or CURRENT_DATE, this does not happen. In our example, we added a filter so that the created_at column contains exactly 1 month in the past (BETWEEN CURRENT_DATE - INTERVAL '7 month' AND CURRENT_DATE - INTERVAL '6 month'). What have we done in this case?

explain analyze verbose
SELECT count(1)
FROM fdw_schema.table 
WHERE is_active is True
AND created_dt >= (SELECT CURRENT_DATE::timestamptz - INTERVAL '7 month') 
AND created_dt <(SELECT CURRENT_DATE::timestamptz - INTERVAL '6 month')
AND meta->>'source' = 'test';

Aggregate  (cost=306875.17..306875.18 rows=1 width=8) (actual time=4789.114..4789.115 rows=1 loops=1)
  Output: count(1)
  InitPlan 1 (returns $0)
    ->  Result  (cost=0.00..0.02 rows=1 width=8) (actual time=0.007..0.008 rows=1 loops=1)
          Output: ((('now'::cstring)::date)::timestamp with time zone - '7 mons'::interval)
  InitPlan 2 (returns $1)
    ->  Result  (cost=0.00..0.02 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=1)
          Output: ((('now'::cstring)::date)::timestamp with time zone - '6 mons'::interval)
  ->  Foreign Scan on fdw_schema."table"  (cost=100.02..306874.86 rows=105 width=0) (actual time=23.475..4681.419 rows=1360025 loops=1)
        Output: "table".id, "table".is_active, "table".meta, "table".created_dt
        Filter: (("table".is_active IS TRUE) AND (("table".meta ->> 'source'::text) = 'test'::text))
        Rows Removed by Filter: 76934
        Remote SQL: SELECT is_active, meta FROM fdw_schema.table WHERE ((created_dt >= $1::timestamp with time zone)) AND ((created_dt < $2::timestamp with time zone))
Planning time: 0.703 ms
Execution time: 4789.379 ms

We prompted the scheduler to calculate the date in the subquery in advance and pass the ready-made variable to the filter. And this hint gave us an excellent result, the query became almost 6 times faster!

Again, it is important to be careful here: the data type in the subquery must be the same as the field for which we are filtering, otherwise the scheduler will decide that since the types are different and you must first get all the data and filter it locally.

Return the filter by date to its original value.

Freddy vs. Jsonb

In general, Boolean fields and dates have already speeded up our query, but there was one more data type. The battle with filtering on it, frankly, is still not over, although there are successes. So, here is how we managed to pass the filter by jsonb field to the remote server.

explain analyze verbose
SELECT count(1)
FROM fdw_schema.table 
WHERE is_active is True
AND created_dt BETWEEN CURRENT_DATE - INTERVAL '7 month' 
AND CURRENT_DATE - INTERVAL '6 month'
AND meta @> '{"source":"test"}'::jsonb;

Aggregate  (cost=245463.60..245463.61 rows=1 width=8) (actual time=6727.589..6727.590 rows=1 loops=1)
  Output: count(1)
  ->  Foreign Scan on fdw_schema."table"  (cost=1100.00..245459.90 rows=1478 width=0) (actual time=16.213..6634.794 rows=1360025 loops=1)
        Output: "table".id, "table".is_active, "table".meta, "table".created_dt
        Filter: (("table".is_active IS TRUE) AND ("table".created_dt >= (('now'::cstring)::date - '7 mons'::interval)) AND ("table".created_dt <= ((('now'::cstring)::date)::timestamp with time zone - '6 mons'::interval)))
        Rows Removed by Filter: 619961
        Remote SQL: SELECT created_dt, is_active FROM fdw_schema.table WHERE ((meta @> '{"source": "test"}'::jsonb))
Planning time: 0.747 ms
Execution time: 6727.815 ms

Instead of filtering operators, you must use the operator of having one jsonb in another. 7 seconds instead of the initial 29. So far, this is the only successful option of transferring filters via jsonb to a remote server, but it is important to take into account one limitation: we use the database version 9.6, however we plan to complete the latest tests and move to version 12 by the end of April. As we update, we will write how it affected, because there are a lot of changes for which there are many hopes: json_path, new CTE behavior, push down (existing from version 10). I'd like to try it soon.

Finish him

We checked how each change affects the speed of the request individually. Now let's see what happens when all three filters are written correctly.

explain analyze verbose
SELECT count(1)
FROM fdw_schema.table 
WHERE is_active = True
AND created_dt >= (SELECT CURRENT_DATE::timestamptz - INTERVAL '7 month') 
AND created_dt <(SELECT CURRENT_DATE::timestamptz - INTERVAL '6 month')
AND meta @> '{"source":"test"}'::jsonb;

Aggregate  (cost=322041.51..322041.52 rows=1 width=8) (actual time=2278.867..2278.867 rows=1 loops=1)
  Output: count(1)
  InitPlan 1 (returns $0)
    ->  Result  (cost=0.00..0.02 rows=1 width=8) (actual time=0.010..0.010 rows=1 loops=1)
          Output: ((('now'::cstring)::date)::timestamp with time zone - '7 mons'::interval)
  InitPlan 2 (returns $1)
    ->  Result  (cost=0.00..0.02 rows=1 width=8) (actual time=0.003..0.003 rows=1 loops=1)
          Output: ((('now'::cstring)::date)::timestamp with time zone - '6 mons'::interval)
  ->  Foreign Scan on fdw_schema."table"  (cost=100.02..322041.41 rows=25 width=0) (actual time=8.597..2153.809 rows=1360025 loops=1)
        Output: "table".id, "table".is_active, "table".meta, "table".created_dt
        Remote SQL: SELECT NULL FROM fdw_schema.table WHERE (is_active) AND ((created_dt >= $1::timestamp with time zone)) AND ((created_dt < $2::timestamp with time zone)) AND ((meta @> '{"source": "test"}'::jsonb))
Planning time: 0.820 ms
Execution time: 2279.087 ms

Yes, the request looks more complicated, it is a forced board, but the execution speed is 2 seconds, which is more than 10 times faster! And we are talking about a simple query on a relatively small data set. On real requests, we received growth up to several hundred times.

To summarize: if you use PostgreSQL with FDW, always check if all filters are sent to the remote server, and you will be happy ... At least until you get to the joins between tables from different servers. But this is the story for another article.

Thank you for the attention! I will be glad to hear questions, comments, as well as stories about your experience in the comments.

On-line analytics in microservice architecture: help and suggest Postgres FDW ̶ и Post Post ̶п̶р̶о̶с̶т̶и̶т иь̶

Simple request and plan with it

That's some booleanshit

timestamptz? hz

Freddy vs. Jsonb

Finish him

More articles: