Making Faster POSTGRESQL COUNT (*)



It is often complained that count (*) in PostgreSQL is very slow.

In this article, I want to explore options so that you get the result as quickly as possible.

Why is count (*) so slow?


Most people understand without problems that the following request will be executed slowly:

SELECT count(*)
FROM /*   */;

This is, after all, a complex query, and PostgreSQL must calculate the result before it knows how many rows it will contain.

But many people are shocked when they find out that the following query is slow:

SELECT count(*) FROM large_table;

However, if you think again, all of the above remains valid: PostgreSQL must compute the result set before it can count it. Since the “magic row counter” is not stored in the table (as in MyISAM MySQL), the only way to count the rows is to view them.

Therefore, count (*) usually performs sequential table scans, which can be quite expensive.

Is the "*" in count (*) a problem?


The "*" in SELECT * FROM ... applies to all columns. Therefore, many people find that using count (*) is inefficient, and instead use count (id) or count (1) instead.

But the "*" in count (*) is completely different, it simply means "string" and does not expand at all (in fact, it is "an aggregate with zero argument"). The notation count (1) or count (id) is actually slower than count (*), because it must be checked whether the argument is NULL or not (count, like most aggregates, ignores the arguments NULL).

So you will not achieve anything by avoiding the "*".

Using index only scan


It is tempting to scan a small index, not the entire table, to count the number of rows. However, this is not so simple in PostgreSQL due to its multi-version concurrency management strategy. Each version of the row (“tuple”) contains information about which database snapshot it is visible to . But this (redundant) information is not stored in indexes. Therefore, it is usually not enough to count the entries in the index, since PostgreSQL must look at the table entry (“heap tuple”) to make sure that the index entry is visible.

To mitigate this problem, PostgreSQL introduced a visibility map , a data structure that stores information about whether all tuples in a table block are visible to everyone or not.
If most of the blocks in the table are fully visible, then index scans do not require frequent visits to a bunch of tuples to determine visibility. Such an index scan is called an “index only scan," and it is often faster to scan an index to count the rows.

Now it is VACUUM that supports the visibility map, so make sure autovacuum is done often enough if you want to use a small index to speed up count (*).

Using pivot table


I wrote above that PostgreSQL does not store the number of rows in a table.

Maintaining such a row count is a big overhead, since this event occurs with every data modification and does not pay off. That would be a bad deal. Moreover, since different requests can see different versions of strings, the counter must also be versioned.

But nothing prevents you from implementing such a line counter yourself.
Suppose you want to track the number of rows in a mytable. You can do it as follows:

START TRANSACTION;
 
CREATE TABLE mytable_count(c bigint);
 
CREATE FUNCTION mytable_count() RETURNS trigger
   LANGUAGE plpgsql AS
$$BEGIN
   IF TG_OP = 'INSERT' THEN
      UPDATE mytable_count SET c = c + 1;
 
      RETURN NEW;
   ELSIF TG_OP = 'DELETE' THEN
      UPDATE mytable_count SET c = c - 1;
 
      RETURN OLD;
   ELSE
      UPDATE mytable_count SET c = 0;
 
      RETURN NULL;
   END IF;
END;$$;
 
CREATE CONSTRAINT TRIGGER mytable_count_mod
   AFTER INSERT OR DELETE ON mytable
   DEFERRABLE INITIALLY DEFERRED
   FOR EACH ROW EXECUTE PROCEDURE mytable_count();
 
-- TRUNCATE triggers must be FOR EACH STATEMENT
CREATE TRIGGER mytable_count_trunc AFTER TRUNCATE ON mytable
   FOR EACH STATEMENT EXECUTE PROCEDURE mytable_count();
 
-- initialize the counter table
INSERT INTO mytable_count
   SELECT count(*) FROM mytable;
 
COMMIT;

We do everything in one transaction so that no data changes on concurrent transactions can be “lost” due to a ring condition.
This is guaranteed by the CREATE TRIGGER command locking the table in SHARE ROW EXCLUSIVE mode, which prevents all concurrent changes.
The downside is that all parallel data modifications must wait until SELECT count (*) is executed.

This gives us a really quick alternative to count (*), but at the cost of slowing down all the data changes in the table. Using a deferred constraint trigger ensures that the row lock in mytable_count is as short as possible to improve concurrency.

Despite the fact that this counter table can receive many updates, there is no dangerThere is no “bloating of the table" , because all these will be "hot" updates (HOT updates).

You really need count (*)


Sometimes the best solution is to look for an alternative.

Often the approximation is good enough and you do not need the exact amount. In this case, you can use the score that PostgreSQL uses to schedule queries:

SELECT reltuples::bigint
FROM pg_catalog.pg_class
WHERE relname = 'mytable';

This value is updated by both autovacuum and autoanalyze, so it should never exceed 10%. You can reduce autovacuum_analyze_scale_factor for this table so that autoanalyze runs there more often.

Estimating the number of query results


So far, we have been exploring how to speed up the counting of table rows.

But sometimes you need to know how many rows the SELECT statement will return without actually executing the query.

Obviously, the only way to get an accurate answer to this question is to complete the request. But if the grade is good enough, you can use the PostgreSQL optimizer to get it.

The following simple function uses dynamic SQL and EXPLAIN to get the query execution plan passed as an argument, and returns an estimate of the number of rows:

CREATE FUNCTION row_estimator(query text) RETURNS bigint
   LANGUAGE plpgsql AS
$$DECLARE
   plan jsonb;
BEGIN
   EXECUTE 'EXPLAIN (FORMAT JSON) ' || query INTO plan;
 
   RETURN (plan->0->'Plan'->>'Plan Rows')::bigint;
END;$$;

Do not use this function to process untrusted SQL statements, as it is inherently vulnerable to SQL injection.

All Articles