Optimization of a single query with GROUP BY in PostgreSQL

I must say that in this article there is no universal advice for all occasions, and we consider the case of optimizing only a small class of queries. However, such requests can appear in many projects.
the
consider a following
Consider such a scheme. We have two signs
the
-
the
content
— documents.
the content_keyword_ref
— the key words that are present in the document.

CREATE TABLE
CREATE TABLE content
(
id integer NOT NULL DEFAULT nextval('content_id_seq'::regclass),
some_data character varying(1000) NOT NULL,
Content_pkey CONSTRAINT PRIMARY KEY (id),
);
CREATE TABLE content_keyword_ref
(
keyword_id integer NOT NULL,
content_id integer NOT NULL,
Content_keyword_ref_pkey CONSTRAINT PRIMARY KEY (keyword_id, content_id),
Content_keyword_ref_content_id_foreign CONSTRAINT FOREIGN KEY (content_id)
REFERENCES content (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE,
Content_keyword_ref_keyword_id_foreign CONSTRAINT FOREIGN KEY (keyword_id)
REFERENCES keywords (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE
);
CREATE INDEX content_keyword_ref_content_id_index
ON content_keyword_ref
USING btree
(content_id);
CREATE INDEX content_keyword_ref_keyword_id_index
ON content_keyword_ref
USING btree
(keyword_id);
CREATE INDEX content_keyword_ref_keyword_content_idx
ON content_keyword_ref
USING btree
(keyword_id, content_id);
Documents have a local database of about 2 million, and linkages with key words about 15 million.
Select documents that contain one of these keywords.
the
classic
For this we have to write about such a request (I will immediately add the
EXPLAIN ANALYZE
and output the plan):the
EXPLAIN ANALYSE
SELECT c.id
FROM content c
Content_keyword_ref JOIN r ON r.content_id = c.id
AND r.keyword_id IN (4713, 5951)
GROUP BY c.id
LIMIT 1000
GROUP BY we have to use only in order to not output duplicated documents for each found keyword. What we see in the query execution plan:
the
Limit (cost=21454.94 34933.16..rows=1000 width=4) (actual time=6.777..199.735 rows=1000 loops=1) -> Group (cost=21454.94 100235.11..rows=5845 width=4) (actual time=6.775 199.641..rows=1000 loops=1) Group Key: c.id -> Merge Join (cost=21454.94 100220.49..rows=5845 width=4) (actual time=6.774 199.389..rows=1141 loops=1) Merge Cond: (c.id = r.content_id) -> Index Only Scan using content_pkey on content c (cost=0.43..73221.47 rows=2182736 width=4) (actual time=0.013..131.942 rows=1339506 loops=1) Heap Fetches: 0 -> Sort (cost=21454.51 21469.13..rows=5845 width=4) (actual time=6.792..6.662 rows=1141 loops=1) Sort Key: r.content_id Sort Method: quicksort Memory: 143kB -> Bitmap Heap Scan on content_keyword_ref r (cost=21088.82 of 118.16..rows=5845 width=4) (actual time=6.273..0.470 rows=2007 loops=1) Recheck Cond: (keyword_id = ANY ('{4713,5951}'::integer[])) Heap Blocks: exact=1781 -> Bitmap Index Scan on content_keyword_ref_keyword_content_idx (cost=0.00..116.70 rows=5845 width=0) (actual time=0.239..0.239 rows=2007 loops=1) Index Cond: (keyword_id = ANY ('{4713,5951}'::integer[])) Planning time: 0.277 ms Execution time: ms 199.805
The same result we would get using DISTINCT instead of GROUP BY:
the
EXPLAIN ANALYSE
SELECT DISTINCT c.id
FROM content c
Content_keyword_ref JOIN r ON r.content_id = c.id
AND r.keyword_id IN (4713, 5951)
LIMIT 1000
We get:
the
Limit (cost=21454.94 34933.16..rows=1000 width=4) (actual time=2.824..187.619 rows=1000 loops=1) -> Unique (cost=21454.94 100235.11..rows=5845 width=4) (actual time=2.824..187.519 rows=1000 loops=1) -> Merge Join (cost=21454.94 100220.49..rows=5845 width=4) (actual time=2.823..187.351 rows=1141 loops=1) Merge Cond: (c.id = r.content_id) -> Index Only Scan using content_pkey on content c (cost=0.43..73221.47 rows=2182736 width=4) (actual time=0.011..120.481 rows=1339506 loops=1) Heap Fetches: 0 -> Sort (cost=21454.51 21469.13..rows=5845 width=4) (actual time=2.693..2.805 rows=1141 loops=1) Sort Key: r.content_id Sort Method: quicksort Memory: 143kB -> Bitmap Heap Scan on content_keyword_ref r (cost=21088.82 of 118.16..rows=5845 width=4) (actual time=0.463..2.321 rows=2007 loops=1) Recheck Cond: (keyword_id = ANY ('{4713,5951}'::integer[])) Heap Blocks: exact=1781 -> Bitmap Index Scan on content_keyword_ref_keyword_content_idx (cost=0.00..116.70 rows=5845 width=0) (actual time=0.235..0.235 rows=2007 loops=1) Index Cond: (keyword_id = ANY ('{4713,5951}'::integer[])) Planning time: 0.264 ms Execution time: 187.727 ms
As can be seen, the grouping leads to sorting, and other overhead. On some data, run time up to several seconds!
How to be?
the
Optimization
My ideas on how to expedite the request if the existing scheme is over. Let's try to redesign the scheme. The sign of
content
is left behind. But the key words will be stored in the array. To quickly select data on the conditions on the array, create GiST index. What the operators for working with arrays are supported by indexes, you can see PostgreSQL documentation.
the
CREATE TABLE document
(
content_id integer NOT NULL,
- The array of keywords, instead of a table content_keyword_ref
keyword_ids integer[] NOT NULL
);
- Our GiST index
CREATE INDEX ON document document_keyword_ids_index USING GiST(keyword_ids gist__intbig_ops);
less interesting
CREATE INDEX document_content_id_index
ON public.document
USING btree
(content_id);
-- Kopipastom available data
INSERT INTO document (content_id, keyword_ids)
SELECT c.id, ARRAY(
SELECT r.keyword_id
FROM content_keyword_ref r
WHERE r.content_id = c.id
)
FROM content c
GROUP BY c.id;
Now try to build a query that will return the same data as in the options above:
the
EXPLAIN ANALYZE
SELECT c.id
FROM content c
JOIN document d ON d.content_id = c.id
AND d.keyword_ids && ARRAY[4713, 5951]
limit 1000
Plan to watch:
the
Limit (cost=387.80 7540.27..rows=1000 width=4) (actual time=8.799..12.935 rows=1000 loops=1) -> Nested Loop (cost=387.80 14177.77..rows=1928 width=4) (actual time=8.799..12.880 rows=1000 loops=1) -> Bitmap Heap Scan on document d (cost=387.37 6246.79..rows=1930 width=4) (actual time=10.599 8.786..rows=1000 loops=1) Recheck Cond: (keyword_ids && '{4713,5951}'::integer[]) Rows Removed by Index Recheck: 107 Heap Blocks: exact=1008 -> Bitmap Index Scan on document_keyword_ids_index (cost=0.00..386.89 rows=1930 width=0) (actual time=8.560 8.560..rows=loops 1977=1) Index Cond: (keyword_ids && '{4713,5951}'::integer[]) -> Index Only Scan using content_pkey on content c (cost=0.43..4.10 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=1000) Index Cond: (id = d.content_id) Heap Fetches: 0 Planning time: 0.184 ms Execution time: 12.994 ms
The benefit is considerably. For the selected data optimized version of the query runs about 14 times faster. The query text remains about the same clear. Let's see what other benefits we received.
the
Bonus
For example, it is necessary to withdraw the documents found on the page with pagination. In this case, to count the number of records in the sample in the "classical" option? Here are some options:
Count the number of records in the subquery with a
GROUP BY:
the
SELECT COUNT(1) FROM (
SELECT c.id
FROM content c
Content_keyword_ref JOIN r ON r.content_id = c.id
AND r.keyword_id IN (4713, 5951)
GROUP BY c.id
) t;
Count the number of records in the subquery with
DISTINCT
:the
SELECT COUNT(1) FROM (
SELECT DISTINCT(c.id)
FROM content c
Content_keyword_ref JOIN r ON r.content_id = c.id
AND r.keyword_id IN (4713, 5951)
) t;
Count the number of records without a subquery, but using
COUNT (DISTINCT columns)
:the
SELECT COUNT(DISTINCT c.id)
FROM content c
Content_keyword_ref JOIN r ON r.content_id = c.id
AND r.keyword_id IN (4713, 5951)
Or even this:
the
SELECT COUNT(1) OVER()
FROM content c
Content_keyword_ref JOIN r ON r.content_id = c.id
AND r.keyword_id IN (4713, 5951)
GROUP BY c.id
LIMIT 1
All of these options minus not only in performance. Will the pagination module in your framework automatically do one of these options? Laravel, for example, net. Instead, he will choose all the records and count the number of
count()
is already in PHP. So most likely you will have to override the method of calculating the number of records to shall be deducted from the base of the whole sample.As we count the number of records in the optimized version of the query:
the
SELECT COUNT(1)
FROM document d
WHERE d.keyword_ids && ARRAY[4713, 5951]
Much shorter and no problems with the pager.
the
Another bonus
We chose documents that contain at least one of these words. What if you want to select documents that contain all keywords of interest? In the classic version of the query can be build like this:
the
SELECT c.id
FROM content c
Content_keyword_ref JOIN r1 ON r1.content_id = c.id
AND r1.keyword_id = 5388
JOIN content_keyword_ref r2 ON r2.content_id = c.id
AND r2.keyword_id = 5951
LIMIT 1000
That is, how many key words are looking for, so joins do. If we filter the records in array, you can use the operator
@>
. Then the query looks more carefully:the
SELECT c.id
FROM content c
JOIN document d ON d.content_id = c.id
AND d.keyword_ids @> ARRAY[5388, 5951]
LIMIT 1000
And the execution plan it is better.
The documentation on the link I left above, you can find a description and other useful operators supported by indexes.
the
Summary
I experimented with different data. Generally optimized option gives the gain in speed from 2 to 10 times. But I still managed to find examples where the calculation of the number of records in the results work in 1.5-2 times slower in the case of "optimized" option.
That is, the whole experiment can be called successful. But if you decide to do these tricks before you start the change in production is to test the effectiveness of your data.