PostrgreSQL: accelerate using intarray
Years 6 ago, when the elephant was only 8.0, and I'm sitting firmly on the MySql often heard calls to change the DB. I remember it was painful to start. But after dare, never regret and muscle are unlikely to return. It is a lot of pluses here, but the post is not about that.
Came the challenge: to write a large in the future. A La Fotos, Hotline. What a challenge for such platforms is the filter.
As is done in the usual "engines"? True: options = filters. Well, it is clear that this is simple, but not always nice, especially when you need filters with ranges (for example: diagonal 10"-11"). And then you have to think about what the filters are independent from the parameters of the essence. In my version filters "tally" parameters. Not sure how this is done at the same hotline (there is a suspicion that there filters directly on knitted goods), but this option requires a lot of human attention. I will not go into details of architecture, is a topic for another post. But to consider it a simplified version, in order not to get lost.
The problem in these filters is their construction. To calculate the number of items you selected (or not selected) filters.
So...
Create a table for items:
the
Filter table:
the
Table for the relations between filters and products:
the
And it turns out the standard version. Trivial.
More...
In the item table, add field filters:
the
As you can see, this array of numbers, there are duplicated idiski filters, Yes, it is denormalization. The integrity of the written procedure and hang on the trigger. Insert:
the
Removal:
the
Update:
the
Cleaning:
the
Now, when you insert, update, delete, data from table will be written to the array of filters in the table of products. And we got rid of the JOIN. Now you do not need to glue the tables in the query. This gives many advantages, it is easier to build queries, they are faster, less memory, etc. But the article's about intarray? Yes.
Install the extension:
the
After the execution of this command will appear in the basis functions, indexes, operators to work with arrays of type INTEGER. Will now work much faster and easier.
Fill up our database. Filters 10 000. Goods 100 000. Each good for 10 filters. Total: the intermediate table is 1,000,000 rows. This block of queries I have executed 8 min wait.
the
Create an index on an array of numbers:
the
In total, our database is full. Let's see now what to do with it. Let's find the most popular filters and remember them.
the
My result like this: 7267, 4889, 6364, 5376, 3556, 7292, 11188, 2643, 9005, 10235.
We find goods with a certain filter, let it be 7267:
the
One of the filters
the
Too lazy to make other requests, sorry. But when you need to find products with multiple filters, then do this approach leaves joiny and subselect far behind.
the
Of. Doc here.
Pay attention to the operators, it's just a fairy tale! I read somewhere that there is even a patch, setting which can be foreign keys to build the array, and the base itself will monitor integrity. Even I read somewhere that the developers even plan to do it without any patches. Yes, it's great. At the time, when I found (for myself) the decision, the joy was...
Seeing that there is little interest in the topic.
Decided to add some tests to the volumes.
Let's try for 10 000 000 products.
Change the query to fill pseudotumoral.
the
The size of the DB ~2.9 GB
The BTREE index:
the
Index GIST:
the
Index GIN:
the
Yes, for a mega large database, it does not save. But given that the search in practice does not happen, and there is always an additional filter, such as by category, it is still a good method. In any case, it is 100 times faster than the JOIN.
Article based on information from habrahabr.ru
Came the challenge: to write a large in the future. A La Fotos, Hotline. What a challenge for such platforms is the filter.
As is done in the usual "engines"? True: options = filters. Well, it is clear that this is simple, but not always nice, especially when you need filters with ranges (for example: diagonal 10"-11"). And then you have to think about what the filters are independent from the parameters of the essence. In my version filters "tally" parameters. Not sure how this is done at the same hotline (there is a suspicion that there filters directly on knitted goods), but this option requires a lot of human attention. I will not go into details of architecture, is a topic for another post. But to consider it a simplified version, in order not to get lost.
The problem in these filters is their construction. To calculate the number of items you selected (or not selected) filters.
So...
Create a table for items:
the
CREATE TABLE public.products (
id SERIAL,
title VARCHAR NOT NULL,
Products_pkey CONSTRAINT PRIMARY KEY(id)
);
Filter table:
the
CREATE TABLE public.filters (
id SERIAL,
title VARCHAR NOT NULL,
Filters_pkey CONSTRAINT PRIMARY KEY(id)
);
Table for the relations between filters and products:
the
CREATE TABLE public.products_ref_filters (
id_product INTEGER NOT NULL,
id_filter, INTEGER NOT NULL,
Products_ref_filters_pkey CONSTRAINT PRIMARY KEY(id_product, id_filter),
Products_ref_filters_fk CONSTRAINT FOREIGN KEY (id_product)
REFERENCES public.products(id)
ON DELETE CASCADE
ON UPDATE CASCADE,
Products_ref_filters_fk1 CONSTRAINT FOREIGN KEY (id_filter)
REFERENCES public.filters(id)
ON DELETE CASCADE
ON UPDATE CASCADE
);
And it turns out the standard version. Trivial.
More...
In the item table, add field filters:
the
ALTER TABLE public.products
Filters ADD COLUMN INTEGER[] DEFAULT ARRAY[]::integer[] NOT NULL;
As you can see, this array of numbers, there are duplicated idiski filters, Yes, it is denormalization. The integrity of the written procedure and hang on the trigger. Insert:
the
CREATE OR REPLACE FUNCTION public.products_ref_filters__insert_tr ()
RETURNS trigger AS'
BEGIN
UPDATE products
SET filters = filters + NEW.id_filter --push element onto array
WHERE id = NEW.id_product;
RETURN NEW;
END;
'LANGUAGE 'plpgsql';
CREATE TRIGGER products_ref_filters__insert_tr
AFTER INSERT
ON public.products_ref_filters FOR EACH ROW
EXECUTE PROCEDURE public.products_ref_filters__insert_tr();
Removal:
the
CREATE OR REPLACE FUNCTION public.products_ref_filters__delete_tr ()
RETURNS trigger AS'
BEGIN
UPDATE products
SET filters = filters - OLD.id_filter --remove entries matching right argument from array
WHERE id = OLD.id_product;
RETURN OLD;
END;
'LANGUAGE 'plpgsql';
Update:
the
CREATE OR REPLACE FUNCTION public.products_ref_filters__update_tr ()
RETURNS trigger AS'
BEGIN
UPDATE products
SET filters = filters - OLD.id_filter
WHERE id = OLD.id_product;
UPDATE products
SET filters = filters + NEW.id_filter
WHERE id = NEW.id_product;
RETURN NEW;
END;
'LANGUAGE 'plpgsql';
CREATE TRIGGER products_ref_filters__update_tr
AFTER UPDATE
ON public.products_ref_filters FOR EACH ROW
EXECUTE PROCEDURE public.products_ref_filters__update_tr();
Cleaning:
the
CREATE OR REPLACE FUNCTION public.products_ref_filters__truncate_tr ()
RETURNS trigger AS'
BEGIN
UPDATE products SET filters = ARRAY[]::INTEGER[];
RETURN NULL;
END;
'LANGUAGE 'plpgsql';
CREATE TRIGGER products_ref_filters__truncate_tr
AFTER TRUNCATE
ON public.products_ref_filters FOR EACH STATEMENT
EXECUTE PROCEDURE public.products_ref_filters__truncate_tr();
Now, when you insert, update, delete, data from table will be written to the array of filters in the table of products. And we got rid of the JOIN. Now you do not need to glue the tables in the query. This gives many advantages, it is easier to build queries, they are faster, less memory, etc. But the article's about intarray? Yes.
Install the extension:
the
CREATE EXTENSION intarray;
After the execution of this command will appear in the basis functions, indexes, operators to work with arrays of type INTEGER. Will now work much faster and easier.
Fill up our database. Filters 10 000. Goods 100 000. Each good for 10 filters. Total: the intermediate table is 1,000,000 rows. This block of queries I have executed 8 min wait.
the
INSERT INTO filters (title) SELECT 'filter_' || num FROM generate_series(1, 10000) as num;
INSERT INTO products (title) SELECT 'product_' || num FROM generate_series(1, 100000) as num;
DO $$
DECLARE
idp INTEGER;
BEGIN
FOR idp IN SELECT id FROM products
LOOP
INSERT INTO products_ref_filters
SELECT idp, f.id FROM filters f ORDER BY random() LIMIT 10;
END LOOP;
END$$;
Create an index on an array of numbers:
the
CREATE INDEX products_idx ON public.products
USING gin (filters public.gin__int_ops);
In total, our database is full. Let's see now what to do with it. Let's find the most popular filters and remember them.
the
SELECT id_filter
FROM products_ref_filters
GROUP BY id_filter
ORDER BY count(*) DESC
LIMIT 10
My result like this: 7267, 4889, 6364, 5376, 3556, 7292, 11188, 2643, 9005, 10235.
We find goods with a certain filter, let it be 7267:
the
-- normal JOIN
SELECT t1.* FROM products t1, t2 products_ref_filters
WHERE t1.id = t2.id_product
AND t2.id_filter = 7267
-- 140 rows returned (execution time: 125 ms; total time: 141 ms)
-- SUB SELECT
SELECT * FROM products
WHERE id IN ( SELECT id_product FROM products_ref_filters WHERE id_filter = 7267 )
-- 140 rows returned (execution time: 125 ms; total time: 141 ms)
-- search through the array
SELECT * FROM products WHERE filters @> ARRAY[7267]
-- 140 rows returned (execution time: 0 ms; total time: 0 ms)
One of the filters
the
JOIN--
SELECT DISTINCT t1.* FROM products t1, t2 products_ref_filters
WHERE t1.id = t2.id_product
AND t2.id_filter IN (7267,4889,6364,5376,3556,7292,11188,2643,9005,10235)
-- 1347 rows returned (execution time: 297 ms; total time: 297 ms)
-- SUB SELECT
SELECT * FROM products
WHERE id IN ( SELECT id_product FROM products_ref_filters WHERE id_filter IN (7267,4889,6364,5376,3556,7292,11188,2643,9005,10235) )
-- 1347 rows returned (execution time: 234 ms; total time: 250 ms)
INTARRAY --
SELECT * FROM products WHERE filters && ARRAY[7267,4889,6364,5376,3556,7292,11188,2643,9005,10235]
-- 1347 rows returned (execution time: 16 ms; total time: 16 ms)
Too lazy to make other requests, sorry. But when you need to find products with multiple filters, then do this approach leaves joiny and subselect far behind.
the
JOIN--
-- LAZINESS, but here the gap even more ;)
-- Products containing both filters
SELECT * FROM products WHERE filters @> ARRAY[9844,9957];
-- 1 rows returned (execution time: 0 ms; total time: 16 ms)
Of. Doc here.
Pay attention to the operators, it's just a fairy tale! I read somewhere that there is even a patch, setting which can be foreign keys to build the array, and the base itself will monitor integrity. Even I read somewhere that the developers even plan to do it without any patches. Yes, it's great. At the time, when I found (for myself) the decision, the joy was...
Seeing that there is little interest in the topic.
Decided to add some tests to the volumes.
Let's try for 10 000 000 products.
Change the query to fill pseudotumoral.
the
-- filter as it was 1000
INSERT INTO filters (title) SELECT 'filter_' || num FROM generate_series(1, 10000) as num;
--Goods 10 000 000
INSERT INTO products (title) SELECT 'product_' || num FROM generate_series(1, 10000000) as num;
-- because the integrity of the us is not interested in tests, fill in only the ID-shki
DO $$
DECLARE
idp INTEGER;
BEGIN
FOR idp IN SELECT id FROM products
LOOP
UPDATE products
SET filters = (SELECT array_agg(id) FROM (SELECT id FROM filters OFFSET random()*1000 LIMIT 10) as foo)
WHERE id = idp;
END LOOP;
END$$;
-- this query works ~20min
The size of the DB ~2.9 GB
The BTREE index:
the
CREATE INDEX products_idx ON public.products USING btree (filters);
-- execution time: 00:02:36
-- DB size ~ 3.8 Gb
SELECT * FROM products WHERE filters @> '{842}'::INTEGER[];
-- Seq Scan on public.products (cost=0.00..357908.00 rows=10000 width=80) ...
-- 99798 rows returned (execution time: 5,594 sec; total time: 5,594 sec)
SELECT * FROM products WHERE filters = '{842,843,844,845,846,847,848,849,850,851}'::INTEGER[];
-- Bitmap Heap Scan on public.products (cost=487.94 32940.13..rows=9726 width=80)
-- 9940 rows returned (execution time: 46 ms; total time: 62 ms)
SELECT * FROM products WHERE filters && '{842,843,844,845,846,847,848,849,850,851}'::INTEGER[]
-- Seq Scan on public.products (cost=0.00..357908.00 rows=10000 width=80)
-- 189853 rows returned (execution time: 6,281 sec; total time: 6,296 sec)
Index GIST:
the
CREATE INDEX products_idx ON public.products USING gist (filters public.gist__int_ops);
-- execution time: 00:26:55;
-- DB size ~ 4.5 Gb
SELECT * FROM products WHERE filters @> '{842}'::INTEGER[];
-- Bitmap Heap Scan on public.products (cost=833.92 34097.44..rows=10000 width=80)
-- 99798 rows returned (execution time: 2,234 sec; total time: 2,234 sec)
SELECT * FROM products WHERE filters = '{842,843,844,845,846,847,848,849,850,851}'::INTEGER[];
-- Bitmap Heap Scan on public.products (cost=811.79 33263.99..rows=9726 width=80)
-- 9940 rows returned (execution time: 234 ms; total time: 234 ms)
SELECT * FROM products WHERE filters && '{842,843,844,845,846,847,848,849,850,851}'::INTEGER[];
-- Bitmap Heap Scan on public.products (cost=833.92 34097.44..rows=10000 width=80)
-- 189853 rows returned (execution time: 5,234 sec; total time: 5,234 sec)
Index GIN:
the
CREATE INDEX products_idx ON public.products USING gin (filters public.gin__int_ops);
SELECT * FROM products WHERE filters @> '{842}'::INTEGER[];
-- Bitmap Heap Scan on public.products (cost=33361.02..97.50 rows=10000 width=80)
-- 99798 rows returned (execution time: 2,204 sec; total time: 2,219 sec)
SELECT * FROM products WHERE filters = '{842,843,844,845,846,847,848,849,850,851}'::INTEGER[];
-- Bitmap Heap Scan on public.products (cost=211.37 32663.57..rows=9726 width=80)
-- 9940 rows returned (execution time: 297 ms; total time: 312 ms)
SELECT * FROM products WHERE filters && '{842,843,844,845,846,847,848,849,850,851}'::INTEGER[];
-- Bitmap Heap Scan on public.products (cost=213.50 33477.02..rows=10000 width=80)
-- 189853 rows returned (execution time: 4,500 sec; total time: 4,515 sec)
Yes, for a mega large database, it does not save. But given that the search in practice does not happen, and there is always an additional filter, such as by category, it is still a good method. In any case, it is 100 times faster than the JOIN.