Explaining the inexplicable. Part 3

In preparation for the conference PG Day’16 we continue to acquaint you with the interesting aspects of PostgreSQL. And today we bring you a translation of the third article of a series about explain.

previous posts of this series I wrote about how to interpret a single line in the output of explain analyze, its structure, and described the basic operations of receiving data (tree nodes explain).

Today we move on to more complex operations.

scan Function

Example:

the

$ explain analyze select * from generate_Series(1,10) i;
QUERY PLAN 
---------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series i (cost=0.00..10.00 rows=1000 width=4) (actual time=0.012..0.013 rows=10 loops=1)
Total runtime: 0.034 ms
(2 rows)

By and large, it is so simple that there is no need to explain something. But since this operation will be used in the following examples, I'll write about it a little.

Function Scan is a very simple knot. He runs a function that returns a set of records (recordset), that's all. It will not trigger functions like “lower()", but only those that can potentially return many rows or columns. When the function returns strings, they will be transferred to the node that is one level above the Function Scan in the tree plan, or the client if the Scan Function is the root node.

The only additional logic that can occur here is the ability to filter the received string, as here:

the

$ explain analyze select * from generate_Series(1,10) i where i < 3;
QUERY PLAN 
-------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series i (cost=0.00..12.50 rows=333 width=4) (actual time=0.012..0.014 rows=2 loops=1)
Filter (i < 3)
Rows Removed by Filter: 8
Total runtime: 0.030 ms
(4 rows)

Sort

I think it's pretty simple to understand – sort takes a selected record and returns it sorted in a certain way.

Example:

the

$ explain analyze select * from pg_class order by relname;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Sort (cost=22.88..23.61 rows=292 width=203) (actual time=0.230..0.253 rows=295 loops=1)
Sort Key: relname
Sort Method: quicksort Memory: 103kB
-> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=203) (actual time=0.007..0.048 rows=295 loops=1)
Total runtime: 0.326 ms
(5 rows)

Even if it's just, there's interesting logic. For starters, if the memory required for sorting is greater than the value of work_mem, you will switch to the disc sorting:

the

$ explain analyze select random() as x from generate_series(1,14000) i order by x;
QUERY PLAN 
------------------------------------------------------------------------------------------------------------------------------
Sort (cost=62.33..64.83 rows=1000 width=0) (actual time=16.713..18.090 rows=14000 loops=1)
Sort Key: (random())
Sort Method: quicksort Memory: 998kB
-> Function Scan on generate_series i (cost=0.00..12.50 rows=1000 width=0) (actual time=2.036..4.533 rows=14000 loops=1)
Total runtime: 18.942 ms
(5 rows)

$ explain analyze select random() as x from generate_series(1,15000) i order by x;
QUERY PLAN 
------------------------------------------------------------------------------------------------------------------------------
Sort (cost=62.33..64.83 rows=1000 width=0) (actual time=27.052..28.780 rows=15000 loops=1)
Sort Key: (random())
Sort Method: external merge Disk: 264kB
-> Function Scan on generate_series i (cost=0.00..12.50 rows=1000 width=0) (actual time=2.171..4.894 rows=15000 loops=1)
Total runtime: 29.767 ms
(5 rows)

Note changing the Sort Method in the example above.

In such cases, Postgres uses temporary files, which are stored in the directory $PGDATA/base/pgsql_tmp/. Of course, they will be removed as soon as they no longer need.

One additional property is that the Sort can change its method of operation, if invoked operation Limit, as you can see here:

the

$ explain analyze select * from pg_class order by relfilenode limit 5;
QUERY PLAN 

Limit (cost=15.77..15.78 rows=5 width=203) (actual time=0.119..0.120 rows=5 loops=1)
-> Sort (cost=15.77..16.50 rows=292 width=203) (actual time=0.118..0.118 rows=5 loops=1)
Sort Key: relfilenode
Sort Method: top-N heapsort Memory: 26kB
-> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=203) (actual time=0.005..0.047 rows=295 loops=1)
Total runtime: 0.161 ms
(6 rows)

Usually to sort the selected dataset you need to process it entirely. But Postgres knows that if you only need a small number of rows, it is not necessary to sort the entire data set, it is enough to get only the first value.

In Big O notation, common sort has complexity O(m * log(m)), but Top-N has complexity O(m * log(n)), where m is the number of rows in the table and n is the number of rows returned. It is important to know that this sorting method also uses much less memory (in the end, it is not necessary to collect the whole dataset of sorted rows, pairs of rows is enough), so he's less likely to use slow disk for temporary files.

Limit

I used the limit repeatedly, because it is very simple, but let's discuss it in detail. Operation limit runs his suboperation and returns only the first N rows that returned by suboperation. Usually after that it stops suboperation, but in some cases (for example, a function call, pl/PgSQL), suboperation has already completed its work by the time she returned to the first row.

A simple example:

the

$ explain analyze select * from pg_class;
QUERY PLAN 
---------------------------------------------------------------------------------------------------------
Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=203) (actual time=0.008..0.047 rows=295 loops=1)
Total runtime: 0.096 ms
(2 rows)

$ explain analyze select * from pg_class limit 2;
QUERY PLAN 
-------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..0.07 rows=2 width=203) (actual time=0.009..0.010 rows=2 loops=1)
-> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=203) (actual time=0.008..0.009 rows=2 loops=1)
Total runtime: 0.045 ms
(3 rows)

As you can see, the use of the limit in the second case led to the fact that the nested operation Seq Scan finished its work immediately after finding two strings.

HashAggregate

This operation is mainly used in cases when you use GROUP BY and some aggregates, like sum(), avg(), min(), max() and others.

Example:

the

$ explain analyze select relkind, count(*) from pg_Class group by relkind;
QUERY PLAN 
-------------------------------------------------------------------------------------------------------------
HashAggregate (cost=12.38..12.42 rows=4 width=1) (actual time=0.223..0.224 rows=5 loops=1)
-> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=1) (actual time=0.008..0.053 rows=295 loops=1)
Total runtime: 0.273 ms
(3 rows)

HashAggregate does the following: for each row, which receives, she finds the "key" GROUP BY (in this case relkind). Then, in the hash (associative array, dictionary) puts the selected row in the cart denoted by the given key.

After all rows have been processed, it scans the hash and returns one row for each key value, performing the relevant calculations needed (sum, min, avg, and so on).

It is important to understand that the HashAggregate need to scan all rows before it can return at least one.

If you understand this, then perhaps see a potential problem: what to do in a situation when you have millions of rows? The hash is too large to fit in memory. And here again we will use work_mem. If the generated hash is too large, it will "merge" to disk (again in the $PGDATA/base/pgsql_tmp).

This means that if the plan is HashAggregate, and Sort, we can use up to 2 * work_mem. This plan is easy to obtain:

the

$ explain analyze select relkind, count(*) from pg_Class group by relkind order by relkind;
QUERY PLAN 
-------------------------------------------------------------------------------------------------------------------
Sort (cost=12.46..12.47 rows=4 width=1) (actual time=0.260..0.261 rows=5 loops=1)
Sort Key: relkind
Sort Method: quicksort Memory: 25kB
-> HashAggregate (cost=12.38..12.42 rows=4 width=1) (actual time=0.221..0.222 rows=5 loops=1)

Total runtime: 0.312 ms
(6 rows)

In reality, a single query can use many times work_mem, because work_mem is a restriction for the operation. Therefore, if the request applies 1000 HashAggregate'Sort's and s (and other operations that use work_mem), total memory consumption can be very high.

Hash Join / Hash

As we just discussed HashAggregate, it would be logical to switch to Hash Join.

This operation, unlike the previous, has two suboperation. One of them is always “Hash", and the second is something else.

As the name implies, a Hash Join is used to join the two recordsets. For example, like this:

the

$ explain analyze select * from pg_class c join pg_namespace n on c.relnamespace = n.oid;
QUERY PLAN 
----------------------------------------------------------------------------------------------------------------------
Hash Join (cost=1.14..16.07 rows=292 width=316) (actual time=0.036..0.343 rows=295 loops=1)
Hash Cond: (c.relnamespace = n.oid)
-> Seq Scan on pg_class c (cost=0.00..10.92 rows=292 width=203) (actual time=0.007..0.044 rows=295 loops=1)
-> Hash (cost=1.06..1.06 rows=6 width=117) (actual time=0.012..0.012 rows=6 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Seq Scan on pg_namespace n (cost=0.00..1.06 rows=6 width=117) (actual time=0.004..0.005 rows=6 loops=1)
Total runtime: 0.462 ms
(7 rows)

It works as follows: first the Hash Join calls “Hash", which in turn causes something else (in our case – Seq Scan on pg_namespace). Then creates a Hash in memory (or on disk, depending on size) hash/associative array/dictionary with strings from the source, hashed through what used to merge the data (in our case it is the OID column in pg_namespace).

Of course, you can have many rows for the specified key join (not in this case because I join using the primary key, but in General it is likely that you will have multiple rows for one hash key).

In the notation of Perl, the output Hash will be something like this:

the

{
'123' => [ { data for row with OID = 123 }, ],
'256' => [ { data for row with OID = 256 }, ],
...
}

Then Hash Join runs the second suboperation (Seq Scan on pg_class in our case) and for every row it does the following:

Checks whether a key join (pg_class.relnamespace in this case) in the hash returned by a Hash operation. the

If not, this line of suboperation is ignored (will not be returned).
If the key is exist, Hash Join takes a string from a hash and based on this line, on the one hand, and all rows of hash, on the other hand, generates output strings.

It is important to note that both sides of the join is performed only once (in our case both is a seq scan), but the first one that was caused by the Hash operation must return all the rows that were stored in the hash, and the second is processed line by line, some lines will be skipped if they do not exist in the hash the first side (I hope this sentence is clear, despite the abundance of hashes).

Since both subsydiowanie can be the operations of any type, they can be filters, an index scan, or whatever you can imagine.

The last thing worth mentioning in connection with Hash Join/Hash is the transaction Hash, as well as Sort and HashAggregate will use memory up to work_mem.

Hash Join / Hash

Since we are talking about the unions, is to discuss the Nested Loop. Example:

the

$ explain analyze select a.* from pg_class c join pg_attribute a on c.oid = a.attrelid where c.relname in ( 'pg_class', 'pg_namespace' );
QUERY PLAN 
----------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.28..52.32 rows=16 width=203) (actual time=0.057..0.134 rows=46 loops=1)
-> Seq Scan on pg_class c (cost=0.00..11.65 rows=2 width=4) (actual time=0.043..0.080 rows=2 loops=1)
Filter: (relname = ANY ('{pg_class,pg_namespace}'::name[]))
Rows Removed by Filter: 291
-> Index Scan using pg_attribute_relid_attnum_index on pg_attribute a (cost=0.28..20.25 rows=8 width=203) (actual time=0.007..0.015 rows=23 loops=2)
Index Cond: (attrelid = c.oid)
Total runtime: 0.182 ms

This is a very interesting plan because it can perform the selected operation repeatedly.

As well as Hash Join, Nested Loop has two "descendants". First it runs “Seq Scan" (in our example, it runs the first node), and then for each returned row (only 2 rows in our example), it starts the second operation (Index Scan on pg_attribute in this case).
You might notice that the Index Scan in the actual meta information is “loops=2". This means that this operation is run twice, and the other values (rows, time) are averages for all runs.

Let us consider the following plan of explain.depesz.com. Note that the actual run time for all operations index scan for categories – from 0.002 to 0.003 MS. But the total time spent on this site – 78.852 MS because it is an index scan performed over 26k times.

So the processing looks as follows:

Nested Loop starts the second side of the join once. Let's call it “A".
To each row of “A", starts the second operation (let's call it “B").
If “B" did not return any rows, the data of “A" are ignored.
If the “B" back of the row for each returned row of the Nested Loop returns a new string based on the current rows of A and B.

Merge Join

Another method of combining data is called Merge Join. It is used if the merge datasets sorted (or can be sorted with little cost) via key join.

I don't have a good example, so I will create it artificially using subqueries, which sort data before merge:

the

$ explain analyze select * from
( select oid, * from pg_class order by oid) as c
join
( select * from pg_attribute a order by attrelid) as a
on c.oid = a.attrelid;
QUERY PLAN 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=23.16..268.01 rows=2273 width=410) (actual time=0.658..3.779 rows=2274 loops=1)
Merge Cond: (pg_class.oid = a.attrelid)
-> Sort (cost=22.88..23.61 rows=292 width=207) (actual time=0.624..0.655 rows=293 loops=1)
Sort Key: pg_class.oid
Sort Method: quicksort Memory: 102kB
-> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=207) (actual time=0.011..0.211 rows=293 loops=1)
-> Materialize (cost=0.28..212.34 rows=2273 width=203) (actual time=0.028..1.264 rows=2274 loops=1)
-> Index Scan using pg_attribute_relid_attnum_index on pg_attribute a (cost=0.28..183.92 rows=2273 width=203) (actual time=0.015..0.752 rows=2274 loops=1)
Total runtime: 4.009 ms
(9 rows)

Merge Join, like other enterprises, launches two suboperation (Sort and Materialize in this case). Since they both return the data sorted and the sort order is the same as in the merge operation, Pg can scan both sets of data returned by suppertime, at the same time and just check whether the same identifiers.

The procedure is as follows:
the

if you combine the right column is same as the merge column to the left:
the
- return the new concatenated string based on the current rows on the right and left;
- take the following string to the right (or left if right there are no more rows);
- return to step 1;
if the merge column to the right "less" than join column on the left:
the
- take the following string to the right (if there are no more rows, finish processing);
- return to step 1;
if the merge column to the right "more" than join column on the left:
the
- take the following string to the left (if there are no more rows, finish processing);
- return to step 1.

This is a very cool way of combining data sets, but it only works for sorted sources. Based on the current database explain.depesz.com, there is
the

44,721 plan that contains the operation “Nested Loop";
34,305 plan with “Hash Join";
just plan 8,889 that uses “Merge Join".

Modifiers Hash Join / Nested Loop / Merge Join

In all examples above I have shown that the Join operation returns a string only when it receives strings from both sides of the merger.

But not always. We may have left, right and full (LEFT/RIGHT/FULL OUTER JOIN) outer join, and the so-called anti-unification (anti-joins).

In the case of left/right joins the names of the operations changed to:
the

Hash Left Join
Hash Right Join,
Merge Left Join,
Nested Loop Left Join.

There is no Nested Loop Right Join, because the Nested Loop always starts at the left and takes the left side as a basis for the cycle. So the Union using a RIGHT JOIN, which will work with a Nested Loop, internally in transformirovalsya LEFT JOIN to Nested Loop operation could work.

In all these cases, the logic is simple: we have two sides of the merge – left and right. And when the party is mentioned in the Association, it returns a new string, even if other side no matching rows.

This happens with queries like this:

the

select * from a left join b on ...

(or right join).

All other information for the Hash Join/Merge Join and Nested Loop are the same, there is only a small change in the logic, when generating the output string.

There is also a version called Full Join with the following operation names:
the

Hash Full Join
Merge Full Join.

In this case, the Association generates a new output line regardless of whether data is missing for any of the parties (as long as the data has at least on one side). This is the case in:

the

select * from a full join b ...

All processing is the same as in the previous examples.

In addition, there is the so-called Anti Joins you. Names of operations are as follows:
the

Hash Anti Join
Merge Anti Join
Nested Loop Anti Join.

In these cases, the Join produces the string only if the right side does not find any rows. This is useful when you do something like “WHERE not exists ()" or “left join ... where right_table.column is null".

As in this example:

the

$ explain analyze select * from pg_class c where not exists (select * from pg_attribute a where a.attrelid = c.oid and a.attnum = 10);
QUERY PLAN 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Anti Join (cost=62.27..78.66 rows=250 width=203) (actual time=0.145..0.448 rows=251 loops=1)
Hash Cond: (c.oid = a.attrelid)
-> Seq Scan on pg_class c (cost=0.00..10.92 rows=292 width=207) (actual time=0.009..0.195 rows=293 loops=1)
-> Hash (cost=61.75..61.75 rows=42 width=4) (actual time=0.123..0.123 rows=42 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 2kB
-> Index Only Scan using pg_attribute_relid_attnum_index on pg_attribute a (cost=0.28..61.75 rows=42 width=4) (actual time=0.021..0.109 rows=42 loops=1)
Index Cond: (attnum = 10)
Heap Fetches: 0
Total runtime: 0.521 ms
(9 rows)

Here Pg performed the right side (Index Scan on pg_attribute), Zaharova her, and then ran the left side (Seq Scan on pg_class), returning only those rows where there were no entries in the Hash for the given pg_class.oid.

Materialize

This operation has already been demonstrated in the example for Merge Join, but it can also be useful in other cases.

Have psql has a variety of internal teams. One of them is \dTS, which is a list of all system data types. Internally \dTS runs this query:

the

SELECT n.nspname as "Schema",
pg_catalog.format_type(t.oid, NULL) AS "Name",
pg_catalog.obj_description(t.oid, 'pg_type') as "Description"
FROM pg_catalog.pg_type t
LEFT JOIN pg_catalog.pg_namespace n ON n.oid = t.typnamespace
WHERE (t.typrelid = 0 OR (SELECT c.relkind = 'c' FROM pg_catalog.pg_class c WHERE c.oid = t.typrelid))
AND NOT EXISTS(SELECT 1 FROM pg_catalog.pg_type el WHERE el.oid = t.typelem AND el.typarray = t.oid)
AND pg_catalog.pg_type_is_visible(t.oid)
ORDER BY 1, 2;

Plan it like this:

the

 the QUERY PLAN 
----------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=2783.00 2783.16..rows=65 width=68) (actual time=3.888..3.883 rows=87 loops=1)
Sort Key: n.nspname, (format_type(t.oid, NULL::integer))
Sort Method: quicksort Memory: 39kB
-> Nested Loop Left Join (cost=16.32..2781.04 rows=65 width=68) (actual time=0.601..3.657 rows=87 loops=1)
Join Filter: (n.oid = t.typnamespace)
Rows Removed by Join Filter: 435
- >Hash Anti Join (cost=16.32..2757.70 rows=65 width=8) (actual time=0.264..0.981 rows=87 loops=1)
Hash Cond: ((t.typelem = el.oid) AND (t.oid = el.typarray))
-> Seq Scan on pg_type t (cost=0.00..2740.26 rows=81 width=12) (actual time=0.012..0.662 rows=157 loops=1)
Filter: (pg_type_is_visible(oid) AND ((typrelid = 0::oid) OR (SubPlan 1)))
Rows Removed by Filter: 185
SubPlan 1
-> Index Scan using pg_class_oid_index on pg_class c (cost=0.15..8.17 rows=1 width=1) (actual time=0.002..0.002 rows=1 loops=98)
Index Cond: (oid = t.typrelid)
-> Hash (cost=11.33..11.33 rows=333 width=8) (actual time=0.241..0.241 rows=342 loops=1)

-> Seq Scan on pg_type el (cost=0.00..11.33 rows=333 width=8) (actual time=0.002..0.130 rows=342 loops=1)
-> Materialize (cost=0.00..1.09 rows=6 width=68) (actual time=0.000..0.001 rows=6 loops=87)
-> Seq Scan on pg_namespace n (cost=0.00..1.06 rows=6 width=68) (actual time=0.002..0.003 rows=6 loops=1)
Total runtime: 3.959 ms

For easy viewing I have also downloaded this plan on explain.depesz.com.

Notice that the operation #9 is Materialize. Why?

Materialize is called Nested Loop Left Join – operation #2. We know that Nested Loop causes the selected operation to be performed multiple times, in this case 87 times.

The right part of the enterprises – Seq Scan on pg_namespace. So, theoretically, Postgres must perform a sequential scan on pg_namespace 87 times. Considering that a single sequential scan of this table takes 0.003 MS, we can expect that the total time will be ~ 0.25 MS.

But Postgres comes smarter. He understands that he will be more efficient to scan the table once and build in-memory image of all her lines. Then next time no need to scan the table to check the information about visibility, parse the page data. He'll just take the data from memory.

Due to this, the total time for all (single reading table, the image data in the memory and scanning the image 87 times) amounted to 0.087 MS.

You can say, "Okay, but why merge join used by materialize before, he's just doing one scan?" Let's remember the plan:

the

$ explain analyze select * from
( select oid, * from pg_class order by oid) as c
join
( select * from pg_attribute a order by attrelid) as a
on c.oid = a.attrelid;
QUERY PLAN 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=23.16..268.01 rows=2273 width=410) (actual time=0.658..3.779 rows=2274 loops=1)
Merge Cond: (pg_class.oid = a.attrelid)
-> Sort (cost=22.88..23.61 rows=292 width=207) (actual time=0.624..0.655 rows=293 loops=1)
Sort Key: pg_class.oid
Sort Method: quicksort Memory: 102kB
-> Seq Scan on pg_class (cost=0.00..10.92 rows=292 width=207) (actual time=0.011..0.211 rows=293 loops=1)
-> Materialize (cost=0.28..212.34 rows=2273 width=203) (actual time=0.028..1.264 rows=2274 loops=1)
-> Index Scan using pg_attribute_relid_attnum_index on pg_attribute a (cost=0.28..183.92 rows=2273 width=203) (actual time=0.015..0.752 rows=2274 loops=1)
Total runtime: 4.009 ms
(9 rows)

Yes, it was run only once. The problem is that the data source for the Merge Join must meet several criteria. Some are obvious (the data should be sorted), while others are less obvious, as they are more technical (the data needs to be viewed forward and backward).

Because of these not-too-obvious criteria to Postgres sometimes necessary to use a Materialize to data coming from a source (in our case of the Index Scan), so he had all the necessary capabilities when it comes to their use.

In short, Materialize retrieves data from the underlying transactions and places them in memory (or partially in memory) so that they are faster to use, or adds additional properties that the previous operation does not.

That's it for today. I thought it over, but there are still many operations that should be described, so that this series will be at least another two posts (the remaining operations and statistical information).

Article based on information from habrahabr.ru

Поиск по этому блогу

computer express