admin管理员组文章数量:1353223
I have a simple postgres query that has an incredibly small cost with set enable_seqscan = off;
, but an incredibly high cost with the setting on. The query is as follows: select b.* from body as b order by b.orbital_eccentricity desc;
The explain is as follows:
set enable_seqscan = off;
explain select b.* from body as b order by b.orbital_eccentricity desc;
----
Index Scan Backward using body_orbital_eccentricity_idx on body b (cost=0.57..1912033501.86 rows=486411520 width=110)
====
set enable_seqscan = on;
explain select b.* from body as b order by b.orbital_eccentricity desc;
----
Gather Merge (cost=97808148.33..145101459.15 rows=405342934 width=110)
Workers Planned: 2
-> Sort (cost=97807148.31..98313826.97 rows=202671467 width=110)
Sort Key: orbital_eccentricity DESC
-> Parallel Seq Scan on body b (cost=0.00..22738694.67 rows=202671467 width=110)
In this case, the cost of ~98 million - 202 million lies in between the index query cost of 0.57 - 1.91 billion.
my questions are as follows:
- Why does postgres choose to use a sequential scan when not given the hint to not do so?
- Why does the index-utilizing query have a cost ceiling that is so high? (1.91 billion)
table DDL:
-- public.body definition
CREATE TABLE public.body (
id64 int8 NOT NULL,
body_id int8 NULL,
"name" varchar NULL,
type_field varchar NULL,
orbital_period float8 NULL,
semi_major_axis float8 NULL,
orbital_eccentricity float8 NULL,
arg_of_periapsis float8 NULL,
mean_anomaly float8 NULL,
ascending_node float8 NULL,
update_time timestamp NULL,
system_id64 int8 NOT NULL,
CONSTRAINT body_pk PRIMARY KEY (id64)
);
CREATE INDEX body_orbital_eccentricity_idx ON public.body USING btree (orbital_eccentricity);
-- public.body foreign keys
ALTER TABLE public.body ADD CONSTRAINT body_system_fk FOREIGN KEY (system_id64) REFERENCES public."system"(id64);
I have a simple postgres query that has an incredibly small cost with set enable_seqscan = off;
, but an incredibly high cost with the setting on. The query is as follows: select b.* from body as b order by b.orbital_eccentricity desc;
The explain is as follows:
set enable_seqscan = off;
explain select b.* from body as b order by b.orbital_eccentricity desc;
----
Index Scan Backward using body_orbital_eccentricity_idx on body b (cost=0.57..1912033501.86 rows=486411520 width=110)
====
set enable_seqscan = on;
explain select b.* from body as b order by b.orbital_eccentricity desc;
----
Gather Merge (cost=97808148.33..145101459.15 rows=405342934 width=110)
Workers Planned: 2
-> Sort (cost=97807148.31..98313826.97 rows=202671467 width=110)
Sort Key: orbital_eccentricity DESC
-> Parallel Seq Scan on body b (cost=0.00..22738694.67 rows=202671467 width=110)
In this case, the cost of ~98 million - 202 million lies in between the index query cost of 0.57 - 1.91 billion.
my questions are as follows:
- Why does postgres choose to use a sequential scan when not given the hint to not do so?
- Why does the index-utilizing query have a cost ceiling that is so high? (1.91 billion)
table DDL:
-- public.body definition
CREATE TABLE public.body (
id64 int8 NOT NULL,
body_id int8 NULL,
"name" varchar NULL,
type_field varchar NULL,
orbital_period float8 NULL,
semi_major_axis float8 NULL,
orbital_eccentricity float8 NULL,
arg_of_periapsis float8 NULL,
mean_anomaly float8 NULL,
ascending_node float8 NULL,
update_time timestamp NULL,
system_id64 int8 NOT NULL,
CONSTRAINT body_pk PRIMARY KEY (id64)
);
CREATE INDEX body_orbital_eccentricity_idx ON public.body USING btree (orbital_eccentricity);
-- public.body foreign keys
ALTER TABLE public.body ADD CONSTRAINT body_system_fk FOREIGN KEY (system_id64) REFERENCES public."system"(id64);
Share
Improve this question
asked Apr 1 at 19:51
tuskiomituskiomi
1902 silver badges18 bronze badges
6
|
Show 1 more comment
2 Answers
Reset to default 3Why does Postgres choose to use a sequential scan when not given the hint to not do so?
Because it thinks that that is the fastest way to compute the query result.
Why does the index-utilizing query have a cost ceiling that is so high? (1.91 billion)
Because it is an expensive operation.
To understand what an index scan of the whole table means, imagine a library with lots of books and a catalog. The library is the table, the books are the rows, and the catalog is the index. Now you are tasked with getting all the books in alphabetical author order. Incidentally, that is just how the catalog is ordered. You get the first index card from the catalog, figure out in which room and on which shelf the book sits, walk over and get the book. Then you repeat that procedure for every index card in the catalog. That is a pretty time-consuming procedure!
Compare that to the parallel plan with the sequential scan. In that case, you get two friends to help you. You divide the library rooms among yourself and your friends. Each one takes all the books from the shelves in his or her respective rooms as fast as possible, forming a pile on the floor. Then each sets upon the time-consuming task to sort his or her pile of books. Finally, your two friends carry their sorted books over to you in order, and you merge the three sorted stacks to a single one.
Which of the above scenarios is faster? I don't know, but both will take a long time. You see that reflected in the respective costs: both are high. PostgreSQL decides that the second strategy will be faster, probably because of the two friends that help, which reduces the piles to sort considerably. The first strategy does not lend itself to parallelism, because if the result should be sorted, the workers would end up waiting for each other a lot of the time.
I guess that your real question is how to make the query fast. Well, it is never going to be fast, and we have to settle for the fastest slow execution. So you should measure which of the strategies is actually faster. For that, use EXPLAIN (ANALYZE, BUFFERS, SETTINGS)
with both queries.
If the index scan actually turns out to be faster, it could be that you didn't tell PostgreSQL enough about your hardware. For example, PostgreSQL assumes that your database will be on spinning disks, where fetching individual books is particularly expensive. If you reduce random_page_cost
to a value closer to 1, which more accurately reflects the characteristics of an SSD, the index scan will become cheaper, and PostgreSQL may prefer it.
Another question is whether you want to optimize for response time or for throughput. Parallel query is good for the former, but bad for the latter, because of the overhead incurred by a parallel query. If you optimize for throughput and set max_parallel_workers_per_gather
to 0, the plan using the index scan will also become more attractive.
Why does postgres choose to use a sequential scan when not given the hint to not do so?
You asked for the whole table (no where
clause). Reading it through the index just means an additional step, so Postgres skips it.
Why does the index-utilizing query have a cost ceiling that is so high? (1.91 billion)
That's the difference between reading the table directly and having to hop through the index pointing at arbitrarily scattered parts of the table. It's also not apples-to-apples since you allowed parallel workers and only the seq scan decided to use it, even though parallel index scans are also available - the reasons why that's the case is a whole other thing of its own, worth a separate thread. Here's the doc on parallel plans.
It's apparently cheaper to read the whole table and sort it, than to try and use the order previously established by the index to try reading it already sorted. Note that the table may be bloated, messy and out of order, and same goes for the index if it's not maintained carefully. The index scan has to deal with both messes and they both slow it down. Seq scan cares little about the state of the table, not at all about the index.
If you make that a covering index, freshly vacuum analyze
and reindex
it or even cluster
the table to make the physical write order follow the index, those plan estimates could change:
vacuum analyze public.body;
reindex table public.body;
cluster verbose public.body using body_orbital_eccentricity_idx;
Your underlying hardware and config can affect this too. The index scan has to read two things off the disk, the index sequentially, the table randomly - this could be what's dragging it down. Seq scan has to read one thing less, just the table, but then it sorts it in memory - if there's not enough of it, the sorting process will need to spill to disk, write some chunks then read them back later. Your storage/memory type and config dictates how well each of these operations are handled.
advantageous to nearly every application to start returning results eagerly
You might want to open a thread for that, too. Throwing the whole, 400k-row table at the client hoping it'll satisfy some future request ahead of time almost certainly costs you more on both sides than exchanging pages, especially since most clients cannot/do not start processing until they finish receiving the whole result set you requested - pages are complete, standalone sets, so it lets you work around that.
I can recommend @Adrian Klaver's example (limit..offset, cursor, keyset), and here's another one with some more (limit..offset, cursor, keyset, clustered tid scan, bookmarks).
本文标签: sqlSimple Postgres query has high cost to use indexStack Overflow
版权声明:本文标题:sql - Simple Postgres query has high cost to use index - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1743875494a2554248.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
select b.* from body as b order by b.orbital_eccentricity desc
, why would it use an index? You are asking to return everything from the table, a sequential scan would be the best way to do that. 2) You should useEXPLAIN ANALYZE
to see the actual cost/time. – Adrian Klaver Commented Apr 1 at 20:01OPTION (FAST 100)
hint. – Charlieface Commented Apr 2 at 1:11