sql - Simple Postgres query has high cost to use index - Stack Overflow-软件玩家

admin管理员组
文章数量:1353223

I have a simple postgres query that has an incredibly small cost with set enable_seqscan = off;, but an incredibly high cost with the setting on. The query is as follows: select b.* from body as b order by b.orbital_eccentricity desc;

The explain is as follows:

set enable_seqscan = off;
explain select b.* from body as b order by b.orbital_eccentricity desc;
----
Index Scan Backward using body_orbital_eccentricity_idx on body b  (cost=0.57..1912033501.86 rows=486411520 width=110)

====

set enable_seqscan = on;
explain select b.* from body as b order by b.orbital_eccentricity desc;
----
Gather Merge  (cost=97808148.33..145101459.15 rows=405342934 width=110)
  Workers Planned: 2
  ->  Sort  (cost=97807148.31..98313826.97 rows=202671467 width=110)
        Sort Key: orbital_eccentricity DESC
        ->  Parallel Seq Scan on body b  (cost=0.00..22738694.67 rows=202671467 width=110)

In this case, the cost of ~98 million - 202 million lies in between the index query cost of 0.57 - 1.91 billion.

my questions are as follows:

Why does postgres choose to use a sequential scan when not given the hint to not do so?
Why does the index-utilizing query have a cost ceiling that is so high? (1.91 billion)

table DDL:

-- public.body definition

CREATE TABLE public.body (
    id64 int8 NOT NULL,
    body_id int8 NULL,
    "name" varchar NULL,
    type_field varchar NULL,
    orbital_period float8 NULL,
    semi_major_axis float8 NULL,
    orbital_eccentricity float8 NULL,
    arg_of_periapsis float8 NULL,
    mean_anomaly float8 NULL,
    ascending_node float8 NULL,
    update_time timestamp NULL,
    system_id64 int8 NOT NULL,
    CONSTRAINT body_pk PRIMARY KEY (id64)
);
CREATE INDEX body_orbital_eccentricity_idx ON public.body USING btree (orbital_eccentricity);

-- public.body foreign keys

ALTER TABLE public.body ADD CONSTRAINT body_system_fk FOREIGN KEY (system_id64) REFERENCES public."system"(id64);

The explain is as follows:

set enable_seqscan = off;
explain select b.* from body as b order by b.orbital_eccentricity desc;
----
Index Scan Backward using body_orbital_eccentricity_idx on body b  (cost=0.57..1912033501.86 rows=486411520 width=110)

====

set enable_seqscan = on;
explain select b.* from body as b order by b.orbital_eccentricity desc;
----
Gather Merge  (cost=97808148.33..145101459.15 rows=405342934 width=110)
  Workers Planned: 2
  ->  Sort  (cost=97807148.31..98313826.97 rows=202671467 width=110)
        Sort Key: orbital_eccentricity DESC
        ->  Parallel Seq Scan on body b  (cost=0.00..22738694.67 rows=202671467 width=110)

In this case, the cost of ~98 million - 202 million lies in between the index query cost of 0.57 - 1.91 billion.

my questions are as follows:

Why does postgres choose to use a sequential scan when not given the hint to not do so?
Why does the index-utilizing query have a cost ceiling that is so high? (1.91 billion)

table DDL:

-- public.body definition

CREATE TABLE public.body (
    id64 int8 NOT NULL,
    body_id int8 NULL,
    "name" varchar NULL,
    type_field varchar NULL,
    orbital_period float8 NULL,
    semi_major_axis float8 NULL,
    orbital_eccentricity float8 NULL,
    arg_of_periapsis float8 NULL,
    mean_anomaly float8 NULL,
    ascending_node float8 NULL,
    update_time timestamp NULL,
    system_id64 int8 NOT NULL,
    CONSTRAINT body_pk PRIMARY KEY (id64)
);
CREATE INDEX body_orbital_eccentricity_idx ON public.body USING btree (orbital_eccentricity);

-- public.body foreign keys

ALTER TABLE public.body ADD CONSTRAINT body_system_fk FOREIGN KEY (system_id64) REFERENCES public."system"(id64);

Share Improve this question asked Apr 1 at 19:51 tuskiomi 1902 silver badges18 bronze badges

selecting 400kk rows... why? – Iłya Bursov Commented Apr 1 at 19:59
2 1) Given select b.* from body as b order by b.orbital_eccentricity desc, why would it use an index? You are asking to return everything from the table, a sequential scan would be the best way to do that. 2) You should use EXPLAIN ANALYZE to see the actual cost/time. – Adrian Klaver Commented Apr 1 at 20:01
Costs are also related to your configuration. Change the configuration, and the costs will change. Whether the query will be faster or slower is a whole different question. For your current query, the index is useless. – Frank Heikens Commented Apr 1 at 21:05
1 Huh? A single query in a single connection is only going to have one network transaction. It will send the query and then stream the results back, an index is not going to change that especially if you are asking for all the data. Seems you are looking for something like pagination which is a different process. For a good overview of that process see cybertec-postgresql/en/… – Adrian Klaver Commented Apr 1 at 23:47
2 Optimizers generally go for overall query cost, not first row speed. Except in SQL Server you can specify the OPTION (FAST 100) hint. – Charlieface Commented Apr 2 at 1:11

| Show 1 more comment

2 Answers 2

Sorted by: Reset to default 3

Why does Postgres choose to use a sequential scan when not given the hint to not do so?

Because it thinks that that is the fastest way to compute the query result.

Why does the index-utilizing query have a cost ceiling that is so high? (1.91 billion)

Because it is an expensive operation.

To understand what an index scan of the whole table means, imagine a library with lots of books and a catalog. The library is the table, the books are the rows, and the catalog is the index. Now you are tasked with getting all the books in alphabetical author order. Incidentally, that is just how the catalog is ordered. You get the first index card from the catalog, figure out in which room and on which shelf the book sits, walk over and get the book. Then you repeat that procedure for every index card in the catalog. That is a pretty time-consuming procedure!

Compare that to the parallel plan with the sequential scan. In that case, you get two friends to help you. You divide the library rooms among yourself and your friends. Each one takes all the books from the shelves in his or her respective rooms as fast as possible, forming a pile on the floor. Then each sets upon the time-consuming task to sort his or her pile of books. Finally, your two friends carry their sorted books over to you in order, and you merge the three sorted stacks to a single one.

Which of the above scenarios is faster? I don't know, but both will take a long time. You see that reflected in the respective costs: both are high. PostgreSQL decides that the second strategy will be faster, probably because of the two friends that help, which reduces the piles to sort considerably. The first strategy does not lend itself to parallelism, because if the result should be sorted, the workers would end up waiting for each other a lot of the time.

I guess that your real question is how to make the query fast. Well, it is never going to be fast, and we have to settle for the fastest slow execution. So you should measure which of the strategies is actually faster. For that, use EXPLAIN (ANALYZE, BUFFERS, SETTINGS) with both queries.

If the index scan actually turns out to be faster, it could be that you didn't tell PostgreSQL enough about your hardware. For example, PostgreSQL assumes that your database will be on spinning disks, where fetching individual books is particularly expensive. If you reduce random_page_cost to a value closer to 1, which more accurately reflects the characteristics of an SSD, the index scan will become cheaper, and PostgreSQL may prefer it.

Another question is whether you want to optimize for response time or for throughput. Parallel query is good for the former, but bad for the latter, because of the overhead incurred by a parallel query. If you optimize for throughput and set max_parallel_workers_per_gather to 0, the plan using the index scan will also become more attractive.

Why does postgres choose to use a sequential scan when not given the hint to not do so?

You asked for the whole table (no where clause). Reading it through the index just means an additional step, so Postgres skips it.

Why does the index-utilizing query have a cost ceiling that is so high? (1.91 billion)

That's the difference between reading the table directly and having to hop through the index pointing at arbitrarily scattered parts of the table. It's also not apples-to-apples since you allowed parallel workers and only the seq scan decided to use it, even though parallel index scans are also available - the reasons why that's the case is a whole other thing of its own, worth a separate thread. Here's the doc on parallel plans.

It's apparently cheaper to read the whole table and sort it, than to try and use the order previously established by the index to try reading it already sorted. Note that the table may be bloated, messy and out of order, and same goes for the index if it's not maintained carefully. The index scan has to deal with both messes and they both slow it down. Seq scan cares little about the state of the table, not at all about the index.

If you make that a covering index, freshly vacuum analyze and reindex it or even cluster the table to make the physical write order follow the index, those plan estimates could change:

vacuum analyze public.body;
reindex table public.body;
cluster verbose public.body using body_orbital_eccentricity_idx;

Your underlying hardware and config can affect this too. The index scan has to read two things off the disk, the index sequentially, the table randomly - this could be what's dragging it down. Seq scan has to read one thing less, just the table, but then it sorts it in memory - if there's not enough of it, the sorting process will need to spill to disk, write some chunks then read them back later. Your storage/memory type and config dictates how well each of these operations are handled.

advantageous to nearly every application to start returning results eagerly

You might want to open a thread for that, too. Throwing the whole, 400k-row table at the client hoping it'll satisfy some future request ahead of time almost certainly costs you more on both sides than exchanging pages, especially since most clients cannot/do not start processing until they finish receiving the whole result set you requested - pages are complete, standalone sets, so it lets you work around that.

I can recommend @Adrian Klaver's example (limit..offset, cursor, keyset), and here's another one with some more (limit..offset, cursor, keyset, clustered tid scan, bookmarks).

本文标签： sqlSimple Postgres query has high cost to use indexStack Overflow

版权声明：本文标题：sql - Simple Postgres query has high cost to use index - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1743875494a2554248.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

sql - Simple Postgres query has high cost to use index - Stack Overflow

2 Answers 2

Why does Postgres choose to use a sequential scan when not given the hint to not do so?

Why does the index-utilizing query have a cost ceiling that is so high? (1.91 billion)

更多相关文章

sql - Simple Postgres query has high cost to use index - Stack Overflow

发表评论

推荐文章

wikidata query returning false result - Stack Overflow

multithreading - Joining std threads in the destructor in C++ - Stack Overflow

jquery - Why doesn&#39;t &quot;use strict&quot; (JavaScript) detect an undeclared variable? - Stack Overflow

javascript - Update data on submit button and stay on the same page - Stack Overflow

multithreading - Does C++ `memory_order_seq_cst` guarantee load of other variables in the current thread? - Stack Overflow

热门文章

html - Javascript Canvas - Draw rectangle drag and drop - Stack Overflow

c# - Calling server and client method in SignalR - Stack Overflow

azure - How to check for the existince of at least one document in a container that satisfies a condition? - Stack Overflow

javascript - Change date range to show events in FullCalendar - Stack Overflow

javascript - Does jasmine-node offer any type of &quot;fail fast&quot; option? - Stack Overflow

Facebook Crawler Floods CPU usage to 100% - Stack Overflow

javascript - ERB replacement with HAML fails in JS - Stack Overflow

Changing value using innerHTML in Javascript - Stack Overflow

javascript - Connecting two gatsby nodes - Stack Overflow

javascript - Create chained methods in node.js? - Stack Overflow

最新文章

xp计算机字体在哪,Windows XP系统字体库在哪-如何安装新字体

跨越时空的连接：向日葵远程控制软件助力Windows XP用户

关于Qt程序不兼容xp系统的问题

Windows XP系统中DLL文件缺失：当遇到DLL文件丢失时，如何确保系统的稳定运

Windows XP系统中AHCI模式的实现

javascript - get html from a variable by id - Stack Overflow

powerapps - Power Apps Custom Connector with SSE (Server Sent Events) - Stack Overflow

javascript - safari ReferenceError: Can&#39;t find variable - Stack Overflow

javascript - API working in localhost but not working while uploading in server - Stack Overflow

android - unable to create a flutter project in androis studio - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

jquery - Why doesn't "use strict" (JavaScript) detect an undeclared variable? - Stack Overflow

javascript - Does jasmine-node offer any type of "fail fast" option? - Stack Overflow

javascript - safari ReferenceError: Can't find variable - Stack Overflow