在应用极限之前,最好的方法是获得结果计数。

时间:2021-07-19 22:58:52

When paging through data that comes from a DB, you need to know how many pages there will be to render the page jump controls.

当对来自DB的数据进行分页时,您需要知道呈现页面跳转控件需要多少页。

Currently I do that by running the query twice, once wrapped in a count() to determine the total results, and a second time with a limit applied to get back just the results I need for the current page.

目前,我通过运行查询两次,一次封装在count()中以确定总结果,第二次使用限制返回当前页面所需的结果。

This seems inefficient. Is there a better way to determine how many results would have been returned before LIMIT was applied?

这似乎是低效的。是否有更好的方法来确定在应用极限之前返回多少结果?

I am using PHP and Postgres.

我正在使用PHP和Postgres。

5 个解决方案

#1


106  

Pure SQL

Things have changed since 2008. You can use a window function to get the full count and the limited result in one query. (Introduced with PostgreSQL 8.4 in 2009).

自2008年以来,情况发生了变化。您可以使用窗口函数在一个查询中获取完整的计数和有限的结果。(2009年推出PostgreSQL 8.4)。

SELECT foo
     , count(*) OVER() AS full_count
FROM   bar
WHERE  <some condition>
ORDER  BY <some col>
LIMIT  <pagesize>
OFFSET <offset>

Note that this can be considerably more expensive than without the total count. All rows have to be counted, and a shortcut taking just the top rows from a matching index is not possible.
Doesn't matter much with small tables, matters with big tables.

注意,这可能比没有总数要昂贵得多。所有的行都必须进行计数,并且不可能使用从匹配索引中取出最上面的行的快捷方式。小桌子不重要,大桌子重要。

Corner case: when OFFSET is at least as great as the number of rows from the base query, no row is returned. Possible alternative:

角情况:当偏移量至少与基查询的行数相同时,不会返回任何行。可能的选择:

Consider the sequence of events:

考虑事件的顺序:

  1. WHERE clause (and JOIN conditions, but not here) filter qualifying rows from the base table(s).

    WHERE子句(和连接条件,但不在这里)从基表中过滤符合条件的行。

    (GROUP BY and aggregate functions would go here.)

    (GROUP BY和aggregate function会到这里)

  2. Window functions are applied considering all qualifying rows (depending on the OVER clause and the frame specification of the function). The simple count(*) OVER() is based on all rows.

    在考虑所有符合条件的行(取决于该函数的OVER子句和框架规范)时,应用窗口函数。对()的简单计数(*)基于所有行。

  3. ORDER BY

    命令

    (DISTINCT or DISTINCT ON would go here.)

    (不同的或不同的在这里。)

  4. LIMIT / OFFSET are applied based on the established order to select rows to return.

    根据已建立的顺序应用限制/偏移量来选择要返回的行。

LIMIT / OFFSET becomes increasingly inefficient with a growing number of rows in the table. Consider alternative approaches if you need better performance:

随着表中行数的增加,限制/偏移量变得越来越低效。如果您需要更好的性能,请考虑其他方法:

Alternatives to get final count

There are completely different approaches to get the count of affected rows (not the count before OFFSET & LIMIT were applied). Postgres has internal bookkeeping how many rows where affected by the last SQL command. Some clients can access that information or count rows themselves (like psql).

有完全不同的方法来获取受影响的行数(不包括在偏移和限制之前的计数)。Postgres有内部簿记,它记录了受最后一条SQL命令影响的行数。有些客户端可以访问该信息或自己数行(如psql)。

For instance, you can retrieve the number of affected rows in plpgsql immediately after executing an SQL command with:

例如,您可以在执行以下SQL命令后立即在plpgsql中检索受影响的行数:

GET DIAGNOSTICS integer_var = ROW_COUNT;

Details in the manual.

详细的手册。

Or you can use pg_num_rows in PHP. Or similar functions in other clients.

或者可以在PHP中使用pg_num_rows。或其他客户端的类似功能。

Related:

相关:

#2


4  

As I describe on my blog, MySQL has a feature called SQL_CALC_FOUND_ROWS. This removes the need to do the query twice, but it still needs to do the query in its entireity, even if the limit clause would have allowed it to stop early.

正如我在博客上所描述的,MySQL有一个名为SQL_CALC_FOUND_ROWS的特性。这样就不需要执行两次查询,但是仍然需要完整地执行查询,即使limit子句允许它提前停止。

As far as I know, there is no similar feature for PostgreSQL. One thing to watch out for when doing pagination (the most common thing for which LIMIT is used IMHO): doing an "OFFSET 1000 LIMIT 10" means that the DB has to fetch at least 1010 rows, even if it only gives you 10. A more performant way to do is to remember the value of the row you are ordering by for the previous row (the 1000th in this case) and rewrite the query like this: "... WHERE order_row > value_of_1000_th LIMIT 10". The advantage is that "order_row" is most probably indexed (if not, you've go a problem). The disadvantage being that if new elements are added between page views, this can get a little out of synch (but then again, it may not be observable by visitors and can be a big performance gain).

据我所知,PostgreSQL没有类似的特性。在进行分页时需要注意的一件事(使用IMHO是最常见的限制)是:执行“偏移1000限制10”意味着DB必须获取至少1010行,即使它只提供10行。更有效的方法是记住您为前一行(本例中为1000)排序的行的值,并像这样重写查询:“……”其中,order_row > value_of_1000_th极限10”。这样做的好处是“order_row”很可能被索引(如果没有,就会出现问题)。缺点是,如果在页面视图之间添加新元素,这可能会有点不同步(但同样,访客可能无法观察到它,这可能是一个很大的性能提升)。

#3


0  

You could mitigate the performance penalty by not running the COUNT() query every time. Cache the number of pages for, say 5 minutes before the query is run again. Unless you're seeing a huge number of INSERTs, that should work just fine.

您可以通过每次不运行COUNT()查询来减轻性能损失。缓存页面的数量,比如在查询再次运行前5分钟。除非您看到大量的插入,否则应该可以正常工作。

#4


0  

Since Postgres already does a certain amount of caching things, this type of method isn't as inefficient as it seems. It's definitely not doubling execution time. We have timers built into our DB layer, so I have seen the evidence.

由于Postgres已经做了一定数量的缓存,所以这种类型的方法并不像看上去的那么低效。这绝对不是双倍的执行时间。我们的DB层有计时器,所以我看到了证据。

#5


-1  

Seeing as you need to know for the purpose of paging, I'd suggest running the full query once, writing the data to disk as a server-side cache, then feeding that through your paging mechanism.

由于出于分页的目的,您需要知道,因此我建议一次性运行完整的查询,将数据作为服务器端缓存写入磁盘,然后通过分页机制提供数据。

If you're running the COUNT query for the purpose of deciding whether to provide the data to the user or not (i.e. if there are > X records, give back an error), you need to stick with the COUNT approach.

如果您正在运行COUNT查询,以决定是否向用户提供数据(例如,如果有> X记录,则返回一个错误),您需要坚持使用COUNT方法。

#1


106  

Pure SQL

Things have changed since 2008. You can use a window function to get the full count and the limited result in one query. (Introduced with PostgreSQL 8.4 in 2009).

自2008年以来,情况发生了变化。您可以使用窗口函数在一个查询中获取完整的计数和有限的结果。(2009年推出PostgreSQL 8.4)。

SELECT foo
     , count(*) OVER() AS full_count
FROM   bar
WHERE  <some condition>
ORDER  BY <some col>
LIMIT  <pagesize>
OFFSET <offset>

Note that this can be considerably more expensive than without the total count. All rows have to be counted, and a shortcut taking just the top rows from a matching index is not possible.
Doesn't matter much with small tables, matters with big tables.

注意,这可能比没有总数要昂贵得多。所有的行都必须进行计数,并且不可能使用从匹配索引中取出最上面的行的快捷方式。小桌子不重要,大桌子重要。

Corner case: when OFFSET is at least as great as the number of rows from the base query, no row is returned. Possible alternative:

角情况:当偏移量至少与基查询的行数相同时,不会返回任何行。可能的选择:

Consider the sequence of events:

考虑事件的顺序:

  1. WHERE clause (and JOIN conditions, but not here) filter qualifying rows from the base table(s).

    WHERE子句(和连接条件,但不在这里)从基表中过滤符合条件的行。

    (GROUP BY and aggregate functions would go here.)

    (GROUP BY和aggregate function会到这里)

  2. Window functions are applied considering all qualifying rows (depending on the OVER clause and the frame specification of the function). The simple count(*) OVER() is based on all rows.

    在考虑所有符合条件的行(取决于该函数的OVER子句和框架规范)时,应用窗口函数。对()的简单计数(*)基于所有行。

  3. ORDER BY

    命令

    (DISTINCT or DISTINCT ON would go here.)

    (不同的或不同的在这里。)

  4. LIMIT / OFFSET are applied based on the established order to select rows to return.

    根据已建立的顺序应用限制/偏移量来选择要返回的行。

LIMIT / OFFSET becomes increasingly inefficient with a growing number of rows in the table. Consider alternative approaches if you need better performance:

随着表中行数的增加,限制/偏移量变得越来越低效。如果您需要更好的性能,请考虑其他方法:

Alternatives to get final count

There are completely different approaches to get the count of affected rows (not the count before OFFSET & LIMIT were applied). Postgres has internal bookkeeping how many rows where affected by the last SQL command. Some clients can access that information or count rows themselves (like psql).

有完全不同的方法来获取受影响的行数(不包括在偏移和限制之前的计数)。Postgres有内部簿记,它记录了受最后一条SQL命令影响的行数。有些客户端可以访问该信息或自己数行(如psql)。

For instance, you can retrieve the number of affected rows in plpgsql immediately after executing an SQL command with:

例如,您可以在执行以下SQL命令后立即在plpgsql中检索受影响的行数:

GET DIAGNOSTICS integer_var = ROW_COUNT;

Details in the manual.

详细的手册。

Or you can use pg_num_rows in PHP. Or similar functions in other clients.

或者可以在PHP中使用pg_num_rows。或其他客户端的类似功能。

Related:

相关:

#2


4  

As I describe on my blog, MySQL has a feature called SQL_CALC_FOUND_ROWS. This removes the need to do the query twice, but it still needs to do the query in its entireity, even if the limit clause would have allowed it to stop early.

正如我在博客上所描述的,MySQL有一个名为SQL_CALC_FOUND_ROWS的特性。这样就不需要执行两次查询,但是仍然需要完整地执行查询,即使limit子句允许它提前停止。

As far as I know, there is no similar feature for PostgreSQL. One thing to watch out for when doing pagination (the most common thing for which LIMIT is used IMHO): doing an "OFFSET 1000 LIMIT 10" means that the DB has to fetch at least 1010 rows, even if it only gives you 10. A more performant way to do is to remember the value of the row you are ordering by for the previous row (the 1000th in this case) and rewrite the query like this: "... WHERE order_row > value_of_1000_th LIMIT 10". The advantage is that "order_row" is most probably indexed (if not, you've go a problem). The disadvantage being that if new elements are added between page views, this can get a little out of synch (but then again, it may not be observable by visitors and can be a big performance gain).

据我所知,PostgreSQL没有类似的特性。在进行分页时需要注意的一件事(使用IMHO是最常见的限制)是:执行“偏移1000限制10”意味着DB必须获取至少1010行,即使它只提供10行。更有效的方法是记住您为前一行(本例中为1000)排序的行的值,并像这样重写查询:“……”其中,order_row > value_of_1000_th极限10”。这样做的好处是“order_row”很可能被索引(如果没有,就会出现问题)。缺点是,如果在页面视图之间添加新元素,这可能会有点不同步(但同样,访客可能无法观察到它,这可能是一个很大的性能提升)。

#3


0  

You could mitigate the performance penalty by not running the COUNT() query every time. Cache the number of pages for, say 5 minutes before the query is run again. Unless you're seeing a huge number of INSERTs, that should work just fine.

您可以通过每次不运行COUNT()查询来减轻性能损失。缓存页面的数量,比如在查询再次运行前5分钟。除非您看到大量的插入,否则应该可以正常工作。

#4


0  

Since Postgres already does a certain amount of caching things, this type of method isn't as inefficient as it seems. It's definitely not doubling execution time. We have timers built into our DB layer, so I have seen the evidence.

由于Postgres已经做了一定数量的缓存,所以这种类型的方法并不像看上去的那么低效。这绝对不是双倍的执行时间。我们的DB层有计时器,所以我看到了证据。

#5


-1  

Seeing as you need to know for the purpose of paging, I'd suggest running the full query once, writing the data to disk as a server-side cache, then feeding that through your paging mechanism.

由于出于分页的目的,您需要知道,因此我建议一次性运行完整的查询,将数据作为服务器端缓存写入磁盘,然后通过分页机制提供数据。

If you're running the COUNT query for the purpose of deciding whether to provide the data to the user or not (i.e. if there are > X records, give back an error), you need to stick with the COUNT approach.

如果您正在运行COUNT查询,以决定是否向用户提供数据(例如,如果有> X记录,则返回一个错误),您需要坚持使用COUNT方法。