如何从巨大的表中读取所有行?

时间:2022-12-24 09:19:33

I have a problem with processing all rows from database (PostgreSQL). I get an error: org.postgresql.util.PSQLException: Ran out of memory retrieving query results. I think that I need to read all rows in small pieces, but it doesn't work - it reads only 100 rows (code below). How to do that?

处理数据库中的所有行(PostgreSQL)时遇到问题。我收到一个错误:org.postgresql.util.PSQLException:内存不足,检索查询结果。我认为我需要读取所有行的小块,但它不起作用 - 它只读取100行(下面的代码)。怎么做?

    int i = 0;      
    Statement s = connection.createStatement();
    s.setMaxRows(100); // bacause of: org.postgresql.util.PSQLException: Ran out of memory retrieving query results.
    ResultSet rs = s.executeQuery("select * from " + tabName);      
    for (;;) {
        while (rs.next()) {
            i++;
            // do something...
        }
        if ((s.getMoreResults() == false) && (s.getUpdateCount() == -1)) {
            break;
        }           
    }

6 个解决方案

#1


35  

Use a CURSOR in PostgreSQL or let the JDBC-driver handle this for you.

在PostgreSQL中使用CURSOR或让JDBC驱动程序为您处理。

LIMIT and OFFSET will get slow when handling large datasets.

处理大型数据集时,LIMIT和OFFSET会变慢。

#2


57  

The short version is, call stmt.setFetchSize(50); and conn.setAutoCommitMode(false); to avoid reading the entire ResultSet into memory.

简短版本是,调用stmt.setFetchSize(50);和conn.setAutoCommitMode(false);避免将整个ResultSet读入内存。

Here's what the docs says:

以下是文档所说的内容:

Getting results based on a cursor

根据游标获取结果

By default the driver collects all the results for the query at once. This can be inconvenient for large data sets so the JDBC driver provides a means of basing a ResultSet on a database cursor and only fetching a small number of rows.

默认情况下,驱动程序一次收集查询的所有结果。这对于大型数据集来说可能不方便,因此JDBC驱动程序提供了一种将ResultSet基于数据库游标并仅获取少量行的方法。

A small number of rows are cached on the client side of the connection and when exhausted the next block of rows is retrieved by repositioning the cursor.

在连接的客户端缓存少量行,当用尽时,通过重新定位光标来检索下一行行。

Note:

注意:

  • Cursor based ResultSets cannot be used in all situations. There a number of restrictions which will make the driver silently fall back to fetching the whole ResultSet at once.

    基于游标的ResultSet不能在所有情况下使用。有许多限制会使驱动程序无声地回退到同时获取整个ResultSet。

  • The connection to the server must be using the V3 protocol. This is the default for (and is only supported by) server versions 7.4 and later.-

    与服务器的连接必须使用V3协议。这是服务器版本7.4及更高版本的默认设置(并且仅受其支持).-

  • The Connection must not be in autocommit mode. The backend closes cursors at the end of transactions, so in autocommit mode the backend will have closed the cursor before anything can be fetched from it.-

    连接不得处于自动提交模式。后端在事务结束时关闭游标,因此在自动提交模式下,后端将关闭游标,然后才能从中获取任何内容.-

  • The Statement must be created with a ResultSet type of ResultSet.TYPE_FORWARD_ONLY. This is the default, so no code will need to be rewritten to take advantage of this, but it also means that you cannot scroll backwards or otherwise jump around in the ResultSet.-

    必须使用ResultSet类型ResultSet.TYPE_FORWARD_ONLY创建Statement。这是默认值,因此不需要重写代码来利用这一点,但这也意味着您无法向后滚动或以其他方式在ResultSet中跳转.-

  • The query given must be a single statement, not multiple statements strung together with semicolons.

    给出的查询必须是单个语句,而不是与分号串在一起的多个语句。

Example 5.2. Setting fetch size to turn cursors on and off.

例5.2。设置获取大小以打开和关闭游标。

Changing code to cursor mode is as simple as setting the fetch size of the Statement to the appropriate size. Setting the fetch size back to 0 will cause all rows to be cached (the default behaviour).

将代码更改为游标模式就像将Statement的获取大小设置为适当的大小一样简单。将提取大小设置为0将导致缓存所有行(默认行为)。

// make sure autocommit is off
conn.setAutoCommit(false);
Statement st = conn.createStatement();

// Turn use of the cursor on.
st.setFetchSize(50);
ResultSet rs = st.executeQuery("SELECT * FROM mytable");
while (rs.next()) {
   System.out.print("a row was returned.");
}
rs.close();

// Turn the cursor off.
st.setFetchSize(0);
rs = st.executeQuery("SELECT * FROM mytable");
while (rs.next()) {
   System.out.print("many rows were returned.");
}
rs.close();

// Close the statement.
st.close();

#3


2  

I think your question is similar to this thread: JDBC Pagination which contains solutions for your need.

我认为你的问题类似于这个主题:JDBC Pagination包含满足您需求的解决方案。

In particular, for PostgreSQL, you can use the LIMIT and OFFSET keywords in your request: http://www.petefreitag.com/item/451.cfm

特别是,对于PostgreSQL,您可以在请求中使用LIMIT和OFFSET关键字:http://www.petefreitag.com/item/451.cfm

PS: In Java code, I suggest you to use PreparedStatement instead of simple Statements: http://download.oracle.com/javase/tutorial/jdbc/basics/prepared.html

PS:在Java代码中,我建议你使用PreparedStatement而不是简单的语句:http://download.oracle.com/javase/tutorial/jdbc/basics/prepared.html

#4


2  

So it turns out that the crux of the problem is that by default, Postgres starts in "autoCommit" mode, and also it needs/uses cursors to be able to "page" through data (ex: read the first 10K results, then the next, then the next), however cursors can only exist within a transaction. So the default is to read all rows, always, into RAM, and then allow your program to start processing "the first result row, then the second" after it has all arrived, for two reasons, it's not in a transaction (so cursors don't work), and also a fetch size hasn't been set.

事实证明,问题的关键在于默认情况下,Postgres以“autoCommit”模式启动,并且它还需要/使用游标来“分页”通过数据(例如:读取前10K结果,然后是接下来,然后是下一个),但游标只能存在于事务中。所以默认是将所有行总是读入RAM,然后允许程序在它全部到达之后开始处理“第一个结果行,然后是第二个”,原因有两个,它不在事务中(因此游标)不起作用),并且还没有设置提取大小。

So how the psql command line tool achieves batched response (its FETCH_COUNT setting) for queries, is to "wrap" its select queries within a short-term transaction (if a transaction isn't yet open), so that cursors can work. You can do something like that also with JDBC:

因此,psql命令行工具如何实现查询的批量响应(其FETCH_COUNT设置),是在短期事务中“包装”其选择查询(如果事务尚未打开),以便游标可以工作。您也可以使用JDBC执行类似的操作:

  static void readLargeQueryInChunksJdbcWay(Connection conn, String originalQuery, int fetchCount, ConsumerWithException<ResultSet, SQLException> consumer) throws SQLException {
    boolean originalAutoCommit = conn.getAutoCommit();
    if (originalAutoCommit) {
      conn.setAutoCommit(false); // start temp transaction
    }
    try (Statement statement = conn.createStatement()) {
      statement.setFetchSize(fetchCount);
      ResultSet rs = statement.executeQuery(originalQuery);
      while (rs.next()) {
        consumer.accept(rs); // or just do you work here
      }
    } finally {
      if (originalAutoCommit) {
        conn.setAutoCommit(true); // reset it, also ends (commits) temp transaction
      }
    }
  }
  @FunctionalInterface
  public interface ConsumerWithException<T, E extends Exception> {
    void accept(T t) throws E;
  }

This gives the benefit of requiring less RAM, and, in my results, seemed to run overall faster, even if you don't need to save the RAM. Weird. It also gives the benefit that your processing of the first row "starts faster" (since it process it a page at a time).

这样可以减少RAM,并且在我的结果中,即使您不需要保存RAM,也可以更快地运行。奇怪的。它还带来了第一行处理“更快启动”的好处(因为它一次处理一页)。

And here's how to do it the "raw postgres cursor" way, along with full demo code, though in my experiments it seemed the JDBC way, above, was slightly faster for whatever reason.

这里是如何使用“原始postgres游标”方式,以及完整的演示代码,虽然在我的实验中,似乎上面的JDBC方式,无论出于何种原因,都略快。

Another option would be to have autoCommit mode off, everywhere, though you still have to always manually specify a fetchSize for each new Statement (or you can set a default fetch size in the URL string).

另一种选择是在任何地方都关闭autoCommit模式,但您仍然必须始终为每个新Statement手动指定fetchSize(或者您可以在URL字符串中设置默认提取大小)。

#5


0  

I did it like below. Not the best way i think, but it works :)

我是这样做的。不是我想的最好的方式,但它的工作:)

    Connection c = DriverManager.getConnection("jdbc:postgresql://....");
    PreparedStatement s = c.prepareStatement("select * from " + tabName + " where id > ? order by id");
    s.setMaxRows(100);
    int lastId = 0;
    for (;;) {
        s.setInt(1, lastId);
        ResultSet rs = s.executeQuery();

        int lastIdBefore = lastId;
        while (rs.next()) {
            lastId = Integer.parseInt(rs.getObject(1).toString());
            // ...
        }

        if (lastIdBefore == lastId) {
            break;
        }
    }

#6


0  

At lest in my case the problem was on the client that tries to fetch the results.

至少在我的情况下,问题出在客户端试图获取结果。

Wanted to get a .csv with ALL the results.

想要获得所有结果的.csv。

I found the solution by using

我通过使用找到了解决方案

psql -U postgres -d dbname  -c "COPY (SELECT * FROM T) TO STDOUT WITH DELIMITER ','"

(where dbname the name of the db...) and redirecting to a file.

(其中dbname是db的名称...)并重定向到文件。

#1


35  

Use a CURSOR in PostgreSQL or let the JDBC-driver handle this for you.

在PostgreSQL中使用CURSOR或让JDBC驱动程序为您处理。

LIMIT and OFFSET will get slow when handling large datasets.

处理大型数据集时,LIMIT和OFFSET会变慢。

#2


57  

The short version is, call stmt.setFetchSize(50); and conn.setAutoCommitMode(false); to avoid reading the entire ResultSet into memory.

简短版本是,调用stmt.setFetchSize(50);和conn.setAutoCommitMode(false);避免将整个ResultSet读入内存。

Here's what the docs says:

以下是文档所说的内容:

Getting results based on a cursor

根据游标获取结果

By default the driver collects all the results for the query at once. This can be inconvenient for large data sets so the JDBC driver provides a means of basing a ResultSet on a database cursor and only fetching a small number of rows.

默认情况下,驱动程序一次收集查询的所有结果。这对于大型数据集来说可能不方便,因此JDBC驱动程序提供了一种将ResultSet基于数据库游标并仅获取少量行的方法。

A small number of rows are cached on the client side of the connection and when exhausted the next block of rows is retrieved by repositioning the cursor.

在连接的客户端缓存少量行,当用尽时,通过重新定位光标来检索下一行行。

Note:

注意:

  • Cursor based ResultSets cannot be used in all situations. There a number of restrictions which will make the driver silently fall back to fetching the whole ResultSet at once.

    基于游标的ResultSet不能在所有情况下使用。有许多限制会使驱动程序无声地回退到同时获取整个ResultSet。

  • The connection to the server must be using the V3 protocol. This is the default for (and is only supported by) server versions 7.4 and later.-

    与服务器的连接必须使用V3协议。这是服务器版本7.4及更高版本的默认设置(并且仅受其支持).-

  • The Connection must not be in autocommit mode. The backend closes cursors at the end of transactions, so in autocommit mode the backend will have closed the cursor before anything can be fetched from it.-

    连接不得处于自动提交模式。后端在事务结束时关闭游标,因此在自动提交模式下,后端将关闭游标,然后才能从中获取任何内容.-

  • The Statement must be created with a ResultSet type of ResultSet.TYPE_FORWARD_ONLY. This is the default, so no code will need to be rewritten to take advantage of this, but it also means that you cannot scroll backwards or otherwise jump around in the ResultSet.-

    必须使用ResultSet类型ResultSet.TYPE_FORWARD_ONLY创建Statement。这是默认值,因此不需要重写代码来利用这一点,但这也意味着您无法向后滚动或以其他方式在ResultSet中跳转.-

  • The query given must be a single statement, not multiple statements strung together with semicolons.

    给出的查询必须是单个语句,而不是与分号串在一起的多个语句。

Example 5.2. Setting fetch size to turn cursors on and off.

例5.2。设置获取大小以打开和关闭游标。

Changing code to cursor mode is as simple as setting the fetch size of the Statement to the appropriate size. Setting the fetch size back to 0 will cause all rows to be cached (the default behaviour).

将代码更改为游标模式就像将Statement的获取大小设置为适当的大小一样简单。将提取大小设置为0将导致缓存所有行(默认行为)。

// make sure autocommit is off
conn.setAutoCommit(false);
Statement st = conn.createStatement();

// Turn use of the cursor on.
st.setFetchSize(50);
ResultSet rs = st.executeQuery("SELECT * FROM mytable");
while (rs.next()) {
   System.out.print("a row was returned.");
}
rs.close();

// Turn the cursor off.
st.setFetchSize(0);
rs = st.executeQuery("SELECT * FROM mytable");
while (rs.next()) {
   System.out.print("many rows were returned.");
}
rs.close();

// Close the statement.
st.close();

#3


2  

I think your question is similar to this thread: JDBC Pagination which contains solutions for your need.

我认为你的问题类似于这个主题:JDBC Pagination包含满足您需求的解决方案。

In particular, for PostgreSQL, you can use the LIMIT and OFFSET keywords in your request: http://www.petefreitag.com/item/451.cfm

特别是,对于PostgreSQL,您可以在请求中使用LIMIT和OFFSET关键字:http://www.petefreitag.com/item/451.cfm

PS: In Java code, I suggest you to use PreparedStatement instead of simple Statements: http://download.oracle.com/javase/tutorial/jdbc/basics/prepared.html

PS:在Java代码中,我建议你使用PreparedStatement而不是简单的语句:http://download.oracle.com/javase/tutorial/jdbc/basics/prepared.html

#4


2  

So it turns out that the crux of the problem is that by default, Postgres starts in "autoCommit" mode, and also it needs/uses cursors to be able to "page" through data (ex: read the first 10K results, then the next, then the next), however cursors can only exist within a transaction. So the default is to read all rows, always, into RAM, and then allow your program to start processing "the first result row, then the second" after it has all arrived, for two reasons, it's not in a transaction (so cursors don't work), and also a fetch size hasn't been set.

事实证明,问题的关键在于默认情况下,Postgres以“autoCommit”模式启动,并且它还需要/使用游标来“分页”通过数据(例如:读取前10K结果,然后是接下来,然后是下一个),但游标只能存在于事务中。所以默认是将所有行总是读入RAM,然后允许程序在它全部到达之后开始处理“第一个结果行,然后是第二个”,原因有两个,它不在事务中(因此游标)不起作用),并且还没有设置提取大小。

So how the psql command line tool achieves batched response (its FETCH_COUNT setting) for queries, is to "wrap" its select queries within a short-term transaction (if a transaction isn't yet open), so that cursors can work. You can do something like that also with JDBC:

因此,psql命令行工具如何实现查询的批量响应(其FETCH_COUNT设置),是在短期事务中“包装”其选择查询(如果事务尚未打开),以便游标可以工作。您也可以使用JDBC执行类似的操作:

  static void readLargeQueryInChunksJdbcWay(Connection conn, String originalQuery, int fetchCount, ConsumerWithException<ResultSet, SQLException> consumer) throws SQLException {
    boolean originalAutoCommit = conn.getAutoCommit();
    if (originalAutoCommit) {
      conn.setAutoCommit(false); // start temp transaction
    }
    try (Statement statement = conn.createStatement()) {
      statement.setFetchSize(fetchCount);
      ResultSet rs = statement.executeQuery(originalQuery);
      while (rs.next()) {
        consumer.accept(rs); // or just do you work here
      }
    } finally {
      if (originalAutoCommit) {
        conn.setAutoCommit(true); // reset it, also ends (commits) temp transaction
      }
    }
  }
  @FunctionalInterface
  public interface ConsumerWithException<T, E extends Exception> {
    void accept(T t) throws E;
  }

This gives the benefit of requiring less RAM, and, in my results, seemed to run overall faster, even if you don't need to save the RAM. Weird. It also gives the benefit that your processing of the first row "starts faster" (since it process it a page at a time).

这样可以减少RAM,并且在我的结果中,即使您不需要保存RAM,也可以更快地运行。奇怪的。它还带来了第一行处理“更快启动”的好处(因为它一次处理一页)。

And here's how to do it the "raw postgres cursor" way, along with full demo code, though in my experiments it seemed the JDBC way, above, was slightly faster for whatever reason.

这里是如何使用“原始postgres游标”方式,以及完整的演示代码,虽然在我的实验中,似乎上面的JDBC方式,无论出于何种原因,都略快。

Another option would be to have autoCommit mode off, everywhere, though you still have to always manually specify a fetchSize for each new Statement (or you can set a default fetch size in the URL string).

另一种选择是在任何地方都关闭autoCommit模式,但您仍然必须始终为每个新Statement手动指定fetchSize(或者您可以在URL字符串中设置默认提取大小)。

#5


0  

I did it like below. Not the best way i think, but it works :)

我是这样做的。不是我想的最好的方式,但它的工作:)

    Connection c = DriverManager.getConnection("jdbc:postgresql://....");
    PreparedStatement s = c.prepareStatement("select * from " + tabName + " where id > ? order by id");
    s.setMaxRows(100);
    int lastId = 0;
    for (;;) {
        s.setInt(1, lastId);
        ResultSet rs = s.executeQuery();

        int lastIdBefore = lastId;
        while (rs.next()) {
            lastId = Integer.parseInt(rs.getObject(1).toString());
            // ...
        }

        if (lastIdBefore == lastId) {
            break;
        }
    }

#6


0  

At lest in my case the problem was on the client that tries to fetch the results.

至少在我的情况下,问题出在客户端试图获取结果。

Wanted to get a .csv with ALL the results.

想要获得所有结果的.csv。

I found the solution by using

我通过使用找到了解决方案

psql -U postgres -d dbname  -c "COPY (SELECT * FROM T) TO STDOUT WITH DELIMITER ','"

(where dbname the name of the db...) and redirecting to a file.

(其中dbname是db的名称...)并重定向到文件。