只在查询返回小于n_max行时收集

Occasionally when connecting to my Oracle database through ROracle and dbplyr I will run a dplyr::collect operation that fetches more data than expected and than R can handle.

有时，当我通过ROracle和dbplyr连接Oracle数据库时，我将运行一个dplyr:::collect操作，它获取的数据比预期的要多，而且R无法处理。

This can make R crash and is often a sign I should have filtered or aggregated data further before fetching.

这可能导致R崩溃，通常是我应该在获取数据之前进一步过滤或聚合数据的信号。

It would be great to be able to check the size of the result before choosing to fetch it or not (without running the query twice).

在选择获取结果之前(不需要运行两次查询)检查结果的大小是很好的。

Let's name collect2 the variation of collect that would allow this:

让我们来命名collect2，收集的变化允许这个:

expected behavior:

期望的行为:

small_t <- con %>% tbl("small_table") %>%
  filter_group_etc %>%
  collect2(n_max = 5e6) # works fine

big_t   <- con %>% tbl("big_table")   %>%
  filter_group_etc %>%
  collect2(n_max = 5e6) # Error: query returned 15.486.245 rows, n_max set to 5.000.000

Would this be possible ?

这可能吗?

I'm also open to a solution using ROracle / DBI without dplyr, e.g.:

我还可以使用没有dplyr的ROracle / DBI的解决方案。

dbGetQuery2(con, my_big_sql_query,n_max = 5e6) # Error: query returned 15.486.245 rows, n_max set to 5.000.000

EDIT:

编辑:

See below a partial solution posted as an answer, not optimal because some time is wasted fetching data I have no use for.

请参见下面的部分解决方案作为答案，而不是最优的，因为一些时间浪费了我没有使用的数据。

4 个解决方案

#1

You can actually achieve your goal in one SQL query:

您可以通过一个SQL查询来实现您的目标:

Add the row count (n) as an extra column to the data, using dplyr's mutate rather than summarise, and then set n < n_limit as a filter condition. This condition corresponds to a having clause in SQL. If the row count is larger than the list, then no data are collected. Otherwise all data are collected. You may wish to drop the row count column in the end.

将行计数(n)作为额外的列添加到数据中，使用dplyr的突变而不是汇总，然后将n < n_limit设置为筛选条件。此条件对应于SQL中的having子句。如果行计数大于列表，则不收集数据。否则将收集所有数据。您可能希望在末尾删除行计数列。

This approach should work on most databases. I have verified this using PostgreSQL and Oracle.

这种方法应该适用于大多数数据库。我已经使用PostgreSQL和Oracle对此进行了验证。

copy_to(dest=con, cars, "cars")
df <- tbl(con, "cars")
n_limit <- 51
df %>% mutate(n=n()) %>% filter(n < n_limit) %>% collect

However, it does not work on SQLite. To see why this is the case, you can check the SQL statement generated by the dplyr code:

但是，它不能在SQLite上工作。要了解这种情况的原因，可以查看dplyr代码生成的SQL语句:

df %>% mutate(n=n()) %>% filter(n < n_limit) %>% show_query

<SQL>
SELECT *
FROM (SELECT "speed", "dist", COUNT(*) OVER () AS "n"
FROM "cars") "rdipjouqeu"
WHERE ("n" < 51.0)

The SQL contains a window function (count(*) over ()), which is not supported by SQLite.

SQL包含一个窗口函数(count(*) over())，这是SQLite不支持的。

#2

This doesn't get around the problem you mention in the comments about spending the resources to get the query twice, but it does seems to work (at least against my MySQL database--I don't have an Oracle database to test it against):

这并没有解决您在评论中提到的关于花费资源获取两次查询的问题，但它似乎确实有效(至少对我的MySQL数据库是这样的——我没有Oracle数据库来测试它):

collect2  <- function(query, limit = 20000) {

  query_nrows  <- query %>% 
    ungroup() %>% 
    summarize(n = n()) %>% 
    collect() %>% 
    pull('n')


  if(query_nrows <= limit) {
    collect(query)
  } else {
    warning("Query has ", query_nrows,"; limit is ", limit,". Data will not be collected.")
  }

}

I don't see any way to test the number of rows in the results of a query without actually running the query. With this method, though, you always force the computation of row numbers to happen in the database first and refuse to collect if you're over 20,000 (or whatever your row limit is).

在没有实际运行查询的情况下，我没有看到任何方法来测试查询结果中的行数。但是，使用这种方法，您总是强制在数据库中首先计算行号，如果超过20,000行(或无论您的行限制是什么)，则拒绝收集行号。

#3

I'll post a partial answer, the issue with this solution is that it will fetch the n_max first rows when the query returns more than n_max rows. Fetching takes time with my configuration so I'd prefer to avoid this step.

我将发布一个部分答案，这个解决方案的问题是，当查询返回超过n_max行时，它将获取第一行的n_max。抓取占用配置的时间，所以我宁愿避免这个步骤。

On the other hand it will return the error that I requested, and the query won't have to be sent twice when n_rows < n_max.

另一方面，它将返回我请求的错误，并且当n_rows < n_max时，查询不需要发送两次。

dbplyr:::collect.tbl_sql is undocumented but has n and warn_incomplete parameters that would seem to do sort of what we want. Unfortunately the feature is broken (that's probably why it's undocumented and not exported) but we'll be able to leverage it with a bit of work.

dbplyr:::收集。tbl_sql没有文档说明，但是有n和warn_incomplete参数，似乎可以实现我们想要的。不幸的是，这个特性被破坏了(这可能就是为什么它没有文档说明，也没有导出)，但是我们将能够利用它做一些工作。

# formals(dbplyr:::collect.tbl_sql)
# $x
# 
#
# $...
# 
# 
# $n
# [1] Inf
# 
# $warn_incomplete
# [1] TRUE

My personal issue is with an Oracle database, however it's easier to replicate with an SQLite database and I believe the solution will work with any DBMS compatible with DBI

我个人的问题是Oracle数据库，但是使用SQLite数据库进行复制更容易，我相信这个解决方案将适用于任何与DBI兼容的DBMS

Initiate and create fake data

初始化并创建假数据

This will create a file called Test.sqlitein your working directory.

这将创建一个名为Test的文件。sqlitein你的工作目录。

library(dplyr)
library(dbplyr)
library(RSQLite)
library(DBI)
set.seed(1)
big_iris <- sample_n(iris,50000,replace=TRUE)
con <- DBI::dbConnect(RSQLite::SQLite(), dbname="Test.sqlite")
DBI::dbWriteTable(con,"BIG_IRIS",rename_all(big_iris,. %>% toupper %>% sub("\\.","_",.)),overwrite=T)
rm(big_iris)

The rabbit hole

兔子洞

Let's imagine my R can only handle 20.000 rows beforecrashing

假设我的R只能处理20,000行

big_iris_filtered <- con %>% tbl("BIG_IRIS") %>% filter(SEPAL_LENGTH > 5.2) %>% 
  collect
nrow(big_iris_filtered) # [1] 35041 <- that's too much
big_iris_filtered <- con %>% tbl("BIG_IRIS") %>% filter(SEPAL_LENGTH > 5.2) %>% 
  collect(n=2e4,warn_incomplete=TRUE)
nrow(big_iris_filtered) # [1] 20000

It stops at 20.000 which is a start, but the warn_incomplete parameter doesn't seem to trigger anything.

它在20,000停止，这是一个开始，但是warn_complete参数似乎没有触发任何东西。

It makes sense when we look at the code for collect:

当我们查看收藏代码时，这是有意义的:

dbplyr:::collect.tbl_sql
# function (x, ..., n = Inf, warn_incomplete = TRUE) 
# {
#   assert_that(length(n) == 1, n > 0L)
#   if (n == Inf) {
#     n <- -1
#   }
#   else {
#     x <- head(x, n)
#   }
#   sql <- db_sql_render(x$src$con, x)
#   out <- db_collect(x$src$con, sql, n = n, warn_incomplete = warn_incomplete)
#   grouped_df(out, intersect(op_grps(x), names(out)))
# }

head here will call method dbplyr:::head.tbl_lazy which will add a LIMIT component in the SQLite code (and probably make it crash with Oracle as head was not supported by default at last time I checked). You can debug and check the value of sql to see it.

这里的head将调用method dbplyr:::head。tbl_lazy将在SQLite代码中添加一个限制组件(可能会使它与Oracle崩溃，因为上次我检查时，缺省情况下不支持head)。您可以调试并检查sql的值以查看它。

By adding this LIMIT component we're fetching everything by definition, so warn_incomplete can't trigger the warning down the line (that it's supposed to do through dbplyr:::res_warn_incomplete)

通过添加这个LIMIT组件，我们将根据定义获取所有内容，因此warn_incomplete不会触发下一行的警告(它应该通过dbplyr::: res_warn_complete)

So let's just remove the part :

我们把这部分去掉

collect2 <- function (x, ..., n = Inf, warn_incomplete = TRUE) {
  assertthat::assert_that(length(n) == 1, n > 0L)
  sql <- dbplyr:::db_sql_render(x$src$con, x)
  out <- dbplyr:::db_collect(x$src$con, sql, n = n, warn_incomplete = warn_incomplete)
  grouped_df(out, intersect(dbplyr:::op_grps(x), names(out)))
}

And test:

和测试:

big_iris_filtered <- con %>% tbl("BIG_IRIS") %>% filter(SEPAL_LENGTH > 5.2) %>% 
  collect2(n=2e4,warn_incomplete=FALSE)
# no warning as expected
big_iris_filtered <- con %>% tbl("BIG_IRIS") %>% filter(SEPAL_LENGTH > 5.2) %>% 
  collect2(n=2e4,warn_incomplete=TRUE)
# Warning message:
# Only first 20,000 results retrieved. Use n = Inf to retrieve all.
dim(big_iris_filtered)
[1] 20000     5

It works!

它的工作原理!

I need one more edit because I wanted it to stop, not to warn, see below.

我需要再编辑一次，因为我希望它停止，而不是警告，请看下面。

Solution

解决方案

collect2 <- function (x, ..., n = Inf, warn_incomplete = TRUE) {
  assertthat::assert_that(length(n) == 1, n > 0L)
  sql <- dbplyr:::db_sql_render(x$src$con, x)
  out <- withCallingHandlers(dbplyr:::db_collect(x$src$con, sql, n = n, warn_incomplete = warn_incomplete),warning=function(w) {stop(w)})
  grouped_df(out, intersect(dbplyr:::op_grps(x), names(out)))
}

big_iris_filtered <- con %>% tbl("BIG_IRIS") %>% filter(SEPAL_LENGTH > 5.2) %>% 
  collect2(n=2e4,warn_incomplete=TRUE)
# Error: Only first 20,000 results retrieved. Use n = Inf to retrieve all.

Good, so that's my partial solution. It' partial because I've fetched these rows before stopping, so the query has run on the server AND has been fetched to the client until the nth row, then triggering the error. With my configuration, fetching is slow, and I'd rather avoid fetching for nothing.

很好，这就是部分解。它是“部分”的，因为在停止之前我已经获取了这些行，所以查询已经在服务器上运行，并被获取到客户端，直到第n行，然后触发错误。对于我的配置，抓取是缓慢的，我宁愿不花任何代价也不去抓取。

next challenge

下一个挑战

The next challenge would be to stop it BEFORE it's fetched. The code from db_collect show how it currently happens, the issue is that the resultset res is updated by dbFetch, which executes the query and fetches to R at the same time.

下一个挑战是在它被取走之前停止它。db_collect的代码显示了当前的情况，问题是dbFetch更新了resultset res，它执行查询并同时获取R。

dbplyr:::db_collect.DBIConnection
# function (con, sql, n = -1, warn_incomplete = TRUE, ...) 
# {
#   res <- dbSendQuery(con, sql)
#   tryCatch({
#     out <- dbFetch(res, n = n)
#     if (warn_incomplete) {
#       res_warn_incomplete(res, "n = Inf")
#     }
#   }, finally = {
#     dbClearResult(res)
#   })
#   out
# }

#4

So, you can't check the size of the results without running the query.

因此，如果不运行查询，就无法检查结果的大小。

Now the question is to either cache the results server side and test for the size, or simply put some "insurance" in the R side so that we never receive too many rows.

现在的问题是缓存结果服务器端并测试其大小，或者简单地在R端添加一些“保险”，这样我们就不会收到太多的行。

In the latter case, how about simply:

在后一种情况下，简单地说:

small_t <- con %>% tbl("small_table") %>%
  filter_group_etc %>%
  head(n=5e6) %>%
  collect()

If you get 5e6 rows, they you probably overflowed; we can't distinguish overflow from exactly 5e6 rows, but that seems a small price to pay to obtain single execution in the DB? Set 5e6 to 5000001 if you're really worried. (And 5000000L or 5000001L would be better options so that they are seen by the DB as integers.)

如果有5e6行，它们可能会溢出;我们无法区分溢出和5e6行，但要在DB中获得单个执行，这似乎是一个很小的代价。如果你真的担心的话，把5e6设置为5000001。(5000000L或5000001L是更好的选项，这样DB就可以将它们视为整数。)

This doesn't work so well if you're worried about a slow connection, but if you're simply worried about over-flowing memory in R, its a cheap piece of insurance without putting extra load on the server.

如果您担心连接速度太慢，那么这并不能很好地工作，但是如果您只是担心R中的内存溢出，这是一种廉价的保险，不会给服务器增加额外的负载。

#1