如何在hsqldb中对大型表执行高效的group by / sum聚合?

时间:2021-09-01 23:02:02

In my HSQL DB I have a large (>10M rows) cached table 'Products' where each Product has a 'groupId' and 'value' column. There is also an index on the groupId colum. In another cached table 'Groups' (also large, >1M rows) I need to store all aggregated values of all Products that belong to this group. So basically my query would look like this:

在我的HSQL DB中,我有一个大的(> 10M行)缓存表'Products',其中每个Product都有一个'groupId'和'value'列。 groupId列上还有一个索引。在另一个缓存表'Groups'(也是大的,> 1M行)中,我需要存储属于该组的所有产品的所有聚合值。所以基本上我的查询看起来像这样:

MERGE INTO Groups AS g USING (
    SELECT groupId, SUM(value) AS aggrValue
    FROM Products GROUP BY groupId
) AS sub
ON g.groupId = sub.groupId
WHEN MATCHED THEN UPDATE SET g.value = sub.aggrValue

However, this does not work for large tables, because as stated in the documentation under 'Temporary Memory Use During Operations', all rows affected by the update will be held in memory during the operation. I also tried to first write the aggregated values into a helper table with this query:

但是,这不适用于大型表,因为如“操作期间临时内存使用”下的文档中所述,受更新影响的所有行将在操作期间保留在内存中。我还尝试首先使用此查询将聚合值写入帮助程序表:

INSERT INTO HelperTable (groupId, value)
SELECT groupId, SUM(value)
FROM Products
GROUP BY groupId

but this also exceeds the available memory. I have already tried limiting the hsqldb.result_max_memory_rows setting to 100_000. What would be the preferred solution to aggregate the values and write them into the Groups table ? My main requirement is that the operation runs with constant memory regardless of the number of Products and Groups.

但这也超出了可用内存。我已经尝试将hsqldb.result_max_memory_rows设置限制为100_000。聚合值并将它们写入Groups表的首选解决方案是什么?我的主要要求是,无论产品和组的数量如何,操作都以恒定的内存运行。

1 个解决方案

#1


0  

By design, HSQLDB performs some operations such as GROUP BY entirely in memory. It also stores the transaction history in memory until commit.

根据设计,HSQLDB完全在内存中执行一些诸如GROUP BY之类的操作。它还将事务历史记录存储在内存中直到提交。

You can perform the operation in chunks, based on the groupId column. You need to commit after each chunk.

您可以基于groupId列以块的形式执行操作。你需要在每个块之后提交。

You can develop the update as a delta update. When new products are added, the sums of values for these product are added to the stored sums. Or when products are removed, the sums of values are subtracted from the stored sums.

您可以将更新开发为增量更新。添加新产品时,会将这些产品的值总和添加到存储的总和中。或者,当移除产品时,从存储的总和中减去值的总和。

#1


0  

By design, HSQLDB performs some operations such as GROUP BY entirely in memory. It also stores the transaction history in memory until commit.

根据设计,HSQLDB完全在内存中执行一些诸如GROUP BY之类的操作。它还将事务历史记录存储在内存中直到提交。

You can perform the operation in chunks, based on the groupId column. You need to commit after each chunk.

您可以基于groupId列以块的形式执行操作。你需要在每个块之后提交。

You can develop the update as a delta update. When new products are added, the sums of values for these product are added to the stored sums. Or when products are removed, the sums of values are subtracted from the stored sums.

您可以将更新开发为增量更新。添加新产品时,会将这些产品的值总和添加到存储的总和中。或者,当移除产品时,从存储的总和中减去值的总和。