I have two tables where the first is very large (>50M rows):
我有两个表,第一个非常大(> 50M行):
CREATE CACHED TABLE Alldistances (
word1 VARCHAR(70),
word2 VARCHAR(70),
distance INTEGER,
distcount INTEGER
);
and a second that can be also quite large (>5M rows):
第二个也可以很大(> 5M行):
CREATE CACHED TABLE tempcach (
word1 VARCHAR(70),
word2 VARCHAR(70),
distance INTEGER,
distcount INTEGER
);
Both tables have indexes:
两个表都有索引:
CREATE INDEX mulalldis ON Alldistances (word1, word2, distance);
CREATE INDEX multem ON tempcach (word1, word2, distance);
In my java program I am using prepared statements to fill/preorganize data in the tempcach table and then I merge the table to alldistances with:
在我的java程序中,我使用预处理语句来填充/预先组织tempcach表中的数据,然后将表合并到alldistances:
MERGE INTO Alldistances alld USING (
SELECT word1,
word2,
distance,
distcount FROM tempcach
) AS src (
newword1,
newword2,
newdistance,
newcount
) ON (
alld.word1 = src.newword1
AND alld.word2 = src.newword2
AND alld.distance = src.newdistance
) WHEN MATCHED THEN
UPDATE SET alld.distcount = alld.distcount+src.newcount
WHEN NOT MATCHED THEN
INSERT (
word1,
word2,
distance,
distcount
) VALUES (
newword1,
newword2,
newdistance,
newcount
);
The tempchach table is then dropped or truncated and filled with new data. During the merge I get the OOM, which is i guess because the whole table is loaded into memory during the merge. So I will have to merge in batches, but can i do that in SQL or do it in my java program. Or is there a smart way to avoid OOM while merging?
然后删除或截断tempchach表并填充新数据。在合并期间,我得到了OOM,我猜是因为整个表在合并期间被加载到内存中。所以我将必须批量合并,但我可以在SQL中执行此操作,还是在我的java程序中执行此操作。或者有一种聪明的方法可以在合并时避免OOM吗?
2 个解决方案
#1
0
It is possible to merge in chunks (batches) in SQL. You need to
可以在SQL中以块(批处理)合并。你需要
- limit the number of rows from the temp table in each chunk
- 限制每个块中临时表的行数
- delete those same rows
- 删除那些相同的行
- repeat
- 重复
The SELECT statement should use an ORDER BY and LIMIT
SELECT语句应使用ORDER BY和LIMIT
SELECT word1,
word2,
distance,
distcount FROM tempcach
ORDER BY primary key or unique columns
LIMIT 1000
) AS src (
After the merge, the delete statement will select the same rows to delete
合并后,delete语句将选择要删除的相同行
DELETE FROM tempcach WHERE primary key or unique columns IN
(SELECT primary key or unique columns FROM tempcach
ORDER BY primary key or unique columns LIMIT 1000)
#2
0
First, just because this kind of thing annoys me, why are you selecting all the fields of the temporary table in a subselect? Why not the simpler SQL:
首先,仅仅因为这种事情让我烦恼,为什么要在子选择中选择临时表的所有字段?为什么不是更简单的SQL:
MERGE INTO Alldistances alld USING tempcach AS src (
newword1,
newword2,
newdistance,
newcount
) ON (
alld.word1 = src.newword1
AND alld.word2 = src.newword2
AND alld.distance = src.newdistance
) WHEN MATCHED THEN
UPDATE SET alld.distcount = alld.distcount+src.newcount
WHEN NOT MATCHED THEN
INSERT (
word1,
word2,
distance,
distcount
) VALUES (
newword1,
newword2,
newdistance,
newcount
);
What you need to have the database avoid loading the whole table into memory is indexing on both tables.
数据库避免将整个表加载到内存中需要的是对两个表进行索引。
CREATE INDEX all_data ON Alldistances (word1, word2, distance);
CREATE INDEX tempcach_data ON tempcach (word1, word2, distance);
#1
0
It is possible to merge in chunks (batches) in SQL. You need to
可以在SQL中以块(批处理)合并。你需要
- limit the number of rows from the temp table in each chunk
- 限制每个块中临时表的行数
- delete those same rows
- 删除那些相同的行
- repeat
- 重复
The SELECT statement should use an ORDER BY and LIMIT
SELECT语句应使用ORDER BY和LIMIT
SELECT word1,
word2,
distance,
distcount FROM tempcach
ORDER BY primary key or unique columns
LIMIT 1000
) AS src (
After the merge, the delete statement will select the same rows to delete
合并后,delete语句将选择要删除的相同行
DELETE FROM tempcach WHERE primary key or unique columns IN
(SELECT primary key or unique columns FROM tempcach
ORDER BY primary key or unique columns LIMIT 1000)
#2
0
First, just because this kind of thing annoys me, why are you selecting all the fields of the temporary table in a subselect? Why not the simpler SQL:
首先,仅仅因为这种事情让我烦恼,为什么要在子选择中选择临时表的所有字段?为什么不是更简单的SQL:
MERGE INTO Alldistances alld USING tempcach AS src (
newword1,
newword2,
newdistance,
newcount
) ON (
alld.word1 = src.newword1
AND alld.word2 = src.newword2
AND alld.distance = src.newdistance
) WHEN MATCHED THEN
UPDATE SET alld.distcount = alld.distcount+src.newcount
WHEN NOT MATCHED THEN
INSERT (
word1,
word2,
distance,
distcount
) VALUES (
newword1,
newword2,
newdistance,
newcount
);
What you need to have the database avoid loading the whole table into memory is indexing on both tables.
数据库避免将整个表加载到内存中需要的是对两个表进行索引。
CREATE INDEX all_data ON Alldistances (word1, word2, distance);
CREATE INDEX tempcach_data ON tempcach (word1, word2, distance);