如何使用talend和sql server更快地加载数据

时间:2021-11-02 10:20:00

I use Talend to load data into a sql-server database.

我使用Talend将数据加载到sql-server数据库中。

It appears that the weakest point of my job is not the dataprocessing, but the effective load in my database, which is not faster than 17 rows/sec.

看来我工作中最薄弱的部分不是数据处理,而是数据库中的有效负载,不超过17行/秒。

The funny point is that I can launch 5 jobs in the same time, and they'll all load at 17rows/sec .

有趣的是,我可以同时启动5个工作,并且它们都将以17个/秒的速度加载。

What could explain this slowness and how could I improve the speed?

什么可以解释这种缓慢,我怎么能提高速度?

Thanks

谢谢

New informations:

新信息:

The transfer speed between my desktop and the server is about 1MByte

我的桌面和服务器之间的传输速度大约是1MByte

My job commits every 10 000

我的工作每10 000个工作一次

I use sql server 2008 R2

我使用sql server 2008 R2

And the schema I use for my jobs is like this:

我用于工作的模式是这样的:

如何使用talend和sql server更快地加载数据

7 个解决方案

#1


15  

Database INSERT OR UPDATE methods are incredibly costly as the database cannot batch all of the commits to do all at once and must do them line by line (ACID transactions force this because if it attempted to do an insert and then failed then all of the other records in this commit would also fail).

数据库INSERT或UPDATE方法非常昂贵,因为数据库无法批量处理所有提交,并且必须逐行执行(ACID事务强制执行此操作,因为如果它尝试执行插入然后失败那么所有其他此提交中的记录也会失败)。

Instead, for large bulk operations it is always best to predetermine whether a record would be inserted or updated before passing the commit to the database and then sending 2 transactions to the database.

相反,对于大型批量操作,最好在将提交传递给数据库然后将2个事务发送到数据库之前预先确定是否插入或更新记录。

A typical job that needed this functionality would assemble the data that is to be INSERT OR UPDATEd and then query the database table for the existing primary keys. If the primary key already exists then you can send this as an UPDATE, otherwise it is an INSERT. The logic for this can be easily done in a tMap component.

需要此功能的典型作业将汇编要INSERT或UPDATEd的数据,然后在数据库表中查询现有主键。如果主键已经存在,那么您可以将其作为UPDATE发送,否则它是INSERT。这个逻辑可以在tMap组件中轻松完成。

如何使用talend和sql server更快地加载数据

In this job we have some data that we wish to INSERT OR UPDATE into a database table that contains some pre-existing data:

在这个工作中,我们有一些数据,我们希望INSERT或UPDATE到包含一些预先存在的数据的数据库表:

如何使用talend和sql server更快地加载数据

And we wish to add the following data to it:

我们希望将以下数据添加到其中:

如何使用talend和sql server更快地加载数据

The job works by throwing the new data into a tHashOutput component so it can be used multiple times in the same job (it simply puts it to memory or in large instances can cache it to the disk).

该作业通过将新数据放入tHashOutput组件来工作,因此可以在同一作业中多次使用它(它只是将其放入内存或在大型实例中可以将其缓存到磁盘)。

Following on from this one lot of data is read out of a tHashInput component and directly into a tMap. Another tHashInput component is utilised to run a parameterised query against the table:

接下来,从tHashInput组件读取大量数据并直接读入tMap。另一个tHashInput组件用于对表运行参数化查询:

如何使用talend和sql server更快地加载数据如何使用talend和sql server更快地加载数据

You may find this guide to Talend and parameterised queries useful. From here the returned records (so only the ones inside the database already) are used as a lookup to the tMap.

您可能会发现此Talend指南和参数化查询很有用。从这里返回的记录(因此只有数据库中已有的记录)用作tMap的查找。

This is then configured as an INNER JOIN to find the records that need to be UPDATED with the rejects from the INNER JOIN to be inserted:

然后将其配置为INNER JOIN,以查找需要更新的记录以及要插入的INNER JOIN中的拒绝:

如何使用talend和sql server更快地加载数据

These outputs then just flow to separate tMySQLOutput components to UPDATE or INSERT as necessary. And finally when the main subjob is complete we commit the changes.

然后,这些输出只是根据需要将tMySQLOutput组件分离为UPDATE或INSERT。最后,当主要子工作完成时,我们提交更改。

#2


4  

I think that @ydaetskcoR 's answer is perfect on a teorical point of view (divide rows that need Insert from those to Update) and gives you a working ETL solution useful for small dataset (some thousands rows).

我认为@ydaetskcoR的答案在teorical观点上是完美的(将那些行需要从那些行分成更新),并为您提供一个有用的小数据集(数千行)的ETL解决方案。

Performing the lookup to be able to decide wheter a row has to be updated or not is costly in ETL as all the data is going back and forth between the Talend machine and the DB server.

由于所有数据都在Talend机器和数据库服务器之间来回传递,因此执行查找以确定是否必须更新某行是非常昂贵的。

When you get to some hundred of thousands o even millions of records you have to pass from ETL to ELT: you just load your data to some temp (staging) table as suggested from @Balazs Gunics and then you use SQL to manipulate it.

当你得到数十万甚至数百万条记录时,你必须从ETL传递给ELT:你只需按照@Balazs Gunics的建议将数据加载到某个临时(临时)表,然后使用SQL来操作它。

In this case after loading your data (only INSERT = fast, even faster using BULK LOAD components) you will issue a LEFT OUTER JOIN between the temp table and the destination one to divide the rows that are already there (need update) and the others.

在这种情况下,在加载数据后(只有INSERT =快速,甚至更快使用BULK LOAD组件),您将在临时表和目标表之间发出LEFT OUTER JOIN,以划分已存在的行(需要更新)和其他行。

This query will give you the rows you need to insert:

此查询将为您提供需要插入的行:

SELECT staging.* FROM staging
LEFT OUTER JOIN destination ON (destination.PK = staging.PK)
WHERE destination.PK IS NULL

This other one the rows you need to update:

另一个是您需要更新的行:

SELECT staging.* FROM staging
LEFT OUTER JOIN destination ON (destination.PK = staging.PK)
WHERE destination.PK IS   NOT    NULL

This will be orders of magnitude faster than ETL, BUT you will need to use SQL to operate on your data, while in ETL you can use Java as ALL the data is taken to the Talend server, so often is common a first step on the local machine to pre-process the data in java (to clean and validate it) and then fire it up on the DB where you use join to load it in the right way.

这将比ETL快几个数量级,但是您将需要使用SQL来操作您的数据,而在ETL中您可以使用Java作为所有数据被带到Talend服务器,因此通常是第一步本地机器用于预处理java中的数据(以清理和验证它),然后在您使用连接的数据库上将其激活,以正确的方式加载它。

Here are the ELT JOB screen shots. 如何使用talend和sql server更快地加载数据

这是ELT JOB屏幕截图。

如何使用talend和sql server更快地加载数据

#3


0  

Based on your note that inserts are an order of magnitude faster than updates (4000 vs 17/sec) - It looks like you need to look at your DB indexes. Adding an index that matches your update parameters could speedup your updates significantly. Of course, this index may slow your inserts a bit.

基于您的注释,插入比更新快了一个数量级(4000 vs 17 / sec) - 看起来您需要查看数据库索引。添加与更新参数匹配的索引可以显着加快更新速度。当然,这个索引可能会减慢你的插入速度。

You can also look at the query execution plan for your update query to see if it is using any indexes. How do I obtain a Query Execution Plan?

您还可以查看更新查询的查询执行计划,以查看它是否正在使用任何索引。如何获取查询执行计划?

#4


0  

You should do a staging table, where you insert the rows.

您应该执行一个临时表,在其中插入行。

Based on this staging table you do a DELETE query with t*SQLrow.

基于此临时表,您可以使用t * SQLrow执行DELETE查询。

DELETE FROM target_table
WHERE target_table.id IN (SELECT id FROM staging_table);

So the rows you wanted to update are no longer exists.

因此,您要更新的行不再存在。

INSERT INTO target_table 
SELECT * FROM staging_table;

This will move all the new/modified rows.

这将移动所有新的/修改的行。

#5


0  

I've found where this performance problem come form.

我发现了这个性能问题的来源。

I do an INSERT OR UPDATE, if I replace it with a simple INSERT, the speed goes up to 4000 rows/s.

我做一个INSERT或UPDATE,如果我用一个简单的INSERT替换它,速度上升到4000行/秒。

Does it seem like an acceptable pace?

它看起来像是可接受的节奏吗?

Anyway, I need my INSERT OR UPDATE so, I guess I'm stuck.

无论如何,我需要我的INSERT或更新,所以,我想我被卡住了。

#6


0  

I was having the same issue loading data into a DB2 server. I too had the commit set at 10000 but once I selected the option to batch(on the same component options screen) performance dramatically improved. When I moved the commit and batch to 20000 the job went from 5 hours to under 2 minutes.

我在将数据加载到DB2服务器时遇到了同样的问题。我也将提交设置为10000但是一旦我选择批处理选项(在相同的组件选项屏幕上),性能就会大大提高。当我将提交和批处理移动到20000时,作业从5小时变为不到2分钟。

#7


0  

I had the same problem and solved it by defining an index on target table.

我遇到了同样的问题,并通过在目标表上定义索引来解决它。

Usually, the target table has an id field which is its primary key and hence indexed. So, all sort of joins with it would work just fine. But the update from a flat file is done by some data fields, so each update statement have to make full table scan.

通常,目标表具有id字段,该字段是其主键并因此被索引。所以,与它的所有类型的连接都可以正常工作。但是平面文件的更新是由某些数据字段完成的,因此每个更新语句都必须进行全表扫描。

The above also explains why it works fast with INSERT and becomes slow with INSERT OR UPDATE

上面还解释了为什么它与INSERT一起快速工作并且在INSERT或UPDATE时变慢

#1


15  

Database INSERT OR UPDATE methods are incredibly costly as the database cannot batch all of the commits to do all at once and must do them line by line (ACID transactions force this because if it attempted to do an insert and then failed then all of the other records in this commit would also fail).

数据库INSERT或UPDATE方法非常昂贵,因为数据库无法批量处理所有提交,并且必须逐行执行(ACID事务强制执行此操作,因为如果它尝试执行插入然后失败那么所有其他此提交中的记录也会失败)。

Instead, for large bulk operations it is always best to predetermine whether a record would be inserted or updated before passing the commit to the database and then sending 2 transactions to the database.

相反,对于大型批量操作,最好在将提交传递给数据库然后将2个事务发送到数据库之前预先确定是否插入或更新记录。

A typical job that needed this functionality would assemble the data that is to be INSERT OR UPDATEd and then query the database table for the existing primary keys. If the primary key already exists then you can send this as an UPDATE, otherwise it is an INSERT. The logic for this can be easily done in a tMap component.

需要此功能的典型作业将汇编要INSERT或UPDATEd的数据,然后在数据库表中查询现有主键。如果主键已经存在,那么您可以将其作为UPDATE发送,否则它是INSERT。这个逻辑可以在tMap组件中轻松完成。

如何使用talend和sql server更快地加载数据

In this job we have some data that we wish to INSERT OR UPDATE into a database table that contains some pre-existing data:

在这个工作中,我们有一些数据,我们希望INSERT或UPDATE到包含一些预先存在的数据的数据库表:

如何使用talend和sql server更快地加载数据

And we wish to add the following data to it:

我们希望将以下数据添加到其中:

如何使用talend和sql server更快地加载数据

The job works by throwing the new data into a tHashOutput component so it can be used multiple times in the same job (it simply puts it to memory or in large instances can cache it to the disk).

该作业通过将新数据放入tHashOutput组件来工作,因此可以在同一作业中多次使用它(它只是将其放入内存或在大型实例中可以将其缓存到磁盘)。

Following on from this one lot of data is read out of a tHashInput component and directly into a tMap. Another tHashInput component is utilised to run a parameterised query against the table:

接下来,从tHashInput组件读取大量数据并直接读入tMap。另一个tHashInput组件用于对表运行参数化查询:

如何使用talend和sql server更快地加载数据如何使用talend和sql server更快地加载数据

You may find this guide to Talend and parameterised queries useful. From here the returned records (so only the ones inside the database already) are used as a lookup to the tMap.

您可能会发现此Talend指南和参数化查询很有用。从这里返回的记录(因此只有数据库中已有的记录)用作tMap的查找。

This is then configured as an INNER JOIN to find the records that need to be UPDATED with the rejects from the INNER JOIN to be inserted:

然后将其配置为INNER JOIN,以查找需要更新的记录以及要插入的INNER JOIN中的拒绝:

如何使用talend和sql server更快地加载数据

These outputs then just flow to separate tMySQLOutput components to UPDATE or INSERT as necessary. And finally when the main subjob is complete we commit the changes.

然后,这些输出只是根据需要将tMySQLOutput组件分离为UPDATE或INSERT。最后,当主要子工作完成时,我们提交更改。

#2


4  

I think that @ydaetskcoR 's answer is perfect on a teorical point of view (divide rows that need Insert from those to Update) and gives you a working ETL solution useful for small dataset (some thousands rows).

我认为@ydaetskcoR的答案在teorical观点上是完美的(将那些行需要从那些行分成更新),并为您提供一个有用的小数据集(数千行)的ETL解决方案。

Performing the lookup to be able to decide wheter a row has to be updated or not is costly in ETL as all the data is going back and forth between the Talend machine and the DB server.

由于所有数据都在Talend机器和数据库服务器之间来回传递,因此执行查找以确定是否必须更新某行是非常昂贵的。

When you get to some hundred of thousands o even millions of records you have to pass from ETL to ELT: you just load your data to some temp (staging) table as suggested from @Balazs Gunics and then you use SQL to manipulate it.

当你得到数十万甚至数百万条记录时,你必须从ETL传递给ELT:你只需按照@Balazs Gunics的建议将数据加载到某个临时(临时)表,然后使用SQL来操作它。

In this case after loading your data (only INSERT = fast, even faster using BULK LOAD components) you will issue a LEFT OUTER JOIN between the temp table and the destination one to divide the rows that are already there (need update) and the others.

在这种情况下,在加载数据后(只有INSERT =快速,甚至更快使用BULK LOAD组件),您将在临时表和目标表之间发出LEFT OUTER JOIN,以划分已存在的行(需要更新)和其他行。

This query will give you the rows you need to insert:

此查询将为您提供需要插入的行:

SELECT staging.* FROM staging
LEFT OUTER JOIN destination ON (destination.PK = staging.PK)
WHERE destination.PK IS NULL

This other one the rows you need to update:

另一个是您需要更新的行:

SELECT staging.* FROM staging
LEFT OUTER JOIN destination ON (destination.PK = staging.PK)
WHERE destination.PK IS   NOT    NULL

This will be orders of magnitude faster than ETL, BUT you will need to use SQL to operate on your data, while in ETL you can use Java as ALL the data is taken to the Talend server, so often is common a first step on the local machine to pre-process the data in java (to clean and validate it) and then fire it up on the DB where you use join to load it in the right way.

这将比ETL快几个数量级,但是您将需要使用SQL来操作您的数据,而在ETL中您可以使用Java作为所有数据被带到Talend服务器,因此通常是第一步本地机器用于预处理java中的数据(以清理和验证它),然后在您使用连接的数据库上将其激活,以正确的方式加载它。

Here are the ELT JOB screen shots. 如何使用talend和sql server更快地加载数据

这是ELT JOB屏幕截图。

如何使用talend和sql server更快地加载数据

#3


0  

Based on your note that inserts are an order of magnitude faster than updates (4000 vs 17/sec) - It looks like you need to look at your DB indexes. Adding an index that matches your update parameters could speedup your updates significantly. Of course, this index may slow your inserts a bit.

基于您的注释,插入比更新快了一个数量级(4000 vs 17 / sec) - 看起来您需要查看数据库索引。添加与更新参数匹配的索引可以显着加快更新速度。当然,这个索引可能会减慢你的插入速度。

You can also look at the query execution plan for your update query to see if it is using any indexes. How do I obtain a Query Execution Plan?

您还可以查看更新查询的查询执行计划,以查看它是否正在使用任何索引。如何获取查询执行计划?

#4


0  

You should do a staging table, where you insert the rows.

您应该执行一个临时表,在其中插入行。

Based on this staging table you do a DELETE query with t*SQLrow.

基于此临时表,您可以使用t * SQLrow执行DELETE查询。

DELETE FROM target_table
WHERE target_table.id IN (SELECT id FROM staging_table);

So the rows you wanted to update are no longer exists.

因此,您要更新的行不再存在。

INSERT INTO target_table 
SELECT * FROM staging_table;

This will move all the new/modified rows.

这将移动所有新的/修改的行。

#5


0  

I've found where this performance problem come form.

我发现了这个性能问题的来源。

I do an INSERT OR UPDATE, if I replace it with a simple INSERT, the speed goes up to 4000 rows/s.

我做一个INSERT或UPDATE,如果我用一个简单的INSERT替换它,速度上升到4000行/秒。

Does it seem like an acceptable pace?

它看起来像是可接受的节奏吗?

Anyway, I need my INSERT OR UPDATE so, I guess I'm stuck.

无论如何,我需要我的INSERT或更新,所以,我想我被卡住了。

#6


0  

I was having the same issue loading data into a DB2 server. I too had the commit set at 10000 but once I selected the option to batch(on the same component options screen) performance dramatically improved. When I moved the commit and batch to 20000 the job went from 5 hours to under 2 minutes.

我在将数据加载到DB2服务器时遇到了同样的问题。我也将提交设置为10000但是一旦我选择批处理选项(在相同的组件选项屏幕上),性能就会大大提高。当我将提交和批处理移动到20000时,作业从5小时变为不到2分钟。

#7


0  

I had the same problem and solved it by defining an index on target table.

我遇到了同样的问题,并通过在目标表上定义索引来解决它。

Usually, the target table has an id field which is its primary key and hence indexed. So, all sort of joins with it would work just fine. But the update from a flat file is done by some data fields, so each update statement have to make full table scan.

通常,目标表具有id字段,该字段是其主键并因此被索引。所以,与它的所有类型的连接都可以正常工作。但是平面文件的更新是由某些数据字段完成的,因此每个更新语句都必须进行全表扫描。

The above also explains why it works fast with INSERT and becomes slow with INSERT OR UPDATE

上面还解释了为什么它与INSERT一起快速工作并且在INSERT或UPDATE时变慢