在Google Dataflow上使用JdbcIO，吞吐量非常低

I'd like to load data into Google CloudSQL instance via Google Dataflow. I think that there're no built-in Sink for CloudSQL, I decide to use org.apache.beam.sdk.io.jdbc.JdbcIO. But, the throughput into CloudSQL is very low (about 6 records/sec).

我想通过Google Dataflow将数据加载到Google CloudSQL实例中。我认为CloudSQL没有内置的Sink，我决定使用org.apache.beam.sdk.io.jdbc.JdbcIO。但是，CloudSQL的吞吐量非常低（大约6个记录/秒）。

I suspect that the spec of CloudSQL is too poor, But there's no improve when it's upgraded.

我怀疑CloudSQL的规格太差了，但升级时没有改进。

In the log of Dataflow, there're many logs as below:

在Dataflow的日志中，有如下许多日志：

Proposing dynamic split of work unit my-project;2017-06-27_02_58_19-14077185378147382467;6703504927792172410 at 
{"fractionConsumed":0.9669782519340515} 

Rejecting split request because custom reader returned null residual source.

What's happened? And How can I improve the performance?

发生了什么？我怎样才能提高性能？

2 个解决方案

#1

It's resolved!

它已经解决了！

At generating connection-string, adding as below:

在生成连接字符串时，添加如下：

JdbcIO.DataSourceConfiguration.create("com.mysql.jdbc.Driver", "jdbc:mysql://google/mydatabase?cloudSqlInstance=myproject:region:instance-name&socketFactory=com.google.cloud.sql.mysql.SocketFactory&rewriteBatchedStatements=true")

Adding "rewriteBatchedStatements=true", it's worked. The throughput improved to 2000/sec about!

添加“rewriteBatchedStatements = true”，它是有效的。吞吐量提高到2000 /秒左右！

Notice: it workes only when using mysql, perhaps.

注意：它可能只在使用mysql时才有效。

#2

Rejecting split request because custom reader returned null residual source.

拒绝拆分请求，因为自定义读取器返回了空剩余源。

Whatever custom source you implemented doesn't appear to support dynamic rebalancing.

无论您实施哪种自定义源，似乎都不支持动态重新平衡。

I suspect that the spec of CloudSQL is too poor, But there's no improve when it's upgraded.

我怀疑CloudSQL的规格太差了，但升级时没有改进。

Are you sure it's throughput to Cloud SQL that is the issue. Have you measured the performance of your source and proven it is the bottleneck?

您确定Cloud SQL的吞吐量是个问题吗？您是否测量过源的性能并证明它是瓶颈？

I'd like to load data into Google CloudSQL instance via Google Dataflow

我想通过Google Dataflow将数据加载到Google CloudSQL实例中

Generally, I wouldn't recommend this. Cloud SQL is a single machine database, so I suspect you don't get a lot of benefit, and perhaps it's even a performance negative, by using a horizontally scalable method like Dataflow. You should be able to do ingestion into Cloud SQL just as fast using a single VM instance to load the data.

一般来说，我不推荐这个。 Cloud SQL是一个单机数据库，所以我怀疑你没有获得很多好处，也许它甚至可以通过使用像Dataflow这样的水平可伸缩方法来表现性能。您应该能够使用单个VM实例加载数据来快速摄取到Cloud SQL中。

#1