正好来自Apache Beam的Kafka源代码

时间:2021-04-21 15:34:57

Is it possible to do exactly-once processing using a Kafka source, with KafkaIO in Beam? There is a flag that can be set called AUTO_COMMIT, but it seems that it commits back to Kafka straight after the consumer processed the data, rather than after the pipeline completed processing the message.

是否可以使用Kafka源完成一次处理,使用KafkaIO in Beam?有一个可以设置为AUTO_COMMIT的标志,但似乎它在消费者处理数据之后直接提交回Kafka,而不是在管道完成处理消息之后。

1 个解决方案

#1


1  

Yes. Beam runners like Dataflow and Flink store the processed offsets in internal state, so it is not related to 'AUTO_COMMIT' in Kafka Consumer config. The internal state stored is check-pointed atomically with processing (actual details depends on the runner).

是。像Dataflow和Flink这样的光束运行器将处理后的偏移存储在内部状态,因此它与Kafka Consumer配置中的'AUTO_COMMIT'无关。存储的内部状态通过原子方式与处理进行检查(实际细节取决于运行者)。

There some more options to achieve end-to-end exactly-once semantics (from source to Beam application to sink). KafkaIO source provides an option to read only committed records, and also supports an exactly-once sink.

还有一些选项可以实现端到端的一次性语义(从源到Beam应用程序到接收器)。 KafkaIO源提供了一个只读提交记录的选项,并且还支持一次性接收器。

Some pipelines do set 'AUTO_COMMIT', mainly so that when a pipeline is restarted from scratch (as opposed to updated, which preserves internal state), it resumes roughly around there the old pipeline left of. As you mentioned, this does not have processing guarantees.

一些管道确实设置了'AUTO_COMMIT',主要是为了当从头开始重新启动管道时(而不是更新,这保留了内部状态),它会在旧管道左边大致恢复。正如您所提到的,这没有处理保证。

#1


1  

Yes. Beam runners like Dataflow and Flink store the processed offsets in internal state, so it is not related to 'AUTO_COMMIT' in Kafka Consumer config. The internal state stored is check-pointed atomically with processing (actual details depends on the runner).

是。像Dataflow和Flink这样的光束运行器将处理后的偏移存储在内部状态,因此它与Kafka Consumer配置中的'AUTO_COMMIT'无关。存储的内部状态通过原子方式与处理进行检查(实际细节取决于运行者)。

There some more options to achieve end-to-end exactly-once semantics (from source to Beam application to sink). KafkaIO source provides an option to read only committed records, and also supports an exactly-once sink.

还有一些选项可以实现端到端的一次性语义(从源到Beam应用程序到接收器)。 KafkaIO源提供了一个只读提交记录的选项,并且还支持一次性接收器。

Some pipelines do set 'AUTO_COMMIT', mainly so that when a pipeline is restarted from scratch (as opposed to updated, which preserves internal state), it resumes roughly around there the old pipeline left of. As you mentioned, this does not have processing guarantees.

一些管道确实设置了'AUTO_COMMIT',主要是为了当从头开始重新启动管道时(而不是更新,这保留了内部状态),它会在旧管道左边大致恢复。正如您所提到的,这没有处理保证。