I am trying to read from multiple kafka brokers using KafkaIO on apache beam. The default option for offset management is to the kafka partition itself (no longer using zookeper from kafka >0.9). With this setup, when i restart the job/pipeline, there is issue with duplicate and missing records.
我试图在apache beam上使用KafkaIO从多个kafka经纪人那里读取。偏移管理的默认选项是kafka分区本身(不再使用kafka中的zookeper> 0.9)。使用此设置,当我重新启动作业/管道时,存在重复和丢失记录的问题。
From what i read, the best way to handle this is to manage offset to external data stores. Is it possible to do this with current version of apache beam and KafkaIO? I am using 2.2.0 version right now.
根据我的阅读,处理此问题的最佳方法是管理外部数据存储的偏移量。是否可以使用当前版本的apache beam和KafkaIO执行此操作?我现在正在使用2.2.0版本。
And, after reading from kafka,i will write it to BigQuery. Is there a setup in KafkaIO where I can set the committed message only after i insert the message to BigQuery? I can only find auto commit setup right now.
并且,在读完kafka之后,我会将它写入BigQuery。在KafkaIO中是否有设置,我只能在将消息插入BigQuery后设置提交的消息?我现在只能找到自动提交设置。
1 个解决方案
#1
0
In Dataflow, you can update a job rather than restarting from scratch. The new job resumes from the last checkpointed state, ensuring exactly-once processing. This works for KafkaIO source as well. The auto-commit option in Kafka consumer configuration helps but it is not atomic with Dataflow internal state, which implies restarted job might have small fraction of duplicate or missing messages.
在Dataflow中,您可以更新作业,而不是从头开始重新启动。新作业从最后一个检查点状态恢复,确保完成一次处理。这也适用于KafkaIO源。 Kafka使用者配置中的自动提交选项有所帮助,但它不是Dataflow内部状态的原子,这意味着重新启动的作业可能只有一小部分重复或丢失的消息。
#1
0
In Dataflow, you can update a job rather than restarting from scratch. The new job resumes from the last checkpointed state, ensuring exactly-once processing. This works for KafkaIO source as well. The auto-commit option in Kafka consumer configuration helps but it is not atomic with Dataflow internal state, which implies restarted job might have small fraction of duplicate or missing messages.
在Dataflow中,您可以更新作业,而不是从头开始重新启动。新作业从最后一个检查点状态恢复,确保完成一次处理。这也适用于KafkaIO源。 Kafka使用者配置中的自动提交选项有所帮助,但它不是Dataflow内部状态的原子,这意味着重新启动的作业可能只有一小部分重复或丢失的消息。