迫使火花流的卡夫卡消费者流程进入不同的机器

I 'm using streaming integrated with streaming-kafka.

我正在使用与流媒体kafka集成的流媒体。

My kafka topic has 80 partitions, while my machines have 40 cores. I found that when the job is running, the kafka consumer processes are only deploy to 2 machines (40*2=80), the bandwidth of the 2 machines will be very very high.

我的kafka主题有80个分区,而我的机器有40个核心。我发现当作业运行时,kafka消费者进程只部署到2台机器(40 * 2 = 80),2台机器的带宽将非常高。

I wonder is there any way to control the kafka consumer's dispatch, in order to balance the bandwidth and memory usage?

我想知道有没有办法控制kafka消费者的调度,以平衡带宽和内存使用量?

1 个解决方案

#1

You can use this consumer from Spark-Packages.

您可以使用Spark-Packages中的此使用者。

http://spark-packages.org/package/dibbhatt/kafka-spark-consumer

This consumer has been successfully running in many production deployment and this is most reliable Receiver based low level consumer.

此消费者已在许多生产部署中成功运行,这是最可靠的基于Receiver的低级别消费者。

This gives more control on offset commits and Receiver fault tolerant. This also give control of how many receiver you can configure for your topic which will determine the parallelism .

这样可以更好地控制偏移提交和接收器容错。这也可以控制您可以为主题配置多少接收器,这将决定并行性。

Dibyendu

#1