控制文件的Spark Streaming

I am using Spark to read the text files from a folder and load them to hive.

我正在使用Spark从文件夹中读取文本文件并将其加载到配置单元。

The interval for the spark streaming is 1 min. The source folder may have 1000 files of bigger size in rare cases.

火花流的间隔是1分钟。在极少数情况下,源文件夹可能有1000个更大的文件。

How do i control spark streaming to limit the number of files the program reads? Currently my program is reading all files generated in last 1 min. But i want to control the number of files it's reading.

如何控制火花流以限制程序读取的文件数量?目前我的程序正在读取最近1分钟内生成的所有文件。但我想控制它正在阅读的文件数量。

I am using textFileStream API.

我正在使用textFileStream API。

    JavaDStream<String> lines = jssc.textFileStream("C:/Users/abcd/files/");

Is there any way to control the file streaming rate?

有没有办法控制文件流速率?

2 个解决方案

#1

I am afraid not. Spark steaming is based on Time driven. You can use Flink which provide Data driven

恐怕没有。 Spark steaming基于时间驱动。您可以使用提供数据驱动的Flink

https://ci.apache.org/projects/flink/flink-docs-release-1.2/concepts/programming-model.html#windows

#2

You could use "spark.streaming.backpressure.enabled" and "spark.streaming.backpressure.initialRate" for controlling the rate at which data is received!!!

您可以使用“spark.streaming.backpressure.enabled”和“spark.streaming.backpressure.initialRate”来控制接收数据的速率!

#1