I use spark streaming to receive twitts from twitter. I get many warning that says:
我使用spark流媒体从twitter接收twitts。我收到很多警告说:
replicated to only 0 peer(s) instead of 1 peers
what is this warning for?
这是什么警告?
my code is:
我的代码是:
SparkConf conf = new SparkConf().setAppName("Test");
JavaStreamingContext sc = new JavaStreamingContext(conf, Durations.seconds(5));
sc.checkpoint("/home/arman/Desktop/checkpoint");
ConfigurationBuilder cb = new ConfigurationBuilder();
cb.setOAuthConsumerKey("****************")
.setOAuthConsumerSecret("**************")
.setOAuthAccessToken("*********************")
.setOAuthAccessTokenSecret("***************");
JavaReceiverInputDStream<twitter4j.Status> statuses = TwitterUtils.createStream(sc,
AuthorizationFactory.getInstance(cb.build()));
JavaPairDStream<String, Long> hashtags = statuses.flatMapToPair(new GetHashtags());
JavaPairDStream<String, Long> hashtagsCount = hashtags.updateStateByKey(new UpdateReduce());
hashtagsCount.foreachRDD(new saveText(args[0], true));
sc.start();
sc.awaitTerminationOrTimeout(Long.parseLong(args[1]));
sc.stop();
1 个解决方案
#1
15
When reading data with Spark Streaming, incoming data blocks are replicated to at least one another node/worker because of fault-tolerance. Without that it may happen that in case the runtime reads data from stream and then fails this particular piece of data would be lost (it's already read and erased from stream and it's also lost at the worker side because of failure).
当读取带有Spark流的数据时,由于容错,传入数据块被复制到至少一个节点/worker。如果运行时从流中读取数据,然后失败,那么这段数据就会丢失(它已经从流中读取和删除了,而且由于失败,它也在工人端丢失了)。
Referring to the Spark documentation :
关于Spark文档:
While a Spark Streaming driver program is running, the system receives data from various sources and and divides it into batches. Each batch of data is treated as an RDD, that is, an immutable parallel collection of data. These input RDDs are saved in memory and replicated to two nodes for fault-tolerance.
当Spark流驱动程序运行时,系统接收来自不同来源的数据,并将其分为批处理。每批数据都被视为一个RDD,即一个不可变的并行数据集合。这些输入的RDDs保存在内存中,并复制到两个节点上进行容错。
The warning in your case means that incoming data from stream are not replicated at all. The reason for that may be that you run the app with just one instance of Spark worker or running in local mode. Try to start more Spark workers and see if the warning is gone.
在您的案例中,警告意味着来自流的传入数据不会被复制。原因可能是你只运行一个Spark worker实例,或者在本地模式下运行。尝试启动更多的火花工人,看看警告是否消失。
#1
15
When reading data with Spark Streaming, incoming data blocks are replicated to at least one another node/worker because of fault-tolerance. Without that it may happen that in case the runtime reads data from stream and then fails this particular piece of data would be lost (it's already read and erased from stream and it's also lost at the worker side because of failure).
当读取带有Spark流的数据时,由于容错,传入数据块被复制到至少一个节点/worker。如果运行时从流中读取数据,然后失败,那么这段数据就会丢失(它已经从流中读取和删除了,而且由于失败,它也在工人端丢失了)。
Referring to the Spark documentation :
关于Spark文档:
While a Spark Streaming driver program is running, the system receives data from various sources and and divides it into batches. Each batch of data is treated as an RDD, that is, an immutable parallel collection of data. These input RDDs are saved in memory and replicated to two nodes for fault-tolerance.
当Spark流驱动程序运行时,系统接收来自不同来源的数据,并将其分为批处理。每批数据都被视为一个RDD,即一个不可变的并行数据集合。这些输入的RDDs保存在内存中,并复制到两个节点上进行容错。
The warning in your case means that incoming data from stream are not replicated at all. The reason for that may be that you run the app with just one instance of Spark worker or running in local mode. Try to start more Spark workers and see if the warning is gone.
在您的案例中,警告意味着来自流的传入数据不会被复制。原因可能是你只运行一个Spark worker实例,或者在本地模式下运行。尝试启动更多的火花工人,看看警告是否消失。