如何在java代码中读取spark保存的文件

I am new to Spark. I have a file TrainDataSpark.java in which I am processing some data and at end of it I am saving my spark processed data to a directory called Predictions with below code

我是Spark的新手。我有一个文件TrainDataSpark.java,我正在处理一些数据,最后我将我的火花处理数据保存到名为Predictions的目录中,代码如下

predictions.saveAsTextFile("Predictions");

In same TrainDataSpark.java i am adding below code part just after above line.

在同一个TrainDataSpark.java中,我正在上面的行之后添加下面的代码部分。

OutputGeneratorOptimized ouputGenerator = new OutputGeneratorOptimized();
final Path predictionFilePath = Paths.get("/Predictions/part-00000");
final Path outputHtml = Paths.get("/outputHtml.html");
ouputGenerator.getFormattedHtml(input,predictionFilePath,outputHtml);

And I am getting NoSuchFile exception for /Predictions/part-00000 . I have tried all possible paths but it fails. I think the java code searches for the File on my local system and not hdfs cluster. Is there a way to get file path from cluster so I can pass it furthur? OR is there a way to load my Predictions file to local instead of cluster so as the java part runs with out error?

我得到/ Predictions / part-00000的NoSuchFile异常。我已经尝试了所有可能的路径,但它失败了。我认为java代码在我的本地系统而不是hdfs集群中搜索File。有没有办法从集群获取文件路径,所以我可以通过它?或者有没有办法将我的Predictions文件加载到本地而不是集群,以便java部分运行时出错?

2 个解决方案

#1

This can happen if you are running Spark on a cluster. Paths.get looks for the file in the local file system on every node separately, while it exists on hdfs. You can probably load the file using sc.textFile("hdfs:/Predictions") (or sc.textFile("Predictions")).

如果您在群集上运行Spark,则会发生这种情况。 Paths.get分别在每个节点上的本地文件系统中查找文件,而它存在于hdfs上。您可以使用sc.textFile(“hdfs:/ Predictions”)(或sc.textFile(“Predictions”))加载文件。

If, on the other hand, you'd like to save the local file system, you'r gonna need to collect the RDD first and save it using regular Java IO.

另一方面,如果您想要保存本地文件系统,则需要首先收集RDD并使用常规Java IO保存它。

#2

I figured it out this way...

我这样想出来了......

String predictionFilePath ="hdfs://pathToHDFS/user/username/Predictions/part-00000";
String outputHtml = "hdfs://pathToHDFS/user/username/outputHtml.html";

URI uriRead = URI.create(predictionFilePath);
URI uriOut = URI.create(outputHtml);

Configuration conf = new Configuration ();

FileSystem fileRead = FileSystem.get (uriRead, conf);
FileSystem fileWrite = FileSystem.get (uriOut, conf);

FSDataInputStream in = fileRead.open(new org.apache.hadoop.fs.Path(uriRead));
FSDataOutputStream out = fileWrite.append(new org.apache.hadoop.fs.Path(uriOut));

/*Java code that uses stream objects to write and read*/
OutputGeneratorOptimized ouputGenerator = new OutputGeneratorOptimized();
ouputGenerator.getFormattedHtml(input,in,out);

#1