I am new to Spark. I have a file TrainDataSpark.java in which I am processing some data and at end of it I am saving my spark processed data to a directory called Predictions with below code
我是Spark的新手。我有一个文件TrainDataSpark.java,我正在处理一些数据,最后我将我的火花处理数据保存到名为Predictions的目录中,代码如下
predictions.saveAsTextFile("Predictions");
In same TrainDataSpark.java i am adding below code part just after above line.
在同一个TrainDataSpark.java中,我正在上面的行之后添加下面的代码部分。
OutputGeneratorOptimized ouputGenerator = new OutputGeneratorOptimized();
final Path predictionFilePath = Paths.get("/Predictions/part-00000");
final Path outputHtml = Paths.get("/outputHtml.html");
ouputGenerator.getFormattedHtml(input,predictionFilePath,outputHtml);
And I am getting NoSuchFile exception for /Predictions/part-00000 . I have tried all possible paths but it fails. I think the java code searches for the File on my local system and not hdfs cluster. Is there a way to get file path from cluster so I can pass it furthur? OR is there a way to load my Predictions file to local instead of cluster so as the java part runs with out error?
我得到/ Predictions / part-00000的NoSuchFile异常。我已经尝试了所有可能的路径,但它失败了。我认为java代码在我的本地系统而不是hdfs集群中搜索File。有没有办法从集群获取文件路径,所以我可以通过它?或者有没有办法将我的Predictions文件加载到本地而不是集群,以便java部分运行时出错?
2 个解决方案
#1
0
This can happen if you are running Spark on a cluster. Paths.get
looks for the file in the local file system on every node separately, while it exists on hdfs. You can probably load the file using sc.textFile("hdfs:/Predictions")
(or sc.textFile("Predictions")
).
如果您在群集上运行Spark,则会发生这种情况。 Paths.get分别在每个节点上的本地文件系统中查找文件,而它存在于hdfs上。您可以使用sc.textFile(“hdfs:/ Predictions”)(或sc.textFile(“Predictions”))加载文件。
If, on the other hand, you'd like to save the local file system, you'r gonna need to collect
the RDD first and save it using regular Java IO.
另一方面,如果您想要保存本地文件系统,则需要首先收集RDD并使用常规Java IO保存它。
#2
0
I figured it out this way...
我这样想出来了......
String predictionFilePath ="hdfs://pathToHDFS/user/username/Predictions/part-00000";
String outputHtml = "hdfs://pathToHDFS/user/username/outputHtml.html";
URI uriRead = URI.create(predictionFilePath);
URI uriOut = URI.create(outputHtml);
Configuration conf = new Configuration ();
FileSystem fileRead = FileSystem.get (uriRead, conf);
FileSystem fileWrite = FileSystem.get (uriOut, conf);
FSDataInputStream in = fileRead.open(new org.apache.hadoop.fs.Path(uriRead));
FSDataOutputStream out = fileWrite.append(new org.apache.hadoop.fs.Path(uriOut));
/*Java code that uses stream objects to write and read*/
OutputGeneratorOptimized ouputGenerator = new OutputGeneratorOptimized();
ouputGenerator.getFormattedHtml(input,in,out);
#1
0
This can happen if you are running Spark on a cluster. Paths.get
looks for the file in the local file system on every node separately, while it exists on hdfs. You can probably load the file using sc.textFile("hdfs:/Predictions")
(or sc.textFile("Predictions")
).
如果您在群集上运行Spark,则会发生这种情况。 Paths.get分别在每个节点上的本地文件系统中查找文件,而它存在于hdfs上。您可以使用sc.textFile(“hdfs:/ Predictions”)(或sc.textFile(“Predictions”))加载文件。
If, on the other hand, you'd like to save the local file system, you'r gonna need to collect
the RDD first and save it using regular Java IO.
另一方面,如果您想要保存本地文件系统,则需要首先收集RDD并使用常规Java IO保存它。
#2
0
I figured it out this way...
我这样想出来了......
String predictionFilePath ="hdfs://pathToHDFS/user/username/Predictions/part-00000";
String outputHtml = "hdfs://pathToHDFS/user/username/outputHtml.html";
URI uriRead = URI.create(predictionFilePath);
URI uriOut = URI.create(outputHtml);
Configuration conf = new Configuration ();
FileSystem fileRead = FileSystem.get (uriRead, conf);
FileSystem fileWrite = FileSystem.get (uriOut, conf);
FSDataInputStream in = fileRead.open(new org.apache.hadoop.fs.Path(uriRead));
FSDataOutputStream out = fileWrite.append(new org.apache.hadoop.fs.Path(uriOut));
/*Java code that uses stream objects to write and read*/
OutputGeneratorOptimized ouputGenerator = new OutputGeneratorOptimized();
ouputGenerator.getFormattedHtml(input,in,out);