SPARK read.json抛出java.io.IOException:换行前的字节太多了

时间:2022-10-29 13:47:49

I am getting following error on reading a large 6gb single line json file:

我在读取一个大的6gb单行json文件时遇到以下错误:

Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost): java.io.IOException: Too many bytes before newline: 2147483648

spark does not read json files with new lines hence the entire 6 gb json file is on a single line:

spark没有用新行读取json文件,因此整个6 gb json文件在一行上:

jf = sqlContext.read.json("jlrn2.json")

configuration:

组态:

spark.driver.memory 20g

1 个解决方案

#1


2  

Yep, you have more than Integer.MAX_VALUE bytes in your line. You need to split it up.

是的,你的行中有超过Integer.MAX_VALUE个字节。你需要拆分它。

Keep in mind that Spark is expecting each line to be a valid JSON document, not the file as a whole. Below is the relevant line from the Spark SQL Progamming Guide

请记住,Spark期望每一行都是有效的JSON文档,而不是整个文件。以下是Spark SQL Progamming Guide中的相关行

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

请注意,作为json文件提供的文件不是典型的JSON文件。每行必须包含一个单独的,自包含的有效JSON对象。因此,常规的多行JSON文件通常会失败。

So if your JSON document is in the form...

因此,如果您的JSON文档在表单中......

[
  { [record] },
  { [record] }
]

You'll want to change it to

你会想要改变它

{ [record] }
{ [record] }

#1


2  

Yep, you have more than Integer.MAX_VALUE bytes in your line. You need to split it up.

是的,你的行中有超过Integer.MAX_VALUE个字节。你需要拆分它。

Keep in mind that Spark is expecting each line to be a valid JSON document, not the file as a whole. Below is the relevant line from the Spark SQL Progamming Guide

请记住,Spark期望每一行都是有效的JSON文档,而不是整个文件。以下是Spark SQL Progamming Guide中的相关行

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

请注意,作为json文件提供的文件不是典型的JSON文件。每行必须包含一个单独的,自包含的有效JSON对象。因此,常规的多行JSON文件通常会失败。

So if your JSON document is in the form...

因此,如果您的JSON文档在表单中......

[
  { [record] },
  { [record] }
]

You'll want to change it to

你会想要改变它

{ [record] }
{ [record] }