I've got this JSON file
我有这个JSON文件
{
"a": 1,
"b": 2
}
which has been obtained with Python json.dump method. Now, I want to read this file into a DataFrame in Spark, using pyspark. Following documentation, I'm doing this
这是用Python json获得的。转储的方法。现在,我想使用pyspark将这个文件读入一个DataFrame中。按照文档说明,我正在这样做
sc = SparkContext()
sc = SparkContext()
sqlc = SQLContext(sc)
sqlc = SQLContext(sc)
df = sqlc.read.json('my_file.json')
df = sqlc.read.json(“my_file.json”)
print df.show()
打印df.show()
The print statement spits out this though:
印刷声明中也提到了这一点:
+---------------+
|_corrupt_record|
+---------------+
| {|
| "a": 1, |
| "b": 2|
| }|
+---------------+
Anyone knows what's going on and why it is not interpreting the file correctly?
谁知道发生了什么,为什么不正确地解释文件?
4 个解决方案
#1
29
You need to have one json object per row in your input file, see http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json
您需要在输入文件中每一行有一个json对象,参见http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json
If your json file looks like this it will give you the expected dataframe:
如果你的json文件是这样的,它会给你预期的数据爆炸名:
{ "a": 1, "b": 2 }
{ "a": 3, "b": 4 }
....
df.show()
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
#2
11
If you want to leave your JSON file as it is (without stripping new lines characters \n
), include multiLine=True
keyword argument
如果您想让JSON文件保持原样(不剥离新行字符\n),包含multiLine=True关键字参数
sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.read.json('my_file.json', multiLine=True)
print df.show()
#3
2
Adding to @Bernhard's great answer
加上@Bernhard的伟大回答
# original file was written with pretty-print inside a list
with open("pretty-printed.json") as jsonfile:
js = json.load(jsonfile)
# write a new file with one object per line
with open("flattened.json", 'a') as outfile:
for d in js:
json.dump(d, outfile)
outfile.write('\n')
#4
0
In Spark 2.2+ you can read json file of multiline using following command.
在Spark 2.2+中,您可以使用以下命令读取多行的json文件。
val dataframe = spark.read.option("multiline",true).json( " filePath ")
if there is json object per line then,
如果每一行有json对象,
val dataframe = spark.read.json(filepath)
#1
29
You need to have one json object per row in your input file, see http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json
您需要在输入文件中每一行有一个json对象,参见http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json
If your json file looks like this it will give you the expected dataframe:
如果你的json文件是这样的,它会给你预期的数据爆炸名:
{ "a": 1, "b": 2 }
{ "a": 3, "b": 4 }
....
df.show()
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
#2
11
If you want to leave your JSON file as it is (without stripping new lines characters \n
), include multiLine=True
keyword argument
如果您想让JSON文件保持原样(不剥离新行字符\n),包含multiLine=True关键字参数
sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.read.json('my_file.json', multiLine=True)
print df.show()
#3
2
Adding to @Bernhard's great answer
加上@Bernhard的伟大回答
# original file was written with pretty-print inside a list
with open("pretty-printed.json") as jsonfile:
js = json.load(jsonfile)
# write a new file with one object per line
with open("flattened.json", 'a') as outfile:
for d in js:
json.dump(d, outfile)
outfile.write('\n')
#4
0
In Spark 2.2+ you can read json file of multiline using following command.
在Spark 2.2+中,您可以使用以下命令读取多行的json文件。
val dataframe = spark.read.option("multiline",true).json( " filePath ")
if there is json object per line then,
如果每一行有json对象,
val dataframe = spark.read.json(filepath)