I'm trying to parse a large (~100MB) json file using ijson package which allows me to interact with the file in an efficient way. However, after writing some code like this,
我正在尝试使用ijson包解析一个大的(~100MB)json文件,它允许我以有效的方式与文件交互。但是,在编写了这样的代码之后,
with open(filename, 'r') as f:
parser = ijson.parse(f)
for prefix, event, value in parser:
if prefix == "name":
print(value)
I found that the code parses only the first line and not the rest of the lines from the file!!
我发现代码只解析第一行而不是文件中的其余行!
Here is how a portion of my json file looks like:
以下是我的json文件的一部分:
{"name":"accelerator_pedal_position","value":0,"timestamp":1364323939.012000}
{"name":"engine_speed","value":772,"timestamp":1364323939.027000}
{"name":"vehicle_speed","value":0,"timestamp":1364323939.029000}
{"name":"accelerator_pedal_position","value":0,"timestamp":1364323939.035000}
In my opinion, I think ijson
parses only one json object.
在我看来,我认为ijson只解析一个json对象。
Can someone please suggest how to work around this?
有人可以建议如何解决这个问题?
2 个解决方案
#1
3
Since the provided chunk looks more like a set of lines each composing an independent JSON, it should be parsed accordingly:
由于提供的块看起来更像是一组每个组成独立JSON的行,因此应该进行相应的解析:
# each JSON is small, there's no need in iterative processing
import json
with open(filename, 'r') as f:
for line in f:
data = json.loads(line)
# data[u'name'], data[u'engine_speed'], data[u'timestamp'] now
# contain correspoding values
#2
3
Unfortunately the ijson library (v2.3 as of March 2018) does not handle parsing multiple JSON objects. It can only handle 1 overall object, and if you attempt to parse a second object, you will get an error: "ijson.common.JSONError: Additional data"
. See bug reports here:
不幸的是,ijson库(截至2018年3月的v2.3)不处理解析多个JSON对象。它只能处理1个整体对象,如果你试图解析第二个对象,你会收到一个错误:“ijson.common.JSONError:附加数据”。在此处查看错误报告:
- https://github.com/isagalaev/ijson/issues/40
- https://github.com/isagalaev/ijson/issues/42
- https://github.com/isagalaev/ijson/issues/67
- python: how do I parse a stream of json arrays
python:如何解析json数组流
It's a big limitation. However, as long as you have line breaks (new line character) after each JSON object, you can parse each one line-by-line independently, like this:
这是一个很大的限制。但是,只要在每个JSON对象之后都有换行符(换行符),就可以逐行解析每个换行符,如下所示:
import io
import ijson
with open(filename, encoding="UTF-8") as json_file:
cursor = 0
for line_number, line in enumerate(json_file):
print ("Processing line", line_number + 1,"at cursor index:", cursor)
line_as_file = io.StringIO(line)
# Use a new parser for each line
json_parser = ijson.parse(line_as_file)
for prefix, type, value in json_parser:
print ("prefix=",prefix, "type=",type, "value=",value)
cursor += len(line)
You are still streaming the file, and not loading it entirely in memory, so it can work on large JSON files. It also uses the line streaming technique from: How to jump to a particular line in a huge text file? and uses enumerate()
from: Accessing the index in Python 'for' loops
您仍在流式传输文件,而不是将其完全加载到内存中,因此它可以处理大型JSON文件。它还使用以下行线技术:如何跳转到巨大文本文件中的特定行?并使用enumerate()from:在Python中访问'for'循环的索引
#1
3
Since the provided chunk looks more like a set of lines each composing an independent JSON, it should be parsed accordingly:
由于提供的块看起来更像是一组每个组成独立JSON的行,因此应该进行相应的解析:
# each JSON is small, there's no need in iterative processing
import json
with open(filename, 'r') as f:
for line in f:
data = json.loads(line)
# data[u'name'], data[u'engine_speed'], data[u'timestamp'] now
# contain correspoding values
#2
3
Unfortunately the ijson library (v2.3 as of March 2018) does not handle parsing multiple JSON objects. It can only handle 1 overall object, and if you attempt to parse a second object, you will get an error: "ijson.common.JSONError: Additional data"
. See bug reports here:
不幸的是,ijson库(截至2018年3月的v2.3)不处理解析多个JSON对象。它只能处理1个整体对象,如果你试图解析第二个对象,你会收到一个错误:“ijson.common.JSONError:附加数据”。在此处查看错误报告:
- https://github.com/isagalaev/ijson/issues/40
- https://github.com/isagalaev/ijson/issues/42
- https://github.com/isagalaev/ijson/issues/67
- python: how do I parse a stream of json arrays
python:如何解析json数组流
It's a big limitation. However, as long as you have line breaks (new line character) after each JSON object, you can parse each one line-by-line independently, like this:
这是一个很大的限制。但是,只要在每个JSON对象之后都有换行符(换行符),就可以逐行解析每个换行符,如下所示:
import io
import ijson
with open(filename, encoding="UTF-8") as json_file:
cursor = 0
for line_number, line in enumerate(json_file):
print ("Processing line", line_number + 1,"at cursor index:", cursor)
line_as_file = io.StringIO(line)
# Use a new parser for each line
json_parser = ijson.parse(line_as_file)
for prefix, type, value in json_parser:
print ("prefix=",prefix, "type=",type, "value=",value)
cursor += len(line)
You are still streaming the file, and not loading it entirely in memory, so it can work on large JSON files. It also uses the line streaming technique from: How to jump to a particular line in a huge text file? and uses enumerate()
from: Accessing the index in Python 'for' loops
您仍在流式传输文件,而不是将其完全加载到内存中,因此它可以处理大型JSON文件。它还使用以下行线技术:如何跳转到巨大文本文件中的特定行?并使用enumerate()from:在Python中访问'for'循环的索引