I'm a bit idiot in programming and Python. I know that these are a lot of explanations in previous questions about this but I carefully read all of them and I didn't find the solution.
I'm trying to read a JSON file which contains about 1 billion of data like this:
我在编程和Python方面有点白痴。我知道这些在以前的问题中有很多解释,但我仔细阅读了所有这些,但我没有找到解决方案。我正在尝试读取包含大约10亿个数据的JSON文件,如下所示:
334465|{"color":"33ef","age":"55","gender":"m"}
334477|{"color":"3444","age":"56","gender":"f"}
334477|{"color":"3999","age":"70","gender":"m"}
I was trying hard to overcome that 6 digit numbers at the beginning of each line, but I dont know how can I read multiple JSON objects? Here is my code but I can't find why it is not working?
我努力克服每行开头的6位数字,但我不知道如何读取多个JSON对象?这是我的代码,但我找不到为什么它不起作用?
import json
T =[]
s = open('simple.json', 'r')
ss = s.read()
for line in ss:
line = ss[7:]
T.append(json.loads(line))
s.close()
And the here is the error that I got:
这是我得到的错误:
ValueError: Extra Data: line 3 column 1 - line 5 column 48 (char 42 - 138)
Any suggestion would be very helpful for me!
任何建议都对我很有帮助!
4 个解决方案
#1
3
There are several problems with the logic of your code.
代码逻辑存在一些问题。
ss = s.read()
reads the entire file s
into a single string. The next line
将整个文件读入单个字符串。下一行
for line in ss:
iterates over each character in that string, one by one. So on each loop line
is a single character. In
迭代遍历该字符串中的每个字符。所以在每个循环线上都是一个字符。在
line = ss[7:]
you are getting the entire file contents apart from the first 7 characters (in positions 0 through 6, inclusive) and replacing the previous content of line
with that. And then
你得到的整个文件内容除了前7个字符(在0到6的位置,包括在内),并用之前的内容替换之前的内容。然后
T.append(json.loads(line))
attempts to convert that to JSON and store the resulting object into the T
list.
尝试将其转换为JSON并将结果对象存储到T列表中。
Here's some code that does what you want. We don't need to read the entire file into a string with .read
, or into a list of lines with .readlines
, we can simply put the file handle into a for loop and that will iterate over the file line by line.
这里有一些代码可以满足您的需求。我们不需要将整个文件读入带有.read的字符串,也不需要读入带有.readlines的行列表,我们可以简单地将文件句柄放入for循环中,然后逐行遍历文件。
We use a with
statement to open the file, so that it will get closed automatically when we exit the with
block, or if there's an IO error.
我们使用with语句来打开文件,以便在我们退出with块时自动关闭,或者如果出现IO错误。
import json
table = []
with open('simple.json', 'r') as f:
for line in f:
table.append(json.loads(line[7:]))
for row in table:
print(row)
output
{'color': '33ef', 'age': '55', 'gender': 'm'}
{'color': '3444', 'age': '56', 'gender': 'f'}
{'color': '3999', 'age': '70', 'gender': 'm'}
We can make this more compact by building the table
list in a list comprehension:
我们可以通过在列表解析中构建表列表来使其更紧凑:
import json
with open('simple.json', 'r') as f:
table = [json.loads(line[7:]) for line in f]
for row in table:
print(row)
#2
1
If you use Pandas you can simply write df = pd.read_json(f, lines=True)
如果您使用Pandas,您只需编写df = pd.read_json(f,lines = True)
as per doc the lines=True
:
根据doc,lines = True:
Read the file as a json object per line.
每行读取一个json对象文件。
#3
0
You should use readlines()
instead of read()
, and wrap your JSON parsing in a try/except block. Your lines probably contain a trailing newline character and that would cause an error.
您应该使用readlines()而不是read(),并将您的JSON解析包装在try / except块中。你的行可能包含一个尾随的换行符,这会导致错误。
s = open('simple.json', 'r')
for line in s.readlines():
try:
j = line.split('|')[-1]
json.loads(j)
except ValueError:
# You probably have bad JSON
continue
#4
0
Thank you so much! You guys are life saver! This is the code that I eventually come up with it. I think it is the combination of all answers!
非常感谢!你们是救命的人!这是我最终提出的代码。我认为这是所有答案的组合!
import json
table = []
with open('simple.json', 'r') as f:
for line in f:
try:
j = line.split('|')[-1]
table.append(json.loads(j))
except ValueError:
# You probably have bad JSON
continue
for row in table:
print(row)
#1
3
There are several problems with the logic of your code.
代码逻辑存在一些问题。
ss = s.read()
reads the entire file s
into a single string. The next line
将整个文件读入单个字符串。下一行
for line in ss:
iterates over each character in that string, one by one. So on each loop line
is a single character. In
迭代遍历该字符串中的每个字符。所以在每个循环线上都是一个字符。在
line = ss[7:]
you are getting the entire file contents apart from the first 7 characters (in positions 0 through 6, inclusive) and replacing the previous content of line
with that. And then
你得到的整个文件内容除了前7个字符(在0到6的位置,包括在内),并用之前的内容替换之前的内容。然后
T.append(json.loads(line))
attempts to convert that to JSON and store the resulting object into the T
list.
尝试将其转换为JSON并将结果对象存储到T列表中。
Here's some code that does what you want. We don't need to read the entire file into a string with .read
, or into a list of lines with .readlines
, we can simply put the file handle into a for loop and that will iterate over the file line by line.
这里有一些代码可以满足您的需求。我们不需要将整个文件读入带有.read的字符串,也不需要读入带有.readlines的行列表,我们可以简单地将文件句柄放入for循环中,然后逐行遍历文件。
We use a with
statement to open the file, so that it will get closed automatically when we exit the with
block, or if there's an IO error.
我们使用with语句来打开文件,以便在我们退出with块时自动关闭,或者如果出现IO错误。
import json
table = []
with open('simple.json', 'r') as f:
for line in f:
table.append(json.loads(line[7:]))
for row in table:
print(row)
output
{'color': '33ef', 'age': '55', 'gender': 'm'}
{'color': '3444', 'age': '56', 'gender': 'f'}
{'color': '3999', 'age': '70', 'gender': 'm'}
We can make this more compact by building the table
list in a list comprehension:
我们可以通过在列表解析中构建表列表来使其更紧凑:
import json
with open('simple.json', 'r') as f:
table = [json.loads(line[7:]) for line in f]
for row in table:
print(row)
#2
1
If you use Pandas you can simply write df = pd.read_json(f, lines=True)
如果您使用Pandas,您只需编写df = pd.read_json(f,lines = True)
as per doc the lines=True
:
根据doc,lines = True:
Read the file as a json object per line.
每行读取一个json对象文件。
#3
0
You should use readlines()
instead of read()
, and wrap your JSON parsing in a try/except block. Your lines probably contain a trailing newline character and that would cause an error.
您应该使用readlines()而不是read(),并将您的JSON解析包装在try / except块中。你的行可能包含一个尾随的换行符,这会导致错误。
s = open('simple.json', 'r')
for line in s.readlines():
try:
j = line.split('|')[-1]
json.loads(j)
except ValueError:
# You probably have bad JSON
continue
#4
0
Thank you so much! You guys are life saver! This is the code that I eventually come up with it. I think it is the combination of all answers!
非常感谢!你们是救命的人!这是我最终提出的代码。我认为这是所有答案的组合!
import json
table = []
with open('simple.json', 'r') as f:
for line in f:
try:
j = line.split('|')[-1]
table.append(json.loads(j))
except ValueError:
# You probably have bad JSON
continue
for row in table:
print(row)