在Python中使用多个对象读取JSON文件

I'm a bit idiot in programming and Python. I know that these are a lot of explanations in previous questions about this but I carefully read all of them and I didn't find the solution.
I'm trying to read a JSON file which contains about 1 billion of data like this:

我在编程和Python方面有点白痴。我知道这些在以前的问题中有很多解释,但我仔细阅读了所有这些,但我没有找到解决方案。我正在尝试读取包含大约10亿个数据的JSON文件,如下所示:

334465|{"color":"33ef","age":"55","gender":"m"}
334477|{"color":"3444","age":"56","gender":"f"}
334477|{"color":"3999","age":"70","gender":"m"}

I was trying hard to overcome that 6 digit numbers at the beginning of each line, but I dont know how can I read multiple JSON objects? Here is my code but I can't find why it is not working?

我努力克服每行开头的6位数字,但我不知道如何读取多个JSON对象?这是我的代码,但我找不到为什么它不起作用?

import json

T =[]
s = open('simple.json', 'r')
ss = s.read()
for line in ss:
    line = ss[7:]
    T.append(json.loads(line))
s.close()

And the here is the error that I got:

这是我得到的错误:

ValueError: Extra Data: line 3 column 1 - line 5 column 48 (char 42 - 138)

Any suggestion would be very helpful for me!

任何建议都对我很有帮助!

4 个解决方案

#1

There are several problems with the logic of your code.

代码逻辑存在一些问题。

ss = s.read()

reads the entire file s into a single string. The next line

将整个文件读入单个字符串。下一行

for line in ss:

iterates over each character in that string, one by one. So on each loop line is a single character. In

迭代遍历该字符串中的每个字符。所以在每个循环线上都是一个字符。在

    line = ss[7:]

you are getting the entire file contents apart from the first 7 characters (in positions 0 through 6, inclusive) and replacing the previous content of line with that. And then

你得到的整个文件内容除了前7个字符(在0到6的位置,包括在内),并用之前的内容替换之前的内容。然后

T.append(json.loads(line))

attempts to convert that to JSON and store the resulting object into the T list.

尝试将其转换为JSON并将结果对象存储到T列表中。

Here's some code that does what you want. We don't need to read the entire file into a string with .read, or into a list of lines with .readlines, we can simply put the file handle into a for loop and that will iterate over the file line by line.

这里有一些代码可以满足您的需求。我们不需要将整个文件读入带有.read的字符串,也不需要读入带有.readlines的行列表,我们可以简单地将文件句柄放入for循环中,然后逐行遍历文件。

We use a with statement to open the file, so that it will get closed automatically when we exit the with block, or if there's an IO error.

我们使用with语句来打开文件,以便在我们退出with块时自动关闭,或者如果出现IO错误。

import json

table = []
with open('simple.json', 'r') as f:
    for line in f:
        table.append(json.loads(line[7:]))

for row in table:
    print(row)

output

{'color': '33ef', 'age': '55', 'gender': 'm'}
{'color': '3444', 'age': '56', 'gender': 'f'}
{'color': '3999', 'age': '70', 'gender': 'm'}

We can make this more compact by building the table list in a list comprehension:

我们可以通过在列表解析中构建表列表来使其更紧凑:

import json

with open('simple.json', 'r') as f:
    table = [json.loads(line[7:]) for line in f]

for row in table:
    print(row)

#2

If you use Pandas you can simply write df = pd.read_json(f, lines=True)

如果您使用Pandas,您只需编写df = pd.read_json(f,lines = True)

as per doc the lines=True:

根据doc,lines = True:

Read the file as a json object per line.

每行读取一个json对象文件。

#3

You should use readlines() instead of read(), and wrap your JSON parsing in a try/except block. Your lines probably contain a trailing newline character and that would cause an error.

您应该使用readlines()而不是read(),并将您的JSON解析包装在try / except块中。你的行可能包含一个尾随的换行符,这会导致错误。

s = open('simple.json', 'r')
for line in s.readlines():
    try:
        j = line.split('|')[-1]
        json.loads(j)
    except ValueError:
        # You probably have bad JSON
        continue

#4

Thank you so much! You guys are life saver! This is the code that I eventually come up with it. I think it is the combination of all answers!

非常感谢!你们是救命的人!这是我最终提出的代码。我认为这是所有答案的组合!

import json

table = []
with open('simple.json', 'r') as f:
    for line in f:
        try:
            j = line.split('|')[-1]
            table.append(json.loads(j))
        except ValueError:
            # You probably have bad JSON
            continue

for row in table:
    print(row)

#1