将JSON转换为Pandas DataFrame

时间:2021-10-10 23:01:08

I am having some trouble with this. I am trying to write this JSON to DataFrame. I feel like my issue is how i am formatting the JSON. When i write each tweet. However not able to narrow it down. Any insight would be awesome. Attached is my raw_tweets.json and 2nd code blow below is how i am writing it, seperating by comma i.e join (',')

我遇到了一些麻烦。我正在尝试将此JSON写入DataFrame。我觉得我的问题是我如何格式化JSON。当我写每条推文。但是无法缩小范围。任何见解都会很棒。附件是我的raw_tweets.json和下面的第二个代码打击是我如何写它,用逗号分隔,即join(',')

HERE is the LINK TO raw_tweets.json

这里是raw_tweets.json的链接

 i get a raise JSONDecodeError("Extra data", s, end)

 JSONDecodeError: Extra data

#JSON to DataFrame

#JSON到DataFrame

class tweet2dframe(object):

    def __init__(self, text="", location=""):
    self.text = text
    self.location = location

def getText(self):

    return self.text

def getLocation(self):

    return self.location



# import json package to load json file
 with open('raw_tweets.json',encoding="utf8") as jsonFile:
     polls_json = json.loads(jsonFile.read())




tweets_list = [polls(i["location"], i["text"]) for i in polls_json['text']]

colNames = ("Text", "location")
dict_list = []


for i in tweets_list:
    dict_list.append(dict(zip(colNames , [i.getText(), i.getLocation()])))


tweets_df = pd.DataFrame(dict_list)
tweets_df.head()

THE way I write my tweets to JSON

我把我的推文写成JSON的方式

saveFile = io.open('raw_tweets.json', 'w', encoding='utf-8')
saveFile.write(','.join(self.tweet_data))
saveFile.close()
exit()

1 个解决方案

#1


2  

raw_tweets.json contains invalid JSON. It contains JSON snippets separated by commas. To make the whole text a valid JSON array, place brackets [...] around the contents:

raw_tweets.json包含无效的JSON。它包含以逗号分隔的JSON片段。要使整个文本成为有效的JSON数组,请在内容周围放置括号[...]:

with open('raw_tweets.json', encoding="utf8") as jsonFile:
    polls_json = json.loads('[{}]'.format(jsonFile.read()))

For example,

例如,

import pandas as pd
import json

with open('raw_tweets.json', encoding="utf8") as jsonFile:
    polls_json = json.loads('[{}]'.format(jsonFile.read()))

tweets_list = [(dct['user']['location'], dct["text"]) for dct in polls_json]
colNames = ("location", "text")
tweets_df = pd.DataFrame(tweets_list, columns=colNames)
print(tweets_df.head())

yields

产量

        location                                               text
0           None  RT @webseriestoday: Democracy Now: Noam Chomsk...
1  Pittsburgh PA  "The tuxedo was an invention of the Koch broth...
2           None  RT @webseriestoday: Democracy Now: Noam Chomsk...
3           None  RT @webseriestoday: Democracy Now: Noam Chomsk...

Another, better way to fix the problem would be to write valid JSON in raw_tweets.json. After all, if you wanted to send the file to someone else, you'll make their life easier if the file contained valid JSON. We'd need to see more of your code to suggest exactly how to fix it, but in general you would want to use json.dump to write a list of dicts as JSON to a file instead of "manually" writing JSON snippets with saveFile.write(','.join(self.tweet_data)):

解决问题的另一种更好的方法是在raw_tweets.json中编写有效的JSON。毕竟,如果您想将文件发送给其他人,如果文件包含有效的JSON,您将使他们的生活更轻松。我们需要看到更多的代码来建议如何修复它,但一般情况下你会想要使用json.dump将一个dicts列表作为JSON写入文件而不是“手动”使用saveFile编写JSON片段.WRITE( '' 加盟(self.tweet_data)):

tweets = []
for i in loop:
    tweets.append(tweet_dict)
with io.open('raw_tweets.json', 'w', encoding='utf-8') as saveFile:
    json.dump(tweets, saveFile)

If raw_tweets.json contained valid JSON then you could load it into a Python list of dicts using:

如果raw_tweets.json包含有效的JSON,那么您可以使用以下命令将其加载到Python的Python列表中:

with open('raw_tweets.json', encoding="utf8") as jsonFile:
    polls_json = json.load(jsonFile)

The rest of the code, to load the desired parts into a DataFrame would remain the same.

将所需部分加载到DataFrame中的其余代码将保持不变。


How was this line of code constructed:

这段代码是如何构建的:

tweets_list = [(dct['user']['location'], dct["text"]) for dct in polls_json]

In an interactive Python session I inspected one dict in polls_json:

在交互式Python会话中,我检查了polls_json中的一个dict:

In [114]: import pandas as pd
In [115]: import json
In [116]: with open('raw_tweets.json', encoding="utf8") as jsonFile:
    polls_json = json.loads('[{}]'.format(jsonFile.read()))
In [117]: dct = polls_json[1]
In [118]: dct
Out[118]: 
{'contributors': None,
 'coordinates': None,
 ...
  'text': "Like the old Soviet leaders, Bernie refused to wear a tux at last night's black-tie dinner.",
  'truncated': False,
  'user': {'contributors_enabled': False,
  ...
   'location': 'Washington DC',}}

It is quite large, so I've omitted parts of it here to make the result more readable. Assuming that I correctly guessed the text and location values you are looking for, we can see that given this dict, dct, we can access the desired text value using dct['text']. But the location' key is inside the nested dict, dct['user']. Therefore, we need to use dct['user']['location'] to extract the location value.

它非常大,所以我在这里省略了部分内容以使结果更具可读性。假设我正确地猜到了你正在寻找的文本和位置值,我们可以看到,给定这个dict,dct,我们可以使用dct ['text']访问所需的文本值。但位置'键位于嵌套字典dct ['user']内。因此,我们需要使用dct ['user'] ['location']来提取位置值。

By the way, Pandas provides a convenient method for reading JSON into a DataFrame, pd.read_json, but it relies on the JSON data being "flat". Because the data we desire is in nested dicts, I used custom code, the list comprehension

顺便说一句,Pandas提供了一种方便的方法来将JSON读入DataFrame,pd.read_json,但它依赖于JSON数据是“平坦的”。因为我们想要的数据是嵌套的dicts,所以我使用自定义代码,列表理解

tweets_list = [(dct['user']['location'], dct["text"]) for dct in polls_json]

to extract the values instead of pd.read_json.

提取值而不是pd.read_json。

#1


2  

raw_tweets.json contains invalid JSON. It contains JSON snippets separated by commas. To make the whole text a valid JSON array, place brackets [...] around the contents:

raw_tweets.json包含无效的JSON。它包含以逗号分隔的JSON片段。要使整个文本成为有效的JSON数组,请在内容周围放置括号[...]:

with open('raw_tweets.json', encoding="utf8") as jsonFile:
    polls_json = json.loads('[{}]'.format(jsonFile.read()))

For example,

例如,

import pandas as pd
import json

with open('raw_tweets.json', encoding="utf8") as jsonFile:
    polls_json = json.loads('[{}]'.format(jsonFile.read()))

tweets_list = [(dct['user']['location'], dct["text"]) for dct in polls_json]
colNames = ("location", "text")
tweets_df = pd.DataFrame(tweets_list, columns=colNames)
print(tweets_df.head())

yields

产量

        location                                               text
0           None  RT @webseriestoday: Democracy Now: Noam Chomsk...
1  Pittsburgh PA  "The tuxedo was an invention of the Koch broth...
2           None  RT @webseriestoday: Democracy Now: Noam Chomsk...
3           None  RT @webseriestoday: Democracy Now: Noam Chomsk...

Another, better way to fix the problem would be to write valid JSON in raw_tweets.json. After all, if you wanted to send the file to someone else, you'll make their life easier if the file contained valid JSON. We'd need to see more of your code to suggest exactly how to fix it, but in general you would want to use json.dump to write a list of dicts as JSON to a file instead of "manually" writing JSON snippets with saveFile.write(','.join(self.tweet_data)):

解决问题的另一种更好的方法是在raw_tweets.json中编写有效的JSON。毕竟,如果您想将文件发送给其他人,如果文件包含有效的JSON,您将使他们的生活更轻松。我们需要看到更多的代码来建议如何修复它,但一般情况下你会想要使用json.dump将一个dicts列表作为JSON写入文件而不是“手动”使用saveFile编写JSON片段.WRITE( '' 加盟(self.tweet_data)):

tweets = []
for i in loop:
    tweets.append(tweet_dict)
with io.open('raw_tweets.json', 'w', encoding='utf-8') as saveFile:
    json.dump(tweets, saveFile)

If raw_tweets.json contained valid JSON then you could load it into a Python list of dicts using:

如果raw_tweets.json包含有效的JSON,那么您可以使用以下命令将其加载到Python的Python列表中:

with open('raw_tweets.json', encoding="utf8") as jsonFile:
    polls_json = json.load(jsonFile)

The rest of the code, to load the desired parts into a DataFrame would remain the same.

将所需部分加载到DataFrame中的其余代码将保持不变。


How was this line of code constructed:

这段代码是如何构建的:

tweets_list = [(dct['user']['location'], dct["text"]) for dct in polls_json]

In an interactive Python session I inspected one dict in polls_json:

在交互式Python会话中,我检查了polls_json中的一个dict:

In [114]: import pandas as pd
In [115]: import json
In [116]: with open('raw_tweets.json', encoding="utf8") as jsonFile:
    polls_json = json.loads('[{}]'.format(jsonFile.read()))
In [117]: dct = polls_json[1]
In [118]: dct
Out[118]: 
{'contributors': None,
 'coordinates': None,
 ...
  'text': "Like the old Soviet leaders, Bernie refused to wear a tux at last night's black-tie dinner.",
  'truncated': False,
  'user': {'contributors_enabled': False,
  ...
   'location': 'Washington DC',}}

It is quite large, so I've omitted parts of it here to make the result more readable. Assuming that I correctly guessed the text and location values you are looking for, we can see that given this dict, dct, we can access the desired text value using dct['text']. But the location' key is inside the nested dict, dct['user']. Therefore, we need to use dct['user']['location'] to extract the location value.

它非常大,所以我在这里省略了部分内容以使结果更具可读性。假设我正确地猜到了你正在寻找的文本和位置值,我们可以看到,给定这个dict,dct,我们可以使用dct ['text']访问所需的文本值。但位置'键位于嵌套字典dct ['user']内。因此,我们需要使用dct ['user'] ['location']来提取位置值。

By the way, Pandas provides a convenient method for reading JSON into a DataFrame, pd.read_json, but it relies on the JSON data being "flat". Because the data we desire is in nested dicts, I used custom code, the list comprehension

顺便说一句,Pandas提供了一种方便的方法来将JSON读入DataFrame,pd.read_json,但它依赖于JSON数据是“平坦的”。因为我们想要的数据是嵌套的dicts,所以我使用自定义代码,列表理解

tweets_list = [(dct['user']['location'], dct["text"]) for dct in polls_json]

to extract the values instead of pd.read_json.

提取值而不是pd.read_json。