I'm running into some issues, trying to load a JSON file in my Python editor so that I can run some analysis on the data within.
我遇到了一些问题,试图在Python编辑器中加载一个JSON文件,以便对内部数据进行一些分析。
The JSON file is in the following folder: 'C:\Users\Admin\JSON files\file1.JSON'
JSON文件在下面的文件夹中:“C:\Users\Admin\JSON文件\file1.JSON”
It contains the following tweet data (this is just one record, there are hundreds in there):
它包含以下tweet数据(这只是一个记录,其中有数百条):
{
"created": "Fri Mar 13 18:09:33 GMT 2014",
"description": "Tweeting the latest Playstation news!",
"favourites_count": 4514,
"followers": 235,
"following": 1345,
"geo_lat": null,
"geo_long": null,
"hashtags": "",
"id": 2144411414,
"is_retweet": false,
"is_truncated": false,
"lang": "en",
"location": "",
"media_urls": "",
"mentions": "",
"name": "Playstation News",
"original_text": null,
"reply_status_id": 0,
"reply_user_id": 0,
"retweet_count": 4514,
"retweet_id": 0,
"score": 0.0,
"screen_name": "SevenPS4",
"source": "<a href=\"http://twitterfeed.com\" rel=\"nofollow\">twitterfeed</a>",
"text": "tweetinfohere",
"timezone": "Amsterdam",
"url": null,
"urls": "http://bit.ly/1lcbBW6",
"user_created": "2013-05-19",
"user_id": 13313,
"utc_offset": 3600
}
I am using the following code to try and test this data:
我正在使用下面的代码来测试这个数据:
import json
import pandas as pa
z = pa.read_json('C:\Users\Admin\JSON files\file1.JSON')
d = pa.DataFrame.from_dict([{k:v} for k,v in z.iteritems() if k in ["retweet_count", "user_id", "is_retweet"]])
print d.retweet_count.sum()
When I run this, it successfully reads the JSON file then prints out a list of the retweet_count's like this:
当我运行这个时,它成功读取JSON文件然后打印出一个retweet_count的列表:
0, 4514 1, 300 2, 450 3, 139
etc etc
0、45141,3002,4503,139等。
My questions: How do I actually sum up all of the retweet_count/user_id values rather than just listing them like shown above?
我的问题是:如何将所有retweet_count/user_id值相加,而不是像上面所示的那样列出它们?
How do I then divide this sum by the number of entries to get an average?
然后如何除以项的个数来得到平均值呢?
How can I choose a sample size of the JSON data rather than use it all? (I thought it was d.iloc[:10] but that doesn't work)
如何选择JSON数据的样本大小而不是全部使用?我以为是d。iloc[:10]但这不起作用)
With the 'is_retweet' field in the JSON file, is it possible to make a count for the amount of false/trues that are given? IE within the JSON file, I want the number of tweets that were retweeted and the number that weren't.
在JSON文件中的“is_retweet”字段中,是否可以对给定的false/trues数量进行计数?在JSON文件中,我想要转发的tweet的数量和没有转发的数字。
Thanks in advance, yeah I'm pretty new to this..
谢谢你,是的,我对这个很陌生。
z.info()
gives:
z.info()为:
<class 'pandas.core.frame.DataFrame'> Int64Index: 506 entries, 0 to 505 Data columns (total 31 columns): created 506 non-null object description 506 non-null object favourites_count 506 non-null int64 followers 506 non-null int64 following 506 non-null int64 geo_lat 10 non-null float64 geo_long 10 non-null float64 hashtags 506 non-null object id 506 non-null int64 is_retweet 506 non-null bool is_truncated 506 non-null bool lang 506 non-null object location 506 non-null object media_urls 506 non-null object mentions 506 non-null object name 506 non-null object original_text 172 non-null object reply_status_id 506 non-null int64 reply_user_id 506 non-null int64 retweet_id 506 non-null int64 retweet_count 506 non_null int64 score 506 non-null int64 screen_name 506 non-null object source 506 non-null object status_count 506 non-null int64 text 506 non-null object timezone 415 non-null object url 273 non-null object urls 506 non-null object user_created 506 non-null object user_id 506 non-null int64 utc_offset 506 non-null int64 dtypes: bool(2), float64(2), int64(11), object(16)
How come it is showing retweet_count and user_id as objects when I run d.info()?
当我运行d.info()时,它如何显示retweet_count和user_id作为对象?
1 个解决方案
#1
0
d.retweet_count
is a list of dictionaries for your retweet_counts
correct?
d。retweet_count是你的retweet_count的字典列表,对吗?
So to get the sum:
为了得到和:
keys = d.retweet_count.keys()
sum = 0
for items in keys:
sum+=d.retweet_count[items]
To get the average:
平均:
avg = sum/len(keys)
Now to get a sample size just divide up keys
:
现在为了得到一个样本大小,只需把键分开:
sample_keys = keys[0:10]
to get the mean
的意思是
for items in sample_keys:
sum+=d.retweet_count[items]
avg = sum/len(sample_keys)
#1
0
d.retweet_count
is a list of dictionaries for your retweet_counts
correct?
d。retweet_count是你的retweet_count的字典列表,对吗?
So to get the sum:
为了得到和:
keys = d.retweet_count.keys()
sum = 0
for items in keys:
sum+=d.retweet_count[items]
To get the average:
平均:
avg = sum/len(keys)
Now to get a sample size just divide up keys
:
现在为了得到一个样本大小,只需把键分开:
sample_keys = keys[0:10]
to get the mean
的意思是
for items in sample_keys:
sum+=d.retweet_count[items]
avg = sum/len(sample_keys)