I have scraped a website with scrapy and stored the data in a json file.
Link to the json file: https://drive.google.com/file/d/0B6JCr_BzSFMHLURsTGdORmlPX0E/view?usp=sharing
我用scrapy抓取了一个网站并将数据存储在json文件中。链接到json文件:https://drive.google.com/file/d/0B6JCr_BzSFMHLURsTGdORmlPX0E/view?usp =sharing
But the json isn't standard json and gives errors:
但是json不是标准的json并且给出了错误:
>>> import json
>>> with open("/root/code/itjuzi/itjuzi/investorinfo.json") as file:
... data = json.load(file)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/root/anaconda2/lib/python2.7/json/__init__.py", line 291, in load
**kw)
File "/root/anaconda2/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/root/anaconda2/lib/python2.7/json/decoder.py", line 367, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 3 column 2 - line 3697 column 2 (char 45 - 3661517)
Then I tried this:
然后我尝试了这个:
with open('/root/code/itjuzi/itjuzi/investorinfo.json','rb') as f:
data = f.readlines()
data = map(lambda x: x.decode('unicode_escape'), data)
>>> df = pd.DataFrame(data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'pd' is not defined
>>> import pandas as pd
>>> df = pd.DataFrame(data)
>>> print pd
<module 'pandas' from '/root/anaconda2/lib/python2.7/site-packages/pandas/__init__.pyc'>
>>> print df
[3697 rows x 1 columns]
Why does this only return 1 column?
为什么这只返回1列?
How can I standardize the json file and read it with pandas correctly?
如何标准化json文件并正确地使用pandas读取它?
2 个解决方案
#1
8
try this:
import json
with open('data.json') as data_file:
data = json.load(data_file)
This has the advantage of dealing well with large JSON files that do not fit in memory
这样做的好处是可以很好地处理不适合内存的大型JSON文件
EDIT: Your data is not valid JSON. Delete the following in the first 3 lines and it will validate:
编辑:您的数据无效JSON。删除前3行中的以下内容,它将验证:
[{
"website": ["\u5341\u65b9\u521b\u6295"]
}]
EDIT2[Since you need to access nested values from json]:
EDIT2 [因为你需要从json访问嵌套值]:
You can now also access single values like this:
您现在还可以访问单个值,如下所示:
data["one"][0]["id"] # will return 'value'
data["two"]["id"] # will return 'value'
data["three"] # will return 'value'
#2
1
Try following codes: (you are missing one something)
请尝试以下代码:(您缺少一些东西)
>>> import json
>>> with open("/root/code/itjuzi/itjuzi/investorinfo.json") as file:
... data = json.load(file.read())
#1
8
try this:
import json
with open('data.json') as data_file:
data = json.load(data_file)
This has the advantage of dealing well with large JSON files that do not fit in memory
这样做的好处是可以很好地处理不适合内存的大型JSON文件
EDIT: Your data is not valid JSON. Delete the following in the first 3 lines and it will validate:
编辑:您的数据无效JSON。删除前3行中的以下内容,它将验证:
[{
"website": ["\u5341\u65b9\u521b\u6295"]
}]
EDIT2[Since you need to access nested values from json]:
EDIT2 [因为你需要从json访问嵌套值]:
You can now also access single values like this:
您现在还可以访问单个值,如下所示:
data["one"][0]["id"] # will return 'value'
data["two"]["id"] # will return 'value'
data["three"] # will return 'value'
#2
1
Try following codes: (you are missing one something)
请尝试以下代码:(您缺少一些东西)
>>> import json
>>> with open("/root/code/itjuzi/itjuzi/investorinfo.json") as file:
... data = json.load(file.read())