I able to get raw javascript raw data from a link to a list datatype but unable to convert it to Pandas Dataframe.
我能够从一个链接到列表数据类型的链接中获取原始javascript原始数据,但无法将其转换为熊猫Dataframe。
import re
import request
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
page = requests.get('http://www.sgx.com/JsonRead/JsonstData?qryId=RAll',headers=headers)
data = re.findall(r'items:(.*)',page.content)
print data
["[{ID:0,N:'2ndChance W200123',SIP:'',NC:'CDWW',R:'',I:'',M:'',LT:0,C:0,VL:0.000,BV:6099.000,B:'0.007',S:'0.009',SV:7278.300,O:0,H:0,L:0,V:0.000,SC:'5',PV:0.009,P:0,BL:'100',P_:'X',V_:''},{ID:1,N:'3Cnergy',SIP:'',NC:'502',R:'',I:'',M:'t',LT:0,C:0,VL:0.000,BV:130.000,B:'0.022',S:'0.025',SV:100.000,O:0,H:0,L:0,V:0.000,SC:'2',PV:0.021,P:0,BL:'100',P_:'X',V_:''},{ID:2,N:'3Cnergy W200528',SIP:'',NC:'1E0W',R:'',I:'',M:'t',LT:0,C:0,VL:0.000,BV:0,B:'',S:'0.004',SV:50.000,O:0,H:0,L:0,V:0.000,SC:'5',PV:0.002,P:0,BL:'100',P_:'X',V_:''}..}]}"
check the type(data)
, it is shown as list.
检查类型(数据),它显示为列表。
However, when I pd.DataFrame(data)
, the output does not shown as a proper dataframe. Instead, it is 0 [{ID:0,N:'2ndChance W200123',SIP:'',NC:'CDWW',...
a weird format.
但是,当我使用dataframe(数据)时,输出并没有显示为一个适当的dataframe。相反,它是0 [{ID:0,N:'2ndChance W200123',SIP: ",NC:'CDWW',…一种奇怪的格式。
How shall I get a neat and tidy dataframe?
我怎样才能得到一个干净整洁的dataframe?
4 个解决方案
#1
0
In this case, the data variable contains one string that should be a JSON string but not. I check the string, you could use the code below to make it a valid JSON string. Note it may fail if you get another data string.
在这种情况下,数据变量包含一个字符串,该字符串应该是JSON字符串,但不是。我检查了字符串,您可以使用下面的代码使它成为一个有效的JSON字符串。注意,如果您获得另一个数据字符串,它可能会失败。
data = data[0][:-1] # remove the last element: "}"
p = re.compile('(\w*?):(.*?)([,|}])') # a key-value pattern, note that the last key-value ends with "}" rather than ","
def repl(m):
# m is a regular Match Object. m.group(1) is the key; m.group(2) is the value; m.group(3) is the separator, maybe a "," if the key-value pair is in the middle of the string or a "}" if the key-value pair is in the end of the string.
# return is a valid mini key-value pair, just add double quotes of the key, remain all other characters.
return '"' + m.group(1) + '":' + m.group(2) +m.group(3)
# find all the key-value pair of the data, and use the repl function to add double quotes.
# the original string contains single quotes, replace it to double quotes or you will get "\'" in the string.
data = p.sub(repl, data.replace("'", '"'))
data = json.loads(data)
df = pd.DataFrame(data)
EDIT:
编辑:
1、The last element of your string is "}", so firstly we remove it.
您的字符串的最后一个元素是“}”,所以我们首先删除它。
2、A valid JSON looks like {"key": value}, the key is in the double quotes. Your string looks like {key: value} without double quotes. We should find all the mini key-value pair, and add the double quotes for every key of them while remain the key and the value. The repl function aims to do this, using a pattern to find all the key-value pair of the data and then replace it with a new string with double quotes of the key.
一个有效的JSON看起来像{“键”:值},键在双引号中。您的字符串看起来像{键:没有双引号的值}。我们应该找到所有的迷你键值对,并在保持键值和值的同时,为每个键添加双引号。repl函数的目标是这样做,使用一个模式来找到所有的键值对数据,然后用一个双引号的新字符串替换它。
Hope I explain clearly.
希望我解释清楚。
#2
0
The problem is in how re.findall is returing data it seems that data is a list containing one string.check len(data) ,it should return one. You need to reprocess it . Another thing is that your dictionary keys are not string,they are being treated as variable that are not defined. you need to make them as strings as show below
问题在于如何重新使用。findall正在对数据进行重组,数据似乎是一个包含一个字符串的列表。检查len(数据),它应该返回一个。你需要重新处理它。另一件事是你的字典键不是字符串,它们被当作没有定义的变量。您需要将它们作为字符串,如下所示。
new_list=[{'ID':0,'N':'2ndChance W200123','SIP':'','NC':'CDWW','R':'','I':'','M':'','LT':0,'C':0,'VL':0.000,'BV':6099.000,'B':'0.007','S':'0.009','SV':7278.300,'O':0,'H':0,'L':0,'V':0.000,'SC':'5','PV':0.009,'P':0,'BL':'100','P_':'X','V_':''}]
>>> d=pd.DataFrame(new_list)
>>> d
B BL BV C H I ID L LT M ... PV P_ R S SC SIP \
0 0.007 100 6099.0 0 0 0 0 0 ... 0.009 X 0.009 5
SV V VL V_
0 7278.3 0.0 0.0
[1 rows x 24 columns]
>>>
#3
0
In your case you are calling the DataFrame constructer on a list of size 1.
在您的例子中,您将DataFrame构造器在一个大小为1的列表中调用。
Pandas is interpreting pd.DataFrame(data)
as 'make a dataframe from this one element which is a string'.
熊猫正在将pd.DataFrame(数据)解释为“从这个元素中生成一个dataframe,它是一个字符串”。
You need to parse out the string into a json or directly retrieve the json via requests and then use the dataframe constructor.
您需要将字符串解析为json或通过请求直接检索json,然后使用dataframe构造函数。
Probably a better/ more robust way to do it but you can just play with the string to coerce the right format as below
可能是一个更好的/更健壮的方法,但是您可以使用字符串来强制正确的格式如下。
import ast
# take item of data which is a string and remove xtra list layer in string
x = data[0].replace("[","").replace("]","")
# remove end parts and split into list of strings of the dicts
x = x[1:-3].split("},{")
# create list with correctly formatted string of dicts
items = [("{" + y + "}").replace('{','{"').replace(',',',"').replace(':','":') for y in x]
# Evaluate strings into dicts
dicts = [ast.literal_eval(item) for item in items]
# Create DataFrame
df = pd.DataFrame.from_records(dicts)
#4
0
This is probably the ugliest answer but it seems to work. The response from that url is a JavaScript object so my thought was to use JSON.stringify
to parse it into proper JSON as shown here:
这可能是最丑陋的答案,但它似乎起作用了。这个url的响应是一个JavaScript对象,所以我的想法是使用JSON。stringify将其解析为适当的JSON,如下所示:
- https://repl.it/repls/ImmaculateGainsboroEmulators
- https://repl.it/repls/ImmaculateGainsboroEmulators
One way to execute JavaScript's JSON.stringify
from Python is to use Selenium's execute_script
as shown below:
一种执行JavaScript JSON的方法。Python的stringify是使用Selenium的execute_script,如下所示:
In[2]: import json
...:
...: import pandas as pd
...: from selenium import webdriver
...:
...: # Setup headless chrome
...: chrome_options = webdriver.ChromeOptions()
...: chrome_options.add_argument("--headless")
...: driver = webdriver.Chrome(chrome_options=chrome_options)
...:
...: # Get response and return it as JSON
...: driver.get('http://www.sgx.com/JsonRead/JsonstData?qryId=RAll')
...: response = driver.find_element_by_xpath('/html/body/pre')
...: json_string = driver.execute_script(
...: 'return JSON.stringify({})'.format(response.text))
...: driver.quit()
...:
...: # Convert to Python dict
...: json_data = json.loads(json_string)['items']
...:
...: # Convert to DataFrame
...: df = pd.DataFrame(json_data)
In[3]: df.shape
Out[3]: (1049, 24)
In[4]: df.head()
Out[4]:
B BL BV C H I ID L LT M ... PV P_ R \
0 0.007 100 6099.0 0.000 0.000 0 0.000 0.000 ... 0.009 X
1 0.022 100 0.1 0.001 0.022 1 0.022 0.022 t ... 0.021 X
2 100 0.0 0.000 0.000 2 0.000 0.000 t ... 0.002 X
3 1.110 100 51.0 0.000 1.110 3 1.110 1.110 t ... 1.110 X
4 0.065 100 0.1 0.000 0.000 4 0.000 0.000 ... 0.080 X
S SC SIP SV V VL V_
0 0.009 5 7278.3 0.0 0.0
1 0.029 2 99.0 1097.8 49.9
2 0.004 5 50.0 0.0 0.0
3 1.120 A 6.9 68820.0 62.0
4 0.083 2 0.1 0.0 0.0
[5 rows x 24 columns]
#1
0
In this case, the data variable contains one string that should be a JSON string but not. I check the string, you could use the code below to make it a valid JSON string. Note it may fail if you get another data string.
在这种情况下,数据变量包含一个字符串,该字符串应该是JSON字符串,但不是。我检查了字符串,您可以使用下面的代码使它成为一个有效的JSON字符串。注意,如果您获得另一个数据字符串,它可能会失败。
data = data[0][:-1] # remove the last element: "}"
p = re.compile('(\w*?):(.*?)([,|}])') # a key-value pattern, note that the last key-value ends with "}" rather than ","
def repl(m):
# m is a regular Match Object. m.group(1) is the key; m.group(2) is the value; m.group(3) is the separator, maybe a "," if the key-value pair is in the middle of the string or a "}" if the key-value pair is in the end of the string.
# return is a valid mini key-value pair, just add double quotes of the key, remain all other characters.
return '"' + m.group(1) + '":' + m.group(2) +m.group(3)
# find all the key-value pair of the data, and use the repl function to add double quotes.
# the original string contains single quotes, replace it to double quotes or you will get "\'" in the string.
data = p.sub(repl, data.replace("'", '"'))
data = json.loads(data)
df = pd.DataFrame(data)
EDIT:
编辑:
1、The last element of your string is "}", so firstly we remove it.
您的字符串的最后一个元素是“}”,所以我们首先删除它。
2、A valid JSON looks like {"key": value}, the key is in the double quotes. Your string looks like {key: value} without double quotes. We should find all the mini key-value pair, and add the double quotes for every key of them while remain the key and the value. The repl function aims to do this, using a pattern to find all the key-value pair of the data and then replace it with a new string with double quotes of the key.
一个有效的JSON看起来像{“键”:值},键在双引号中。您的字符串看起来像{键:没有双引号的值}。我们应该找到所有的迷你键值对,并在保持键值和值的同时,为每个键添加双引号。repl函数的目标是这样做,使用一个模式来找到所有的键值对数据,然后用一个双引号的新字符串替换它。
Hope I explain clearly.
希望我解释清楚。
#2
0
The problem is in how re.findall is returing data it seems that data is a list containing one string.check len(data) ,it should return one. You need to reprocess it . Another thing is that your dictionary keys are not string,they are being treated as variable that are not defined. you need to make them as strings as show below
问题在于如何重新使用。findall正在对数据进行重组,数据似乎是一个包含一个字符串的列表。检查len(数据),它应该返回一个。你需要重新处理它。另一件事是你的字典键不是字符串,它们被当作没有定义的变量。您需要将它们作为字符串,如下所示。
new_list=[{'ID':0,'N':'2ndChance W200123','SIP':'','NC':'CDWW','R':'','I':'','M':'','LT':0,'C':0,'VL':0.000,'BV':6099.000,'B':'0.007','S':'0.009','SV':7278.300,'O':0,'H':0,'L':0,'V':0.000,'SC':'5','PV':0.009,'P':0,'BL':'100','P_':'X','V_':''}]
>>> d=pd.DataFrame(new_list)
>>> d
B BL BV C H I ID L LT M ... PV P_ R S SC SIP \
0 0.007 100 6099.0 0 0 0 0 0 ... 0.009 X 0.009 5
SV V VL V_
0 7278.3 0.0 0.0
[1 rows x 24 columns]
>>>
#3
0
In your case you are calling the DataFrame constructer on a list of size 1.
在您的例子中,您将DataFrame构造器在一个大小为1的列表中调用。
Pandas is interpreting pd.DataFrame(data)
as 'make a dataframe from this one element which is a string'.
熊猫正在将pd.DataFrame(数据)解释为“从这个元素中生成一个dataframe,它是一个字符串”。
You need to parse out the string into a json or directly retrieve the json via requests and then use the dataframe constructor.
您需要将字符串解析为json或通过请求直接检索json,然后使用dataframe构造函数。
Probably a better/ more robust way to do it but you can just play with the string to coerce the right format as below
可能是一个更好的/更健壮的方法,但是您可以使用字符串来强制正确的格式如下。
import ast
# take item of data which is a string and remove xtra list layer in string
x = data[0].replace("[","").replace("]","")
# remove end parts and split into list of strings of the dicts
x = x[1:-3].split("},{")
# create list with correctly formatted string of dicts
items = [("{" + y + "}").replace('{','{"').replace(',',',"').replace(':','":') for y in x]
# Evaluate strings into dicts
dicts = [ast.literal_eval(item) for item in items]
# Create DataFrame
df = pd.DataFrame.from_records(dicts)
#4
0
This is probably the ugliest answer but it seems to work. The response from that url is a JavaScript object so my thought was to use JSON.stringify
to parse it into proper JSON as shown here:
这可能是最丑陋的答案,但它似乎起作用了。这个url的响应是一个JavaScript对象,所以我的想法是使用JSON。stringify将其解析为适当的JSON,如下所示:
- https://repl.it/repls/ImmaculateGainsboroEmulators
- https://repl.it/repls/ImmaculateGainsboroEmulators
One way to execute JavaScript's JSON.stringify
from Python is to use Selenium's execute_script
as shown below:
一种执行JavaScript JSON的方法。Python的stringify是使用Selenium的execute_script,如下所示:
In[2]: import json
...:
...: import pandas as pd
...: from selenium import webdriver
...:
...: # Setup headless chrome
...: chrome_options = webdriver.ChromeOptions()
...: chrome_options.add_argument("--headless")
...: driver = webdriver.Chrome(chrome_options=chrome_options)
...:
...: # Get response and return it as JSON
...: driver.get('http://www.sgx.com/JsonRead/JsonstData?qryId=RAll')
...: response = driver.find_element_by_xpath('/html/body/pre')
...: json_string = driver.execute_script(
...: 'return JSON.stringify({})'.format(response.text))
...: driver.quit()
...:
...: # Convert to Python dict
...: json_data = json.loads(json_string)['items']
...:
...: # Convert to DataFrame
...: df = pd.DataFrame(json_data)
In[3]: df.shape
Out[3]: (1049, 24)
In[4]: df.head()
Out[4]:
B BL BV C H I ID L LT M ... PV P_ R \
0 0.007 100 6099.0 0.000 0.000 0 0.000 0.000 ... 0.009 X
1 0.022 100 0.1 0.001 0.022 1 0.022 0.022 t ... 0.021 X
2 100 0.0 0.000 0.000 2 0.000 0.000 t ... 0.002 X
3 1.110 100 51.0 0.000 1.110 3 1.110 1.110 t ... 1.110 X
4 0.065 100 0.1 0.000 0.000 4 0.000 0.000 ... 0.080 X
S SC SIP SV V VL V_
0 0.009 5 7278.3 0.0 0.0
1 0.029 2 99.0 1097.8 49.9
2 0.004 5 50.0 0.0 0.0
3 1.120 A 6.9 68820.0 62.0
4 0.083 2 0.1 0.0 0.0
[5 rows x 24 columns]