Python:无法将列表转换为熊猫dataframe。

时间:2021-02-04 00:20:45

I able to get raw javascript raw data from a link to a list datatype but unable to convert it to Pandas Dataframe.

我能够从一个链接到列表数据类型的链接中获取原始javascript原始数据,但无法将其转换为熊猫Dataframe。

import re
import request

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
page = requests.get('http://www.sgx.com/JsonRead/JsonstData?qryId=RAll',headers=headers)

data = re.findall(r'items:(.*)',page.content)

print data
["[{ID:0,N:'2ndChance W200123',SIP:'',NC:'CDWW',R:'',I:'',M:'',LT:0,C:0,VL:0.000,BV:6099.000,B:'0.007',S:'0.009',SV:7278.300,O:0,H:0,L:0,V:0.000,SC:'5',PV:0.009,P:0,BL:'100',P_:'X',V_:''},{ID:1,N:'3Cnergy',SIP:'',NC:'502',R:'',I:'',M:'t',LT:0,C:0,VL:0.000,BV:130.000,B:'0.022',S:'0.025',SV:100.000,O:0,H:0,L:0,V:0.000,SC:'2',PV:0.021,P:0,BL:'100',P_:'X',V_:''},{ID:2,N:'3Cnergy W200528',SIP:'',NC:'1E0W',R:'',I:'',M:'t',LT:0,C:0,VL:0.000,BV:0,B:'',S:'0.004',SV:50.000,O:0,H:0,L:0,V:0.000,SC:'5',PV:0.002,P:0,BL:'100',P_:'X',V_:''}..}]}"

check the type(data), it is shown as list.

检查类型(数据),它显示为列表。

However, when I pd.DataFrame(data), the output does not shown as a proper dataframe. Instead, it is 0 [{ID:0,N:'2ndChance W200123',SIP:'',NC:'CDWW',... a weird format.

但是,当我使用dataframe(数据)时,输出并没有显示为一个适当的dataframe。相反,它是0 [{ID:0,N:'2ndChance W200123',SIP: ",NC:'CDWW',…一种奇怪的格式。

How shall I get a neat and tidy dataframe?

我怎样才能得到一个干净整洁的dataframe?

4 个解决方案

#1


0  

In this case, the data variable contains one string that should be a JSON string but not. I check the string, you could use the code below to make it a valid JSON string. Note it may fail if you get another data string.

在这种情况下,数据变量包含一个字符串,该字符串应该是JSON字符串,但不是。我检查了字符串,您可以使用下面的代码使它成为一个有效的JSON字符串。注意,如果您获得另一个数据字符串,它可能会失败。

data = data[0][:-1]  # remove the last element: "}" 

p = re.compile('(\w*?):(.*?)([,|}])')  # a key-value pattern, note that the last key-value ends with "}" rather than ","

def repl(m):
    # m is a regular Match Object. m.group(1) is the key; m.group(2) is the value; m.group(3) is the separator, maybe a "," if the key-value pair is in the middle of the string or a "}" if the key-value pair is in the end of the string.
    # return is a valid mini key-value pair, just add double quotes of the key, remain all other characters.  
    return '"' + m.group(1) + '":' + m.group(2) +m.group(3)

# find all the key-value pair of the data, and use the repl function to add double quotes.
# the original string contains single quotes, replace it to double quotes or you will get  "\'" in the string.
data = p.sub(repl, data.replace("'", '"'))  

data = json.loads(data)
df = pd.DataFrame(data)

EDIT:

编辑:

1、The last element of your string is "}", so firstly we remove it.

您的字符串的最后一个元素是“}”,所以我们首先删除它。

2、A valid JSON looks like {"key": value}, the key is in the double quotes. Your string looks like {key: value} without double quotes. We should find all the mini key-value pair, and add the double quotes for every key of them while remain the key and the value. The repl function aims to do this, using a pattern to find all the key-value pair of the data and then replace it with a new string with double quotes of the key.

一个有效的JSON看起来像{“键”:值},键在双引号中。您的字符串看起来像{键:没有双引号的值}。我们应该找到所有的迷你键值对,并在保持键值和值的同时,为每个键添加双引号。repl函数的目标是这样做,使用一个模式来找到所有的键值对数据,然后用一个双引号的新字符串替换它。

Hope I explain clearly.

希望我解释清楚。

#2


0  

The problem is in how re.findall is returing data it seems that data is a list containing one string.check len(data) ,it should return one. You need to reprocess it . Another thing is that your dictionary keys are not string,they are being treated as variable that are not defined. you need to make them as strings as show below

问题在于如何重新使用。findall正在对数据进行重组,数据似乎是一个包含一个字符串的列表。检查len(数据),它应该返回一个。你需要重新处理它。另一件事是你的字典键不是字符串,它们被当作没有定义的变量。您需要将它们作为字符串,如下所示。

new_list=[{'ID':0,'N':'2ndChance W200123','SIP':'','NC':'CDWW','R':'','I':'','M':'','LT':0,'C':0,'VL':0.000,'BV':6099.000,'B':'0.007','S':'0.009','SV':7278.300,'O':0,'H':0,'L':0,'V':0.000,'SC':'5','PV':0.009,'P':0,'BL':'100','P_':'X','V_':''}]
>>> d=pd.DataFrame(new_list)
>>> d
           B   BL      BV  C  H I  ID  L  LT M ...     PV P_  R      S  SC SIP  \
    0  0.007  100  6099.0  0  0     0  0   0   ...  0.009  X     0.009   5       

           SV    V   VL V_  
    0  7278.3  0.0  0.0     

    [1 rows x 24 columns]
>>> 

#3


0  

In your case you are calling the DataFrame constructer on a list of size 1.

在您的例子中,您将DataFrame构造器在一个大小为1的列表中调用。

Pandas is interpreting pd.DataFrame(data) as 'make a dataframe from this one element which is a string'.

熊猫正在将pd.DataFrame(数据)解释为“从这个元素中生成一个dataframe,它是一个字符串”。

You need to parse out the string into a json or directly retrieve the json via requests and then use the dataframe constructor.

您需要将字符串解析为json或通过请求直接检索json,然后使用dataframe构造函数。

Probably a better/ more robust way to do it but you can just play with the string to coerce the right format as below

可能是一个更好的/更健壮的方法,但是您可以使用字符串来强制正确的格式如下。

import ast

# take item of data which is a string and remove xtra list layer in string
x = data[0].replace("[","").replace("]","")

# remove end parts and split into list of strings of the dicts
x = x[1:-3].split("},{")

# create list with correctly formatted string of dicts
items = [("{" + y + "}").replace('{','{"').replace(',',',"').replace(':','":') for y in x]

# Evaluate strings into dicts
dicts = [ast.literal_eval(item) for item in items]

# Create DataFrame
df = pd.DataFrame.from_records(dicts)

#4


0  

This is probably the ugliest answer but it seems to work. The response from that url is a JavaScript object so my thought was to use JSON.stringify to parse it into proper JSON as shown here:

这可能是最丑陋的答案,但它似乎起作用了。这个url的响应是一个JavaScript对象,所以我的想法是使用JSON。stringify将其解析为适当的JSON,如下所示:

One way to execute JavaScript's JSON.stringify from Python is to use Selenium's execute_script as shown below:

一种执行JavaScript JSON的方法。Python的stringify是使用Selenium的execute_script,如下所示:

In[2]: import json
  ...: 
  ...: import pandas as pd
  ...: from selenium import webdriver
  ...: 
  ...: # Setup headless chrome
  ...: chrome_options = webdriver.ChromeOptions()
  ...: chrome_options.add_argument("--headless")
  ...: driver = webdriver.Chrome(chrome_options=chrome_options)
  ...: 
  ...: # Get response and return it as JSON
  ...: driver.get('http://www.sgx.com/JsonRead/JsonstData?qryId=RAll')
  ...: response = driver.find_element_by_xpath('/html/body/pre')
  ...: json_string = driver.execute_script(
  ...:     'return JSON.stringify({})'.format(response.text))
  ...: driver.quit()
  ...: 
  ...: # Convert to Python dict
  ...: json_data = json.loads(json_string)['items']
  ...: 
  ...: # Convert to DataFrame
  ...: df = pd.DataFrame(json_data)
In[3]: df.shape
Out[3]: (1049, 24)
In[4]: df.head()
Out[4]: 
       B   BL      BV      C      H I  ID      L     LT  M ...     PV P_  R  \
0  0.007  100  6099.0  0.000  0.000     0  0.000  0.000    ...  0.009  X      
1  0.022  100     0.1  0.001  0.022     1  0.022  0.022  t ...  0.021  X      
2         100     0.0  0.000  0.000     2  0.000  0.000  t ...  0.002  X      
3  1.110  100    51.0  0.000  1.110     3  1.110  1.110  t ...  1.110  X      
4  0.065  100     0.1  0.000  0.000     4  0.000  0.000    ...  0.080  X      

       S  SC SIP      SV        V    VL V_  
0  0.009   5      7278.3      0.0   0.0     
1  0.029   2        99.0   1097.8  49.9     
2  0.004   5        50.0      0.0   0.0     
3  1.120   A         6.9  68820.0  62.0     
4  0.083   2         0.1      0.0   0.0     

[5 rows x 24 columns]

#1


0  

In this case, the data variable contains one string that should be a JSON string but not. I check the string, you could use the code below to make it a valid JSON string. Note it may fail if you get another data string.

在这种情况下,数据变量包含一个字符串,该字符串应该是JSON字符串,但不是。我检查了字符串,您可以使用下面的代码使它成为一个有效的JSON字符串。注意,如果您获得另一个数据字符串,它可能会失败。

data = data[0][:-1]  # remove the last element: "}" 

p = re.compile('(\w*?):(.*?)([,|}])')  # a key-value pattern, note that the last key-value ends with "}" rather than ","

def repl(m):
    # m is a regular Match Object. m.group(1) is the key; m.group(2) is the value; m.group(3) is the separator, maybe a "," if the key-value pair is in the middle of the string or a "}" if the key-value pair is in the end of the string.
    # return is a valid mini key-value pair, just add double quotes of the key, remain all other characters.  
    return '"' + m.group(1) + '":' + m.group(2) +m.group(3)

# find all the key-value pair of the data, and use the repl function to add double quotes.
# the original string contains single quotes, replace it to double quotes or you will get  "\'" in the string.
data = p.sub(repl, data.replace("'", '"'))  

data = json.loads(data)
df = pd.DataFrame(data)

EDIT:

编辑:

1、The last element of your string is "}", so firstly we remove it.

您的字符串的最后一个元素是“}”,所以我们首先删除它。

2、A valid JSON looks like {"key": value}, the key is in the double quotes. Your string looks like {key: value} without double quotes. We should find all the mini key-value pair, and add the double quotes for every key of them while remain the key and the value. The repl function aims to do this, using a pattern to find all the key-value pair of the data and then replace it with a new string with double quotes of the key.

一个有效的JSON看起来像{“键”:值},键在双引号中。您的字符串看起来像{键:没有双引号的值}。我们应该找到所有的迷你键值对,并在保持键值和值的同时,为每个键添加双引号。repl函数的目标是这样做,使用一个模式来找到所有的键值对数据,然后用一个双引号的新字符串替换它。

Hope I explain clearly.

希望我解释清楚。

#2


0  

The problem is in how re.findall is returing data it seems that data is a list containing one string.check len(data) ,it should return one. You need to reprocess it . Another thing is that your dictionary keys are not string,they are being treated as variable that are not defined. you need to make them as strings as show below

问题在于如何重新使用。findall正在对数据进行重组,数据似乎是一个包含一个字符串的列表。检查len(数据),它应该返回一个。你需要重新处理它。另一件事是你的字典键不是字符串,它们被当作没有定义的变量。您需要将它们作为字符串,如下所示。

new_list=[{'ID':0,'N':'2ndChance W200123','SIP':'','NC':'CDWW','R':'','I':'','M':'','LT':0,'C':0,'VL':0.000,'BV':6099.000,'B':'0.007','S':'0.009','SV':7278.300,'O':0,'H':0,'L':0,'V':0.000,'SC':'5','PV':0.009,'P':0,'BL':'100','P_':'X','V_':''}]
>>> d=pd.DataFrame(new_list)
>>> d
           B   BL      BV  C  H I  ID  L  LT M ...     PV P_  R      S  SC SIP  \
    0  0.007  100  6099.0  0  0     0  0   0   ...  0.009  X     0.009   5       

           SV    V   VL V_  
    0  7278.3  0.0  0.0     

    [1 rows x 24 columns]
>>> 

#3


0  

In your case you are calling the DataFrame constructer on a list of size 1.

在您的例子中,您将DataFrame构造器在一个大小为1的列表中调用。

Pandas is interpreting pd.DataFrame(data) as 'make a dataframe from this one element which is a string'.

熊猫正在将pd.DataFrame(数据)解释为“从这个元素中生成一个dataframe,它是一个字符串”。

You need to parse out the string into a json or directly retrieve the json via requests and then use the dataframe constructor.

您需要将字符串解析为json或通过请求直接检索json,然后使用dataframe构造函数。

Probably a better/ more robust way to do it but you can just play with the string to coerce the right format as below

可能是一个更好的/更健壮的方法,但是您可以使用字符串来强制正确的格式如下。

import ast

# take item of data which is a string and remove xtra list layer in string
x = data[0].replace("[","").replace("]","")

# remove end parts and split into list of strings of the dicts
x = x[1:-3].split("},{")

# create list with correctly formatted string of dicts
items = [("{" + y + "}").replace('{','{"').replace(',',',"').replace(':','":') for y in x]

# Evaluate strings into dicts
dicts = [ast.literal_eval(item) for item in items]

# Create DataFrame
df = pd.DataFrame.from_records(dicts)

#4


0  

This is probably the ugliest answer but it seems to work. The response from that url is a JavaScript object so my thought was to use JSON.stringify to parse it into proper JSON as shown here:

这可能是最丑陋的答案,但它似乎起作用了。这个url的响应是一个JavaScript对象,所以我的想法是使用JSON。stringify将其解析为适当的JSON,如下所示:

One way to execute JavaScript's JSON.stringify from Python is to use Selenium's execute_script as shown below:

一种执行JavaScript JSON的方法。Python的stringify是使用Selenium的execute_script,如下所示:

In[2]: import json
  ...: 
  ...: import pandas as pd
  ...: from selenium import webdriver
  ...: 
  ...: # Setup headless chrome
  ...: chrome_options = webdriver.ChromeOptions()
  ...: chrome_options.add_argument("--headless")
  ...: driver = webdriver.Chrome(chrome_options=chrome_options)
  ...: 
  ...: # Get response and return it as JSON
  ...: driver.get('http://www.sgx.com/JsonRead/JsonstData?qryId=RAll')
  ...: response = driver.find_element_by_xpath('/html/body/pre')
  ...: json_string = driver.execute_script(
  ...:     'return JSON.stringify({})'.format(response.text))
  ...: driver.quit()
  ...: 
  ...: # Convert to Python dict
  ...: json_data = json.loads(json_string)['items']
  ...: 
  ...: # Convert to DataFrame
  ...: df = pd.DataFrame(json_data)
In[3]: df.shape
Out[3]: (1049, 24)
In[4]: df.head()
Out[4]: 
       B   BL      BV      C      H I  ID      L     LT  M ...     PV P_  R  \
0  0.007  100  6099.0  0.000  0.000     0  0.000  0.000    ...  0.009  X      
1  0.022  100     0.1  0.001  0.022     1  0.022  0.022  t ...  0.021  X      
2         100     0.0  0.000  0.000     2  0.000  0.000  t ...  0.002  X      
3  1.110  100    51.0  0.000  1.110     3  1.110  1.110  t ...  1.110  X      
4  0.065  100     0.1  0.000  0.000     4  0.000  0.000    ...  0.080  X      

       S  SC SIP      SV        V    VL V_  
0  0.009   5      7278.3      0.0   0.0     
1  0.029   2        99.0   1097.8  49.9     
2  0.004   5        50.0      0.0   0.0     
3  1.120   A         6.9  68820.0  62.0     
4  0.083   2         0.1      0.0   0.0     

[5 rows x 24 columns]