如何将一些列作为json平面化?

I have a dataframe df that loads data from a database. Most of the columns are json strings while some are even list of jsons. For example:

我有一个dataframe df，它从数据库加载数据。大多数列是json字符串，而有些列甚至是jsons列表。例如:

id     name     columnA                               columnB
1     John     {"dist": "600", "time": "0:12.10"}    [{"pos": "1st", "value": "500"},{"pos": "2nd", "value": "300"},{"pos": "3rd", "value": "200"}, {"pos": "total", "value": "1000"}]
2     Mike     {"dist": "600"}                       [{"pos": "1st", "value": "500"},{"pos": "2nd", "value": "300"},{"pos": "total", "value": "800"}]
...

AS you can see, all the rows don't have the same number of elements in the json strings for a column.

如您所见，所有的行在一个列的json字符串中没有相同数量的元素。

What I need to do is keep the normal columns like id and name, as it is and flatten the json columns like so

我需要做的是保持id和name这样的常规列，并像这样将json列展开

id    name   columnA.dist   columnA.time   columnB.pos.1st   columnB.pos.2nd   columnB.pos.3rd     columnB.pos.total
1     John   600            0:12.10        500               300               200                 1000 
2     Mark   600            NaN            500               300               Nan                 800

I have tried using json_normalize like so

我尝试过像这样使用json_normalize

from pandas.io.json import json_normalize
json_normalize(df)

But there seems to be some problems with keyerror. What is the correct way of doing this?

但是keyerror似乎存在一些问题。这样做的正确方法是什么?

2 个解决方案

#1

Here's a solution using json_normalize() again by using a custom function to get the data in the correct format understood by json_normalize function.

这里有一个使用json_normalize()的解决方案，通过使用自定义函数获取json_normalize函数理解的正确格式的数据。

import ast
from pandas.io.json import json_normalize

def only_dict(d):
    '''
    Convert json string representation of dictionary to a python dict
    '''
    return ast.literal_eval(d)

def list_of_dicts(ld):
    '''
    Create a mapping of the tuples formed after 
    converting json strings of list to a python list   
    '''
    return dict([(list(d.values())[1], list(d.values())[0]) for d in ast.literal_eval(ld)])

A = json_normalize(df['columnA'].apply(only_dict).tolist()).add_prefix('columnA.')
B = json_normalize(df['columnB'].apply(list_of_dicts).tolist()).add_prefix('columnB.pos.')

Finally, join the DFs on the common index to get:

最后，加入共同指数上的DFs，得到:

df[['id', 'name']].join([A, B])

EDIT:- As per the comment by @MartijnPieters, the recommended way of decoding the json strings would be to use json.loads() which is much faster when compared to using ast.literal_eval() if you know that the data source is JSON.

编辑:-根据@MartijnPieters的评论，解码json字符串的推荐方法是使用json.load()，如果您知道数据源是json，那么与使用ast.literal_eval()相比要快得多。

#2

create a custom function to flatten columnB then use pd.concat

创建一个自定义函数使columnB平坦，然后使用pd.concat

def flatten(js):
    return pd.DataFrame(js).set_index('pos').squeeze()

pd.concat([df.drop(['columnA', 'columnB'], axis=1),
           df.columnA.apply(pd.Series),
           df.columnB.apply(flatten)], axis=1)

#1