panda:将嵌套的json转换为扁平的表

I have a JSON of the following structure:

我有如下结构的JSON:

{
    "a": "a_1",
    "b": "b_1",
    "c": [{
        "d": "d_1",
        "e": "e_1",
        "f": [],
        "g": "g_1",
        "h": "h_1"
    }, {
        "d": "d_2",
        "e": "e_2",
        "f": [],
        "g": "g_2",
        "h": "h_2"
    }, {
        "d": "d_3",
        "e": "e_3",
        "f": [{
            "i": "i_1",
            "j": "j_1",
            "k": "k_1",
            "l": "l_1",
            "m": []
        }, {
            "i": "i_2",
            "j": "j_2",
            "k": "k_2",
            "l": "l_2",
            "m": [{
                "n": "n_1",
                "o": "o_1",
                "p": "p_1",
                "q": "q_1"
            }]
        }],
        "g": "g_3",
        "h": "h_3"
    }]
}

And I want to convert it into pandas data frame of the following type:

我想把它转换成熊猫的数据框架如下:

How can I achieve that?

我怎么才能做到呢?

Following is my attempt but the direction is completely diff.

以下是我的尝试，但方向完全不同。

code:

代码:

from pandas.io.json import json_normalize

def flatten_json(y):
    out = {}

    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(y)
    return out

sample_object = { "a": "a_1", "b": "b_1", "c": [{ "d": "d_1", "e": "e_1", "f": [], "g": "g_1", "h": "h_1" }, { "d": "d_2", "e": "e_2", "f": [], "g": "g_2", "h": "h_2" }, { "d": "d_3", "e": "e_3", "f": [{ "i": "i_1", "j": "j_1", "k": "k_1", "l": "l_1", "m": [] }, { "i": "i_2", "j": "j_2", "k": "k_2", "l": "l_2", "m": [{ "n": "n_1", "o": "o_1", "p": "p_1", "q": "q_1" }] }], "g": "g_3", "h": "h_3" }] }
intermediate_json = flatten_json(sample_object)
flattened_df = json_normalize(intermediate_json)
transposed_df = flattened_df.T
print(transposed_df.to_string())

OUTPUT:

输出:

                 0
a              a_1
b              b_1
c_0_d          d_1
c_0_e          e_1
c_0_g          g_1
c_0_h          h_1
c_1_d          d_2
c_1_e          e_2
c_1_g          g_2
c_1_h          h_2
c_2_d          d_3
c_2_e          e_3
c_2_f_0_i      i_1
c_2_f_0_j      j_1
c_2_f_0_k      k_1
c_2_f_0_l      l_1
c_2_f_1_i      i_2
c_2_f_1_j      j_2
c_2_f_1_k      k_2
c_2_f_1_l      l_2
c_2_f_1_m_0_n  n_1
c_2_f_1_m_0_o  o_1
c_2_f_1_m_0_p  p_1
c_2_f_1_m_0_q  q_1
c_2_g          g_3
c_2_h          h_3

1 个解决方案

#1

Before Reading

在阅读之前

This do the Job as presented in the Question, if some additionnal specificities, please communicate it.
这就是问题中提到的工作，如果有什么特别之处，请与我们联系。
This surely can be improved, take it as a possible solution to your problem
这肯定是可以改进的，把它当作解决你问题的一个可能的方法
Please note that the key to solve your problem leads in Looping through nested dictionary which can be done with recursive functions.
请注意，解决问题的关键在于通过嵌套字典进行循环，而嵌套字典可以使用递归函数完成。

Solution

解决方案

With _dict your nested dictionary you can do a recursive function and some tricks to achieve your goal:

有了_dict你的嵌套字典，你可以做一个递归函数和一些技巧来实现你的目标:

I first write a function iterate_dict that recursively read your dictionary and store the results into a new dict where keys/values are your final pd.Dataframe columns content:

我首先编写一个函数iterate_dict，它递归地读取你的字典，并将结果存储到一个新的命令中，其中键/值是你的最终pd。Dataframe列内容:

def iterate_dict(_dict, _fdict,level=0):
    for k in _dict.keys(): #Iterate over keys of a dict

        #If value is a string update _fdict
        if isinstance(_dict[k],str): 
            #If first seen, initialize your dict
            if not k in _fdict.keys():
                _fdict[k] = [-1]*(level-1) #Trick to shift columns
            #Append the value
            _fdict[k].append(_dict[k])

        #If a list 
        if isinstance(_dict[k],list):
            if not k in _fdict.keys(): #If first seen key initialize
                _fdict[k] = [-1]*(level) #Same previous trick
                #Extend with required range (0, 1, 2 ...)
                _fdict[k].extend([i for i in range(len(_dict[k]))]) 
            else:
                if len(_dict[k]) > 0:
                    _start = 0 if len(_fdict[k]) == 0 else (int(_fdict[k][-1])+1)
                    _fdict[k].extend([i for i in range(_start,_start+len(_dict[k]))]) #Extend 
            for _d in _dict[k]: #If value of key is a list recall iterate_dict
                iterate_dict(_d,_fdict,level=level+1)

And another function, to_series, to transform the values of the future columns into pd.Series replacing previous int equals to -1 into np.nan:

另一个函数to_series将未来列的值转换为pd。将之前的int数替换为-1的级数:

def to_series(_fvalues):
    if _fvalues[0] == -1:
        _fvalues.insert(0,-1) #Trick to shift again 
    return pd.Series(_fvalues).replace(-1,np.nan) #Replace -1 with nan in case

Then use it like this:

然后这样使用:

_fdict = dict() #The future columns content
iterate_dict(_dict,_fdict) #Do the Job
print(_fdict)
{'a': ['a_1'],
 'b': ['b_1'],
 'c': [0, 1, 2],
 'd': ['d_1', 'd_2', 'd_3'],
 'e': ['e_1', 'e_2', 'e_3'],
 'f': [-1, 0, 1],
 'g': ['g_1', 'g_2', 'g_3'],
 'h': ['h_1', 'h_2', 'h_3'],
 'i': [-1, 'i_1', 'i_2'],
 'j': [-1, 'j_1', 'j_2'],
 'k': [-1, 'k_1', 'k_2'],
 'l': [-1, 'l_1', 'l_2'],
 'm': [-1, -1, 0],
 'n': [-1, -1, 'n_1'],
 'o': [-1, -1, 'o_1'],
 'p': [-1, -1, 'p_1'],
 'q': [-1, -1, 'q_1']}
#Here you can see a shift is required, use your custom to_series() function

Then create your pd.Dataframe:

然后创建你的pd.Dataframe:

df = pd.DataFrame(dict([ (k,to_series(v)) for k,v in _fdict.items() ])).ffill()
#Don't forget to do a forward fillna as needed
print(df)
    a    b    c    d    e    f    g    h    i    j    k    l    m    n    o  \
0  a_1  b_1  0.0  d_1  e_1  NaN  g_1  h_1  NaN  NaN  NaN  NaN  NaN  NaN  NaN   
1  a_1  b_1  1.0  d_2  e_2  NaN  g_2  h_2  NaN  NaN  NaN  NaN  NaN  NaN  NaN   
2  a_1  b_1  2.0  d_3  e_3  0.0  g_3  h_3  i_1  j_1  k_1  l_1  NaN  NaN  NaN   
3  a_1  b_1  2.0  d_3  e_3  1.0  g_3  h_3  i_2  j_2  k_2  l_2  0.0  n_1  o_1   

     p    q  
0  NaN  NaN  
1  NaN  NaN  
2  NaN  NaN  
3  p_1  q_1

#1