panda:将嵌套的json转换为扁平的表

时间:2022-09-05 15:49:02

I have a JSON of the following structure:

我有如下结构的JSON:

{
    "a": "a_1",
    "b": "b_1",
    "c": [{
        "d": "d_1",
        "e": "e_1",
        "f": [],
        "g": "g_1",
        "h": "h_1"
    }, {
        "d": "d_2",
        "e": "e_2",
        "f": [],
        "g": "g_2",
        "h": "h_2"
    }, {
        "d": "d_3",
        "e": "e_3",
        "f": [{
            "i": "i_1",
            "j": "j_1",
            "k": "k_1",
            "l": "l_1",
            "m": []
        }, {
            "i": "i_2",
            "j": "j_2",
            "k": "k_2",
            "l": "l_2",
            "m": [{
                "n": "n_1",
                "o": "o_1",
                "p": "p_1",
                "q": "q_1"
            }]
        }],
        "g": "g_3",
        "h": "h_3"
    }]
}

And I want to convert it into pandas data frame of the following type:

我想把它转换成熊猫的数据框架如下:

panda:将嵌套的json转换为扁平的表

How can I achieve that?

我怎么才能做到呢?


Following is my attempt but the direction is completely diff.

以下是我的尝试,但方向完全不同。

code:

代码:

from pandas.io.json import json_normalize

def flatten_json(y):
    out = {}

    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(y)
    return out

sample_object = { "a": "a_1", "b": "b_1", "c": [{ "d": "d_1", "e": "e_1", "f": [], "g": "g_1", "h": "h_1" }, { "d": "d_2", "e": "e_2", "f": [], "g": "g_2", "h": "h_2" }, { "d": "d_3", "e": "e_3", "f": [{ "i": "i_1", "j": "j_1", "k": "k_1", "l": "l_1", "m": [] }, { "i": "i_2", "j": "j_2", "k": "k_2", "l": "l_2", "m": [{ "n": "n_1", "o": "o_1", "p": "p_1", "q": "q_1" }] }], "g": "g_3", "h": "h_3" }] }
intermediate_json = flatten_json(sample_object)
flattened_df = json_normalize(intermediate_json)
transposed_df = flattened_df.T
print(transposed_df.to_string())

OUTPUT:

输出:

                 0
a              a_1
b              b_1
c_0_d          d_1
c_0_e          e_1
c_0_g          g_1
c_0_h          h_1
c_1_d          d_2
c_1_e          e_2
c_1_g          g_2
c_1_h          h_2
c_2_d          d_3
c_2_e          e_3
c_2_f_0_i      i_1
c_2_f_0_j      j_1
c_2_f_0_k      k_1
c_2_f_0_l      l_1
c_2_f_1_i      i_2
c_2_f_1_j      j_2
c_2_f_1_k      k_2
c_2_f_1_l      l_2
c_2_f_1_m_0_n  n_1
c_2_f_1_m_0_o  o_1
c_2_f_1_m_0_p  p_1
c_2_f_1_m_0_q  q_1
c_2_g          g_3
c_2_h          h_3

1 个解决方案

#1


0  


Before Reading

在阅读之前

  • This do the Job as presented in the Question, if some additionnal specificities, please communicate it.
  • 这就是问题中提到的工作,如果有什么特别之处,请与我们联系。
  • This surely can be improved, take it as a possible solution to your problem
  • 这肯定是可以改进的,把它当作解决你问题的一个可能的方法
  • Please note that the key to solve your problem leads in Looping through nested dictionary which can be done with recursive functions.
  • 请注意,解决问题的关键在于通过嵌套字典进行循环,而嵌套字典可以使用递归函数完成。

Solution

解决方案

With _dict your nested dictionary you can do a recursive function and some tricks to achieve your goal:

有了_dict你的嵌套字典,你可以做一个递归函数和一些技巧来实现你的目标:

I first write a function iterate_dict that recursively read your dictionary and store the results into a new dict where keys/values are your final pd.Dataframe columns content:

我首先编写一个函数iterate_dict,它递归地读取你的字典,并将结果存储到一个新的命令中,其中键/值是你的最终pd。Dataframe列内容:

def iterate_dict(_dict, _fdict,level=0):
    for k in _dict.keys(): #Iterate over keys of a dict

        #If value is a string update _fdict
        if isinstance(_dict[k],str): 
            #If first seen, initialize your dict
            if not k in _fdict.keys():
                _fdict[k] = [-1]*(level-1) #Trick to shift columns
            #Append the value
            _fdict[k].append(_dict[k])

        #If a list 
        if isinstance(_dict[k],list):
            if not k in _fdict.keys(): #If first seen key initialize
                _fdict[k] = [-1]*(level) #Same previous trick
                #Extend with required range (0, 1, 2 ...)
                _fdict[k].extend([i for i in range(len(_dict[k]))]) 
            else:
                if len(_dict[k]) > 0:
                    _start = 0 if len(_fdict[k]) == 0 else (int(_fdict[k][-1])+1)
                    _fdict[k].extend([i for i in range(_start,_start+len(_dict[k]))]) #Extend 
            for _d in _dict[k]: #If value of key is a list recall iterate_dict
                iterate_dict(_d,_fdict,level=level+1)

And another function, to_series, to transform the values of the future columns into pd.Series replacing previous int equals to -1 into np.nan:

另一个函数to_series将未来列的值转换为pd。将之前的int数替换为-1的级数:

def to_series(_fvalues):
    if _fvalues[0] == -1:
        _fvalues.insert(0,-1) #Trick to shift again 
    return pd.Series(_fvalues).replace(-1,np.nan) #Replace -1 with nan in case 

Then use it like this:

然后这样使用:

_fdict = dict() #The future columns content
iterate_dict(_dict,_fdict) #Do the Job
print(_fdict)
{'a': ['a_1'],
 'b': ['b_1'],
 'c': [0, 1, 2],
 'd': ['d_1', 'd_2', 'd_3'],
 'e': ['e_1', 'e_2', 'e_3'],
 'f': [-1, 0, 1],
 'g': ['g_1', 'g_2', 'g_3'],
 'h': ['h_1', 'h_2', 'h_3'],
 'i': [-1, 'i_1', 'i_2'],
 'j': [-1, 'j_1', 'j_2'],
 'k': [-1, 'k_1', 'k_2'],
 'l': [-1, 'l_1', 'l_2'],
 'm': [-1, -1, 0],
 'n': [-1, -1, 'n_1'],
 'o': [-1, -1, 'o_1'],
 'p': [-1, -1, 'p_1'],
 'q': [-1, -1, 'q_1']}
#Here you can see a shift is required, use your custom to_series() function

Then create your pd.Dataframe:

然后创建你的pd.Dataframe:

df = pd.DataFrame(dict([ (k,to_series(v)) for k,v in _fdict.items() ])).ffill()
#Don't forget to do a forward fillna as needed
print(df)
    a    b    c    d    e    f    g    h    i    j    k    l    m    n    o  \
0  a_1  b_1  0.0  d_1  e_1  NaN  g_1  h_1  NaN  NaN  NaN  NaN  NaN  NaN  NaN   
1  a_1  b_1  1.0  d_2  e_2  NaN  g_2  h_2  NaN  NaN  NaN  NaN  NaN  NaN  NaN   
2  a_1  b_1  2.0  d_3  e_3  0.0  g_3  h_3  i_1  j_1  k_1  l_1  NaN  NaN  NaN   
3  a_1  b_1  2.0  d_3  e_3  1.0  g_3  h_3  i_2  j_2  k_2  l_2  0.0  n_1  o_1   

     p    q  
0  NaN  NaN  
1  NaN  NaN  
2  NaN  NaN  
3  p_1  q_1

#1


0  


Before Reading

在阅读之前

  • This do the Job as presented in the Question, if some additionnal specificities, please communicate it.
  • 这就是问题中提到的工作,如果有什么特别之处,请与我们联系。
  • This surely can be improved, take it as a possible solution to your problem
  • 这肯定是可以改进的,把它当作解决你问题的一个可能的方法
  • Please note that the key to solve your problem leads in Looping through nested dictionary which can be done with recursive functions.
  • 请注意,解决问题的关键在于通过嵌套字典进行循环,而嵌套字典可以使用递归函数完成。

Solution

解决方案

With _dict your nested dictionary you can do a recursive function and some tricks to achieve your goal:

有了_dict你的嵌套字典,你可以做一个递归函数和一些技巧来实现你的目标:

I first write a function iterate_dict that recursively read your dictionary and store the results into a new dict where keys/values are your final pd.Dataframe columns content:

我首先编写一个函数iterate_dict,它递归地读取你的字典,并将结果存储到一个新的命令中,其中键/值是你的最终pd。Dataframe列内容:

def iterate_dict(_dict, _fdict,level=0):
    for k in _dict.keys(): #Iterate over keys of a dict

        #If value is a string update _fdict
        if isinstance(_dict[k],str): 
            #If first seen, initialize your dict
            if not k in _fdict.keys():
                _fdict[k] = [-1]*(level-1) #Trick to shift columns
            #Append the value
            _fdict[k].append(_dict[k])

        #If a list 
        if isinstance(_dict[k],list):
            if not k in _fdict.keys(): #If first seen key initialize
                _fdict[k] = [-1]*(level) #Same previous trick
                #Extend with required range (0, 1, 2 ...)
                _fdict[k].extend([i for i in range(len(_dict[k]))]) 
            else:
                if len(_dict[k]) > 0:
                    _start = 0 if len(_fdict[k]) == 0 else (int(_fdict[k][-1])+1)
                    _fdict[k].extend([i for i in range(_start,_start+len(_dict[k]))]) #Extend 
            for _d in _dict[k]: #If value of key is a list recall iterate_dict
                iterate_dict(_d,_fdict,level=level+1)

And another function, to_series, to transform the values of the future columns into pd.Series replacing previous int equals to -1 into np.nan:

另一个函数to_series将未来列的值转换为pd。将之前的int数替换为-1的级数:

def to_series(_fvalues):
    if _fvalues[0] == -1:
        _fvalues.insert(0,-1) #Trick to shift again 
    return pd.Series(_fvalues).replace(-1,np.nan) #Replace -1 with nan in case 

Then use it like this:

然后这样使用:

_fdict = dict() #The future columns content
iterate_dict(_dict,_fdict) #Do the Job
print(_fdict)
{'a': ['a_1'],
 'b': ['b_1'],
 'c': [0, 1, 2],
 'd': ['d_1', 'd_2', 'd_3'],
 'e': ['e_1', 'e_2', 'e_3'],
 'f': [-1, 0, 1],
 'g': ['g_1', 'g_2', 'g_3'],
 'h': ['h_1', 'h_2', 'h_3'],
 'i': [-1, 'i_1', 'i_2'],
 'j': [-1, 'j_1', 'j_2'],
 'k': [-1, 'k_1', 'k_2'],
 'l': [-1, 'l_1', 'l_2'],
 'm': [-1, -1, 0],
 'n': [-1, -1, 'n_1'],
 'o': [-1, -1, 'o_1'],
 'p': [-1, -1, 'p_1'],
 'q': [-1, -1, 'q_1']}
#Here you can see a shift is required, use your custom to_series() function

Then create your pd.Dataframe:

然后创建你的pd.Dataframe:

df = pd.DataFrame(dict([ (k,to_series(v)) for k,v in _fdict.items() ])).ffill()
#Don't forget to do a forward fillna as needed
print(df)
    a    b    c    d    e    f    g    h    i    j    k    l    m    n    o  \
0  a_1  b_1  0.0  d_1  e_1  NaN  g_1  h_1  NaN  NaN  NaN  NaN  NaN  NaN  NaN   
1  a_1  b_1  1.0  d_2  e_2  NaN  g_2  h_2  NaN  NaN  NaN  NaN  NaN  NaN  NaN   
2  a_1  b_1  2.0  d_3  e_3  0.0  g_3  h_3  i_1  j_1  k_1  l_1  NaN  NaN  NaN   
3  a_1  b_1  2.0  d_3  e_3  1.0  g_3  h_3  i_2  j_2  k_2  l_2  0.0  n_1  o_1   

     p    q  
0  NaN  NaN  
1  NaN  NaN  
2  NaN  NaN  
3  p_1  q_1