在Python中展平Dicts或Lists的通用JSON列表

时间:2022-10-18 18:08:43

I have a set of arbitrary JSON data that has been parsed in Python to lists of dicts and lists of varying depth. I need to be able to 'flatten' this into a list of dicts. Example below:

我有一组任意JSON数据,已经在Python中解析为dicts列表和不同深度的列表。我需要能够将其“扁平化”为一系列词汇。示例如下:

Source Data Example 1

源数据示例1

[{u'industry': [
   {u'id': u'112', u'name': u'A'},
   {u'id': u'132', u'name': u'B'},
   {u'id': u'110', u'name': u'C'},
   ],
  u'name': u'materials'},
 {u'industry': {u'id': u'210', u'name': u'A'},
  u'name': u'conglomerates'}
]

Desired Result Example 1

期望的结果示例1

[{u'name':u'materials', u'industry_id':u'112', u'industry_name':u'A'},
 {u'name':u'materials', u'industry_id':u'132', u'industry_name':u'B'},
 {u'name':u'materials', u'industry_id':u'110', u'industry_name':u'C'},
 {u'name':u'conglomerates', u'industry_id':u'210', u'industry_name':u'A'},
]

This is easy enough for this simple example, but I don't always have this exact structure of list o f dicts, with one additional layer of list of dicts. In some cases, I may have additional nesting that needs to follow the same methodology. As a result, I think I will need recursion and I cannot seem to get this to work.

对于这个简单的例子来说这很容易,但是我并不总是有这个列表的精确结构,还有一层额外的dicts列表。在某些情况下,我可能需要遵循相同的方法进行额外的嵌套。结果,我想我需要递归,我似乎无法让它工作。

Proposed Methodology

拟议的方法论

1) For Each List of Dicts, prepend each key with a 'path' that provides the name of the parent key. In the example above, 'industry' was the key which contained a list of dicts, so each of the children dicts in the list have 'industry' added to them.

1)对于每个Dicts列表,在每个键前面加上一个提供父键名称的“路径”。在上面的例子中,“行业”是包含一系列dicts的关键,因此列表中的每个子项都添加了“行业”。

2) Add 'Parent' Items to Each Dict within List - in this case, the 'name' and 'industry' were the items in the top level list of dicts, and so the 'name' key/value was added to each of the items in 'industry'.

2)在列表中的每个词典中添加“父”项 - 在这种情况下,“名称”和“行业”是*别的词典列表中的项目,因此“名称”键/值被添加到每个词典中“工业”中的项目。

I can imagine some scenarios where you had multiple lists of dicts or even dicts of dicts in the 'Parent' items and applying each of these sub-trees to the children list of dicts would not work. As a result, I'll assume that the 'parent' items are always simple key/value pairs.

我可以想象一些场景,你在'Parent'项目中有多个dicts列表甚至dicts字符串,并且将这些子树中的每一个应用到子序列表中都不起作用。因此,我将假设“父”项始终是简单的键/值对。

One more example to try to illustrate the potential variabilities in data structure that need to be handled.

再举一个例子,试图说明需要处理的数据结构中的潜在变化。

Source Data Example 2

源数据示例2

[{u'industry': [
   {u'id': u'112', u'name': u'A'},
   {u'id': u'132', u'name': u'B'},
   {u'id': u'110', u'name': u'C', u'company': [
                            {u'id':'500', u'symbol':'X'},
                            {u'id':'502', u'symbol':'Y'},
                            {u'id':'504', u'symbol':'Z'},
                  ]
   },
   ],
  u'name': u'materials'},
 {u'industry': {u'id': u'210', u'name': u'A'},
  u'name': u'conglomerates'}
]

Desired Result Example 2

期望的结果示例2

[{u'name':u'materials', u'industry_id':u'112', u'industry_name':u'A'},
 {u'name':u'materials', u'industry_id':u'132', u'industry_name':u'B'},
 {u'name':u'materials', u'industry_id':u'110', u'industry_name':u'C', 
                        u'company_id':'500', u'company_symbol':'X'},
 {u'name':u'materials', u'industry_id':u'110', u'industry_name':u'C', 
                        u'company_id':'502', u'company_symbol':'Y'},
 {u'name':u'materials', u'industry_id':u'110', u'industry_name':u'C', 
                        u'company_id':'504', u'company_symbol':'Z'},
 {u'name':u'conglomerates', u'industry_id':u'210', u'industry_name':u'A'},
]

I have looked at several other examples and I can't seem to find one that works in these example cases.

我已经看了几个其他的例子,我似乎找不到一个适用于这些例子的例子。

Any suggestions or pointers? I've spent some time trying to build a recursive function to handle this with no luck after many hours...

有什么建议或指示?我花了一些时间试图建立一个递归函数来处理这个问题,经过几个小时没有运气...

UPDATED WITH ONE FAILED ATTEMPT

更新一次失败的尝试

def _flatten(sub_tree, flattened=[], path="", parent_dict={}, child_dict={}):
    if type(sub_tree) is list:
        for i in sub_tree:
            flattened.append(_flatten(i,
                                      flattened=flattened,
                                      path=path,
                                      parent_dict=parent_dict,
                                      child_dict=child_dict
                                      )
                            )
        return flattened
    elif type(sub_tree) is dict:
        lists = {}
        new_parent_dict = {}
        new_child_dict = {}
        for key, value in sub_tree.items():
            new_path = path + '_' + key
            if type(value) is dict:
                for key2, value2 in value.items():
                    new_path2 = new_path + '_' + key2
                    new_parent_dict[new_path2] = value2
            elif type(value) is unicode:
                new_parent_dict[key] = value
            elif type(value) is list:
                lists[new_path] = value
        new_parent_dict.update(parent_dict)
        for key, value in lists.items():
            for i in value:
                flattened.append(_flatten(i,
                                      flattened=flattened,
                                      path=key,
                                      parent_dict=new_parent_dict,
                                      )
            )
        return flattened

The result I get is a 231x231 matrix of 'None' objects - clearly I'm getting into trouble with the recursion running away.

我得到的结果是一个231x231的“无”对象矩阵 - 显然我遇到了递归逃跑的麻烦。

I've tried a few additional 'start from scratch' attempts and failed with a similar failure mode.

我尝试了一些额外的“从头开始”尝试,但在类似的故障模式下失败了。

1 个解决方案

#1


4  

Alright. My solution comes with two functions. The first, splitObj, takes care of splitting an object into the flat data and the sublist or subobject which will later require the recursion. The second, flatten, actually iterates of a list of objects, makes the recursive calls and takes care of reconstructing the final object for each iteration.

好的。我的解决方案有两个功能。第一个,splitObj,负责将对象拆分为平面数据和子列表或子对象,稍后需要递归。第二个,展平,实际上迭代一个对象列表,进行递归调用并负责为每次迭代重建最终对象。

def splitObj (obj, prefix = None):
    '''
    Split the object, returning a 3-tuple with the flat object, optionally
    followed by the key for the subobjects and a list of those subobjects.
    '''
    # copy the object, optionally add the prefix before each key
    new = obj.copy() if prefix is None else { '{}_{}'.format(prefix, k): v for k, v in obj.items() }

    # try to find the key holding the subobject or a list of subobjects
    for k, v in new.items():
        # list of subobjects
        if isinstance(v, list):
            del new[k]
            return new, k, v
        # or just one subobject
        elif isinstance(v, dict):
            del new[k]
            return new, k, [v]
    return new, None, None

def flatten (data, prefix = None):
    '''
    Flatten the data, optionally with each key prefixed.
    '''
    # iterate all items
    for item in data:
        # split the object
        flat, key, subs = splitObj(item, prefix)

        # just return fully flat objects
        if key is None:
            yield flat
            continue

        # otherwise recursively flatten the subobjects
        for sub in flatten(subs, key):
            sub.update(flat)
            yield sub

Note that this does not exactly produce your desired output. The reason for this is that your output is actually inconsistent. In the second example, for the case where there are companies nested in the industries, the nesting isn’t visible in the output. So instead, my output will generate industry_company_id and industry_company_symbol:

请注意,这并不能完全产生您想要的输出。原因是你的输出实际上是不一致的。在第二个示例中,对于存在嵌套在行业中的公司的情况,嵌套在输出中不可见。相反,我的输出将生成industry_company_id和industry_company_symbol:

>>> ex1 = [{u'industry': [{u'id': u'112', u'name': u'A'},
                          {u'id': u'132', u'name': u'B'},
                          {u'id': u'110', u'name': u'C'}],
            u'name': u'materials'},
           {u'industry': {u'id': u'210', u'name': u'A'}, u'name': u'conglomerates'}]
>>> ex2 = [{u'industry': [{u'id': u'112', u'name': u'A'},
                          {u'id': u'132', u'name': u'B'},
                          {u'company': [{u'id': '500', u'symbol': 'X'},
                                        {u'id': '502', u'symbol': 'Y'},
                                        {u'id': '504', u'symbol': 'Z'}],
                           u'id': u'110',
                           u'name': u'C'}],
            u'name': u'materials'},
           {u'industry': {u'id': u'210', u'name': u'A'}, u'name': u'conglomerates'}]

>>> pprint(list(flatten(ex1)))
[{'industry_id': u'112', 'industry_name': u'A', u'name': u'materials'},
 {'industry_id': u'132', 'industry_name': u'B', u'name': u'materials'},
 {'industry_id': u'110', 'industry_name': u'C', u'name': u'materials'},
 {'industry_id': u'210', 'industry_name': u'A', u'name': u'conglomerates'}]
>>> pprint(list(flatten(ex2)))
[{'industry_id': u'112', 'industry_name': u'A', u'name': u'materials'},
 {'industry_id': u'132', 'industry_name': u'B', u'name': u'materials'},
 {'industry_company_id': '500',
  'industry_company_symbol': 'X',
  'industry_id': u'110',
  'industry_name': u'C',
  u'name': u'materials'},
 {'industry_company_id': '502',
  'industry_company_symbol': 'Y',
  'industry_id': u'110',
  'industry_name': u'C',
  u'name': u'materials'},
 {'industry_company_id': '504',
  'industry_company_symbol': 'Z',
  'industry_id': u'110',
  'industry_name': u'C',
  u'name': u'materials'},
 {'industry_id': u'210', 'industry_name': u'A', u'name': u'conglomerates'}]

#1


4  

Alright. My solution comes with two functions. The first, splitObj, takes care of splitting an object into the flat data and the sublist or subobject which will later require the recursion. The second, flatten, actually iterates of a list of objects, makes the recursive calls and takes care of reconstructing the final object for each iteration.

好的。我的解决方案有两个功能。第一个,splitObj,负责将对象拆分为平面数据和子列表或子对象,稍后需要递归。第二个,展平,实际上迭代一个对象列表,进行递归调用并负责为每次迭代重建最终对象。

def splitObj (obj, prefix = None):
    '''
    Split the object, returning a 3-tuple with the flat object, optionally
    followed by the key for the subobjects and a list of those subobjects.
    '''
    # copy the object, optionally add the prefix before each key
    new = obj.copy() if prefix is None else { '{}_{}'.format(prefix, k): v for k, v in obj.items() }

    # try to find the key holding the subobject or a list of subobjects
    for k, v in new.items():
        # list of subobjects
        if isinstance(v, list):
            del new[k]
            return new, k, v
        # or just one subobject
        elif isinstance(v, dict):
            del new[k]
            return new, k, [v]
    return new, None, None

def flatten (data, prefix = None):
    '''
    Flatten the data, optionally with each key prefixed.
    '''
    # iterate all items
    for item in data:
        # split the object
        flat, key, subs = splitObj(item, prefix)

        # just return fully flat objects
        if key is None:
            yield flat
            continue

        # otherwise recursively flatten the subobjects
        for sub in flatten(subs, key):
            sub.update(flat)
            yield sub

Note that this does not exactly produce your desired output. The reason for this is that your output is actually inconsistent. In the second example, for the case where there are companies nested in the industries, the nesting isn’t visible in the output. So instead, my output will generate industry_company_id and industry_company_symbol:

请注意,这并不能完全产生您想要的输出。原因是你的输出实际上是不一致的。在第二个示例中,对于存在嵌套在行业中的公司的情况,嵌套在输出中不可见。相反,我的输出将生成industry_company_id和industry_company_symbol:

>>> ex1 = [{u'industry': [{u'id': u'112', u'name': u'A'},
                          {u'id': u'132', u'name': u'B'},
                          {u'id': u'110', u'name': u'C'}],
            u'name': u'materials'},
           {u'industry': {u'id': u'210', u'name': u'A'}, u'name': u'conglomerates'}]
>>> ex2 = [{u'industry': [{u'id': u'112', u'name': u'A'},
                          {u'id': u'132', u'name': u'B'},
                          {u'company': [{u'id': '500', u'symbol': 'X'},
                                        {u'id': '502', u'symbol': 'Y'},
                                        {u'id': '504', u'symbol': 'Z'}],
                           u'id': u'110',
                           u'name': u'C'}],
            u'name': u'materials'},
           {u'industry': {u'id': u'210', u'name': u'A'}, u'name': u'conglomerates'}]

>>> pprint(list(flatten(ex1)))
[{'industry_id': u'112', 'industry_name': u'A', u'name': u'materials'},
 {'industry_id': u'132', 'industry_name': u'B', u'name': u'materials'},
 {'industry_id': u'110', 'industry_name': u'C', u'name': u'materials'},
 {'industry_id': u'210', 'industry_name': u'A', u'name': u'conglomerates'}]
>>> pprint(list(flatten(ex2)))
[{'industry_id': u'112', 'industry_name': u'A', u'name': u'materials'},
 {'industry_id': u'132', 'industry_name': u'B', u'name': u'materials'},
 {'industry_company_id': '500',
  'industry_company_symbol': 'X',
  'industry_id': u'110',
  'industry_name': u'C',
  u'name': u'materials'},
 {'industry_company_id': '502',
  'industry_company_symbol': 'Y',
  'industry_id': u'110',
  'industry_name': u'C',
  u'name': u'materials'},
 {'industry_company_id': '504',
  'industry_company_symbol': 'Z',
  'industry_id': u'110',
  'industry_name': u'C',
  u'name': u'materials'},
 {'industry_id': u'210', 'industry_name': u'A', u'name': u'conglomerates'}]