如何使用平面数据表中的嵌套记录构建JSON文件?

时间:2020-12-24 15:26:53

I'm looking for a Python technique to build a nested JSON file from a flat table in a pandas data frame. For example how could a pandas data frame table such as:

我正在寻找一种Python技术来从pandas数据框中的平面表构建嵌套的JSON文件。例如,pandas数据框表如何如下:

teamname  member firstname lastname  orgname         phone        mobile
0        1       0      John      Doe     Anon  916-555-1234                 
1        1       1      Jane      Doe     Anon  916-555-4321  916-555-7890   
2        2       0    Mickey    Moose  Moosers  916-555-0000  916-555-1111   
3        2       1     Minny    Moose  Moosers  916-555-2222

be taken and exported to a JSON that looks like:

被采用并导出为看起来像这样的JSON:

{
"teams": [
{
"teamname": "1",
"members": [
  {
    "firstname": "John", 
    "lastname": "Doe",
    "orgname": "Anon",
    "phone": "916-555-1234",
    "mobile": "",
  },
  {
    "firstname": "Jane",
    "lastname": "Doe",
    "orgname": "Anon",
    "phone": "916-555-4321",
    "mobile": "916-555-7890",
  }
]
},
{
"teamname": "2",
"members": [
  {
    "firstname": "Mickey",
    "lastname": "Moose",
    "orgname": "Moosers",
    "phone": "916-555-0000",
    "mobile": "916-555-1111",
  },
  {
    "firstname": "Minny",
    "lastname": "Moose",
    "orgname": "Moosers",
    "phone": "916-555-2222",
    "mobile": "",
  }
]
}       
]

}

I have tried doing this by creating a dict of dicts and dumping to JSON. This is my current code:

我试过通过创建一个dicts的字典并转储到JSON来做到这一点。这是我目前的代码:

data = pandas.read_excel(inputExcel, sheetname = 'SCAT Teams', encoding = 'utf8')
memberDictTuple = [] 

for index, row in data.iterrows():
    dataRow = row
    rowDict = dict(zip(columnList[2:], dataRow[2:]))

    teamRowDict = {columnList[0]:int(dataRow[0])}

    memberId = tuple(row[1:2])
    memberId = memberId[0]

    teamName = tuple(row[0:1])
    teamName = teamName[0]

    memberDict1 = {int(memberId):rowDict}
    memberDict2 = {int(teamName):memberDict1}

    memberDictTuple.append(memberDict2)

memberDictTuple = tuple(memberDictTuple)
formattedJson = json.dumps(memberDictTuple, indent = 4, sort_keys = True)
print formattedJson

This produces the following output. Each item is nested at the correct level under "teamname" 1 or 2, but records should be nested together if they have the same teamname. How can I fix this so that teamname 1 and teamname 2 each have 2 records nested within?

这会产生以下输出。每个项目都嵌套在“teamname”1或2下的正确级别,但如果记录具有相同的团队名称,则应将它们嵌套在一起。我如何解决这个问题,以便teamname 1和teamname 2每个都有2个嵌套的记录?

[
    {
        "1": {
            "0": {
                "email": "john.doe@wildlife.net", 
                "firstname": "John", 
                "lastname": "Doe", 
                "mobile": "none", 
                "orgname": "Anon", 
                "phone": "916-555-1234"
            }
        }
    }, 
    {
        "1": {
            "1": {
                "email": "jane.doe@wildlife.net", 
                "firstname": "Jane", 
                "lastname": "Doe", 
                "mobile": "916-555-7890", 
                "orgname": "Anon", 
                "phone": "916-555-4321"
            }
        }
    }, 
    {
        "2": {
            "0": {
                "email": "mickey.moose@wildlife.net", 
                "firstname": "Mickey", 
                "lastname": "Moose", 
                "mobile": "916-555-1111", 
                "orgname": "Moosers", 
                "phone": "916-555-0000"
            }
        }
    }, 
    {
        "2": {
            "1": {
                "email": "minny.moose@wildlife.net", 
                "firstname": "Minny", 
                "lastname": "Moose", 
                "mobile": "none", 
                "orgname": "Moosers", 
                "phone": "916-555-2222"
            }
        }
    }
]

2 个解决方案

#1


1  

This is the a solution that works and creates the desired JSON format. First, I grouped my dataframe by the appropriate columns, then instead of creating a dictionary (and losing data order) for each column heading/record pair, I created them as lists of tuples, then transformed the list into an Ordered Dict. Another Ordered Dict was created for the two columns that everything else was grouped by. Precise layering between lists and ordered dicts was necessary to for the JSON conversion to produce the correct format. Also note that when dumping to JSON, sort_keys must be set to false, or all your Ordered Dicts will be rearranged into alphabetical order.

这是一个可以工作并创建所需JSON格式的解决方案。首先,我通过适当的列对数据框进行分组,然后不是为每个列标题/记录对创建字典(并且丢失数据顺序),而是将它们创建为元组列表,然后将列表转换为有序字典。另一个Ordered Dict是为两列创建的,其他所有列都按其分组。列表和有序dicts之间的精确分层对于JSON转换生成正确的格式是必要的。另请注意,转储到JSON时,sort_keys必须设置为false,否则所有Ordered Dicts将重新排列为字母顺序。

import pandas
import json
from collections import OrderedDict

inputExcel = 'E:\\teams.xlsx'
exportJson = 'E:\\teams.json'

data = pandas.read_excel(inputExcel, sheetname = 'SCAT Teams', encoding = 'utf8')

# This creates a tuple of column headings for later use matching them with column data
cols = []
columnList = list(data[0:])
for col in columnList:
    cols.append(str(col))
columnList = tuple(cols)

#This groups the dataframe by the 'teamname' and 'members' columns
grouped = data.groupby(['teamname', 'members']).first()

#This creates a reference to the index level of the groups
groupnames = data.groupby(["teamname", "members"]).grouper.levels
tm = (groupnames[0])

#Create a list to add team records to at the end of the first 'for' loop
teamsList = []

for teamN in tm:
    teamN = int(teamN)  #added this in to prevent TypeError: 1 is not JSON serializable
    tempList = []   #Create an temporary list to add each record to
    for index, row in grouped.iterrows():
        dataRow = row
        if index[0] == teamN:  #Select the record in each row of the grouped dataframe if its index matches the team number

            #In order to have the JSON records come out in the same order, I had to first create a list of tuples, then convert to and Ordered Dict
            rowDict = ([(columnList[2], dataRow[0]), (columnList[3], dataRow[1]), (columnList[4], dataRow[2]), (columnList[5], dataRow[3]), (columnList[6], dataRow[4]), (columnList[7], dataRow[5])])
            rowDict = OrderedDict(rowDict)
            tempList.append(rowDict)
    #Create another Ordered Dict to keep 'teamname' and the list of members from the temporary list sorted
    t = ([('teamname', str(teamN)), ('members', tempList)])
    t= OrderedDict(t)

    #Append the Ordered Dict to the emepty list of teams created earlier
    ListX = t
    teamsList.append(ListX)


#Create a final dictionary with a single item: the list of teams
teams = {"teams":teamsList} 

#Dump to JSON format
formattedJson = json.dumps(teams, indent = 1, sort_keys = False) #sort_keys MUST be set to False, or all dictionaries will be alphebetized
formattedJson = formattedJson.replace("NaN", '"NULL"') #"NaN" is the NULL format in pandas dataframes - must be replaced with "NULL" to be a valid JSON file
print formattedJson

#Export to JSON file
parsed = open(exportJson, "w")
parsed.write(formattedJson)

print"\n\nExport to JSON Complete"

#2


0  

With some input from @root I used a different tack and came up with the following code, which seems to get most of the way there:

有了@root的一些输入,我使用了不同的方法,并提出了以下代码,它似乎在那里大部分时间:

import pandas
import json
from collections import defaultdict

inputExcel = 'E:\\teamsMM.xlsx'
exportJson = 'E:\\teamsMM.json'

data = pandas.read_excel(inputExcel, sheetname = 'SCAT Teams', encoding = 'utf8')

grouped = data.groupby(['teamname', 'members']).first()

results = defaultdict(lambda: defaultdict(dict))

for t in grouped.itertuples():
    for i, key in enumerate(t.Index):
        if i ==0:
            nested = results[key]
        elif i == len(t.Index) -1:
            nested[key] = t
        else:
            nested = nested[key]


formattedJson = json.dumps(results, indent = 4)

formattedJson = '{\n"teams": [\n' + formattedJson +'\n]\n }'

parsed = open(exportJson, "w")
parsed.write(formattedJson)

The resulting JSON file is this:

生成的JSON文件是这样的:

{
"teams": [
{
    "1": {
        "0": [
            [
                1, 
                0
            ], 
            "John", 
            "Doe", 
            "Anon", 
            "916-555-1234", 
            "none", 
            "john.doe@wildlife.net"
        ], 
        "1": [
            [
                1, 
                1
            ], 
            "Jane", 
            "Doe", 
            "Anon", 
            "916-555-4321", 
            "916-555-7890", 
            "jane.doe@wildlife.net"
        ]
    }, 
    "2": {
        "0": [
            [
                2, 
                0
            ], 
            "Mickey", 
            "Moose", 
            "Moosers", 
            "916-555-0000", 
            "916-555-1111", 
            "mickey.moose@wildlife.net"
        ], 
        "1": [
            [
                2, 
                1
            ], 
            "Minny", 
            "Moose", 
            "Moosers", 
            "916-555-2222", 
            "none", 
            "minny.moose@wildlife.net"
        ]
    }
}
]
 }

This format is very close to the desired end product. Remaining issues are: removing the redundant array [1, 0] that appears just above each firstname, and getting the headers for each nest to be "teamname": "1", "members": rather than "1": "0":

此格式非常接近所需的最终产品。剩下的问题是:删除每个名字上方出现的冗余数组[1,0],并将每个嵌套的标题变为“teamname”:“1”,“members”:而不是“1”:“0” :

Also, I do not know why each record is being stripped of its heading on the conversion. For instance why is dictionary entry "firstname":"John" exported as "John".

另外,我不知道为什么每个记录都被剥离了转换的标题。例如,为什么字典条目“firstname”:“John”导出为“John”。

#1


1  

This is the a solution that works and creates the desired JSON format. First, I grouped my dataframe by the appropriate columns, then instead of creating a dictionary (and losing data order) for each column heading/record pair, I created them as lists of tuples, then transformed the list into an Ordered Dict. Another Ordered Dict was created for the two columns that everything else was grouped by. Precise layering between lists and ordered dicts was necessary to for the JSON conversion to produce the correct format. Also note that when dumping to JSON, sort_keys must be set to false, or all your Ordered Dicts will be rearranged into alphabetical order.

这是一个可以工作并创建所需JSON格式的解决方案。首先,我通过适当的列对数据框进行分组,然后不是为每个列标题/记录对创建字典(并且丢失数据顺序),而是将它们创建为元组列表,然后将列表转换为有序字典。另一个Ordered Dict是为两列创建的,其他所有列都按其分组。列表和有序dicts之间的精确分层对于JSON转换生成正确的格式是必要的。另请注意,转储到JSON时,sort_keys必须设置为false,否则所有Ordered Dicts将重新排列为字母顺序。

import pandas
import json
from collections import OrderedDict

inputExcel = 'E:\\teams.xlsx'
exportJson = 'E:\\teams.json'

data = pandas.read_excel(inputExcel, sheetname = 'SCAT Teams', encoding = 'utf8')

# This creates a tuple of column headings for later use matching them with column data
cols = []
columnList = list(data[0:])
for col in columnList:
    cols.append(str(col))
columnList = tuple(cols)

#This groups the dataframe by the 'teamname' and 'members' columns
grouped = data.groupby(['teamname', 'members']).first()

#This creates a reference to the index level of the groups
groupnames = data.groupby(["teamname", "members"]).grouper.levels
tm = (groupnames[0])

#Create a list to add team records to at the end of the first 'for' loop
teamsList = []

for teamN in tm:
    teamN = int(teamN)  #added this in to prevent TypeError: 1 is not JSON serializable
    tempList = []   #Create an temporary list to add each record to
    for index, row in grouped.iterrows():
        dataRow = row
        if index[0] == teamN:  #Select the record in each row of the grouped dataframe if its index matches the team number

            #In order to have the JSON records come out in the same order, I had to first create a list of tuples, then convert to and Ordered Dict
            rowDict = ([(columnList[2], dataRow[0]), (columnList[3], dataRow[1]), (columnList[4], dataRow[2]), (columnList[5], dataRow[3]), (columnList[6], dataRow[4]), (columnList[7], dataRow[5])])
            rowDict = OrderedDict(rowDict)
            tempList.append(rowDict)
    #Create another Ordered Dict to keep 'teamname' and the list of members from the temporary list sorted
    t = ([('teamname', str(teamN)), ('members', tempList)])
    t= OrderedDict(t)

    #Append the Ordered Dict to the emepty list of teams created earlier
    ListX = t
    teamsList.append(ListX)


#Create a final dictionary with a single item: the list of teams
teams = {"teams":teamsList} 

#Dump to JSON format
formattedJson = json.dumps(teams, indent = 1, sort_keys = False) #sort_keys MUST be set to False, or all dictionaries will be alphebetized
formattedJson = formattedJson.replace("NaN", '"NULL"') #"NaN" is the NULL format in pandas dataframes - must be replaced with "NULL" to be a valid JSON file
print formattedJson

#Export to JSON file
parsed = open(exportJson, "w")
parsed.write(formattedJson)

print"\n\nExport to JSON Complete"

#2


0  

With some input from @root I used a different tack and came up with the following code, which seems to get most of the way there:

有了@root的一些输入,我使用了不同的方法,并提出了以下代码,它似乎在那里大部分时间:

import pandas
import json
from collections import defaultdict

inputExcel = 'E:\\teamsMM.xlsx'
exportJson = 'E:\\teamsMM.json'

data = pandas.read_excel(inputExcel, sheetname = 'SCAT Teams', encoding = 'utf8')

grouped = data.groupby(['teamname', 'members']).first()

results = defaultdict(lambda: defaultdict(dict))

for t in grouped.itertuples():
    for i, key in enumerate(t.Index):
        if i ==0:
            nested = results[key]
        elif i == len(t.Index) -1:
            nested[key] = t
        else:
            nested = nested[key]


formattedJson = json.dumps(results, indent = 4)

formattedJson = '{\n"teams": [\n' + formattedJson +'\n]\n }'

parsed = open(exportJson, "w")
parsed.write(formattedJson)

The resulting JSON file is this:

生成的JSON文件是这样的:

{
"teams": [
{
    "1": {
        "0": [
            [
                1, 
                0
            ], 
            "John", 
            "Doe", 
            "Anon", 
            "916-555-1234", 
            "none", 
            "john.doe@wildlife.net"
        ], 
        "1": [
            [
                1, 
                1
            ], 
            "Jane", 
            "Doe", 
            "Anon", 
            "916-555-4321", 
            "916-555-7890", 
            "jane.doe@wildlife.net"
        ]
    }, 
    "2": {
        "0": [
            [
                2, 
                0
            ], 
            "Mickey", 
            "Moose", 
            "Moosers", 
            "916-555-0000", 
            "916-555-1111", 
            "mickey.moose@wildlife.net"
        ], 
        "1": [
            [
                2, 
                1
            ], 
            "Minny", 
            "Moose", 
            "Moosers", 
            "916-555-2222", 
            "none", 
            "minny.moose@wildlife.net"
        ]
    }
}
]
 }

This format is very close to the desired end product. Remaining issues are: removing the redundant array [1, 0] that appears just above each firstname, and getting the headers for each nest to be "teamname": "1", "members": rather than "1": "0":

此格式非常接近所需的最终产品。剩下的问题是:删除每个名字上方出现的冗余数组[1,0],并将每个嵌套的标题变为“teamname”:“1”,“members”:而不是“1”:“0” :

Also, I do not know why each record is being stripped of its heading on the conversion. For instance why is dictionary entry "firstname":"John" exported as "John".

另外,我不知道为什么每个记录都被剥离了转换的标题。例如,为什么字典条目“firstname”:“John”导出为“John”。