So i have a program that reads json, flattens it and dumps csv:
所以我有一个程序,读取json,展平它并转储csv:
import json
import unicodecsv as csv
import sys
import glob
import os
from flatten_json import flatten_json
def createcolumnheadings(cols):
#create column headings
columns = cols.keys()
columns = list( set( columns ) )
return columns
doOnce=True
path=os.chdir(sys.argv[1])
for f in glob.glob("smallR.txt"):
fName=os.path.splitext(f)[0]
out_file= open( 'csv/' + fName+'.csv', 'wb' )
csv_w = csv.writer( out_file, delimiter="\t", encoding='utf-8' )
with open(f, 'r') as handle:
for line in handle:
data = json.loads(line)
flatdata =flatten_json(data)
if doOnce:
columns=createcolumnheadings(flatdata)
columns.insert(0,'racism')
csv_w.writerow( columns)
doOnce=False
flatdata['racism']= 0
csv_w.writerow(flatdata.get(x, u'') for x in columns)
This works OK, with one problem. My program just takes the column headings from the first line in smallR.txt (plus it adds a 'Racism' column).
这有效,有一个问题。我的程序只从smallR.txt中的第一行获取列标题(加上它添加了一个'Racism'列)。
Some of the latter data (smallR.txt here) has different columns. This results in output not quite right, see small.csv here.
后面的一些数据(此处为smallR.txt)具有不同的列。这导致输出不太正确,请参见small.csv。
Is there an easy way to adapt my program to handle new column headings found on later lines?
是否有一种简单的方法来调整我的程序来处理后续行中的新列标题?
1 个解决方案
#1
1
In that case you need to scan the whole file first, in order to get all the possible columns:
在这种情况下,您需要先扫描整个文件,以获取所有可能的列:
with open(f, 'r') as handle:
data = [json.loads(line) for line in handle]
columns = ['racism'] + list({k for entry in data for k in entry.keys()})
csv_w.writerow(columns)
for entry in entries:
csv_w.writerow(entry.get(c, '') for c in columns)
This loads all data in memory. If this is not acceptable to you, you might read the file twice: one to get the columns, another to read and write:
这会将所有数据加载到内存中。如果你不接受这个,你可能会读两次文件:一个用于获取列,另一个用于读写:
with open(f, 'r') as handle:
columns = ['racism'] + list({k for line in handle for k in json.load(line).keys()})
csv_w.write(columns)
with open(f, 'r') as handle:
for line in handle:
entry = json.loads(line)
csv_w.write(entry.get(c, '') for c in columns)
The flatten_json function definition is missing so I can only guess what it does.
缺少flatten_json函数定义,因此我只能猜测它的作用。
#1
1
In that case you need to scan the whole file first, in order to get all the possible columns:
在这种情况下,您需要先扫描整个文件,以获取所有可能的列:
with open(f, 'r') as handle:
data = [json.loads(line) for line in handle]
columns = ['racism'] + list({k for entry in data for k in entry.keys()})
csv_w.writerow(columns)
for entry in entries:
csv_w.writerow(entry.get(c, '') for c in columns)
This loads all data in memory. If this is not acceptable to you, you might read the file twice: one to get the columns, another to read and write:
这会将所有数据加载到内存中。如果你不接受这个,你可能会读两次文件:一个用于获取列,另一个用于读写:
with open(f, 'r') as handle:
columns = ['racism'] + list({k for line in handle for k in json.load(line).keys()})
csv_w.write(columns)
with open(f, 'r') as handle:
for line in handle:
entry = json.loads(line)
csv_w.write(entry.get(c, '') for c in columns)
The flatten_json function definition is missing so I can only guess what it does.
缺少flatten_json函数定义,因此我只能猜测它的作用。