I have hundreds of large CSV files that I would like to merge into one. However not all CSV files contain all columns. I therefore need to merge based on column name, not column position.
我有数百个大型的CSV文件,我想合并成一个。但是并不是所有的CSV文件都包含所有的列。因此,我需要基于列名而不是列位置进行合并。
Just to be clear: in the merged CSV, values should be empty for a cell coming from a line which did not have the column of that cell.
明确一点:在合并的CSV中,来自没有该单元格列的行的单元格的值应该为空。
I cannot use the pandas module, because it makes me run out of memory.
我不能使用熊猫模块,因为它使我的内存耗尽。
Is there a module that can do that, or some easy code?
是否有一个模块可以做到这一点,或者一些简单的代码?
2 个解决方案
#1
11
The csv.DictReader
and csv.DictWriter
classes should work well (see Python docs). Something like this:
csv。DictReader和csv。DictWriter类应该运行良好(请参阅Python文档)。是这样的:
import csv
inputs = ["in1.csv", "in2.csv"] # etc
# First determine the field names from the top line of each input file
# Comment 1 below
fieldnames = []
for filename in inputs:
with open(filename, "r", newline="") as f_in:
reader = csv.reader(f_in)
headers = next(reader)
for h in headers:
if h not in fieldnames:
fieldnames.append(h)
# Then copy the data
with open("out.csv", "w", newline="") as f_out: # Comment 2 below
writer = csv.DictWriter(f_out, fieldnames=fieldnames)
for filename in inputs:
with open(filename, "r", newline="") as f_in:
reader = csv.DictReader(f_in) # Uses the field names in this file
for line in reader:
# Comment 3 below
writer.writerow(line)
Comments from above:
从上面的评论:
- You need to specify all the possible field names in advance to
DictWriter
, so you need to loop through all your CSV files twice: once to find all the headers, and once to read the data. There is no better solution, because all the headers need to be known beforeDictWriter
can write the first line. This part would be more efficient using sets instead of lists (thein
operator on a list is comparatively slow), but it won't make much difference for a few hundred headers. Sets would also lose the deterministic ordering of a list - your columns would come out in a different order each time you ran the code. - 您需要在DictWriter之前指定所有可能的字段名,因此需要对所有CSV文件进行两次循环:一次查找所有头文件,一次读取数据。没有更好的解决方案,因为在DictWriter编写第一行之前,需要知道所有的头信息。这部分使用集合而不是列表会更有效(列表中的in操作符比较慢),但是对于几百个头文件来说,这并没有太大的区别。集合也会失去列表的确定性排序——每次运行代码时,您的列将以不同的顺序出现。
- The above code is for Python 3, where weird things happen in the CSV module without
newline=""
. Remove this for Python 2. - 上面的代码是针对Python 3的,在没有newline=""的CSV模块中会发生奇怪的事情。对于Python 2,删除它。
- At this point,
line
is a dict with the field names as keys, and the column data as values. You can specify what to do with blank or unknown values in theDictReader
andDictWriter
constructors. - 此时,line是一个命令,字段名作为键,列数据作为值。您可以指定如何处理DictReader和DictWriter构造函数中的空白或未知值。
This method should not run out of memory, because it never has the whole file loaded at once.
这个方法不应该耗尽内存,因为它从来没有一次加载整个文件。
#2
1
For those of us using 2.7, this adds an extra linefeed between records in "out.csv". To resolve this, just change the file mode from "w" to "wb".
对于使用2.7的用户,这在“out.csv”中添加了一个额外的线提要。要解决这个问题,只需将文件模式从“w”更改为“wb”。
#1
11
The csv.DictReader
and csv.DictWriter
classes should work well (see Python docs). Something like this:
csv。DictReader和csv。DictWriter类应该运行良好(请参阅Python文档)。是这样的:
import csv
inputs = ["in1.csv", "in2.csv"] # etc
# First determine the field names from the top line of each input file
# Comment 1 below
fieldnames = []
for filename in inputs:
with open(filename, "r", newline="") as f_in:
reader = csv.reader(f_in)
headers = next(reader)
for h in headers:
if h not in fieldnames:
fieldnames.append(h)
# Then copy the data
with open("out.csv", "w", newline="") as f_out: # Comment 2 below
writer = csv.DictWriter(f_out, fieldnames=fieldnames)
for filename in inputs:
with open(filename, "r", newline="") as f_in:
reader = csv.DictReader(f_in) # Uses the field names in this file
for line in reader:
# Comment 3 below
writer.writerow(line)
Comments from above:
从上面的评论:
- You need to specify all the possible field names in advance to
DictWriter
, so you need to loop through all your CSV files twice: once to find all the headers, and once to read the data. There is no better solution, because all the headers need to be known beforeDictWriter
can write the first line. This part would be more efficient using sets instead of lists (thein
operator on a list is comparatively slow), but it won't make much difference for a few hundred headers. Sets would also lose the deterministic ordering of a list - your columns would come out in a different order each time you ran the code. - 您需要在DictWriter之前指定所有可能的字段名,因此需要对所有CSV文件进行两次循环:一次查找所有头文件,一次读取数据。没有更好的解决方案,因为在DictWriter编写第一行之前,需要知道所有的头信息。这部分使用集合而不是列表会更有效(列表中的in操作符比较慢),但是对于几百个头文件来说,这并没有太大的区别。集合也会失去列表的确定性排序——每次运行代码时,您的列将以不同的顺序出现。
- The above code is for Python 3, where weird things happen in the CSV module without
newline=""
. Remove this for Python 2. - 上面的代码是针对Python 3的,在没有newline=""的CSV模块中会发生奇怪的事情。对于Python 2,删除它。
- At this point,
line
is a dict with the field names as keys, and the column data as values. You can specify what to do with blank or unknown values in theDictReader
andDictWriter
constructors. - 此时,line是一个命令,字段名作为键,列数据作为值。您可以指定如何处理DictReader和DictWriter构造函数中的空白或未知值。
This method should not run out of memory, because it never has the whole file loaded at once.
这个方法不应该耗尽内存,因为它从来没有一次加载整个文件。
#2
1
For those of us using 2.7, this adds an extra linefeed between records in "out.csv". To resolve this, just change the file mode from "w" to "wb".
对于使用2.7的用户,这在“out.csv”中添加了一个额外的线提要。要解决这个问题,只需将文件模式从“w”更改为“wb”。