Disclaimer: I'm not a programmer, never was, never learned algorithms, CS, etc. Just have to work with it.
免责声明:我不是程序员,从来没有,从未学过算法,CS等等。只需要使用它。
My question is: I need to split a huge (over 4 GB) CSV file into smaller ones (then process it with require 'win32ole'
) based on the first field. In awk it's rather easy:
我的问题是:我需要根据第一个字段将一个巨大的(超过4 GB)CSV文件拆分为较小的文件(然后使用require'win32ole'处理它)。在awk中它很容易:
awk -F ',' '{myfile=$1 ; print $0 >> (myfile".csv")}' KNAGYFILE.csv
But with ruby
I did:
但是我用红宝石做了:
open('hugefile').each { |hline|
accno = hline[0,12]
nline = hline[13,10000].gsub(/;/,",")
accfile = File.open("#{accno.to_s}.csv", "a")
accfile.puts nline
accfile.close
}
Then recognized that it's resource inefficient (several file open/close). I'm sure there's a better way to do it, could You explain me how?
然后认识到它的资源效率低下(几个文件打开/关闭)。我确信有更好的方法,你可以解释我怎么样?
UPDATE: just forgot to mention, that the file is sorted on the first column. E.g. if this is hugefile:
更新:忘了提一下,文件在第一列上排序。例如。如果这是hugefile:
012345678901,1,1,1,1,1,1
012345678901,1,2,1,1,1,1
012345678901,1,1,A,1,1,1
012345678901,1,1,1,1,A,A
A12345678901,1,1,1,1,1,1
A12345678901,1,1,1,1,1,1
A12345678901,1,1,1,1,1,1
A12345678901,1,1,1,1,1,1
Then I need two new files, named 012345678901.csv
and A12345678901.csv
.
然后我需要两个新文件,名为012345678901.csv和A12345678901.csv。
2 个解决方案
#1
2
Your awk solution will have to open the file just as many times, so I would think you'd get the same resource usage.
你的awk解决方案必须多次打开文件,所以我认为你会获得相同的资源使用。
You can keep the file open until $1 changes:
您可以保持文件打开,直到$ 1更改:
prev = nil
File.foreach('hugefile') do |hline|
accno = hline[0,12]
nline = hline[13,10000].gsub(/;/,",")
if prev != accno
accfile.close rescue nil
accfile = File.open("#{accno.to_s}.csv", "a")
prev = accno
end
accfile.puts nline
end
#2
1
This should get around the multi-open-write-close issue, although it might run into problems if the number of files becomes large; I can't say, I never had hundreds of files open for write!
这应该解决多开放写入关闭问题,尽管如果文件数量变大可能会遇到问题;我不能说,我从来没有打开过数百个文件!
The first line is the important one: for each new key encountered it opens a new file and stores it against that key in the hash. The last line closes all the files opened.
第一行是重要的一行:对于遇到的每个新密钥,它会打开一个新文件并将其存储在哈希中的该密钥中。最后一行关闭所有打开的文件。
files = Hash.new { |h, k| h[k] = File.open("#{k}.csv", 'w+') }
open('hugefile').each do |hline|
files[hline[0,12]].puts hline[13,10000].gsub(/;/,",")
end
files.each { |n, f| f.close }
#1
2
Your awk solution will have to open the file just as many times, so I would think you'd get the same resource usage.
你的awk解决方案必须多次打开文件,所以我认为你会获得相同的资源使用。
You can keep the file open until $1 changes:
您可以保持文件打开,直到$ 1更改:
prev = nil
File.foreach('hugefile') do |hline|
accno = hline[0,12]
nline = hline[13,10000].gsub(/;/,",")
if prev != accno
accfile.close rescue nil
accfile = File.open("#{accno.to_s}.csv", "a")
prev = accno
end
accfile.puts nline
end
#2
1
This should get around the multi-open-write-close issue, although it might run into problems if the number of files becomes large; I can't say, I never had hundreds of files open for write!
这应该解决多开放写入关闭问题,尽管如果文件数量变大可能会遇到问题;我不能说,我从来没有打开过数百个文件!
The first line is the important one: for each new key encountered it opens a new file and stores it against that key in the hash. The last line closes all the files opened.
第一行是重要的一行:对于遇到的每个新密钥,它会打开一个新文件并将其存储在哈希中的该密钥中。最后一行关闭所有打开的文件。
files = Hash.new { |h, k| h[k] = File.open("#{k}.csv", 'w+') }
open('hugefile').each do |hline|
files[hline[0,12]].puts hline[13,10000].gsub(/;/,",")
end
files.each { |n, f| f.close }