使用ruby基于内容拆分大文件

时间:2022-02-06 15:45:08

Disclaimer: I'm not a programmer, never was, never learned algorithms, CS, etc. Just have to work with it.

免责声明:我不是程序员,从来没有,从未学过算法,CS等等。只需要使用它。

My question is: I need to split a huge (over 4 GB) CSV file into smaller ones (then process it with require 'win32ole') based on the first field. In awk it's rather easy:

我的问题是:我需要根据第一个字段将一个巨大的(超过4 GB)CSV文件拆分为较小的文件(然后使用require'win32ole'处理它)。在awk中它很容易:

awk -F ',' '{myfile=$1 ; print $0 >> (myfile".csv")}' KNAGYFILE.csv

But with ruby I did:

但是我用红宝石做了:

open('hugefile').each { |hline|
    accno = hline[0,12]
    nline = hline[13,10000].gsub(/;/,",")
    accfile = File.open("#{accno.to_s}.csv", "a")
    accfile.puts nline
    accfile.close
}

Then recognized that it's resource inefficient (several file open/close). I'm sure there's a better way to do it, could You explain me how?

然后认识到它的资源效率低下(几个文件打开/关闭)。我确信有更好的方法,你可以解释我怎么样?

UPDATE: just forgot to mention, that the file is sorted on the first column. E.g. if this is hugefile:

更新:忘了提一下,文件在第一列上排序。例如。如果这是hugefile:

012345678901,1,1,1,1,1,1
012345678901,1,2,1,1,1,1
012345678901,1,1,A,1,1,1
012345678901,1,1,1,1,A,A
A12345678901,1,1,1,1,1,1
A12345678901,1,1,1,1,1,1
A12345678901,1,1,1,1,1,1
A12345678901,1,1,1,1,1,1

Then I need two new files, named 012345678901.csv and A12345678901.csv.

然后我需要两个新文件,名为012345678901.csv和A12345678901.csv。

2 个解决方案

#1


2  

Your awk solution will have to open the file just as many times, so I would think you'd get the same resource usage.

你的awk解决方案必须多次打开文件,所以我认为你会获得相同的资源使用。

You can keep the file open until $1 changes:

您可以保持文件打开,直到$ 1更改:

prev = nil
File.foreach('hugefile') do |hline|
  accno = hline[0,12]
  nline = hline[13,10000].gsub(/;/,",")
  if prev != accno
    accfile.close rescue nil
    accfile = File.open("#{accno.to_s}.csv", "a")
    prev = accno
  end
  accfile.puts nline
end

#2


1  

This should get around the multi-open-write-close issue, although it might run into problems if the number of files becomes large; I can't say, I never had hundreds of files open for write!

这应该解决多开放写入关闭问题,尽管如果文件数量变大可能会遇到问题;我不能说,我从来没有打开过数百个文件!

The first line is the important one: for each new key encountered it opens a new file and stores it against that key in the hash. The last line closes all the files opened.

第一行是重要的一行:对于遇到的每个新密钥,它会打开一个新文件并将其存储在哈希中的该密钥中。最后一行关闭所有打开的文件。

files = Hash.new { |h, k| h[k] = File.open("#{k}.csv", 'w+') }
open('hugefile').each do |hline|
  files[hline[0,12]].puts hline[13,10000].gsub(/;/,",")
end
files.each { |n, f| f.close }

#1


2  

Your awk solution will have to open the file just as many times, so I would think you'd get the same resource usage.

你的awk解决方案必须多次打开文件,所以我认为你会获得相同的资源使用。

You can keep the file open until $1 changes:

您可以保持文件打开,直到$ 1更改:

prev = nil
File.foreach('hugefile') do |hline|
  accno = hline[0,12]
  nline = hline[13,10000].gsub(/;/,",")
  if prev != accno
    accfile.close rescue nil
    accfile = File.open("#{accno.to_s}.csv", "a")
    prev = accno
  end
  accfile.puts nline
end

#2


1  

This should get around the multi-open-write-close issue, although it might run into problems if the number of files becomes large; I can't say, I never had hundreds of files open for write!

这应该解决多开放写入关闭问题,尽管如果文件数量变大可能会遇到问题;我不能说,我从来没有打开过数百个文件!

The first line is the important one: for each new key encountered it opens a new file and stores it against that key in the hash. The last line closes all the files opened.

第一行是重要的一行:对于遇到的每个新密钥,它会打开一个新文件并将其存储在哈希中的该密钥中。最后一行关闭所有打开的文件。

files = Hash.new { |h, k| h[k] = File.open("#{k}.csv", 'w+') }
open('hugefile').each do |hline|
  files[hline[0,12]].puts hline[13,10000].gsub(/;/,",")
end
files.each { |n, f| f.close }

相关文章