Ruby:如何将文件拆分为给定大小的多个文件

时间:2022-09-03 02:47:43

I want to split a txt file into multiple files where each file contains no more than 5Mb. I know there are tools for this, but I need this for a project and want to do it in Ruby. Also, I prefer to do this with File.open in block context if possible, but I fail miserably :o(

我想将一个txt文件拆分成多个文件,其中每个文件包含不超过5Mb。我知道有这方面的工具,但我需要这个项目,并希望在Ruby中做到这一点。另外,如果可能的话,我更喜欢在块上下文中使用File.open执行此操作,但我失败了:o(

#!/usr/bin/env ruby

require 'pp'

MAX_BYTES = 5_000_000

file_num = 0
bytes    = 0

File.open("test.txt", 'r') do |data_in|
  File.open("#{file_num}.txt", 'w') do |data_out|
    data_in.each_line do |line|
      data_out.puts line

      bytes += line.length

      if bytes > MAX_BYTES
        bytes = 0
        file_num += 1
        # next file
      end
    end
  end
end

This work, but I don't think it is elegant. Also, I still wonder if it can be done with File.open in block context only.

这项工作,但我认为它不优雅。此外,我仍然想知道是否可以在块上下文中使用File.open完成。

#!/usr/bin/env ruby

require 'pp'

MAX_BYTES = 5_000_000

file_num = 0
bytes    = 0

File.open("test.txt", 'r') do |data_in|
  data_out = File.open("#{file_num}.txt", 'w')

  data_in.each_line do |line|
    data_out = File.open("#{file_num}.txt", 'w') unless data_out.respond_to? :write
    data_out.puts line

    bytes += line.length

    if bytes > MAX_BYTES
      bytes = 0
      file_num += 1
      data_out.close
    end
  end

  data_out.close if data_out.respond_to? :close
end

Cheers,

Martin

4 个解决方案

#1


14  

[Updated] Wrote a short version without any helper variables and put everything in a method:

[更新]写了一个没有任何辅助变量的简短版本,并将所有内容放在一个方法中:

def chunker f_in, out_pref, chunksize = 1_073_741_824
  File.open(f_in,"r") do |fh_in|
    until fh_in.eof?
      File.open("#{out_pref}_#{"%05d"%(fh_in.pos/chunksize)}.txt","w") do |fh_out|
        fh_out << fh_in.read(chunksize)
      end
    end
  end
end

chunker "inputfile.txt", "output_prefix" (, desired_chunk_size)

Instead of a line loop you can use .read(length) and do a loop only for the EOF marker and the file cursor.

您可以使用.read(length)代替行循环,仅为EOF标记和文件光标执行循环。

This takes care that the chunky files are never bigger than your desired chunk size.

这使得粗块文件永远不会超过您想要的块大小。

On the other hand it never takes care for line breaks (\n)!

另一方面,它永远不会关注换行符(\ n)!

Numbers for chunk files will be generated from integer division of current file curser position by chunksize, formatted with "%05d" which result in 5-digit numbers with leading zero (00001).

块文件的编号将通过chunksize从当前文件光标位置的整数除法生成,格式为“%05d”,其导致5位数字,前导零(00001)。

This is only possible because .read(chunksize) is used. In the second example below, it could not be used!

这是唯一可能的,因为使用了.read(chunksize)。在下面的第二个例子中,它无法使用!

Update: Splitting with line break recognition

更新:使用换行符识别拆分

If your really need complete lines with \n then use this modified code snippet:

如果您确实需要使用\ n的完整行,那么请使用此修改后的代码段:

def chunker f_in, out_pref, chunksize = 1_073_741_824
  outfilenum = 1
  File.open(f_in,"r") do |fh_in|
    until fh_in.eof?
      File.open("#{out_pref}_#{outfilenum}.txt","w") do |fh_out|
        line = ""
        while fh_out.size <= (chunksize-line.length) && !fh_in.eof?
          line = fh_in.readline
          fh_out << line
        end
      end
      outfilenum += 1
    end
  end
end

I had to introduce a helper variable line because I want to ensure that the chunky file size is always below the chunksize limit! If you don't do this extended check you will get also file sizes above the limit. The while statement only successfully checks in next iteration step when the line is already written. (Working with .ungetc or other complex calculations will make the code more unreadable and not shorter than this example.)

我不得不介绍一个辅助变量行,因为我想确保粗块文件大小始终低于chunksize限制!如果您不进行此扩展检查,您将获得超出限制的文件大小。 while语句仅在已经写入行时成功检查下一个迭代步骤。 (使用.ungetc或其他复杂计算将使代码更难以读取并且不会短于此示例。)

Unfortunately you have to have a second EOF check, because the last chunk iteration will mostly result in a smaller chunk.

不幸的是,您必须进行第二次EOF检查,因为最后一次块迭代将主要导致较小的块。

Also two helper variables are needed: the line is described above, the outfilenum is needed, because the resulting file sizes mostly do not match the exact chunksize.

还需要两个辅助变量:上面描述了该行,需要outfilenum,因为生成的文件大小通常与精确的chunksize不匹配。

#2


11  

For files of any size, split will be faster than scratch-built Ruby code, even taking the cost of starting a separate executable into account. It's also code that you don't have to write, debug or maintain:

对于任何大小的文件,拆分将比临时构建的Ruby代码更快,甚至考虑到启动单独的可执行文件的成本。它也是您不必编写,调试或维护的代码:

system("split -C 1M -d test.txt ''")

The options are:

选项是:

  • -C 1M Put lines totalling no more than 1M in each chunk
  • -C 1M在每个块中放置总计不超过1M的行

  • -d Use decimal suffixes in the output filenames
  • -d在输出文件名中使用十进制后缀

  • test.txt The name of the input file
  • test.txt输入文件的名称

  • '' Use a blank output file prefix
  • ''使用空白输出文件前缀

Unless you're on Windows, this is the way to go.

除非你在Windows上,否则这就是你要走的路。

#3


0  

Instead of opening your outfile inside the infile block, open the file and assign it to variable. When you hit the filesize limit, close the file and open a new one.

不要在infile块中打开outfile,而是打开文件并将其分配给变量。当您达到文件大小限制时,请关闭该文件并打开一个新文件。

#4


0  

This code actually works, it's simple and it uses array which make it faster:

这段代码实际上有效,它很简单,它使用数组使其更快:

#!/usr/bin/env ruby
data = Array.new()
MAX_BYTES = 3500
MAX_LINES = 32
lineNum = 0
file_num = 0
bytes    = 0


filename = 'W:/IN/tangoZ.txt_100.TXT'
r = File.exist?(filename)
puts 'File exists =' + r.to_s + ' ' +  filename
file=File.open(filename,"r")
line_count = file.readlines.size
file_size = File.size(filename).to_f / 1024000
puts 'Total lines=' + line_count.to_s + '   size=' + file_size.to_s + ' Mb'
puts ' '


file = File.open(filename,"r")
#puts '1 File open read ' + filename
file.each{|line|          
     bytes += line.length
     lineNum += 1
     data << line    

        if bytes > MAX_BYTES  then
       # if lineNum > MAX_LINES  then     
              bytes = 0
              file_num += 1
          #puts '_2 File open write ' + file_num.to_s + '  lines ' + lineNum.to_s
             File.open("#{file_num}.txt", 'w') {|f| f.write data.join}
         data.clear
         lineNum = 0
        end



}

## write leftovers
file_num += 1
#puts '__3 File open write FINAL' + file_num.to_s + '  lines ' + lineNum.to_s
File.open("#{file_num}.txt", 'w') {|f| f.write data.join}

#1


14  

[Updated] Wrote a short version without any helper variables and put everything in a method:

[更新]写了一个没有任何辅助变量的简短版本,并将所有内容放在一个方法中:

def chunker f_in, out_pref, chunksize = 1_073_741_824
  File.open(f_in,"r") do |fh_in|
    until fh_in.eof?
      File.open("#{out_pref}_#{"%05d"%(fh_in.pos/chunksize)}.txt","w") do |fh_out|
        fh_out << fh_in.read(chunksize)
      end
    end
  end
end

chunker "inputfile.txt", "output_prefix" (, desired_chunk_size)

Instead of a line loop you can use .read(length) and do a loop only for the EOF marker and the file cursor.

您可以使用.read(length)代替行循环,仅为EOF标记和文件光标执行循环。

This takes care that the chunky files are never bigger than your desired chunk size.

这使得粗块文件永远不会超过您想要的块大小。

On the other hand it never takes care for line breaks (\n)!

另一方面,它永远不会关注换行符(\ n)!

Numbers for chunk files will be generated from integer division of current file curser position by chunksize, formatted with "%05d" which result in 5-digit numbers with leading zero (00001).

块文件的编号将通过chunksize从当前文件光标位置的整数除法生成,格式为“%05d”,其导致5位数字,前导零(00001)。

This is only possible because .read(chunksize) is used. In the second example below, it could not be used!

这是唯一可能的,因为使用了.read(chunksize)。在下面的第二个例子中,它无法使用!

Update: Splitting with line break recognition

更新:使用换行符识别拆分

If your really need complete lines with \n then use this modified code snippet:

如果您确实需要使用\ n的完整行,那么请使用此修改后的代码段:

def chunker f_in, out_pref, chunksize = 1_073_741_824
  outfilenum = 1
  File.open(f_in,"r") do |fh_in|
    until fh_in.eof?
      File.open("#{out_pref}_#{outfilenum}.txt","w") do |fh_out|
        line = ""
        while fh_out.size <= (chunksize-line.length) && !fh_in.eof?
          line = fh_in.readline
          fh_out << line
        end
      end
      outfilenum += 1
    end
  end
end

I had to introduce a helper variable line because I want to ensure that the chunky file size is always below the chunksize limit! If you don't do this extended check you will get also file sizes above the limit. The while statement only successfully checks in next iteration step when the line is already written. (Working with .ungetc or other complex calculations will make the code more unreadable and not shorter than this example.)

我不得不介绍一个辅助变量行,因为我想确保粗块文件大小始终低于chunksize限制!如果您不进行此扩展检查,您将获得超出限制的文件大小。 while语句仅在已经写入行时成功检查下一个迭代步骤。 (使用.ungetc或其他复杂计算将使代码更难以读取并且不会短于此示例。)

Unfortunately you have to have a second EOF check, because the last chunk iteration will mostly result in a smaller chunk.

不幸的是,您必须进行第二次EOF检查,因为最后一次块迭代将主要导致较小的块。

Also two helper variables are needed: the line is described above, the outfilenum is needed, because the resulting file sizes mostly do not match the exact chunksize.

还需要两个辅助变量:上面描述了该行,需要outfilenum,因为生成的文件大小通常与精确的chunksize不匹配。

#2


11  

For files of any size, split will be faster than scratch-built Ruby code, even taking the cost of starting a separate executable into account. It's also code that you don't have to write, debug or maintain:

对于任何大小的文件,拆分将比临时构建的Ruby代码更快,甚至考虑到启动单独的可执行文件的成本。它也是您不必编写,调试或维护的代码:

system("split -C 1M -d test.txt ''")

The options are:

选项是:

  • -C 1M Put lines totalling no more than 1M in each chunk
  • -C 1M在每个块中放置总计不超过1M的行

  • -d Use decimal suffixes in the output filenames
  • -d在输出文件名中使用十进制后缀

  • test.txt The name of the input file
  • test.txt输入文件的名称

  • '' Use a blank output file prefix
  • ''使用空白输出文件前缀

Unless you're on Windows, this is the way to go.

除非你在Windows上,否则这就是你要走的路。

#3


0  

Instead of opening your outfile inside the infile block, open the file and assign it to variable. When you hit the filesize limit, close the file and open a new one.

不要在infile块中打开outfile,而是打开文件并将其分配给变量。当您达到文件大小限制时,请关闭该文件并打开一个新文件。

#4


0  

This code actually works, it's simple and it uses array which make it faster:

这段代码实际上有效,它很简单,它使用数组使其更快:

#!/usr/bin/env ruby
data = Array.new()
MAX_BYTES = 3500
MAX_LINES = 32
lineNum = 0
file_num = 0
bytes    = 0


filename = 'W:/IN/tangoZ.txt_100.TXT'
r = File.exist?(filename)
puts 'File exists =' + r.to_s + ' ' +  filename
file=File.open(filename,"r")
line_count = file.readlines.size
file_size = File.size(filename).to_f / 1024000
puts 'Total lines=' + line_count.to_s + '   size=' + file_size.to_s + ' Mb'
puts ' '


file = File.open(filename,"r")
#puts '1 File open read ' + filename
file.each{|line|          
     bytes += line.length
     lineNum += 1
     data << line    

        if bytes > MAX_BYTES  then
       # if lineNum > MAX_LINES  then     
              bytes = 0
              file_num += 1
          #puts '_2 File open write ' + file_num.to_s + '  lines ' + lineNum.to_s
             File.open("#{file_num}.txt", 'w') {|f| f.write data.join}
         data.clear
         lineNum = 0
        end



}

## write leftovers
file_num += 1
#puts '__3 File open write FINAL' + file_num.to_s + '  lines ' + lineNum.to_s
File.open("#{file_num}.txt", 'w') {|f| f.write data.join}