I want to split a txt file into multiple files where each file contains no more than 5Mb. I know there are tools for this, but I need this for a project and want to do it in Ruby. Also, I prefer to do this with File.open in block context if possible, but I fail miserably :o(
我想将一个txt文件拆分成多个文件,其中每个文件包含不超过5Mb。我知道有这方面的工具,但我需要这个项目,并希望在Ruby中做到这一点。另外,如果可能的话,我更喜欢在块上下文中使用File.open执行此操作,但我失败了:o(
#!/usr/bin/env ruby
require 'pp'
MAX_BYTES = 5_000_000
file_num = 0
bytes = 0
File.open("test.txt", 'r') do |data_in|
File.open("#{file_num}.txt", 'w') do |data_out|
data_in.each_line do |line|
data_out.puts line
bytes += line.length
if bytes > MAX_BYTES
bytes = 0
file_num += 1
# next file
end
end
end
end
This work, but I don't think it is elegant. Also, I still wonder if it can be done with File.open in block context only.
这项工作,但我认为它不优雅。此外,我仍然想知道是否可以在块上下文中使用File.open完成。
#!/usr/bin/env ruby
require 'pp'
MAX_BYTES = 5_000_000
file_num = 0
bytes = 0
File.open("test.txt", 'r') do |data_in|
data_out = File.open("#{file_num}.txt", 'w')
data_in.each_line do |line|
data_out = File.open("#{file_num}.txt", 'w') unless data_out.respond_to? :write
data_out.puts line
bytes += line.length
if bytes > MAX_BYTES
bytes = 0
file_num += 1
data_out.close
end
end
data_out.close if data_out.respond_to? :close
end
Cheers,
Martin
4 个解决方案
#1
14
[Updated] Wrote a short version without any helper variables and put everything in a method:
[更新]写了一个没有任何辅助变量的简短版本,并将所有内容放在一个方法中:
def chunker f_in, out_pref, chunksize = 1_073_741_824
File.open(f_in,"r") do |fh_in|
until fh_in.eof?
File.open("#{out_pref}_#{"%05d"%(fh_in.pos/chunksize)}.txt","w") do |fh_out|
fh_out << fh_in.read(chunksize)
end
end
end
end
chunker "inputfile.txt", "output_prefix" (, desired_chunk_size)
Instead of a line loop you can use .read(length)
and do a loop only for the EOF
marker and the file cursor.
您可以使用.read(length)代替行循环,仅为EOF标记和文件光标执行循环。
This takes care that the chunky files are never bigger than your desired chunk size.
这使得粗块文件永远不会超过您想要的块大小。
On the other hand it never takes care for line breaks (\n
)!
另一方面,它永远不会关注换行符(\ n)!
Numbers for chunk files will be generated from integer division of current file curser position by chunksize, formatted with "%05d" which result in 5-digit numbers with leading zero (00001
).
块文件的编号将通过chunksize从当前文件光标位置的整数除法生成,格式为“%05d”,其导致5位数字,前导零(00001)。
This is only possible because .read(chunksize)
is used. In the second example below, it could not be used!
这是唯一可能的,因为使用了.read(chunksize)。在下面的第二个例子中,它无法使用!
Update: Splitting with line break recognition
更新:使用换行符识别拆分
If your really need complete lines with \n
then use this modified code snippet:
如果您确实需要使用\ n的完整行,那么请使用此修改后的代码段:
def chunker f_in, out_pref, chunksize = 1_073_741_824
outfilenum = 1
File.open(f_in,"r") do |fh_in|
until fh_in.eof?
File.open("#{out_pref}_#{outfilenum}.txt","w") do |fh_out|
line = ""
while fh_out.size <= (chunksize-line.length) && !fh_in.eof?
line = fh_in.readline
fh_out << line
end
end
outfilenum += 1
end
end
end
I had to introduce a helper variable line
because I want to ensure that the chunky file size is always below the chunksize
limit! If you don't do this extended check you will get also file sizes above the limit. The while
statement only successfully checks in next iteration step when the line is already written. (Working with .ungetc
or other complex calculations will make the code more unreadable and not shorter than this example.)
我不得不介绍一个辅助变量行,因为我想确保粗块文件大小始终低于chunksize限制!如果您不进行此扩展检查,您将获得超出限制的文件大小。 while语句仅在已经写入行时成功检查下一个迭代步骤。 (使用.ungetc或其他复杂计算将使代码更难以读取并且不会短于此示例。)
Unfortunately you have to have a second EOF
check, because the last chunk iteration will mostly result in a smaller chunk.
不幸的是,您必须进行第二次EOF检查,因为最后一次块迭代将主要导致较小的块。
Also two helper variables are needed: the line
is described above, the outfilenum
is needed, because the resulting file sizes mostly do not match the exact chunksize
.
还需要两个辅助变量:上面描述了该行,需要outfilenum,因为生成的文件大小通常与精确的chunksize不匹配。
#2
11
For files of any size, split
will be faster than scratch-built Ruby code, even taking the cost of starting a separate executable into account. It's also code that you don't have to write, debug or maintain:
对于任何大小的文件,拆分将比临时构建的Ruby代码更快,甚至考虑到启动单独的可执行文件的成本。它也是您不必编写,调试或维护的代码:
system("split -C 1M -d test.txt ''")
The options are:
选项是:
-
-C 1M
Put lines totalling no more than 1M in each chunk -
-d
Use decimal suffixes in the output filenames -
test.txt
The name of the input file -
''
Use a blank output file prefix
-C 1M在每个块中放置总计不超过1M的行
-d在输出文件名中使用十进制后缀
test.txt输入文件的名称
''使用空白输出文件前缀
Unless you're on Windows, this is the way to go.
除非你在Windows上,否则这就是你要走的路。
#3
0
Instead of opening your outfile inside the infile block, open the file and assign it to variable. When you hit the filesize limit, close the file and open a new one.
不要在infile块中打开outfile,而是打开文件并将其分配给变量。当您达到文件大小限制时,请关闭该文件并打开一个新文件。
#4
0
This code actually works, it's simple and it uses array which make it faster:
这段代码实际上有效,它很简单,它使用数组使其更快:
#!/usr/bin/env ruby
data = Array.new()
MAX_BYTES = 3500
MAX_LINES = 32
lineNum = 0
file_num = 0
bytes = 0
filename = 'W:/IN/tangoZ.txt_100.TXT'
r = File.exist?(filename)
puts 'File exists =' + r.to_s + ' ' + filename
file=File.open(filename,"r")
line_count = file.readlines.size
file_size = File.size(filename).to_f / 1024000
puts 'Total lines=' + line_count.to_s + ' size=' + file_size.to_s + ' Mb'
puts ' '
file = File.open(filename,"r")
#puts '1 File open read ' + filename
file.each{|line|
bytes += line.length
lineNum += 1
data << line
if bytes > MAX_BYTES then
# if lineNum > MAX_LINES then
bytes = 0
file_num += 1
#puts '_2 File open write ' + file_num.to_s + ' lines ' + lineNum.to_s
File.open("#{file_num}.txt", 'w') {|f| f.write data.join}
data.clear
lineNum = 0
end
}
## write leftovers
file_num += 1
#puts '__3 File open write FINAL' + file_num.to_s + ' lines ' + lineNum.to_s
File.open("#{file_num}.txt", 'w') {|f| f.write data.join}
#1
14
[Updated] Wrote a short version without any helper variables and put everything in a method:
[更新]写了一个没有任何辅助变量的简短版本,并将所有内容放在一个方法中:
def chunker f_in, out_pref, chunksize = 1_073_741_824
File.open(f_in,"r") do |fh_in|
until fh_in.eof?
File.open("#{out_pref}_#{"%05d"%(fh_in.pos/chunksize)}.txt","w") do |fh_out|
fh_out << fh_in.read(chunksize)
end
end
end
end
chunker "inputfile.txt", "output_prefix" (, desired_chunk_size)
Instead of a line loop you can use .read(length)
and do a loop only for the EOF
marker and the file cursor.
您可以使用.read(length)代替行循环,仅为EOF标记和文件光标执行循环。
This takes care that the chunky files are never bigger than your desired chunk size.
这使得粗块文件永远不会超过您想要的块大小。
On the other hand it never takes care for line breaks (\n
)!
另一方面,它永远不会关注换行符(\ n)!
Numbers for chunk files will be generated from integer division of current file curser position by chunksize, formatted with "%05d" which result in 5-digit numbers with leading zero (00001
).
块文件的编号将通过chunksize从当前文件光标位置的整数除法生成,格式为“%05d”,其导致5位数字,前导零(00001)。
This is only possible because .read(chunksize)
is used. In the second example below, it could not be used!
这是唯一可能的,因为使用了.read(chunksize)。在下面的第二个例子中,它无法使用!
Update: Splitting with line break recognition
更新:使用换行符识别拆分
If your really need complete lines with \n
then use this modified code snippet:
如果您确实需要使用\ n的完整行,那么请使用此修改后的代码段:
def chunker f_in, out_pref, chunksize = 1_073_741_824
outfilenum = 1
File.open(f_in,"r") do |fh_in|
until fh_in.eof?
File.open("#{out_pref}_#{outfilenum}.txt","w") do |fh_out|
line = ""
while fh_out.size <= (chunksize-line.length) && !fh_in.eof?
line = fh_in.readline
fh_out << line
end
end
outfilenum += 1
end
end
end
I had to introduce a helper variable line
because I want to ensure that the chunky file size is always below the chunksize
limit! If you don't do this extended check you will get also file sizes above the limit. The while
statement only successfully checks in next iteration step when the line is already written. (Working with .ungetc
or other complex calculations will make the code more unreadable and not shorter than this example.)
我不得不介绍一个辅助变量行,因为我想确保粗块文件大小始终低于chunksize限制!如果您不进行此扩展检查,您将获得超出限制的文件大小。 while语句仅在已经写入行时成功检查下一个迭代步骤。 (使用.ungetc或其他复杂计算将使代码更难以读取并且不会短于此示例。)
Unfortunately you have to have a second EOF
check, because the last chunk iteration will mostly result in a smaller chunk.
不幸的是,您必须进行第二次EOF检查,因为最后一次块迭代将主要导致较小的块。
Also two helper variables are needed: the line
is described above, the outfilenum
is needed, because the resulting file sizes mostly do not match the exact chunksize
.
还需要两个辅助变量:上面描述了该行,需要outfilenum,因为生成的文件大小通常与精确的chunksize不匹配。
#2
11
For files of any size, split
will be faster than scratch-built Ruby code, even taking the cost of starting a separate executable into account. It's also code that you don't have to write, debug or maintain:
对于任何大小的文件,拆分将比临时构建的Ruby代码更快,甚至考虑到启动单独的可执行文件的成本。它也是您不必编写,调试或维护的代码:
system("split -C 1M -d test.txt ''")
The options are:
选项是:
-
-C 1M
Put lines totalling no more than 1M in each chunk -
-d
Use decimal suffixes in the output filenames -
test.txt
The name of the input file -
''
Use a blank output file prefix
-C 1M在每个块中放置总计不超过1M的行
-d在输出文件名中使用十进制后缀
test.txt输入文件的名称
''使用空白输出文件前缀
Unless you're on Windows, this is the way to go.
除非你在Windows上,否则这就是你要走的路。
#3
0
Instead of opening your outfile inside the infile block, open the file and assign it to variable. When you hit the filesize limit, close the file and open a new one.
不要在infile块中打开outfile,而是打开文件并将其分配给变量。当您达到文件大小限制时,请关闭该文件并打开一个新文件。
#4
0
This code actually works, it's simple and it uses array which make it faster:
这段代码实际上有效,它很简单,它使用数组使其更快:
#!/usr/bin/env ruby
data = Array.new()
MAX_BYTES = 3500
MAX_LINES = 32
lineNum = 0
file_num = 0
bytes = 0
filename = 'W:/IN/tangoZ.txt_100.TXT'
r = File.exist?(filename)
puts 'File exists =' + r.to_s + ' ' + filename
file=File.open(filename,"r")
line_count = file.readlines.size
file_size = File.size(filename).to_f / 1024000
puts 'Total lines=' + line_count.to_s + ' size=' + file_size.to_s + ' Mb'
puts ' '
file = File.open(filename,"r")
#puts '1 File open read ' + filename
file.each{|line|
bytes += line.length
lineNum += 1
data << line
if bytes > MAX_BYTES then
# if lineNum > MAX_LINES then
bytes = 0
file_num += 1
#puts '_2 File open write ' + file_num.to_s + ' lines ' + lineNum.to_s
File.open("#{file_num}.txt", 'w') {|f| f.write data.join}
data.clear
lineNum = 0
end
}
## write leftovers
file_num += 1
#puts '__3 File open write FINAL' + file_num.to_s + ' lines ' + lineNum.to_s
File.open("#{file_num}.txt", 'w') {|f| f.write data.join}