I'm writing this little HelloWorld as a followup to this and the numbers do not add up
我正在编写这个小HelloWorld作为后续内容,这些数字并没有加起来
filename = "testThis.txt"
total_bytes = 0
file = File.new(filename, "r")
file.each do |line|
total_bytes += line.unpack("U*").length
end
puts "original size #{File.size(filename)}"
puts "Total bytes #{total_bytes}"
The result is not the same as the file size. I think I just need to know what format
I need to plug in... or maybe I've missed the point entirely. How can I measure the file size line by line?
结果与文件大小不同。我想我只需要知道我需要插入什么格式......或者我可能完全错过了这一点。如何逐行测量文件大小?
Note: I'm on Windows, and the file is encoded as type ANSI.
注意:我在Windows上,文件编码为ANSI类型。
Edit: This produces the same results!
编辑:这会产生相同的结果!
filename = "testThis.txt"
total_bytes = 0
file = File.new(filename, "r")
file.each_byte do |whatever|
total_bytes += 1
end
puts "Original size #{File.size(filename)}"
puts "Total bytes #{total_bytes}"
so anybody who can help now...
所以现在可以提供帮助的人......
6 个解决方案
#1
IO#gets works the same as if you were capturing input from the command line: the "Enter" isn't sent as part of the input; neither is it passed when #gets is called on a File or other subclass of IO, so the numbers are definitely not going to match up.
IO#的工作方式与从命令行捕获输入的方式相同:“输入”不作为输入的一部分发送;当在文件或IO的其他子类上调用#gets时,它都不会被传递,因此这些数字肯定不会匹配。
See the relevant Pickaxe section
请参阅相关的镐头部分
May I enquire why you're so concerned about the line lengths summing to the file size? You may be solving a harder problem than is necessary...
我可以问一下你为什么如此担心线长总和到文件大小?你可能正在解决一个比必要更难的问题......
Aha. I think I get it now.
啊哈。我想我现在明白了。
Lacking a handy iPod (or any other sort, for that matter), I don't know if you want exactly 4K chunks, in which case IO#read(4000) would be your friend (4000 or 4096?) or if you're happier to break by line, in which case something like this ought to work:
由于缺少方便的iPod(或任何其他类型的东西),我不知道你是否想要完全4K的块,在这种情况下IO#read(4000)将是你的朋友(4000或4096?)或者如果你'更乐意逐行打破,在这种情况下这样的事情应该有效:
class Chunkifier
def Chunkifier.to_chunks(path)
chunks, current_chunk_size = [""], 0
File.readlines(path).each do |line|
line.chomp! # strips off \n, \r or \r\n depending on OS
if chunks.last.size + line.size >= 4_000 # 4096?
chunks.last.chomp! # remove last line terminator
chunks << ""
end
chunks.last << line + "\n" # or whatever terminator you need
end
chunks
end
end
if __FILE__ == $0
require 'test/unit'
class TestFile < Test::Unit::TestCase
def test_chunking
chs = Chunkifier.to_chunks(PATH)
chs.each do |chunk|
assert 4_000 >= chunk.size, "chunk is #{chunk.size} bytes long"
end
end
end
end
Note the use of IO#readlines to get all the text in one slurp: #each or #each_line would do as well. I used String#chomp! to ensure that whatever the OS is doing, the byts at the end are removed, so that \n or whatever can be forced into the output.
注意使用IO#readlines来获取所有文本:#each或#each_line也可以。我用String#chomp!为了确保无论操作系统在做什么,最后的字节都被删除,以便\ n或其他任何东西都可以强制进入输出。
I would suggest using File#write, rather than #print or #puts for the output, as the latter have a tendency to deliver OS-specific newline sequences.
我建议使用File#write而不是#print或#puts作为输出,因为后者倾向于提供特定于OS的换行序列。
If you're really concerned about multi-byte characters, consider taking the each_byte or unpack(C*) options and monkey-patching String, something like this:
如果你真的担心多字节字符,可以考虑使用each_byte或unpack(C *)选项和monkey-patching String,如下所示:
class String
def size_in_bytes
self.unpack("C*").size
end
end
The unpack version is about 8 times faster than the each_byte one on my machine, btw.
解压缩版本比我机器上的each_byte快8倍,顺便说一下。
#2
You might try IO#each_byte, e.g.
您可以尝试IO#each_byte,例如
total_bytes = 0
file_name = "test_this.txt"
File.open(file_name, "r") do |file|
file.each_byte {|b| total_bytes += 1}
end
puts "Original size #{File.size(file_name)}"
puts "Total bytes #{total_bytes}"
That, of course, doesn't give you a line at a time. Your best option for that is probably to go through the file via each_byte
until you encounter \r\n
. The IO class provides a bunch of pretty low-level read methods that might be helpful.
当然,这不会一次给你一条线。你最好的选择可能是通过each_byte遍历文件,直到你遇到\ r \ n。 IO类提供了许多可能有用的低级读取方法。
#3
You potentially have several overlapping issues here:
您可能在此处有几个重叠的问题:
-
Linefeed characters
\r\n
vs.\n
(as per your previous post). Also EOF file character (^Z)?换行符\ r \ n与\ n(根据您之前的帖子)。还有EOF文件字符(^ Z)?
-
Definition of "size" in your problem statement: do you mean "how many characters" (taking into account multi-byte character encodings) or do you mean "how many bytes"?
问题陈述中“大小”的定义:你的意思是“多少个字符”(考虑多字节字符编码)或者你的意思是“多少字节”?
-
Interaction of the
$KCODE
global variable (deprecated in ruby 1.9. SeeString#encoding
and friends if you're running under 1.9). Are there, for example, accented characters in your file?$ KCODE全局变量的交互(在ruby 1.9中不推荐使用。如果你在1.9下运行,请参阅String#encoding和friends)。例如,您的文件中是否有重音字符?
-
Your format string for
#unpack
. I think you wantC*
here if you really want to count bytes.#unpack的格式字符串。如果你真的想要计算字节,我想你想要C *。
Note also the existence of IO#each_line
(just so you can throw away the while
and be a little more ruby-idiomatic ;-)).
还要注意IO#each_line的存在(只是因为你可以扔掉一会儿,并且更多一点ruby-idiomatic ;-))。
#4
The issue is that when you save a text file on windows, your line breaks are two characters (characters 13 and 10) and therefore 2 bytes, when you save it on linux there is only 1 (character 10). However, ruby reports both these as a single character '\n' - it says character 10. What's worse, is that if you're on linux with a windows file, ruby will give you both characters.
问题是当你在Windows上保存文本文件时,你的换行符是两个字符(字符13和10),因此2个字节,当你在linux上保存它时只有1个(字符10)。但是,ruby将这两个字符报告为单个字符'\ n' - 它表示字符10.更糟糕的是,如果你使用windows文件在linux上,ruby会给你两个字符。
So, if you know that your files are always coming from windows text files and executed on windows, every time you get a newline character you can add 1 to your count. Otherwise it's a couple of conditionals and a little state machine.
因此,如果您知道您的文件总是来自Windows文本文件并在Windows上执行,那么每次获得换行符时,您都可以为计数添加1。否则它是几个条件和一个小状态机。
BTW there's no EOF 'character'.
顺便说一下,没有EOF'字符'。
#5
f = File.new("log.txt")
begin
while (line = f.readline)
line.chomp
puts line.length
end
rescue EOFError
f.close
end
#6
Here is a simple solution, presuming that the current file pointer is set to the start of a line in the read file:
这是一个简单的解决方案,假设当前文件指针设置为读取文件中一行的开头:
last_pos = file.pos
next_line = file.gets
current_pos = file.pos
backup_dist = last_pos - current_pos
file.seek(backup_dist, IO::SEEK_CUR)
in this example "file" is the file from which you are reading. To do this in a loop:
在此示例中,“file”是您正在阅读的文件。要在循环中执行此操作:
last_pos = file.pos
begin loop
next_line = file.gets
current_pos = file.pos
backup_dist = last_pos - current_pos
last_pos = current_pos
file.seek(backup_dist, IO::SEEK_CUR)
end loop
#1
IO#gets works the same as if you were capturing input from the command line: the "Enter" isn't sent as part of the input; neither is it passed when #gets is called on a File or other subclass of IO, so the numbers are definitely not going to match up.
IO#的工作方式与从命令行捕获输入的方式相同:“输入”不作为输入的一部分发送;当在文件或IO的其他子类上调用#gets时,它都不会被传递,因此这些数字肯定不会匹配。
See the relevant Pickaxe section
请参阅相关的镐头部分
May I enquire why you're so concerned about the line lengths summing to the file size? You may be solving a harder problem than is necessary...
我可以问一下你为什么如此担心线长总和到文件大小?你可能正在解决一个比必要更难的问题......
Aha. I think I get it now.
啊哈。我想我现在明白了。
Lacking a handy iPod (or any other sort, for that matter), I don't know if you want exactly 4K chunks, in which case IO#read(4000) would be your friend (4000 or 4096?) or if you're happier to break by line, in which case something like this ought to work:
由于缺少方便的iPod(或任何其他类型的东西),我不知道你是否想要完全4K的块,在这种情况下IO#read(4000)将是你的朋友(4000或4096?)或者如果你'更乐意逐行打破,在这种情况下这样的事情应该有效:
class Chunkifier
def Chunkifier.to_chunks(path)
chunks, current_chunk_size = [""], 0
File.readlines(path).each do |line|
line.chomp! # strips off \n, \r or \r\n depending on OS
if chunks.last.size + line.size >= 4_000 # 4096?
chunks.last.chomp! # remove last line terminator
chunks << ""
end
chunks.last << line + "\n" # or whatever terminator you need
end
chunks
end
end
if __FILE__ == $0
require 'test/unit'
class TestFile < Test::Unit::TestCase
def test_chunking
chs = Chunkifier.to_chunks(PATH)
chs.each do |chunk|
assert 4_000 >= chunk.size, "chunk is #{chunk.size} bytes long"
end
end
end
end
Note the use of IO#readlines to get all the text in one slurp: #each or #each_line would do as well. I used String#chomp! to ensure that whatever the OS is doing, the byts at the end are removed, so that \n or whatever can be forced into the output.
注意使用IO#readlines来获取所有文本:#each或#each_line也可以。我用String#chomp!为了确保无论操作系统在做什么,最后的字节都被删除,以便\ n或其他任何东西都可以强制进入输出。
I would suggest using File#write, rather than #print or #puts for the output, as the latter have a tendency to deliver OS-specific newline sequences.
我建议使用File#write而不是#print或#puts作为输出,因为后者倾向于提供特定于OS的换行序列。
If you're really concerned about multi-byte characters, consider taking the each_byte or unpack(C*) options and monkey-patching String, something like this:
如果你真的担心多字节字符,可以考虑使用each_byte或unpack(C *)选项和monkey-patching String,如下所示:
class String
def size_in_bytes
self.unpack("C*").size
end
end
The unpack version is about 8 times faster than the each_byte one on my machine, btw.
解压缩版本比我机器上的each_byte快8倍,顺便说一下。
#2
You might try IO#each_byte, e.g.
您可以尝试IO#each_byte,例如
total_bytes = 0
file_name = "test_this.txt"
File.open(file_name, "r") do |file|
file.each_byte {|b| total_bytes += 1}
end
puts "Original size #{File.size(file_name)}"
puts "Total bytes #{total_bytes}"
That, of course, doesn't give you a line at a time. Your best option for that is probably to go through the file via each_byte
until you encounter \r\n
. The IO class provides a bunch of pretty low-level read methods that might be helpful.
当然,这不会一次给你一条线。你最好的选择可能是通过each_byte遍历文件,直到你遇到\ r \ n。 IO类提供了许多可能有用的低级读取方法。
#3
You potentially have several overlapping issues here:
您可能在此处有几个重叠的问题:
-
Linefeed characters
\r\n
vs.\n
(as per your previous post). Also EOF file character (^Z)?换行符\ r \ n与\ n(根据您之前的帖子)。还有EOF文件字符(^ Z)?
-
Definition of "size" in your problem statement: do you mean "how many characters" (taking into account multi-byte character encodings) or do you mean "how many bytes"?
问题陈述中“大小”的定义:你的意思是“多少个字符”(考虑多字节字符编码)或者你的意思是“多少字节”?
-
Interaction of the
$KCODE
global variable (deprecated in ruby 1.9. SeeString#encoding
and friends if you're running under 1.9). Are there, for example, accented characters in your file?$ KCODE全局变量的交互(在ruby 1.9中不推荐使用。如果你在1.9下运行,请参阅String#encoding和friends)。例如,您的文件中是否有重音字符?
-
Your format string for
#unpack
. I think you wantC*
here if you really want to count bytes.#unpack的格式字符串。如果你真的想要计算字节,我想你想要C *。
Note also the existence of IO#each_line
(just so you can throw away the while
and be a little more ruby-idiomatic ;-)).
还要注意IO#each_line的存在(只是因为你可以扔掉一会儿,并且更多一点ruby-idiomatic ;-))。
#4
The issue is that when you save a text file on windows, your line breaks are two characters (characters 13 and 10) and therefore 2 bytes, when you save it on linux there is only 1 (character 10). However, ruby reports both these as a single character '\n' - it says character 10. What's worse, is that if you're on linux with a windows file, ruby will give you both characters.
问题是当你在Windows上保存文本文件时,你的换行符是两个字符(字符13和10),因此2个字节,当你在linux上保存它时只有1个(字符10)。但是,ruby将这两个字符报告为单个字符'\ n' - 它表示字符10.更糟糕的是,如果你使用windows文件在linux上,ruby会给你两个字符。
So, if you know that your files are always coming from windows text files and executed on windows, every time you get a newline character you can add 1 to your count. Otherwise it's a couple of conditionals and a little state machine.
因此,如果您知道您的文件总是来自Windows文本文件并在Windows上执行,那么每次获得换行符时,您都可以为计数添加1。否则它是几个条件和一个小状态机。
BTW there's no EOF 'character'.
顺便说一下,没有EOF'字符'。
#5
f = File.new("log.txt")
begin
while (line = f.readline)
line.chomp
puts line.length
end
rescue EOFError
f.close
end
#6
Here is a simple solution, presuming that the current file pointer is set to the start of a line in the read file:
这是一个简单的解决方案,假设当前文件指针设置为读取文件中一行的开头:
last_pos = file.pos
next_line = file.gets
current_pos = file.pos
backup_dist = last_pos - current_pos
file.seek(backup_dist, IO::SEEK_CUR)
in this example "file" is the file from which you are reading. To do this in a loop:
在此示例中,“file”是您正在阅读的文件。要在循环中执行此操作:
last_pos = file.pos
begin loop
next_line = file.gets
current_pos = file.pos
backup_dist = last_pos - current_pos
last_pos = current_pos
file.seek(backup_dist, IO::SEEK_CUR)
end loop