如何处理Ruby中的大文件？

I'm pretty new to programming, so be gentle. I'm trying to extract IBSN numbers from a library database .dat file. I have written code that works, but it is only searching through about half of the 180MB file. How can I adjust it to search the whole file? Or how can I write a program the will split the dat file into manageable chunks?

我对编程很新,所以要温柔。我正在尝试从库数据库.dat文件中提取IBSN编号。我编写的代码可以工作,但它只搜索180MB文件的大约一半。如何调整它来搜索整个文件?或者我如何编写程序将dat文件拆分为可管理的块?

edit: Here's my code:

编辑:这是我的代码:

export = File.new("resultsfinal.txt","w+")

File.open("bibrec2.dat").each do |line|
  line.scan(/[a]{1}[1234567890xX]{10}\W/) do |x|
    export.puts x
  end
  line.scan(/[a]{1}[1234567890xX]{13}/) do |x|
    export.puts x
  end
end

6 个解决方案

#1

You should try to catch exception to check if the problem is really on the read block or not.

您应该尝试捕获异常以检查问题是否确实在读取块上。

Just so you know I already made a script with kinda the same syntax to search real big file of ~8GB without problem.

只是你知道我已经制作了一个脚本,有点相同的语法来搜索~8GB的真实大文件没有问题。

export = File.new("resultsfinal.txt","w+")

File.open("bibrec2.dat").each do |line|
  begin
    line.scan(/[a]{1}[1234567890xX]{10}\W/) do |x|
      export.puts x
    end
    line.scan(/[a]{1}[1234567890xX]{13}/) do |x|
      export.puts x
    end
  rescue
    puts "Problem while adding the result"
  end
end

#2

The main thing is to clean up and combine the regex for performance benefits. Also you should always use block syntax with files to ensure the fd's are getting closed properly. File#each doesn't load the whole file into memory, it does one line at a time:

最主要的是清理并结合正则表达式以获得性能优势。此外,您应始终对文件使用块语法,以确保正确关闭fd。 File#each不会将整个文件加载到内存中,它一次只执行一行:

File.open("resultsfinal.txt","w+") do |output|
    File.open("bibrec2.dat").each do |line|
        output.puts line.scan(/a[\dxX]{10}(?:[\dxX]{3}|\W)/)
    end
end

#3

file = File.new("bibrec2.dat", "r")
while (line = file.gets)
  line.scan(/[a]{1}[1234567890xX]{10}\W/) do |x|
    export.puts x
  end
  line.scan(/[a]{1}[1234567890xX]{13}/) do |x|
    export.puts x
  end
end
file.close

#4

As to the performance issue, I can't see anything particularly worrying about the file size: 180MB shouldn't pose any problems. What happens to memory use when you're running your script?

至于性能问题,我看不出任何特别担心文件大小的问题:180MB不应该造成任何问题。运行脚本时内存使用会发生什么变化?

I'm not sure, however, that your Regular Expressions are doing what you want. This, for example:

但是,我不确定您的正则表达式是否正在执行您想要的操作。这个,例如:

/[a]{1}[1234567890xX]{10}\W/

does (I think) this:

(我认为)这个:

one "a". Do you really want to match for an "a"? "a" would suffice, rather than "[a]{1}", in that case.

一个“一个”。你真的想要匹配“a”吗?在这种情况下,“a”就足够了,而不是“[a] {1}”。

exactly 10 of (digit or "x" or "X")

正好10个(数字或“x”或“X”)

a single "non-word" character i.e. not a-z, A-Z, 0-9 or underscore

单个“非单词”字符,即不是a-z,A-Z,0-9或下划线

There are a couple of sample ISBN matchers here and here, although they seem to be matching something more like the format that we see on the back cover of a book and I'm guessing your input file has stripped out some of that formatting.

这里和这里有几个样本ISBN匹配器,虽然它们似乎匹配的东西更像我们在书的封底上看到的格式,我猜你的输入文件已经删除了一些格式。

#5

You can look into using File#truncate and IO#seek and employ the binary search type algorithm. #truncate may be destructive so you should duplicate the file (I know this is a hassle).

您可以查看使用File#truncate和IO#seek并使用二进制搜索类型算法。 #truncate可能是破坏性的,所以你应该复制文件(我知道这是一个麻烦)。

middle = File.new("my_huge_file.dat").size / 2
tmpfile = File.new("my_huge_file.dat", "r+").truncate(middle)
# run search algoritm on 'tmpfile'
File.open("my_huge_file.dat") do |huge_file|
  huge_file.seek(middle + 1)
  # run search algorithm from here
end

The code is highly untested, brittle and incomplete. But I hope it gives you a platform to build of off.

代码是高度未经测试的,脆弱的和不完整的。但我希望它能为你提供一个平台。

#6

-2

If you are programming on a modern operating system and the computer has enough memory (say 512megs), Ruby should have no problem reading the entire file into memory.

如果您在现代操作系统上进行编程并且计算机有足够的内存(比如512megs),那么Ruby应该可以将整个文件读入内存。

Things typically get iffy when you get to about a 2 gigabyte working set on a typical 32bit OS.

当你在典型的32位操作系统上获得大约2千兆字节的工作集时,通常会遇到麻烦。

#1