Ruby：从文本文件中选择随机行的优雅方法是什么？

I've seen some really beautiful examples of Ruby and I'm trying to shift my thinking to be able to produce them instead of just admire them. Here's the best I could come up with for picking a random line out of a file:

我已经看到了Ruby的一些非常漂亮的例子，我试图改变我的想法，以便能够制作它们而不仅仅是欣赏它们。这是我从文件中挑选随机行的最佳方法：

def pick_random_line
  random_line = nil
  File.open("data.txt") do |file|
    file_lines = file.readlines()
    random_line = file_lines[Random.rand(0...file_lines.size())]
  end 

  random_line                                                                                                                                                               
end

I feel like it's gotta be possible to do this in a shorter, more elegant way without storing the entire file's contents in memory. Is there?

我觉得有必要以更短，更优雅的方式做到这一点，而不将整个文件的内容存储在内存中。在那儿？

7 个解决方案

#1

You can do it without storing anything except the current candidate for the random line.

你可以在不存储除随机线的当前候选者之外的任何东西的情况下进行。

def pick_random_line
  chosen_line = nil
  File.foreach("data.txt").each_with_index do |line, number|
    chosen_line = line if rand < 1.0/(number+1)
  end
  return chosen_line
end

So the first line is chosen with probability 1/1 = 1; the second line is chosen with probability 1/2, so half the time it keeps the first one and half the time it switches to the second.

所以选择第一行的概率为1/1 = 1;选择第二行的概率为1/2，因此保留第一行的一半时间和切换到第二行的一半时间。

Then the third line is chosen with probability 1/3 - so 1/3 of the time it picks it, and the other 2/3 of the time it keeps whichever one of the first two it picked. Since each of them had a 50% chance of being chosen as of line 2, they each wind up with a 1/3 chance of being chosen as of line 3.

然后选择第三行的概率为1/3 - 所以它选择它的时间的1/3，另外2/3的时间它保留它所选择的前两个中的任何一个。由于他们中的每一个都有50％的机会被选为第2行，所以他们每个人都有1/3的机会被选为第3行。

And so on. At line N, every line from 1-N has an even 1/N chance of being chosen, and that holds all the way through the file (as long as the file isn't so huge that 1/(number of lines in file) is less than epsilon :)). And you only make one pass through the file and never store more than two lines at once.

等等。在第N行，1-N的每一行都有一个偶数1 / N的机会被选中，并且一直保持在文件中（只要文件不是那么大，1 /（文件中的行数））小于epsilon :)）。而且你只需要通过一个文件，一次不会存储两行以上。

EDIT You're not going to get a real concise solution with this algorithm, but you can turn it into a one-liner if you want to:

编辑你不会用这个算法得到一个真正简洁的解决方案，但你可以把它变成一个单行，如果你想：

def pick_random_line
  File.foreach("data.txt").each_with_index.reduce(nil) { |picked,pair| 
    rand < 1.0/(1+pair[1]) ? pair[0] : picked }
end

#2

There is already a random entry selector built into the Ruby Array class: sample().

Ruby Array类中已经内置了一个随机条目选择器：sample（）。

def pick_random_line
  File.readlines("data.txt").sample
end

#3

This function does exactly what you need.

此功能完全符合您的需要。

It's not a one-liner. But it works with textfiles of any size (except zero size, maybe :).

这不是一个单行。但它适用于任何大小的文本文件（除了零大小，可能:)。

def random_line(filename)
  blocksize, line = 1024, ""
  File.open(filename) do |file|
    initial_position = rand(File.size(filename)-1)+1 # random pointer position. Not a line number!
    pos = Array.new(2).fill( initial_position ) # array [prev_position, current_position]
    # Find beginning of current line
    begin
      pos.push([pos[1]-blocksize, 0].max).shift # calc new position
      file.pos = pos[1] # move pointer backward within file
      offset = (n = file.read(pos[0] - pos[1]).rindex(/\n/) ) ? n+1 : nil
    end until pos[1] == 0 || offset
    file.pos = pos[1] + offset.to_i
    # Collect line text till the end
    begin
      data = file.read(blocksize)
      line.concat((p = data.index(/\n/)) ? data[0,p.to_i] : data)
    end until file.eof? or p
  end
  line
end

Try it:

尝试一下：

filename = "huge_text_file.txt"
100.times { puts random_line(filename).force_encoding("UTF-8") }

Negligible (imho) drawbacks:

可忽略不计（imho）的缺点：

the longer the line, the higher the chance it'll be picked.

线越长，拾取的机会就越大。
doesn't take into account the "\r" line separator ( windows-specific ). Use files with Unix-style line endings!

没有考虑“\ r”行分隔符（特定于Windows）。使用具有Unix风格的行结尾的文件！

#4

This is not much better than what you came up with, but at least it's shorter:

这并不比你提出的要好得多，但至少它更短：

def pick_random_line
  lines = File.readlines("data.txt")
  lines[rand(lines.length)]
end

One thing you can do to make your code more Rubyish is omitting braces. Use readlines and size instead of readlines() and size().

你可以做的一件事是让你的代码更加Rubyish省略大括号。使用readlines和size而不是readlines（）和size（）。

#5

A one liner:

一个班轮：

def pick_random_line(file)
  `head -$((${RANDOM} % `wc -l < #{file}` + 1)) #{file} | tail -1`
end

If you protest that it's not Ruby, go find a talk in this year's Euruko titled Ruby is unlike a Banana.

如果你*它不是Ruby，那么今年就去找一个名为Ruby的Euruko与香蕉不同。

PS: Ignore SO's incorrect syntax highlighting.

PS：忽略SO不正确的语法高亮。

#6

Here a shorter version of Mark's exellent answer, not as short as Dave's though

这是Mark的优秀答案的缩短版本，不像Dave那样简短

def pick_random_line number=1, chosen_line=""
  File.foreach("data.txt") {|line| chosen_line = line if rand < 1.0/number+=1}
  chosen_line 
end

#7

-1

Stat the file, pick a random number between zero and the size of the file, seek to that byte in the file. Scan until the next newline, then read and return the next line (assuming you're not at the end of the file).

统计文件，在零和文件大小之间选择一个随机数，寻找文件中的那个字节。扫描到下一个换行符，然后读取并返回下一行（假设您不在文件的末尾）。

#1