如何从文件中返回一个随机的行?面试问题

I am preparing for a phone interview. I came upon these questions on the internet. Can anyone tell me some good answers for these?

我正在准备电话面试。我在网上遇到了这些问题。有人能给我一些好的答案吗?

Suppose I give you a text file and ask you a to write a program that will return a random line from the file (all lines must have equal probability to be returned)

假设我给你一个文本文件，并要求你编写一个程序，该程序将从文件中返回一个随机的行(所有行必须具有相同的返回概率)
Same as part 1, except this time the entire text file cannot fit into main memory

与第1部分相同，只是这次整个文本文件不能装入主内存
Same as part 2, except now you have a stream instead of a file.

与第2部分相同，但是现在您有了一个流而不是一个文件。

Please help.

请帮助。

Ok...@Everyone, I really had a few ideas in my mintod before asking this...Seeing the relentless attack by my fellow SOers, I am posting my answers. Please feel free to attack them too...

好吧……@ everybody，在问这个问题之前，我真的有一些想法。看到我的同伴无情的攻击，我把我的答案张贴出来。请随意攻击他们……

1: Count the number of '\n' in the file. Generate a random number between 1 and the number and return the line after the number-1 '\n'.

1:计算文件中“\n”的数目。在1和数字之间生成一个随机数，并在数字1 '\n'之后返回行。

2: Bring the file into main memory part by part and follow the above procedure.

2:将文件部分导入主存，按以上步骤操作。

3: I dont have much idea about this and would appreciate any inputs.

他说:我对此不太了解，如果您有什么意见的话，我将不胜感激。

Its wonderful that you guys really give an inspiration to push forward.....

你们真是太棒了，你们真的给了我们前进的动力。

4 个解决方案

#1

Read all lines into an array, return a random line in the range of 1 and the amount of lines.

将所有行读入一个数组，返回1范围内的随机行和行数。
Most simple: Count the lines, choose a line number at random, go through the file a second time and return it.

最简单的:计算行数，随机选择行号，再次遍历文件并返回它。
You just have to remember one line. Each new line has a probability of 1/N (N being lines read).

你只需要记住一行。每个新行有1/N的概率(N表示读的行)。

Pseudocode:

伪代码:
```
i = 1
chosen_line = ""
for line in lines:
    if random() < 1/i:    # random returns a uniform random number in [0,1)
        chosen_line = line
    i += 1
return chosen_line
```

Algorithm number 3 could be used for 1 and 2 too.

算法3也可以用于1和2。

#2

You can do this without having to read all the lines in memory, thus working well for huge files. Pseudocode:

您可以在不需要读取内存中的所有行的情况下完成这一操作，从而可以很好地处理大型文件。伪代码:

linenum := 0
ret := ''
while more lines to read:
    line := readline()
    linenum := linenum + 1
    r := uniform_random(0, linenum)
    if r < 1:
        ret := line

return ret

Proof: We begin by noting that we always save the first line in ret. If the file has one line, you are going to choose it, and you're done.

证明:我们首先注意到我们总是将第一行保存在ret中。如果文件有一行，您将选择它，然后就完成了。

For two-line file, ret will save the first line 100% of the times, and the second line will be saved in ret 50% of the time during the second iteration of the loop. Thus, each line has a probability of 0.5 of being selected.

对于两行文件，ret会将第一行保留100%的次数，而在循环的第二次迭代中，第二行保留50%的时间。因此，每一行都有被选中的概率为0.5。

Now, let's assume that this method works for files of ≤N lines. To prove that this means it works for N+1, in the (N+1)th iteration of the loop, there is a probability of 1/(N+1) that the last line will be selected (random(0, N+1) < 1 has that probability). Thus, the last line has 1/(N+1) probability of being selected. The probability of all other lines being selected is still going to be equal to each other, let's call this x. Then, N*x + 1/(N+1) == 1, which means that x = 1/(N+1).

现在,让我们假设这个方法适用于文件≤N行。为了证明它适用于N+1，在循环的(N+1)次迭代中，有一个概率是1/(N+1)最后一行被选中(随机(0,N+1) < 1有这个概率)。因此，最后一行有1/(N+1)被选中的概率。所有其它被选择的线的概率仍然是相等的，我们称它为x，然后N*x +1 /(N+1) = 1，这意味着x = 1/(N+1)

Proof by induction is complete.

通过归纳法证明是完整的。

Edit: Oops, didn't see the first answer's third method before replying. Still, I will keep this post here if only for the proof, and the opportunity for other people to correct it if there are any errors in it.

编辑:哎呦，回复之前没有看到第一个答案的第三种方法。不过，如果只是为了证明，我还是会把这篇文章放在这里，如果有错误，我也会给其他人改正的机会。

#3

Re 1: Use solution to 2

Re 1:使用解决方案2

Re 2: You would want to scan the whole file using a RandomAccessFile access to count the number of lines and (possibly) cache the file pointers for each start of line. Then you could choose one at random (I'm assuming this question is not about how to generate random numbers) and move back to that start point, read the line and return it. If you want it fast then make sure you are buffering the reads (raf is v slow otherwise).

Re 2:您可能希望使用RandomAccessFile访问来扫描整个文件，以计数行数，(可能)缓存每行开头的文件指针。然后你可以随机选择一个(我假设这个问题不是关于如何生成随机数)然后返回到起始点，读取一行并返回它。如果您想要它快，那么请确保您正在缓冲读(raf是v慢速的，否则)。

Re 3: If the stream doesn't fit in memory (i.e. you cannot cache the whole thing) and you don't know how many lines are in the stream without reading the whole stream (assuming you only get to read it once) then I cannot see a solution. I too wait for answers...

如果流不适合内存(例如，您不能缓存整个内容)，并且您不知道流中有多少行而不读取整个流(假设您只能读取一次)，那么我看不到解决方案。我也在等待答案……

#4

-1

#3: write the stream to a file on disk and use solution 2. Not the most efficient solution, of course, but very simple.

#3:将流写入磁盘上的文件并使用解决方案2。当然，这不是最有效的解决方案，但非常简单。

#1

Read all lines into an array, return a random line in the range of 1 and the amount of lines.

将所有行读入一个数组，返回1范围内的随机行和行数。
Most simple: Count the lines, choose a line number at random, go through the file a second time and return it.

最简单的:计算行数，随机选择行号，再次遍历文件并返回它。
You just have to remember one line. Each new line has a probability of 1/N (N being lines read).

你只需要记住一行。每个新行有1/N的概率(N表示读的行)。

Pseudocode:

伪代码:
```
i = 1
chosen_line = ""
for line in lines:
    if random() < 1/i:    # random returns a uniform random number in [0,1)
        chosen_line = line
    i += 1
return chosen_line
```