在C中有效地从文本文件中有效地选择随机行？

This is essentially a more constrained version of this question.

这本质上是这个问题的一个受限制的版本。

Suppose we have a very large text file, containing a large number of lines.

假设我们有一个非常大的文本文件,包含大量的行。

We need to choose a line at random from the file, with uniform probability, but there are constraints:

我们需要从文件中随机选择一条线,具有统一的概率,但是存在约束条件:

Because this is a soft realtime application, we cannot iterate over the entire file. The choice should take a constant-ish amount of time.

因为这是一个软实时应用程序,所以我们无法迭代整个文件。选择应该花费不变的时间。

Because of memory constraints, the file cannot be cached.

由于内存限制,无法缓存文件。

Because the file is permitted to change at runtime, the length of the file cannot be assumed to be a constant.

由于允许在运行时更改文件,因此不能将文件的长度假定为常量。

My first thought is to use an lstat() call to get the total filesize in bytes. fseek() can then be used to directly access a random byte offset, getting something like O(1) access into a random part of the file.

我的第一个想法是使用lstat()调用以字节为单位获取总文件大小。然后可以使用fseek()直接访问随机字节偏移量,将类似O(1)的内容访问到文件的随机部分。

The problem is that we can't then do something like read to the next newline and call it a day, because that would produce a distribution biased toward long lines.

问题是我们不能再做一些事情,比如读到下一个换行符并将其称为一天,因为这会产生偏向长线的分布。

My first thought at solving this issue is to read until the first "n" newlines (wrapping back to the file's beginning if required), and then choose a line with uniform probability from this smaller set. It is safe to assume the file's contents are randomly ordered, so this sub-sample should be uniform with respect to length, and, since its starting point was selected uniformly from all possible points, it should represent a uniform choice from the file as a whole. So, in pseudo-C, our algorithm looks something like:

我解决这个问题的第一个想法是读取直到第一个“n”换行符(如果需要,回绕到文件的开头),然后从这个较小的集合中选择一个具有统一概率的行。可以安全地假设文件的内容是随机排序的,因此这个子样本在长度方面应该是统一的,并且,由于它的起始点是从所有可能的点统一选择的,所以它应该代表从文件中统一选择的整个。所以,在伪C中,我们的算法看起来像:

 lstat(filepath, &filestat);
 fseek(file, (int)(filestat.off_t*drand48()), SEEK_SET);
 char sample[n][BUFSIZ];
 for(int i=0;i<n;i++)
     fgets(sample[i], BUFSIZ, file); //plus some stuff to deal with file wrap around...
 return sample[(int)(n*drand48())];

This doesn't seem like an especially elegant solution, and I'm not completely confident it will be uniform, so I'm wondering if there's a better way to do it. Any thoughts?

这似乎不是一个特别优雅的解决方案,我并不完全相信它会是统一的,所以我想知道是否有更好的方法来做到这一点。有什么想法吗?

EDIT: On further consideration, I'm now pretty sure that my method is not uniform, since the starting point is more likely to be inside longer words, and thus is not uniform. Tricky!

编辑:进一步考虑,我现在很确定我的方法不统一,因为起点更可能在更长的单词内,因此不均匀。整蛊!

3 个解决方案

#1

Select a random character from the file (via rand and seek as you noted). Now, instead of finding the associated newline, since that is biased as you noted, I would apply the following algorithm:

从文件中选择一个随机字符(通过rand并按照您的说明进行搜索)。现在,我没有找到相关的换行符,因为你注意到它有偏见,我会应用以下算法:


Is the character a newline character?
   yes - use the preceeding line
   no  - try again

I can't see how this could give anything but a uniform distribution of lines. The efficiency depends on the average length of a line. If your file has relatively short lines, this could be workable, though if the file really can't be precached even by the OS, you might pay a heavy price in physical disk seeks.

我看不出除了线条的均匀分布外,它怎么能给出任何东西。效率取决于线的平均长度。如果你的文件行相对较短,这可能是可行的,但如果文件甚至无法通过操作系统进行预处理,你可能会在物理磁盘搜索中付出沉重的代价。

#2

Solution was found, which works surprisingly well. Documenting here for myself and others.

找到了解决方案,效果出奇的好。在这里为自己和他人记录。

This example code does around 80,000 draws per second in practice, with a mean line length that matches that of the file to 4 significant digits on most runs. In contrast, I get around 250 draws per second using the method from the cross referenced question.

此示例代码在实践中每秒执行大约80,000次绘制,平均行长度与文件的长度匹配,在大多数运行中为4位有效数字。相比之下,我使用交叉引用问题中的方法得到每秒250次左右的绘制。

Essentially what it does is sample a random place in the file, and then discard it and draw again with probability inversely proportionate to the line length. This cancels out the bias for longer words. On average, the method makes a number of draws equal to the average line length in the file before accepting one.

基本上它的作用是对文件中的随机位置进行采样,然后将其丢弃并再次绘制,其概率与线长度成反比。这消除了较长单词的偏见。平均而言,该方法在接受之前使绘制的数量等于文件中的平均行长度。

Some notable drawbacks:

一些值得注意的缺点:

Files with longer line lengths will produce more rejections per draw, making this much slower.

行长较长的文件每次绘制会产生更多拒绝,这使得速度慢得多。

Files with longer line lengths require a larger constant than 50 in the rdraw function, which appears to mean much longer seek times in practice if line lengths exhibit high variance. For instance, setting it to BUFSIZ on one file I tested with reduced draw speeds to around 10000 draws per second. Still much faster than counting lines in the file though.

具有较长线长度的文件在rdraw函数中需要比50更大的常数,如果线长度表现出高的方差,这似乎意味着实际上更长的寻道时间。例如,在我测试的一个文件上将其设置为BUFSIZ,降低的绘制速度为每秒10000幅左右。尽管比计算文件中的行还要快得多。

int rdraw(FILE* where, char *storage, size_t bytes){
    int offset = (int)(bytes*drand48());
    int initial_seek = offset>50?offset-50:0;
    fseek(where, initial_seek, SEEK_SET);
    int chars_read = 0;
    while(chars_read + initial_seek < offset){
            fgets(storage,50,where);
            chars_read += strlen(storage);
    }
    return strlen(storage);
}

int main(){
    srand48(time(NULL));
    struct stat blah;
    stat("/usr/share/dict/words", &blah);
    FILE *where = fopen("/usr/share/dict/words", "r");
    off_t bytes = blah.st_size;
    char b[BUFSIZ+1];

    int i;
    for(i=0;i<1000000; i++){
            while(drand48() > 1.0/(rdraw(where, b, bytes)));
    }

}

#3

If the file only changes in the end (more lines are added) you can create an algorithm with uniform probability:

如果文件最后只更改(添加更多行),您可以创建一个具有统一概率的算法:

Preparation: Create an index file that contains the offset for each n:th line. Use a fixed-width format so that the position can be used to determine which record you have.

准备:创建一个索引文件,其中包含每个第n行的偏移量。使用固定宽度格式,以便位置可用于确定您拥有的记录。

Open the index file and read the last record. Use ftell to determine the record number.

打开索引文件并读取最后一条记录。使用ftell确定记录号。
Open the big file and fseek to the offset obtained in step 1.

打开大文件并fseek到步骤1中获得的偏移量。
Read the big file to the end, counting the number of newlines. You now have the total number of lines in the big file.

阅读大文件到最后,计算换行数。您现在拥有大文件中的总行数。
Generate a random number up to the number of lines obtained in step 3.

生成一个随机数,直到步骤3中获得的行数。
fseek to, and read, the appropriate record in the index file.

fseek并读取索引文件中的相应记录。
fseek to the appropriate offset in the large file. Skip the remainder of newlines.

fseek到大文件中的适当偏移量。跳过其余的换行符。
Read the line!

读行!

Example

Let's assume we chose n=100 and that the large file contains 367 lines.

假设我们选择n = 100并且大文件包含367行。

Index file:

00000000,00004753,00009420,00016303

The index file has 4 records, so the large file containsat least 300 records (100* (4-1)). Last offset is 16303.

索引文件有4条记录,因此大文件包含至少300条记录(100 *(4-1))。最后一次偏移是16303。
Open the large file and fseek to 16303.

打开大文件,然后将fseek打开到16303。
Count the remaining number of lines (67).

计算剩余的行数(67)。
Generata a random number in the range [0-366]. Let's say we got 112.

Generata是[0-366]范围内的随机数。假设我们得到112。
112/100 = 1 with 12 as remainder. Read the index file record with offset 1. We get the result 4753.

112/100 = 1,余数为12。读取偏移量为1的索引文件记录。我们得到结果4753。
fseek to 4753 in the large file and then skip 11 (12-1) lines.

fseek到大文件中的4753然后跳过11(12-1)行。
Read the 12th line.

阅读第12行。

Voila!

Edit:

I saw the comment on the target file changing. If the target file changes only rarely, then this may still be a viable approach. You would need to create an new index file before switching target file. You may also want to update the index file when the target file has grown more than n rows.

我看到目标文件的评论发生了变化。如果目标文件很少变化,那么这可能仍然是一种可行的方法。在切换目标文件之前,您需要创建一个新的索引文件。当目标文件增长超过n行时,您可能还希望更新索引文件。

#1