I have a file with contents
我有一个内容文件
abc
def
high
lmn
...
...
There are more than 2 million lines in the files. I want to randomly sample lines from the files and output 50K lines. Any thoughts on how to approach this problem? I was thinking along the lines of Perl and its rand
function (Or a handy shell command would be neat).
文件中有超过200万行。我想从文件中随机采样行并输出50K行。有关如何解决这个问题的任何想法?我正在考虑Perl及其rand函数(或者一个方便的shell命令会很整洁)。
Related (Possibly Duplicate) Questions:
相关(可能重复)问题:
- Randomly Pick Lines From a File Without Slurping It With Unix
- How can I get exactly n random lines from a file with Perl?
随机从文件中选择行而不用Unix扼杀它
如何从Perl文件中获取n个随机行?
5 个解决方案
#1
Assuming you basically want to output about 2.5% of all lines, this would do:
假设您基本上想要输出大约2.5%的所有行,这将会:
print if 0.025 > rand while <$input>;
#2
Shell way:
sort -R file | head -n 50000
#3
From perlfaq5: "How do I select a random line from a file?"
从perlfaq5:“如何从文件中选择随机行?”
Short of loading the file into a database or pre-indexing the lines in the file, there are a couple of things that you can do.
如果没有将文件加载到数据库或预先索引文件中的行,那么您可以做几件事。
Here's a reservoir-sampling algorithm from the Camel Book:
以下是Camel Book中的油藏采样算法:
srand;
rand($.) < 1 && ($line = $_) while <>;
This has a significant advantage in space over reading the whole file in. You can find a proof of this method in The Art of Computer Programming, Volume 2, Section 3.4.2, by Donald E. Knuth.
这在读取整个文件的空间方面具有显着的优势。您可以在Donald E. Knuth的“计算机编程艺术”第2卷第3.4.2节中找到这种方法的证明。
You can use the File::Random module which provides a function for that algorithm:
您可以使用File :: Random模块,该模块为该算法提供函数:
use File::Random qw/random_line/;
my $line = random_line($filename);
Another way is to use the Tie::File module, which treats the entire file as an array. Simply access a random array element.
另一种方法是使用Tie :: File模块,它将整个文件视为一个数组。只需访问随机数组元素。
#4
Perl way:
use CPAN. There is module File::RandomLine that does exactly what you need.
使用CPAN。模块File :: RandomLine可以完全满足您的需求。
#5
If you need to extract an exact number of lines:
如果您需要提取确切的行数:
use strict;
use warnings;
# Number of lines to pick and file to pick from
# Error checking omitted!
my ($pick, $file) = @ARGV;
open(my $fh, '<', $file)
or die "Can't read file '$file' [$!]\n";
# count lines in file
my ($lines, $buffer);
while (sysread $fh, $buffer, 4096) {
$lines += ($buffer =~ tr/\n//);
}
# limit number of lines to pick to number of lines in file
$pick = $lines if $pick > $lines;
# build list of N lines to pick, use a hash to prevent picking the
# same line multiple times
my %picked;
for (1 .. $pick) {
my $n = int(rand($lines)) + 1;
redo if $picked{$n}++
}
# loop over file extracting selected lines
seek($fh, 0, 0);
while (<$fh>) {
print if $picked{$.};
}
close $fh;
#1
Assuming you basically want to output about 2.5% of all lines, this would do:
假设您基本上想要输出大约2.5%的所有行,这将会:
print if 0.025 > rand while <$input>;
#2
Shell way:
sort -R file | head -n 50000
#3
From perlfaq5: "How do I select a random line from a file?"
从perlfaq5:“如何从文件中选择随机行?”
Short of loading the file into a database or pre-indexing the lines in the file, there are a couple of things that you can do.
如果没有将文件加载到数据库或预先索引文件中的行,那么您可以做几件事。
Here's a reservoir-sampling algorithm from the Camel Book:
以下是Camel Book中的油藏采样算法:
srand;
rand($.) < 1 && ($line = $_) while <>;
This has a significant advantage in space over reading the whole file in. You can find a proof of this method in The Art of Computer Programming, Volume 2, Section 3.4.2, by Donald E. Knuth.
这在读取整个文件的空间方面具有显着的优势。您可以在Donald E. Knuth的“计算机编程艺术”第2卷第3.4.2节中找到这种方法的证明。
You can use the File::Random module which provides a function for that algorithm:
您可以使用File :: Random模块,该模块为该算法提供函数:
use File::Random qw/random_line/;
my $line = random_line($filename);
Another way is to use the Tie::File module, which treats the entire file as an array. Simply access a random array element.
另一种方法是使用Tie :: File模块,它将整个文件视为一个数组。只需访问随机数组元素。
#4
Perl way:
use CPAN. There is module File::RandomLine that does exactly what you need.
使用CPAN。模块File :: RandomLine可以完全满足您的需求。
#5
If you need to extract an exact number of lines:
如果您需要提取确切的行数:
use strict;
use warnings;
# Number of lines to pick and file to pick from
# Error checking omitted!
my ($pick, $file) = @ARGV;
open(my $fh, '<', $file)
or die "Can't read file '$file' [$!]\n";
# count lines in file
my ($lines, $buffer);
while (sysread $fh, $buffer, 4096) {
$lines += ($buffer =~ tr/\n//);
}
# limit number of lines to pick to number of lines in file
$pick = $lines if $pick > $lines;
# build list of N lines to pick, use a hash to prevent picking the
# same line multiple times
my %picked;
for (1 .. $pick) {
my $n = int(rand($lines)) + 1;
redo if $picked{$n}++
}
# loop over file extracting selected lines
seek($fh, 0, 0);
while (<$fh>) {
print if $picked{$.};
}
close $fh;