如何使用Perl计算文件中的字符、单词和行?

What is a good/best way to count the number of characters, words, and lines of a text file using Perl (without using wc)?

使用Perl(不使用wc)计算文本文件的字符、单词和行数的最佳方法是什么?

10 个解决方案

#1

Here's the perl code. Counting words can be somewhat subjective, but I just say it's any string of characters that isn't whitespace.

这是perl代码。计算单词可能有点主观，但我只是说，它是任何字符串，不是空格。

open(FILE, "<file.txt") or die "Could not open file: $!";

my ($lines, $words, $chars) = (0,0,0);

while (<FILE>) {
    $lines++;
    $chars += length($_);
    $words += scalar(split(/\s+/, $_));
}

print("lines=$lines words=$words chars=$chars\n");

#2

A variation on bmdhacks' answer that will probably produce better results is to use \s+ (or even better \W+) as the delimiter. Consider the string "The quick brown fox" (additional spaces if it's not obvious). Using a delimiter of a single whitespace character will give a word count of six not four. So, try:

对于bmdhacks的答案可能会产生更好的结果的一个变体是使用\s+(或者更好的\W+)作为分隔符。考虑一下字符串“quick brown fox”(如果不明显的话，额外的空格)。使用单个空格字符的分隔符将使单词计数为6而不是4。因此,试一试:

open(FILE, "<file.txt") or die "Could not open file: $!";

my ($lines, $words, $chars) = (0,0,0);

while (<FILE>) {
    $lines++;
    $chars += length($_);
    $words += scalar(split(/\W+/, $_));
}

print("lines=$lines words=$words chars=$chars\n");

Using \W+ as the delimiter will stop punctuation (amongst other things) from counting as words.

使用\W+作为分隔符将停止标点(在其他事情中)作为单词。

#3

The Word Count tool counts characters, words and lines in text files

单词计数工具在文本文件中计算字符、单词和行。

#4

Here. Try this Unicode-savvy version of the wc program.

在这里。试试这个“独角兽”游戏的版本吧。

It skips non-file arguments (pipes, directories, sockets, etc).

它跳过非文件参数(管道、目录、套接字等)。
It assumes UTF-8 text.

它假定utf - 8的文本。
It counts any Unicode whitespace as a word separator.

它将任何Unicode空格作为单词分隔符。
It also accepts alternate encodings if there is a .ENCODING at the end of the filename, like foo.cp1252, foo.latin1, foo.utf16, etc.

如果在文件名的末尾有一个。编码，比如foo，它也接受其他编码。cp1252 foo。latin1,中的一个foo。utf16等等。
It also work with files that have been compressed in a variety of formats.

它还处理以各种格式压缩的文件。
It gives counts of Paragraphs, Lines, Words, Graphemes, Characters, and Bytes.

它给出段落、行、字、字、字和字节的计数。
It understands all Unicode linebreak sequences.

它了解所有Unicode linebreak序列。
It warns about corrupted textfiles with linebreak errors.

它警告有linebreak错误的文本文件损坏。

Here’s an example of running it:

这里有一个跑步的例子:

   Paras    Lines    Words   Graphs    Chars    Bytes File
       2     2270    82249   504169   504333   528663 /tmp/ap
       1     2404    11163    63164    63164    66336 /tmp/b3
    uwc: missing linebreak at end of corrupted textfiile /tmp/bad
      1*       2*        4       19       19       19 /tmp/bad
       1       14       52      273      273      293 /tmp/es
      57      383     1369    11997    11997    12001 /tmp/funny
       1   657068  3175429 31205970 31209138 32633834 /tmp/lw
       1        1        4       27       27       27 /tmp/nf.cp1252
       1        1        4       27       27       34 /tmp/nf.euc-jp
       1        1        4       27       27       27 /tmp/nf.latin1
       1        1        4       27       27       27 /tmp/nf.macroman
       1        1        4       27       27       54 /tmp/nf.ucs2
       1        1        4       27       27       56 /tmp/nf.utf16
       1        1        4       27       27       54 /tmp/nf.utf16be
       1        1        4       27       27       54 /tmp/nf.utf16le
       1        1        4       27       27      112 /tmp/nf.utf32
       1        1        4       27       27      108 /tmp/nf.utf32be
       1        1        4       27       27      108 /tmp/nf.utf32le
       1        1        4       27       27       39 /tmp/nf.utf7
       1        1        4       27       27       31 /tmp/nf.utf8
       1    26906   101528   635841   636026   661202 /tmp/o2
     131      346     1370     9590     9590     4486 /tmp/perl5122delta.pod.gz
     291      814     3941    25318    25318     9878 /tmp/perl51310delta.pod.bz2
       1     2551     5345   132655   132655   133178 /tmp/tailsort-pl.utf8
       1       89      334     1784     1784     2094 /tmp/til
       1        4       18       88       88      106 /tmp/w
     276     1736     5773    53782    53782    53804 /tmp/www

Here ya go:

这丫去:

#!/usr/bin/env perl 
#########################################################################
# uniwc - improved version of wc that works correctly with Unicode
#
# Tom Christiansen <tchrist@perl.com>
# Mon Feb 28 15:59:01 MST 2011
#########################################################################

use 5.10.0;

use strict;
use warnings FATAL => "all";
use sigtrap qw[ die untrapped normal-signals ];

use Carp;

$SIG{__WARN__}  = sub {
    confess("FATALIZED WARNING: @_")  unless $^S;
};

$SIG{__DIE__}  = sub {
    confess("UNCAUGHT EXCEPTION: @_")  unless $^S;
};

$| = 1;

my $Errors = 0;
my $Headers = 0;

sub yuck($) {
    my $errmsg = $_[0];
    $errmsg =~ s/(?<=[^\n])\z/\n/;
    print STDERR "$0: $errmsg";
}

process_input(\&countem);

sub countem { 
    my ($_, $file) = @_;

    my (
        @paras, @lines, @words,
        $paracount, $linecount, $wordcount, 
        $grafcount, $charcount, $bytecount,
    );

    if ($charcount = length($_)) {
        $wordcount = eval { @words = split m{ \p{Space}+  }x }; 
        yuck "error splitting words: $@" if $@;

        $linecount = eval { @lines = split m{ \R     }x }; 
        yuck "error splitting lines: $@" if $@;

        $grafcount = 0;
        $grafcount++ while /\X/g;
        #$grafcount = eval { @lines = split m{ \R     }x }; 
        yuck "error splitting lines: $@" if $@;

        $paracount = eval { @paras = split m{ \R{2,} }x }; 
        yuck "error splitting paras: $@" if $@;

        if ($linecount && !/\R\z/) {
            yuck("missing linebreak at end of corrupted textfiile $file");
            $linecount .= "*";
            $paracount .= "*";
        } 
    }

    $bytecount = tell;
    if (-e $file) {
        $bytecount = -s $file;
        if ($bytecount != -s $file) {
            yuck "filesize of $file differs from bytecount\n";
            $Errors++;
        }
    } 
    my $mask = "%8s " x 6 . "%s\n";
    printf  $mask => qw{ Paras Lines Words Graphs Chars Bytes File } unless $Headers++;

    printf $mask => map( { show_undef($_) } 
                                $paracount, $linecount, 
                                $wordcount, $grafcount, 
                                $charcount, $bytecount,
                       ), $file;
} 

sub show_undef {
    my $value = shift;
    return defined($value)
             ? $value
             : "undef";
} 

END { 
    close(STDOUT) || die "$0: can't close STDOUT: $!";
    exit($Errors != 0);
}

sub process_input {

    my $function = shift();

    my $enc;

    if (@ARGV == 0 && -t) {
        warn "$0: reading from stdin, type ^D to end or ^C to kill.\n";
    }

    unshift(@ARGV, "-") if @ARGV == 0;

FILE:

    for my $file (@ARGV) {
        # don't let magic open make an output handle

        next if -e $file && ! -f _;

        my $quasi_filename = fix_extension($file);

        $file = "standard input" if $file eq q(-);
        $quasi_filename =~ s/^(?=\s*[>|])/< /;

        no strict "refs";
        my $fh = $file;   # is *so* a lexical filehandle! ☺
        unless (open($fh, $quasi_filename)) {
            yuck("couldn't open $quasi_filename: $!");
            next FILE;
        }
        set_encoding($fh, $file) || next FILE;

        my $whole_file = eval {
            use warnings "FATAL" => "all";
            local $/;
            scalar <$fh>;
        };

        if ($@) {
            $@ =~ s/ at \K.*? line \d+.*/$file line $./;
            yuck($@);
            next FILE;
        }

        $function->($whole_file, $file);

        unless (close $fh) {
            yuck("couldn't close $quasi_filename at line $.: $!");
            next FILE;
        }

    } # foreach file

}

sub set_encoding(*$) {
    my ($handle, $path) = @_;

    my $enc_name = "utf8";

    if ($path && $path =~ m{ \. ([^\s.]+) \z }x) {
        my $ext = $1;
        die unless defined $ext;
        require Encode;
        if (my $enc_obj = Encode::find_encoding($ext)) {
            my $name = $enc_obj->name || $ext;
            $enc_name = "encoding($name)";
        }
    }

    return 1 if eval {
        use warnings FATAL => "all";
        no strict "refs";
        binmode($handle, ":$enc_name");
        1;
    };

    for ($@) {
        s/ at .* line \d+\.//;
        s/$/ for $path/;
    }

    yuck("set_encoding: $@");

    return undef;
}

sub fix_extension {
    my $path = shift();
    my %Compress = (
        Z       =>  "zcat",
        z       => "gzcat",            # for uncompressing
        gz      => "gzcat",
        bz      => "bzcat",
        bz2     => "bzcat",
        bzip    => "bzcat",
        bzip2   => "bzcat",
        lzma    => "lzcat",
    );

    if ($path =~ m{ \. ( [^.\s] +) \z }x) {
        if (my $prog = $Compress{$1}) {
            return "$prog $path |";
        } 
    } 

    return $path;

}

#5

There is the Perl Power Tools project whose goal is to reconstruct all the Unix bin utilities, primarily for those on operating systems deprived of Unix. Yes, they did wc. The implementation is overkill, but it is POSIX compliant.

有一个Perl Power Tools项目，其目标是重构所有Unix bin实用程序，主要用于那些被剥夺Unix操作系统的人。是的,他们做了wc。实现是过量的，但它是POSIX兼容的。

It gets a little ridiculous when you look at the GNU compliant implementation of true.

当你看到GNU兼容的实现时，它就变得有点可笑了。

#6

Non-serious answer:

较轻的回答:

system("wc foo");

#7

Reading the file in fixed-size chunks may be more efficient than reading line-by-line. The wc binary does this.

在固定大小的块中读取文件可能比逐行读取更有效。wc二进制文件做到了这一点。

#!/usr/bin/env perl

use constant BLOCK_SIZE => 16384;

for my $file (@ARGV) {
    open my $fh, '<', $file or do {
        warn "couldn't open $file: $!\n";
        continue;
    };

    my ($chars, $words, $lines) = (0, 0, 0);

    my ($new_word, $new_line);
    while ((my $size = sysread $fh, local $_, BLOCK_SIZE) > 0) {
        $chars += $size;
        $words += /\s+/g;
        $words-- if $new_word && /\A\s/;
        $lines += () = /\n/g;

        $new_word = /\s\Z/;
        $new_line = /\n\Z/;
    }
    $lines-- if $new_line;

    print "\t$lines\t$words\t$chars\t$file\n";
}

#8

To be able to count CHARS and not bytes, consider this:
(Try it with Chinese or Cyrillic letters and file saved in utf8)

要能够计算CHARS而不是字节，可以考虑如下(尝试使用在utf8中保存的中文或Cyrillic字母和文件)

use utf8;

my $file='file.txt';
my $LAYER = ':encoding(UTF-8)';
open( my $fh, '<', $file )
  || die( "$file couldn't be opened: $!" );
binmode( $fh, $LAYER );
read $fh, my $txt, -s $file;
close $fh;

print length $txt,$/;
use bytes;
print length $txt,$/;

#9

I stumbled upon this while googling for a character count solution. Admittedly, I know next to nothing about perl so some of this may be off base, but here are my tweaks of newt's solution.

我在google上搜索字符计数解决方案时无意中发现了这一点。不可否认，我对perl几乎一无所知，因此有些可能是错误的，但这里是我对纽特的解决方案的调整。

First, there is a built-in line count variable anyway, so I just used that. This is probably a bit more efficient, I guess. As it is, the character count includes newline characters, which is probably not what you want, so I chomped $_. Perl also complained about the way the split() is done (implicit split, see: Why does Perl complain "Use of implicit split to @_ is deprecated"? ) so I tweaked that. My input files are UTF-8 so I opened them as such. That probably helps get the correct character count in the input file contains non-ASCII characters.

首先，有一个内置的行计数变量，所以我只是使用了它。我想这可能更有效率一点。实际上，字符计数包括换行符，这可能不是您想要的，所以我用chomped $_。Perl还抱怨了split()的方法(隐式拆分，参见:为什么Perl抱怨“使用隐式拆分为@_”?)所以我调整。我的输入文件是UTF-8，所以我打开它们。这可能有助于在输入文件中获得正确的字符计数，其中包含非ascii字符。

Here's the code:

这是代码:

open(FILE, "<:encoding(UTF-8)", "file.txt") or die "Could not open file: $!";

my ($lines, $words, $chars) = (0,0,0);
my @wordcounter;
while (<FILE>) {
    chomp($_);
    $chars += length($_);
    @wordcounter = split(/\W+/, $_);
    $words += @wordcounter;
}
$lines = $.;
close FILE;
print "\nlines=$lines, words=$words, chars=$chars\n";

#10

This may be helpful to Perl beginners. I tried to simulate MS word counting functionalities and added one more feature which is not shown using wc in Linux.

这可能对Perl初学者很有帮助。我尝试模拟MS word计数功能，并添加了一个在Linux中没有使用wc的特性。

number of lines
的行数
number of words
单词量
number of characters with space
有空格的字符数。
number of characters without space (wc will not give this in its output but Microsoft words shows it.)
没有空格的字符数(wc不会在输出中给出这个值，但是微软的文字显示了它)。

Here is the url: Counting words,characters and lines in a file

这里是url:计算文件中的单词、字符和行。

#1