I have several very large XML files and I'm trying to find the lines that contain non-ASCII characters. I've tried the following:
我有几个非常大的XML文件,我试图找到包含非ascii字符的行。我试过以下:
grep -e "[\x{00FF}-\x{FFFF}]" file.xml
But this returns every line in the file, regardless of whether the line contains a character in the range specified.
但这将返回文件中的每一行,而不管该行是否包含指定范围内的字符。
Do I have the syntax wrong or am I doing something else wrong? I've also tried:
我的语法有问题吗?还是我做错了别的事?我也试过:
egrep "[\x{00FF}-\x{FFFF}]" file.xml
(with both single and double quotes surrounding the pattern).
(包括单引号和双引号)。
10 个解决方案
#1
399
You can use the command:
您可以使用以下命令:
grep --color='auto' -P -n "[\x80-\xFF]" file.xml
This will give you the line number, and will highlight non-ascii chars in red.
这将给出行号,并以红色突出显示非ascii字符。
In some systems, depending on your settings, the above will not work, so you can grep by the inverse
在某些系统中,根据您的设置,上面的操作将不起作用,因此您可以使用逆操作
grep --color='auto' -P -n "[^\x00-\x7F]" file.xml
Note also, that the important bit is the -P
flag which equates to --perl-regexp
: so it will interpret your pattern as a Perl regular expression. It also says that
还要注意的是,重要的位是-P标志,它等同于- Perl -regexp:因此它将把您的模式解释为Perl正则表达式。它还说,
this is highly experimental and grep -P may warn of unimplemented features.
这是高度实验性的,grep -P可能警告未实现的特性。
#2
94
Instead of making assumptions about the byte range of non-ASCII characters, as most of the above solutions do, it's slightly better IMO to be explicit about the actual byte range of ASCII characters instead.
与其对非ASCII字符的字节范围进行假设,不如在上面的大多数解决方案中做一些假设,相反地,对于ASCII字符的实际字节范围来说,它稍微好一些。
So the first solution for instance would become:
所以第一个解决方案是:
grep --color='auto' -P -n '[^\x00-\x7F]' file.xml
(which basically greps for any character outside of the hexadecimal ASCII range: from \x00 up to \x7F)
(基本适用于十六进制ASCII范围之外的任何字符:从\x00到\x7F)
On Mountain Lion that won't work (due to the lack of PCRE support in BSD grep), but with pcre
installed via Homebrew, the following will work just as well:
在Mountain Lion上不能工作(由于BSD grep中缺少PCRE支持),但是通过Homebrew安装PCRE,以下内容也可以工作:
pcregrep --color='auto' -n '[^\x00-\x7F]' file.xml
Any pros or cons that anyone can think off?
有没有人能想到的利与弊?
#3
65
The following works for me:
以下是我的作品:
grep -P "[\x80-\xFF]" file.xml
Non-ASCII characters start at 0x80 and go to 0xFF when looking at bytes. Grep (and family) don't do Unicode processing to merge multi-byte characters into a single entity for regex matching as you seem to want. The -P
option in my grep allows the use of \xdd
escapes in character classes to accomplish what you want.
非ascii字符从0x80开始,在查看字节时转到0xFF。Grep(和family)不需要进行Unicode处理,将多字节字符合并到一个实体中,以便按照您的要求进行regex匹配。我的grep中的-P选项允许在字符类中使用\xdd转义来实现您想要的。
#4
45
In perl
在perl中
perl -ane '{ if(m/[[:^ascii:]]/) { print } }' fileName > newFile
#5
32
The easy way is to define a non-ASCII character... as a character that is not an ASCII character.
简单的方法是定义一个非ascii字符……作为非ASCII字符的字符。
LC_ALL=C grep '[^ -~]' file.xml
Add a tab after the ^
if necessary.
^如果必要的话后添加一个选项卡。
Setting LC_COLLATE=C
avoids nasty surprises about the meaning of character ranges in many locales. Setting LC_CTYPE=C
is necessary to match single-byte characters — otherwise the command would miss invalid byte sequences in the current encoding. Setting LC_ALL=C
avoids locale-dependent effects altogether.
设置LC_COLLATE=C可以避免在许多地区中关于字符范围的含义出现令人不快的意外。设置LC_CTYPE=C是匹配单字节字符所必需的——否则命令将会错过当前编码中的无效字节序列。设置LC_ALL=C完全避免了与位置相关的影响。
#6
17
Here is another variant I found that produced completely different results from the grep search for [\x80-\xFF]
in the accepted answer. Perhaps it will be useful to someone to find additional non-ascii characters:
下面是我发现的另一个变体,它在已接受的答案中产生了与grep搜索[\x80-\xFF]完全不同的结果。也许有人可以找到额外的非ascii字符:
grep --color='auto' -P -n "[^[:ascii:]]" myfile.txt
grep -颜色= '汽车' - p - n”[^[ascii:]]”myfile.txt
Note: my computer's grep (a Mac) did not have -P
option, so I did brew install grep
and started the call above with ggrep
instead of grep
.
注意:我的计算机的grep (Mac)没有-P选项,所以我使用了brew install grep,并使用ggrep而不是grep启动上面的调用。
#7
5
The following code works:
下面的代码:
find /tmp | perl -ne 'print if /[^[:ascii:]]/'
Replace /tmp
with the name of the directory you want to search through.
用要搜索的目录名替换/tmp。
#8
1
Strangely, I had to do this today! I ended up using Perl because I couldn't get grep/egrep to work (even in -P mode). Something like:
奇怪的是,我今天不得不这么做!最后我使用了Perl,因为我无法让grep/鹭(甚至在-P模式下)工作。喜欢的东西:
cat blah | perl -en '/\xCA\xFE\xBA\xBE/ && print "found"'
For unicode characters (like \u2212
in example below) use this:
对于unicode字符(如以下示例中的\u2212):
find . ... -exec perl -CA -e '$ARGV = @ARGV[0]; open IN, $ARGV; binmode(IN, ":utf8"); binmode(STDOUT, ":utf8"); while (<IN>) { next unless /\N{U+2212}/; print "$ARGV: $&: $_"; exit }' '{}' \;
#9
0
It could be interesting to know how to search for one unicode character. This command can help. You only need to know the code in UTF8
知道如何搜索一个unicode字符可能会很有趣。这个命令可以帮助。您只需要知道UTF8中的代码
grep -v $'\u200d'
#10
0
Searching for non-printable chars.
寻找非输出字符。
I agree with Harvey above buried in the comments, it is often more useful to search for non-printable characters OR it is easy to think non-ASCII when you really should be thinking non-printable. Harvey suggests "use this: "[^\n -~]". Add \r for DOS text files. That translates to "[^\x0A\x020-\x07E]" and add \x0D for CR"
我同意上述哈维的观点,在评论中,搜索不可打印字符通常更有用,或者当你真的应该考虑不可打印时,很容易想到非ascii。哈维认为“使用:“[^ \ n - ~]”。为文本文件添加\r DOS。翻译“[^ \ x0A \ x020 - \ x07E]”并添加\ x0D CR”
Also, adding -c (show count of patterns matched) to grep is useful when searching for non-printable chars as the strings matched can mess up terminal.
此外,在搜索不可打印的字符时,向grep添加-c(显示匹配的模式计数)是有用的,因为匹配的字符串会使终端变得混乱。
I found adding range 0-8 and 0x0e-0x1f (to the 0x80-0xff range) is a useful pattern. This excludes the TAB, CR and LF and one or two more uncommon printable chars. So IMHO a quite a useful (albeit crude) grep pattern is THIS one:
我发现添加范围0-8和0x0e-0x1f(到0x80-0xff范围)是一个有用的模式。这就排除了制表符、CR和LF以及一两个以上不常见的可打印字符。因此,在我看来,一个相当有用(尽管很粗糙)的grep模式是这样的:
grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" *
breakdown:
分解:
\x00-\x08 - non-printable control chars 0 - 7 decimal
\x0E-\x1F - more non-printable control chars 14 - 31 decimal
\x80-1xFF - non-printable chars > 128 decimal
-c - print count of matching lines instead of lines
-P - perl style regexps
Instead of -c you may prefer to use -n (and optionally -b) or -l
-n, --line-number
-b, --byte-offset
-l, --files-with-matches
E.g. practical example of use find to grep all files under current directory:
例如,查找到当前目录下的所有文件的实际使用示例:
find . -type f -exec grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" {} +
You may wish to adjust the grep at times. e.g. BS(0x08 - backspace) char used in some printable files or to exclude VT(0x0B - vertical tab). The BEL(0x07) and ESC(0x1B) chars can also be deemed printable in some cases.
有时您可能希望调整grep。例如,在一些可打印文件中使用的b (0x08 - backspace)字符,或者不包括VT(0x0B - vertical选项卡)。BEL(0x07)和ESC(0x1B) chars在某些情况下也可以被认为是可打印的。
Non-Printable ASCII Chars ** marks PRINTABLE but CONTROL chars that is useful to exclude sometimes Dec Hex Ctrl Char description Dec Hex Ctrl Char description 0 00 ^@ NULL 16 10 ^P DATA LINK ESCAPE (DLE) 1 01 ^A START OF HEADING (SOH) 17 11 ^Q DEVICE CONTROL 1 (DC1) 2 02 ^B START OF TEXT (STX) 18 12 ^R DEVICE CONTROL 2 (DC2) 3 03 ^C END OF TEXT (ETX) 19 13 ^S DEVICE CONTROL 3 (DC3) 4 04 ^D END OF TRANSMISSION (EOT) 20 14 ^T DEVICE CONTROL 4 (DC4) 5 05 ^E END OF QUERY (ENQ) 21 15 ^U NEGATIVE ACKNOWLEDGEMENT (NAK) 6 06 ^F ACKNOWLEDGE (ACK) 22 16 ^V SYNCHRONIZE (SYN) 7 07 ^G BEEP (BEL) 23 17 ^W END OF TRANSMISSION BLOCK (ETB) 8 08 ^H BACKSPACE (BS)** 24 18 ^X CANCEL (CAN) 9 09 ^I HORIZONTAL TAB (HT)** 25 19 ^Y END OF MEDIUM (EM) 10 0A ^J LINE FEED (LF)** 26 1A ^Z SUBSTITUTE (SUB) 11 0B ^K VERTICAL TAB (VT)** 27 1B ^[ ESCAPE (ESC) 12 0C ^L FF (FORM FEED)** 28 1C ^\ FILE SEPARATOR (FS) RIGHT ARROW 13 0D ^M CR (CARRIAGE RETURN)** 29 1D ^] GROUP SEPARATOR (GS) LEFT ARROW 14 0E ^N SO (SHIFT OUT) 30 1E ^^ RECORD SEPARATOR (RS) UP ARROW 15 0F ^O SI (SHIFT IN) 31 1F ^_ UNIT SEPARATOR (US) DOWN ARROW
#1
399
You can use the command:
您可以使用以下命令:
grep --color='auto' -P -n "[\x80-\xFF]" file.xml
This will give you the line number, and will highlight non-ascii chars in red.
这将给出行号,并以红色突出显示非ascii字符。
In some systems, depending on your settings, the above will not work, so you can grep by the inverse
在某些系统中,根据您的设置,上面的操作将不起作用,因此您可以使用逆操作
grep --color='auto' -P -n "[^\x00-\x7F]" file.xml
Note also, that the important bit is the -P
flag which equates to --perl-regexp
: so it will interpret your pattern as a Perl regular expression. It also says that
还要注意的是,重要的位是-P标志,它等同于- Perl -regexp:因此它将把您的模式解释为Perl正则表达式。它还说,
this is highly experimental and grep -P may warn of unimplemented features.
这是高度实验性的,grep -P可能警告未实现的特性。
#2
94
Instead of making assumptions about the byte range of non-ASCII characters, as most of the above solutions do, it's slightly better IMO to be explicit about the actual byte range of ASCII characters instead.
与其对非ASCII字符的字节范围进行假设,不如在上面的大多数解决方案中做一些假设,相反地,对于ASCII字符的实际字节范围来说,它稍微好一些。
So the first solution for instance would become:
所以第一个解决方案是:
grep --color='auto' -P -n '[^\x00-\x7F]' file.xml
(which basically greps for any character outside of the hexadecimal ASCII range: from \x00 up to \x7F)
(基本适用于十六进制ASCII范围之外的任何字符:从\x00到\x7F)
On Mountain Lion that won't work (due to the lack of PCRE support in BSD grep), but with pcre
installed via Homebrew, the following will work just as well:
在Mountain Lion上不能工作(由于BSD grep中缺少PCRE支持),但是通过Homebrew安装PCRE,以下内容也可以工作:
pcregrep --color='auto' -n '[^\x00-\x7F]' file.xml
Any pros or cons that anyone can think off?
有没有人能想到的利与弊?
#3
65
The following works for me:
以下是我的作品:
grep -P "[\x80-\xFF]" file.xml
Non-ASCII characters start at 0x80 and go to 0xFF when looking at bytes. Grep (and family) don't do Unicode processing to merge multi-byte characters into a single entity for regex matching as you seem to want. The -P
option in my grep allows the use of \xdd
escapes in character classes to accomplish what you want.
非ascii字符从0x80开始,在查看字节时转到0xFF。Grep(和family)不需要进行Unicode处理,将多字节字符合并到一个实体中,以便按照您的要求进行regex匹配。我的grep中的-P选项允许在字符类中使用\xdd转义来实现您想要的。
#4
45
In perl
在perl中
perl -ane '{ if(m/[[:^ascii:]]/) { print } }' fileName > newFile
#5
32
The easy way is to define a non-ASCII character... as a character that is not an ASCII character.
简单的方法是定义一个非ascii字符……作为非ASCII字符的字符。
LC_ALL=C grep '[^ -~]' file.xml
Add a tab after the ^
if necessary.
^如果必要的话后添加一个选项卡。
Setting LC_COLLATE=C
avoids nasty surprises about the meaning of character ranges in many locales. Setting LC_CTYPE=C
is necessary to match single-byte characters — otherwise the command would miss invalid byte sequences in the current encoding. Setting LC_ALL=C
avoids locale-dependent effects altogether.
设置LC_COLLATE=C可以避免在许多地区中关于字符范围的含义出现令人不快的意外。设置LC_CTYPE=C是匹配单字节字符所必需的——否则命令将会错过当前编码中的无效字节序列。设置LC_ALL=C完全避免了与位置相关的影响。
#6
17
Here is another variant I found that produced completely different results from the grep search for [\x80-\xFF]
in the accepted answer. Perhaps it will be useful to someone to find additional non-ascii characters:
下面是我发现的另一个变体,它在已接受的答案中产生了与grep搜索[\x80-\xFF]完全不同的结果。也许有人可以找到额外的非ascii字符:
grep --color='auto' -P -n "[^[:ascii:]]" myfile.txt
grep -颜色= '汽车' - p - n”[^[ascii:]]”myfile.txt
Note: my computer's grep (a Mac) did not have -P
option, so I did brew install grep
and started the call above with ggrep
instead of grep
.
注意:我的计算机的grep (Mac)没有-P选项,所以我使用了brew install grep,并使用ggrep而不是grep启动上面的调用。
#7
5
The following code works:
下面的代码:
find /tmp | perl -ne 'print if /[^[:ascii:]]/'
Replace /tmp
with the name of the directory you want to search through.
用要搜索的目录名替换/tmp。
#8
1
Strangely, I had to do this today! I ended up using Perl because I couldn't get grep/egrep to work (even in -P mode). Something like:
奇怪的是,我今天不得不这么做!最后我使用了Perl,因为我无法让grep/鹭(甚至在-P模式下)工作。喜欢的东西:
cat blah | perl -en '/\xCA\xFE\xBA\xBE/ && print "found"'
For unicode characters (like \u2212
in example below) use this:
对于unicode字符(如以下示例中的\u2212):
find . ... -exec perl -CA -e '$ARGV = @ARGV[0]; open IN, $ARGV; binmode(IN, ":utf8"); binmode(STDOUT, ":utf8"); while (<IN>) { next unless /\N{U+2212}/; print "$ARGV: $&: $_"; exit }' '{}' \;
#9
0
It could be interesting to know how to search for one unicode character. This command can help. You only need to know the code in UTF8
知道如何搜索一个unicode字符可能会很有趣。这个命令可以帮助。您只需要知道UTF8中的代码
grep -v $'\u200d'
#10
0
Searching for non-printable chars.
寻找非输出字符。
I agree with Harvey above buried in the comments, it is often more useful to search for non-printable characters OR it is easy to think non-ASCII when you really should be thinking non-printable. Harvey suggests "use this: "[^\n -~]". Add \r for DOS text files. That translates to "[^\x0A\x020-\x07E]" and add \x0D for CR"
我同意上述哈维的观点,在评论中,搜索不可打印字符通常更有用,或者当你真的应该考虑不可打印时,很容易想到非ascii。哈维认为“使用:“[^ \ n - ~]”。为文本文件添加\r DOS。翻译“[^ \ x0A \ x020 - \ x07E]”并添加\ x0D CR”
Also, adding -c (show count of patterns matched) to grep is useful when searching for non-printable chars as the strings matched can mess up terminal.
此外,在搜索不可打印的字符时,向grep添加-c(显示匹配的模式计数)是有用的,因为匹配的字符串会使终端变得混乱。
I found adding range 0-8 and 0x0e-0x1f (to the 0x80-0xff range) is a useful pattern. This excludes the TAB, CR and LF and one or two more uncommon printable chars. So IMHO a quite a useful (albeit crude) grep pattern is THIS one:
我发现添加范围0-8和0x0e-0x1f(到0x80-0xff范围)是一个有用的模式。这就排除了制表符、CR和LF以及一两个以上不常见的可打印字符。因此,在我看来,一个相当有用(尽管很粗糙)的grep模式是这样的:
grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" *
breakdown:
分解:
\x00-\x08 - non-printable control chars 0 - 7 decimal
\x0E-\x1F - more non-printable control chars 14 - 31 decimal
\x80-1xFF - non-printable chars > 128 decimal
-c - print count of matching lines instead of lines
-P - perl style regexps
Instead of -c you may prefer to use -n (and optionally -b) or -l
-n, --line-number
-b, --byte-offset
-l, --files-with-matches
E.g. practical example of use find to grep all files under current directory:
例如,查找到当前目录下的所有文件的实际使用示例:
find . -type f -exec grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" {} +
You may wish to adjust the grep at times. e.g. BS(0x08 - backspace) char used in some printable files or to exclude VT(0x0B - vertical tab). The BEL(0x07) and ESC(0x1B) chars can also be deemed printable in some cases.
有时您可能希望调整grep。例如,在一些可打印文件中使用的b (0x08 - backspace)字符,或者不包括VT(0x0B - vertical选项卡)。BEL(0x07)和ESC(0x1B) chars在某些情况下也可以被认为是可打印的。
Non-Printable ASCII Chars ** marks PRINTABLE but CONTROL chars that is useful to exclude sometimes Dec Hex Ctrl Char description Dec Hex Ctrl Char description 0 00 ^@ NULL 16 10 ^P DATA LINK ESCAPE (DLE) 1 01 ^A START OF HEADING (SOH) 17 11 ^Q DEVICE CONTROL 1 (DC1) 2 02 ^B START OF TEXT (STX) 18 12 ^R DEVICE CONTROL 2 (DC2) 3 03 ^C END OF TEXT (ETX) 19 13 ^S DEVICE CONTROL 3 (DC3) 4 04 ^D END OF TRANSMISSION (EOT) 20 14 ^T DEVICE CONTROL 4 (DC4) 5 05 ^E END OF QUERY (ENQ) 21 15 ^U NEGATIVE ACKNOWLEDGEMENT (NAK) 6 06 ^F ACKNOWLEDGE (ACK) 22 16 ^V SYNCHRONIZE (SYN) 7 07 ^G BEEP (BEL) 23 17 ^W END OF TRANSMISSION BLOCK (ETB) 8 08 ^H BACKSPACE (BS)** 24 18 ^X CANCEL (CAN) 9 09 ^I HORIZONTAL TAB (HT)** 25 19 ^Y END OF MEDIUM (EM) 10 0A ^J LINE FEED (LF)** 26 1A ^Z SUBSTITUTE (SUB) 11 0B ^K VERTICAL TAB (VT)** 27 1B ^[ ESCAPE (ESC) 12 0C ^L FF (FORM FEED)** 28 1C ^\ FILE SEPARATOR (FS) RIGHT ARROW 13 0D ^M CR (CARRIAGE RETURN)** 29 1D ^] GROUP SEPARATOR (GS) LEFT ARROW 14 0E ^N SO (SHIFT OUT) 30 1E ^^ RECORD SEPARATOR (RS) UP ARROW 15 0F ^O SI (SHIFT IN) 31 1F ^_ UNIT SEPARATOR (US) DOWN ARROW