findstr或grep自动检测字符编码(UTF-16)

时间:2022-08-21 14:02:51

I want to do this:

我想做这个:

 findstr /s /c:some-symbol *

or the grep equivalent

或grep等价物

 grep -R some-symbol *

but I need the utility to autodetect files encoded in UTF-16 (and friends) and search them appropriately. My files even have the byte-ordering mark FFEE in them so I'm not even looking for heroic autodetection.

但我需要该实用程序来自动检测以UTF-16(和朋友)编码的文件并适当地搜索它们。我的文件甚至还有字节排序标记FFEE,所以我甚至都没有寻找英雄的自动检测。

Any suggestions?


I'm referring to Windows Vista and XP.

我指的是Windows Vista和XP。

7 个解决方案

#1


Thanks for the suggestions. I was referring to Windows Vista and XP.

谢谢你的建议。我指的是Windows Vista和XP。

I also discovered this workaround, using free Sysinternals strings.exe:

我还发现了这个解决方法,使用免费的Sysinternals strings.exe:

C:\> strings -s -b dir_tree_to_search | grep regexp 

Strings.exe extracts all of the strings it finds (from binaries, but works fine with text files too) and prepends each result with a filename and colon, so take that into account in the regexp (or use cut or another step in the pipeline). The -s makes it do a recursive extraction and -b just suppresses the banner message.

Strings.exe提取它找到的所有字符串(来自二进制文件,但也适用于文本文件)并使用文件名和冒号预先添加每个结果,因此在regexp中考虑到这一点(或使用cut或管道中的其他步骤) )。 -s使它进行递归提取,-b只是抑制横幅消息。

Ultimately I'm still kind of surprised that the flagship searching utilities Gnu grep and findstr don't handle Unicode character encodings natively.

最终,我仍然感到惊讶的是,旗舰搜索实用程序Gnu grep和findstr本身不处理Unicode字符编码。

#2


On Windows, you can also use find.exe.

在Windows上,您还可以使用find.exe。

find /i /n "YourSearchString" *.*

The only problem is this prints file names followed by matches. You may filter them by piping to findstr

唯一的问题是打印文件名后跟匹配。您可以通过管道到findstr来过滤它们

find /i /n "YourSearchString" *.* | findstr /i "YourSearchString"

#3


findstr /s /c:some-symbol *

can be replaced with the following character encoding aware command:

可以使用以下字符编码感知命令替换:

for /r %f in (*) do @find /i /n "some-symbol" "%f"

#4


A workaround is to convert your UTF-16 to ASCII or ANSI

解决方法是将UTF-16转换为ASCII或ANSI

TYPE UTF-16.txt > ASCII.txt

Then you can use FINDSTR.

然后你可以使用FINDSTR。

FINDSTR object ASCII.txt

#5


In higher versions of Windows, UTF-16 is supported out-of-box. If not, try changing active code page by chcp command.

在更高版本的Windows中,UTF-16支持开箱即用。如果没有,请尝试通过chcp命令更改活动代码页。

In my case when using findstr alone was failing for UTF-16 files, however it worked with type:

在我的情况下,单独使用findstr失败的UTF-16文件,但它适用于类型:

type *.* | findstr /s /c:some-symbol

#6


According to this blog article by Damon Cortesi grep doesn't work with UTF-16 files, as you found out. However, it presents this work-around:

根据Damon Cortesi撰写的这篇博客文章,grep与UTF-16文件不兼容,正如您所发现的那样。但是,它介绍了这种解决方法:

for f in `find . -type f | xargs -I {} file {} | grep UTF-16 | cut -f1 -d\:`
        do iconv -f UTF-16 -t UTF-8 $f | grep -iH --label=$f ${GREP_FOR}
done

This is obviously for Unix, not sure what the equivalent on Windows would be. The author of that article also provides a shell-script to do the above that you can find on github here.

这显然是针对Unix的,不确定Windows上的等价物是什么。该文章的作者还提供了一个shell脚本来执行上述操作,您可以在github上找到它。

This only greps files that are UTF-16. You'd also grep your ASCII files the normal way.

这只是greps UTF-16文件。你也可以正常方式grep你的ASCII文件。

#7


You didn't say which platform you want to do this on.

您没有说明要在哪个平台上执行此操作。

On Windows, you could use PowerGREP, which automatically detects Unicode files that start with a byte order mark. (There's also an option to auto-detect files without a BOM. The auto-detection is very reliable for UTF-8, but limited for UTF-16.)

在Windows上,您可以使用PowerGREP,它会自动检测以字节顺序标记开头的Unicode文件。 (还有一个选项可以自动检测没有BOM的文件。自动检测对于UTF-8非常可靠,但仅限于UTF-16。)

#1


Thanks for the suggestions. I was referring to Windows Vista and XP.

谢谢你的建议。我指的是Windows Vista和XP。

I also discovered this workaround, using free Sysinternals strings.exe:

我还发现了这个解决方法,使用免费的Sysinternals strings.exe:

C:\> strings -s -b dir_tree_to_search | grep regexp 

Strings.exe extracts all of the strings it finds (from binaries, but works fine with text files too) and prepends each result with a filename and colon, so take that into account in the regexp (or use cut or another step in the pipeline). The -s makes it do a recursive extraction and -b just suppresses the banner message.

Strings.exe提取它找到的所有字符串(来自二进制文件,但也适用于文本文件)并使用文件名和冒号预先添加每个结果,因此在regexp中考虑到这一点(或使用cut或管道中的其他步骤) )。 -s使它进行递归提取,-b只是抑制横幅消息。

Ultimately I'm still kind of surprised that the flagship searching utilities Gnu grep and findstr don't handle Unicode character encodings natively.

最终,我仍然感到惊讶的是,旗舰搜索实用程序Gnu grep和findstr本身不处理Unicode字符编码。

#2


On Windows, you can also use find.exe.

在Windows上,您还可以使用find.exe。

find /i /n "YourSearchString" *.*

The only problem is this prints file names followed by matches. You may filter them by piping to findstr

唯一的问题是打印文件名后跟匹配。您可以通过管道到findstr来过滤它们

find /i /n "YourSearchString" *.* | findstr /i "YourSearchString"

#3


findstr /s /c:some-symbol *

can be replaced with the following character encoding aware command:

可以使用以下字符编码感知命令替换:

for /r %f in (*) do @find /i /n "some-symbol" "%f"

#4


A workaround is to convert your UTF-16 to ASCII or ANSI

解决方法是将UTF-16转换为ASCII或ANSI

TYPE UTF-16.txt > ASCII.txt

Then you can use FINDSTR.

然后你可以使用FINDSTR。

FINDSTR object ASCII.txt

#5


In higher versions of Windows, UTF-16 is supported out-of-box. If not, try changing active code page by chcp command.

在更高版本的Windows中,UTF-16支持开箱即用。如果没有,请尝试通过chcp命令更改活动代码页。

In my case when using findstr alone was failing for UTF-16 files, however it worked with type:

在我的情况下,单独使用findstr失败的UTF-16文件,但它适用于类型:

type *.* | findstr /s /c:some-symbol

#6


According to this blog article by Damon Cortesi grep doesn't work with UTF-16 files, as you found out. However, it presents this work-around:

根据Damon Cortesi撰写的这篇博客文章,grep与UTF-16文件不兼容,正如您所发现的那样。但是,它介绍了这种解决方法:

for f in `find . -type f | xargs -I {} file {} | grep UTF-16 | cut -f1 -d\:`
        do iconv -f UTF-16 -t UTF-8 $f | grep -iH --label=$f ${GREP_FOR}
done

This is obviously for Unix, not sure what the equivalent on Windows would be. The author of that article also provides a shell-script to do the above that you can find on github here.

这显然是针对Unix的,不确定Windows上的等价物是什么。该文章的作者还提供了一个shell脚本来执行上述操作,您可以在github上找到它。

This only greps files that are UTF-16. You'd also grep your ASCII files the normal way.

这只是greps UTF-16文件。你也可以正常方式grep你的ASCII文件。

#7


You didn't say which platform you want to do this on.

您没有说明要在哪个平台上执行此操作。

On Windows, you could use PowerGREP, which automatically detects Unicode files that start with a byte order mark. (There's also an option to auto-detect files without a BOM. The auto-detection is very reliable for UTF-8, but limited for UTF-16.)

在Windows上,您可以使用PowerGREP,它会自动检测以字节顺序标记开头的Unicode文件。 (还有一个选项可以自动检测没有BOM的文件。自动检测对于UTF-8非常可靠,但仅限于UTF-16。)