I'm trying to extract a word list from a Russian short story.
我正试图从俄罗斯短篇小说中提取一个单词列表。
#!/bin/sh
export LC_ALL=ru_RU.utf8
sed -re 's/\s+/\n/g' | \
sed 's/[\.!,—()«»;:?]//g' | \
tr '[:upper:]' '[:lower:]' | \
sort | uniq
However the tr
step is not lowercasing the Cyrillic capital letters. I thought I was being clever using the portable character classes!
然而,tr步骤并没有降低西里尔字母大小写。我以为我在使用便携式角色课时很聪明!
$ LC_ALL=ru_RU.utf8 echo "Г" | tr [:upper:] [:lower:]
Г
In case it's relevant, I obtained the Russian text by copy-pasting from a Chrome browser window into Vim. It looks right on screen (a Putty terminal). This is in Cygwin's bash shell -- it should work identically to Bash on Linux (should!).
如果它是相关的,我通过从Chrome浏览器窗口复制粘贴到Vim获得俄语文本。它在屏幕上看起来是正确的(Putty终端)。这是在Cygwin的bash shell中 - 它应该与Linux上的Bash完全相同(应该!)。
What is a portable, reliable way to lowercase unicode text in a pipe?
什么是在管道中小写unicode文本的便携,可靠的方法?
1 个解决方案
#1
10
This is what I found at Wikipedia (without any reference, though):
这是我在*上发现的(没有任何参考):
Most versions of
tr
, includingGNU tr
and classic Unixtr
, operate on single-byte characters and are not Unicode compliant. An exception is the Heirloom Toolchest implementation, which provides basic Unicode support.大多数tr版本(包括GNU tr和经典Unix tr)都使用单字节字符,并且不符合Unicode。 Heirloom Toolchest实现是一个例外,它提供了基本的Unicode支持。
Also, this is old but related.
此外,这是旧的,但相关。
As I mentioned in the comment, sed
seems to work (GNU sed
, at least):
正如我在评论中提到的,sed似乎有效(至少GNU sed):
$ echo 'СТЭК' | sed 's/[[:upper:]]*/\L&/'
стэк
#1
10
This is what I found at Wikipedia (without any reference, though):
这是我在*上发现的(没有任何参考):
Most versions of
tr
, includingGNU tr
and classic Unixtr
, operate on single-byte characters and are not Unicode compliant. An exception is the Heirloom Toolchest implementation, which provides basic Unicode support.大多数tr版本(包括GNU tr和经典Unix tr)都使用单字节字符,并且不符合Unicode。 Heirloom Toolchest实现是一个例外,它提供了基本的Unicode支持。
Also, this is old but related.
此外,这是旧的,但相关。
As I mentioned in the comment, sed
seems to work (GNU sed
, at least):
正如我在评论中提到的,sed似乎有效(至少GNU sed):
$ echo 'СТЭК' | sed 's/[[:upper:]]*/\L&/'
стэк