I use curl
to get some URL response, it's JSON response and it contains unicode-escaped national characters like \u0144 (ń)
and \u00f3 (ó)
.
我使用curl来得到一些URL响应,JSON响应,它包含unicode-escaped国家字符\ u0144(ń)和\ u00f3(o)。
How can I convert them to UTF-8 or any other encoding to save into file?
如何将它们转换为UTF-8或任何其他编码以保存到文件中?
7 个解决方案
#1
26
I don't know which distribution you are using, but uni2ascii should be included.
我不知道您使用的是哪个发行版,但是应该包括uni2ascii。
$ sudo apt-get install uni2ascii
It only depend on libc6, so it's a lightweight solution (uni2ascii i386 4.18-2 is 55,0 kB on Ubuntu)!
它只依赖于libc6,所以它是一个轻量级的解决方案(uni2ascii i386 4.18-2在Ubuntu上是55,0 kB)!
Then to use it:
然后使用它:
$ echo 'Character 1: \u0144, Character 2: \u00f3' | ascii2uni -a U -q
Character 1: ń, Character 2: ó
#2
29
Might be a bit ugly, but echo -e
should do it:
可能有点丑,但echo -e应该这样做:
echo -en "$(curl $URL)"
-e
interprets escapes, -n
suppresses the newline echo
would normally add.
-e解释转义,-n抑制新行回波通常会添加。
Note: The \u
escape works in the bash builtin echo
, but not /usr/bin/echo
.
注意:在bash builtin echo中,\u脱机工作,但不是/usr/bin/ echo。
As pointed out in the comments, this is bash 4.2+, and 4.2.x have a bug handling 0x00ff/17 values (0x80-0xff).
正如在注释中指出的,这是bash 4.2+和4.2。x有一个处理0x00ff/17值的错误(0x80-0xff)。
#3
28
I found native2ascii from JDK as the best way to do it:
我从JDK中找到了native2ascii,这是最好的方法:
native2ascii -encoding UTF-8 -reverse src.txt dest.txt
Detailed description is here: http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/native2ascii.html
详细描述如下:http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/native2ascii.html。
Update: No longer available since JDK9: https://bugs.openjdk.java.net/browse/JDK-8074431
更新:自JDK9: https://bugs.openjdk.java.net/browse/JDK-8074431之后,不再可用。
#4
18
Assuming the \u
is always followed by exactly 4 hex digits:
假设\u始终紧跟着4个十六进制数字:
#!/usr/bin/perl
use strict;
use warnings;
binmode(STDOUT, ':utf8');
while (<>) {
s/\\u([0-9a-fA-F]{4})/chr(hex($1))/eg;
print;
}
The binmode
puts standard output into UTF-8 mode. The s...
command replaces each occurrence of \u
followed by 4 hex digits with the corresponding character. The e
suffix causes the replacement to be evaluated as an expression rather than treated as a string; the g
says to replace all occurrences rather than just the first.
binmode将标准输出放入UTF-8模式。年代……命令将每个发生的\u替换为4个十六进制数字和相应的字符。e后缀使替换被计算为表达式,而不是作为字符串处理;g说要替换所有的事件,而不是第一个。
You can save the above to a file somewhere in your $PATH
(don't forget the chmod +x
). It filters standard input (or one or more files named on the command line) to standard output.
您可以将上述文件保存到$PATH中的某个文件(不要忘记chmod +x)。它过滤标准输入(或命令行中指定的一个或多个文件)到标准输出。
#5
9
use /usr/bin/printf "\u0160ini\u010di Ho\u0161i - A\u017e sa skon\u010d\u00ed zima"
to get proper unicode-to-utf8 conversion.
使用/usr/bin/printf "\u0160ini\ u0161i - A\u017e sa skon\u010d\u00ed zima"获得适当的unicodeto -utf8转换。
#6
8
Don't rely on regexes: JSON has some strange corner-cases with \u
escapes and non-BMP code points. (specifically, JSON will encode one code-point using two \u
escapes) If you assume 1 escape sequence translates to 1 code point, you're doomed on such text.
不要依赖regexes: JSON有一些奇怪的转角,有\u转义和非bmp代码点。(具体来说,JSON将用两个\u来编码一个代码点)如果您假设一个转义序列转换为1个代码点,那么您就注定要在这样的文本中出现。
Using a full JSON parser from the language of your choice is considerably more robust:
从您选择的语言中使用完整的JSON解析器会更加健壮:
$ echo '["foo bar \u0144\n"]' | python -c 'import json, sys; sys.stdout.write(json.load(sys.stdin)[0].encode("utf-8"))'
That's really just feeding the data to this short python script:
这只是给这个短python脚本提供了数据:
import json
import sys
data = json.load(sys.stdin)
data = data[0] # change this to find your string in the JSON
sys.stdout.write(data.encode('utf-8'))
From which you can save as foo.py
and call as curl ... | foo.py
您可以从其中保存为foo。py和call as curl…| foo.py
An example that will break most of the other attempts in this question is "\ud83d\udca3"
:
一个将打破这个问题的大部分尝试的例子是“\ud83d\udca3”:
% printf '"\\ud83d\\udca3"' | python2 -c 'import json, sys; sys.stdout.write(json.load(sys.stdin)[0].encode("utf-8"))'; echo
????
# echo will result in corrupt output:
% echo -e $(printf '"\\ud83d\\udca3"')
"������"
# native2ascii won't even try (this is correct for its intended use case, however, just not ours):
% printf '"\\ud83d\\udca3"' | native2ascii -encoding utf-8 -reverse
"\ud83d\udca3"
#7
-1
Works on Windows, should work on *nix too. Uses python 2.
在Windows上工作,也应该在*nix上工作。使用python 2。
#!/usr/bin/env python
from __future__ import unicode_literals
import sys
import json
import codecs
def unescape_json(fname_in, fname_out):
with file(fname_in, 'rb') as fin:
js = json.load(fin)
with codecs.open(fname_out, 'wb', 'utf-8') as fout:
json.dump(js, fout, ensure_ascii=False)
def usage():
print "Converts all \\uXXXX codes in json into utf-8"
print "Usage: .py infile outfile"
sys.exit(1)
def main():
try:
fname_in, fname_out = sys.argv[1:]
except Exception:
usage()
unescape_json(fname_in, fname_out)
print "Done."
if __name__ == '__main__':
main()
#1
26
I don't know which distribution you are using, but uni2ascii should be included.
我不知道您使用的是哪个发行版,但是应该包括uni2ascii。
$ sudo apt-get install uni2ascii
It only depend on libc6, so it's a lightweight solution (uni2ascii i386 4.18-2 is 55,0 kB on Ubuntu)!
它只依赖于libc6,所以它是一个轻量级的解决方案(uni2ascii i386 4.18-2在Ubuntu上是55,0 kB)!
Then to use it:
然后使用它:
$ echo 'Character 1: \u0144, Character 2: \u00f3' | ascii2uni -a U -q
Character 1: ń, Character 2: ó
#2
29
Might be a bit ugly, but echo -e
should do it:
可能有点丑,但echo -e应该这样做:
echo -en "$(curl $URL)"
-e
interprets escapes, -n
suppresses the newline echo
would normally add.
-e解释转义,-n抑制新行回波通常会添加。
Note: The \u
escape works in the bash builtin echo
, but not /usr/bin/echo
.
注意:在bash builtin echo中,\u脱机工作,但不是/usr/bin/ echo。
As pointed out in the comments, this is bash 4.2+, and 4.2.x have a bug handling 0x00ff/17 values (0x80-0xff).
正如在注释中指出的,这是bash 4.2+和4.2。x有一个处理0x00ff/17值的错误(0x80-0xff)。
#3
28
I found native2ascii from JDK as the best way to do it:
我从JDK中找到了native2ascii,这是最好的方法:
native2ascii -encoding UTF-8 -reverse src.txt dest.txt
Detailed description is here: http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/native2ascii.html
详细描述如下:http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/native2ascii.html。
Update: No longer available since JDK9: https://bugs.openjdk.java.net/browse/JDK-8074431
更新:自JDK9: https://bugs.openjdk.java.net/browse/JDK-8074431之后,不再可用。
#4
18
Assuming the \u
is always followed by exactly 4 hex digits:
假设\u始终紧跟着4个十六进制数字:
#!/usr/bin/perl
use strict;
use warnings;
binmode(STDOUT, ':utf8');
while (<>) {
s/\\u([0-9a-fA-F]{4})/chr(hex($1))/eg;
print;
}
The binmode
puts standard output into UTF-8 mode. The s...
command replaces each occurrence of \u
followed by 4 hex digits with the corresponding character. The e
suffix causes the replacement to be evaluated as an expression rather than treated as a string; the g
says to replace all occurrences rather than just the first.
binmode将标准输出放入UTF-8模式。年代……命令将每个发生的\u替换为4个十六进制数字和相应的字符。e后缀使替换被计算为表达式,而不是作为字符串处理;g说要替换所有的事件,而不是第一个。
You can save the above to a file somewhere in your $PATH
(don't forget the chmod +x
). It filters standard input (or one or more files named on the command line) to standard output.
您可以将上述文件保存到$PATH中的某个文件(不要忘记chmod +x)。它过滤标准输入(或命令行中指定的一个或多个文件)到标准输出。
#5
9
use /usr/bin/printf "\u0160ini\u010di Ho\u0161i - A\u017e sa skon\u010d\u00ed zima"
to get proper unicode-to-utf8 conversion.
使用/usr/bin/printf "\u0160ini\ u0161i - A\u017e sa skon\u010d\u00ed zima"获得适当的unicodeto -utf8转换。
#6
8
Don't rely on regexes: JSON has some strange corner-cases with \u
escapes and non-BMP code points. (specifically, JSON will encode one code-point using two \u
escapes) If you assume 1 escape sequence translates to 1 code point, you're doomed on such text.
不要依赖regexes: JSON有一些奇怪的转角,有\u转义和非bmp代码点。(具体来说,JSON将用两个\u来编码一个代码点)如果您假设一个转义序列转换为1个代码点,那么您就注定要在这样的文本中出现。
Using a full JSON parser from the language of your choice is considerably more robust:
从您选择的语言中使用完整的JSON解析器会更加健壮:
$ echo '["foo bar \u0144\n"]' | python -c 'import json, sys; sys.stdout.write(json.load(sys.stdin)[0].encode("utf-8"))'
That's really just feeding the data to this short python script:
这只是给这个短python脚本提供了数据:
import json
import sys
data = json.load(sys.stdin)
data = data[0] # change this to find your string in the JSON
sys.stdout.write(data.encode('utf-8'))
From which you can save as foo.py
and call as curl ... | foo.py
您可以从其中保存为foo。py和call as curl…| foo.py
An example that will break most of the other attempts in this question is "\ud83d\udca3"
:
一个将打破这个问题的大部分尝试的例子是“\ud83d\udca3”:
% printf '"\\ud83d\\udca3"' | python2 -c 'import json, sys; sys.stdout.write(json.load(sys.stdin)[0].encode("utf-8"))'; echo
????
# echo will result in corrupt output:
% echo -e $(printf '"\\ud83d\\udca3"')
"������"
# native2ascii won't even try (this is correct for its intended use case, however, just not ours):
% printf '"\\ud83d\\udca3"' | native2ascii -encoding utf-8 -reverse
"\ud83d\udca3"
#7
-1
Works on Windows, should work on *nix too. Uses python 2.
在Windows上工作,也应该在*nix上工作。使用python 2。
#!/usr/bin/env python
from __future__ import unicode_literals
import sys
import json
import codecs
def unescape_json(fname_in, fname_out):
with file(fname_in, 'rb') as fin:
js = json.load(fin)
with codecs.open(fname_out, 'wb', 'utf-8') as fout:
json.dump(js, fout, ensure_ascii=False)
def usage():
print "Converts all \\uXXXX codes in json into utf-8"
print "Usage: .py infile outfile"
sys.exit(1)
def main():
try:
fname_in, fname_out = sys.argv[1:]
except Exception:
usage()
unescape_json(fname_in, fname_out)
print "Done."
if __name__ == '__main__':
main()