在Python中使用raw_input（）的Unicode输入

I´am just starting to learn python using LPTHW, and it´s really good. I´am just a couple of days in to my studies and come up to excercise 16 it looks like this:

我刚刚开始学习使用LPTHW的python，这真的很棒。我只是在学习上几天，然后开始练习16它看起来像这样：

-*- coding: utf-8 -*-

from sys import argv

script, filename = argv

print "We're going to erase %r." % filename
print "If you don't want that, hit CTRL-C (^C)."
print "If you do want that, hit RETURN."

raw_input("?")

print "Opening the file..."
target = open(filename, 'w')

print "Truncating the file.  Goodbye!"
target.truncate()

print "Now I'm going to ask you for three lines."

line1 = raw_input("line 1: ")
line2 = raw_input("line 2: ")
line3 = raw_input("line 3: ")

print "I'm going to write these to the file."

target.write("%r\n%r\n%r\n" % (line1, line2, line3))

print "And finally, we close it."
target.close()

The problem is that i'm from a country with the letters "Å", "Ä" and "Ö" in the alphabet, but when i am using these letters the output in the file (test.txt) looks something like this: u'hej' u'\xc5je' u'l\xe4get'

问题是我来自一个字母表中带有“Å”，“Ä”和“Ö”字母的国家，但是当我使用这些字母时，文件中的输出（test.txt）看起来像这样：你好吗？xc5je'ou'l \ xe4get'

When i decode a string a can do something like this: "hallå".decode("utf-8")

当我解码一个字符串时，可以做类似这样的事情：“hallå”.decode（“utf-8”）

And it will print just fine

它打印就好了

But i also want the input from a user to be correct, even when using odd characters. I have tried different things that either does not work or gives me errors when running, like for example

但我也希望用户输入正确，即使使用奇数字符也是如此。我尝试过不同的东西，或者不起作用，或者在运行时给我错误，例如

line1 = raw_input("line 1: ").decode("utf-8")

I tried to google my problems but i did´t feel like the answers given was not very straight forward or written for much more experienced users.

我试图谷歌我的问题，但我不觉得给出的答案不是很直接或为更有经验的用户编写。

If someone would take some time to explain the encoding/decoding of unicode characters in a beginner firendly way and give me an example of how i can get it to work i would really appriciate it

如果有人花一些时间以初学者友好的方式解释unicode字符的编码/解码，并给我一个如何让它工作的例子，我会真的很喜欢它

If it helps, iam on Windows 10, running python 2.7.10 and my system locale is set to swedish

如果它有帮助，iam在Windows 10上运行python 2.7.10并且我的系统语言环境设置为瑞典语

4 个解决方案

#1

Here's a way to decode stdin. It generally works from the Console but IDEs sometimes replace the stdin object and don't always support the encoding parameter. I also modernized the code a bit, using with and io.open to handle encodings. Note that the file will be written in UTF-8, so open it with Notepad to see it correctly. Using type <filename> from the console will try to display the file with the console's stdout encoding.

这是解码stdin的一种方法。它通常在Console中运行，但IDE有时会替换stdin对象，并不总是支持encoding参数。我还对代码进行了现代化改造，使用with和io.open来处理编码。请注意，该文件将以UTF-8编写，因此请使用记事本将其打开以正确查看。使用控制台中的类型将尝试使用控制台的stdout编码显示该文件。

#!python2
import sys
import io

script, filename = sys.argv

print "We're going to erase %s." % filename
print "If you don't want that, hit CTRL-C (^C)."
print "If you do want that, hit RETURN."

raw_input("?")

print "Now I'm going to ask you for three lines."

line1 = raw_input("line 1: ").decode(sys.stdin.encoding)
line2 = raw_input("line 2: ").decode(sys.stdin.encoding)
line3 = raw_input("line 3: ").decode(sys.stdin.encoding)

print "I'm going to write these to the file."

with io.open(filename, 'wt', encoding='utf8') as target:
    target.write(u"%s\n%s\n%s\n" % (line1, line2, line3))

#2

Your output indicates that raw_input() already accepts Å, ä just fine in your environment.

你的输出表明raw_input（）已经接受Å，ä就好了。

Either your code does not correspond to the output or your IDE is too helpful. raw_input() should return str type (bytes) but the output shows that you're saving text representations of unicode objects: u'hej' u'\xc5je' u'l\xe4get'.

您的代码与输出不对应，或者您的IDE太有用了。 raw_input（）应返回str类型（字节），但输出显示你正在保存unicode对象的文本表示：u'hej'u'\ xc5je'u'l \ xe4get'。

The smallest code change that would produce your desirable result is using %s (save string as is) instead of %r (save its ascii printable representation as returned by repr() function) in the format string as suggested in @chepner's answer.

产生理想结果的最小代码更改是在@chepner的答案中建议的格式字符串中使用％s（按原样保存字符串）而不是％r（保存由repr（）函数返回的ascii可打印表示）。

If someone would take some time to explain the encoding/decoding of unicode characters in a beginner firendly way and give me an example of how i can get it to work i would really appriciate it

如果有人花一些时间以初学者友好的方式解释unicode字符的编码/解码，并给我一个如何让它工作的例子，我会真的很喜欢它

Unicode handling on Python 2 requires understanding of what API returns text and what API returns binary data. Some API use a mixture such as ascii-based network protocols.

Python 2上的Unicode处理需要了解API返回文本以及API返回二进制数据的内容。某些API使用混合，例如基于ascii的网络协议。

Python 2 allows str type to represent both human-readable text and binary data and it may create confusion. I recommend to start with Python 3 that is more strict for Unicode-related issues.

Python 2允许str类型表示人类可读的文本和二进制数据，它可能会造成混淆。我建议从Python 3开始，这对于Unicode相关问题更严格。

In general, while working with Unicode you should convert encoded text into Unicode on input as soon as possible (e.g., using .decode()) and convert Unicode text to bytes on output as late as possible. @Mark Tolonen's answer demonstrate this approach:

通常，在使用Unicode时，您应尽快将编码文本转换为Unicode（例如，使用.decode（）），并尽可能晚地将Unicode文本转换为输出字节。 @Mark Tolonen的回答证明了这种方法：

it uses .decode(sys.stdin.encoding) to decode bytes returned from raw_input() into Unicode text. If raw_input() already returns Unicode in your environment (to check print type(raw_input('input something'))) then you could omit .decode() call
它使用.decode（sys.stdin.encoding）将从raw_input（）返回的字节解码为Unicode文本。如果raw_input（）已经在您的环境中返回Unicode（要检查打印类型（raw_input（'input something'））），那么您可以省略.decode（）调用
io.open(..., encoding='utf-8').write(u'some text') convert Unicode text to bytes (encodes it using utf-8 encoding).
io.open（...，encoding ='utf-8'）。write（u'some text'）将Unicode文本转换为字节（使用utf-8编码对其进行编码）。

This general approach is known as Unicode sandwich.

这种通用方法称为Unicode三明治。

.decode(sys.stdin.encoding) may fail. To support arbitrary Unicode input in Windows console, install win-unicode-console Python package.

.decode（sys.stdin.encoding）可能会失败。要在Windows控制台中支持任意Unicode输入，请安装win-unicode-console Python包。

#3

You're writing a representation of the string, rather than the actual encoded Unicode string, to your file. Use

您正在将字符串的表示形式写入文件，而不是实际编码的Unicode字符串。使用

target.write("%s\n%s\n%s\n" % (line1, line2, line3))

instead.

代替。

#4

you can use this format:

你可以使用这种格式：

f = open('file.txt', 'w') s = u'\u221A' f.write(s.encode('utf-8'))

f = open（'file.txt'，'w'）s = u'\ u221A'f.write（s.encode（'utf-8'））

here: line1 = raw_input("> ").encode('utf-8') so goes for line2 and line3

这里：line1 = raw_input（“>”）。encode（'utf-8'）所以对于line2和line3

#1

#!python2
import sys
import io

script, filename = sys.argv

print "We're going to erase %s." % filename
print "If you don't want that, hit CTRL-C (^C)."
print "If you do want that, hit RETURN."

raw_input("?")

print "Now I'm going to ask you for three lines."

line1 = raw_input("line 1: ").decode(sys.stdin.encoding)
line2 = raw_input("line 2: ").decode(sys.stdin.encoding)
line3 = raw_input("line 3: ").decode(sys.stdin.encoding)

print "I'm going to write these to the file."

with io.open(filename, 'wt', encoding='utf8') as target:
    target.write(u"%s\n%s\n%s\n" % (line1, line2, line3))

#2