在Python中读取“原始”Unicode字符串

I am quite new to Python so my question might be silly, but even though reading through a lot of threads I didn't find an answer to my question.

我对Python很陌生,所以我的问题可能很愚蠢,但即使阅读了很多主题,我也没有找到问题的答案。

I have a mixed source document which contains html, xml, latex and other textformats and which I try to get into a latex-only format.

我有一个混合源文档,其中包含html,xml,latex和其他textformats,我尝试使用仅限乳胶格式。

Therefore, I have used python to recognise the different commands as regular expresssions and replace them with the adequate latex command. Everything has worked out fine so far.

因此,我使用python将不同的命令识别为常规表达式,并用适当的latex命令替换它们。到目前为止,一切都很顺利。

Now I am left with some "raw-type" Unicode signs, such as the greek letters. Unfortunaltly is just about to much to do it by hand. Therefore, I am looking for a way to do this the smart way too. Is there a way for Python to recognise / read them? And how do I tell python to recognise / read e.g. Pi written as a Greek letter?

现在我留下了一些“原始类型”的Unicode标志,例如希腊字母。不幸的是,手工做很多事情。因此,我正在寻找一种以聪明的方式做到这一点的方法。有没有办法让Python识别/读取它们?我如何告诉python识别/读取例如Pi写成希腊字母?

A minimal example of the code I use is:

我使用的代码的最小示例是:

fh = open('SOURCE_DOCUMENT','r')
stuff = fh.read()
fh.close()

new_stuff = re.sub('READ','REPLACE',stuff)
fh = open('LATEX_DOCUMENT','w')
fh.write(new_stuff)
fh.close()

I am not sure whether it is an important information or not, but I am using Python 2.6 running on windows.

我不确定它是否是一个重要的信息,但我使用的是在Windows上运行的Python 2.6。

I would be really glad, if someone might be able to give me hint, at least where to find the according information or how this might work. Or whether I am completely wrong, and Python can't do this job ...

我真的很高兴,如果有人能够给我提示,至少在哪里可以找到相关信息或者这可能如何起作用。或者我是否完全错了,Python无法完成这项工作......

Many thanks in advance.
Cheers,
Britta

提前谢谢了。干杯,布里塔

3 个解决方案

#1

You talk of ``raw'' Unicode strings. What does that mean? Unicode itself is not an encoding, but there are different encodings to store Unicode characters (read this post by Joel).

你谈到``raw''Unicode字符串。那是什么意思? Unicode本身不是一种编码,但存在不同的编码来存储Unicode字符(请阅读Joel的这篇文章)。

The open function in Python 3.0 takes an optional encoding argument that lets you specify the encoding, e.g. UTF-8 (a very common way to encode Unicode). In Python 2.x, have a look at the codecs module, which also provides an open function that allows specifying the encoding of the file.

Python 3.0中的open函数采用可选的编码参数,允许您指定编码,例如UTF-8(一种非常常见的Unicode编码方式)。在Python 2.x中,看看编解码器模块,它还提供了一个允许指定文件编码的开放函数。

Edit: alternatively, why not just let those poor characters be, and specify the encoding of your LaTeX file at the top:

编辑:或者,为什么不让那些可怜的角色,并在顶部指定您的LaTeX文件的编码:

\usepackage[utf8]{inputenc}

(I never tried this, but I figure it should work. You may need to replace utf8 by utf8x, though)

(我从未尝试过这个,但我认为它应该可行。你可能需要用utf8x替换utf8,但是)

#2

Please, first, read this:

请首先阅读:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

绝对最低每个软件开发人员绝对必须知道Unicode和字符集(没有借口!)

Then, come back and ask questions.

然后,回来问问题。

#3

You need to determine the "encoding" of the input document. Unicode can encode millions of characters but files can only story 8-bit values (0-255). So the Unicode text must be encoded in some way.

您需要确定输入文档的“编码”。 Unicode可以编码数百万个字符,但文件只能记录8位值(0-255)。因此必须以某种方式对Unicode文本进行编码。

If the document is XML, it should be in the first line (encoding="..."; "utf-8" is the default if there is no "encoding" field). For HTML, look for "charset".

如果文档是XML,则它应该在第一行(encoding =“...”;如果没有“encoding”字段,则“utf-8”是默认值)。对于HTML,请查找“charset”。

If all else fails, open the document in an editor where you can set the encoding (jEdit, for example). Try them until the text looks right. Then use this value as the encoding parameter for codecs.open() in Python.

如果所有其他方法都失败了,请在编辑器中打开文档,您可以在其中设置编码(例如,jEdit)。尝试它们直到文本看起来正确。然后使用此值作为Python中codecs.open()的编码参数。

#1