剪贴法不正确地检索文本编码，希伯来语为\u0d5等。

First time working with this stuff. Checked out all other SOF questions about internalization / text encoding.

第一次使用这些东西。查看所有其他关于内部化/文本编码的问题。

I'm doing the Scrapy tutorial, when I got stuck at this part: Extracting Data, When I extract the data, the text instead of hebrew displayed as a series of \uXXXX.

我在做剪贴教程，当我被困在这部分时:提取数据，当我提取数据时，文本而不是希伯来语显示为一系列的\uXXXX。

it's possible for you to check it out by scraping this page for example;

你可以通过抓取这个页面来检查它;

scrapy shell http://israblog.nana10.co.il/blogread.asp?blog=167524&blogcode=13348970
hxs.select('//h2[@class="title"]/text()').extract()[0]

this will retrieve

这将检索

u'\u05de\u05d9 \u05d0\u05e0\u05e1 \u05e4\u05d5\u05d8\u05e0\u05e6\u05d9\u05d0\u05dc\u05d9?'

u ' \ u05de \ u05d9 \ u05d0 \ u05e0 \ u05e1 \ u05e4 \ u05d5 \ u05d8 \ u05e0 \ u05e6 \ u05d9 \ u05d0 \ u05dc \ u05d9 ?”

(unrelated:) if you try to print it in the console, you get: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python27\lib\encodings\cp437.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-1: cha racter maps to <undefined>

如果您试图在控制台中打印它，您将得到:Traceback(最近的调用):文件“ ”，第1行，在文件中“C:\Python27\lib\encodings\cp437”。charmap_encode(输入，错误，encoding_map) UnicodeEncodeError: 'charmap' codec不能对位置0-1的字符编码:character映射到。

Tried setting the encoding through the settings, tried converting manually, basically I feel like I tried everything.

尝试通过设置来设置编码，尝试手动转换，基本上我觉得我尝试了所有的东西。

(I've gone already about 5 pomodoros trying to fix this!)

(我已经走了大约5个番茄节，试图解决这个问题!)

what can I do to get the hebrew text that should be there: "מי אנס פוטנציאלי?"

我能做些什么来得到应该在那里的希伯来语的文本呢?

(Disclaimer: I just went into the first blog and post I noticed on http://Israblog.co.il, I'm in no way related to the blog or blog owner, I just used it as an example)

(免责声明:我刚进入第一个博客，我在http://Israblog.co上发现了这篇文章。我与博客或博客所有者没有任何关系，我只是用它作为例子)

2 个解决方案

#1

what can I do to get the hebrew text that should be there: "מי אנס פוטנציאלי?"

我能做些什么来得到应该在那里的希伯来语的文本呢?

test.py:

test.py:

# coding: utf-8

a = u'\u05de\u05d9 \u05d0\u05e0\u05e1 \u05e4\u05d5\u05d8\u05e0\u05e6\u05d9\u05d0\u05dc\u05d9?'
b = 'מי אנס פוטנציאלי?'

print a
print b

Result:

结果:

vic@wic:~/projects/snippets$ python test.py 
מי אנס פוטנציאלי?
מי אנס פוטנציאלי?
vic@wic:~/projects/snippets$

As you see they are the same. It's just different representation of the same unicode string. So don't worry that it's not scraped correctly.

正如你看到的，它们是一样的。它只是同一个unicode字符串的不同表示。所以不要担心它没有被正确地刮掉。

If you want to save it to a file:

如果你想把它保存到文件中:

Python 2.7.3 (default, Apr 20 2012, 22:39:59) 
[GCC 4.6.3] on linux2
>>> a = u'\u05de\u05d9 \u05d0\u05e0\u05e1 \u05e4\u05d5\u05d8\u05e0\u05e6\u05d9\u05d0\u05dc\u05d9'
>>> a
u'\u05de\u05d9 \u05d0\u05e0\u05e1 \u05e4\u05d5\u05d8\u05e0\u05e6\u05d9\u05d0\u05dc\u05d9'
>>> f = open('test.txt', 'w')
>>> f.write(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
>>> f.write(a.encode('utf-8'))
>>> f.close()

#2

Have you tried to see what do you get when storing that information you get from the page somewhere in json, xml....

你有没有想过，当你从json, xml，的页面中存储信息时，你会得到什么

I had those problems with some signs on few sites and in most cases if you don't do anything with the retrieved data it gets stored properly, but if you try to print them out in console you won't get proper result, or it will give error if you don't use repr

我有这些问题与一些迹象在一些网站和在大多数情况下,如果你不做任何与检索数据存储正确,但是如果你试图在控制台打印出来你不会得到正确的结果,或者它会给错误repr。如果你不使用

print repr(data)

I hope this helps, cause I know the frustration of encoding problems.

我希望这能有所帮助，因为我知道编码问题的挫败感。

#1