如何获取正则表达式将非ASCII字符识别为字母?

时间:2021-08-23 20:20:57

I'm extracting information from a webpage in Swedish. This page is using characters like: öäå.

我正在用瑞典语从网页中提取信息。此页面使用的字符如下:öäå。

My problem is that when I print the information the öäå are gone.

我的问题是,当我打印信息时,öäå已经消失了。

I'm extracting the information using Beautiful Soup. I think that the problem is that I do a bunch of regular expressions on the strings that I extract, e.g. location = re.sub(r'([^\w])+', '', location) to remove everything except for the letters. Before this I guess that Beautiful Soup encoded the strings so that the öäå became something like /x02/, a hex value.

我正在使用Beautiful Soup提取信息。我认为问题是我在提取的字符串上做了一堆正则表达式,例如location = re.sub(r'([^ \ w])+','',location)删除除字母之外的所有内容。在此之前,我猜美丽的汤编码了字符串,以便öäå变成/ x02 /,十六进制值。

So if I'm correct, then the regexes are removing the öäå, right, I mean the only thing that should be left of the hex char is x after the regex, but there are no x instead of öäå on my page, so this little theory is maybe not correct? Anyway, if it's right or wrong, how do you solve this? When I later print the extracted information to my webpage i use self.response.out.write() in google app engine (don't know if that help in solving the problem)

因此,如果我是正确的,那么正则表达式正在移除öäå,对,我的意思是在正则表达式之后,唯一应该留下的十六进制字符是x,但是我的页面上没有x而不是öäå,所以这小理论可能不正确?无论如何,如果是对或错,你如何解决这个问题?当我稍后将提取的信息打印到我的网页时,我在谷歌应用引擎中使用self.response.out.write()(不知道是否有帮助解决问题)

EDIT: The encoding on the Swedish site is utf-8 and the encoding on my site is also utf-8. EDIT2: You can use ISO-8859-10 for Swedish, but according to google chrome the encoding is Unicode(utf-8) on this specific site

编辑:瑞典网站上的编码是utf-8,我的网站上的编码也是utf-8。 EDIT2:您可以使用ISO-8859-10 for Swedish,但根据谷歌浏览器,此特定网站上的编码为Unicode(utf-8)

2 个解决方案

#1


8  

Always work in unicode and only convert to an encoded representation when necessary.

始终使用unicode,并在必要时仅转换为编码表示。

For this particular situation, you also need to use the re.U flag so \w matches unicode letters:

对于这种特殊情况,您还需要使用re.U标志,以便\ w匹配unicode字母:

#coding: utf-8

import re

location = "öäå".decode('utf-8')
location = re.sub(r'([^\w])+', '', location, flags=re.U)

print location # prints öäå

#2


0  

It would help if you could dump the strings before and after each step.

如果您可以在每个步骤之前和之后转储字符串,这将有所帮助。

Check your value of re.UNICODE first, see this

首先检查你的re.UNICODE值,看看这个

#1


8  

Always work in unicode and only convert to an encoded representation when necessary.

始终使用unicode,并在必要时仅转换为编码表示。

For this particular situation, you also need to use the re.U flag so \w matches unicode letters:

对于这种特殊情况,您还需要使用re.U标志,以便\ w匹配unicode字母:

#coding: utf-8

import re

location = "öäå".decode('utf-8')
location = re.sub(r'([^\w])+', '', location, flags=re.U)

print location # prints öäå

#2


0  

It would help if you could dump the strings before and after each step.

如果您可以在每个步骤之前和之后转储字符串,这将有所帮助。

Check your value of re.UNICODE first, see this

首先检查你的re.UNICODE值,看看这个