I am having trouble with .isupper() when I have a utf-8 encoded string. I have a lot of text files I am converting to xml. While the text is very variable the format is static. words in all caps should be wrapped in <title>
tags and everything else <p>
. It is considerably more complex then this, but this should be sufficent for my question.
当我有一个utf-8编码的字符串时,我遇到了.isupper()的问题。我有很多要转换成xml的文本文件。虽然文本非常变量,但格式是静态的。所有大写的单词都应该被包装在
。它比这个要复杂得多,但这对我的问题应该是足够的。
My problem is that this is an utf-8 file. This is a must, as there will be
some
many non-English characters in the final output. This may be time to provide a brief example:
我的问题是这是一个utf-8文件。这是必须的,因为在最终输出中会有许多非英语字符。这可能是提供一个简单例子的时候了:
inputText.txt
inputText.txt
RÉSUMÉ
的简历
Bacon ipsum dolor sit amet strip steak t-bone chicken, irure ground round nostrud aute pancetta ham hock incididunt aliqua. Dolore short loin ex chicken, chuck drumstick ut hamburger ut andouille. In laborum eiusmod short loin, spare ribs enim ball tip sausage. Tenderloin ut consequat flank. Tempor officia sirloin duis. In pancetta do, ut dolore t-bone sint pork pariatur dolore chicken exercitation. Nostrud ribeye tail, ut ullamco venison mollit pork chop proident consectetur fugiat reprehenderit officia ut tri-tip.
培根的ipsum dolor sit amet strip牛排t-bone chicken, irure ground round nostrud aute pancetta ham hock incididunt aliqua。Dolore short loin ex chicken, chuck drumstick ut hamburger ut andouille。在eiusmod短腰肉的实验室里,备用排骨enim球尖香肠。里脊ut consequat旁边。时间系沙朗酒后驾车。在pancetta do中,ut dolore t-bone sint pork pariatur dolore dolore chicken努力。没有尾巴,没有肉,没有肉,没有肉,没有肉,没有肉。
DesiredOutput
DesiredOutput
<title>RÉSUMÉ</title>
<p>Bacon ipsum dolor sit amet strip steak t-bone chicken, irure ground round nostrud
aute pancetta ham hock incididunt aliqua. Dolore short loin ex chicken, chuck drumstick
ut hamburger ut andouille. In laborum eiusmod short loin, spare ribs enim ball tip sausage.
Tenderloin ut consequat flank. Tempor officia sirloin duis. In pancetta do, ut dolore t-bone
sint pork pariatur dolore chicken exercitation. Nostrud ribeye tail, ut ullamco venison
mollit pork chop proident consectetur fugiat reprehenderit officia ut tri-tip.
</p>
Sample Code
示例代码
#!/usr/local/bin/python2.7
# yes this is an alt-install of python
import codecs
import sys
import re
from xml.dom.minidom import Document
def main():
fn = sys.argv[1]
input = codecs.open(fn, 'r', 'utf-8')
output = codecs.open('desiredOut.xml', 'w', 'utf-8')
doc = Documents()
doc = parseInput(input,doc)
print>>output, doc.toprettyxml(indent=' ',encoding='UTF-8')
def parseInput(input, doc):
tokens = [re.split(r'\b', line.strip()) for line in input if line != '\n'] #remove blank lines
for i in range(len(tokens)):
# THIS IS MY PROBLEM. .isupper() is never true.
if str(tokens[i]).isupper():
title = doc.createElement('title')
tText = str(tokens[i]).strip('[\']')
titleText = doc.createTextNode(tText.title())
doc.appendChild(title)
title.appendChild(titleText)
else:
p = doc.createElement('p')
pText = str(tokens[i]).strip('[\']')
paraText = doc.createTextNode(pText)
doc.appendChild(p)
p.appenedChild(paraText)
return doc
if __name__ == '__main__':
main()
ultimately it is pretty straight forward, I would accept critiques or suggestions on my code. Who wouldn't? In particular I am unhappy with str(tokens[i])
perhaps there is a better way to loop through a list of strings?
最终它是相当直接的,我将接受对我的代码的批评或建议。谁不想呢?特别是我对str(令牌[I])不满意也许有更好的方法来循环字符串列表?
But the purpose of this question is to figure out the most efficient way to check if an utf-8 string is capitalized. Perhaps I should look into crafting a regex for this.
但是这个问题的目的是找出检查utf-8字符串是否大写的最有效的方法。也许我应该研究为此设计一个regex。
Do note, I did not run this code and it may not run just right. I hand picked the parts from working code and may have mistyped something. Alert me and I will correct it. lastly, note I am not using lxml
请注意,我没有运行这段代码,它可能不会正确运行。我从工作代码中挑选了这些零件,并可能把一些东西搞错了。提醒我,我会改正的。最后,注意我没有使用lxml
3 个解决方案
#1
9
The primary reason that your published code fails (even with only ascii characters!) is that re.split() will not split on a zero-width match. r'\b'
matches zero characters:
发布代码失败的主要原因(即使只有ascii字符!)是re.split()不会在0 -width匹配上分割。r \ b”匹配0字符:
>>> re.split(r'\b', 'foo-BAR_baz')
['foo-BAR_baz']
>>> re.split(r'\W+', 'foo-BAR_baz')
['foo', 'BAR_baz']
>>> re.split(r'[\W_]+', 'foo-BAR_baz')
['foo', 'BAR', 'baz']
Also, you need flags=re.UNICODE
to ensure that Unicode definitions of \b
and \W
etc are used. And using str()
where you did is at best unnecessary.
同时,你需要旗帜=再保险。使用UNICODE以确保使用\b和\W等的UNICODE定义。而使用str()最多是不必要的。
So it wasn't really a Unicode problem per se at all. However some answerers tried to address it as a Unicode problem, with varying degrees of success ... here's my take on the Unicode problem:
所以它本身并不是一个真正的Unicode问题。然而,一些回答者试图将其作为Unicode问题来解决,但取得了不同程度的成功……下面是我对Unicode问题的看法:
The general solution to this kind of problem is to follow the standard bog-simple advice that applies to all text problems: Decode your input from bytestrings to unicode strings as early as possible. Do all processing in unicode. Encode your output unicode into byte strings as late as possible.
这种问题的一般解决方案是遵循适用于所有文本问题的简单标准建议:尽早解码字节字符串到unicode字符串的输入。在unicode中进行所有的处理。将输出unicode编码为字节字符串,越晚越好。
So: byte_string.decode('utf8').isupper()
is the way to go. Hacks like byte_string.decode('ascii', 'ignore').isupper()
are to be avoided; they can be all of (complicated, unneeded, failure-prone) -- see below.
所以:byte_str.decode(“utf8”).isupper()是不错的选择。像byte_str.decode(‘ascii’,‘ignore’).isupper()这样的黑客应该避免;它们可能都是(复杂的、不需要的、容易失败的)——请参见下面。
Some code:
一些代码:
# coding: ascii
import unicodedata
tests = (
(u'\u041c\u041e\u0421\u041a\u0412\u0410', True), # capital of Russia, all uppercase
(u'R\xc9SUM\xc9', True), # RESUME with accents
(u'R\xe9sum\xe9', False), # Resume with accents
(u'R\xe9SUM\xe9', False), # ReSUMe with accents
)
for ucode, expected in tests:
print
print 'unicode', repr(ucode)
for uc in ucode:
print 'U+%04X %s' % (ord(uc), unicodedata.name(uc))
u8 = ucode.encode('utf8')
print 'utf8', repr(u8)
actual1 = u8.decode('utf8').isupper() # the natural way of doing it
actual2 = u8.decode('ascii', 'ignore').isupper() # @jathanism
print expected, actual1, actual2
Output from Python 2.7.1:
从Python 2.7.1输出:
unicode u'\u041c\u041e\u0421\u041a\u0412\u0410'
U+041C CYRILLIC CAPITAL LETTER EM
U+041E CYRILLIC CAPITAL LETTER O
U+0421 CYRILLIC CAPITAL LETTER ES
U+041A CYRILLIC CAPITAL LETTER KA
U+0412 CYRILLIC CAPITAL LETTER VE
U+0410 CYRILLIC CAPITAL LETTER A
utf8 '\xd0\x9c\xd0\x9e\xd0\xa1\xd0\x9a\xd0\x92\xd0\x90'
True True False
unicode u'R\xc9SUM\xc9'
U+0052 LATIN CAPITAL LETTER R
U+00C9 LATIN CAPITAL LETTER E WITH ACUTE
U+0053 LATIN CAPITAL LETTER S
U+0055 LATIN CAPITAL LETTER U
U+004D LATIN CAPITAL LETTER M
U+00C9 LATIN CAPITAL LETTER E WITH ACUTE
utf8 'R\xc3\x89SUM\xc3\x89'
True True True
unicode u'R\xe9sum\xe9'
U+0052 LATIN CAPITAL LETTER R
U+00E9 LATIN SMALL LETTER E WITH ACUTE
U+0073 LATIN SMALL LETTER S
U+0075 LATIN SMALL LETTER U
U+006D LATIN SMALL LETTER M
U+00E9 LATIN SMALL LETTER E WITH ACUTE
utf8 'R\xc3\xa9sum\xc3\xa9'
False False False
unicode u'R\xe9SUM\xe9'
U+0052 LATIN CAPITAL LETTER R
U+00E9 LATIN SMALL LETTER E WITH ACUTE
U+0053 LATIN CAPITAL LETTER S
U+0055 LATIN CAPITAL LETTER U
U+004D LATIN CAPITAL LETTER M
U+00E9 LATIN SMALL LETTER E WITH ACUTE
utf8 'R\xc3\xa9SUM\xc3\xa9'
False False True
The only differences with Python 3.x are syntactical -- the principle (do all processing in unicode) remains the same.
与python3的唯一区别。x是语法上的——原则(在unicode中进行所有处理)保持不变。
#2
2
As one comment above illustrates, it is not true for every character that one of the checks islower() vs isupper() will always be true and the other false. Unified Han characters, for example, are considered "letters" but are not lowercase, not uppercase, and not titlecase.
正如上面的一条注释所说明的,对于每个字符来说,其中一个check islower() vs isupper()将永远为真,另一个为假,这是不对的。例如,统一的汉字符被认为是“字母”,但不是小写的,不是大写的,也不是titlecase。
So your stated requirements, to treat upper- and lower-case text differently, should be clarified. I will assume the distinction is between upper-case letters and all other characters. Perhaps this is splitting hairs, but you ARE talking about non-English text here.
因此,您所陈述的需求应该得到澄清,以区别对待大小写文本。我假设区分是大写字母和其他所有字符。也许这是吹毛求疵,但你在这里谈论的是非英语文本。
First, I do recommend using Unicode strings (the unicode() built-in) exclusively for the string processing portions of your code. Discipline your mind to think of the "regular" strings as byte-strings, because that's exactly what they are. All string literals not written u"like this"
are byte-strings.
首先,我确实建议只对代码的字符串处理部分使用Unicode字符串(Unicode()内置)。训练你的大脑把“规则”的弦当作字节串,因为它们就是这样。所有的字符串字面量都不像这样写u,是字节字符串。
This line of code then:
这一行代码:
tokens = [re.split(r'\b', line.strip()) for line in input if line != '\n']
would become:
将成为:
tokens = [re.split(u'\\b', unicode(line.strip(), 'UTF-8')) for line in input if line != '\n']
You would also test tokens[i].isupper()
rather than str(tokens[i]).isupper()
. Based on what you have posted, it seems likely that other portions of your code would need to be changed to work with character strings instead of byte-strings.
您还将测试令牌[i].isupper()而不是str(token [i]).isupper()。根据您发布的内容,您的代码的其他部分可能需要更改为使用字符串而不是字节字符串。
#3
0
Simple solution. I think
简单的解决方案。我认为
tokens = [re.split(r'\b', line.strip()) for line in input if line != '\n'] #remove blank lines
becomes
就变成了
tokens = [line.strip() for line in input if line != '\n']
then I am able to go with no need for str()
or unicode()
As far as I can tell.
然后,就我所知,我可以不需要str()或unicode()。
if tokens[i].isupper(): #do stuff
The word token and the re.split on word boundaries is legacy of when I was messing with nltk earlier this week. But ultimately I am processing lines, not tokens/words. This may change. but for now this seems to work. I will leave this question open for now, in the hope of alternative solutions and comments.
单词token和re.split on word boundary是我本周早些时候与nltk发生冲突时遗留下来的东西。但最终我是在处理行,而不是标记/单词。这一情况可能发生改变。但就目前而言,这似乎行得通。我将暂时搁置这个问题,希望有其他的解决办法和意见。
#1
9
The primary reason that your published code fails (even with only ascii characters!) is that re.split() will not split on a zero-width match. r'\b'
matches zero characters:
发布代码失败的主要原因(即使只有ascii字符!)是re.split()不会在0 -width匹配上分割。r \ b”匹配0字符:
>>> re.split(r'\b', 'foo-BAR_baz')
['foo-BAR_baz']
>>> re.split(r'\W+', 'foo-BAR_baz')
['foo', 'BAR_baz']
>>> re.split(r'[\W_]+', 'foo-BAR_baz')
['foo', 'BAR', 'baz']
Also, you need flags=re.UNICODE
to ensure that Unicode definitions of \b
and \W
etc are used. And using str()
where you did is at best unnecessary.
同时,你需要旗帜=再保险。使用UNICODE以确保使用\b和\W等的UNICODE定义。而使用str()最多是不必要的。
So it wasn't really a Unicode problem per se at all. However some answerers tried to address it as a Unicode problem, with varying degrees of success ... here's my take on the Unicode problem:
所以它本身并不是一个真正的Unicode问题。然而,一些回答者试图将其作为Unicode问题来解决,但取得了不同程度的成功……下面是我对Unicode问题的看法:
The general solution to this kind of problem is to follow the standard bog-simple advice that applies to all text problems: Decode your input from bytestrings to unicode strings as early as possible. Do all processing in unicode. Encode your output unicode into byte strings as late as possible.
这种问题的一般解决方案是遵循适用于所有文本问题的简单标准建议:尽早解码字节字符串到unicode字符串的输入。在unicode中进行所有的处理。将输出unicode编码为字节字符串,越晚越好。
So: byte_string.decode('utf8').isupper()
is the way to go. Hacks like byte_string.decode('ascii', 'ignore').isupper()
are to be avoided; they can be all of (complicated, unneeded, failure-prone) -- see below.
所以:byte_str.decode(“utf8”).isupper()是不错的选择。像byte_str.decode(‘ascii’,‘ignore’).isupper()这样的黑客应该避免;它们可能都是(复杂的、不需要的、容易失败的)——请参见下面。
Some code:
一些代码:
# coding: ascii
import unicodedata
tests = (
(u'\u041c\u041e\u0421\u041a\u0412\u0410', True), # capital of Russia, all uppercase
(u'R\xc9SUM\xc9', True), # RESUME with accents
(u'R\xe9sum\xe9', False), # Resume with accents
(u'R\xe9SUM\xe9', False), # ReSUMe with accents
)
for ucode, expected in tests:
print
print 'unicode', repr(ucode)
for uc in ucode:
print 'U+%04X %s' % (ord(uc), unicodedata.name(uc))
u8 = ucode.encode('utf8')
print 'utf8', repr(u8)
actual1 = u8.decode('utf8').isupper() # the natural way of doing it
actual2 = u8.decode('ascii', 'ignore').isupper() # @jathanism
print expected, actual1, actual2
Output from Python 2.7.1:
从Python 2.7.1输出:
unicode u'\u041c\u041e\u0421\u041a\u0412\u0410'
U+041C CYRILLIC CAPITAL LETTER EM
U+041E CYRILLIC CAPITAL LETTER O
U+0421 CYRILLIC CAPITAL LETTER ES
U+041A CYRILLIC CAPITAL LETTER KA
U+0412 CYRILLIC CAPITAL LETTER VE
U+0410 CYRILLIC CAPITAL LETTER A
utf8 '\xd0\x9c\xd0\x9e\xd0\xa1\xd0\x9a\xd0\x92\xd0\x90'
True True False
unicode u'R\xc9SUM\xc9'
U+0052 LATIN CAPITAL LETTER R
U+00C9 LATIN CAPITAL LETTER E WITH ACUTE
U+0053 LATIN CAPITAL LETTER S
U+0055 LATIN CAPITAL LETTER U
U+004D LATIN CAPITAL LETTER M
U+00C9 LATIN CAPITAL LETTER E WITH ACUTE
utf8 'R\xc3\x89SUM\xc3\x89'
True True True
unicode u'R\xe9sum\xe9'
U+0052 LATIN CAPITAL LETTER R
U+00E9 LATIN SMALL LETTER E WITH ACUTE
U+0073 LATIN SMALL LETTER S
U+0075 LATIN SMALL LETTER U
U+006D LATIN SMALL LETTER M
U+00E9 LATIN SMALL LETTER E WITH ACUTE
utf8 'R\xc3\xa9sum\xc3\xa9'
False False False
unicode u'R\xe9SUM\xe9'
U+0052 LATIN CAPITAL LETTER R
U+00E9 LATIN SMALL LETTER E WITH ACUTE
U+0053 LATIN CAPITAL LETTER S
U+0055 LATIN CAPITAL LETTER U
U+004D LATIN CAPITAL LETTER M
U+00E9 LATIN SMALL LETTER E WITH ACUTE
utf8 'R\xc3\xa9SUM\xc3\xa9'
False False True
The only differences with Python 3.x are syntactical -- the principle (do all processing in unicode) remains the same.
与python3的唯一区别。x是语法上的——原则(在unicode中进行所有处理)保持不变。
#2
2
As one comment above illustrates, it is not true for every character that one of the checks islower() vs isupper() will always be true and the other false. Unified Han characters, for example, are considered "letters" but are not lowercase, not uppercase, and not titlecase.
正如上面的一条注释所说明的,对于每个字符来说,其中一个check islower() vs isupper()将永远为真,另一个为假,这是不对的。例如,统一的汉字符被认为是“字母”,但不是小写的,不是大写的,也不是titlecase。
So your stated requirements, to treat upper- and lower-case text differently, should be clarified. I will assume the distinction is between upper-case letters and all other characters. Perhaps this is splitting hairs, but you ARE talking about non-English text here.
因此,您所陈述的需求应该得到澄清,以区别对待大小写文本。我假设区分是大写字母和其他所有字符。也许这是吹毛求疵,但你在这里谈论的是非英语文本。
First, I do recommend using Unicode strings (the unicode() built-in) exclusively for the string processing portions of your code. Discipline your mind to think of the "regular" strings as byte-strings, because that's exactly what they are. All string literals not written u"like this"
are byte-strings.
首先,我确实建议只对代码的字符串处理部分使用Unicode字符串(Unicode()内置)。训练你的大脑把“规则”的弦当作字节串,因为它们就是这样。所有的字符串字面量都不像这样写u,是字节字符串。
This line of code then:
这一行代码:
tokens = [re.split(r'\b', line.strip()) for line in input if line != '\n']
would become:
将成为:
tokens = [re.split(u'\\b', unicode(line.strip(), 'UTF-8')) for line in input if line != '\n']
You would also test tokens[i].isupper()
rather than str(tokens[i]).isupper()
. Based on what you have posted, it seems likely that other portions of your code would need to be changed to work with character strings instead of byte-strings.
您还将测试令牌[i].isupper()而不是str(token [i]).isupper()。根据您发布的内容,您的代码的其他部分可能需要更改为使用字符串而不是字节字符串。
#3
0
Simple solution. I think
简单的解决方案。我认为
tokens = [re.split(r'\b', line.strip()) for line in input if line != '\n'] #remove blank lines
becomes
就变成了
tokens = [line.strip() for line in input if line != '\n']
then I am able to go with no need for str()
or unicode()
As far as I can tell.
然后,就我所知,我可以不需要str()或unicode()。
if tokens[i].isupper(): #do stuff
The word token and the re.split on word boundaries is legacy of when I was messing with nltk earlier this week. But ultimately I am processing lines, not tokens/words. This may change. but for now this seems to work. I will leave this question open for now, in the hope of alternative solutions and comments.
单词token和re.split on word boundary是我本周早些时候与nltk发生冲突时遗留下来的东西。但最终我是在处理行,而不是标记/单词。这一情况可能发生改变。但就目前而言,这似乎行得通。我将暂时搁置这个问题,希望有其他的解决办法和意见。