如何将utf-8花式报价转换为中性报价

时间:2021-01-14 20:27:51

I'm writing a little Python script that parses word docs and writes to a csv file. However, some of the docs have some utf-8 characters that my script can't process correctly.

我正在编写一个Python脚本来解析word文档并写入csv文件。但是,某些文档有一些utf-8字符,我的脚本无法正确处理。

Fancy quotes show up quite often (u'\u201c'). Is there a quick and easy (and smart) way of replacing those with the neutral ascii-supported quotes, so I can just write line.encode('ascii') to the csv file?

花哨的引号经常出现(u'\ u201c')。是否有一种快速简便的(智能)方式用中性的ascii支持的引号替换它们,所以我可以将line.encode('ascii')写入csv文件?

I have tried to find the left quote and replace it:

我试图找到左引号并替换它:

val = line.find(u'\u201c')
if val >= 0: line[val] = '"'

But to no avail:

但无济于事:

TypeError: 'unicode' object does not support item assignment

TypeError:'unicode'对象不支持项目分配

Is what I've described a good strategy? Or should I just set up the csv to support utf-8 (though I'm not sure if the application that will be reading the CSV wants utf-8)?

我所描述的是一个好策略吗?或者我应该设置csv以支持utf-8(虽然我不确定将要读取CSV的应用程序是否需要utf-8)?

Thank you

2 个解决方案

#1


8  

You can use the Unidecode package to automatically convert all Unicode characters to their nearest pure ASCII equivalent.

您可以使用Unidecode包自动将所有Unicode字符转换为最接近的纯ASCII等效字符。

from unidecode import unidecode
line = unidecode(line)

This will handle both directions of double quotes as well as single quotes, em dashes, and other things that you probably haven't discovered yet.

这将处理双引号的方向以及单引号,破折号以及您可能尚未发现的其他内容。

#2


5  

You can't assign to a string, as they are immutable, and can't be changed.

您不能分配给字符串,因为它们是不可变的,并且无法更改。

You can, however, just use the regex library, which might be the most flexible way to do this:

但是,您可以使用正则表达式库,这可能是最灵活的方法:

import re
newline = re.sub(u'\u201c','"',line)

#1


8  

You can use the Unidecode package to automatically convert all Unicode characters to their nearest pure ASCII equivalent.

您可以使用Unidecode包自动将所有Unicode字符转换为最接近的纯ASCII等效字符。

from unidecode import unidecode
line = unidecode(line)

This will handle both directions of double quotes as well as single quotes, em dashes, and other things that you probably haven't discovered yet.

这将处理双引号的方向以及单引号,破折号以及您可能尚未发现的其他内容。

#2


5  

You can't assign to a string, as they are immutable, and can't be changed.

您不能分配给字符串,因为它们是不可变的,并且无法更改。

You can, however, just use the regex library, which might be the most flexible way to do this:

但是,您可以使用正则表达式库,这可能是最灵活的方法:

import re
newline = re.sub(u'\u201c','"',line)