tl;dr version
I have paragraph which might contain quotations (e.g. "blah blah", 'this one also', etc). Now I have to replace this with latex style quotation (e.g. ``blah blah", `this also', etc) with the help of python 3.0.
我有可能包含引文的段落(例如“blah blah”,“this one also”等)。现在我必须在python 3.0的帮助下用乳胶风格的引用替换它(例如``blah blah',`this also'等)。
Background
I have lots of plain text files (more than ~100). Now I have to make one single Latex document with content taken from these files after doing little text processing on them. I am using Python 3.0 for this purpose. Now I am able to make everything else (like escape characters, sections etc) work but in I am not able to get quotation marks properly.
我有很多纯文本文件(超过100个)。现在,我必须在对它们进行少量文本处理后制作一个单独的Latex文档,其中包含从这些文件中获取的内容。我正在使用Python 3.0来达到这个目的。现在我能够使其他所有东西(如转义字符,部分等)工作,但我无法正确获得引号。
I can find pattern with regex (as described here), but how do I replace it with given pattern? I don't know how to use "re.sub()" function in this case. Because there might be multiple instances of quotes in my string. There is this question related to this, but how do I implement this with python?
我可以找到带有正则表达式的模式(如此处所述),但是如何用给定的模式替换它?在这种情况下,我不知道如何使用“re.sub()”函数。因为我的字符串中可能有多个引号实例。这个问题与此有关,但我如何用python实现呢?
2 个解决方案
#1
1
Design Considerations
- I've only considered the regular
"double-quotes"
and'single-quotes'
. There may be other quotation marks (see this question) - LaTeX end-quotes are also single-quotes - we don't want to capture a LaTeX double-end quote (e.g. ``LaTeX double-quote'') and mistake it as a single quote (around nothing)
- Word contractions and ownership
's
contain single quotes (e.g.don't
,John's
). These are characterised with alpha characters surrounding both sides of the quote - Regular nouns (plural ownership) have single-quotes after the word (e.g.
the actresses' roles
)
我只考虑过常规的“双引号”和“单引号”。可能还有其他引号(见这个问题)
LaTeX末尾引号也是单引号 - 我们不想捕获LaTeX双端引用(例如``LaTeX double-quote'')并将其误认为单引号(无所谓)
单词收缩和所有权包含单引号(例如,不要,John's)。它们的特征是引号两边都有字母字符
普通名词(复数所有权)在单词之后有单引号(例如女演员的角色)
Solution
import re
def texify_single_quote(in_string):
in_string = ' ' + in_string #Hack (see explanations)
return re.sub(r"(?<=\s)'(?!')(.*?)'", r"`\1'", in_string)[1:]
def texify_double_quote(in_string):
return re.sub(r'"(.*?)"', r"``\1''", in_string)
Testing
with open("test.txt", 'r') as fd_in, open("output.txt", 'w') as fd_out:
for line in fd_in.readlines():
#Test for commutativity
assert texify_single_quote(texify_double_quote(in_string)) == texify_double_quote(texify_single_quote(in_string))
line = texify_single_quote(line)
line = texify_double_quote(line)
fd_out.write(line)
Input file (test.txt
):
输入文件(test.txt):
# 'single', 'single', "double"
# 'single', "double", 'single'
# "double", 'single', 'single'
# "double", "double", 'single'
# "double", 'single', "double"
# I'm a 'single' person
# I'm a "double" person?
# Ownership for plural words; the peoples' 'rights'
# John's dog barked 'Woof!', and Fred's parents' 'loving' cat ran away.
# "A double-quoted phrase, with a 'single' quote inside"
# 'A single-quoted phrase with a "double quote" inside, with contracted words such as "don't"'
# 'A single-quoted phrase with a regular noun such as actresses' roles'
Output (output.txt
):
# `single', `single', ``double''
# `single', ``double'', `single'
# ``double'', `single', `single'
# ``double'', ``double'', `single'
# ``double'', `single', ``double''
# I'm a `single' person
# I'm a ``double'' person?
# Ownership for plural words; the peoples' `rights'
# John's dog barked `Woof!', and Fred's parents' `loving' cat ran away.
# ``A double-quoted phrase, with a `single' quote inside''
# `A single-quoted phrase with a ``double quote'' inside, with contracted words such as ``don't'''
# `A single-quoted phrase with a regular noun such as actresses' roles'
(note comments were prepended to stop formatting on post's output!)
(注意注释是为了停止格式化帖子的输出!)
Explanations
We will break down this Regex pattern, (?<=\s)'(?!')(.*?)'
:
我们将分解这个正则表达式模式,(?<= \ s)'(?!')(。*?)':
-
Summary:
(?<=\s)'(?!')
deals with the opening single-quote, whilst(.*?)
deals with whats in the quotes. -
(?<=\s)'
is a positive look-behind and only matches single-quotes that have a whitespace (\s
) preceding it. This is important to prevent matching contracted words such ascan't
(consideration 3, 4). -
'(?!')
is a negative look-ahead and only matches single-quotes that are not followed by another single-quote (consideration 2). - As mentioned in this answer, The pattern
(.*?)
captures what's in-between the quotation marks, whilst the\1
contains the capture. - The "Hack"
in_string = ' ' + in_string
is there because the positive look-behind does not capture single quotes starting at the beginning of the line, thus adding a space for all lines (then removing it on return with slicing,return re.sub(...)[1:]
) solves this problem!
总结:(?<= \ s)'(?!')处理开头的单引号,而(。*?)处理引号中的什么。
(?<= \ s)'是一个积极的后视,只匹配前面有空格(\ s)的单引号。这对于防止匹配的合同单词很重要,例如不能(考虑3,4)。
'(?!')是一个负面的预测,只匹配单引号后面没有其他单引号(考虑2)。
正如在这个答案中提到的,模式(。*?)捕获引号之间的内容,而\ 1包含捕获。
“Hack”in_string =''+ in_string就在那里,因为正面的后视不会捕获从行开头开始的单引号,因此为所有行添加一个空格(然后在切换返回时删除它,返回re。 sub(...)[1:])解决了这个问题!
#2
1
regexes are great for some tasks but they are still limited (read this for more info). writing a parser for this task seems more prune to errors.
正则表达式对于某些任务非常有用,但它们仍然有限(请阅读此内容以获取更多信息)。为这个任务编写解析器似乎更容易修复错误。
I created a simple function for this task and added comments. if still there are questions about the implementation please ask.
我为这个任务创建了一个简单的函数并添加了注释。如果仍然有关于实施的问题请询问。
the code (online version here):
代码(在线版本):
the_text = '''
This is my \"test\" String
This is my \'test\' String
This is my 'test' String
This is my \"test\" String which has \"two\" quotes
This is my \'test\' String which has \'two\' quotes
This is my \'test\' String which has \"two\" quotes
This is my \"test\" String which has \'two\' quotes
'''
def convert_quotes(txt, quote_type):
# find all quotes
quotes_pos = []
idx = -1
while True:
idx = txt.find(quote_type, idx+1)
if idx == -1:
break
quotes_pos.append(idx)
if len(quotes_pos) % 2 == 1:
raise ValueError('bad number of quotes of type %s' % quote_type)
# replace quote with ``
new_txt = []
last_pos = -1
for i, pos in enumerate(quotes_pos):
# ignore the odd quotes - we dont replace them
if i % 2 == 1:
continue
new_txt += txt[last_pos+1:pos]
new_txt += '``'
last_pos = pos
# append the last part of the string
new_txt += txt[last_pos+1:]
return ''.join(new_txt)
print(convert_quotes(convert_quotes(the_text, '\''), '"'))
prints out:
This is my ``test" String
This is my ``test' String
This is my ``test' String
This is my ``test" String which has ``two" quotes
This is my ``test' String which has ``two' quotes
This is my ``test' String which has ``two" quotes
This is my ``test" String which has ``two' quotes
Note: parsing nested quotes is ambiguous.
注意:解析嵌套引号是不明确的。
for example: the string "bob said: "alice said: hello""
is nested on proper language
例如:字符串“bob说:”alice说:你好“”嵌套在适当的语言上
BUT:
the string "bob said: hi" and "alice said: hello"
is not nested.
字符串“bob说:hi”和“alice说:你好”不是嵌套的。
if this is your case you might want first to parse these nested quotes into different quotes or use parenthesis ()
for nested quotes disambiguation.
如果是这种情况,您可能需要先将这些嵌套引号解析为不同的引号,或者使用括号()表示嵌套引号消歧。
#1
1
Design Considerations
- I've only considered the regular
"double-quotes"
and'single-quotes'
. There may be other quotation marks (see this question) - LaTeX end-quotes are also single-quotes - we don't want to capture a LaTeX double-end quote (e.g. ``LaTeX double-quote'') and mistake it as a single quote (around nothing)
- Word contractions and ownership
's
contain single quotes (e.g.don't
,John's
). These are characterised with alpha characters surrounding both sides of the quote - Regular nouns (plural ownership) have single-quotes after the word (e.g.
the actresses' roles
)
我只考虑过常规的“双引号”和“单引号”。可能还有其他引号(见这个问题)
LaTeX末尾引号也是单引号 - 我们不想捕获LaTeX双端引用(例如``LaTeX double-quote'')并将其误认为单引号(无所谓)
单词收缩和所有权包含单引号(例如,不要,John's)。它们的特征是引号两边都有字母字符
普通名词(复数所有权)在单词之后有单引号(例如女演员的角色)
Solution
import re
def texify_single_quote(in_string):
in_string = ' ' + in_string #Hack (see explanations)
return re.sub(r"(?<=\s)'(?!')(.*?)'", r"`\1'", in_string)[1:]
def texify_double_quote(in_string):
return re.sub(r'"(.*?)"', r"``\1''", in_string)
Testing
with open("test.txt", 'r') as fd_in, open("output.txt", 'w') as fd_out:
for line in fd_in.readlines():
#Test for commutativity
assert texify_single_quote(texify_double_quote(in_string)) == texify_double_quote(texify_single_quote(in_string))
line = texify_single_quote(line)
line = texify_double_quote(line)
fd_out.write(line)
Input file (test.txt
):
输入文件(test.txt):
# 'single', 'single', "double"
# 'single', "double", 'single'
# "double", 'single', 'single'
# "double", "double", 'single'
# "double", 'single', "double"
# I'm a 'single' person
# I'm a "double" person?
# Ownership for plural words; the peoples' 'rights'
# John's dog barked 'Woof!', and Fred's parents' 'loving' cat ran away.
# "A double-quoted phrase, with a 'single' quote inside"
# 'A single-quoted phrase with a "double quote" inside, with contracted words such as "don't"'
# 'A single-quoted phrase with a regular noun such as actresses' roles'
Output (output.txt
):
# `single', `single', ``double''
# `single', ``double'', `single'
# ``double'', `single', `single'
# ``double'', ``double'', `single'
# ``double'', `single', ``double''
# I'm a `single' person
# I'm a ``double'' person?
# Ownership for plural words; the peoples' `rights'
# John's dog barked `Woof!', and Fred's parents' `loving' cat ran away.
# ``A double-quoted phrase, with a `single' quote inside''
# `A single-quoted phrase with a ``double quote'' inside, with contracted words such as ``don't'''
# `A single-quoted phrase with a regular noun such as actresses' roles'
(note comments were prepended to stop formatting on post's output!)
(注意注释是为了停止格式化帖子的输出!)
Explanations
We will break down this Regex pattern, (?<=\s)'(?!')(.*?)'
:
我们将分解这个正则表达式模式,(?<= \ s)'(?!')(。*?)':
-
Summary:
(?<=\s)'(?!')
deals with the opening single-quote, whilst(.*?)
deals with whats in the quotes. -
(?<=\s)'
is a positive look-behind and only matches single-quotes that have a whitespace (\s
) preceding it. This is important to prevent matching contracted words such ascan't
(consideration 3, 4). -
'(?!')
is a negative look-ahead and only matches single-quotes that are not followed by another single-quote (consideration 2). - As mentioned in this answer, The pattern
(.*?)
captures what's in-between the quotation marks, whilst the\1
contains the capture. - The "Hack"
in_string = ' ' + in_string
is there because the positive look-behind does not capture single quotes starting at the beginning of the line, thus adding a space for all lines (then removing it on return with slicing,return re.sub(...)[1:]
) solves this problem!
总结:(?<= \ s)'(?!')处理开头的单引号,而(。*?)处理引号中的什么。
(?<= \ s)'是一个积极的后视,只匹配前面有空格(\ s)的单引号。这对于防止匹配的合同单词很重要,例如不能(考虑3,4)。
'(?!')是一个负面的预测,只匹配单引号后面没有其他单引号(考虑2)。
正如在这个答案中提到的,模式(。*?)捕获引号之间的内容,而\ 1包含捕获。
“Hack”in_string =''+ in_string就在那里,因为正面的后视不会捕获从行开头开始的单引号,因此为所有行添加一个空格(然后在切换返回时删除它,返回re。 sub(...)[1:])解决了这个问题!
#2
1
regexes are great for some tasks but they are still limited (read this for more info). writing a parser for this task seems more prune to errors.
正则表达式对于某些任务非常有用,但它们仍然有限(请阅读此内容以获取更多信息)。为这个任务编写解析器似乎更容易修复错误。
I created a simple function for this task and added comments. if still there are questions about the implementation please ask.
我为这个任务创建了一个简单的函数并添加了注释。如果仍然有关于实施的问题请询问。
the code (online version here):
代码(在线版本):
the_text = '''
This is my \"test\" String
This is my \'test\' String
This is my 'test' String
This is my \"test\" String which has \"two\" quotes
This is my \'test\' String which has \'two\' quotes
This is my \'test\' String which has \"two\" quotes
This is my \"test\" String which has \'two\' quotes
'''
def convert_quotes(txt, quote_type):
# find all quotes
quotes_pos = []
idx = -1
while True:
idx = txt.find(quote_type, idx+1)
if idx == -1:
break
quotes_pos.append(idx)
if len(quotes_pos) % 2 == 1:
raise ValueError('bad number of quotes of type %s' % quote_type)
# replace quote with ``
new_txt = []
last_pos = -1
for i, pos in enumerate(quotes_pos):
# ignore the odd quotes - we dont replace them
if i % 2 == 1:
continue
new_txt += txt[last_pos+1:pos]
new_txt += '``'
last_pos = pos
# append the last part of the string
new_txt += txt[last_pos+1:]
return ''.join(new_txt)
print(convert_quotes(convert_quotes(the_text, '\''), '"'))
prints out:
This is my ``test" String
This is my ``test' String
This is my ``test' String
This is my ``test" String which has ``two" quotes
This is my ``test' String which has ``two' quotes
This is my ``test' String which has ``two" quotes
This is my ``test" String which has ``two' quotes
Note: parsing nested quotes is ambiguous.
注意:解析嵌套引号是不明确的。
for example: the string "bob said: "alice said: hello""
is nested on proper language
例如:字符串“bob说:”alice说:你好“”嵌套在适当的语言上
BUT:
the string "bob said: hi" and "alice said: hello"
is not nested.
字符串“bob说:hi”和“alice说:你好”不是嵌套的。
if this is your case you might want first to parse these nested quotes into different quotes or use parenthesis ()
for nested quotes disambiguation.
如果是这种情况,您可能需要先将这些嵌套引号解析为不同的引号,或者使用括号()表示嵌套引号消歧。