如何使用python-docx在Word文档中替换文本并保存?

时间:2022-07-22 19:13:48

The oodocx module mentioned in the same page refers the user to an /examples folder that does not seem to be there.
I have read the documentation of python-docx 0.7.2, plus everything I could find in * on the subject, so please believe that I have done my “homework”.

在同一页面中提到的oodocx模块将用户指向一个/示例文件夹,而该文件夹似乎不存在。我已经阅读了python-docx 0.7.2的文档,以及我在*上找到的所有内容,所以请相信我已经完成了我的“作业”。

Python is the only language I know (beginner+, maybe intermediate), so please do not assume any knowledge of C, Unix, xml, etc.

Python是我所知道的唯一的语言(初学者+,可能是中级的),所以请不要假设任何关于C、Unix、xml等的知识。

Task : Open a ms-word 2007+ document with a single line of text in it (to keep things simple) and replace any “key” word in Dictionary that occurs in that line of text with its dictionary value. Then close the document keeping everything else the same.

任务:打开一个ms-word 2007+文档,其中包含一行文本(以保持简单),并替换在文本中以字典值出现的任何“键”字。然后关闭文档,保持其他内容不变。

Line of text (for example) “We shall linger in the chambers of the sea.”

文本行(例如)“我们将徘徊在海洋的房间里。”

from docx import Document

document = Document('/Users/umityalcin/Desktop/Test.docx')

Dictionary = {‘sea’: “ocean”}

sections = document.sections
for section in sections:
    print(section.start_type)

#Now, I would like to navigate, focus on, get to, whatever to the section that has my
#single line of text and execute a find/replace using the dictionary above.
#then save the document in the usual way.

document.save('/Users/umityalcin/Desktop/Test.docx')

I am not seeing anything in the documentation that allows me to do this—maybe it is there but I don’t get it because everything is not spelled-out at my level.

我在文档中没有看到任何允许我这样做的东西——也许它在那里,但我没有得到它,因为在我的水平上,所有的东西都没有被压缩。

I have followed other suggestions on this site and have tried to use earlier versions of the module (https://github.com/mikemaccana/python-docx) that is supposed to have "methods like replace, advReplace" as follows: I open the source-code in the python interpreter, and add the following at the end (this is to avoid *es with the already installed version 0.7.2):

我在这个网站上跟随其他建议,试图使用早期版本的模块(https://github.com/mikemaccana/python-docx)这是应该“像替换方法,advReplace”如下:我打开在python解释器源代码,最后添加以下(这是为了避免冲突已经安装了9.7.2):

document = opendocx('/Users/umityalcin/Desktop/Test.docx')
words = document.xpath('//w:r', namespaces=document.nsmap)
for word in words:
    if word in Dictionary.keys():
        print "found it", Dictionary[word]
        document = replace(document, word, Dictionary[word])
savedocx(document, coreprops, appprops, contenttypes, websettings,
    wordrelationships, output, imagefiledict=None) 

Running this produces the following error message:

运行此操作会产生以下错误消息:

NameError: name 'coreprops' is not defined

NameError: name 'coreprops'没有定义。

Maybe I am trying to do something that cannot be done—but I would appreciate your help if I am missing something simple.

也许我正在尝试做一些不能做的事情——但是如果我错过了一些简单的事情,我会感激你的帮助。

If this matters, I am using the 64 bit version of Enthought's Canopy on OSX 10.9.3

如果这一点很重要,我使用的是Enthought的64位版本的OSX 10.9.3。

4 个解决方案

#1


15  

The current version of python-docx does not have a search() function or a replace() function. These are requested fairly frequently, but an implementation for the general case is quite tricky and it hasn't risen to the top of the backlog yet.

当前版本的python-docx没有search()函数或replace()函数。这些请求相当频繁,但是一般情况下的实现是相当棘手的,而且还没有上升到backlog的顶部。

Several folks have had success though, getting done what they need, using the facilities already present. Here's an example. It has nothing to do with sections by the way :)

一些人已经取得了成功,他们利用已经存在的设施来完成他们所需要的。这是一个例子。它与段落之间没有任何关系:)

for paragraph in document.paragraphs:
    if 'sea' in paragraph.text:
        print paragraph.text
        paragraph.text = 'new text containing ocean'

To search in Tables as well, you would need to use something like:

要在表中搜索,您需要使用以下内容:

for table in document.tables:
    for cell in table.cells:
        for paragraph in cell.paragraphs:
            if 'sea' in paragraph.text:
               ...

If you pursue this path, you'll probably discover pretty quickly what the complexities are. If you replace the entire text of a paragraph, that will remove any character-level formatting, like a word or phrase in bold or italic.

如果你走这条路,你可能会很快发现它的复杂性。如果您替换了段落的整个文本,那么将删除任何字符级的格式,如粗体或斜体中的单词或短语。

By the way, the code from @wnnmaw's answer is for the legacy version of python-docx and won't work at all with versions after 0.3.0.

顺便说一下,@wnnmaw的答案来自于python-docx的遗留版本,并且在0.3.0之后的版本中不会有任何效果。

#2


7  

I needed something to replace regular expressions in docx. I took scannys answer. To handle style I've used answer from: Python docx Replace string in paragraph while keeping style added recursive call to handle nested tables. and came up with something like this:

我需要在docx中替换正则表达式。我把scannys回答。要处理风格,我已经使用了答案:Python docx在段落中替换字符串,同时保持样式增加递归调用来处理嵌套的表。然后得出这样的结论:

import re
from docx import Document

def docx_replace_regex(doc_obj, regex , replace):

    for p in doc_obj.paragraphs:
        if regex.search(p.text):
            inline = p.runs
            # Loop added to work with runs (strings with same style)
            for i in range(len(inline)):
                if regex.search(inline[i].text):
                    text = regex.sub(replace, inline[i].text)
                    inline[i].text = text

    for table in doc_obj.tables:
        for row in table.rows:
            for cell in row.cells:
                docx_replace_regex(cell, regex , replace)



regex1 = re.compile(r"your regex")
replace1 = r"your replace string"
filename = "test.docx"
doc = Document(filename)
docx_replace_regex(doc, regex1 , replace1)
doc.save('result1.docx')

To iterate over dictionary:

遍历词典:

for word, replacement in dictionary.items():
    word_re=re.compile(word)
    docx_replace_regex(doc, word_re , replacement)

Note that this solution will replace regex only if whole regex has same style in document.

请注意,这个解决方案只在整个regex在文档中具有相同的样式时才替换regex。

Also if text is edited after saving same style text might be in separate runs. For example if you open document that has "testabcd" string and you change it to "test1abcd" and save, even dough its the same style there are 3 separate runs "test", "1", and "abcd", in this case replacement of test1 won't work.

同样,如果在保存相同样式文本后编辑文本,则可能是单独运行。例如,如果打开具有“testabcd”字符串的文档,并将其更改为“test1abcd”并保存,即使是相同的样式,也会有3个单独的运行“test”、“1”和“abcd”,在这种情况下,test1的替换将不起作用。

This is for tracking changes in the document. To marge it to one run, in Word you need to go to "Options", "Trust Center" and in "Privacy Options" unthick "Store random numbers to improve combine accuracy" and save the document.

这是用于跟踪文档中的更改。要让它运行起来,你需要去“选项”,“信任中心”和“隐私选项”,不厚的“存储随机数来提高组合的准确性”并保存文档。

#3


0  

The problem with your second attempt is that you haven't defined the parameters that savedocx needs. You need to do something like this before you save:

第二次尝试的问题是您还没有定义savedocx需要的参数。在你存钱之前,你需要这样做:

relationships = docx.relationshiplist()
title = "Document Title"
subject = "Document Subject"
creator = "Document Creator"
keywords = []

coreprops = docx.coreproperties(title=title, subject=subject, creator=creator,
                       keywords=keywords)
app = docx.appproperties()
content = docx.contenttypes()
web = docx.websettings()
word = docx.wordrelationships(relationships)
output = r"path\to\where\you\want\to\save"

#4


0  

The Office Dev Centre has an entry in which a developer has published (MIT licenced at this time) a description of a couple of algorithms that appear to suggest a solution for this (albeit in C#, and require porting):" MS Dev Centre posting

办公室开发中心有一个项目,开发人员已经发布了(麻省理工学院的许可),这是对一些算法的描述,这些算法似乎为这个问题提供了解决方案(尽管是在c#中,并且需要移植):“开发中心发布。

#1


15  

The current version of python-docx does not have a search() function or a replace() function. These are requested fairly frequently, but an implementation for the general case is quite tricky and it hasn't risen to the top of the backlog yet.

当前版本的python-docx没有search()函数或replace()函数。这些请求相当频繁,但是一般情况下的实现是相当棘手的,而且还没有上升到backlog的顶部。

Several folks have had success though, getting done what they need, using the facilities already present. Here's an example. It has nothing to do with sections by the way :)

一些人已经取得了成功,他们利用已经存在的设施来完成他们所需要的。这是一个例子。它与段落之间没有任何关系:)

for paragraph in document.paragraphs:
    if 'sea' in paragraph.text:
        print paragraph.text
        paragraph.text = 'new text containing ocean'

To search in Tables as well, you would need to use something like:

要在表中搜索,您需要使用以下内容:

for table in document.tables:
    for cell in table.cells:
        for paragraph in cell.paragraphs:
            if 'sea' in paragraph.text:
               ...

If you pursue this path, you'll probably discover pretty quickly what the complexities are. If you replace the entire text of a paragraph, that will remove any character-level formatting, like a word or phrase in bold or italic.

如果你走这条路,你可能会很快发现它的复杂性。如果您替换了段落的整个文本,那么将删除任何字符级的格式,如粗体或斜体中的单词或短语。

By the way, the code from @wnnmaw's answer is for the legacy version of python-docx and won't work at all with versions after 0.3.0.

顺便说一下,@wnnmaw的答案来自于python-docx的遗留版本,并且在0.3.0之后的版本中不会有任何效果。

#2


7  

I needed something to replace regular expressions in docx. I took scannys answer. To handle style I've used answer from: Python docx Replace string in paragraph while keeping style added recursive call to handle nested tables. and came up with something like this:

我需要在docx中替换正则表达式。我把scannys回答。要处理风格,我已经使用了答案:Python docx在段落中替换字符串,同时保持样式增加递归调用来处理嵌套的表。然后得出这样的结论:

import re
from docx import Document

def docx_replace_regex(doc_obj, regex , replace):

    for p in doc_obj.paragraphs:
        if regex.search(p.text):
            inline = p.runs
            # Loop added to work with runs (strings with same style)
            for i in range(len(inline)):
                if regex.search(inline[i].text):
                    text = regex.sub(replace, inline[i].text)
                    inline[i].text = text

    for table in doc_obj.tables:
        for row in table.rows:
            for cell in row.cells:
                docx_replace_regex(cell, regex , replace)



regex1 = re.compile(r"your regex")
replace1 = r"your replace string"
filename = "test.docx"
doc = Document(filename)
docx_replace_regex(doc, regex1 , replace1)
doc.save('result1.docx')

To iterate over dictionary:

遍历词典:

for word, replacement in dictionary.items():
    word_re=re.compile(word)
    docx_replace_regex(doc, word_re , replacement)

Note that this solution will replace regex only if whole regex has same style in document.

请注意,这个解决方案只在整个regex在文档中具有相同的样式时才替换regex。

Also if text is edited after saving same style text might be in separate runs. For example if you open document that has "testabcd" string and you change it to "test1abcd" and save, even dough its the same style there are 3 separate runs "test", "1", and "abcd", in this case replacement of test1 won't work.

同样,如果在保存相同样式文本后编辑文本,则可能是单独运行。例如,如果打开具有“testabcd”字符串的文档,并将其更改为“test1abcd”并保存,即使是相同的样式,也会有3个单独的运行“test”、“1”和“abcd”,在这种情况下,test1的替换将不起作用。

This is for tracking changes in the document. To marge it to one run, in Word you need to go to "Options", "Trust Center" and in "Privacy Options" unthick "Store random numbers to improve combine accuracy" and save the document.

这是用于跟踪文档中的更改。要让它运行起来,你需要去“选项”,“信任中心”和“隐私选项”,不厚的“存储随机数来提高组合的准确性”并保存文档。

#3


0  

The problem with your second attempt is that you haven't defined the parameters that savedocx needs. You need to do something like this before you save:

第二次尝试的问题是您还没有定义savedocx需要的参数。在你存钱之前,你需要这样做:

relationships = docx.relationshiplist()
title = "Document Title"
subject = "Document Subject"
creator = "Document Creator"
keywords = []

coreprops = docx.coreproperties(title=title, subject=subject, creator=creator,
                       keywords=keywords)
app = docx.appproperties()
content = docx.contenttypes()
web = docx.websettings()
word = docx.wordrelationships(relationships)
output = r"path\to\where\you\want\to\save"

#4


0  

The Office Dev Centre has an entry in which a developer has published (MIT licenced at this time) a description of a couple of algorithms that appear to suggest a solution for this (albeit in C#, and require porting):" MS Dev Centre posting

办公室开发中心有一个项目,开发人员已经发布了(麻省理工学院的许可),这是对一些算法的描述,这些算法似乎为这个问题提供了解决方案(尽管是在c#中,并且需要移植):“开发中心发布。