Python：通过解析word / document.xml将文本从docx提取到txt

I would like to extract text from docx files into simple txt file. I know this problem might seem to be easy or trivial (I hope it will be) but I've looked over dozens of forum topics, spent hours trying to solve by myself and found no solution...

我想从docx文件中提取文本到简单的txt文件。我知道这个问题可能看似简单或微不足道(我希望会是这样)但我查看了几十个论坛主题,花了好几个小时试图自己解决并找不到解决方案......

I have borrowed the following code from Etienne's blog.

我从Etienne的博客借了以下代码。

It works perfectly if I need the content with no formatting. But... Since my documents contain simple tables, I need them to keep their format with simply using tabulators. So instead of this:

如果我需要没有格式化的内容,它可以很好地工作。但是......由于我的文档包含简单的表格,我需要它们只需使用制表符来保持其格式。所以不是这样的:

Name
Age
Wage
John
30
2000

This should appear:

这应该出现:

Name      Age     Wage
John      30      2000

In order not to slide into each other I prefer double tabs for longer lines. I have examined XML structure a little bit and found out that new rows in tables are indicated by tr, and columns by tc. So I've tried to modify this a thousand ways but with no success... Though it's not really working, I copy my idea of approaching the solution:

为了不相互滑动,我更喜欢使用双标签来获得更长的线条。我已经检查了一下XML结构,发现表中的新行用tr表示,列用tc表示。所以我试图改变这一千种方法,但没有成功......虽然它没有真正起作用,但我复制了我接近解决方案的想法:

from lxml.html.defs import form_tags

try:
    from xml.etree.cElementTree import XML
except ImportError:
    from xml.etree.ElementTree import XML
import zipfile

WORD_NAMESPACE='{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
ROW = WORD_NAMESPACE + 'tr'
COL = WORD_NAMESPACE + 'tc'


def get_docx_text(path):
document = zipfile.ZipFile(path)    
xml_content = document.read('word/document.xml')
document.close()    
tree = XML(xml_content)    
paragraphs = []    

for item in tree.iter(ROW or COL or PARA):    
    texts = []
    print(item)    
    if item is ROW:    
        texts.append('\n')    
    elif item is COL:    
        texts.append('\t\t')    
    elif item is PARA:    
        for node in item.iter(TEXT):    
            if node.text:    
                texts.append(node.text)    
    if texts:    
        paragraphs.append(''.join(texts))    
return '\n\n'.join(paragraphs)

text_file = open("output.txt", "w")
text_file.write(get_docx_text('input.docx'))
text_file.close()

I'm not very sure about how the syntactics should look like. The output gives nothing, and for a few trial it resulted something but it was even worse than nothing.

我不太确定语法应该是什么样的。输出没有给出任何东西,并且在一些试验中它产生了一些东西,但它甚至比什么都没有。

I put print(item) just for checking. But instead of every ROW, COL and PARA items it will list me ROWs only. So it seems like in the condition of the for loop the program seems to ingore the or connection of terms. If it cannot find ROW, it won't execute the 2 remaining options but skip instantly to the next item. I tried it with giving a list of the terms, as well.

我把print(item)只是为了检查。但是,不是每一个ROW,COL和PARA项目,它只会列出我的行。因此,似乎在for循环的条件下,程序似乎是依赖于术语的连接。如果它找不到ROW,它将不会执行剩下的2个选项,而是立即跳到下一个项目。我试着给出一个条款清单。

Inside it the if/elif blocks I think e.g. if item is ROW should examine whether 'item' and 'ROW' are identical (and they actually are).

其中if / elif块我觉得例如如果item是ROW,则应检查'item'和'ROW'是否相同(实际上它们是)。

2 个解决方案

#1

X or Y or Z evaluates to the first of three values, which is casted to True. Non-empty strings are always True. So, for item in tree.iter(ROW or COL or PARA) evaluates to for item in tree.iter(ROW) — this is why you are getting only row elements inside your loop.

X或Y或Z计算为三个值中的第一个,其值为True。非空字符串始终为True。因此,对于tree.iter(ROW或COL或PARA)中的项目,求值为tree.iter(ROW)中的项目 - 这就是为什么在循环中只获取行元素的原因。

iter() method of ElementTree object can only accept one tag name, so you should perhaps just iterate over the whole tree (won't be a problem if document is not big).

ElementTree对象的iter()方法只能接受一个标记名称,所以你应该只迭代整个树(如果文档不大则不会有问题)。

is is not going to work here. It is an identity operator and only returns True if objects compared are identical (i. e. variables compared refer to the same Python object). In your if... elif... you're comparing a constant str (ROW, COL, PARA) and Element object, which is created anew in each iteration, so, obviously, these two are not the same object and each comparison will return False.

是不会在这里工作。它是一个身份运算符,只有在比较的对象相同时才返回True(即比较的变量引用相同的Python对象)。在你的if ... elif ...你比较一个常量str(ROW,COL,PARA)和Element对象,它们在每次迭代中重新创建,所以,显然,这两个不是同一个对象和每个比较将返回False。

Instead you should use something like if item.tag == ROW.

相反,你应该使用if item.tag == ROW之类的东西。

All of the above taken into account, you should rewrite your loop section like this:

考虑到以上所有因素,你应该像这样重写你的循环部分:

for item in tree.iter():    
    texts = []
    print(item)    
    if item.tag == ROW:    
        texts.append('\n')    
    elif item.tag == COL:    
        texts.append('\t\t')    
    elif item.tag == PARA:    
        for node in item.iter(TEXT):    
            if node.text:    
                texts.append(node.text)    
    if texts:    
        paragraphs.append(''.join(texts))

#2

The answer above won't work like you asked. This should work for documents containing only tables; some additional parsing with findall should help you isolate non-table data and make this work for a document with tables and other text:

上面的答案不会像你问的那样奏效。这适用于仅包含表格的文件;使用findall进行一些额外的解析应该可以帮助您隔离非表数据,并使其适用于包含表和其他文本的文档:

TABLE = WORD_NAMESPACE + 'tbl'  

for item in tree.iter():   # use this for loop instead
    #print(item.tag)
    if item.tag == TABLE:
        for row in item.iter(ROW):
            texts.append('\n')
            for col in row.iter(COL):
                texts.append('\t')
                for ent in col.iter(TEXT):
                    if ent.text:
                        texts.append(ent.text)
return ''.join(texts)

#1

X or Y or Z evaluates to the first of three values, which is casted to True. Non-empty strings are always True. So, for item in tree.iter(ROW or COL or PARA) evaluates to for item in tree.iter(ROW) — this is why you are getting only row elements inside your loop.

iter() method of ElementTree object can only accept one tag name, so you should perhaps just iterate over the whole tree (won't be a problem if document is not big).