使用python从.docx文件中提取GPS坐标

时间:2021-01-25 21:23:22

I have some hectic task to do for which I need some help from python. Please see this word document.

我有一些繁忙的任务要做,我需要python的一些帮助。请看这个word文档。

使用python从.docx文件中提取GPS坐标

I am to extract texts and GPS coordinates from each row. There are currently over 100 coordinates in 10 docx file. My "hefty" python knowledge get me to this.

我是从每一行提取文本和GPS坐标。 10 docx文件中目前有超过100个坐标。我的“大量”蟒蛇知识让我想到了这一点。

from docx import Document
import re

main_file = Document("D:/DOCUMENTS/Google_Link/1  Category I/1  Category 
I.docx")
table = main_file.tables[1] #this is same for every document

data = []
keys = None

for i, row in enumerate(table.rows):
   text = (cell.text for cell in row.cells)

if i == 0:
    keys = tuple(text)
    continue

row_data = tuple(text)
data.append(row_data)

regexReference = re.compile("(C.-)\w+")
colReference = [item[1] for item in data]

listReference = filter(regexReference.match, colReference)

for i in listReference:
    print i.encode('UTF-8')

I can print 16 reference ids from column 2. Please guide me to print something like this.

我可以从第2列打印16个参考ID。请指导我打印这样的东西。

C1-20701-17-1

some site, some region

The existing CMC Office at Bariyodhala (22°40'34.3"N; 91°38'28.2"E) requires 
some repair/maintenance works including electrical wiring and electrical 
lights and appliances like ceiling fans supplies. Detail specification of 
the works are attached

x = 91°38'28.2"E
y = 22°40'34.3"N

These XY locations and descritions will be used to create KML files afterwards and attach with each document. I'd prefer a variable for each part of the above section (ref id, location, description, x and y) so that I can automate that as well.

这些XY位置和描述将用于之后创建KML文件并附加到每个文档。我更喜欢上面部分的每个部分的变量(ref id,location,description,x和y),这样我也可以自动化它。

demo docx

演示docx

1 个解决方案

#1


1  

I don't know if this works if there are files with different patterns (p.s. I'm using python 2.7.11):

如果有不同模式的文件(p.s.我正在使用python 2.7.11),我不知道这是否有效:

# -*- coding: utf-8 -*-
from docx import Document
import sys
import os
import re

reload(sys)
sys.setdefaultencoding('utf8')

for root, dirs, files in os.walk("."):
    for name in files:
        doc_file = os.path.join(root, name)
        if doc_file.endswith('docx'):
            main_file = Document(doc_file)
            table = main_file.tables[1]  # this is same for every document

            data = []
            keys = None

            for i, row in enumerate(table.rows):
                text = (cell.text for cell in row.cells)

                if i == 0:
                    keys = tuple(text)
                    continue

                row_data = tuple(text)
                data.append(row_data)

            regexReference = re.compile("(C.-[0-9-]+)")
            regexCoordinate = re.compile(r'(N-(.{,12})([0-9]|\')|[0-9].{,12}N)[;, ]+(E-(.{,12})([0-9]|\')|[0-9].{,12}E)')

            result = []
            for item in data:
                tmp = dict()
                matchReference = regexReference.search(item[1])
                matchCoordinate = regexCoordinate.search(unicode(item[2]))
                if matchReference:
                    tmp['reference'] = matchReference.group()
                if matchCoordinate:
                    tmp['x'] = matchCoordinate.group(1)
                    tmp['y'] = matchCoordinate.group(4)
                tmp['description'] = unicode(item[2])
                tmp['location'] = unicode(item[3])
                result.append(tmp)

            for rs in result:
                if 'reference' in rs:
                    for k, v in rs.iteritems():
                        print('{} = {}'.format(k, v))
                    print

# Output:
# --------------------------------
# y = 91°38'28.2"E
# x = 22°40'34.3"N
# description = The existing CMC Office at Bariyodhala (22°40'34.3"N; 91°38'28.2"E) requires some repair/maintenance works including electrical wiring and electrical lights and appliances like ceiling fans supplies. Detail specification of the works are attached.
# reference = C1-20701-17-1
# location = xxxxx Site, c Region

#1


1  

I don't know if this works if there are files with different patterns (p.s. I'm using python 2.7.11):

如果有不同模式的文件(p.s.我正在使用python 2.7.11),我不知道这是否有效:

# -*- coding: utf-8 -*-
from docx import Document
import sys
import os
import re

reload(sys)
sys.setdefaultencoding('utf8')

for root, dirs, files in os.walk("."):
    for name in files:
        doc_file = os.path.join(root, name)
        if doc_file.endswith('docx'):
            main_file = Document(doc_file)
            table = main_file.tables[1]  # this is same for every document

            data = []
            keys = None

            for i, row in enumerate(table.rows):
                text = (cell.text for cell in row.cells)

                if i == 0:
                    keys = tuple(text)
                    continue

                row_data = tuple(text)
                data.append(row_data)

            regexReference = re.compile("(C.-[0-9-]+)")
            regexCoordinate = re.compile(r'(N-(.{,12})([0-9]|\')|[0-9].{,12}N)[;, ]+(E-(.{,12})([0-9]|\')|[0-9].{,12}E)')

            result = []
            for item in data:
                tmp = dict()
                matchReference = regexReference.search(item[1])
                matchCoordinate = regexCoordinate.search(unicode(item[2]))
                if matchReference:
                    tmp['reference'] = matchReference.group()
                if matchCoordinate:
                    tmp['x'] = matchCoordinate.group(1)
                    tmp['y'] = matchCoordinate.group(4)
                tmp['description'] = unicode(item[2])
                tmp['location'] = unicode(item[3])
                result.append(tmp)

            for rs in result:
                if 'reference' in rs:
                    for k, v in rs.iteritems():
                        print('{} = {}'.format(k, v))
                    print

# Output:
# --------------------------------
# y = 91°38'28.2"E
# x = 22°40'34.3"N
# description = The existing CMC Office at Bariyodhala (22°40'34.3"N; 91°38'28.2"E) requires some repair/maintenance works including electrical wiring and electrical lights and appliances like ceiling fans supplies. Detail specification of the works are attached.
# reference = C1-20701-17-1
# location = xxxxx Site, c Region