I have some hectic task to do for which I need some help from python. Please see this word document.
我有一些繁忙的任务要做,我需要python的一些帮助。请看这个word文档。
I am to extract texts and GPS coordinates from each row. There are currently over 100 coordinates in 10 docx file. My "hefty" python knowledge get me to this.
我是从每一行提取文本和GPS坐标。 10 docx文件中目前有超过100个坐标。我的“大量”蟒蛇知识让我想到了这一点。
from docx import Document
import re
main_file = Document("D:/DOCUMENTS/Google_Link/1 Category I/1 Category
I.docx")
table = main_file.tables[1] #this is same for every document
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = tuple(text)
data.append(row_data)
regexReference = re.compile("(C.-)\w+")
colReference = [item[1] for item in data]
listReference = filter(regexReference.match, colReference)
for i in listReference:
print i.encode('UTF-8')
I can print 16 reference ids from column 2. Please guide me to print something like this.
我可以从第2列打印16个参考ID。请指导我打印这样的东西。
C1-20701-17-1
some site, some region
The existing CMC Office at Bariyodhala (22°40'34.3"N; 91°38'28.2"E) requires
some repair/maintenance works including electrical wiring and electrical
lights and appliances like ceiling fans supplies. Detail specification of
the works are attached
x = 91°38'28.2"E
y = 22°40'34.3"N
These XY locations and descritions will be used to create KML files afterwards and attach with each document. I'd prefer a variable for each part of the above section (ref id, location, description, x and y) so that I can automate that as well.
这些XY位置和描述将用于之后创建KML文件并附加到每个文档。我更喜欢上面部分的每个部分的变量(ref id,location,description,x和y),这样我也可以自动化它。
演示docx
1 个解决方案
#1
1
I don't know if this works if there are files with different patterns (p.s. I'm using python 2.7.11):
如果有不同模式的文件(p.s.我正在使用python 2.7.11),我不知道这是否有效:
# -*- coding: utf-8 -*-
from docx import Document
import sys
import os
import re
reload(sys)
sys.setdefaultencoding('utf8')
for root, dirs, files in os.walk("."):
for name in files:
doc_file = os.path.join(root, name)
if doc_file.endswith('docx'):
main_file = Document(doc_file)
table = main_file.tables[1] # this is same for every document
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = tuple(text)
data.append(row_data)
regexReference = re.compile("(C.-[0-9-]+)")
regexCoordinate = re.compile(r'(N-(.{,12})([0-9]|\')|[0-9].{,12}N)[;, ]+(E-(.{,12})([0-9]|\')|[0-9].{,12}E)')
result = []
for item in data:
tmp = dict()
matchReference = regexReference.search(item[1])
matchCoordinate = regexCoordinate.search(unicode(item[2]))
if matchReference:
tmp['reference'] = matchReference.group()
if matchCoordinate:
tmp['x'] = matchCoordinate.group(1)
tmp['y'] = matchCoordinate.group(4)
tmp['description'] = unicode(item[2])
tmp['location'] = unicode(item[3])
result.append(tmp)
for rs in result:
if 'reference' in rs:
for k, v in rs.iteritems():
print('{} = {}'.format(k, v))
print
# Output:
# --------------------------------
# y = 91°38'28.2"E
# x = 22°40'34.3"N
# description = The existing CMC Office at Bariyodhala (22°40'34.3"N; 91°38'28.2"E) requires some repair/maintenance works including electrical wiring and electrical lights and appliances like ceiling fans supplies. Detail specification of the works are attached.
# reference = C1-20701-17-1
# location = xxxxx Site, c Region
#1
1
I don't know if this works if there are files with different patterns (p.s. I'm using python 2.7.11):
如果有不同模式的文件(p.s.我正在使用python 2.7.11),我不知道这是否有效:
# -*- coding: utf-8 -*-
from docx import Document
import sys
import os
import re
reload(sys)
sys.setdefaultencoding('utf8')
for root, dirs, files in os.walk("."):
for name in files:
doc_file = os.path.join(root, name)
if doc_file.endswith('docx'):
main_file = Document(doc_file)
table = main_file.tables[1] # this is same for every document
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = tuple(text)
data.append(row_data)
regexReference = re.compile("(C.-[0-9-]+)")
regexCoordinate = re.compile(r'(N-(.{,12})([0-9]|\')|[0-9].{,12}N)[;, ]+(E-(.{,12})([0-9]|\')|[0-9].{,12}E)')
result = []
for item in data:
tmp = dict()
matchReference = regexReference.search(item[1])
matchCoordinate = regexCoordinate.search(unicode(item[2]))
if matchReference:
tmp['reference'] = matchReference.group()
if matchCoordinate:
tmp['x'] = matchCoordinate.group(1)
tmp['y'] = matchCoordinate.group(4)
tmp['description'] = unicode(item[2])
tmp['location'] = unicode(item[3])
result.append(tmp)
for rs in result:
if 'reference' in rs:
for k, v in rs.iteritems():
print('{} = {}'.format(k, v))
print
# Output:
# --------------------------------
# y = 91°38'28.2"E
# x = 22°40'34.3"N
# description = The existing CMC Office at Bariyodhala (22°40'34.3"N; 91°38'28.2"E) requires some repair/maintenance works including electrical wiring and electrical lights and appliances like ceiling fans supplies. Detail specification of the works are attached.
# reference = C1-20701-17-1
# location = xxxxx Site, c Region