I want to treat data from .tcx file (xml form) between specific tags with Python.
File format is like as follows.
我想用Python处理特定标签之间的.tcx文件(xml格式)数据。文件格式如下。
<Track>
<Trackpoint>
<Time>2015-08-29T22:04:39.000Z</Time>
<Position>
<LatitudeDegrees>37.198049426078796</LatitudeDegrees>
<LongitudeDegrees>127.07204628735781</LongitudeDegrees>
</Position>
<AltitudeMeters>34.79999923706055</AltitudeMeters>
<DistanceMeters>7.309999942779541</DistanceMeters>
<HeartRateBpm>
<Value>102</Value>
</HeartRateBpm>
<Cadence>76</Cadence>
<Extensions>
<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">
<Watts>112</Watts>
</TPX>
</Extensions>
</Trackpoint>
....Lots of <Trackpoint> ... </Trackpoint>
</Track>
Eventually, I'll make Data table with columns of 'Lattitude, Altitude, ... Watts'.
First I tried to make a list from taged data (like Watts ... /Watts) with BeautifulSoup, xpath etc. But I'm a newbie to deal with these tools. How can I grab data between tags in xml file with Python?
最后,我将使用“Lattitude,Altitude,... Watts”列创建数据表。首先,我尝试使用BeautifulSoup,xpath等从托管数据(如Watts ... / Watts)制作一个列表。但我是处理这些工具的新手。如何使用Python在xml文件中的标签之间获取数据?
3 个解决方案
#1
2
You could use the lxml
module, along with XPath
. lxml
is good for parsing XML/HTML, traversing element trees and returning element text/attributes. You can select particular elements, sets of elements or attributes of elements using XPath
. Using your example data:
您可以使用lxml模块和XPath。 lxml适用于解析XML / HTML,遍历元素树和返回元素文本/属性。您可以使用XPath选择特定元素,元素集或元素属性。使用您的示例数据:
content = '''
<Track>
<Trackpoint>
<Time>2015-08-29T22:04:39.000Z</Time>
<Position>
<LatitudeDegrees>37.198049426078796</LatitudeDegrees>
<LongitudeDegrees>127.07204628735781</LongitudeDegrees>
</Position>
<AltitudeMeters>34.79999923706055</AltitudeMeters>
<DistanceMeters>7.309999942779541</DistanceMeters>
<HeartRateBpm>
<Value>102</Value>
</HeartRateBpm>
<Cadence>76</Cadence>
<Extensions>
<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">
<Watts>112</Watts>
</TPX>
</Extensions>
</Trackpoint>
....Lots of <Trackpoint> ... </Trackpoint>
</Track>
'''
from lxml import etree
tree = etree.XML(content)
time = tree.xpath('Trackpoint/Time/text()')
print(time)
Output
产量
['2015-08-29T22:04:39.000Z']
#2
2
You can even use lxml
module to convert XML to CSV (for later import into a dataframe, spreadsheet, or database table) using an iterated Python list across various XPaths.
您甚至可以使用lxml模块将XML转换为CSV(以便以后导入到数据框,电子表格或数据库表中),使用跨各种XPath的迭代Python列表。
Notice the very last Watts
node is a special, longer XPath due escaping the special namespace, xlmns
not registered in sample XML.
请注意,最后一个Watts节点是一个特殊的,更长的XPath,因为它逃避了特殊的命名空间,xlmns没有在样本XML中注册。
import os, csv
import lxml.etree as ET
# SET DIRECTORY
cd = os.path.dirname(os.path.abspath(__file__))
# LOAD XML FILE
xmlfile = 'trackXML.xml'
dom = ET.parse(os.path.join(cd, xmlfile))
# DEFINING COLUMNS
columns = ['latitude', 'longitude', 'altitude', 'distance', 'watts']
# OPEN CSV FILE
with open(os.path.join(cd,'trackData.csv'), 'w') as m:
writer = csv.writer(m)
writer.writerow(columns)
nodexpath = dom.xpath('//Trackpoint')
dataline = [] # FOR ONE-ROW CSV APPENDS
datalines = [] # FOR FINAL OUTPUT
for j in range(1,len(nodexpath)+1):
dataline = []
# LOCATE PATH OF EACH NODE VALUE
latitudexpath = dom.xpath('//Trackpoint[{0}]/Position/LatitudeDegrees/text()'.format(j))
dataline.append('') if latitudexpath == [] else dataline.append(latitudexpath[0])
longitudexpath = dom.xpath('//Trackpoint[{0}]/Position/LongitudeDegrees/text()'.format(j))
dataline.append('') if longitudexpath == [] else dataline.append(longitudexpath[0])
altitudexpath = dom.xpath('//Trackpoint[{0}]/AltitudeMeters/text()'.format(j))
dataline.append('') if altitudexpath == [] else dataline.append(altitudexpath[0])
distancexpath = dom.xpath('//Trackpoint[{0}]/DistanceMeters/text()'.format(j))
dataline.append('') if distancexpath == [] else dataline.append(distancexpath[0])
wattsxpath = dom.xpath("//Trackpoint[{0}]/*[name()='Extensions']/*[name()='TPX']/*[name()='Watts']/text()".format(j))
dataline.append('') if wattsxpath == [] else dataline.append(wattsxpath[0])
datalines.append(dataline)
writer.writerow(dataline)
print(datalines)
In addition to CSV file, below is the datalines list output of selected columns:
除CSV文件外,以下是所选列的数据表列表输出:
[['37.198049426078796', '127.07204628735781', '34.79999923706055', '7.309999942779541', '112']]
#3
0
The Python program https://github.com/cast42/vpower/blob/master/vpower.py iterates over the TCX file specified at the command line and add a power field for all measurements of the cycling activity. It uses the lxml library for speed and because it deals with namespaces. In previous versions of this program I used xml.etree.ElementTree but ran into problems with the namespaces.
Python程序https://github.com/cast42/vpower/blob/master/vpower.py遍历命令行中指定的TCX文件,并为循环活动的所有测量添加一个功率字段。它使用lxml库来提高速度,因为它处理命名空间。在此程序的先前版本中,我使用了xml.etree.ElementTree但遇到了名称空间的问题。
#1
2
You could use the lxml
module, along with XPath
. lxml
is good for parsing XML/HTML, traversing element trees and returning element text/attributes. You can select particular elements, sets of elements or attributes of elements using XPath
. Using your example data:
您可以使用lxml模块和XPath。 lxml适用于解析XML / HTML,遍历元素树和返回元素文本/属性。您可以使用XPath选择特定元素,元素集或元素属性。使用您的示例数据:
content = '''
<Track>
<Trackpoint>
<Time>2015-08-29T22:04:39.000Z</Time>
<Position>
<LatitudeDegrees>37.198049426078796</LatitudeDegrees>
<LongitudeDegrees>127.07204628735781</LongitudeDegrees>
</Position>
<AltitudeMeters>34.79999923706055</AltitudeMeters>
<DistanceMeters>7.309999942779541</DistanceMeters>
<HeartRateBpm>
<Value>102</Value>
</HeartRateBpm>
<Cadence>76</Cadence>
<Extensions>
<TPX xmlns="http://www.garmin.com/xmlschemas/ActivityExtension/v2">
<Watts>112</Watts>
</TPX>
</Extensions>
</Trackpoint>
....Lots of <Trackpoint> ... </Trackpoint>
</Track>
'''
from lxml import etree
tree = etree.XML(content)
time = tree.xpath('Trackpoint/Time/text()')
print(time)
Output
产量
['2015-08-29T22:04:39.000Z']
#2
2
You can even use lxml
module to convert XML to CSV (for later import into a dataframe, spreadsheet, or database table) using an iterated Python list across various XPaths.
您甚至可以使用lxml模块将XML转换为CSV(以便以后导入到数据框,电子表格或数据库表中),使用跨各种XPath的迭代Python列表。
Notice the very last Watts
node is a special, longer XPath due escaping the special namespace, xlmns
not registered in sample XML.
请注意,最后一个Watts节点是一个特殊的,更长的XPath,因为它逃避了特殊的命名空间,xlmns没有在样本XML中注册。
import os, csv
import lxml.etree as ET
# SET DIRECTORY
cd = os.path.dirname(os.path.abspath(__file__))
# LOAD XML FILE
xmlfile = 'trackXML.xml'
dom = ET.parse(os.path.join(cd, xmlfile))
# DEFINING COLUMNS
columns = ['latitude', 'longitude', 'altitude', 'distance', 'watts']
# OPEN CSV FILE
with open(os.path.join(cd,'trackData.csv'), 'w') as m:
writer = csv.writer(m)
writer.writerow(columns)
nodexpath = dom.xpath('//Trackpoint')
dataline = [] # FOR ONE-ROW CSV APPENDS
datalines = [] # FOR FINAL OUTPUT
for j in range(1,len(nodexpath)+1):
dataline = []
# LOCATE PATH OF EACH NODE VALUE
latitudexpath = dom.xpath('//Trackpoint[{0}]/Position/LatitudeDegrees/text()'.format(j))
dataline.append('') if latitudexpath == [] else dataline.append(latitudexpath[0])
longitudexpath = dom.xpath('//Trackpoint[{0}]/Position/LongitudeDegrees/text()'.format(j))
dataline.append('') if longitudexpath == [] else dataline.append(longitudexpath[0])
altitudexpath = dom.xpath('//Trackpoint[{0}]/AltitudeMeters/text()'.format(j))
dataline.append('') if altitudexpath == [] else dataline.append(altitudexpath[0])
distancexpath = dom.xpath('//Trackpoint[{0}]/DistanceMeters/text()'.format(j))
dataline.append('') if distancexpath == [] else dataline.append(distancexpath[0])
wattsxpath = dom.xpath("//Trackpoint[{0}]/*[name()='Extensions']/*[name()='TPX']/*[name()='Watts']/text()".format(j))
dataline.append('') if wattsxpath == [] else dataline.append(wattsxpath[0])
datalines.append(dataline)
writer.writerow(dataline)
print(datalines)
In addition to CSV file, below is the datalines list output of selected columns:
除CSV文件外,以下是所选列的数据表列表输出:
[['37.198049426078796', '127.07204628735781', '34.79999923706055', '7.309999942779541', '112']]
#3
0
The Python program https://github.com/cast42/vpower/blob/master/vpower.py iterates over the TCX file specified at the command line and add a power field for all measurements of the cycling activity. It uses the lxml library for speed and because it deals with namespaces. In previous versions of this program I used xml.etree.ElementTree but ran into problems with the namespaces.
Python程序https://github.com/cast42/vpower/blob/master/vpower.py遍历命令行中指定的TCX文件,并为循环活动的所有测量添加一个功率字段。它使用lxml库来提高速度,因为它处理命名空间。在此程序的先前版本中,我使用了xml.etree.ElementTree但遇到了名称空间的问题。