I want to parse xml which contains a CDATA element in the following format
我想以以下格式解析包含CDATA元素的xml
<showtimes><![CDATA[6:50 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=18:50&perfd=03012011,9:40 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=21:40&perfd=03012011]]> </showtimes>
Please help me to find out a solution.
请帮我找到一个解决办法。
3 个解决方案
#1
4
This shouldn't be any problem - e.g. with lxml:
这应该不是什么问题——例如lxml:
from lxml import etree
input = '<showtimes><![CDATA[6:50 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=18:50&perfd=03012011,9:40 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=21:40&perfd=03012011]]> </showtimes>'
f = etree.fromstring(input)
for s in f.xpath("//showtimes"):
print s.text
... prints:
…打印:
6:50 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=18:50&perfd=03012011,9:40 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=21:40&perfd=03012011
20日下午,https://www.movietickets.com/purchase.asp?afid = rgncom&house_id = 6446和语言= 2 &movie_id = 87050 &perft = 18:50&perfd = 87050,40点,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=21:40&perfd = 03012011
#2
1
I'm not sure what you are looking for. Here is an answer based on some wild assumptions.
我不知道你在找什么。这里有一个基于一些疯狂假设的答案。
PS: This solution needs lxml.
这个解决方案需要lxml。
>>> s = """<showtimes><![CDATA[6:50 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=18:50&perfd=03012011,9:40 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=21:40&perfd=03012011]]> </showtimes>"""
>>> from lxml import etree
>>> import urlparse
>>> doc = etree.fromstring(s)
>>> _time, url = doc.text.split(',', 1)
>>> _time # Not sure if you want this
'6:50 PM'
>>> for key, value in urlparse.parse_qs(urlparse.urlsplit(url).query).items():
print key, value
perfd ['03012011,9:40 PM,https://www.movietickets.com/purchase.asp?afid=rgncom', '03012011 ']
movie_id ['87050', '87050']
language ['2', '2']
perft ['18:50', '21:40']
afid ['rgncom']
house_id ['6446', '6446']
>>>
#3
0
as far is I know the standard python SAX parser handles CDATA correctly. You will be able to parse it.
就我所知,标准的python SAX解析器正确地处理CDATA。您将能够解析它。
#1
4
This shouldn't be any problem - e.g. with lxml:
这应该不是什么问题——例如lxml:
from lxml import etree
input = '<showtimes><![CDATA[6:50 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=18:50&perfd=03012011,9:40 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=21:40&perfd=03012011]]> </showtimes>'
f = etree.fromstring(input)
for s in f.xpath("//showtimes"):
print s.text
... prints:
…打印:
6:50 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=18:50&perfd=03012011,9:40 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=21:40&perfd=03012011
20日下午,https://www.movietickets.com/purchase.asp?afid = rgncom&house_id = 6446和语言= 2 &movie_id = 87050 &perft = 18:50&perfd = 87050,40点,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=21:40&perfd = 03012011
#2
1
I'm not sure what you are looking for. Here is an answer based on some wild assumptions.
我不知道你在找什么。这里有一个基于一些疯狂假设的答案。
PS: This solution needs lxml.
这个解决方案需要lxml。
>>> s = """<showtimes><![CDATA[6:50 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=18:50&perfd=03012011,9:40 PM,https://www.movietickets.com/purchase.asp?afid=rgncom&house_id=6446&language=2&movie_id=87050&perft=21:40&perfd=03012011]]> </showtimes>"""
>>> from lxml import etree
>>> import urlparse
>>> doc = etree.fromstring(s)
>>> _time, url = doc.text.split(',', 1)
>>> _time # Not sure if you want this
'6:50 PM'
>>> for key, value in urlparse.parse_qs(urlparse.urlsplit(url).query).items():
print key, value
perfd ['03012011,9:40 PM,https://www.movietickets.com/purchase.asp?afid=rgncom', '03012011 ']
movie_id ['87050', '87050']
language ['2', '2']
perft ['18:50', '21:40']
afid ['rgncom']
house_id ['6446', '6446']
>>>
#3
0
as far is I know the standard python SAX parser handles CDATA correctly. You will be able to parse it.
就我所知,标准的python SAX解析器正确地处理CDATA。您将能够解析它。