I am using urllib and beautifulsoup to parse xml file in django. I can't parse the content of description tag with CDATA.
我正在使用urllib和beautifulsoup来解析django中的xml文件。我无法用CDATA解析description标记的内容。
my xml tag.
我的xml标签。
<item>
<title>EU Confronting US Over Surveillance</title>
<description><![CDATA[Voice of America is an international news and broadcast organization serving Central and Eastern Europe, the Caucasus, Central Asia, Russia, the Middle East and Balkan countries]]></description>
<guid>http://www.voanews.com/content/eu-confronting-us-over-surveillance/1778928.html</guid>
</item>
This description tag is inside the item tag views.py
此描述标记位于项目标记views.py中
for i in soup.findAll('item'):
print i.description.string
If CDATA is not there means I can parse the contents inside descirption tag. I don't know how to parse this content. Please help me out Also how to get the image inside the tag..
如果CDATA不存在意味着我可以解析descirption标签内的内容。我不知道如何解析这个内容。请帮帮我以及如何获取标签内的图像..
<description><img src='http://static.ibnlive.in.com/ibnlive/pix/sitepix/10_2013/tony-abbott-visits-afghanistan-says-australias-war-is-over_291013013344_338x225.jpg' width='90' height='62'><p>"Australia's longest war" is ending and its defence forces mission in Afghanistan will be complete by 2013 end, Prime Minister Tony Abbott announced in a statement on Tuesday.</p></description>
1 个解决方案
#1
0
CData can be accessed like this:
可以像这样访问CData:
>>> import BeautifulSoup
>>> txt = '''<description><![CDATA[Voice of America is an international news and broadcast organization serving Central and Eastern Europe, the Caucasus, Central Asia, Russia, the Middle East and Balkan countries]]></description>'''
>>> soup = BeautifulSoup.BeautifulSoup(txt)
>>> for cd in soup.findAll(text=True):
... if isinstance(cd, BeautifulSoup.CData):
... print 'CData value: %r' % cd
...
CData value: u'Voice of America is an international news and broadcast organi
zation serving Central and Eastern Europe, the Caucasus, Central Asia, Russia, t
he Middle East and Balkan countries'
>>>
An edit based on your comment that should help.
基于您的评论的编辑应该有所帮助。
from bs4 import BeautifulSoup, CData
import urllib
source_txt = urllib.urlopen("http://voanews.com/api/epiqq")
soup = BeautifulSoup.BeautifulSoup(source_txt.read())
for cd in soup.findAll(text=True):
if isinstance(cd, CData):
print 'CData value: %r' % cd
Things to note:
注意事项:
- The import statement. I'm importing the entire BeautifulSoup package
- The
urlopen
parameter. It needs thehttp
进口声明。我正在导入整个BeautifulSoup包
urlopen参数。它需要http
#1
0
CData can be accessed like this:
可以像这样访问CData:
>>> import BeautifulSoup
>>> txt = '''<description><![CDATA[Voice of America is an international news and broadcast organization serving Central and Eastern Europe, the Caucasus, Central Asia, Russia, the Middle East and Balkan countries]]></description>'''
>>> soup = BeautifulSoup.BeautifulSoup(txt)
>>> for cd in soup.findAll(text=True):
... if isinstance(cd, BeautifulSoup.CData):
... print 'CData value: %r' % cd
...
CData value: u'Voice of America is an international news and broadcast organi
zation serving Central and Eastern Europe, the Caucasus, Central Asia, Russia, t
he Middle East and Balkan countries'
>>>
An edit based on your comment that should help.
基于您的评论的编辑应该有所帮助。
from bs4 import BeautifulSoup, CData
import urllib
source_txt = urllib.urlopen("http://voanews.com/api/epiqq")
soup = BeautifulSoup.BeautifulSoup(source_txt.read())
for cd in soup.findAll(text=True):
if isinstance(cd, CData):
print 'CData value: %r' % cd
Things to note:
注意事项:
- The import statement. I'm importing the entire BeautifulSoup package
- The
urlopen
parameter. It needs thehttp
进口声明。我正在导入整个BeautifulSoup包
urlopen参数。它需要http