My code:
我的代码:
import re
import requests
from lxml import etree
url = 'http://weixin.sogou.com/gzhjs?openid=oIWsFt__d2wSBKMfQtkFfeVq_u8I&ext=2JjmXOu9jMsFW8Sh4E_XmC0DOkcPpGX18Zm8qPG7F0L5ffrupfFtkDqSOm47Bv9U'
r = requests.get(url)
items = r.json()['items']
- without encode('utf-8'):
- 没有编码('utf-8'):
etree.fromstring(items[0])
output:
etree.fromstring(items [0])输出:
ValueError
Traceback (most recent call last)
<ipython-input-69-cb8697498318> in <module>()
----> 1 etree.fromstring(items[0])
lxml.etree.pyx in lxml.etree.fromstring (src\lxml\lxml.etree.c:68121)()
parser.pxi in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:102435)()
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
- with encode('utf-8'):
- with encode('utf-8'):
etree.fromstring(items[0].encode('utf-8'))
output:
etree.fromstring(items [0] .encode('utf-8'))输出:
File "<string>", line unknown
XMLSyntaxError: CData section not finished
鎶楀啺鎶㈤櫓鎹锋姤:闃冲寳I绾挎, line 1, column 281
Have not idea to parse this xml..
不知道解析这个xml ..
1 个解决方案
#1
5
As a workaround, you can remove encoding
attribute before pass the string to etree.fromstring
:
作为解决方法,您可以在将字符串传递给etree.fromstring之前删除编码属性:
xml = re.sub(r'\bencoding="[-\w]+"', '', items[0], count=1)
root = etree.fromstring(xml)
UPDATE after seeing @Lea's comment in the question:
在看到@Lea在问题中的评论后更新:
Specify parser with explicit encoding:
使用显式编码指定解析器:
xml = r.json()['items'].encode('utf-8')
root = etree.fromstring(xml, parser=etree.XMLParser(encoding='utf-8'))
#1
5
As a workaround, you can remove encoding
attribute before pass the string to etree.fromstring
:
作为解决方法,您可以在将字符串传递给etree.fromstring之前删除编码属性:
xml = re.sub(r'\bencoding="[-\w]+"', '', items[0], count=1)
root = etree.fromstring(xml)
UPDATE after seeing @Lea's comment in the question:
在看到@Lea在问题中的评论后更新:
Specify parser with explicit encoding:
使用显式编码指定解析器:
xml = r.json()['items'].encode('utf-8')
root = etree.fromstring(xml, parser=etree.XMLParser(encoding='utf-8'))