This question already has an answer here:
这个问题已经有了答案:
- Strip HTML from strings in Python 20 answers
- 在Python 20中从字符串中删除HTML
I have a text like this:
我有这样一段文字:
text = """<div>
<h1>Title</h1>
<p>A long text........ </p>
<a href=""> a link </a>
</div>"""
using pure Python, with no external module I want to have this:
使用纯Python,没有外部模块,我想要:
>>> print remove_tags(text)
Title A long text..... a link
I know I can do it using lxml.html.fromstring(text).text_content() but I need to achieve the same in pure Python using builtin or std library for 2.6+
我知道我可以使用lxml.html.fromstring(text).text_content()来实现这一点,但是我需要在纯Python中使用内置的或2.6+的std库来实现这一点
How can I do that?
我怎么做呢?
5 个解决方案
#1
114
Using a regex
Using a regex you can clean everything inside <>
:
使用regex,您可以清理<>:
import re
def cleanhtml(raw_html):
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', raw_html)
return cleantext
Using BeautifulSoup
You could also use BeautifulSoup
additionnal package to find out all the raw text
您还可以使用漂亮的soup additionnal包来查找所有的原始文本。
You will need to explicitly set a parser when calling BeautifulSoup I recommand "lxml" as mentionned in alternative answers (puch more robist than the default one (i.e available without additionnal install) 'html.parser'
在调用BeautifulSoup时,您将需要显式地设置解析器。e无附加安装“html.parser”
from bs4 import BeautifulSoup
cleantext = BeautifulSoup(raw_html, "lxml").text
But it doesn't prevent you from using external libraries, so I recommend the first solution.
但这并不妨碍您使用外部库,因此我推荐第一个解决方案。
#2
30
Python has several XML modules built in. The simplest one for the case that you already have a string with the full HTML is xml.etree
, which works (somewhat) similarly to the lxml example you mention:
Python内置了几个XML模块。对于已经具有完整HTML的字符串,最简单的一种是xml。etree,它的工作原理(有点)类似于您提到的lxml示例:
def remove_tags(text):
return ''.join(xml.etree.ElementTree.fromstring(text).itertext())
#3
22
Note that this isn't perfect, since if you had something like, say, <a title=">">
it would break. However, it's about the closest you'd get in non-library Python without a really complex function:
注意,这并不是完美的,因为如果你有这样的东西,它就会坏掉。但是,如果没有一个非常复杂的函数,它是最接近非库Python的:
import re
TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
return TAG_RE.sub('', text)
However, as lvc mentions xml.etree
is available in the Python Standard Library, so you could probably just adapt it to serve like your existing lxml
version:
但是,正如lvc提到的xml一样。etree在Python标准库中是可用的,所以您可以将它调整为类似于您现有的lxml版本:
def remove_tags(text):
return ''.join(xml.etree.ElementTree.fromstring(text).itertext())
#4
4
There's a simple way to this in any C-like language. The style is not Pythonic but works with pure Python:
在任何类似c的语言中都有一种简单的方法。这种风格不是Python的,而是纯Python的:
def remove_html_markup(s):
tag = False
quote = False
out = ""
for c in s:
if c == '<' and not quote:
tag = True
elif c == '>' and not quote:
tag = False
elif (c == '"' or c == "'") and tag:
quote = not quote
elif not tag:
out = out + c
return out
The idea based in a simple finite-state machine and is detailed explained here: http://youtu.be/2tu9LTDujbw
这个想法基于一个简单的有限状态机,详细说明如下:http://youtu.be/2tu9LTDujbw。
You can see it working here: http://youtu.be/HPkNPcYed9M?t=35s
您可以看到它在这里工作:http://youtu.be/hpknpcyed9m?
PS - If you're interested in the class(about smart debugging with python) I give you a link: http://www.udacity.com/overview/Course/cs259/CourseRev/1. It's free!
如果您对这个类(关于使用python进行智能调试)感兴趣,我将提供一个链接:http://www.udacity.com/overview/Course/cs259/CourseRev/1。它是免费的!
#5
-5
global temp
temp =''
s = ' '
def remove_strings(text):
global temp
if text == '':
return temp
start = text.find('<')
end = text.find('>')
if start == -1 and end == -1 :
temp = temp + text
return temp
newstring = text[end+1:]
fresh_start = newstring.find('<')
if newstring[:fresh_start] != '':
temp += s+newstring[:fresh_start]
remove_strings(newstring[fresh_start:])
return temp
#1
114
Using a regex
Using a regex you can clean everything inside <>
:
使用regex,您可以清理<>:
import re
def cleanhtml(raw_html):
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', raw_html)
return cleantext
Using BeautifulSoup
You could also use BeautifulSoup
additionnal package to find out all the raw text
您还可以使用漂亮的soup additionnal包来查找所有的原始文本。
You will need to explicitly set a parser when calling BeautifulSoup I recommand "lxml" as mentionned in alternative answers (puch more robist than the default one (i.e available without additionnal install) 'html.parser'
在调用BeautifulSoup时,您将需要显式地设置解析器。e无附加安装“html.parser”
from bs4 import BeautifulSoup
cleantext = BeautifulSoup(raw_html, "lxml").text
But it doesn't prevent you from using external libraries, so I recommend the first solution.
但这并不妨碍您使用外部库,因此我推荐第一个解决方案。
#2
30
Python has several XML modules built in. The simplest one for the case that you already have a string with the full HTML is xml.etree
, which works (somewhat) similarly to the lxml example you mention:
Python内置了几个XML模块。对于已经具有完整HTML的字符串,最简单的一种是xml。etree,它的工作原理(有点)类似于您提到的lxml示例:
def remove_tags(text):
return ''.join(xml.etree.ElementTree.fromstring(text).itertext())
#3
22
Note that this isn't perfect, since if you had something like, say, <a title=">">
it would break. However, it's about the closest you'd get in non-library Python without a really complex function:
注意,这并不是完美的,因为如果你有这样的东西,它就会坏掉。但是,如果没有一个非常复杂的函数,它是最接近非库Python的:
import re
TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
return TAG_RE.sub('', text)
However, as lvc mentions xml.etree
is available in the Python Standard Library, so you could probably just adapt it to serve like your existing lxml
version:
但是,正如lvc提到的xml一样。etree在Python标准库中是可用的,所以您可以将它调整为类似于您现有的lxml版本:
def remove_tags(text):
return ''.join(xml.etree.ElementTree.fromstring(text).itertext())
#4
4
There's a simple way to this in any C-like language. The style is not Pythonic but works with pure Python:
在任何类似c的语言中都有一种简单的方法。这种风格不是Python的,而是纯Python的:
def remove_html_markup(s):
tag = False
quote = False
out = ""
for c in s:
if c == '<' and not quote:
tag = True
elif c == '>' and not quote:
tag = False
elif (c == '"' or c == "'") and tag:
quote = not quote
elif not tag:
out = out + c
return out
The idea based in a simple finite-state machine and is detailed explained here: http://youtu.be/2tu9LTDujbw
这个想法基于一个简单的有限状态机,详细说明如下:http://youtu.be/2tu9LTDujbw。
You can see it working here: http://youtu.be/HPkNPcYed9M?t=35s
您可以看到它在这里工作:http://youtu.be/hpknpcyed9m?
PS - If you're interested in the class(about smart debugging with python) I give you a link: http://www.udacity.com/overview/Course/cs259/CourseRev/1. It's free!
如果您对这个类(关于使用python进行智能调试)感兴趣,我将提供一个链接:http://www.udacity.com/overview/Course/cs259/CourseRev/1。它是免费的!
#5
-5
global temp
temp =''
s = ' '
def remove_strings(text):
global temp
if text == '':
return temp
start = text.find('<')
end = text.find('>')
if start == -1 and end == -1 :
temp = temp + text
return temp
newstring = text[end+1:]
fresh_start = newstring.find('<')
if newstring[:fresh_start] != '':
temp += s+newstring[:fresh_start]
remove_strings(newstring[fresh_start:])
return temp