正则表达式从python中的html中提取所有常规文本[重复]

时间:2022-10-31 13:12:27

This question already has an answer here:

这个问题在这里已有答案:

how do i extract everythin that is not an html tag from a partial html text?

如何从部分html文本中提取不是html标记的所有内容?

That is, if I have something of the type:

也就是说,如果我有类型的东西:

<div>Hello</div><h3><div>world</div></h3>

I want to extract ['Hello','world']

我想提取['你好','世界']

I thought about the Regex:

我想到了正则表达式:

>[a-zA-Z0-9]+<

but it will not include special characters and chinese or hebrew characters, which I need

但它不包括我需要的特殊字符和中文或希伯来字符

3 个解决方案

#1


3  

You should look at something like regular expression to extract text from HTML

您应该查看正则表达式之类的内容以从HTML中提取文本

From that post:

从那篇文章:

You can't really parse HTML with regular expressions. It's too complex. RE's won't handle will work in a browser as proper text, but might baffle a naive RE.

您无法使用正则表达式真正解析HTML。这太复杂了。 RE不会处理将在浏览器中作为正确的文本工作,但可能会困扰一个天真的RE。

You'll be happier and more successful with a proper HTML parser. Python folks often use something Beautiful Soup to parse HTML and strip out tags and scripts.

使用正确的HTML解析器,您会更快乐,更成功。 Python人们经常使用Beautiful Soup来解析HTML并删除标签和脚本。

Also, browsers, by design, tolerate malformed HTML. So you will often find yourself trying to parse HTML which is clearly improper, but happens to work okay in a browser.

此外,浏览器在设计上容忍格式错误的HTML。因此,您经常会发现自己试图解析明显不合适的HTML,但在浏览器中恰好可以正常工作。

You might be able to parse bad HTML with RE's. All it requires is patience and hard work. But it's often simpler to use someone else's parser.

您可以使用RE解析错误的HTML。它所需要的只是耐心和努力。但是使用别人的解析器通常更简单。

#2


1  

As Avi already pointed, this is too complex task for regular expressions. Use get_text from BeautifulSoup or clean_html from nltk to extract text from your html.

正如Avi已经指出的那样,这对于正则表达式来说太复杂了。使用BeautifulSoup中的get_text或nltk中的clean_html从html中提取文本。

from bs4 import BeautifulSoup
clean_text = BeautifulSoup(html).get_text()

or

要么

import nltk
clean_text = nltk.clean_html(html)

Another option, thanks to GuillaumeA, is to use pyquery:

感谢GuillaumeA的另一个选择是使用pyquery:

from pyquery import PyQuery
clean_text = PyQuery(html)

It must be said that the above mentioned html parsers will do the job with varying level of success if the html is not well formed, so you should experiment and see what works best for your input data.

必须要说的是,如果html格式不正确,上面提到的html解析器将以不同程度的成功完成工作,所以你应该试验并看看什么最适合你的输入数据。

#3


-1  

I am not familiar with Python , but the following regular expression can help you.

我不熟悉Python,但以下正则表达式可以帮助您。

<\s*(\w+)[^/>]*>

where,

哪里,

<: starting character

\s*: it may have whitespaces before tag name (ugly but possible).

(\w+): tags can contain letters and numbers (h1). Well, \w also matches '_', but it does not hurt I guess. If curious use ([a-zA-Z0-9]+) instead.

[^/>]*: anything except > and / until closing >

\>: closing >

#1


3  

You should look at something like regular expression to extract text from HTML

您应该查看正则表达式之类的内容以从HTML中提取文本

From that post:

从那篇文章:

You can't really parse HTML with regular expressions. It's too complex. RE's won't handle will work in a browser as proper text, but might baffle a naive RE.

您无法使用正则表达式真正解析HTML。这太复杂了。 RE不会处理将在浏览器中作为正确的文本工作,但可能会困扰一个天真的RE。

You'll be happier and more successful with a proper HTML parser. Python folks often use something Beautiful Soup to parse HTML and strip out tags and scripts.

使用正确的HTML解析器,您会更快乐,更成功。 Python人们经常使用Beautiful Soup来解析HTML并删除标签和脚本。

Also, browsers, by design, tolerate malformed HTML. So you will often find yourself trying to parse HTML which is clearly improper, but happens to work okay in a browser.

此外,浏览器在设计上容忍格式错误的HTML。因此,您经常会发现自己试图解析明显不合适的HTML,但在浏览器中恰好可以正常工作。

You might be able to parse bad HTML with RE's. All it requires is patience and hard work. But it's often simpler to use someone else's parser.

您可以使用RE解析错误的HTML。它所需要的只是耐心和努力。但是使用别人的解析器通常更简单。

#2


1  

As Avi already pointed, this is too complex task for regular expressions. Use get_text from BeautifulSoup or clean_html from nltk to extract text from your html.

正如Avi已经指出的那样,这对于正则表达式来说太复杂了。使用BeautifulSoup中的get_text或nltk中的clean_html从html中提取文本。

from bs4 import BeautifulSoup
clean_text = BeautifulSoup(html).get_text()

or

要么

import nltk
clean_text = nltk.clean_html(html)

Another option, thanks to GuillaumeA, is to use pyquery:

感谢GuillaumeA的另一个选择是使用pyquery:

from pyquery import PyQuery
clean_text = PyQuery(html)

It must be said that the above mentioned html parsers will do the job with varying level of success if the html is not well formed, so you should experiment and see what works best for your input data.

必须要说的是,如果html格式不正确,上面提到的html解析器将以不同程度的成功完成工作,所以你应该试验并看看什么最适合你的输入数据。

#3


-1  

I am not familiar with Python , but the following regular expression can help you.

我不熟悉Python,但以下正则表达式可以帮助您。

<\s*(\w+)[^/>]*>

where,

哪里,

<: starting character

\s*: it may have whitespaces before tag name (ugly but possible).

(\w+): tags can contain letters and numbers (h1). Well, \w also matches '_', but it does not hurt I guess. If curious use ([a-zA-Z0-9]+) instead.

[^/>]*: anything except > and / until closing >

\>: closing >