从网页中删除所有HTML标记

时间:2023-01-09 15:29:36

I am doing some BASH shell scripting with curl. If my curl command returns any text, I know I have an error. This text returned by curl is usually in HTML. I figured that if I can strip out all of the HTML tags, I could display the resulting text as an error message.

我正在用curl做一些BASH shell脚本。如果我的curl命令返回任何文本,我知道我有一个错误。 curl返回的这个文本通常是HTML格式。我想如果我可以删除所有HTML标记,我可以将结果文本显示为错误消息。

I was thinking of something like this:

我在考虑这样的事情:

sed -E 's/<.*?>//g' <<<$output_text

But I get sed: 1: "s/<.*?>//": RE error: repetition-operator operand invalid

但我得到了sed:1:“s / <。*?> //”:RE错误:重复操作符操作数无效

If I replace *? with *, I don't get the error (and I don't get any text either). If I remove the global (g) flag, I get the same error.

如果我更换*?与*,我没有得到错误(我也没有得到任何文本)。如果我删除global(g)标志,我会得到同样的错误。

This is on Mac OS X.

这是在Mac OS X上。

3 个解决方案

#1


5  

sed doesn't support non-greedy.

sed不支持非贪心。

try

's/<[^>]*>//g'

#2


3  

Maybe parser-based perl solution?

也许基于解析器的perl解决方案?

perl -0777 -MHTML::Strip -nlE 'say HTML::Strip->new->parse($_)' file.html

You must install the HTML::Strip module with cpan HTML::Strip command.

您必须使用cpan HTML :: Strip命令安装HTML :: Strip模块。

alternatively

you can use an standard OS X utility called: textutil see the man page

您可以使用名为OSutil的标准OS X实用程序,请参阅手册页

textutil -convert txt file.html

will produce file.txt with stripped html tags, or

将生成带有剥离的html标签的file.txt,或

textutil -convert txt -stdin -stdout < file.txt | some_command

Another alternative

Some systems get installed the lynx text-only browser. You can use the:

有些系统安装了lynx纯文本浏览器。你可以使用:

lynx -dump file.html #or
lynx -stdin -dump < file.html

But in your case, you can rely only on pure sed or awk solutions... IMHO.

但在你的情况下,你可以只依靠纯sed或awk解决方案...恕我直言。

But, if you have perl (and only haven't the HTML::Strip module) the next is still better as sed

但是,如果你有perl(并且只有没有HTML :: Strip模块),那么下一个仍然比sed更好

perl -0777 -pe 's/<.*?>//sg'

because will remove the next (multiline and common) tag too:

因为也将删除下一个(多行和常用)标记:

<a
 href="#"
 class="some"
>link text</a>

#3


1  

Code for GNU :

GNU sed代码:

sed '/</ {:k s/<[^>]*>//g; /</ {N; bk}}' file

This might fail, you should better use a tool.

这可能会失败,您应该更好地使用html解析工具。

#1


5  

sed doesn't support non-greedy.

sed不支持非贪心。

try

's/<[^>]*>//g'

#2


3  

Maybe parser-based perl solution?

也许基于解析器的perl解决方案?

perl -0777 -MHTML::Strip -nlE 'say HTML::Strip->new->parse($_)' file.html

You must install the HTML::Strip module with cpan HTML::Strip command.

您必须使用cpan HTML :: Strip命令安装HTML :: Strip模块。

alternatively

you can use an standard OS X utility called: textutil see the man page

您可以使用名为OSutil的标准OS X实用程序,请参阅手册页

textutil -convert txt file.html

will produce file.txt with stripped html tags, or

将生成带有剥离的html标签的file.txt,或

textutil -convert txt -stdin -stdout < file.txt | some_command

Another alternative

Some systems get installed the lynx text-only browser. You can use the:

有些系统安装了lynx纯文本浏览器。你可以使用:

lynx -dump file.html #or
lynx -stdin -dump < file.html

But in your case, you can rely only on pure sed or awk solutions... IMHO.

但在你的情况下,你可以只依靠纯sed或awk解决方案...恕我直言。

But, if you have perl (and only haven't the HTML::Strip module) the next is still better as sed

但是,如果你有perl(并且只有没有HTML :: Strip模块),那么下一个仍然比sed更好

perl -0777 -pe 's/<.*?>//sg'

because will remove the next (multiline and common) tag too:

因为也将删除下一个(多行和常用)标记:

<a
 href="#"
 class="some"
>link text</a>

#3


1  

Code for GNU :

GNU sed代码:

sed '/</ {:k s/<[^>]*>//g; /</ {N; bk}}' file

This might fail, you should better use a tool.

这可能会失败,您应该更好地使用html解析工具。