在bash中删除/替换html标记

时间:2021-01-20 15:29:44

I have a file with lines that contain:

我有一个包含以下内容的文件:

    <li><b> Some Text:</b> More Text </li>

I want to remove the html tags and replace the </b> tag with a dash so it becomes like this:

我想要删除html标签并将标签替换为破折号,这样它就变成这样:

Some Text:- More Text

一些文本:文本

I'm trying to use sed however I can't find the proper regex combination.

我正在尝试使用sed,但是我找不到合适的regex组合。

2 个解决方案

#1


14  

If you strictly want to strip all HTML tags, but at the same time only replace the </b> tag with a -, you can chain two simple sed commands with a pipe:

如果您严格地想要去除所有的HTML标记,但同时只将标记替换为a -,您可以使用管道将两个简单的sed命令链在一起:

cat your_file | sed 's|</b>|-|g' | sed 's|<[^>]*>||g' > stripped_file

This will pass all the file's contents to the first sed command that will handle replacing the </b> to a -. Then, the output of that will be piped to a sed that will replace all HTML tags with empty strings. The final output will be saved into the new file stripped_file.

这将把文件的所有内容传递给第一个sed命令,该命令将处理将替换为a -。然后,输出将被传输到一个sed,该sed将用空字符串替换所有HTML标记。最后的输出将保存到新的文件stripped_file中。

Using a similar method as the other answer from @Steve, you could also use sed's -e option to chain expressions into a single (non-piped command); by adding -i, you can also read-in and replace the contents of your original file without the need for cat, or a new file:

使用类似于@Steve的另一个答案的方法,您还可以使用sed的-e选项将表达式链接到单个(非管道命令);通过添加-i,您还可以读取并替换原始文件的内容,而不需要使用cat或新文件:

sed -i -e 's|</b>|-|g' -e 's|<[^>]*>||g' your_file

This will do the replacement just as the chained-command above, however this time it will directly replace the contents in the input file. To save to a new file instead, remove the -i and add > stripped_file to the end (or whatever file-name you choose).

这将像上面的chain- command那样进行替换,但是这一次它将直接替换输入文件中的内容。要保存到一个新文件,删除-i并在末尾添加> stripped_file(或您选择的任何文件名)。

#2


0  

One way using GNU sed:

一种使用GNU sed的方法:

sed -e 's/<\/b>/-/g' -e 's/<[^>]*>//g' file.txt

Example:

例子:

echo "<li><b> Some Text:</b> More Text </li>" | sed -e 's/<\/b>/-/g' -e 's/<[^>]*>//g'

Result:

结果:

 Some Text:- More Text

#1


14  

If you strictly want to strip all HTML tags, but at the same time only replace the </b> tag with a -, you can chain two simple sed commands with a pipe:

如果您严格地想要去除所有的HTML标记,但同时只将标记替换为a -,您可以使用管道将两个简单的sed命令链在一起:

cat your_file | sed 's|</b>|-|g' | sed 's|<[^>]*>||g' > stripped_file

This will pass all the file's contents to the first sed command that will handle replacing the </b> to a -. Then, the output of that will be piped to a sed that will replace all HTML tags with empty strings. The final output will be saved into the new file stripped_file.

这将把文件的所有内容传递给第一个sed命令,该命令将处理将替换为a -。然后,输出将被传输到一个sed,该sed将用空字符串替换所有HTML标记。最后的输出将保存到新的文件stripped_file中。

Using a similar method as the other answer from @Steve, you could also use sed's -e option to chain expressions into a single (non-piped command); by adding -i, you can also read-in and replace the contents of your original file without the need for cat, or a new file:

使用类似于@Steve的另一个答案的方法,您还可以使用sed的-e选项将表达式链接到单个(非管道命令);通过添加-i,您还可以读取并替换原始文件的内容,而不需要使用cat或新文件:

sed -i -e 's|</b>|-|g' -e 's|<[^>]*>||g' your_file

This will do the replacement just as the chained-command above, however this time it will directly replace the contents in the input file. To save to a new file instead, remove the -i and add > stripped_file to the end (or whatever file-name you choose).

这将像上面的chain- command那样进行替换,但是这一次它将直接替换输入文件中的内容。要保存到一个新文件,删除-i并在末尾添加> stripped_file(或您选择的任何文件名)。

#2


0  

One way using GNU sed:

一种使用GNU sed的方法:

sed -e 's/<\/b>/-/g' -e 's/<[^>]*>//g' file.txt

Example:

例子:

echo "<li><b> Some Text:</b> More Text </li>" | sed -e 's/<\/b>/-/g' -e 's/<[^>]*>//g'

Result:

结果:

 Some Text:- More Text