sed regex的帮助:从特定标记中提取文本

时间:2022-09-13 16:23:56

First time sed'er, so be gentle.

第一次见面,所以要温柔。

I have the following text file, 'test_file':

我有以下文本文件'test_file':

 <Tag1>not </Tag1><Tag2>working</Tag2>

I want to extract the text in between <Tag2> using sed regex, there may be other occurrences of <Tag2> and I would like to extract those also.

我想使用sed regex在 之间提取文本,可能还会出现 ,我也想提取它们。

So far I have this sed based regex:

到目前为止,我有这个基于sed的regex:

cat test_file | grep -i "Tag2"| sed 's/<[^>]*[>]//g'

which gives the output:

使输出:

 not working

Anyone any idea how to get this working?

有人知道怎么让它工作吗?

4 个解决方案

#1


4  

As another poster said, sed may not be the best tool for this job. You may want to use something built for XML parsing, or even a simple scripting language, such as perl.

正如另一个海报所言,sed可能不是这项工作的最佳工具。您可能希望使用为XML解析而构建的东西,甚至是简单的脚本语言,比如perl。

The problem with your try, is that you aren't analyzing the string properly.

你尝试的问题是,你没有正确地分析字符串。

cat test_file is good - it prints out the contents of the file to stdout.

cat test_file是好的——它将文件的内容打印出来给stdout。

grep -i "Tag2" is ok - it prints out only lines with "Tag2" in them. This may not be exactly what you want. Bear in mind that it will print the whole line, not just the <Tag2> part, so you will still have to search out that part later.

grep -i“Tag2”没有问题——它只打印带有“Tag2”的行。这可能不是你想要的。请记住,它将打印整个行,而不仅仅是 部分,因此您仍然需要稍后搜索该部分。

sed 's/&lt;[^&gt;]*[&gt;]//g' isn't what you want - it simply removes the tags, including <Tag1> and <Tag2>.

sed的s / & lt;[^祝辞]*[在]/ / g的不是你想要的东西——它只是删除标签,包括 <标签1> 和 <标签2> 。

You can try something like:

你可以试试:

cat tmp.tmp | grep -i tag2 | sed 's/.*<Tag2>\(.*\)<\/Tag2>.*/\1/'

This will produce

这将产生

working

but it will only work for one tag pair.

但它只适用于一对标签。

#2


4  

For your nice, friendly example, you could use

对于您友好的示例,您可以使用

sed -e 's/^.*<Tag2>//' -e 's!</Tag2>.*!!' test-file 

but the XML out there is cruel and uncaring. You're asking for serious trouble using regular expressions to scrape XML.

但是XML的存在是残酷和冷漠的。使用正则表达式获取XML会带来严重的麻烦。

#3


0  

you can use gawk, eg

你可以用呆瓜(如)

$ cat file
 <Tag1>not </Tag1><Tag2>working here</Tag2>
 <Tag1>not </Tag1><Tag2>
working

</Tag2>

$ awk -vRS="</Tag2>" '/<Tag2>/{gsub(/.*<Tag2>/,"");print}' file
working here

working

#4


0  

awk -F"Tag2" '{print $2}' test_1 | sed 's/[^a-zA-Z]//g'

#1


4  

As another poster said, sed may not be the best tool for this job. You may want to use something built for XML parsing, or even a simple scripting language, such as perl.

正如另一个海报所言,sed可能不是这项工作的最佳工具。您可能希望使用为XML解析而构建的东西,甚至是简单的脚本语言,比如perl。

The problem with your try, is that you aren't analyzing the string properly.

你尝试的问题是,你没有正确地分析字符串。

cat test_file is good - it prints out the contents of the file to stdout.

cat test_file是好的——它将文件的内容打印出来给stdout。

grep -i "Tag2" is ok - it prints out only lines with "Tag2" in them. This may not be exactly what you want. Bear in mind that it will print the whole line, not just the <Tag2> part, so you will still have to search out that part later.

grep -i“Tag2”没有问题——它只打印带有“Tag2”的行。这可能不是你想要的。请记住,它将打印整个行,而不仅仅是 部分,因此您仍然需要稍后搜索该部分。

sed 's/&lt;[^&gt;]*[&gt;]//g' isn't what you want - it simply removes the tags, including <Tag1> and <Tag2>.

sed的s / & lt;[^祝辞]*[在]/ / g的不是你想要的东西——它只是删除标签,包括 <标签1> 和 <标签2> 。

You can try something like:

你可以试试:

cat tmp.tmp | grep -i tag2 | sed 's/.*<Tag2>\(.*\)<\/Tag2>.*/\1/'

This will produce

这将产生

working

but it will only work for one tag pair.

但它只适用于一对标签。

#2


4  

For your nice, friendly example, you could use

对于您友好的示例,您可以使用

sed -e 's/^.*<Tag2>//' -e 's!</Tag2>.*!!' test-file 

but the XML out there is cruel and uncaring. You're asking for serious trouble using regular expressions to scrape XML.

但是XML的存在是残酷和冷漠的。使用正则表达式获取XML会带来严重的麻烦。

#3


0  

you can use gawk, eg

你可以用呆瓜(如)

$ cat file
 <Tag1>not </Tag1><Tag2>working here</Tag2>
 <Tag1>not </Tag1><Tag2>
working

</Tag2>

$ awk -vRS="</Tag2>" '/<Tag2>/{gsub(/.*<Tag2>/,"");print}' file
working here

working

#4


0  

awk -F"Tag2" '{print $2}' test_1 | sed 's/[^a-zA-Z]//g'