如何使用regexes查找重复字符串和它们之间的值?

时间:2021-11-20 19:25:00

How would you find the value of string that is repeated and the data between it using regexes? For example, take this piece of XML:

如何找到重复的字符串的值以及使用regexes的字符串之间的数据?例如,以这段XML为例:

<tagName>Data between the tag</tagName>

What would be the correct regex to find these values? (Note that tagName could be anything).

要找到这些值,正确的regex是什么?(注意,标签名可以是任何东西)。

I have found a way that works that involves finding all the tagNames that are inbetween a set of < > and then searching for the first instance of the tagName from the opening tag to the end of the string and then finding the closing </tagName> and working out the data from between them. However, this is extremely inefficient and complex. There must be an easier way!

我已经找到了一种方法,包括发现的所有tagName中间画一组< >,然后寻找第一个实例的tagName开始标记字符串的结束,然后找到关闭< / tagName >,从它们之间的数据。然而,这是非常低效和复杂的。一定有更简单的办法!

EDIT: Please don't tell me to use XMLReader; I doubt I will ever use my custom class for reading XML, I am trying to learn the best way to do it (and the wrong ways) through attempting to make my own.

编辑:请不要告诉我使用XMLReader;我怀疑我是否会使用我的自定义类来读取XML,我正在尝试通过尝试创建自己的方法来学习最好的方法(以及错误的方法)。

Thanks in advance.

提前谢谢。

5 个解决方案

#1


5  

You can use: <(\w+)>(.*?)<\/\1>

您可以使用:<(\ w +)>(. * ?)< \ / \ 1 >

Group #1 is the tag, Group #2 is the content.

第一组是标签,第二组是内容。

#2


3  

Using regular expressions to parse XML is a terrible error.

使用正则表达式解析XML是一个可怕的错误。

This is efficient (it doesn't parse the XML into a DOM) and simple enough:

这是高效的(它不会将XML解析为DOM),而且足够简单:

string s = "<tagName>Data between the tag</tagName>";

using (XmlReader xr = XmlReader.Create(new StringReader(s)))
{
    xr.Read();
    Console.WriteLine(xr.ReadElementContentAsString());
}

Edit:

编辑:

Since the actual goal here is to learn something by doing, and not to just get the job done, here's why using regular expressions doesn't work:

因为这里的实际目标是通过实践来学习,而不是仅仅完成工作,所以使用正则表达式不起作用的原因如下:

Consider this fairly trivial test case:

考虑这个相当琐碎的测试用例:

<a><b><a>text1<b>CDATA<![<a>text2</a>]]></b></a></b>text3</a>

There are two elements with a tag name of "a" in that XML. The first has one text-node child with a value of "text1", and the second has one text-node child with a value of "text3". Also, there's a "b" element that contains a string of text that looks like an "a" element but isn't because it's enclosed in a CDATA section.

在该XML中有两个标记名为“a”的元素。第一个文本节点有一个值为“text1”的文本节点子节点,第二个文本节点有一个值为“text3”的文本节点子节点。此外,还有一个“b”元素,它包含一个看起来像“a”元素的文本字符串,但不是因为它包含在CDATA部分。

You can't parse that with simple pattern-matching. Finding <a> and looking ahead to find </a> doesn't begin to do what you need. You have to put start tags on a stack as you find them, and pop them off the stack as you reach the matching end tag. You have to stop putting anything on the stack when you encounter the start of a CDATA section, and not start again until you encounter the end.

你不能用简单的模式匹配来解析它。找到并向前查找并不能开始做您需要的事情。您必须在找到开始标记时将它们放在堆栈上,并在到达匹配的结束标记时将它们从堆栈中取出。当遇到CDATA部分的开始时,您必须停止在堆栈上放置任何内容,直到遇到结束时才重新开始。

And that's without introducing whitespace, empty elements, attributes, processing instructions, comments, or Unicode into the problem.

这并没有在问题中引入空格、空元素、属性、处理指令、注释或Unicode。

#3


2  

You can use a backreference like \1 to refer to an earlier match:

你可以使用一个像\1这样的反向引用来引用先前的匹配:

@"<([^>]*)>(.*)</\1>"

The \1 will match what was captured by the first parenthesized group.

\1将匹配第一个括号括起来的组捕获的内容。

#4


0  

with Perl:

用Perl:

my $tagName = 'some tag';
my $i; # some line of XML
$i =~ /\<$tagName\>(.+)\<\/$tagname\>/;

where $1 is now filled with the data you captured

$1现在被您捕获的数据填充在哪里

#5


0  

Going forward, if you get stuck check out regexlib.com

继续,如果你被卡在regexlib.com。

It's the first place I go when i get stuck on regex

当我被regex卡住时,这是我第一次去的地方

#1


5  

You can use: <(\w+)>(.*?)<\/\1>

您可以使用:<(\ w +)>(. * ?)< \ / \ 1 >

Group #1 is the tag, Group #2 is the content.

第一组是标签,第二组是内容。

#2


3  

Using regular expressions to parse XML is a terrible error.

使用正则表达式解析XML是一个可怕的错误。

This is efficient (it doesn't parse the XML into a DOM) and simple enough:

这是高效的(它不会将XML解析为DOM),而且足够简单:

string s = "<tagName>Data between the tag</tagName>";

using (XmlReader xr = XmlReader.Create(new StringReader(s)))
{
    xr.Read();
    Console.WriteLine(xr.ReadElementContentAsString());
}

Edit:

编辑:

Since the actual goal here is to learn something by doing, and not to just get the job done, here's why using regular expressions doesn't work:

因为这里的实际目标是通过实践来学习,而不是仅仅完成工作,所以使用正则表达式不起作用的原因如下:

Consider this fairly trivial test case:

考虑这个相当琐碎的测试用例:

<a><b><a>text1<b>CDATA<![<a>text2</a>]]></b></a></b>text3</a>

There are two elements with a tag name of "a" in that XML. The first has one text-node child with a value of "text1", and the second has one text-node child with a value of "text3". Also, there's a "b" element that contains a string of text that looks like an "a" element but isn't because it's enclosed in a CDATA section.

在该XML中有两个标记名为“a”的元素。第一个文本节点有一个值为“text1”的文本节点子节点,第二个文本节点有一个值为“text3”的文本节点子节点。此外,还有一个“b”元素,它包含一个看起来像“a”元素的文本字符串,但不是因为它包含在CDATA部分。

You can't parse that with simple pattern-matching. Finding <a> and looking ahead to find </a> doesn't begin to do what you need. You have to put start tags on a stack as you find them, and pop them off the stack as you reach the matching end tag. You have to stop putting anything on the stack when you encounter the start of a CDATA section, and not start again until you encounter the end.

你不能用简单的模式匹配来解析它。找到并向前查找并不能开始做您需要的事情。您必须在找到开始标记时将它们放在堆栈上,并在到达匹配的结束标记时将它们从堆栈中取出。当遇到CDATA部分的开始时,您必须停止在堆栈上放置任何内容,直到遇到结束时才重新开始。

And that's without introducing whitespace, empty elements, attributes, processing instructions, comments, or Unicode into the problem.

这并没有在问题中引入空格、空元素、属性、处理指令、注释或Unicode。

#3


2  

You can use a backreference like \1 to refer to an earlier match:

你可以使用一个像\1这样的反向引用来引用先前的匹配:

@"<([^>]*)>(.*)</\1>"

The \1 will match what was captured by the first parenthesized group.

\1将匹配第一个括号括起来的组捕获的内容。

#4


0  

with Perl:

用Perl:

my $tagName = 'some tag';
my $i; # some line of XML
$i =~ /\<$tagName\>(.+)\<\/$tagname\>/;

where $1 is now filled with the data you captured

$1现在被您捕获的数据填充在哪里

#5


0  

Going forward, if you get stuck check out regexlib.com

继续,如果你被卡在regexlib.com。

It's the first place I go when i get stuck on regex

当我被regex卡住时,这是我第一次去的地方