如何从字符串的末尾向后删除模式或单词?

时间:2021-02-11 16:55:49

I have a string like this:

我有这样一条线:

<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar><foo>

I would like to strip the first 3 opening and the last 3 closing tags from the string. I do not know the tag names in advance.

我想从字符串中除去前3个开始和最后3个结束标记。我事先不知道标签的名称。

I can strip the first 3 strings with re.sub(r'<[^<>]+>', '', in_str, 3)). How do I strip the closing tags? What should remain is:

我可以带第一个3字符串re.sub(r ' <[^ < >]+ >”,“in_str,3))。如何去掉结束标签?什么是应该保持:

<v1>aaa<b>bbb</b>ccc</v1>

I know I could maybe 'do it right', but I actually do not wish to do xml nor html parsing for my purpose, which is to aid myself visualizing the xml representation of some classes.

我知道我可能“做得对”,但实际上我不希望出于我的目的进行xml或html解析,这是为了帮助我可视化某些类的xml表示。

Instead, I realized that this problem is interesting. It seems I cannot simply search backwards with regex, ie. right to left. because that seems unsupported:

相反,我意识到这个问题很有趣。我似乎不能简单地用regex来搜索。右到左。因为这似乎不支持:

If you mean, find the right-most match of several (similar to the rfind method of a string) then no, it is not directly supported. You could use re.findall() and chose the last match but if the matches can overlap this may not give the correct result.

如果您的意思是,查找几个最右边的匹配项(类似于字符串的rfind方法),那么no,它不是直接支持的。可以使用re.findall()并选择最后一个匹配项,但如果匹配项可以重叠,则可能无法给出正确的结果。

But .rstrip is not good with words, and won't do patterns either.

但是。rstrip不能很好地使用文字,也不能处理模式。

I looked at Strip HTML from strings in Python but I only wish to strip up to 3 tags.

我研究了从Python中的字符串中去除HTML,但我只想去掉最多3个标记。

What approach could be used here? Should I reverse the string (ugly in itself and due to the '<>'s). Do tokenization (why not parse, then?)? Or create static closing tags based on the left-to-right match?

这里可以使用什么方法?我是否应该反转这个字符串(由于'<>'s而本身很难看)。标记化(为什么不解析?)还是基于从左到右的匹配创建静态结束标记?

Which strategy to follow to strip the patterns from the end of the string?

应该采用什么策略从字符串的末尾删除模式?

4 个解决方案

#1


3  

The simplest would be to use old-fashing string splitting and limiting the split:

最简单的方法是使用老式的断线并限制断线:

in_str.split('>', 3)[-1].rsplit('<', 3)[0]

Demo:

演示:

>>> in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar><foo>'
>>> in_str.split('>', 3)[-1].rsplit('<', 3)[0]
'<v1>aaa<b>bbb</b>ccc</v1>'

str.split() and str.rsplit() with a limit will split the string from the start or the end up to the limit times, letting you select the remainder unsplit.

string .split()和string .rsplit()具有一个限制,将从开始或结束分割字符串到限制时间,让您选择其余的未分割。

#2


2  

You've already got practically all the solution. re can't do backwards, but you can:

实际上你已经得到了所有的解。你不能反过来做,但你可以:

in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
in_str = re.sub(r'<[^<>]+>', '', in_str, 3)
in_str = in_str[::-1]
print in_str
in_str = re.sub(r'>[^<>]+/<', '', in_str, 3)
in_str = in_str[::-1]

print in_str
<v1>aaa<b>bbb</b>ccc</v1>

Note the reversed regex for the reversed string, but then it goes back-to-front.

注意反向字符串的反向regex,但是它会前后颠倒。

Of course, as mentioned, this is way easier with a proper parser:

当然,如前所述,使用合适的解析器会更容易:

in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
from lxml.html import etree
ix = etree.fromstring(in_str)
print etree.tostring(ix[0][0][0])
<v1>aaa<b>bbb</b>ccc</v1>

#3


1  

I would look into regular expressions and use one such pattern to use a split

我将研究正则表达式并使用这样的模式来使用分割

http://docs.python.org/3/library/re.html?highlight=regex#re.regex.split

http://docs.python.org/3/library/re.html?highlight=regex re.regex.split

#4


1  

Sorry, can't comment, but will give it as an answer.

对不起,我不能评论,但我会给你一个答复。

in_str.split('>', 3)[-1].rsplit('<', 3)[0] will work for the given example <foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>, but not for <foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo><another>test</another>. You just should be aware of this.

in_str。分割(“>”,3)[1]。rsplit(' < ',3)[0]将为给定的工作示例< foo > <栏> < k2 > < v1 > aaa < b > bbb ccc < / v1 > < / b > < / k2 > < /酒吧> < / foo >,但不是< foo > <栏> < k2 > < v1 > aaa < b > bbb ccc < / v1 > < / b > < / k2 > < /酒吧> < / foo > < /另一个> <一> 测试。你应该意识到这一点。

To solve the counter example provided by me, you will have to track state (or count) of tags and evaluate that you match the correct pairs.

要解决我提供的计数器示例,您必须跟踪标记的状态(或计数)并评估是否匹配正确的对。

#1


3  

The simplest would be to use old-fashing string splitting and limiting the split:

最简单的方法是使用老式的断线并限制断线:

in_str.split('>', 3)[-1].rsplit('<', 3)[0]

Demo:

演示:

>>> in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar><foo>'
>>> in_str.split('>', 3)[-1].rsplit('<', 3)[0]
'<v1>aaa<b>bbb</b>ccc</v1>'

str.split() and str.rsplit() with a limit will split the string from the start or the end up to the limit times, letting you select the remainder unsplit.

string .split()和string .rsplit()具有一个限制,将从开始或结束分割字符串到限制时间,让您选择其余的未分割。

#2


2  

You've already got practically all the solution. re can't do backwards, but you can:

实际上你已经得到了所有的解。你不能反过来做,但你可以:

in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
in_str = re.sub(r'<[^<>]+>', '', in_str, 3)
in_str = in_str[::-1]
print in_str
in_str = re.sub(r'>[^<>]+/<', '', in_str, 3)
in_str = in_str[::-1]

print in_str
<v1>aaa<b>bbb</b>ccc</v1>

Note the reversed regex for the reversed string, but then it goes back-to-front.

注意反向字符串的反向regex,但是它会前后颠倒。

Of course, as mentioned, this is way easier with a proper parser:

当然,如前所述,使用合适的解析器会更容易:

in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
from lxml.html import etree
ix = etree.fromstring(in_str)
print etree.tostring(ix[0][0][0])
<v1>aaa<b>bbb</b>ccc</v1>

#3


1  

I would look into regular expressions and use one such pattern to use a split

我将研究正则表达式并使用这样的模式来使用分割

http://docs.python.org/3/library/re.html?highlight=regex#re.regex.split

http://docs.python.org/3/library/re.html?highlight=regex re.regex.split

#4


1  

Sorry, can't comment, but will give it as an answer.

对不起,我不能评论,但我会给你一个答复。

in_str.split('>', 3)[-1].rsplit('<', 3)[0] will work for the given example <foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>, but not for <foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo><another>test</another>. You just should be aware of this.

in_str。分割(“>”,3)[1]。rsplit(' < ',3)[0]将为给定的工作示例< foo > <栏> < k2 > < v1 > aaa < b > bbb ccc < / v1 > < / b > < / k2 > < /酒吧> < / foo >,但不是< foo > <栏> < k2 > < v1 > aaa < b > bbb ccc < / v1 > < / b > < / k2 > < /酒吧> < / foo > < /另一个> <一> 测试。你应该意识到这一点。

To solve the counter example provided by me, you will have to track state (or count) of tags and evaluate that you match the correct pairs.

要解决我提供的计数器示例,您必须跟踪标记的状态(或计数)并评估是否匹配正确的对。