I have a string like this:
我有这样一条线:
<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar><foo>
I would like to strip the first 3 opening and the last 3 closing tags from the string. I do not know the tag names in advance.
我想从字符串中除去前3个开始和最后3个结束标记。我事先不知道标签的名称。
I can strip the first 3 strings with re.sub(r'<[^<>]+>', '', in_str, 3))
. How do I strip the closing tags? What should remain is:
我可以带第一个3字符串re.sub(r ' <[^ < >]+ >”,“in_str,3))。如何去掉结束标签?什么是应该保持:
<v1>aaa<b>bbb</b>ccc</v1>
I know I could maybe 'do it right', but I actually do not wish to do xml nor html parsing for my purpose, which is to aid myself visualizing the xml representation of some classes.
我知道我可能“做得对”,但实际上我不希望出于我的目的进行xml或html解析,这是为了帮助我可视化某些类的xml表示。
Instead, I realized that this problem is interesting. It seems I cannot simply search backwards with regex, ie. right to left. because that seems unsupported:
相反,我意识到这个问题很有趣。我似乎不能简单地用regex来搜索。右到左。因为这似乎不支持:
If you mean, find the right-most match of several (similar to the rfind method of a string) then no, it is not directly supported. You could use re.findall() and chose the last match but if the matches can overlap this may not give the correct result.
如果您的意思是,查找几个最右边的匹配项(类似于字符串的rfind方法),那么no,它不是直接支持的。可以使用re.findall()并选择最后一个匹配项,但如果匹配项可以重叠,则可能无法给出正确的结果。
But .rstrip
is not good with words, and won't do patterns either.
但是。rstrip不能很好地使用文字,也不能处理模式。
I looked at Strip HTML from strings in Python but I only wish to strip up to 3 tags.
我研究了从Python中的字符串中去除HTML,但我只想去掉最多3个标记。
What approach could be used here? Should I reverse the string (ugly in itself and due to the '<>'s). Do tokenization (why not parse, then?)? Or create static closing tags based on the left-to-right match?
这里可以使用什么方法?我是否应该反转这个字符串(由于'<>'s而本身很难看)。标记化(为什么不解析?)还是基于从左到右的匹配创建静态结束标记?
Which strategy to follow to strip the patterns from the end of the string?
应该采用什么策略从字符串的末尾删除模式?
4 个解决方案
#1
3
The simplest would be to use old-fashing string splitting and limiting the split:
最简单的方法是使用老式的断线并限制断线:
in_str.split('>', 3)[-1].rsplit('<', 3)[0]
Demo:
演示:
>>> in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar><foo>'
>>> in_str.split('>', 3)[-1].rsplit('<', 3)[0]
'<v1>aaa<b>bbb</b>ccc</v1>'
str.split()
and str.rsplit()
with a limit will split the string from the start or the end up to the limit times, letting you select the remainder unsplit.
string .split()和string .rsplit()具有一个限制,将从开始或结束分割字符串到限制时间,让您选择其余的未分割。
#2
2
You've already got practically all the solution. re
can't do backwards, but you can:
实际上你已经得到了所有的解。你不能反过来做,但你可以:
in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
in_str = re.sub(r'<[^<>]+>', '', in_str, 3)
in_str = in_str[::-1]
print in_str
in_str = re.sub(r'>[^<>]+/<', '', in_str, 3)
in_str = in_str[::-1]
print in_str
<v1>aaa<b>bbb</b>ccc</v1>
Note the reversed regex for the reversed string, but then it goes back-to-front.
注意反向字符串的反向regex,但是它会前后颠倒。
Of course, as mentioned, this is way easier with a proper parser:
当然,如前所述,使用合适的解析器会更容易:
in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
from lxml.html import etree
ix = etree.fromstring(in_str)
print etree.tostring(ix[0][0][0])
<v1>aaa<b>bbb</b>ccc</v1>
#3
1
I would look into regular expressions and use one such pattern to use a split
我将研究正则表达式并使用这样的模式来使用分割
http://docs.python.org/3/library/re.html?highlight=regex#re.regex.split
http://docs.python.org/3/library/re.html?highlight=regex re.regex.split
#4
1
Sorry, can't comment, but will give it as an answer.
对不起,我不能评论,但我会给你一个答复。
in_str.split('>', 3)[-1].rsplit('<', 3)[0]
will work for the given example <foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>
, but not for <foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo><another>test</another>
. You just should be aware of this.
in_str。分割(“>”,3)[1]。rsplit(' < ',3)[0]将为给定的工作示例< foo > <栏> < k2 > < v1 > aaa < b > bbb ccc < / v1 > < / b > < / k2 > < /酒吧> < / foo >,但不是< foo > <栏> < k2 > < v1 > aaa < b > bbb ccc < / v1 > < / b > < / k2 > < /酒吧> < / foo > < /另一个> <一> 测试。你应该意识到这一点。
To solve the counter example provided by me, you will have to track state (or count) of tags and evaluate that you match the correct pairs.
要解决我提供的计数器示例,您必须跟踪标记的状态(或计数)并评估是否匹配正确的对。
#1
3
The simplest would be to use old-fashing string splitting and limiting the split:
最简单的方法是使用老式的断线并限制断线:
in_str.split('>', 3)[-1].rsplit('<', 3)[0]
Demo:
演示:
>>> in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar><foo>'
>>> in_str.split('>', 3)[-1].rsplit('<', 3)[0]
'<v1>aaa<b>bbb</b>ccc</v1>'
str.split()
and str.rsplit()
with a limit will split the string from the start or the end up to the limit times, letting you select the remainder unsplit.
string .split()和string .rsplit()具有一个限制,将从开始或结束分割字符串到限制时间,让您选择其余的未分割。
#2
2
You've already got practically all the solution. re
can't do backwards, but you can:
实际上你已经得到了所有的解。你不能反过来做,但你可以:
in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
in_str = re.sub(r'<[^<>]+>', '', in_str, 3)
in_str = in_str[::-1]
print in_str
in_str = re.sub(r'>[^<>]+/<', '', in_str, 3)
in_str = in_str[::-1]
print in_str
<v1>aaa<b>bbb</b>ccc</v1>
Note the reversed regex for the reversed string, but then it goes back-to-front.
注意反向字符串的反向regex,但是它会前后颠倒。
Of course, as mentioned, this is way easier with a proper parser:
当然,如前所述,使用合适的解析器会更容易:
in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
from lxml.html import etree
ix = etree.fromstring(in_str)
print etree.tostring(ix[0][0][0])
<v1>aaa<b>bbb</b>ccc</v1>
#3
1
I would look into regular expressions and use one such pattern to use a split
我将研究正则表达式并使用这样的模式来使用分割
http://docs.python.org/3/library/re.html?highlight=regex#re.regex.split
http://docs.python.org/3/library/re.html?highlight=regex re.regex.split
#4
1
Sorry, can't comment, but will give it as an answer.
对不起,我不能评论,但我会给你一个答复。
in_str.split('>', 3)[-1].rsplit('<', 3)[0]
will work for the given example <foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>
, but not for <foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo><another>test</another>
. You just should be aware of this.
in_str。分割(“>”,3)[1]。rsplit(' < ',3)[0]将为给定的工作示例< foo > <栏> < k2 > < v1 > aaa < b > bbb ccc < / v1 > < / b > < / k2 > < /酒吧> < / foo >,但不是< foo > <栏> < k2 > < v1 > aaa < b > bbb ccc < / v1 > < / b > < / k2 > < /酒吧> < / foo > < /另一个> <一> 测试。你应该意识到这一点。
To solve the counter example provided by me, you will have to track state (or count) of tags and evaluate that you match the correct pairs.
要解决我提供的计数器示例,您必须跟踪标记的状态(或计数)并评估是否匹配正确的对。