Well I know there several questions similar but could not find any with this specific case.
我知道有几个类似的问题但在这个具体的案例中找不到。
I took one code and tweak it to my needs but now I'm founding a bug on it that I can't correct.
我取了一段代码,并根据自己的需要对其进行了调整,但现在我在它上面创建了一个无法纠正的错误。
Code:
代码:
$tag = 'namespace';
$match = Tags::get($f, $tag);
var_dump($match);
static function get( $xml, $tag) { // http://*.com/questions/3404433/get-content-within-a-html-tag-using-7-processing
// bug case string(56) "<namespaces>
// <namespace key="-2">Media</namespace>"
$tag_ini = "<{$tag}[^\>]*?>"; $tag_end = "<\\/{$tag}>";
$tag_regex = '/' . $tag_ini . '(.*?)' . $tag_end . '/si';
preg_match_all($tag_regex,
$xml,
$matches,
PREG_OFFSET_CAPTURE);
return $matches;
}
As you can see, there is a bug if the tag is nested:
如您所见,如果标记是嵌套的,则存在一个bug:
<namespaces> <namespace key="-2">Media</namespace>
<名称> <名称空间关键= " 2> 媒体< /名称>
When it should return 'Media', or even the outer '<namespaces>'
and then the inside ones.
当它应该返回“Media”,甚至是外部的“
I tried to add "<{$tag}[^\>|^\r\n ]*?>
", ^\s+
, changing the * to *?, and other few things that in best case turned to recognize only the bugged case.
我试图添加“< { $ tag }[^ \ > | ^ \ r \ n]* ?>“^ \ s +,改变* * ?,以及其他一些在最好的情况下只能识别被窃听的案例的事情。
Also tried "<{$tag}[^{$tag}]*?>"
which gives blank, I suppose it nullifies itself.
也试过“< { $ tag }[^ { $ tag }]* ?>是空白的,我想它是无效的。
I'm a newb on regex, I can tell that to fix this just is needed to add don't let open a new tag of the same type. Or I could even use a hack answer for my use case, that excludes if the inside text has new line carriage.
我是regex上的一个新手,我可以告诉您,要修复它只需要添加不要打开相同类型的新标记。或者,我甚至可以为我的用例使用一个hack的答案,如果内部文本有新的换行符,这个答案将被排除。
Can anyone get the right syntax for this?
有人能得到正确的语法吗?
You can check an extract of the text here: http://pastebin.com/f2naN2S3
您可以在这里查看文本的摘录:http://pastebin.com/f2naN2S3。
After the proposed change: $tag_ini = "<{$tag}\\b[^>]*>"; $tag_end = "<\\/{$tag}>";
it does work for the the example case, but not for this one:
后提出了改变:$ tag_ini = " < { $标签} \ \ b[^ >]* >”;美元tag_end = " < \ \ / { $ tag } >”;它确实适用于这个例子,但不是针对这个例子:
<namespace key="0" />
<namespace key="1">Talk</namespace>
As it results in:
因为它的结果:
<namespace key="1">Talk"
It's because numbers and " and letters are considered inside word boundary. How could I address that?
这是因为数字和字母在单词边界内被考虑。我该怎么说呢?
3 个解决方案
#1
1
The main problem is that you did not use a word boundary after the opening tag and thus, namespace
in the pattern could also match namespaces
tag, and many others.
主要的问题是,在开始标记之后没有使用单词边界,因此,模式中的名称空间也可以匹配名称空间标记和其他许多名称空间。
The subsequent issue is that the <${tag}\b[^>]*>(.*?)<\/${tag}>
pattern would overfire if there is a self-closing namespace
tag followed with a "normal" paired open/close namespace
tag. So, you need to either use a negative lookbehind (?<!\/)
before the >
(see demo), or use a (?![^>]*\/>)
negative lookahead after \b
(see demo).
随之而来的问题是,< $ {标签} \ b[^ >]* >(. * ?)< \ / $ {标签} >模式会烧毁之后如果有自闭的名称空间标签与一个“正常”的配对打开/关闭名称空间标签。所以,你需要使用一个负向后插入(? < !之前\ /)>(见演示),或者使用(? ![^ >]* \ / >)-后超前\ b(见演示)。
So, you can use
所以,你可以使用
$tag_ini = "<{$tag}\\b[^>]*(?<!\\/)>"; $tag_end = "<\\/{$tag}>";
#2
1
This is probably not the idea answer, but I was messing with a regex generator:
这可能不是理想的答案,但我在使用regex生成器:
<?php
# URL that generated this code:
# http://txt2re.com/index-php.php3?s=%3Cnamespace%3E%3Cnamespace%20key=%22-2%22%3EMedia%3C/namespace%3E&12&11
$txt='arstarstarstarstarstarst<namespace key="-2">Media</namespace>arstarstarstarstarst';
$re1='.*?'; # Non-greedy match on filler
$re2='(?:[a-z][a-z]+)'; # Uninteresting: word
$re3='.*?'; # Non-greedy match on filler
$re4='(?:[a-z][a-z]+)'; # Uninteresting: word
$re5='.*?'; # Non-greedy match on filler
$re6='(?:[a-z][a-z]+)'; # Uninteresting: word
$re7='.*?'; # Non-greedy match on filler
$re8='((?:[a-z][a-z]+))'; # Word 1
if ($c=preg_match_all ("/".$re1.$re2.$re3.$re4.$re5.$re6.$re7.$re8."/is", $txt, $matches))
{
$word1=$matches[1][0];
print "($word1) \n";
}
#-----
# Paste the code into a new php file. Then in Unix:
# $ php x.php
#-----
?>
#3
0
This line is what I needed
这条线就是我需要的
$tag_ini = "<{$tag}\\b[^>|^\\/>]*>"; $tag_end = "<\\/{$tag}>";
Thank you very much you @Alison and @Wictor for your help and directions
非常感谢@Alison和@Wictor的帮助和指导
#1
1
The main problem is that you did not use a word boundary after the opening tag and thus, namespace
in the pattern could also match namespaces
tag, and many others.
主要的问题是,在开始标记之后没有使用单词边界,因此,模式中的名称空间也可以匹配名称空间标记和其他许多名称空间。
The subsequent issue is that the <${tag}\b[^>]*>(.*?)<\/${tag}>
pattern would overfire if there is a self-closing namespace
tag followed with a "normal" paired open/close namespace
tag. So, you need to either use a negative lookbehind (?<!\/)
before the >
(see demo), or use a (?![^>]*\/>)
negative lookahead after \b
(see demo).
随之而来的问题是,< $ {标签} \ b[^ >]* >(. * ?)< \ / $ {标签} >模式会烧毁之后如果有自闭的名称空间标签与一个“正常”的配对打开/关闭名称空间标签。所以,你需要使用一个负向后插入(? < !之前\ /)>(见演示),或者使用(? ![^ >]* \ / >)-后超前\ b(见演示)。
So, you can use
所以,你可以使用
$tag_ini = "<{$tag}\\b[^>]*(?<!\\/)>"; $tag_end = "<\\/{$tag}>";
#2
1
This is probably not the idea answer, but I was messing with a regex generator:
这可能不是理想的答案,但我在使用regex生成器:
<?php
# URL that generated this code:
# http://txt2re.com/index-php.php3?s=%3Cnamespace%3E%3Cnamespace%20key=%22-2%22%3EMedia%3C/namespace%3E&12&11
$txt='arstarstarstarstarstarst<namespace key="-2">Media</namespace>arstarstarstarstarst';
$re1='.*?'; # Non-greedy match on filler
$re2='(?:[a-z][a-z]+)'; # Uninteresting: word
$re3='.*?'; # Non-greedy match on filler
$re4='(?:[a-z][a-z]+)'; # Uninteresting: word
$re5='.*?'; # Non-greedy match on filler
$re6='(?:[a-z][a-z]+)'; # Uninteresting: word
$re7='.*?'; # Non-greedy match on filler
$re8='((?:[a-z][a-z]+))'; # Word 1
if ($c=preg_match_all ("/".$re1.$re2.$re3.$re4.$re5.$re6.$re7.$re8."/is", $txt, $matches))
{
$word1=$matches[1][0];
print "($word1) \n";
}
#-----
# Paste the code into a new php file. Then in Unix:
# $ php x.php
#-----
?>
#3
0
This line is what I needed
这条线就是我需要的
$tag_ini = "<{$tag}\\b[^>|^\\/>]*>"; $tag_end = "<\\/{$tag}>";
Thank you very much you @Alison and @Wictor for your help and directions
非常感谢@Alison和@Wictor的帮助和指导