Not being that knowledgable in regex patterns and after reading all wikis and references I found I'm having problems altering a pattern for word detection and higlighting.
在regex模式中,在阅读了所有wikis和引用之后,我发现我在修改word检测和higlighting模式时遇到了一些问题。
I found a function on another * answer that did everything it was needed but now I found out it misses out on a few things
我在另一个*上找到了一个函数,它做了所有需要做的事情,但现在我发现它遗漏了一些东西。
The function is:
函数是:
function ParserGlossario($texto, $termos) {
$padrao = '\1<a href="#" class="termo">\2</a>\3';
if (empty($termos)) {
return $texto;
}
if (is_array($termos)) {
$substituir = array();
$com = array();
foreach ($termos as $key => $value) {
$key = $value;
$value = $padrao;
// $key = '([\s])(' . $key . ')([\s\.\,\!\?\<])';
$key = '([\s])(' . $key . ')([\s\.\,\!\?\<])';
$substituir[] = '|' . $key . '|ix';
$com[] = empty($value) ? $padrao : $value;
}
return preg_replace($substituir, $com, $texto);
} else {
$termos = '([\s])(' . $termos . ')([\s])';
return preg_replace('|'.$termos.'|i', $padrao, $texto);
}
}
Some words are not being highlighted (the ones marked with red arrows):
有些词没有突出显示(红色箭头):
And I don't know if it helps, but here is the array of "terms" that is used to search the text:
我不知道这是否有帮助,但这是用来搜索文本的“术语”数组:
EDIT. The string being searched is just plain text:
编辑。正在搜索的字符串只是纯文本:
Abaxial Xxxxx acaule Acaule xxxxxx xxx; xxxxx xxx Abaxial esporos. abaxial
离轴Xxxxx;xxxxx xxx轴外的esporos。轴外的
EDIT. Added PHP code fiddle
编辑。添加PHP代码小提琴
http://phpfiddle.org/main/code/079ad24318f554d9f2ba
http://phpfiddle.org/main/code/079ad24318f554d9f2ba
Any help? I really don't know much about regexes...
任何帮助吗?我真的不太了解regexes…
2 个解决方案
#1
1
try
试一试
$key = '(^|\b)(' . $key . ')\b';
insetad of
insetad的
$key = '([\s])(' . $key . ')([\s\.\,\!\?\<])';
should help. Your matches still will be in the second group but there will be no third and I think the first should not be touched, so I believe this
应该帮助。你的比赛仍然会在第二组,但是不会有第三组,我认为第一组不应该被碰,所以我相信这一点
$padrao = '\1<a href="#" class="termo">\2</a>\3';
is better to be as
还是做个普通人好
$padrao = '<a href="#" class="termo">$2</a>';
and forgot (sorry): change
,忘了(不好意思):改变
$substituir[] = '|' . $key . '|ix';
to
来
$substituir[] = '#' . $key . '#ix';
And also I would use a string
我还要用一个字符串
$com = empty($value) ? $padrao : $value;
instead of array, it's not needed in this case.
在这种情况下,它不需要数组,而不是数组。
#2
1
Let us look together on value of $key
for example for array element acaule
.
让我们一起讨论$key的值,例如数组元素acaule。
([\s])(acaule)([\s\.\,\!\?\<])
-
There are 3 marking groups defined by 3 pairs of
(
...)
.有3个标记组由3对(…)定义。
-
The first marking group matches any whitespace character. If there is no whitespace character like for
Abaxial
at beginning of the string, the word is ignored.第一个标记组匹配任何空格字符。如果在字符串的开头没有类似于Abaxial的空格字符,则会忽略这个词。
Putting
\s
into a character class, i.e. within[
...]
is not really needed here as\s
is itself a character class.([\s])
and(\s)
are equal.将\s放入字符类中,例如在[…在这里并不真正需要,因为s本身就是一个字符类。([\s])和(\s)是相等的。
-
The second marking group matches just the word from array.
第二个标记组只匹配数组中的单词。
-
The third marking group matches
第三个标记组匹配。
- either any whitespace character,
- 任何空白字符,
- or a period,
- 或一段时间,
- or a comma,
- 或一个逗号,
- or an exclamation mark,
- 或者一个感叹号,
- or a question mark, i.e. the standard punctuation marks,
- 或问号,即标准的标点符号,
- or a left angle bracket (from an HTML or XML tag).
- 或左尖括号(来自HTML或XML标记)。
A semicolon or colon is not matched and other non word characters are also ignored for a positive match.不匹配分号或冒号,对于正匹配也忽略其他非单词字符。
If there is none of those characters like for
abaxial
at end of the string, the search is negative.如果在字符串的末尾没有任何这样的字符,搜索是负的。
By the way:
([\s.,!?<])
is equal([\s\.\,\!\?\<])
as only\
and]
(always) and-
(depending on position) must be escaped with a backslash within a character class definition to be interpreted as literal character. Well,[
should be also escaped with a backslash within[
...]
for easier reading.顺便说一句:([\s. !?<]) = ([\s\.\ !\ !\? <]) as only \ and] (always) and -(取决于位置)必须在字符类定义中使用反斜杠转义,以便解释为文字字符。嗯,[应该在[…]中加上反斜杠来转义……更容易阅读。
So it is clear why Abaxial
at beginning of string and abaxial
at end of the string are not matched.
所以很明显,为什么弦的起始处是离轴,弦的末端是不匹配的。
But why is Acaule
not matched?
但是为什么Acaule不匹配呢?
Well, there is left to this word acaule
with a space left and a space right as required for a positive match. So the space right of acaule
was already taken for this positive match. Therefore for Acaule
there is no whitespace character anymore left to this word.
好吧,这个词的左边有一个空格,右边有一个空格,这是正匹配所需要的。所以acaule的空间权已经被用于这个正匹配。因此,对于Acaule,这个词就没有空格字符了。
There is \b
which means word boundary not matching any character which might be used together with \W*?
instead of ([\s])
and instead of ([\s\.\,\!\?\<])
to avoid matching substrings within a word.
\b的意思是字词边界不匹配任何字符,可能与\W*一起使用?代替([\s])和([\s\.\ !\?\<])避免在一个字内匹配子串。
Possible would be something like
可能是这样的
$key = '(\W*?)(\b' . $key . '\b)(\W*?)';
\W*?
means any non word character 0 or more times non-greedy.
\ W * ?意思是任何非单词字符0或更多倍的非贪婪。
\W?
means any non word character 0 or 1 times and could be also used in first and third capturing group if that is better for the replace.
\ W ?意思是任何非文字字符0或1次,也可以用于第一和第三捕获组,如果这对替换更好。
But what is the right search string depends on what you want as result of the replace.
但是什么是正确的搜索字符串取决于你想要什么作为替换的结果。
I don't have a PHP interpreter installed at all and therefore can't try it out what your PHP code does on replace and therefore what you would like to see after replace done on provided example string.
我根本没有安装PHP解释器,因此无法尝试PHP代码在replace上做什么,因此您希望看到在所提供的示例字符串上完成replace之后的情况。
#1
1
try
试一试
$key = '(^|\b)(' . $key . ')\b';
insetad of
insetad的
$key = '([\s])(' . $key . ')([\s\.\,\!\?\<])';
should help. Your matches still will be in the second group but there will be no third and I think the first should not be touched, so I believe this
应该帮助。你的比赛仍然会在第二组,但是不会有第三组,我认为第一组不应该被碰,所以我相信这一点
$padrao = '\1<a href="#" class="termo">\2</a>\3';
is better to be as
还是做个普通人好
$padrao = '<a href="#" class="termo">$2</a>';
and forgot (sorry): change
,忘了(不好意思):改变
$substituir[] = '|' . $key . '|ix';
to
来
$substituir[] = '#' . $key . '#ix';
And also I would use a string
我还要用一个字符串
$com = empty($value) ? $padrao : $value;
instead of array, it's not needed in this case.
在这种情况下,它不需要数组,而不是数组。
#2
1
Let us look together on value of $key
for example for array element acaule
.
让我们一起讨论$key的值,例如数组元素acaule。
([\s])(acaule)([\s\.\,\!\?\<])
-
There are 3 marking groups defined by 3 pairs of
(
...)
.有3个标记组由3对(…)定义。
-
The first marking group matches any whitespace character. If there is no whitespace character like for
Abaxial
at beginning of the string, the word is ignored.第一个标记组匹配任何空格字符。如果在字符串的开头没有类似于Abaxial的空格字符,则会忽略这个词。
Putting
\s
into a character class, i.e. within[
...]
is not really needed here as\s
is itself a character class.([\s])
and(\s)
are equal.将\s放入字符类中,例如在[…在这里并不真正需要,因为s本身就是一个字符类。([\s])和(\s)是相等的。
-
The second marking group matches just the word from array.
第二个标记组只匹配数组中的单词。
-
The third marking group matches
第三个标记组匹配。
- either any whitespace character,
- 任何空白字符,
- or a period,
- 或一段时间,
- or a comma,
- 或一个逗号,
- or an exclamation mark,
- 或者一个感叹号,
- or a question mark, i.e. the standard punctuation marks,
- 或问号,即标准的标点符号,
- or a left angle bracket (from an HTML or XML tag).
- 或左尖括号(来自HTML或XML标记)。
A semicolon or colon is not matched and other non word characters are also ignored for a positive match.不匹配分号或冒号,对于正匹配也忽略其他非单词字符。
If there is none of those characters like for
abaxial
at end of the string, the search is negative.如果在字符串的末尾没有任何这样的字符,搜索是负的。
By the way:
([\s.,!?<])
is equal([\s\.\,\!\?\<])
as only\
and]
(always) and-
(depending on position) must be escaped with a backslash within a character class definition to be interpreted as literal character. Well,[
should be also escaped with a backslash within[
...]
for easier reading.顺便说一句:([\s. !?<]) = ([\s\.\ !\ !\? <]) as only \ and] (always) and -(取决于位置)必须在字符类定义中使用反斜杠转义,以便解释为文字字符。嗯,[应该在[…]中加上反斜杠来转义……更容易阅读。
So it is clear why Abaxial
at beginning of string and abaxial
at end of the string are not matched.
所以很明显,为什么弦的起始处是离轴,弦的末端是不匹配的。
But why is Acaule
not matched?
但是为什么Acaule不匹配呢?
Well, there is left to this word acaule
with a space left and a space right as required for a positive match. So the space right of acaule
was already taken for this positive match. Therefore for Acaule
there is no whitespace character anymore left to this word.
好吧,这个词的左边有一个空格,右边有一个空格,这是正匹配所需要的。所以acaule的空间权已经被用于这个正匹配。因此,对于Acaule,这个词就没有空格字符了。
There is \b
which means word boundary not matching any character which might be used together with \W*?
instead of ([\s])
and instead of ([\s\.\,\!\?\<])
to avoid matching substrings within a word.
\b的意思是字词边界不匹配任何字符,可能与\W*一起使用?代替([\s])和([\s\.\ !\?\<])避免在一个字内匹配子串。
Possible would be something like
可能是这样的
$key = '(\W*?)(\b' . $key . '\b)(\W*?)';
\W*?
means any non word character 0 or more times non-greedy.
\ W * ?意思是任何非单词字符0或更多倍的非贪婪。
\W?
means any non word character 0 or 1 times and could be also used in first and third capturing group if that is better for the replace.
\ W ?意思是任何非文字字符0或1次,也可以用于第一和第三捕获组,如果这对替换更好。
But what is the right search string depends on what you want as result of the replace.
但是什么是正确的搜索字符串取决于你想要什么作为替换的结果。
I don't have a PHP interpreter installed at all and therefore can't try it out what your PHP code does on replace and therefore what you would like to see after replace done on provided example string.
我根本没有安装PHP解释器,因此无法尝试PHP代码在replace上做什么,因此您希望看到在所提供的示例字符串上完成replace之后的情况。