如何匹配一个不属于模式的字符?

If I have a string like so:

如果我有一个这样的字符串:

ć; Johć; Smith <js@comms.com>; ;boomʿ;woopwoop; ;

& # 263;;(9 # 263;;史密斯< js@comms.com >;;繁荣# 703;;woopwoop;;

and I wish to match all the semicolons that are not part of that html entity, what regex technique can I use?

我希望匹配所有不在html实体中的分号，我可以使用什么regex技术?

I got close a couple of times with a negative look behind and my best attempt so far is the following:

我用消极的眼神接近了几次，而我迄今为止最好的尝试如下:

(?<!&#.+?[^;]);

However this won't match all the semicolons required to take this victory home.

然而，这并不能与所有的分号相匹配。

I'm using php.

我使用php。

I am considering replacing the html entities with a token first, then do the replacement of the semicolons and finally replacing the entities back into the string.

我正在考虑先用标记替换html实体，然后替换分号，最后将实体替换回字符串。

This seems quite clunky and inelegant so I'd rather do it with a regex, even if it gets a little unwieldy.

这看起来相当笨拙和不优雅，所以我宁愿使用regex，即使它有点笨拙。

EDIT: @sln supplied a regex that will select nearly all entities, which as he points out should be the first step when trying to avoid something.

编辑:@sln提供了一个regex，它将选择几乎所有实体，正如他指出的，这应该是避免某些东西的第一步。

(?i)[%&](?:[a-z]+|(?:#(?:[0-9]+|x[0-9a-f]+)));

(?我)(% &)(?:[a - z]+ |(?:#(?:[0 - 9]+ | x[0-9a-f]+)));

While the question is about how to select single characters except those found in a string, the context of the data I provided makes this a very useful regex to know and to attach to this question.

虽然问题是关于如何选择单个字符(除了在字符串中找到的字符)，但是我提供的数据的上下文使它成为非常有用的regex来了解和附加这个问题。

1 个解决方案

#1

You may match and skip the entity and match the semi-colon in all other contexts:

您可以在所有其他上下文中匹配和跳过实体并匹配分号:

$s = preg_replace('~&#\w+;(*SKIP)(*F)|;~', 'NEWTEXT', $s);

See the regex demo

看到regex演示

Details:

细节:

&#\w+; - a &#, followed with 1+ word chars and a ;
& # \ w +;- a &#，后面加1+单词chars和a;
(*SKIP)(*F) - two PCRE verbs that fail the current match and proceed looking for the next match after the text matched
(*SKIP)(*F)——两个PCRE动词，在当前匹配失败后继续查找下一个匹配项
| - or
|——或者
; - a semi-colon.
;——一个分号。

#1