找到小写，然后紧跟大写

My text is as below:

我的文字如下：

<font size=+2 color=#F07500><b> [ba]</font></b>
<ul><li><font color =#0B610B> Word word wordWord word.<br></font></li></ul>
<ul><li><font color =#F07500> Word word word.<br></font></li></ul>
<ul><li><font color =#0B610B> Word word word wordWord.<br></font></li></ul>
<ul><li><font color =#0B610B> WordWord.<br></font></li></ul>
<br><font color =#E41B17><b>UPPERCASE LETTERS</b></font> 
<ul><li><font color =#0B610B> Word word wordWord word.<br></font><br><font color =#E41B17><b>PhD and dataBase</b></font> </li></ul>
<font color =#0B610B> Word word word.<br></font></li></ul><dd><font color =#F07500>     »» Word wordWord word.<br></font>

There is a lowercase letter immediately followed by an uppercase in each of the .... For example:

在 ... 的每一个中都有一个小写字母，后面紧跟一个大写字母。例如：

<font color =#0B610B> Word word wordWord word.<br></font>

I want to correct this error by splitting them as follows (i.e: adding a colon and a space between them):

我想通过按如下方式拆分它们来纠正这个错误（即：在它们之间添加冒号和空格）：

<font color =#0B610B> Word word word: Word word.<br></font>

So far, I have been using:

到目前为止，我一直在使用：

(<font color =#0B610B\b[^>]*>)(.*?</font>)

to select each of the instances of ..., and it works fine in finding one instance by one instance of ....

选择 ... 的每个实例，它可以通过 ... 的一个实例找到一个实例。

But when I use:

但是当我使用时：

(<font color =#0B610B\b[^>]*>)(.*?[a-z])([A-Z].*?</font>)

it does find but selects everything between ...in one line regardless of other font-color tags, and replaces other unwanted instances.

它确实可以找到但是在一行中选择 ... 之间的所有内容而不管其他字体颜色标记，并替换其他不需要的实例。

I want it to find and replace error in each of this specific pair of tags: ..., not grabbing everything starting by  and ending in 

我想让它在每个特定的标签对中找到并替换错误： ... ，而不是抓住以开头并以

Are there any regular expressions to solve this problem? Many thanks in advance.

有没有正则表达式来解决这个问题？提前谢谢了。

1 个解决方案

#1

In general, regex is not a good idea for parsing HTML (if it's a once-off you might be OK).

一般来说，正则表达式不是解析HTML的好主意（如果它是一次性的你可能没问题）。

I think this might be the reason your regex is not working. Can you give an example of a case in which your regex fails?

我想这可能是你的正则表达式不起作用的原因。你能举例说明你的正则表达式失败了吗？

One case I can think of if is there is no match ([a-z][A-Z]) within a matching  pair, but there is in a neighbouring . For example:

我可以想到的一个案例是匹配的对中是否没有匹配（[a-z] [A-Z]），但是在相邻的中。例如：

<font color=#0B610B>word word</font><font color=#000000>word wordWord</font>

In this case, the only valid match is word wordword word and the rest of the string Word, and so this is what the regex matches (since if it can match it will!)

在这种情况下，唯一有效的匹配是单词单词和其余字符串Word ，所以这就是正则表达式匹配（因为它可以匹配它！）

I can think of a crude workaround but I wouldn't recommend it unless this task is a once-off because using regex for HTML is always prone to such errors!. This regex is also pretty inefficient. Try (untested):

我可以想到一个原始的解决方法但我不建议它，除非这个任务是一次性的，因为使用HTML的正则表达式总是容易出现这样的错误！这个正则表达式也非常低效。尝试（未经测试）：

(<font color =#0B610B\b[^>]*>)(([^<]|<(?!/font))*?[a-z])([A-Z].*?</font>)

It says, "look for the  tag, followed by either an angle bracket < not followed by /font, OR anything else, and again followed by the [a-z][A-Z]". So it tries to make sure that the match doesn't go over a  boundary.

它说，“寻找标签，然后是一个尖括号 <后跟 font，或其他任何东西，然后再跟着[a-z] [a-z]”。所以它试图确保匹配不会越过 边界。

#1