My text is as below:
我的文字如下:
<font size=+2 color=#F07500><b> [ba]</font></b>
<ul><li><font color =#0B610B> Word word wordWord word.<br></font></li></ul>
<ul><li><font color =#F07500> Word word word.<br></font></li></ul>
<ul><li><font color =#0B610B> Word word word wordWord.<br></font></li></ul>
<ul><li><font color =#0B610B> WordWord.<br></font></li></ul>
<br><font color =#E41B17><b>UPPERCASE LETTERS</b></font>
<ul><li><font color =#0B610B> Word word wordWord word.<br></font><br><font color =#E41B17><b>PhD and dataBase</b></font> </li></ul>
<font color =#0B610B> Word word word.<br></font></li></ul><dd><font color =#F07500> »» Word wordWord word.<br></font>
There is a lowercase letter immediately followed by an uppercase in each of the <font color =#0B610B>...</font>
. For example:
在 ... 的每一个中都有一个小写字母,后面紧跟一个大写字母。例如:
<font color =#0B610B> Word word wordWord word.<br></font>
I want to correct this error by splitting them as follows (i.e: adding a colon and a space between them):
我想通过按如下方式拆分它们来纠正这个错误(即:在它们之间添加冒号和空格):
<font color =#0B610B> Word word word: Word word.<br></font>
So far, I have been using:
到目前为止,我一直在使用:
(<font color =#0B610B\b[^>]*>)(.*?</font>)
to select each of the instances of <font color =#0B610B>...</font>
, and it works fine in finding one instance by one instance of <font color =#0B610B>...</font>
.
选择 ... 的每个实例,它可以通过 ... 的一个实例找到一个实例。
But when I use:
但是当我使用时:
(<font color =#0B610B\b[^>]*>)(.*?[a-z])([A-Z].*?</font>)
it does find but selects everything between <font color =#0B610B>...</font>
in one line regardless of other font-color tags, and replaces other unwanted instances.
它确实可以找到但是在一行中选择 ... 之间的所有内容而不管其他字体颜色标记,并替换其他不需要的实例。
I want it to find and replace error in each of this specific pair of tags: <font color =#0B610B>...</font>
, not grabbing everything starting by <font color =#0B610B>
and ending in </font>
我想让它在每个特定的标签对中找到并替换错误: ... ,而不是抓住以开头并以
Are there any regular expressions to solve this problem? Many thanks in advance.
有没有正则表达式来解决这个问题?提前谢谢了。
1 个解决方案
#1
1
In general, regex is not a good idea for parsing HTML (if it's a once-off you might be OK).
一般来说,正则表达式不是解析HTML的好主意(如果它是一次性的你可能没问题)。
I think this might be the reason your regex is not working. Can you give an example of a case in which your regex fails?
我想这可能是你的正则表达式不起作用的原因。你能举例说明你的正则表达式失败了吗?
One case I can think of if is there is no match ([a-z][A-Z]
) within a matching <font color=#0B610B></font>
pair, but there is in a neighbouring <font></font>
. For example:
我可以想到的一个案例是匹配的 对中是否没有匹配([a-z] [A-Z]),但是在相邻的 中。例如:
<font color=#0B610B>word word</font><font color=#000000>word wordWord</font>
In this case, the only valid match is <font color=#0B610B>word word</font><font color=#000000>word word
and the rest of the string Word</font>
, and so this is what the regex matches (since if it can match it will!)
在这种情况下,唯一有效的匹配是单词 单词和其余字符串Word ,所以这就是正则表达式匹配(因为它可以匹配它!)
I can think of a crude workaround but I wouldn't recommend it unless this task is a once-off because using regex for HTML is always prone to such errors!. This regex is also pretty inefficient. Try (untested):
我可以想到一个原始的解决方法但我不建议它,除非这个任务是一次性的,因为使用HTML的正则表达式总是容易出现这样的错误!这个正则表达式也非常低效。尝试(未经测试):
(<font color =#0B610B\b[^>]*>)(([^<]|<(?!/font))*?[a-z])([A-Z].*?</font>)
It says, "look for the <font colour=xxxx>
tag, followed by either an angle bracket <
not followed by /font
, OR anything else, and again followed by the [a-z][A-Z]
". So it tries to make sure that the match doesn't go over a </font>
boundary.
它说,“寻找标签,然后是一个尖括号 <后跟 font,或其他任何东西,然后再跟着[a-z] [a-z]”。所以它试图确保匹配不会越过< font> 边界。
#1
1
In general, regex is not a good idea for parsing HTML (if it's a once-off you might be OK).
一般来说,正则表达式不是解析HTML的好主意(如果它是一次性的你可能没问题)。
I think this might be the reason your regex is not working. Can you give an example of a case in which your regex fails?
我想这可能是你的正则表达式不起作用的原因。你能举例说明你的正则表达式失败了吗?
One case I can think of if is there is no match ([a-z][A-Z]
) within a matching <font color=#0B610B></font>
pair, but there is in a neighbouring <font></font>
. For example:
我可以想到的一个案例是匹配的 对中是否没有匹配([a-z] [A-Z]),但是在相邻的 中。例如:
<font color=#0B610B>word word</font><font color=#000000>word wordWord</font>
In this case, the only valid match is <font color=#0B610B>word word</font><font color=#000000>word word
and the rest of the string Word</font>
, and so this is what the regex matches (since if it can match it will!)
在这种情况下,唯一有效的匹配是单词 单词和其余字符串Word ,所以这就是正则表达式匹配(因为它可以匹配它!)
I can think of a crude workaround but I wouldn't recommend it unless this task is a once-off because using regex for HTML is always prone to such errors!. This regex is also pretty inefficient. Try (untested):
我可以想到一个原始的解决方法但我不建议它,除非这个任务是一次性的,因为使用HTML的正则表达式总是容易出现这样的错误!这个正则表达式也非常低效。尝试(未经测试):
(<font color =#0B610B\b[^>]*>)(([^<]|<(?!/font))*?[a-z])([A-Z].*?</font>)
It says, "look for the <font colour=xxxx>
tag, followed by either an angle bracket <
not followed by /font
, OR anything else, and again followed by the [a-z][A-Z]
". So it tries to make sure that the match doesn't go over a </font>
boundary.
它说,“寻找标签,然后是一个尖括号 <后跟 font,或其他任何东西,然后再跟着[a-z] [a-z]”。所以它试图确保匹配不会越过< font> 边界。