如何使用正则表达式匹配字符串

时间:2021-03-15 21:39:44

I have a string which contains multiple occurrences of the "<p class=a> ... </p>" where ... is different text.

我有一个字符串,其中包含多次出现的“

... ”,其中......是不同的文本。

I am using "<p class=a>(.*)</p>" regex pattern to split the text into chunks. but this is not working. what would be the correct regex for this?

我正在使用“

(。*) ”正则表达式模式将文本拆分为块。但这不起作用。什么是正确的正则表达式?

P.S. the same regex pattern is working in iOS using NSRegularExpression but not working in android using Pattern.

附:相同的正则表达式模式在iOS中使用NSRegularExpression但在使用Pattern的android中不起作用。

To explain my problem more : i am doing the following

更多地解释我的问题:我正在做以下事情

Pattern regex3 = Pattern.compile("(?s)<P Class=ENCC>(.*?)</P>", CASE_INSENSITIVE);
String[] result = p.split(str); 

result array contains only 1 item and it is the whole string

结果数组只包含1个项目,它是整个字符串

and the following is a portion of the file that i am reading :

以下是我正在阅读的文件的一部分:

<BODY>
    <SYNC Start=200>
      <P Class=ENCC><i>Cerita, Watak, Adegan dalam</i><br/><i>Drama Ini Rekaan Semata-Mata.</i></P>
    </SYNC>
    <SYNC Start=2440>
      <P Class=ENCC>&nbsp;</P>
    </SYNC>
    <SYNC Start=2560>
      <P Class=ENCC><i>Kami Tidak Berniat</i><br/><i>Melukakan Hati Sesiapa.</i></P>
    </SYNC>
    <SYNC Start=4560>
      <P Class=ENCC>&nbsp;</P>
    </SYNC>
    <SYNC Start=66160>
      <P Class=ENCC>Hai kawan-kawan.<br/>Inilah bandaraya Banting.</P>
    </SYNC>

UPDATE ::::

更新::::

hi everybody, I have got the problem. the problem was actually with the encoding of the file that i was reading. the file was UTF-16 (Little Endian) encoded. that was causing the all problem of regex not working. i changed it to UTF-8 and everything started working .. thanx everybody for your support.

大家好,我遇到了问题。问题实际上是我正在阅读的文件的编码。该文件是UTF-16(Little Endian)编码的。这导致正则表达式的所有问题都无法正常工作。我把它改成了UTF-8,一切都开始工作..感谢大家的支持。

4 个解决方案

#1


1  

EDIT:

编辑:

Now that you've posted the code and the text you're matching against, one thing immediately leaps to mind:

既然您已经发布了代码和您要匹配的文本,那么有一件事会立即浮现在脑海中:

You're matching <p class..., but your string contains <P Class.... Regexes are case-sensitive.

你匹配

Then, . does not match newlines. And it's quite likely that your paragraphs do contain newlines.

然后, 。与换行符不匹配。你的段落很可能包含换行符。

Therefore, try "(?si)<p class=a>(.*?)</p>". The (?s) modifier allows the dot to match newlines, too, and the (?i) modifier makes the regex case-insensitive.

因此,请尝试“(?si)

(。*?) ”。 (?s)修饰符也允许点匹配换行符,而(?i)修饰符使正则表达式不区分大小写。

#2


2  

Parsing HTML with regular expressions is not really a good idea (reason here). What you should use in an HTML parser such as this.

使用正则表达式解析HTML并不是一个好主意(原因在这里)。你应该在HTML解析器中使用什么,比如这个。

That being said, your issue is most likely the fact that the * operator is greedy. In your question you just say that it is not working, so I think that your problem is because it is matching the first <p class=a> and the very last </p>. Making the regular expression non greedy, like so: <p class=a>(.*?)</p> (notice the extra ? to make the * operator non greedy) should solve the problem (assuming that your problem is the one I have stated earlier).

话虽这么说,你的问题很可能是*运算符贪婪的事实。在你的问题中你只是说它不起作用,所以我认为你的问题是因为它匹配第一个

和最后一个 。使正则表达式非贪婪,如下所示:

(。*?) (注意额外的?使*运算符非贪婪)应解决问题(假设您的问题是一个我之前已说过)。

That being said, I would really recommend you ditch the regular expression approach and use appropriate HTML Parsers.

话虽这么说,我真的建议你放弃正则表达式方法并使用适当的HTML解析器。

#3


0  

The .* may match <. You can try :

。*可能匹配<。你可以试试 :

<p class=a>([^<]*)</p>

#4


0  

I guess the problem is that your pattern is greedy. You should use this instead.

我想问题是你的模式是贪婪的。你应该使用它。

"<p class=a>(.*?)</p>"

If you have this string:

如果你有这个字符串:

"<p class=a>fist</p><p class=a>second</p>"

Your pattern ("<p class=a>(.*)</p>") will match this

您的模式(“

(。*) ”)将与此匹配

"<p class=a>fist</p><p class=a>second</p>"

While "<p class=a>(.*?)</p>" only matches

而“

(。*?) ”只匹配

"<p class=a>fist</p>"

#1


1  

EDIT:

编辑:

Now that you've posted the code and the text you're matching against, one thing immediately leaps to mind:

既然您已经发布了代码和您要匹配的文本,那么有一件事会立即浮现在脑海中:

You're matching <p class..., but your string contains <P Class.... Regexes are case-sensitive.

你匹配

Then, . does not match newlines. And it's quite likely that your paragraphs do contain newlines.

然后, 。与换行符不匹配。你的段落很可能包含换行符。

Therefore, try "(?si)<p class=a>(.*?)</p>". The (?s) modifier allows the dot to match newlines, too, and the (?i) modifier makes the regex case-insensitive.

因此,请尝试“(?si)

(。*?) ”。 (?s)修饰符也允许点匹配换行符,而(?i)修饰符使正则表达式不区分大小写。

#2


2  

Parsing HTML with regular expressions is not really a good idea (reason here). What you should use in an HTML parser such as this.

使用正则表达式解析HTML并不是一个好主意(原因在这里)。你应该在HTML解析器中使用什么,比如这个。

That being said, your issue is most likely the fact that the * operator is greedy. In your question you just say that it is not working, so I think that your problem is because it is matching the first <p class=a> and the very last </p>. Making the regular expression non greedy, like so: <p class=a>(.*?)</p> (notice the extra ? to make the * operator non greedy) should solve the problem (assuming that your problem is the one I have stated earlier).

话虽这么说,你的问题很可能是*运算符贪婪的事实。在你的问题中你只是说它不起作用,所以我认为你的问题是因为它匹配第一个

和最后一个 。使正则表达式非贪婪,如下所示:

(。*?) (注意额外的?使*运算符非贪婪)应解决问题(假设您的问题是一个我之前已说过)。

That being said, I would really recommend you ditch the regular expression approach and use appropriate HTML Parsers.

话虽这么说,我真的建议你放弃正则表达式方法并使用适当的HTML解析器。

#3


0  

The .* may match <. You can try :

。*可能匹配<。你可以试试 :

<p class=a>([^<]*)</p>

#4


0  

I guess the problem is that your pattern is greedy. You should use this instead.

我想问题是你的模式是贪婪的。你应该使用它。

"<p class=a>(.*?)</p>"

If you have this string:

如果你有这个字符串:

"<p class=a>fist</p><p class=a>second</p>"

Your pattern ("<p class=a>(.*)</p>") will match this

您的模式(“

(。*) ”)将与此匹配

"<p class=a>fist</p><p class=a>second</p>"

While "<p class=a>(.*?)</p>" only matches

而“

(。*?) ”只匹配

"<p class=a>fist</p>"