需要一个正则表达式的字符串来preg_split记录到数组中

时间:2021-08-30 21:40:22
$source="<p><b>Lal, Vaninm</b></p>
<p><b>Vice President &amp;</b></p>
<p><b>General Manager</b></p>
<p>Company 1 Inc.</p>
<p>PO Box 123456</p>
<p>salt Lake1, 00111-3333</p>
<p>111-111-111 / F: 111-111-111</p>
<p>info1@site1.com</p>
<p><b>Andrus, Reed </b></p>
<p><b>Manager</b></p>
<p>Company 2 Inc.</p>
<p>Monada, Suite 222</p>
<p>J , Lousiana 2222</p>
<p>222-222-222 / F: 222-222-222</p>
<p>info2@site2.com</p>
<p><b>Sharma, John L.</b></p>
<p><b>Senior Property Manager</b></p>
<p>Company 3  Ltd.</p>
<p>PO Box 3333</p>
<p>Grand Cinema, Layman Islands</p>
<p>FGB 333</p>
<p>333-333-333</p>
<p>info3@site3.com</p>
<p><b>Lucky, Philip S</b></p>
<p>Life Member</p>
<p>Company 4 Inc.</p>
<p>Battelsville, Oklahoma 74000</p>
<p>444-444-444</p>
<p><b>Berry, Richard B, RPA, CPM</b></p>";
$records = preg_split ("@\<p\>\<b\>(.*?)(\<p\>(.*)\</p\>\<p\>\<b\>)@s", $source); 
var_dump($records);

The array must contain four records. The data contained inside tags are meaningless. I am new to regular expression. I tried as above. Please suggest regular expressions for this. Thanks in advance.

该数组必须包含四个记录。标签内包含的数据毫无意义。我是正则表达的新手。我试过上面的事情。请为此建议正则表达式。提前致谢。

I think <p><b> ....<p>...</p><p><b> identifies a record. But I cant make the required expression.

我认为

....

...

标识一条记录。但是我无法做出必要的表达。

1 个解决方案

#1


0  

With all the disclaimers about parsing html with regex, the following regex will correctly split your input.

关于使用正则表达式解析html的所有免责声明,以下正则表达式将正确地分割您的输入。

Version 1: file with only newlines (unix, osx)

版本1:仅包含换行符的文件(unix,osx)

(?=(?<=^|((?<!</b>)</p>\n))<p><b>)

Version 2: file with carriage returns and newlines (windows)

版本2:包含回车符和换行符的文件(窗口)

(?=(?<=^|((?<!</b>)</p>\r\n))<p><b>)

Therefore, if you were using the first, you could write:

因此,如果你使用第一个,你可以写:

$records = preg_split('~(?=(?<=^|((?<!</b>)</p>\n))<p><b>)~', $str);

Note that there are actually five records because of the last line:

请注意,由于最后一行,实际上有五条记录:

<p><b>Berry, Richard B, RPA, CPM</b></p>";

How does it work?

它是如何工作的?

With lookahead and lookbehind. This is a "zero-width" match that just looks for a certain position.

具有前瞻和外观。这是一个“零宽度”匹配,只是寻找某个位置。

  • The (?= lookahead asserts that the current position that is followed by <p><b>...
  • (?= lookahead断言当前位置后跟

    ...

  • as long as <p><b> is preceded by (lookbehind (?<= ) the beginning of the string ^ or </p>\n that is not preceded by </b> (negative lookbehind (?<!</b>))
  • 只要

    前面有(lookbehind(?<=)字符串的开头^或 \ n,前面没有 (负面的后观(? ))

Enjoy!

#1


0  

With all the disclaimers about parsing html with regex, the following regex will correctly split your input.

关于使用正则表达式解析html的所有免责声明,以下正则表达式将正确地分割您的输入。

Version 1: file with only newlines (unix, osx)

版本1:仅包含换行符的文件(unix,osx)

(?=(?<=^|((?<!</b>)</p>\n))<p><b>)

Version 2: file with carriage returns and newlines (windows)

版本2:包含回车符和换行符的文件(窗口)

(?=(?<=^|((?<!</b>)</p>\r\n))<p><b>)

Therefore, if you were using the first, you could write:

因此,如果你使用第一个,你可以写:

$records = preg_split('~(?=(?<=^|((?<!</b>)</p>\n))<p><b>)~', $str);

Note that there are actually five records because of the last line:

请注意,由于最后一行,实际上有五条记录:

<p><b>Berry, Richard B, RPA, CPM</b></p>";

How does it work?

它是如何工作的?

With lookahead and lookbehind. This is a "zero-width" match that just looks for a certain position.

具有前瞻和外观。这是一个“零宽度”匹配,只是寻找某个位置。

  • The (?= lookahead asserts that the current position that is followed by <p><b>...
  • (?= lookahead断言当前位置后跟

    ...

  • as long as <p><b> is preceded by (lookbehind (?<= ) the beginning of the string ^ or </p>\n that is not preceded by </b> (negative lookbehind (?<!</b>))
  • 只要

    前面有(lookbehind(?<=)字符串的开头^或 \ n,前面没有 (负面的后观(? ))

Enjoy!