如何用正则表达式解析html标签?

时间:2021-11-25 21:46:41

I wanted to parse following html tags contents retrieved through curl by regular expressions.

我想通过正则表达式解析通过curl检索的html标签内容。

<span class='ui-allscores'>IND - 203/9 (49.4 Ovs)</span>

so that output will be "IND - 203/9 (49.4 Ovs)".

所以输出将是“IND - 203/9(49.4 Ovs)”。

I have written following code but it is not working.please help.

我写了下面的代码,但它没有工作。请帮助。

$one="<span class='ui-allscores'>IND - 203/9 (49.4 Ovs)</span>";
$five="~(?<=<span class='ui-allscores'>)[.]*(?=</br></span>)~";
preg_match_all($five,$one,$ui);
print_r($ui);

3 个解决方案

#1


5  

Try this one:

试试这个:

$string = "<span class='ui-allscores'>IND - 203/9 (49.4 Ovs)</span>";

Dynamic span tag:

动态范围标记:

preg_match('/<span[^>]*>(.*?)<\/span>/si', $string, $matches);

Specific span tag:

特定范围标记:

preg_match("/<span class='ui-allscores'>(.*?)<\/span>/si", $string, $matches);

// Output
array (size=2)
  0 => string '<span class='ui-allscores'>IND - 203/9 (49.4 Ovs)</span>' (length=56)
  1 => string 'IND - 203/9 (49.4 Ovs)' (length=22)

#2


1  

If you simply want to remove the HTML tags, Use the php built-in function strip_tags to remove the html tags.

如果您只是想删除HTML标记,请使用php内置函数strip_tags删除html标记。

Another answer on removing html tags Strip all HTML tags, except allowed

关于删除html标签的另一个答案除了允许之外,删除所有HTML标签

#3


1  

The problem of your regex is the [.] part. This is matching only a literal ., because the dot is written inside a character class. So just remove the square brackets.

正则表达式的问题是[。]部分。这只匹配文字。因为点是写在字符类中的。所以只需删除方括号。

 $five="~(?<=<span class='ui-allscores'>).*(?=</br></span>)~";

The next problem then is the greediness of *. You can change this matching behaviour by putting a ? behind.

接下来的问题是*的贪婪。您可以通过放置?来更改此匹配行为背后。

$five="~(?<=<span class='ui-allscores'>).*?(?=</br></span>)~";

But the overall point is: You should most probably use a html parser for this job!

但总的问题是:你最有可能使用html解析器来完成这项工作!

See How do you parse and process HTML/XML in PHP?

#1


5  

Try this one:

试试这个:

$string = "<span class='ui-allscores'>IND - 203/9 (49.4 Ovs)</span>";

Dynamic span tag:

动态范围标记:

preg_match('/<span[^>]*>(.*?)<\/span>/si', $string, $matches);

Specific span tag:

特定范围标记:

preg_match("/<span class='ui-allscores'>(.*?)<\/span>/si", $string, $matches);

// Output
array (size=2)
  0 => string '<span class='ui-allscores'>IND - 203/9 (49.4 Ovs)</span>' (length=56)
  1 => string 'IND - 203/9 (49.4 Ovs)' (length=22)

#2


1  

If you simply want to remove the HTML tags, Use the php built-in function strip_tags to remove the html tags.

如果您只是想删除HTML标记,请使用php内置函数strip_tags删除html标记。

Another answer on removing html tags Strip all HTML tags, except allowed

关于删除html标签的另一个答案除了允许之外,删除所有HTML标签

#3


1  

The problem of your regex is the [.] part. This is matching only a literal ., because the dot is written inside a character class. So just remove the square brackets.

正则表达式的问题是[。]部分。这只匹配文字。因为点是写在字符类中的。所以只需删除方括号。

 $five="~(?<=<span class='ui-allscores'>).*(?=</br></span>)~";

The next problem then is the greediness of *. You can change this matching behaviour by putting a ? behind.

接下来的问题是*的贪婪。您可以通过放置?来更改此匹配行为背后。

$five="~(?<=<span class='ui-allscores'>).*?(?=</br></span>)~";

But the overall point is: You should most probably use a html parser for this job!

但总的问题是:你最有可能使用html解析器来完成这项工作!

See How do you parse and process HTML/XML in PHP?