使用正则表达式从html代码中提取第一个图像源?

时间:2021-10-28 08:59:39

I would like to know how this can be achieved.

我想知道这是如何实现的。

Assume: That there's a lot of html code containing tables, divs, images, etc.

假设:有很多html代码包含表、div、images等。

Problem: How can I get matches of all occurances. More over, to be specific, how can I get the img tag source (src = ?).

问题:如何获得所有发生的匹配。更详细地说,我如何获得img标记源(src = ?)

example:

例子:

<img src="http://example.com/g.jpg" alt="" />

How can I print out http://example.com/g.jpg in this case. I want to assume that there are also other tags in the html code as i mentioned, and possibly more than one image. Would it be possible to have an array of all images sources in html code?

在这种情况下,如何打印http://example.com/g.jpg。我想假设html代码中还有其他的标签,可能不止一个图片。是否可能在html代码中包含所有图像源的数组?

I know this can be achieved way or another with regular expressions, but I can't get the hang of it.

我知道这可以通过正则表达式实现,但我搞不懂它。

Any help is greatly appreciated.

非常感谢您的帮助。

10 个解决方案

#1


39  

While regular expressions can be good for a large variety of tasks, I find it usually falls short when parsing HTML DOM. The problem with HTML is that the structure of your document is so variable that it is hard to accurately (and by accurately I mean 100% success rate with no false positive) extract a tag.

虽然正则表达式可以用于各种任务,但在解析HTML DOM时,我发现它通常会出现不足。HTML的问题在于,文档的结构变化如此之大,以至于很难精确地提取标签(准确地说,我指的是没有假阳性的100%成功率)。

What I recommend you do is use a DOM parser such as SimpleHTML and use it as such:

我建议您使用诸如SimpleHTML之类的DOM解析器,并这样使用:

function get_first_image($html) {
    require_once('SimpleHTML.class.php')

    $post_html = str_get_html($html);

    $first_img = $post_html->find('img', 0);

    if($first_img !== null) {
        return $first_img->src;
    }

    return null;
}

Some may think this is overkill, but in the end, it will be easier to maintain and also allows for more extensibility. For example, using the DOM parser, I can also get the alt attribute.

有些人可能认为这有点过头了,但最终,它将更容易维护,并允许更多的可扩展性。例如,使用DOM解析器,我还可以获得alt属性。

A regular expression could be devised to achieve the same goal but would be limited in such way that it would force the alt attribute to be after the src or the opposite, and to overcome this limitation would add more complexity to the regular expression.

可以设计一个正则表达式来实现相同的目标,但是它的局限性在于它会迫使alt属性位于src之后或相反的位置,如果克服这个限制,正则表达式就会变得更加复杂。

Also, consider the following. To properly match an <img> tag using regular expressions and to get only the src attribute (captured in group 2), you need the following regular expression:

同时,考虑以下。要使用正则表达式匹配一个使用正则表达式从html代码中提取第一个图像源?标记,并只获取src属性(在第2组中捕获),您需要以下正则表达式:

<\s*?img\s+[^>]*?\s*src\s*=\s*(["'])((\\?+.)*?)\1[^>]*?>

And then again, the above can fail if:

如果:

  • The attribute or tag name is in capital and the i modifier is not used.
  • 属性或标记名在大写,不使用i修饰符。
  • Quotes are not used around the src attribute.
  • 引用不在src属性周围使用。
  • Another attribute then src uses the > character somewhere in their value.
  • 然后,src在其值的某个地方使用>字符。
  • Some other reason I have not foreseen.
  • 还有一些我没有预见到的原因。

So again, simply don't use regular expressions to parse a dom document.

同样,不要使用正则表达式来解析dom文档。


EDIT: If you want all the images:

编辑:如果你想要所有的图片:

function get_images($html){
    require_once('SimpleHTML.class.php')

    $post_dom = str_get_dom($html);

    $img_tags = $post_dom->find('img');

    $images = array();

    foreach($img_tags as $image) {
        $images[] = $image->src;
    }

    return $images;
}

#2


12  

Use this, is more effective:

使用这个,更有效:

preg_match_all('/<img [^>]*src=["|\']([^"|\']+)/i', $html, $matches);
foreach ($matches[1] as $key=>$value) {
    echo $value."<br>";
}

Example:

例子:

$html = '
<ul>     
  <li><a target="_new" href="http://www.manfromuranus.com">Man from Uranus</a></li>       
  <li><a target="_new" href="http://www.thevichygovernment.com/">The Vichy Government</a></li>      
  <li><a target="_new" href="http://www.cambridgepoetry.org/">Cambridge Poetry</a></li>      
  <img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value1.jpg" />
  <li><a href="http://www.verot.net/pretty/">Electronaut Records</a></li>      
  <img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value2.jpg" />
  <li><a target="_new" href="http://www.catseye-crew.com">Catseye Productions</a></li>     
  <img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value3.jpg" />
</ul>
<img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="res/upload.jpg" />
  <li><a target="_new" href="http://www.manfromuranus.com">Man from Uranus</a></li>       
  <li><a target="_new" href="http://www.thevichygovernment.com/">The Vichy Government</a></li>      
  <li><a target="_new" href="http://www.cambridgepoetry.org/">Cambridge Poetry</a></li>      
  <img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value4.jpg" />
  <li><a href="http://www.verot.net/pretty/">Electronaut Records</a></li>      
  <img src="value5.jpg" />
  <li><a target="_new" href="http://www.catseye-crew.com">Catseye Productions</a></li>     
  <img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value6.jpg" />
';   
preg_match_all('/<img .*src=["|\']([^"|\']+)/i', $html, $matches);
foreach ($matches[1] as $key=>$value) {
    echo $value."<br>";
} 

Output:

输出:

value1.jpg
value2.jpg
value3.jpg
res/upload.jpg
value4.jpg
value5.jpg
value6.jpg

#3


7  

This works for me:

这工作对我来说:

preg_match('@<img.+src="(.*)".*>@Uims', $html, $matches);
$src = $matches[1];

#4


5  

i assume all your src= have " around the url

我假设所有的src=都在url附近

<img[^>]+src=\"([^\"]+)\"

the other answers posted here make other assumsions about your code

这里发布的其他答案对您的代码进行了其他假设

#5


2  

I agree with Andrew Moore. Using the DOM is much, much better. The HTML DOM images collection will return to you a reference to all image objects.

我同意安德鲁·摩尔的观点。使用DOM要好多了。HTML DOM图像集合将返回对所有图像对象的引用。

Let's say in your header you have,

在你的标题中,

<script type="text/javascript">
    function getFirstImageSource()
    {
        var img = document.images[0].src;
        return img;
    }
</script>

and then in your body you have,

在你的身体里,

<script type="text/javascript">
  alert(getFirstImageSource());
</script>

This will return the 1st image source. You can also loop through them along the lines of, (in head section)

这将返回第一个图像源。您还可以沿着(在head部分)的行循环它们

function getAllImageSources()
    {
        var returnString = "";
        for (var i = 0; i < document.images.length; i++)
        {
            returnString += document.images[i].src + "\n"
        }
        return returnString;
    }

(in body)

(身体)

<script type="text/javascript">
  alert(getAllImageSources());
</script>

If you're using JavaScript to do this, remember that you can't run your function looping through the images collection in your header. In other words, you can't do something like this,

如果您正在使用JavaScript进行此操作,请记住,您不能在头部的图像集合中运行函数循环。换句话说,你不能这样做,

<script type="text/javascript">
    function getFirstImageSource()
    {
        var img = document.images[0].src;
        return img;
    }
    window.onload = getFirstImageSource;  //bad function

</script>

because this won't work. The images haven't loaded when the header is executed and thus you'll get a null result.

因为这是行不通的。在执行头部时没有加载图像,因此您将得到一个空结果。

Hopefully this can help in some way. If possible, I'd make use of the DOM. You'll find that a good deal of your work is already done for you.

希望这能在某种程度上有所帮助。如果可能的话,我将利用DOM。你会发现你已经完成了很多工作。

#6


2  

I don't know if you MUST use regex to get your results. If not, you could try out simpleXML and XPath, which would be much more reliable for your goal:

我不知道你是否必须用正则表达式来得到结果。如果没有,您可以尝试simpleXML和XPath,这将更符合您的目标:

First, import the HTML into a DOM Document Object. If you get errors, turn errors off for this part and be sure to turn them back on afterward:

首先,将HTML导入DOM文档对象。如果你有错误,关闭这个部分的错误,并确保在之后重新打开:

 $dom = new DOMDocument();
 $dom -> loadHTMLFile("filename.html");

Next, import the DOM into a simpleXML object, like so:

接下来,将DOM导入到simpleXML对象中,如下所示:

 $xml = simplexml_import_dom($dom);

Now you can use a few methods to get all of your image elements (and their attributes) into an array. XPath is the one I prefer, because I've had better luck with traversing the DOM with it:

现在,您可以使用一些方法将所有的映像元素(及其属性)放到一个数组中。XPath是我更喜欢的,因为我比较幸运地使用它遍历DOM:

 $images = $xml -> xpath('//img/@src');

This variable now can treated like an array of your image URLs:

这个变量现在可以像一个图像url数组一样处理:

 foreach($images as $image) {
    echo '<img src="$image" /><br />
    ';
  }

Presto, all of your images, none of the fat.

很快,你所有的形象,没有任何脂肪。

Here's the non-annotated version of the above:

以下是上述的非注释版本:


 $dom = new DOMDocument();
 $dom -> loadHTMLFile("filename.html");

 $xml = simplexml_import_dom($dom);

 $images = $xml -> xpath('//img/@src');

 foreach($images as $image) {
    echo '<img src="$image" /><br />
    ';
  }

#7


2  

I really think you can not predict all the cases with on regular expression.

我真的认为你不能预测所有的情况都有正则表达式。

The best way is to use the DOM with the PHP5 class DOMDocument and xpath. It's the cleanest way to do what you want.

最好的方法是将DOM与PHP5类DOMDocument和xpath一起使用。这是做你想做的最干净的方式。

$dom = new DOMDocument();
$dom->loadHTML( $htmlContent );
$xml = simplexml_import_dom($dom);
$images = $xml -> xpath('//img/@src');

#8


1  

You can try this:

你可以试试这个:

preg_match_all("/<img\s+src=\"(.+)\"/i", $html, $matches);
foreach ($matches as $key=>$value) {
    echo $key . ", " . $value . "<br>";
}

#9


1  

since you're not worrying about validating the HTML, you might try using strip_tags() on the text first to clear out most of the cruft.

因为您不必担心验证HTML,所以您可以先在文本上使用strip_tags()来清除大部分cruft。

Then you can search for an expression like

然后可以搜索类似的表达式

"/\<img .+ \/\>/i"

The backslashes escape special characters like <,>,/. .+ insists that there be 1 or more of any character inside the img tag You can capture part of the expression by putting parentheses around it. e.g. (.+) captures the middle part of the img tag.

反斜杠可以避开诸如<、>、/. +等特殊字符。+坚持在img标记中有一个或多个字符,可以通过在其周围加上括号来捕获表达式的一部分。(.+)捕获img标签的中间部分。

When you decide what part of the middle you wish specifically to capture, you can modify the (.+) to something more specific.

当您决定要捕获中间的哪个部分时,您可以将(.+)修改为更具体的内容。

#10


0  

<?php    
/* PHP Simple HTML DOM Parser @ http://simplehtmldom.sourceforge.net */

require_once('simple_html_dom.php');

$html = file_get_html('http://example.com');
$image = $html->find('img')[0]->src;

echo "<img src='{$image}'/>"; // BOOM!

PHP Simple HTML DOM Parser will do the job in few lines of code.

PHP简单的HTML DOM解析器将在几行代码中完成这项工作。

#1


39  

While regular expressions can be good for a large variety of tasks, I find it usually falls short when parsing HTML DOM. The problem with HTML is that the structure of your document is so variable that it is hard to accurately (and by accurately I mean 100% success rate with no false positive) extract a tag.

虽然正则表达式可以用于各种任务,但在解析HTML DOM时,我发现它通常会出现不足。HTML的问题在于,文档的结构变化如此之大,以至于很难精确地提取标签(准确地说,我指的是没有假阳性的100%成功率)。

What I recommend you do is use a DOM parser such as SimpleHTML and use it as such:

我建议您使用诸如SimpleHTML之类的DOM解析器,并这样使用:

function get_first_image($html) {
    require_once('SimpleHTML.class.php')

    $post_html = str_get_html($html);

    $first_img = $post_html->find('img', 0);

    if($first_img !== null) {
        return $first_img->src;
    }

    return null;
}

Some may think this is overkill, but in the end, it will be easier to maintain and also allows for more extensibility. For example, using the DOM parser, I can also get the alt attribute.

有些人可能认为这有点过头了,但最终,它将更容易维护,并允许更多的可扩展性。例如,使用DOM解析器,我还可以获得alt属性。

A regular expression could be devised to achieve the same goal but would be limited in such way that it would force the alt attribute to be after the src or the opposite, and to overcome this limitation would add more complexity to the regular expression.

可以设计一个正则表达式来实现相同的目标,但是它的局限性在于它会迫使alt属性位于src之后或相反的位置,如果克服这个限制,正则表达式就会变得更加复杂。

Also, consider the following. To properly match an <img> tag using regular expressions and to get only the src attribute (captured in group 2), you need the following regular expression:

同时,考虑以下。要使用正则表达式匹配一个使用正则表达式从html代码中提取第一个图像源?标记,并只获取src属性(在第2组中捕获),您需要以下正则表达式:

<\s*?img\s+[^>]*?\s*src\s*=\s*(["'])((\\?+.)*?)\1[^>]*?>

And then again, the above can fail if:

如果:

  • The attribute or tag name is in capital and the i modifier is not used.
  • 属性或标记名在大写,不使用i修饰符。
  • Quotes are not used around the src attribute.
  • 引用不在src属性周围使用。
  • Another attribute then src uses the > character somewhere in their value.
  • 然后,src在其值的某个地方使用>字符。
  • Some other reason I have not foreseen.
  • 还有一些我没有预见到的原因。

So again, simply don't use regular expressions to parse a dom document.

同样,不要使用正则表达式来解析dom文档。


EDIT: If you want all the images:

编辑:如果你想要所有的图片:

function get_images($html){
    require_once('SimpleHTML.class.php')

    $post_dom = str_get_dom($html);

    $img_tags = $post_dom->find('img');

    $images = array();

    foreach($img_tags as $image) {
        $images[] = $image->src;
    }

    return $images;
}

#2


12  

Use this, is more effective:

使用这个,更有效:

preg_match_all('/<img [^>]*src=["|\']([^"|\']+)/i', $html, $matches);
foreach ($matches[1] as $key=>$value) {
    echo $value."<br>";
}

Example:

例子:

$html = '
<ul>     
  <li><a target="_new" href="http://www.manfromuranus.com">Man from Uranus</a></li>       
  <li><a target="_new" href="http://www.thevichygovernment.com/">The Vichy Government</a></li>      
  <li><a target="_new" href="http://www.cambridgepoetry.org/">Cambridge Poetry</a></li>      
  <img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value1.jpg" />
  <li><a href="http://www.verot.net/pretty/">Electronaut Records</a></li>      
  <img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value2.jpg" />
  <li><a target="_new" href="http://www.catseye-crew.com">Catseye Productions</a></li>     
  <img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value3.jpg" />
</ul>
<img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="res/upload.jpg" />
  <li><a target="_new" href="http://www.manfromuranus.com">Man from Uranus</a></li>       
  <li><a target="_new" href="http://www.thevichygovernment.com/">The Vichy Government</a></li>      
  <li><a target="_new" href="http://www.cambridgepoetry.org/">Cambridge Poetry</a></li>      
  <img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value4.jpg" />
  <li><a href="http://www.verot.net/pretty/">Electronaut Records</a></li>      
  <img src="value5.jpg" />
  <li><a target="_new" href="http://www.catseye-crew.com">Catseye Productions</a></li>     
  <img width="190" height="197" border="0" align="right" alt="upload.jpg" title="upload.jpg" class="noborder" src="value6.jpg" />
';   
preg_match_all('/<img .*src=["|\']([^"|\']+)/i', $html, $matches);
foreach ($matches[1] as $key=>$value) {
    echo $value."<br>";
} 

Output:

输出:

value1.jpg
value2.jpg
value3.jpg
res/upload.jpg
value4.jpg
value5.jpg
value6.jpg

#3


7  

This works for me:

这工作对我来说:

preg_match('@<img.+src="(.*)".*>@Uims', $html, $matches);
$src = $matches[1];

#4


5  

i assume all your src= have " around the url

我假设所有的src=都在url附近

<img[^>]+src=\"([^\"]+)\"

the other answers posted here make other assumsions about your code

这里发布的其他答案对您的代码进行了其他假设

#5


2  

I agree with Andrew Moore. Using the DOM is much, much better. The HTML DOM images collection will return to you a reference to all image objects.

我同意安德鲁·摩尔的观点。使用DOM要好多了。HTML DOM图像集合将返回对所有图像对象的引用。

Let's say in your header you have,

在你的标题中,

<script type="text/javascript">
    function getFirstImageSource()
    {
        var img = document.images[0].src;
        return img;
    }
</script>

and then in your body you have,

在你的身体里,

<script type="text/javascript">
  alert(getFirstImageSource());
</script>

This will return the 1st image source. You can also loop through them along the lines of, (in head section)

这将返回第一个图像源。您还可以沿着(在head部分)的行循环它们

function getAllImageSources()
    {
        var returnString = "";
        for (var i = 0; i < document.images.length; i++)
        {
            returnString += document.images[i].src + "\n"
        }
        return returnString;
    }

(in body)

(身体)

<script type="text/javascript">
  alert(getAllImageSources());
</script>

If you're using JavaScript to do this, remember that you can't run your function looping through the images collection in your header. In other words, you can't do something like this,

如果您正在使用JavaScript进行此操作,请记住,您不能在头部的图像集合中运行函数循环。换句话说,你不能这样做,

<script type="text/javascript">
    function getFirstImageSource()
    {
        var img = document.images[0].src;
        return img;
    }
    window.onload = getFirstImageSource;  //bad function

</script>

because this won't work. The images haven't loaded when the header is executed and thus you'll get a null result.

因为这是行不通的。在执行头部时没有加载图像,因此您将得到一个空结果。

Hopefully this can help in some way. If possible, I'd make use of the DOM. You'll find that a good deal of your work is already done for you.

希望这能在某种程度上有所帮助。如果可能的话,我将利用DOM。你会发现你已经完成了很多工作。

#6


2  

I don't know if you MUST use regex to get your results. If not, you could try out simpleXML and XPath, which would be much more reliable for your goal:

我不知道你是否必须用正则表达式来得到结果。如果没有,您可以尝试simpleXML和XPath,这将更符合您的目标:

First, import the HTML into a DOM Document Object. If you get errors, turn errors off for this part and be sure to turn them back on afterward:

首先,将HTML导入DOM文档对象。如果你有错误,关闭这个部分的错误,并确保在之后重新打开:

 $dom = new DOMDocument();
 $dom -> loadHTMLFile("filename.html");

Next, import the DOM into a simpleXML object, like so:

接下来,将DOM导入到simpleXML对象中,如下所示:

 $xml = simplexml_import_dom($dom);

Now you can use a few methods to get all of your image elements (and their attributes) into an array. XPath is the one I prefer, because I've had better luck with traversing the DOM with it:

现在,您可以使用一些方法将所有的映像元素(及其属性)放到一个数组中。XPath是我更喜欢的,因为我比较幸运地使用它遍历DOM:

 $images = $xml -> xpath('//img/@src');

This variable now can treated like an array of your image URLs:

这个变量现在可以像一个图像url数组一样处理:

 foreach($images as $image) {
    echo '<img src="$image" /><br />
    ';
  }

Presto, all of your images, none of the fat.

很快,你所有的形象,没有任何脂肪。

Here's the non-annotated version of the above:

以下是上述的非注释版本:


 $dom = new DOMDocument();
 $dom -> loadHTMLFile("filename.html");

 $xml = simplexml_import_dom($dom);

 $images = $xml -> xpath('//img/@src');

 foreach($images as $image) {
    echo '<img src="$image" /><br />
    ';
  }

#7


2  

I really think you can not predict all the cases with on regular expression.

我真的认为你不能预测所有的情况都有正则表达式。

The best way is to use the DOM with the PHP5 class DOMDocument and xpath. It's the cleanest way to do what you want.

最好的方法是将DOM与PHP5类DOMDocument和xpath一起使用。这是做你想做的最干净的方式。

$dom = new DOMDocument();
$dom->loadHTML( $htmlContent );
$xml = simplexml_import_dom($dom);
$images = $xml -> xpath('//img/@src');

#8


1  

You can try this:

你可以试试这个:

preg_match_all("/<img\s+src=\"(.+)\"/i", $html, $matches);
foreach ($matches as $key=>$value) {
    echo $key . ", " . $value . "<br>";
}

#9


1  

since you're not worrying about validating the HTML, you might try using strip_tags() on the text first to clear out most of the cruft.

因为您不必担心验证HTML,所以您可以先在文本上使用strip_tags()来清除大部分cruft。

Then you can search for an expression like

然后可以搜索类似的表达式

"/\<img .+ \/\>/i"

The backslashes escape special characters like <,>,/. .+ insists that there be 1 or more of any character inside the img tag You can capture part of the expression by putting parentheses around it. e.g. (.+) captures the middle part of the img tag.

反斜杠可以避开诸如<、>、/. +等特殊字符。+坚持在img标记中有一个或多个字符,可以通过在其周围加上括号来捕获表达式的一部分。(.+)捕获img标签的中间部分。

When you decide what part of the middle you wish specifically to capture, you can modify the (.+) to something more specific.

当您决定要捕获中间的哪个部分时,您可以将(.+)修改为更具体的内容。

#10


0  

<?php    
/* PHP Simple HTML DOM Parser @ http://simplehtmldom.sourceforge.net */

require_once('simple_html_dom.php');

$html = file_get_html('http://example.com');
$image = $html->find('img')[0]->src;

echo "<img src='{$image}'/>"; // BOOM!

PHP Simple HTML DOM Parser will do the job in few lines of code.

PHP简单的HTML DOM解析器将在几行代码中完成这项工作。