正则表达式从img标签中提取src属性

I am trying to write a pattern for extracting the path for files found in img tags in HTML.

我正在尝试编写一个模式来提取HTML中img标签中的文件的路径。

String string = "<img src=\"file:/C:/Documents and Settings/elundqvist/My Documents/My Pictures/import dialog step 1.JPG\" border=\"0\" />";

My Pattern:

我的模式:

src\\s*=\\s*\"(.+)\"

Problem is that my pattern will also include the 'border="0" part of the img tag.

问题是，我的模式还将包括img标记的'border="0"部分。

What pattern would match the URI path for this file without including the 'border="0"?

什么模式会匹配这个文件的URI路径，而不包括'border="0"?

6 个解决方案

#1

Your pattern should be (unescaped):

你的模式应该是(不可避免的):

src\s*=\s*"(.+?)"

The important part is the added question mark that matches the group as few times as possible

最重要的部分是添加的问号，尽可能少地匹配组

#2

Try this expression:

试试这个表达式:

src\s*=\s*"([^"]+)"

#3

This one only grabs the src only if it's inside of an tag and not when it is written anywhere else as plain text. It also checks if you've added other attributes before or after the src attribute.

只有当src位于标记内部时，而不是将其作为纯文本写在其他地方时，才会获取src。它还检查是否在src属性之前或之后添加了其他属性。

Also, it determines whether you're using single (') or double (") quotes.

此外，它还确定您使用的是单引号(')还是双引号(")。

\<img.+src\=(?:\"|\')(.+?)(?:\"|\')(?:.+?)\>

So for PHP you would do:

对于PHP，你可以这样做:

preg_match("/\<img.+src\=(?:\"|\')(.+?)(?:\"|\')(?:.+?)\>/", $string, $matches);
echo "$matches[1]";

for JavaScript you would do:

对于JavaScript，你可以这样做:

var match = text.match(/\<img.+src\=(?:\"|\')(.+?)(?:\"|\')(?:.+?)\>/)
alert(match[1]);

Hopefully that helps.

希望有帮助。

#4

You want to play with the greedy form of group-capture. Something like

你想玩的是贪婪的群体捕捉。类似的

src\\s*=\\s*\"(.+)?\"

src \ \ s * = \ \ s * \”(+)? \ "

By default the regex will try and match as much as possible

默认情况下，regex将尽可能地尝试和匹配

#5

I am trying to write a pattern for extracting the path for files found in img tags in HTML.

我正在尝试编写一个模式来提取HTML中img标签中的文件的路径。

Can we have an autoresponder for "Don't use regex to parse [X]HTML"?

我们能有一个自动应答器来表示“不要使用regex解析[X]HTML”吗?

Problem is that my pattern will also include the 'border="0" part of the img tag.

问题是，我的模式还将包括img标记的'border="0"部分。

Not to mention any time 'src="' appears in plain text!

更不用说任何时候“src=”出现在纯文本中!

If you know in advance the exact format of the HTML you're going to be parsing (eg. because you generated it yourself), you can get away with it. But otherwise, regex is entirely the wrong tool for the job.

如果您事先知道HTML的确切格式，那么您将进行解析(例如。因为你自己创造了它)，你可以逃避它。但除此之外，regex完全不是适合这项工作的工具。

#6

I'd like to expand on this topic as usually the src attribute comes unquoted so the regex to take the quoted and unquoted src attribute is:
src\s*=\s*"?(.+?)["|\s]

我想对这个主题进行扩展，因为通常src属性是不被引用的，所以regex将引用和不被引用的src属性设置为:src\s*=\s*"?(.+?)["|\s]

#1