我需要一个regex来获取img标记的src属性

I have a string which follows literally:

我有一个字符串，字面意思是

"lt;img src=quot;http://www.news.gov.tt/thumbnail.php?file=Hon__Jerry_Narace_Minister__Of_Health_599152837.jpgamp;size=summary_mediumquot;gt;lt;pgt;Fifty-eight people have been tested for Influenza A/H1N1 virus, commonly called swine flu, in Trinidad and Tobago. \r\nThe tests have all come back negative, Health Minister Jerry Narace said yesterday. \r\n\r\n"

I would like to get the url between the 'quot;' strings, ie,

我想要得到"字符串"之间的url，

http://www.news.gov.tt/thumbnail.php?file=Hon__Jerry_Narace_Minister__Of_Health_599152837.jpgamp;size=summary_medium

using a regex in .NET.

在。net中使用正则表达式。

Any ideas?

什么好主意吗?

4 个解决方案

#1

^\"lt;img\s+src\=quot;(.+)quot;

Given the following input:

鉴于以下输入:

"lt;img src=quot;http://www.news.gov.tt/thumbnail.php?file=Hon__Jerry_Narace_Minister__Of_Health_599152837.jpgamp;size=summary_mediumquot;gt;lt;pgt;Fifty-eight people have been tested for Influenza A/H1N1 virus, commonly called swine flu, in Trinidad and Tobago. \r\nThe tests have all come back negative, Health Minister Jerry Narace said yesterday. \r\n\r\n"

this regex returns the following:

此regex返回以下内容:

http://www.news.gov.tt/thumbnail.php?file=Hon__Jerry_Narace_Minister__Of_Health_599152837.jpgamp;size=summary_medium

which I believe is exactly what you required.

我相信这正是你所需要的。

Hope this helps, Ryan

希望这有助于,瑞安

#2

Regex r = new Regex("(?<=img src=&quot;).*?(?=&quot;)");

Should do the trick for you, assuming there aren't any ampersands hiding out there somewhere.

假设没有任何与号隐藏在某处，那么这个技巧应该对你有用。

EDIT: After posting this answer, I noticed ampersands I saw before in your string were no longer present.

编辑:在发布了这个答案后，我注意到我以前在你的字符串中看到的与号不再存在。

#3

This regex should sort you out to grab the src content of just the IMG tags:

这个regex应该对您进行分类，只获取IMG标记的src内容:

(?<=<img.*?src=\&quot;)[^\"]*(?=\&quot;.*?((&frasl;&gt;)|(&gt;.*&lt&frasl;img&gt;)))

It doesn't rely on positioning or the src within the tag, it does require that you set the case sensitivity to insensitive to be stable though.

它不依赖于标记中的定位或src，它要求您将大小写敏感设置为不敏感，以保持稳定。

Patjbs version will grab you the src of all tags, which will cause instability if you're parsing html that contains linked in external content - such as javascript, external div content etc.

Patjbs版本将会抓取所有标签的src，如果你解析包含外部内容的html(如javascript、外部div内容等)，就会导致不稳定。

string htmlString = @"<img id="tagId" src="myTagSource.gif" name="imageName" />";
string matchString = Regex.Match(htmlString, @"(?<=<img.*?src=\")[^\"]*(?=\".*?((/>)|(>.*</img)))").Value;

matchString now equals "myTagSource.gif"

里面有现在等于“myTagSource.gif”

I notice that your input string is missing some & (ampersand) to denote the escape chars such as quot; there's going to be no way (without forcing the logic to look for quot; lt; gt;) to interpret those characters programmatically. You would have to do a replace on the initial string to convert it to a regex interpretable [is that a word?] string.

我注意到您的输入字符串缺少一些&(&)来表示转义字符，如“;”没有任何方法(不强制逻辑去寻找“;lt;)以编程方式解释这些字符。您需要在初始字符串上做一个替换，将其转换为regex解释器[是一个单词吗?)字符串。

So let's say you grab all these strings out of the page, you'd need to assume that all instances of lt; become < and all gt; become >, all quot; become ".

假设你从页面中获取所有这些字符串，你需要假设所有的lt实例;成为 <和all gt;成为> ,所有”;成为“。

You cannot also assume that the data provided will always come back in this form, sometimes the string may contain other tag information (id, name, border info etc). So I think perhaps the most ideological and the most maintainable solutions may diverge here a slightly. The most ideological way would be to do it in one parse, but the most maintenance friendly may be to do it in two steps, first converting the input string to a standard html string, and the second to extract the source data.

您也不能假定所提供的数据将始终以这种形式返回，有时字符串可能包含其他标记信息(id、名称、边框信息等)。因此，我认为最具意识形态和可维护的解决方案在这里可能会有一点分歧。最理想的方法是在一次解析中完成，但最有利于维护的方法可能是在两个步骤中完成，首先将输入字符串转换为标准的html字符串，然后提取源数据。

Alternatively, you could do it in one parse, replacing the html construct in my pattern with the corresponding character replacements (assuming they're using standard encoding but dropping the ampersand), although, it's not quite as readable, and likely to cause some confusion to anyone maintaining the code:

或者，您也可以在一次解析中完成，将我模式中的html构造替换为相应的字符替换(假设它们使用的是标准编码，但去掉了&号)，尽管它的可读性不太好，可能会给维护代码的任何人带来一些混乱:

(?<=lt;img.?src=\quot;).?(?=\quot;.*?((frasl;gt;)|(gt;.*lt;frasl;imggt;)))

(? < = lt;img。? src = \”)。?(? = \”。* ?((frasl;gt)|(gt;。* lt;frasl imggt;)))

Edit: If it turns out that they are using standard encoding and you just haven't provided the & in your example, then you can just sub in first pattern I presented referencing the decoded URL using:

编辑:如果事实证明他们使用的是标准编码，而你在你的示例中没有提供&，那么你可以按照我提供的第一种模式，使用以下方法引用解码后的URL:

string MatchValue = Regex.Match(HttpUtility.UrlDecode(inputString), pattern).Value;

This will decode the string you get back from them into a standard string replacing the escaped characters with the correct characters and then run the same pattern.

这将解码您从它们返回的字符串到一个标准字符串，用正确的字符替换转义字符，然后运行相同的模式。

#4

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

regexe在解析HTML方面根本就不擅长(请参见您能否提供一些示例，说明为什么使用regex很难解析XML和HTML ?为什么)。您需要的是一个HTML解析器。可以提供一个解析HTML的示例吗?用于使用各种解析器的示例。

#1

^\"lt;img\s+src\=quot;(.+)quot;