I have a web page. From that i want to find all the IMG tags and get the SRC of those IMG tags.
我有一个网页。从那里我想找到所有IMG标签并获得那些IMG标签的SRC。
What will be the regular expression to do this.
这样做的正则表达式是什么。
Some explanation:
I am scraping a web page. All the data is displayed correctly except the images. To solve this, wow i have an idea, to find the SRC and replace it : e.g
我在抓一个网页。除图像外,所有数据都正确显示。要解决这个问题,哇我有一个想法,找到SRC并替换它:例如
/images/header.jpg
and replace this with
并替换它
www.*/images/header.jpg
5 个解决方案
#1
You don't want a regular expression, you want a parser. From this question:
你不需要正则表达式,你想要一个解析器。从这个问题:
class Program
{
static void Main(string[] args)
{
var web = new HtmlWeb();
var doc = web.Load("http://www.*.com");
var nodes = doc.DocumentNode.SelectNodes("//img[@src]");
foreach (var node in nodes)
{
Console.WriteLine(node.src);
}
}
}
#2
As pointed out, regular expression are not the perfect solution, but you can usually build one that is good enough for the job. This is what I would use:
正如所指出的那样,正则表达式并不是完美的解决方案,但你通常可以构建一个对于工作来说足够好的解决方案。这是我会用的:
string newHtml = Regex.Replace(html,
@"(?<=<img\s+[^>]*?src=(?<q>['""]))(?<url>.+?)(?=\k<q>)",
m => "http://www.*.com" + m.Value);
It will match src attributes delimited by single or double quotes.
它将匹配由单引号或双引号分隔的src属性。
Of course, you would have to change the lambda/delegate to do your own replacing logic, but you get the idea :)
当然,你必须改变lambda / delegate来做你自己的替换逻辑,但是你明白了:)
#3
I have to agree with the parser-crowd on this one. In order of increasing input complexity, the hierarchy I choose from is:
我不得不同意这个解析器的人群。为了增加输入复杂性,我选择的层次结构是:
- substrings;
- regexes; and
- parsers.
While regexes can handle much more complicated inputs than simple substring operations, they tend to barf pretty easily when faced with the really hairy input possibilities of free-form markup languages.
虽然正则表达式可以处理比简单子字符串操作更复杂的输入,但是当面对*格式标记语言的真正多毛输入可能性时,它们往往很容易barf。
XML DOM parsers will be the easiest solution for this problem.
XML DOM解析器将是解决此问题的最简单方法。
You can use regexes (and they'll work reasonably well if you restrict the input format, such as ensuring img tags don't cross line boundaries and so on), but the simplicity of a parser-based solution will blow regexes out of the water for multi-line, attributes-in-any-order DOM tags.
您可以使用正则表达式(如果您限制输入格式,它们将会运行得相当好,例如确保img标签不跨越边界等等),但基于解析器的解决方案的简单性会将正则表达式排除在外用于多行,任意顺序DOM标记的水。
#4
Remember that the source could be generated through javascript, so you may not be able to "just" do a regex replacement for img src.
请记住,源代码可以通过javascript生成,因此您可能无法“只”为img src执行正则表达式替换。
Using Mechanize/Hpricot/Nokogiri in ruby:
在ruby中使用Mechanize / Hpricot / Nokogiri:
require 'mechanize'
agent = WWW::Mechanize.new
page = agent.get('http://www.google.com')
(page/"img").each { |img| puts img['src'] = "http://www.yahoo.com" + img['src'] }
And you are done!
你完成了!
#5
/// <summary>
/// Gets the src from an IMG tag
/// Assigns proper values to link and name, if the htmlId matches the pattern
/// </summary>
/// <param name="htmlTd">Html containing IMG tag</param>
/// <param name="link">Contains the src contents</param>
/// <param name="name">Contains img element content</param>
/// <returns>true if success, false otherwise</returns>
public static bool TryGetImgDetails(string htmlTd, out string link, out string name)
{
link = null;
name = null;
string pattern = "<img\\s*src\\s*=\\s*(?:\"(?<link>[^\"]*)\"|(?<link>\\S+))\\s*>(?<name>.*)\\s*</img>";
if (Regex.IsMatch(htmlTd, pattern))
{
Regex r = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);
link = r.Match(htmlTd).Result("${link}");
name = r.Match(htmlTd).Result("${name}");
return true;
}
else
return false;
}
#1
You don't want a regular expression, you want a parser. From this question:
你不需要正则表达式,你想要一个解析器。从这个问题:
class Program
{
static void Main(string[] args)
{
var web = new HtmlWeb();
var doc = web.Load("http://www.*.com");
var nodes = doc.DocumentNode.SelectNodes("//img[@src]");
foreach (var node in nodes)
{
Console.WriteLine(node.src);
}
}
}
#2
As pointed out, regular expression are not the perfect solution, but you can usually build one that is good enough for the job. This is what I would use:
正如所指出的那样,正则表达式并不是完美的解决方案,但你通常可以构建一个对于工作来说足够好的解决方案。这是我会用的:
string newHtml = Regex.Replace(html,
@"(?<=<img\s+[^>]*?src=(?<q>['""]))(?<url>.+?)(?=\k<q>)",
m => "http://www.*.com" + m.Value);
It will match src attributes delimited by single or double quotes.
它将匹配由单引号或双引号分隔的src属性。
Of course, you would have to change the lambda/delegate to do your own replacing logic, but you get the idea :)
当然,你必须改变lambda / delegate来做你自己的替换逻辑,但是你明白了:)
#3
I have to agree with the parser-crowd on this one. In order of increasing input complexity, the hierarchy I choose from is:
我不得不同意这个解析器的人群。为了增加输入复杂性,我选择的层次结构是:
- substrings;
- regexes; and
- parsers.
While regexes can handle much more complicated inputs than simple substring operations, they tend to barf pretty easily when faced with the really hairy input possibilities of free-form markup languages.
虽然正则表达式可以处理比简单子字符串操作更复杂的输入,但是当面对*格式标记语言的真正多毛输入可能性时,它们往往很容易barf。
XML DOM parsers will be the easiest solution for this problem.
XML DOM解析器将是解决此问题的最简单方法。
You can use regexes (and they'll work reasonably well if you restrict the input format, such as ensuring img tags don't cross line boundaries and so on), but the simplicity of a parser-based solution will blow regexes out of the water for multi-line, attributes-in-any-order DOM tags.
您可以使用正则表达式(如果您限制输入格式,它们将会运行得相当好,例如确保img标签不跨越边界等等),但基于解析器的解决方案的简单性会将正则表达式排除在外用于多行,任意顺序DOM标记的水。
#4
Remember that the source could be generated through javascript, so you may not be able to "just" do a regex replacement for img src.
请记住,源代码可以通过javascript生成,因此您可能无法“只”为img src执行正则表达式替换。
Using Mechanize/Hpricot/Nokogiri in ruby:
在ruby中使用Mechanize / Hpricot / Nokogiri:
require 'mechanize'
agent = WWW::Mechanize.new
page = agent.get('http://www.google.com')
(page/"img").each { |img| puts img['src'] = "http://www.yahoo.com" + img['src'] }
And you are done!
你完成了!
#5
/// <summary>
/// Gets the src from an IMG tag
/// Assigns proper values to link and name, if the htmlId matches the pattern
/// </summary>
/// <param name="htmlTd">Html containing IMG tag</param>
/// <param name="link">Contains the src contents</param>
/// <param name="name">Contains img element content</param>
/// <returns>true if success, false otherwise</returns>
public static bool TryGetImgDetails(string htmlTd, out string link, out string name)
{
link = null;
name = null;
string pattern = "<img\\s*src\\s*=\\s*(?:\"(?<link>[^\"]*)\"|(?<link>\\S+))\\s*>(?<name>.*)\\s*</img>";
if (Regex.IsMatch(htmlTd, pattern))
{
Regex r = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);
link = r.Match(htmlTd).Result("${link}");
name = r.Match(htmlTd).Result("${name}");
return true;
}
else
return false;
}