I'm trying to scrap a link from the source code of a website that varies with every source code.
我试图从一个网站的源代码中取消一个链接,这个链接会随着每个源代码的变化而变化。
Form example:
形式的例子:
<div align="center">
<a href="http://www10.site.com/d/the rest of the link">
<span class="button_upload green">
The next time I get the source code the http://www10
changes to any http://www
+ number like http://www65
.
下次我获得源代码的时候,http://www10会改变http://www +号,比如http://www65。
How can I scrap the exact link with the new changed number?
我怎样才能取消与新更改的号码的确切联系?
Edit : Here's how i use RE MatchCollection m1 = Regex.Matches(textBox6.Text, "(href=\"http://www10)(?<td_inner>.*?)(\">)", RegexOptions.Singleline);
编辑:下面是我如何使用RE MatchCollection m1 = regex.match (textBox6)。文本”,(href = \“http://www10)(? < td_inner >。* ?)(\“>)”,RegexOptions.Singleline);
2 个解决方案
#1
1
You mentioned in the comments that you use Regulars expressions for parsing the HTML Document. That is a the hardest way you can do this (also, generally not recommended!). Try using a HTML Parser like http://html-agility-pack.net
您在评论中提到,您使用正则表达式来解析HTML文档。这是你能做的最难的一种方式(通常不推荐!)尝试使用HTML解析器,比如http://html-agility-pack.net。
For HTML Agility Pack: You install it via NuGet Packeges and here is an example (posted on their website):
对于HTML敏捷包:你通过NuGet Packeges安装它,这里有一个例子(在他们的网站上发布):
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href]")
{
HtmlAttribute att = link["href"];
att.Value = FixLink(att);
}
doc.Save("file.htm");
It can also load string contents, not just files. You use xPath or CSS Selectors to navigate inside the document and select what you want.
它还可以加载字符串内容,而不仅仅是文件。您可以使用xPath或CSS选择器在文档中导航并选择您想要的内容。
#2
0
How about a JS function like this, run when the page loads:
像这样的JS函数,在页面加载时运行:
// jQuery is required!
var updateLinkUrl = function (num) {
$.each($('.button_upload.green'), function (pos, el) {
var orig = $(el).parent().prop("href");
var newurl = orig.replace("www10", "www" + num);
$(el).parent().prop("href", newurl);
});
};
$(document).ready(function () { updateLinkUrl(65); });
#1
1
You mentioned in the comments that you use Regulars expressions for parsing the HTML Document. That is a the hardest way you can do this (also, generally not recommended!). Try using a HTML Parser like http://html-agility-pack.net
您在评论中提到,您使用正则表达式来解析HTML文档。这是你能做的最难的一种方式(通常不推荐!)尝试使用HTML解析器,比如http://html-agility-pack.net。
For HTML Agility Pack: You install it via NuGet Packeges and here is an example (posted on their website):
对于HTML敏捷包:你通过NuGet Packeges安装它,这里有一个例子(在他们的网站上发布):
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href]")
{
HtmlAttribute att = link["href"];
att.Value = FixLink(att);
}
doc.Save("file.htm");
It can also load string contents, not just files. You use xPath or CSS Selectors to navigate inside the document and select what you want.
它还可以加载字符串内容,而不仅仅是文件。您可以使用xPath或CSS选择器在文档中导航并选择您想要的内容。
#2
0
How about a JS function like this, run when the page loads:
像这样的JS函数,在页面加载时运行:
// jQuery is required!
var updateLinkUrl = function (num) {
$.each($('.button_upload.green'), function (pos, el) {
var orig = $(el).parent().prop("href");
var newurl = orig.replace("www10", "www" + num);
$(el).parent().prop("href", newurl);
});
};
$(document).ready(function () { updateLinkUrl(65); });