如何从源代码中删除变量数据?

时间:2022-05-13 16:22:36

I'm trying to scrap a link from the source code of a website that varies with every source code.

我试图从一个网站的源代码中取消一个链接,这个链接会随着每个源代码的变化而变化。

Form example:

形式的例子:

 <div align="center">
    <a href="http://www10.site.com/d/the rest of the link">
        <span class="button_upload green">

The next time I get the source code the http://www10 changes to any http://www + number like http://www65.

下次我获得源代码的时候,http://www10会改变http://www +号,比如http://www65。

How can I scrap the exact link with the new changed number?

我怎样才能取消与新更改的号码的确切联系?

Edit : Here's how i use RE MatchCollection m1 = Regex.Matches(textBox6.Text, "(href=\"http://www10)(?<td_inner>.*?)(\">)", RegexOptions.Singleline);

编辑:下面是我如何使用RE MatchCollection m1 = regex.match (textBox6)。文本”,(href = \“http://www10)(? < td_inner >。* ?)(\“>)”,RegexOptions.Singleline);

2 个解决方案

#1


1  

You mentioned in the comments that you use Regulars expressions for parsing the HTML Document. That is a the hardest way you can do this (also, generally not recommended!). Try using a HTML Parser like http://html-agility-pack.net

您在评论中提到,您使用正则表达式来解析HTML文档。这是你能做的最难的一种方式(通常不推荐!)尝试使用HTML解析器,比如http://html-agility-pack.net。

For HTML Agility Pack: You install it via NuGet Packeges and here is an example (posted on their website):

对于HTML敏捷包:你通过NuGet Packeges安装它,这里有一个例子(在他们的网站上发布):

HtmlDocument doc = new HtmlDocument();
 doc.Load("file.htm");
 foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href]")
 {
    HtmlAttribute att = link["href"];
    att.Value = FixLink(att);
 }
 doc.Save("file.htm");

It can also load string contents, not just files. You use xPath or CSS Selectors to navigate inside the document and select what you want.

它还可以加载字符串内容,而不仅仅是文件。您可以使用xPath或CSS选择器在文档中导航并选择您想要的内容。

#2


0  

How about a JS function like this, run when the page loads:

像这样的JS函数,在页面加载时运行:

// jQuery is required!

var updateLinkUrl = function (num) { 
    $.each($('.button_upload.green'), function (pos, el) {
          var orig = $(el).parent().prop("href");
          var newurl = orig.replace("www10", "www" + num);
          $(el).parent().prop("href", newurl);
    });
};
$(document).ready(function () {  updateLinkUrl(65); });

#1


1  

You mentioned in the comments that you use Regulars expressions for parsing the HTML Document. That is a the hardest way you can do this (also, generally not recommended!). Try using a HTML Parser like http://html-agility-pack.net

您在评论中提到,您使用正则表达式来解析HTML文档。这是你能做的最难的一种方式(通常不推荐!)尝试使用HTML解析器,比如http://html-agility-pack.net。

For HTML Agility Pack: You install it via NuGet Packeges and here is an example (posted on their website):

对于HTML敏捷包:你通过NuGet Packeges安装它,这里有一个例子(在他们的网站上发布):

HtmlDocument doc = new HtmlDocument();
 doc.Load("file.htm");
 foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href]")
 {
    HtmlAttribute att = link["href"];
    att.Value = FixLink(att);
 }
 doc.Save("file.htm");

It can also load string contents, not just files. You use xPath or CSS Selectors to navigate inside the document and select what you want.

它还可以加载字符串内容,而不仅仅是文件。您可以使用xPath或CSS选择器在文档中导航并选择您想要的内容。

#2


0  

How about a JS function like this, run when the page loads:

像这样的JS函数,在页面加载时运行:

// jQuery is required!

var updateLinkUrl = function (num) { 
    $.each($('.button_upload.green'), function (pos, el) {
          var orig = $(el).parent().prop("href");
          var newurl = orig.replace("www10", "www" + num);
          $(el).parent().prop("href", newurl);
    });
};
$(document).ready(function () {  updateLinkUrl(65); });