使用ASP.NET中的HtmlAgilityPack刮取HTML DOM元素

时间:2021-05-09 01:23:34

I am Scraping HTML DOM elements using HtmlAgilityPack in ASP.NET. currently my code is loading all the href links which means that sublinks of sublinks also . But I need only the depending URL of my domain URL. I don't know how to write code for it. Can any one help me to do this? Here is my code:

我在ASP.NET中使用HtmlAgilityPack刮HTML DOM元素。目前我的代码正在加载所有href链接,这也意味着子链接的子链接。但我只需要域URL的依赖URL。我不知道如何为它编写代码。任何人都可以帮我这么做吗?这是我的代码:

public void GetURL(string strGetURL)
{
    var getHtmlSource = new HtmlWeb();
    var document = new HtmlDocument(); 
try
{
    document = getHtmlSource.Load(strGetURL);
    var aTags = document.DocumentNode.SelectNodes("//a"); 
    if (aTags != null)
    {
        outputurl.Text = string.Empty;
        int _count = 0;
        foreach (var aTag in aTags)
        {
            string strURLTmp;
            strURLTmp = aTag.Attributes["href"].Value;
            if (_count != 0)
            {
                if (!CheckDuplicate(strURLTmp))
                {
                    lstResults.Add(strURLTmp);
                    outputurl.Text += strURLTmp + "\n";
                    counter++; 
                    GetURL(strURLTmp);
                }
            }
            _count++;
        }
    }
}

1 个解决方案

#1


0  

If you meant to get URL that contains specific domain, you can change the XPath to be :

如果您打算获取包含​​特定域的URL,则可以将XPath更改为:

//a[contains(@href, 'your domain here')]

Or if you prefer LINQ than XPath :

或者,如果您更喜欢LINQ而不是XPath:

var aTags = document.DocumentNode.SelectNodes("//a"); 
if (aTags != null)
{
    ....
    var relevantLinks = aTags.Where(o => o.GetAttributeValue("href", "")
                                          .Contains("your domain here")
                                    );
    ....
}

GetAttributeValue() is a better way to get value of an attribute using HAP. Instead of returning null which may cause exception, this method returns the 2nd parameter when the attribute is not found in the context node.

GetAttributeValue()是使用HAP获取属性值的更好方法。当在上下文节点中找不到该属性时,此方法返回第二个参数,而不是返回可能导致异常的null。

#1


0  

If you meant to get URL that contains specific domain, you can change the XPath to be :

如果您打算获取包含​​特定域的URL,则可以将XPath更改为:

//a[contains(@href, 'your domain here')]

Or if you prefer LINQ than XPath :

或者,如果您更喜欢LINQ而不是XPath:

var aTags = document.DocumentNode.SelectNodes("//a"); 
if (aTags != null)
{
    ....
    var relevantLinks = aTags.Where(o => o.GetAttributeValue("href", "")
                                          .Contains("your domain here")
                                    );
    ....
}

GetAttributeValue() is a better way to get value of an attribute using HAP. Instead of returning null which may cause exception, this method returns the 2nd parameter when the attribute is not found in the context node.

GetAttributeValue()是使用HAP获取属性值的更好方法。当在上下文节点中找不到该属性时,此方法返回第二个参数,而不是返回可能导致异常的null。