如何使用Perl从HTML文件中提取链接?

时间:2022-06-01 18:23:39

I have some input with a link and I want to open that link. For instance, I have an HTML file and want to find all links in the file and open their contents in an Excel spreadsheet.

我有一些链接的输入,我想打开该链接。例如,我有一个HTML文件,想要查找文件中的所有链接,并在Excel电子表格中打开它们的内容。

4 个解决方案

#1


It sounds like you want the linktractor script from my HTML::SimpleLinkExtor module.

听起来你想要我的HTML :: SimpleLinkExtor模块中的linktractor脚本。

You might also be interested in my webreaper script. I wrote that a long, long time ago to do something close to this same task. I don't really recommend it because other tools are much better now, but you can at least look at the code.

您可能也对我的webreaper脚本感兴趣。很久很久以前,我写了一篇接近同样任务的文章。我不推荐它,因为现在其他工具要好得多,但你至少可以看一下代码。

CPAN and Google are your friends. :)

CPAN和Google是你的朋友。 :)

Mojo::UserAgent is quite nice for this, too:

Mojo :: UserAgent对此也非常好:

use Mojo::UserAgent

print Mojo::UserAgent
    ->new
    ->get( $ARGV[0] )
    ->res
    ->dom->find( "a" )
    ->map( attr => "href" )
    ->join( "\n" );

#2


That sounds like a job for WWW::Mechanize. It provides a fairly high level interface to fetching and studying web pages.

这听起来像WWW :: Mechanize的工作。它为获取和学习网页提供了相当高级的界面。

Once you've read the docs, I think you'll have a good idea how to go about it.

一旦你阅读了文档,我想你会知道如何去做。

#3


There is also Web::Query:

还有Web :: Query:

#!/usr/bin/env perl 

use 5.10.0;

use strict;
use warnings;

use Web::Query;

say for wq( shift )->find('a')->attr('href');

Or, from the cli:

或者,从cli:

$ perl -MWeb::Query -E'say for wq(shift)->find("a")->attr("href")' \
       http://techblog.babyl.ca

#4


I've used URI::Find for this in the past (for when the file is not HTML).

我过去曾经使用过URI :: Find(当文件不是HTML时)。

#1


It sounds like you want the linktractor script from my HTML::SimpleLinkExtor module.

听起来你想要我的HTML :: SimpleLinkExtor模块中的linktractor脚本。

You might also be interested in my webreaper script. I wrote that a long, long time ago to do something close to this same task. I don't really recommend it because other tools are much better now, but you can at least look at the code.

您可能也对我的webreaper脚本感兴趣。很久很久以前,我写了一篇接近同样任务的文章。我不推荐它,因为现在其他工具要好得多,但你至少可以看一下代码。

CPAN and Google are your friends. :)

CPAN和Google是你的朋友。 :)

Mojo::UserAgent is quite nice for this, too:

Mojo :: UserAgent对此也非常好:

use Mojo::UserAgent

print Mojo::UserAgent
    ->new
    ->get( $ARGV[0] )
    ->res
    ->dom->find( "a" )
    ->map( attr => "href" )
    ->join( "\n" );

#2


That sounds like a job for WWW::Mechanize. It provides a fairly high level interface to fetching and studying web pages.

这听起来像WWW :: Mechanize的工作。它为获取和学习网页提供了相当高级的界面。

Once you've read the docs, I think you'll have a good idea how to go about it.

一旦你阅读了文档,我想你会知道如何去做。

#3


There is also Web::Query:

还有Web :: Query:

#!/usr/bin/env perl 

use 5.10.0;

use strict;
use warnings;

use Web::Query;

say for wq( shift )->find('a')->attr('href');

Or, from the cli:

或者,从cli:

$ perl -MWeb::Query -E'say for wq(shift)->find("a")->attr("href")' \
       http://techblog.babyl.ca

#4


I've used URI::Find for this in the past (for when the file is not HTML).

我过去曾经使用过URI :: Find(当文件不是HTML时)。