I have this input text:
我有这个输入文字:
<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body><table cellspacing="0" cellpadding="0" border="0" align="center" width="603"> <tbody><tr> <td><table cellspacing="0" cellpadding="0" border="0" width="603"> <tbody><tr> <td width="314"><img height="61" width="330" src="/Elearning_Platform/dp_templates/dp-template-images/awards-title.jpg" alt="" /></td> <td width="273"><img height="61" width="273" src="/Elearning_Platform/dp_templates/dp-template-images/awards.jpg" alt="" /></td> </tr> </tbody></table></td> </tr> <tr> <td><table cellspacing="0" cellpadding="0" border="0" align="center" width="603"> <tbody><tr> <td colspan="3"><img height="45" width="603" src="/Elearning_Platform/dp_templates/dp-template-images/top-bar.gif" alt="" /></td> </tr> <tr> <td background="/Elearning_Platform/dp_templates/dp-template-images/left-bar-bg.gif" width="12"><img height="1" width="12" src="/Elearning_Platform/dp_templates/dp-template-images/left-bar-bg.gif" alt="" /></td> <td width="580"><p> what y all heard?</p><p>i'm shark oysters.</p> <p> </p> <p> </p> <p> </p> <p> </p> <p> </p> <p> </p></td> <td background="/Elearning_Platform/dp_templates/dp-template-images/right-bar-bg.gif" width="11"><img height="1" width="11" src="/Elearning_Platform/dp_templates/dp-template-images/right-bar-bg.gif" alt="" /></td> </tr> <tr> <td colspan="3"><img height="31" width="603" src="/Elearning_Platform/dp_templates/dp-template-images/bottom-bar.gif" alt="" /></td> </tr> </tbody></table></td> </tr> </tbody></table> <p> </p></body></html>
As you can see, there's no newline in this chunk of HTML text, and I need to look for all image links inside, copy them out to a directory, and change the line inside the text to something like ./images/file_name
.
正如您所看到的,HTML文本块中没有换行符,我需要查找内部的所有图像链接,将它们复制到目录中,并将文本内的行更改为./images/file_name。
Currently, the Perl code that I'm using looks like this:
目前,我使用的Perl代码如下所示:
my ($old_src,$new_src,$folder_name);
foreach my $record (@readfile) {
## so the if else case for the url replacement block below will be correct
$old_src = "";
$new_src = "";
if ($record =~ /\<img(.+)/){
if($1=~/src=\"((\w|_|\\|-|\/|\.|:)+)\"/){
$old_src = $1;
my @tmp = split(/\/Elearning/,$old_src);
$new_src = "/media/www/vprimary/Elearning".$tmp[-1];
push (@images, $new_src);
$folder_name = "images";
}## end if
}
elsif($record =~ /background=\"(.+\.jpg)/){
$old_src = $1;
my @tmp = split(/\/Elearning/,$old_src);
$new_src = "/media/www/vprimary/Elearning".$tmp[-1];
push (@images, $new_src);
$folder_name = "images";
}
elsif($record=~/\<iframe(.+)/){
if($1=~/src=\"((\w|_|\\|\?|=|-|\/|\.|:)+)\"/){
$old_src = $1;
my @tmp = split(/\/Elearning/,$old_src);
$new_src = "/media/www/vprimary/Elearning".$tmp[-1];
## remove the ?rand behind the html file name
if($new_src=~/\?rand/){
my ($fname,$rand) = split(/\?/,$new_src);
$new_src = $fname;
my ($fname,$rand) = split(/\?/,$old_src);
$old_src = $fname."\\?".$rand;
}
print "old_src::$old_src\n"; ##s7test
print "new_src::$new_src\n\n"; ##s7test
push (@iframes, $new_src);
$folder_name = "iframes";
}## end if
}## end if
my $new_record = $record;
if($old_src && $new_src){
$new_record =~ s/$old_src/$new_src/ ;
print "new_record:$new_record\n"; ##s7test
my @tmp = split(/\//,$new_src);
$new_record =~ s/$new_src/\.\\$folder_name\\$tmp[-1]/;
## print "new_record2:$new_record\n\n"; ##s7test
}## end if
print WRITEFILE $new_record;
} # foreach
This is only sufficient to handle HTML text with newlines in them. I thought only looping the regex statement, but then i would have to change the matching line to some other text.
这仅足以处理带有换行符的HTML文本。我认为只循环正则表达式语句,但后来我必须将匹配行更改为其他文本。
Do you have any idea if there an elegant Perl way to do this? Or maybe I'm just too dumb to see the obvious way of doing it, plus I know putting global option doesn't work.
你有任何想法是否有一个优雅的Perl方式来做到这一点?或者也许我只是太傻了,看不出明显的做法,而且我知道把全局选项不起作用。
thanks. ~steve
3 个解决方案
#1
2
If you must avoid any additional module, like an HTML parser, you could try:
如果您必须避免使用任何其他模块,例如HTML解析器,您可以尝试:
while ($string =~ m/(?:\<\s*(?:img|iframe)[^\>]+src\s*=\s*\"((?:\w|_|\\|-|\/|\.|:)+)\"|background\s*=\s*\"([^\>]+\.jpg)|\<\s*iframe)/g) {
$old_src = $1;
my @tmp = split(/\/Elearning/,$old_src);
$new_src = "/media/www/vprimary/Elearning".$tmp[-1];
if($new_src=~/\?rand/){
// remove rand and push in @iframes
else
{
// push into @images
}
}
That way, you would apply this regex on all the source (newlines included), and have a more compact code (plus, you would take into account any extra space between attributes and their values)
这样,你可以在所有源代码(包括换行符)上应用这个正则表达式,并且有一个更紧凑的代码(另外,你会考虑属性和它们的值之间的任何额外空间)
#2
10
There are excellent HTML parsers for Perl, learn to use them and stick with that. HTML is complex, allows > in attributes, heavily use nesting, etc. Using regexes to parse it, beyond very simple tasks (or machine generated code), is prone to problems.
Perl有很好的HTML解析器,学会使用它们并坚持使用它。 HTML很复杂,允许>属性,大量使用嵌套等。使用正则表达式解析它,除了非常简单的任务(或机器生成的代码)之外,很容易出现问题。
#3
4
I think you want my HTML::SimpleLinkExtor module:
我想你想要我的HTML :: SimpleLinkExtor模块:
use HTML::SimpleLinkExtor; my $extor = HTML::SimpleLinkExtor->new; $extor->parse_file( $file ); my @imgs = $extor->img;
I'm not sure what exactly you're trying to do, but it surely sounds like one of the HTML parsing modules should do the trick if mine doesn't.
我不确定你到底想要做什么,但它确实听起来像是一个HTML解析模块应该做的伎俩,如果我没有。
#1
2
If you must avoid any additional module, like an HTML parser, you could try:
如果您必须避免使用任何其他模块,例如HTML解析器,您可以尝试:
while ($string =~ m/(?:\<\s*(?:img|iframe)[^\>]+src\s*=\s*\"((?:\w|_|\\|-|\/|\.|:)+)\"|background\s*=\s*\"([^\>]+\.jpg)|\<\s*iframe)/g) {
$old_src = $1;
my @tmp = split(/\/Elearning/,$old_src);
$new_src = "/media/www/vprimary/Elearning".$tmp[-1];
if($new_src=~/\?rand/){
// remove rand and push in @iframes
else
{
// push into @images
}
}
That way, you would apply this regex on all the source (newlines included), and have a more compact code (plus, you would take into account any extra space between attributes and their values)
这样,你可以在所有源代码(包括换行符)上应用这个正则表达式,并且有一个更紧凑的代码(另外,你会考虑属性和它们的值之间的任何额外空间)
#2
10
There are excellent HTML parsers for Perl, learn to use them and stick with that. HTML is complex, allows > in attributes, heavily use nesting, etc. Using regexes to parse it, beyond very simple tasks (or machine generated code), is prone to problems.
Perl有很好的HTML解析器,学会使用它们并坚持使用它。 HTML很复杂,允许>属性,大量使用嵌套等。使用正则表达式解析它,除了非常简单的任务(或机器生成的代码)之外,很容易出现问题。
#3
4
I think you want my HTML::SimpleLinkExtor module:
我想你想要我的HTML :: SimpleLinkExtor模块:
use HTML::SimpleLinkExtor; my $extor = HTML::SimpleLinkExtor->new; $extor->parse_file( $file ); my @imgs = $extor->img;
I'm not sure what exactly you're trying to do, but it surely sounds like one of the HTML parsing modules should do the trick if mine doesn't.
我不确定你到底想要做什么,但它确实听起来像是一个HTML解析模块应该做的伎俩,如果我没有。