I need to extract captcha from url and recognised it with Tesseract. My code is:
我需要从url中提取验证码并使用Tesseract识别它。我的代码是:
#!/usr/bin/perl -X
###
$user = 'user'; #Enter your username here
$pass = 'pass'; #Enter your password here
###
#Server settings
$home = "http://perltest.adavice.com";
$url = "$home/c/test.cgi?u=$user&p=$pass";
#Get HTML code!
$html = `GET "$url"`
###Add code here!
#Grab img from HTML code
if ($html =~ m%img[^>]*src="(/[^"]*)"%s)
{
$img = $1;
}
###
die "<img> not found\n" if (!$img);
#Download image to server (save as: ocr_me.img)
print "GET '$home$img' > ocr_me.img\n";
system "GET '$home$img' > ocr_me.img";
###Add code here!
#Run OCR (using shell command tesseract) on img and save text as ocr_result.txt
system("tesseract ocr_me.img ocr_result");
print "GET '$txt' > ocr_result.txt\n";
system "GET '$txt' > ocr_result.txt";
###
die "ocr_result.txt not found\n" if (!-e "ocr_result.txt");
# check OCR results:
$txt = 'cat ocr_result.txt';
$txt =~ s/[^A-Za-z0-9\-_\.]+//sg;
$img =~ s/^.*\///;
print `echo -n "file=$img&text=$txt" | POST "$url"`;
Image parse correctly. This image contain captcha and looks like:
图像解析正确。此图片包含captcha,看起来像:
My output is:
我的输出是:
GET 'http://perltest.adavice.com/captcha/1533110309.png' > ocr_me.img
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
GET '' > ocr_result.txt
Captcha text not specified
As you see, script parse image correctly. But Tesseract didnt see anything in that PNG file. I am trying to specify additional parameters such as -psm and -l with shell command tesseract, but this also giving nothing
如您所见,脚本正确解析图像。但是Tesseract在那个PNG文件中没有看到任何内容。我试图用shell命令tesseract指定其他参数,如-psm和-l,但这也没有给出任何内容
UPDATE: After read answer @Dave Cross, I am tried his suggestion.
更新:阅读答案@Dave Cross后,我尝试了他的建议。
In output I got:
在输出中我得到:
http://perltest.adavice.com/captcha/1533141024.png
ocr_me.img
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
[]
200Captcha text not specified
Original image file not specified
Captcha text not specified
Why I need text from image .PNG? Maybe this additional information can help you. Look at that:
为什么我需要来自图像.PNG的文字?也许这些额外的信息可以帮助您。看那个:
This is how $url looks like in browser. My goal here is create query for this page in wim using perl. For this I need fill in forms above my $user, $pass and $txt (from recognized with Tesseract image). And send that with POST 'url' (last string in code).
这就是$ url在浏览器中的样子。我的目标是使用perl在wim中为此页面创建查询。为此我需要在$ user,$ pass和$ txt之上填写表格(来自Tesseract图像的识别)。并使用POST'url'发送它(代码中的最后一个字符串)。
1 个解决方案
#1
2
Several strange things going on here. Any one of them could be causing your problems.
这里发生了几件奇怪的事情。他们中的任何一个都可能导致你的问题。
- Having
-X
on your shebang line is a terrible idea. It explicitly turns off warnings. I suggest you remove it, adduse warnings
to your code and fix all the problems that reveals (I'd suggest addinguse strict
too, but you'd need to declare all of your variables). - 在你的shebang线上使用-X是一个糟糕的主意。它明确地关闭了警告。我建议你删除它,在代码中添加使用警告并修复显示的所有问题(我建议添加use strict,但你需要声明所有的变量)。
- I'd recommend using LWP::Simple instead of shelling out to
GET
. - 我建议使用LWP :: Simple而不是炮轰GET。
- Please don't use regexes to parse HTML. Use a real HTML parser instead. Web::Query is my current favourite.
- 请不要使用正则表达式来解析HTML。请改用真正的HTML解析器。 Web :: Query是我目前的最爱。
- You then run
GET
again, using a variable called$txt
that doesn't have a value. That's not going to work! - 然后使用名为$ txt且没有值的变量再次运行GET。那不行!
-
$txt = 'cat ocr_result.txt'
doesn't do what you think it does. You want backticks, not single quotes. - $ txt ='cat ocr_result.txt'没有你想象的那样做。你想要反引号,而不是单引号。
Update: Obviously, I don't have access to your username or password, so I can't reconstruct all of your code. But this seems to work fine for accessing the image in your example and extracting the text from it.
更新:显然,我无法访问您的用户名或密码,因此我无法重建您的所有代码。但这似乎适用于访问示例中的图像并从中提取文本。
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use LWP::Simple;
my $img_url = 'http://perltest.adavice.com/captcha/1533110309.png';
my $img_file = 'ocr_me.img';
getstore($img_url, $img_file);
my $txt = `tesseract $img_file stdout`;
say $txt;
Here's your actual error:
这是你的实际错误:
system("tesseract ocr_me.img ocr_result");
print "GET '$txt' > ocr_result.txt\n";
system "GET '$txt' > ocr_result.txt";
You ask tesseract
to write its output to ocr_result.txt
, but two lines later, you overwrite that file with the output of a failed call to GET
. I'm not sure what you think that's going to do, but it will trash whatever output tesseract
has already stored in that file.
你要求tesseract将其输出写入ocr_result.txt,但是后两行,你用GET失败的调用输出覆盖该文件。我不确定你认为会发生什么,但它会废弃tesseract已存储在该文件中的任何输出。
Updated Update:
更新的更新:
Here's my current version of the code:
这是我当前版本的代码:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use LWP::Simple qw[$ua get getstore];
use File::Basename;
###
my $user = 'xxxx'; #Enter your username here
my $pass = 'xxxx'; #Enter your password here
###
#Server settings
my $home = "http://perltest.adavice.com";
my $url = "$home/c/test.cgi?u=$user&p=$pass";
#Get HTML code!
my $html = get($url);
my $img;
###Add code here!
#Grab img from HTML code
if ($html =~ m%img[^>]*src="(/[^"]*)"%s)
{
$img = $1;
}
my $img_url = $home . $img;
my $img_file = 'ocr_me.img';
getstore($img_url, $img_file);
say $img_url;
say $img_file;
# Looks like tesseract adds two newlines to its output -
# so chomp() it twice!
chomp(my $txt = `tesseract ocr_me.img stdout`);
chomp($txt);
say "[$txt]";
$txt =~ s/\W+//g;
my $resp = $ua->post($url, {
u => $user,
p => $pass,
file => basename($img),
text => $txt,
});
print $resp->code;
print $resp->content;
I've changed a few things.
我改变了一些东西。
- Corrected
$img_url
from$url . $img
to$home . $img
(this is what was stopping it from getting the correct image). - 更正了$ url中的$ img_url。 $ img到$ home。 $ img(这是阻止它获取正确图像的原因)。
- Switched to using LWP::Simple throughout (it's just easier).
- 切换到使用LWP :: Simple(它更简单)。
-
chomp
ed (twice!) the output fromtesseract
to remove newlines. - chomped(两次!)tesseract的输出删除换行符。
- Used File::Basename to get the correct filename to pass in the final
POST
. - 使用File :: Basename获取在最终POST中传递的正确文件名。
- Removed any non-word characters from
$txt
beforePOST
ing it. - 在POST之前从$ txt中删除了任何非单词字符。
It still doesn't quite work. It seems to hang waiting for a response from the server. But I'm afraid I've run out of time to help you.
它仍然不太有用。它似乎挂起等待服务器的响应。但是我担心我没时间帮你了。
#1
2
Several strange things going on here. Any one of them could be causing your problems.
这里发生了几件奇怪的事情。他们中的任何一个都可能导致你的问题。
- Having
-X
on your shebang line is a terrible idea. It explicitly turns off warnings. I suggest you remove it, adduse warnings
to your code and fix all the problems that reveals (I'd suggest addinguse strict
too, but you'd need to declare all of your variables). - 在你的shebang线上使用-X是一个糟糕的主意。它明确地关闭了警告。我建议你删除它,在代码中添加使用警告并修复显示的所有问题(我建议添加use strict,但你需要声明所有的变量)。
- I'd recommend using LWP::Simple instead of shelling out to
GET
. - 我建议使用LWP :: Simple而不是炮轰GET。
- Please don't use regexes to parse HTML. Use a real HTML parser instead. Web::Query is my current favourite.
- 请不要使用正则表达式来解析HTML。请改用真正的HTML解析器。 Web :: Query是我目前的最爱。
- You then run
GET
again, using a variable called$txt
that doesn't have a value. That's not going to work! - 然后使用名为$ txt且没有值的变量再次运行GET。那不行!
-
$txt = 'cat ocr_result.txt'
doesn't do what you think it does. You want backticks, not single quotes. - $ txt ='cat ocr_result.txt'没有你想象的那样做。你想要反引号,而不是单引号。
Update: Obviously, I don't have access to your username or password, so I can't reconstruct all of your code. But this seems to work fine for accessing the image in your example and extracting the text from it.
更新:显然,我无法访问您的用户名或密码,因此我无法重建您的所有代码。但这似乎适用于访问示例中的图像并从中提取文本。
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use LWP::Simple;
my $img_url = 'http://perltest.adavice.com/captcha/1533110309.png';
my $img_file = 'ocr_me.img';
getstore($img_url, $img_file);
my $txt = `tesseract $img_file stdout`;
say $txt;
Here's your actual error:
这是你的实际错误:
system("tesseract ocr_me.img ocr_result");
print "GET '$txt' > ocr_result.txt\n";
system "GET '$txt' > ocr_result.txt";
You ask tesseract
to write its output to ocr_result.txt
, but two lines later, you overwrite that file with the output of a failed call to GET
. I'm not sure what you think that's going to do, but it will trash whatever output tesseract
has already stored in that file.
你要求tesseract将其输出写入ocr_result.txt,但是后两行,你用GET失败的调用输出覆盖该文件。我不确定你认为会发生什么,但它会废弃tesseract已存储在该文件中的任何输出。
Updated Update:
更新的更新:
Here's my current version of the code:
这是我当前版本的代码:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use LWP::Simple qw[$ua get getstore];
use File::Basename;
###
my $user = 'xxxx'; #Enter your username here
my $pass = 'xxxx'; #Enter your password here
###
#Server settings
my $home = "http://perltest.adavice.com";
my $url = "$home/c/test.cgi?u=$user&p=$pass";
#Get HTML code!
my $html = get($url);
my $img;
###Add code here!
#Grab img from HTML code
if ($html =~ m%img[^>]*src="(/[^"]*)"%s)
{
$img = $1;
}
my $img_url = $home . $img;
my $img_file = 'ocr_me.img';
getstore($img_url, $img_file);
say $img_url;
say $img_file;
# Looks like tesseract adds two newlines to its output -
# so chomp() it twice!
chomp(my $txt = `tesseract ocr_me.img stdout`);
chomp($txt);
say "[$txt]";
$txt =~ s/\W+//g;
my $resp = $ua->post($url, {
u => $user,
p => $pass,
file => basename($img),
text => $txt,
});
print $resp->code;
print $resp->content;
I've changed a few things.
我改变了一些东西。
- Corrected
$img_url
from$url . $img
to$home . $img
(this is what was stopping it from getting the correct image). - 更正了$ url中的$ img_url。 $ img到$ home。 $ img(这是阻止它获取正确图像的原因)。
- Switched to using LWP::Simple throughout (it's just easier).
- 切换到使用LWP :: Simple(它更简单)。
-
chomp
ed (twice!) the output fromtesseract
to remove newlines. - chomped(两次!)tesseract的输出删除换行符。
- Used File::Basename to get the correct filename to pass in the final
POST
. - 使用File :: Basename获取在最终POST中传递的正确文件名。
- Removed any non-word characters from
$txt
beforePOST
ing it. - 在POST之前从$ txt中删除了任何非单词字符。
It still doesn't quite work. It seems to hang waiting for a response from the server. But I'm afraid I've run out of time to help you.
它仍然不太有用。它似乎挂起等待服务器的响应。但是我担心我没时间帮你了。