
时间:2022-09-24 08:54:51

I need to extract captcha from url and recognised it with Tesseract. My code is:


#!/usr/bin/perl -X
$user = 'user'; #Enter your username here
$pass = 'pass'; #Enter your password here
#Server settings
$home = "http://perltest.adavice.com";
$url = "$home/c/test.cgi?u=$user&p=$pass";
#Get HTML code!
$html = `GET "$url"`
###Add code here!
#Grab img from HTML code
if ($html =~ m%img[^>]*src="(/[^"]*)"%s)
    $img = $1;
die "<img> not found\n" if (!$img);
#Download image to server (save as: ocr_me.img)
print "GET '$home$img' > ocr_me.img\n";
system "GET '$home$img' > ocr_me.img";
###Add code here!
#Run OCR (using shell command tesseract) on img and save text as ocr_result.txt
system("tesseract ocr_me.img ocr_result");
print "GET '$txt' > ocr_result.txt\n";
system "GET '$txt' > ocr_result.txt";
die "ocr_result.txt not found\n" if (!-e "ocr_result.txt");
# check OCR results:
$txt = 'cat ocr_result.txt';
$txt =~ s/[^A-Za-z0-9\-_\.]+//sg;
$img =~ s/^.*\///;
print `echo -n "file=$img&text=$txt" | POST "$url"`;

Image parse correctly. This image contain captcha and looks like:



My output is:


GET 'http://perltest.adavice.com/captcha/1533110309.png' > ocr_me.img
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
GET '' > ocr_result.txt
Captcha text not specified

As you see, script parse image correctly. But Tesseract didnt see anything in that PNG file. I am trying to specify additional parameters such as -psm and -l with shell command tesseract, but this also giving nothing


UPDATE: After read answer @Dave Cross, I am tried his suggestion.

更新:阅读答案@Dave Cross后,我尝试了他的建议。

In output I got:


Tesseract Open Source OCR Engine v3.02.02 with Leptonica
200Captcha text not specified
Original image file not specified
Captcha text not specified

Why I need text from image .PNG? Maybe this additional information can help you. Look at that: Tesseract不识别png文件中的验证码,该文件包含英文字母的数字和字母


This is how $url looks like in browser. My goal here is create query for this page in wim using perl. For this I need fill in forms above my $user, $pass and $txt (from recognized with Tesseract image). And send that with POST 'url' (last string in code).

这就是$ url在浏览器中的样子。我的目标是使用perl在wim中为此页面创建查询。为此我需要在$ user,$ pass和$ txt之上填写表格(来自Tesseract图像的识别)。并使用POST'url'发送它(代码中的最后一个字符串)。

1 个解决方案



Several strange things going on here. Any one of them could be causing your problems.


  1. Having -X on your shebang line is a terrible idea. It explicitly turns off warnings. I suggest you remove it, add use warnings to your code and fix all the problems that reveals (I'd suggest adding use strict too, but you'd need to declare all of your variables).
  2. 在你的shebang线上使用-X是一个糟糕的主意。它明确地关闭了警告。我建议你删除它,在代码中添加使用警告并修复显示的所有问题(我建议添加use strict,但你需要声明所有的变量)。
  3. I'd recommend using LWP::Simple instead of shelling out to GET.
  4. 我建议使用LWP :: Simple而不是炮轰GET。
  5. Please don't use regexes to parse HTML. Use a real HTML parser instead. Web::Query is my current favourite.
  6. 请不要使用正则表达式来解析HTML。请改用真正的HTML解析器。 Web :: Query是我目前的最爱。
  7. You then run GET again, using a variable called $txt that doesn't have a value. That's not going to work!
  8. 然后使用名为$ txt且没有值的变量再次运行GET。那不行!
  9. $txt = 'cat ocr_result.txt' doesn't do what you think it does. You want backticks, not single quotes.
  10. $ txt ='cat ocr_result.txt'没有你想象的那样做。你想要反引号,而不是单引号。

Update: Obviously, I don't have access to your username or password, so I can't reconstruct all of your code. But this seems to work fine for accessing the image in your example and extracting the text from it.



use strict;
use warnings;
use feature 'say';

use LWP::Simple;

my $img_url  = 'http://perltest.adavice.com/captcha/1533110309.png';
my $img_file = 'ocr_me.img';

getstore($img_url, $img_file);

my $txt = `tesseract $img_file stdout`;

say $txt;

Here's your actual error:


system("tesseract ocr_me.img ocr_result");
print "GET '$txt' > ocr_result.txt\n";
system "GET '$txt' > ocr_result.txt";

You ask tesseract to write its output to ocr_result.txt, but two lines later, you overwrite that file with the output of a failed call to GET. I'm not sure what you think that's going to do, but it will trash whatever output tesseract has already stored in that file.


Updated Update:


Here's my current version of the code:


use strict;
use warnings;
use feature 'say';
use LWP::Simple qw[$ua get getstore];
use File::Basename;
my $user = 'xxxx'; #Enter your username here
my $pass = 'xxxx'; #Enter your password here
#Server settings
my $home = "http://perltest.adavice.com";
my $url = "$home/c/test.cgi?u=$user&p=$pass";
#Get HTML code!
my $html = get($url);
my $img;
###Add code here!
#Grab img from HTML code
if ($html =~ m%img[^>]*src="(/[^"]*)"%s)
    $img = $1;
my $img_url = $home . $img;
my $img_file = 'ocr_me.img';

getstore($img_url, $img_file);

say $img_url;
say $img_file;

# Looks like tesseract adds two newlines to its output -
# so chomp() it twice!
chomp(my $txt = `tesseract ocr_me.img stdout`);

say "[$txt]";

$txt =~ s/\W+//g;

my $resp = $ua->post($url, {
  u    => $user,
  p    => $pass,
  file => basename($img),
  text => $txt,

print $resp->code;
print $resp->content;

I've changed a few things.


  1. Corrected $img_url from $url . $img to $home . $img (this is what was stopping it from getting the correct image).
  2. 更正了$ url中的$ img_url。 $ img到$ home。 $ img(这是阻止它获取正确图像的原因)。
  3. Switched to using LWP::Simple throughout (it's just easier).
  4. 切换到使用LWP :: Simple(它更简单)。
  5. chomped (twice!) the output from tesseract to remove newlines.
  6. chomped(两次!)tesseract的输出删除换行符。
  7. Used File::Basename to get the correct filename to pass in the final POST.
  8. 使用File :: Basename获取在最终POST中传递的正确文件名。
  9. Removed any non-word characters from $txt before POSTing it.
  10. 在POST之前从$ txt中删除了任何非单词字符。

It still doesn't quite work. It seems to hang waiting for a response from the server. But I'm afraid I've run out of time to help you.




Several strange things going on here. Any one of them could be causing your problems.


  1. Having -X on your shebang line is a terrible idea. It explicitly turns off warnings. I suggest you remove it, add use warnings to your code and fix all the problems that reveals (I'd suggest adding use strict too, but you'd need to declare all of your variables).
  2. 在你的shebang线上使用-X是一个糟糕的主意。它明确地关闭了警告。我建议你删除它,在代码中添加使用警告并修复显示的所有问题(我建议添加use strict,但你需要声明所有的变量)。
  3. I'd recommend using LWP::Simple instead of shelling out to GET.
  4. 我建议使用LWP :: Simple而不是炮轰GET。
  5. Please don't use regexes to parse HTML. Use a real HTML parser instead. Web::Query is my current favourite.
  6. 请不要使用正则表达式来解析HTML。请改用真正的HTML解析器。 Web :: Query是我目前的最爱。
  7. You then run GET again, using a variable called $txt that doesn't have a value. That's not going to work!
  8. 然后使用名为$ txt且没有值的变量再次运行GET。那不行!
  9. $txt = 'cat ocr_result.txt' doesn't do what you think it does. You want backticks, not single quotes.
  10. $ txt ='cat ocr_result.txt'没有你想象的那样做。你想要反引号,而不是单引号。

Update: Obviously, I don't have access to your username or password, so I can't reconstruct all of your code. But this seems to work fine for accessing the image in your example and extracting the text from it.



use strict;
use warnings;
use feature 'say';

use LWP::Simple;

my $img_url  = 'http://perltest.adavice.com/captcha/1533110309.png';
my $img_file = 'ocr_me.img';

getstore($img_url, $img_file);

my $txt = `tesseract $img_file stdout`;

say $txt;

Here's your actual error:


system("tesseract ocr_me.img ocr_result");
print "GET '$txt' > ocr_result.txt\n";
system "GET '$txt' > ocr_result.txt";

You ask tesseract to write its output to ocr_result.txt, but two lines later, you overwrite that file with the output of a failed call to GET. I'm not sure what you think that's going to do, but it will trash whatever output tesseract has already stored in that file.


Updated Update:


Here's my current version of the code:


use strict;
use warnings;
use feature 'say';
use LWP::Simple qw[$ua get getstore];
use File::Basename;
my $user = 'xxxx'; #Enter your username here
my $pass = 'xxxx'; #Enter your password here
#Server settings
my $home = "http://perltest.adavice.com";
my $url = "$home/c/test.cgi?u=$user&p=$pass";
#Get HTML code!
my $html = get($url);
my $img;
###Add code here!
#Grab img from HTML code
if ($html =~ m%img[^>]*src="(/[^"]*)"%s)
    $img = $1;
my $img_url = $home . $img;
my $img_file = 'ocr_me.img';

getstore($img_url, $img_file);

say $img_url;
say $img_file;

# Looks like tesseract adds two newlines to its output -
# so chomp() it twice!
chomp(my $txt = `tesseract ocr_me.img stdout`);

say "[$txt]";

$txt =~ s/\W+//g;

my $resp = $ua->post($url, {
  u    => $user,
  p    => $pass,
  file => basename($img),
  text => $txt,

print $resp->code;
print $resp->content;

I've changed a few things.


  1. Corrected $img_url from $url . $img to $home . $img (this is what was stopping it from getting the correct image).
  2. 更正了$ url中的$ img_url。 $ img到$ home。 $ img(这是阻止它获取正确图像的原因)。
  3. Switched to using LWP::Simple throughout (it's just easier).
  4. 切换到使用LWP :: Simple(它更简单)。
  5. chomped (twice!) the output from tesseract to remove newlines.
  6. chomped(两次!)tesseract的输出删除换行符。
  7. Used File::Basename to get the correct filename to pass in the final POST.
  8. 使用File :: Basename获取在最终POST中传递的正确文件名。
  9. Removed any non-word characters from $txt before POSTing it.
  10. 在POST之前从$ txt中删除了任何非单词字符。

It still doesn't quite work. It seems to hang waiting for a response from the server. But I'm afraid I've run out of time to help you.
