Tesseract不识别png文件中的验证码,该文件包含英文字母的数字和字母

时间:2022-09-24 08:54:51

I need to extract captcha from url and recognised it with Tesseract. My code is:

我需要从url中提取验证码并使用Tesseract识别它。我的代码是:

#!/usr/bin/perl -X
###
$user = 'user'; #Enter your username here
$pass = 'pass'; #Enter your password here
###
#Server settings
$home = "http://perltest.adavice.com";
$url = "$home/c/test.cgi?u=$user&p=$pass";
#Get HTML code!
$html = `GET "$url"`
###Add code here!
#Grab img from HTML code
if ($html =~ m%img[^>]*src="(/[^"]*)"%s)
{
    $img = $1;
}
###
die "<img> not found\n" if (!$img);
#Download image to server (save as: ocr_me.img)
print "GET '$home$img' > ocr_me.img\n";
system "GET '$home$img' > ocr_me.img";
###Add code here!
#Run OCR (using shell command tesseract) on img and save text as ocr_result.txt
system("tesseract ocr_me.img ocr_result");
print "GET '$txt' > ocr_result.txt\n";
system "GET '$txt' > ocr_result.txt";
###
die "ocr_result.txt not found\n" if (!-e "ocr_result.txt");
# check OCR results:
$txt = 'cat ocr_result.txt';
$txt =~ s/[^A-Za-z0-9\-_\.]+//sg;
$img =~ s/^.*\///;
print `echo -n "file=$img&text=$txt" | POST "$url"`;

Image parse correctly. This image contain captcha and looks like:

图像解析正确。此图片包含captcha,看起来像:

Tesseract不识别png文件中的验证码,该文件包含英文字母的数字和字母

My output is:

我的输出是:

GET 'http://perltest.adavice.com/captcha/1533110309.png' > ocr_me.img
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
GET '' > ocr_result.txt
Captcha text not specified

As you see, script parse image correctly. But Tesseract didnt see anything in that PNG file. I am trying to specify additional parameters such as -psm and -l with shell command tesseract, but this also giving nothing

如您所见,脚本正确解析图像。但是Tesseract在那个PNG文件中没有看到任何内容。我试图用shell命令tesseract指定其他参数,如-psm和-l,但这也没有给出任何内容

UPDATE: After read answer @Dave Cross, I am tried his suggestion.

更新:阅读答案@Dave Cross后,我尝试了他的建议。

In output I got:

在输出中我得到:

http://perltest.adavice.com/captcha/1533141024.png
ocr_me.img
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
[]
200Captcha text not specified
Original image file not specified
Captcha text not specified

Why I need text from image .PNG? Maybe this additional information can help you. Look at that: Tesseract不识别png文件中的验证码,该文件包含英文字母的数字和字母

为什么我需要来自图像.PNG的文字?也许这些额外的信息可以帮助您。看那个:

This is how $url looks like in browser. My goal here is create query for this page in wim using perl. For this I need fill in forms above my $user, $pass and $txt (from recognized with Tesseract image). And send that with POST 'url' (last string in code).

这就是$ url在浏览器中的样子。我的目标是使用perl在wim中为此页面创建查询。为此我需要在$ user,$ pass和$ txt之上填写表格(来自Tesseract图像的识别)。并使用POST'url'发送它(代码中的最后一个字符串)。

1 个解决方案

#1


2  

Several strange things going on here. Any one of them could be causing your problems.

这里发生了几件奇怪的事情。他们中的任何一个都可能导致你的问题。

  1. Having -X on your shebang line is a terrible idea. It explicitly turns off warnings. I suggest you remove it, add use warnings to your code and fix all the problems that reveals (I'd suggest adding use strict too, but you'd need to declare all of your variables).
  2. 在你的shebang线上使用-X是一个糟糕的主意。它明确地关闭了警告。我建议你删除它,在代码中添加使用警告并修复显示的所有问题(我建议添加use strict,但你需要声明所有的变量)。
  3. I'd recommend using LWP::Simple instead of shelling out to GET.
  4. 我建议使用LWP :: Simple而不是炮轰GET。
  5. Please don't use regexes to parse HTML. Use a real HTML parser instead. Web::Query is my current favourite.
  6. 请不要使用正则表达式来解析HTML。请改用真正的HTML解析器。 Web :: Query是我目前的最爱。
  7. You then run GET again, using a variable called $txt that doesn't have a value. That's not going to work!
  8. 然后使用名为$ txt且没有值的变量再次运行GET。那不行!
  9. $txt = 'cat ocr_result.txt' doesn't do what you think it does. You want backticks, not single quotes.
  10. $ txt ='cat ocr_result.txt'没有你想象的那样做。你想要反引号,而不是单引号。

Update: Obviously, I don't have access to your username or password, so I can't reconstruct all of your code. But this seems to work fine for accessing the image in your example and extracting the text from it.

更新:显然,我无法访问您的用户名或密码,因此我无法重建您的所有代码。但这似乎适用于访问示例中的图像并从中提取文本。

#!/usr/bin/perl

use strict;
use warnings;
use feature 'say';

use LWP::Simple;

my $img_url  = 'http://perltest.adavice.com/captcha/1533110309.png';
my $img_file = 'ocr_me.img';

getstore($img_url, $img_file);

my $txt = `tesseract $img_file stdout`;

say $txt;

Here's your actual error:

这是你的实际错误:

system("tesseract ocr_me.img ocr_result");
print "GET '$txt' > ocr_result.txt\n";
system "GET '$txt' > ocr_result.txt";

You ask tesseract to write its output to ocr_result.txt, but two lines later, you overwrite that file with the output of a failed call to GET. I'm not sure what you think that's going to do, but it will trash whatever output tesseract has already stored in that file.

你要求tesseract将其输出写入ocr_result.txt,但是后两行,你用GET失败的调用输出覆盖该文件。我不确定你认为会发生什么,但它会废弃tesseract已存储在该文件中的任何输出。

Updated Update:

更新的更新:

Here's my current version of the code:

这是我当前版本的代码:

#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use LWP::Simple qw[$ua get getstore];
use File::Basename;
###
my $user = 'xxxx'; #Enter your username here
my $pass = 'xxxx'; #Enter your password here
###
#Server settings
my $home = "http://perltest.adavice.com";
my $url = "$home/c/test.cgi?u=$user&p=$pass";
#Get HTML code!
my $html = get($url);
my $img;
###Add code here!
#Grab img from HTML code
if ($html =~ m%img[^>]*src="(/[^"]*)"%s)
{
    $img = $1;
}
my $img_url = $home . $img;
my $img_file = 'ocr_me.img';

getstore($img_url, $img_file);

say $img_url;
say $img_file;

# Looks like tesseract adds two newlines to its output -
# so chomp() it twice!
chomp(my $txt = `tesseract ocr_me.img stdout`);
chomp($txt);

say "[$txt]";

$txt =~ s/\W+//g;

my $resp = $ua->post($url, {
  u    => $user,
  p    => $pass,
  file => basename($img),
  text => $txt,
});

print $resp->code;
print $resp->content;

I've changed a few things.

我改变了一些东西。

  1. Corrected $img_url from $url . $img to $home . $img (this is what was stopping it from getting the correct image).
  2. 更正了$ url中的$ img_url。 $ img到$ home。 $ img(这是阻止它获取正确图像的原因)。
  3. Switched to using LWP::Simple throughout (it's just easier).
  4. 切换到使用LWP :: Simple(它更简单)。
  5. chomped (twice!) the output from tesseract to remove newlines.
  6. chomped(两次!)tesseract的输出删除换行符。
  7. Used File::Basename to get the correct filename to pass in the final POST.
  8. 使用File :: Basename获取在最终POST中传递的正确文件名。
  9. Removed any non-word characters from $txt before POSTing it.
  10. 在POST之前从$ txt中删除了任何非单词字符。

It still doesn't quite work. It seems to hang waiting for a response from the server. But I'm afraid I've run out of time to help you.

它仍然不太有用。它似乎挂起等待服务器的响应。但是我担心我没时间帮你了。

#1


2  

Several strange things going on here. Any one of them could be causing your problems.

这里发生了几件奇怪的事情。他们中的任何一个都可能导致你的问题。

  1. Having -X on your shebang line is a terrible idea. It explicitly turns off warnings. I suggest you remove it, add use warnings to your code and fix all the problems that reveals (I'd suggest adding use strict too, but you'd need to declare all of your variables).
  2. 在你的shebang线上使用-X是一个糟糕的主意。它明确地关闭了警告。我建议你删除它,在代码中添加使用警告并修复显示的所有问题(我建议添加use strict,但你需要声明所有的变量)。
  3. I'd recommend using LWP::Simple instead of shelling out to GET.
  4. 我建议使用LWP :: Simple而不是炮轰GET。
  5. Please don't use regexes to parse HTML. Use a real HTML parser instead. Web::Query is my current favourite.
  6. 请不要使用正则表达式来解析HTML。请改用真正的HTML解析器。 Web :: Query是我目前的最爱。
  7. You then run GET again, using a variable called $txt that doesn't have a value. That's not going to work!
  8. 然后使用名为$ txt且没有值的变量再次运行GET。那不行!
  9. $txt = 'cat ocr_result.txt' doesn't do what you think it does. You want backticks, not single quotes.
  10. $ txt ='cat ocr_result.txt'没有你想象的那样做。你想要反引号,而不是单引号。

Update: Obviously, I don't have access to your username or password, so I can't reconstruct all of your code. But this seems to work fine for accessing the image in your example and extracting the text from it.

更新:显然,我无法访问您的用户名或密码,因此我无法重建您的所有代码。但这似乎适用于访问示例中的图像并从中提取文本。

#!/usr/bin/perl

use strict;
use warnings;
use feature 'say';

use LWP::Simple;

my $img_url  = 'http://perltest.adavice.com/captcha/1533110309.png';
my $img_file = 'ocr_me.img';

getstore($img_url, $img_file);

my $txt = `tesseract $img_file stdout`;

say $txt;

Here's your actual error:

这是你的实际错误:

system("tesseract ocr_me.img ocr_result");
print "GET '$txt' > ocr_result.txt\n";
system "GET '$txt' > ocr_result.txt";

You ask tesseract to write its output to ocr_result.txt, but two lines later, you overwrite that file with the output of a failed call to GET. I'm not sure what you think that's going to do, but it will trash whatever output tesseract has already stored in that file.

你要求tesseract将其输出写入ocr_result.txt,但是后两行,你用GET失败的调用输出覆盖该文件。我不确定你认为会发生什么,但它会废弃tesseract已存储在该文件中的任何输出。

Updated Update:

更新的更新:

Here's my current version of the code:

这是我当前版本的代码:

#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use LWP::Simple qw[$ua get getstore];
use File::Basename;
###
my $user = 'xxxx'; #Enter your username here
my $pass = 'xxxx'; #Enter your password here
###
#Server settings
my $home = "http://perltest.adavice.com";
my $url = "$home/c/test.cgi?u=$user&p=$pass";
#Get HTML code!
my $html = get($url);
my $img;
###Add code here!
#Grab img from HTML code
if ($html =~ m%img[^>]*src="(/[^"]*)"%s)
{
    $img = $1;
}
my $img_url = $home . $img;
my $img_file = 'ocr_me.img';

getstore($img_url, $img_file);

say $img_url;
say $img_file;

# Looks like tesseract adds two newlines to its output -
# so chomp() it twice!
chomp(my $txt = `tesseract ocr_me.img stdout`);
chomp($txt);

say "[$txt]";

$txt =~ s/\W+//g;

my $resp = $ua->post($url, {
  u    => $user,
  p    => $pass,
  file => basename($img),
  text => $txt,
});

print $resp->code;
print $resp->content;

I've changed a few things.

我改变了一些东西。

  1. Corrected $img_url from $url . $img to $home . $img (this is what was stopping it from getting the correct image).
  2. 更正了$ url中的$ img_url。 $ img到$ home。 $ img(这是阻止它获取正确图像的原因)。
  3. Switched to using LWP::Simple throughout (it's just easier).
  4. 切换到使用LWP :: Simple(它更简单)。
  5. chomped (twice!) the output from tesseract to remove newlines.
  6. chomped(两次!)tesseract的输出删除换行符。
  7. Used File::Basename to get the correct filename to pass in the final POST.
  8. 使用File :: Basename获取在最终POST中传递的正确文件名。
  9. Removed any non-word characters from $txt before POSTing it.
  10. 在POST之前从$ txt中删除了任何非单词字符。

It still doesn't quite work. It seems to hang waiting for a response from the server. But I'm afraid I've run out of time to help you.

它仍然不太有用。它似乎挂起等待服务器的响应。但是我担心我没时间帮你了。