Windows上的Perl Image :: OCR :: Tesseract模块

任何人都知道在Windows上安装“Image :: OCR :: Tesseract”模块的优雅方式? 由于* NIX模块依赖关系称为“LEOCHARRE :: CLI”,模块无法通过CPAN在Windows上安装。 这个模块似乎并不需要运行“Image :: OCR :: Tesseract”本身。

我设法首先手动安装makefile.pl中列出的依赖模块(“LEOCHARRE :: CLI”除外),然后将模块文件移动到“C:\ Perl”下的正确目录结构\ site \ lib文件\图片\ OCR”。 让它工作的最后一部分是改变从命令行调用ImageMagick和Tesseract可执行文件的代码段,以便在模块调用可执行文件时在程序名称周围加上引号。

这个工作,但我真的感觉更好的做一个生产系统的PPM或CPAN安装在Windows上的回购。

没关系,我知道了,尽管我不能决定什么是更好的解决方案。

要让安装程序通过传统的“perl makefile.pl,make,make test,make install”在Windows上工作,需要编译Makefile.pl脚本,包括缺少的Windows安装模块(Devel :: AssertOS :: MSWin32 ),并修补AssertEXE.pm使用“File :: Which”,而不是内置的“Windows”缺少的命令。 所有这些仍然要求修改“Image :: OCR :: Tesseract”,以在命令行中执行“convert”和“tesseract”时在程序名称周围加上引号。

考虑到使安装程序在Windows上工作的步骤数,以及事实上该模块不创建模块链接到二进制组件,我会说安装和获取窗口上工作的Tesseract模块的最佳选择首先安装下面的二进制包:

ImageMagick http://www.imagemagick.org/script/binary-releases.php

Tesseract http://code.google.com/p/tesseract-ocr/downloads/list

接下来,找到你的Perl模块目录 – 在我的系统上是“C:\ Perl \ site \ lib \”。 创建一个文件夹“图像”,如果你没有一个。 接下来,打开Image文件夹并创建一个名为“OCR”的文件夹。 打开OCR文件夹。 在这一点上,你的路径应该是“C:\ Perl \ site \ lib \ Image \ OCR \”。 创建一个名为“Tesseract.pm”的新文本文件,并复制到以下内容中…

package Image::OCR::Tesseract; use strict; use Carp; use Cwd; use String::ShellQuote 'shell_quote'; use Exporter; use vars qw(@EXPORT_OK @ISA $VERSION $DEBUG $WHICH_TESSERACT $WHICH_CONVERT %EXPORT_TAGS @TRASH); @ISA = qw(Exporter); @EXPORT_OK = qw(get_ocr get_hocr _tesseract convert_8bpp_tif tesseract); $VERSION = sprintf "%d.%02d", q$Revision: 1.24 $ =~ /(\d+)/g; %EXPORT_TAGS = ( all => \@EXPORT_OK ); BEGIN { use File::Which 'which'; $WHICH_TESSERACT = which('tesseract'); $WHICH_CONVERT = which('convert'); if($^O=~m/MSWin/) { $WHICH_TESSERACT='"'.$WHICH_TESSERACT.'"'; $WHICH_CONVERT='"'.$WHICH_CONVERT.'"'; } $WHICH_TESSERACT or die("Is tesseract installed? Cannot find bin path to tesseract."); $WHICH_CONVERT or die("Is convert installed? Cannot find bin path to convert."); } END { scalar @TRASH or return; if ( $DEBUG ){ print STDERR "Debug on, these are trash files:\n".join("\n",@TRASH) ; } else { unlink @TRASH; } } sub DEBUG { Carp::cluck("Image::OCR::Tesseract::DEBUG() deprecated") } sub get_hocr { my ($abs_image,$abs_tmp_dir,$lang)= @_; -f $abs_image or croak("$abs_image is not a file on disk"); my $hocr="hocr"; if(defined $abs_tmp_dir){ -d $abs_tmp_dir or die("tmp dir arg $abs_tmp_dir not a dir on disk."); $abs_image=~/([^\/]+)$/ or die("cant match filename in path arg '$abs_image'"); my $abs_copy = "$abs_tmp_dir/$1"; # TODO, what if source and dest are same, i want it to die require File::Copy; File::Copy::copy($abs_image, $abs_copy) or die("cant make copy of $abs_image to $abs_copy, $!"); # change the image to get ocr from to be the copy $abs_image = $abs_copy; # since it's a copy. erase that on exit push @TRASH, $abs_image; } my $tmp_tif = convert_8bpp_tif($abs_image); push @TRASH, $tmp_tif; # for later delete _tesseract($tmp_tif,$lang,$hocr) || ''; } sub get_ocr { my ($abs_image,$abs_tmp_dir,$lang)= @_; -f $abs_image or croak("$abs_image is not a file on disk"); if(defined $abs_tmp_dir){ -d $abs_tmp_dir or die("tmp dir arg $abs_tmp_dir not a dir on disk."); $abs_image=~/([^\/]+)$/ or die("cant match filename in path arg '$abs_image'"); my $abs_copy = "$abs_tmp_dir/$1"; # TODO, what if source and dest are same, i want it to die require File::Copy; File::Copy::copy($abs_image, $abs_copy) or die("cant make copy of $abs_image to $abs_copy, $!"); # change the image to get ocr from to be the copy $abs_image = $abs_copy; # since it's a copy. erase that on exit push @TRASH, $abs_image; } my $tmp_tif = convert_8bpp_tif($abs_image); push @TRASH, $tmp_tif; # for later delete _tesseract($tmp_tif,$lang) || ''; } sub convert_8bpp_tif { my ($abs_img,$abs_out) = (shift,shift); defined $abs_img or die('missing image arg'); $abs_out ||= $abs_img.'.tmp.'.time().(int rand(9000)).'.tif'; my @arg = ( $WHICH_CONVERT, $abs_img, '-compress','none','+matte', $abs_out ); #die (join(" ", @arg)); system(@arg) == 0 or die("convert $abs_img error.. $?"); $DEBUG and warn("made $abs_out 8bpp tiff."); $abs_out; } # people expect tesseract to automatically convert *tesseract = \&_tesseract; sub _tesseract { my ($abs_image,$lang,$hocr) = @_; defined $abs_image or croak('missing image path arg'); $abs_image=~/\.tif+$/i or warn("Are you sure '$abs_image' is a tif image? This operation may fail."); #my @arg = ( # $WHICH_TESSERACT, shell_quote($abs_image), shell_quote($abs_image), # (defined $lang and ('-l', $lang) ), '2>/dev/null' #); my $cmd = ( sprintf '%s %s %s', $WHICH_TESSERACT, shell_quote($abs_image), shell_quote($abs_image) ) . ( defined $lang ? " -l $lang" : '' ) . ( defined $hocr ? " hocr" : '' ) . " 2>/dev/null"; $DEBUG and warn "command: $cmd"; system($cmd); # hard to check ==0 my $txt = $abs_image.($hocr?".html":".txt"); unless( -f $txt ){ Carp::cluck("no text output for image '$abs_image'. (No text file '$txt' found on disk)"); return; } $DEBUG and warn "Found text file '$txt'"; my $content = (_slurp($txt) || ''); $DEBUG and warn("content length of text in '$txt' from image '$abs_image' is ". length $content ); push @TRASH, $txt; $content; } sub _slurp { my $abs = shift; open(FILE,'<', $abs) or die("can't open file for reading '$abs', $!"); local $/; my $txt = <FILE>; close FILE; $txt; } 1; __END__ #sub _force_imgtype { # my $img = shift; # my $type = shift; # my $delete_original = shift; # $delete_original ||=0; # # # if($img=~/\.$type$/i){ # return $img; # } # # my $img_out= $img; # $img_out=~s/\.\w{1,5}$/\.$type/ or die("cant get file ext for $img"); # # # #} 

保存并关闭。 如果在安装ImageMagick和Tesseract二进制文件之前打开一个命令行会话并打开一个新的命令行会话。 用以下脚本测试模块:

 use Image::OCR::Tesseract; my $image = 'SomeImageFileThatContainsText.jpg'; my $text = Image::OCR::Tesseract::get_ocr($image); print "Text...\n"; print $text."\n"; print "Normal Exit\n"; exit; 

而已。 凌乱,我知道,但没有什么好的解决方案,模块安装程序确实需要更新,以支持Windows(和其他)系统,即使实际的模块代码几乎不加修改地运行。 真的,如果Tesseract和ImageMagick被安装到没有空格的路径,那么“Image :: OCR :: Tesseract”模块代码将不需要任何改变,但是这个小调整让支持的可执行文件可以安装在任何地方,包括默认位置。