OCR – 使用tesseract 3.0和imagemagick 6.6.5从图像中获取文本

我正在尝试构build一个shell脚本,允许我在图像中search文本。 基于文本,脚本将尽力从图像中获取文本。 我想要你的input,因为这个脚本似乎适用于大多数图像,但不是那些文本字体颜色与文本周围较小环境相似的图像。

# !/bin/bash # # imt-ocr.sh is image magick tessearc OCR tool that is used for finding out text in image # # Arguments: # 1 -- image filename (with path) # 2 -- text to search in image (default to '') # 3 -- occurence of text (default to 1) # Usage: # imt-ocr.sh [image_filename] [text_to_search] [occurence] # image=$1 txt=$2 occurence=$3 # Default to 1 if [ "$occurence" == "" ] then occurence=1 fi get_major_color () # Returns the major color of an image with its hex value # Parameter: Image filename (with path) # Return format: Returns a string "hex_val_of_color major_color_name" { convert $1 -format %c histogram:info: > x.txt cat x.txt | awk '{print $1}' > x1.txt h=$(sort -n x1.txt | tail -1); color_info=$(cat x.txt | grep "$h" | cut -d '#' -f2) rm -rf x.txt x1.txt echo "$color_info" } invert_color() # Inverts the color hex value # Parameter: Hex value to be inverted # Return format: Returns in hex { input_color_hex=$1 # Input color's hex value white_color_hex=FFFFFF # White color's hex vlaue inv_color_hex=`echo $(printf '%06X\n' $((0x$white_color_hex - 0x$input_color_hex)))` echo $inv_color_hex } start_scale=100 end_scale=300 increment_scale=100 tmp_img=dst.tif attempt=1 for ((scale=$start_scale, attempt=$attempt; scale <= $end_scale ; scale=scale+$increment_scale, attempt++)) do echo "IMT-OCR-LOG: Scaling image to $scale% in attempt #$attempt" convert $image -type Grayscale -scale $scale% $tmp_img tesseract $tmp_img OUT found_oc=$(grep -o "$txt" OUT.txt | wc -l) echo "IMT-OCR-LOG: Found $found_oc occurence(s) of text '$txt' in attempt #$attempt" if [ $occurence -le $found_oc ] && [ $found_oc -ne 0 ] then echo "IMT-OCR-LOG: Printing out the last text found on image" echo "IMT-OCR-LOG: ======================================================" cat OUT.txt echo "IMT-OCR-LOG: ======================================================" rm -rf $tmp_img OUT.txt exit 1 else echo "IMT-OCR-LOG: Getting major color of image in attempt #$attempt" color_info=`get_major_color $image` true_color=$(echo $color_info | awk '{print $2}') true_val=$(echo $color_info | awk '{print $1}') echo "IMT-OCR-LOG: Major color of image is '$true_color' with hex value of $true_val in attempt #$attempt" # Blur the image echo "IMT-OCR-LOG: Bluring image in attempt #$attempt" convert $tmp_img -blur 1x65535 $tmp_img # Flip the color inverted_val=`invert_color $true_val` echo "IMT-OCR-LOG: Inverting the major color of image from 0x$true_val to 0x$inverted_val in attempt #$attempt" convert $tmp_img -fill \#$inverted_val -opaque \#$true_val $tmp_img # Sharpen the image echo "IMT-OCR-LOG: Sharpening image in attempt #$attempt" convert $tmp_img -sharpen 1x65535 $tmp_img # Find text tesseract $tmp_img OUT found_oc=$(grep -o "$txt" OUT.txt | wc -l) echo "IMT-OCR-LOG: Found $found_oc occurence(s) of text '$txt' in attempt #$attempt" if [ "$found_oc" != "0" ] then if [ $occurence -le $found_oc ] then echo "IMT-OCR-LOG: Printing out the last text found on image" echo "IMT-OCR-LOG: ======================================================" cat OUT.txt echo "IMT-OCR-LOG: ======================================================" rm -rf $tmp_img OUT.txt exit 1 fi fi fi rm -rf OUT.txt done rm -rf $tmp_img 

这是一个示例与问题,示例(test.jpeg) http://img.zgserver.com/linux/03-Word-Collage-iPad.jpg

 [admin@ba-callgen image-magick-tesseract-processing]$ sh imt-ocr.sh test.jpeg Common IMT-OCR-LOG: Scaling image to 100% in attempt #1 Tesseract Open Source OCR Engine with Leptonica IMT-OCR-LOG: Found 0 occurence(s) of text 'Common' in attempt #1 IMT-OCR-LOG: Getting major color of image in attempt #1 IMT-OCR-LOG: Major color of image is 'grey96' with hex value of F5F5F5 in attempt #1 IMT-OCR-LOG: Bluring image in attempt #1 IMT-OCR-LOG: Inverting the major color of image from 0xF5F5F5 to 0x0A0A0A in attempt #1 IMT-OCR-LOG: Sharpening image in attempt #1 Tesseract Open Source OCR Engine with Leptonica IMT-OCR-LOG: Found 0 occurence(s) of text 'Common' in attempt #1 IMT-OCR-LOG: Scaling image to 200% in attempt #2 Tesseract Open Source OCR Engine with Leptonica IMT-OCR-LOG: Found 1 occurence(s) of text 'Common' in attempt #2 IMT-OCR-LOG: Printing out the last text found on image IMT-OCR-LOG: ====================================================== Settings M... Text Common words Exclude numbers word case Theme & Layuul Color theme Fnnl Word layout Clrien lalion 7301 Lrmclsc ape \u2018OTC Ergl sw v.-ords > li( ` I):Jntc1'\:1r\qa ) Landon Spring > Hough Trad > H3'fJ|1d :-Ialf > HL IMT-OCR-LOG: ====================================================== [admin@ba-callgen image-magick-tesseract-processing]$ [admin@ba-callgen image-magick-tesseract-processing]$ [admin@ba-callgen image-magick-tesseract-processing]$ [admin@ba-callgen image-magick-tesseract-processing]$ [admin@ba-callgen image-magick-tesseract-processing]$ [admin@ba-callgen image-magick-tesseract-processing]$ [admin@ba-callgen image-magick-tesseract-processing]$ sh imt-ocr.sh test.jpeg Portrait IMT-OCR-LOG: Scaling image to 100% in attempt #1 Tesseract Open Source OCR Engine with Leptonica IMT-OCR-LOG: Found 0 occurence(s) of text 'Portrait' in attempt #1 IMT-OCR-LOG: Getting major color of image in attempt #1 IMT-OCR-LOG: Major color of image is 'grey96' with hex value of F5F5F5 in attempt #1 IMT-OCR-LOG: Bluring image in attempt #1 IMT-OCR-LOG: Inverting the major color of image from 0xF5F5F5 to 0x0A0A0A in attempt #1 IMT-OCR-LOG: Sharpening image in attempt #1 Tesseract Open Source OCR Engine with Leptonica IMT-OCR-LOG: Found 0 occurence(s) of text 'Portrait' in attempt #1 IMT-OCR-LOG: Scaling image to 200% in attempt #2 Tesseract Open Source OCR Engine with Leptonica IMT-OCR-LOG: Found 0 occurence(s) of text 'Portrait' in attempt #2 IMT-OCR-LOG: Getting major color of image in attempt #2 IMT-OCR-LOG: Major color of image is 'grey96' with hex value of F5F5F5 in attempt #2 IMT-OCR-LOG: Bluring image in attempt #2 IMT-OCR-LOG: Inverting the major color of image from 0xF5F5F5 to 0x0A0A0A in attempt #2 IMT-OCR-LOG: Sharpening image in attempt #2 Tesseract Open Source OCR Engine with Leptonica IMT-OCR-LOG: Found 0 occurence(s) of text 'Portrait' in attempt #2 IMT-OCR-LOG: Scaling image to 300% in attempt #3 Tesseract Open Source OCR Engine with Leptonica IMT-OCR-LOG: Found 0 occurence(s) of text 'Portrait' in attempt #3 IMT-OCR-LOG: Getting major color of image in attempt #3 IMT-OCR-LOG: Major color of image is 'grey96' with hex value of F5F5F5 in attempt #3 IMT-OCR-LOG: Bluring image in attempt #3 IMT-OCR-LOG: Inverting the major color of image from 0xF5F5F5 to 0x0A0A0A in attempt #3 IMT-OCR-LOG: Sharpening image in attempt #3 Tesseract Open Source OCR Engine with Leptonica IMT-OCR-LOG: Found 0 occurence(s) of text 'Portrait' in attempt #3 [admin@ba-callgen image-magick-tesseract-processing]$ 

正如你所看到的,我可以find文本“普通”,而不是“肖像”。 原因是因为肖像的字体颜色。 任何帮助改善这个脚本…

我正在使用Centos 5。

操作输入图像时,不要人为地限制自己只评估一种或两种方法。 你似乎现在只使用-blur-blur

您还应该考虑使用以下操作:

  • -contrast
  • -despeckle
  • -edge
  • -negate
  • -normalize
  • -posterize
  • -type grayscale
  • -monochrome
  • -gamma
  • -antialias / +antialias

Input Image: 原始截图

看看这个命令产生的例子:

 convert 03-Word-Collage-iPad.jpeg \ -scale 1000% \ -blur 1x65535 -blur 1x65535 -blur 1x65535 \ -contrast \ -normalize \ -despeckle -despeckle \ -type grayscale \ -sharpen 1 \ -posterize 3 \ -negate \ -gamma 100 \ -compress zip \ a.tif 

Output Image: 输出图像
(对不起,在上传一个TIFF到这个网站时,它会自动转换成PNG,所以当下载上面的图片时你并没有真正得到我的TIFF–但是你会看到我的真实结果。)

注1:我用这个ImageMagick版本测试了这个:

 convert -version Version: ImageMagick 6.7.6-9 2012-05-12 Q16 http://www.imagemagick.org Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC Features: 

注2: ImageMagick的旧版本或新版本可能会有不同的表现,特别是当涉及到 – -posterize

这是Tesseract的a.tif OCR的a.tif

 tesseract a.tif OUT && cat OUT.txt Tesseract Open Source OCR Engine v3.01 with Leptonica Page 0 Text Common words Remove English words > Exclude numbers Word case Don't change 1+ Theme & Layout Color theme London Spring > Font Rough Trad > Word layout Half and Half > Orientation Landscape Q u -0 "H I 

更新:

我验证了ImageMagick 6.7.9-0 (昨天发布)的最新版本不会产生与上面的命令+截图(使用版本6.7.6-9 )所示的结果完全相同的结果。 以下是区别:

在这里输入图像说明

无论如何,我敢肯定,如果你稍微调整一下我的命令,使用各种参数来玩,无论你的ImageMagick版本是什么,你都可以为你工作。