Tesseract training data for Hebrew

May 18, 2011July 6, 2011 Roi Uncategorized

Now that my B.Sc. project is behind me I can share the tesseract training data I compiled for Hebrew

Links:

training_fonts.zip – the training files I used
tesseract-2.00.heb.tar.gz – compiled for tesseract 2
heb.traineddata.gz – compiled for tesseract 3
heb.traineddata.zip

24 thoughts to “Tesseract training data for Hebrew”

naim94a says:

July 6, 2011 at 12:38 am

The file “heb.traineddata.gz” is not a valid archive files. – Please upload again.

Reply
1. Roi says:
  
  July 6, 2011 at 7:53 pm
  
  seems to extract fine in linux but fails on windows with winrar
  I uploaded it as zip file
  
  Reply
  1. zdenop says:
    
    October 30, 2011 at 2:53 pm
    
    $ gzip -d heb.traineddata.gz
    
    gzip: heb.traineddata.gz: not in gzip format
    
    $ file heb.traineddata.gz
    heb.traineddata.gz: PCX ver. 2.5 image data
    
    tested on mandrivalinux 2011 64bit
    
    Reply
    1. Roi says:
      
      October 30, 2011 at 3:30 pm
      
      strange. tested on centos.
      
      roid@xxxx ~/tmp $ file heb.traineddata.gz
      heb.traineddata.gz: gzip compressed data, was “heb.traineddata”, from FAT filesystem (MS-DOS, OS/2, NT), last modified: Wed May 18 15:16:22 2011
      
      roid@xxxx ~/tmp $ gunzip heb.traineddata.gz
      -rw-r–r– 1 roid ecryptfs 522575 Oct 30 15:30 heb.traineddata
      
      use the zip one its the same.
      
      Reply
      1. zdenop says:
        
        October 30, 2011 at 3:55 pm
        
        how did you created gzip file?
        
        Reply
        
        Roi says:
        
        October 30, 2011 at 4:19 pm
        
        probably from some windows utility. but if it worked on centos it should have worked on other distros.
        maybe your gzip is old ? gzip 1.3.5
        
        anyway i did gunzip and gzip again.
        and you could just use the zipped one.
        its the same file.
        
        roid@xxxx ~/tmp $ file heb.traineddata.gz
        heb.traineddata.gz: gzip compressed data, was “heb.traineddata”, from Unix, last modified: Wed May 18 15:17:18 2011, max compression
        
        Reply
naim94a says:

July 12, 2011 at 10:18 pm

תודה רבה!!!!

Reply
heiji says:

August 27, 2011 at 4:23 pm

thanks for this good article..

Reply
Shlomi says:

October 6, 2011 at 3:49 pm

Hi Roi,
Thanks for the post. For some reason, my tesseract (3) crash when working with your training data (-l heb).
I created my own heb training data, it does not crash but the Hebrew is from left to right.
Any thing I can do about it?

Reply
1. Roi says:
  
  October 6, 2011 at 7:01 pm
  
  you need to reverse the string. tesseract doesn’t do it for you
  
  Reply
  1. soli says:
    
    October 19, 2012 at 12:38 pm
    
    if we reverse string it needs to define all of the word that are in heb !
    would you please show it in a example ?
    
    Reply
  2. soli says:
    
    October 19, 2012 at 12:43 pm
    
    you mean flipping the input image?
    
    Reply
    1. Roi says:
      
      October 22, 2012 at 3:16 pm
      
      just normal string reverse.
      you should try and see the output.
      
      Reply
Shlomi Izikovich says:

October 23, 2011 at 6:13 pm

Does anyone know what are the many @ signs that Tesseract output file contains?

Reply
zdenop says:

October 30, 2011 at 3:58 pm

Do you plan to join to Enrico Segre and his training (http://tesseract-ocr.googlecode.com/issues/attachment?aid=3980168041635874079&name=heb-rashi-stam4.tgz&token=63ae8889c508b9706b619f4ce8066685)?

Reply
Daniel says:

January 19, 2013 at 1:36 am

Hi Roi,
Thanks, but like the poster above, your Hebrew training data is also making Tesseract (version 3.02) crash for me.
I get the following error when I try to run Tesseract on a JPEG with your training data (heb.traineddata.gz):

num_edges_ > 0:Error:Assert failed:in file ..\..\dict\dawg.cpp, line 320

How can I fix this? Thanks in advance!

Reply
1. Roi says:
  
  January 20, 2013 at 12:59 pm
  
  you need to recompile the training data.
  
  Reply
Daniel says:

January 25, 2013 at 1:13 am

Hi Roi,
Sorry, but I’m completely clueless. Can you please please be more specific? What commands should I run?

I tried searching for recompiling Tesseract training data, but no luck. Shouldn’t the .traineddata be already compiled? If not, do I have to recompile from the “training_fonts.zip” file you provided? I just don’t understand.

Please help,
Thanks!

Reply
1. Roi says:
  
  January 27, 2013 at 2:13 pm
  
  yes when i said recompile i meant training.
  the attached files was done using old version and you need to redo the training with tesseract3 and training_fonts.zip
  
  look here:
  http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
  
  Reply
Aharon Varady says:

November 19, 2013 at 7:03 pm

Hi Roi,

Do you know anyone who has been training Tesseract to recognize Hebrew or Yiddish with niqqud? Those of us who are working on Hebrew Wikisource are hoping that an open source OCR solution is available. I have high hopes for Tesseract but don’t have a clear idea of how I can begin to train it to recognize printed texts of Hebrew with niqqud at different font sizes (something that Kobi Zamir’s HOCR software had a very hard time with).

Reply
1. Chaim says:
  
  February 23, 2014 at 1:31 am
  
  There seems to be a project dealing with this:
  http://www.cs.bgu.ac.il/~elhadad/hocr/
  
  Reply
  1. Aharon Varady says:
    
    January 29, 2015 at 9:39 pm
    
    Thanks Chaim — I was aware of that work. For anyone interested, I’ve written something of a summary of development in open-source Hebrew OCR as of January 2015, here.
    
    Reply
lior says:

September 17, 2015 at 4:50 pm

Do you familiar with an automatic tool for tesseract 3.02 training?
I didn’t find one that works

Reply
1. Roi says:
  
  September 18, 2015 at 10:57 am
  
  no sorry. I just took what I did for tesseract 2 and compiled it with tesseract 3 without any modifications.
  
  Reply

Share this:

24 thoughts to “Tesseract training data for Hebrew”

Leave a Reply Cancel reply