Now that my B.Sc. project is behind me I can share the tesseract training data I compiled for Hebrew
Links:
training_fonts.zip – the training files I used
tesseract-2.00.heb.tar.gz – compiled for tesseract 2
heb.traineddata.gz – compiled for tesseract 3
heb.traineddata.zip
The file “heb.traineddata.gz” is not a valid archive files. – Please upload again.
seems to extract fine in linux but fails on windows with winrar
I uploaded it as zip file
$ gzip -d heb.traineddata.gz
gzip: heb.traineddata.gz: not in gzip format
$ file heb.traineddata.gz
heb.traineddata.gz: PCX ver. 2.5 image data
tested on mandrivalinux 2011 64bit
strange. tested on centos.
roid@xxxx ~/tmp $ file heb.traineddata.gz
heb.traineddata.gz: gzip compressed data, was “heb.traineddata”, from FAT filesystem (MS-DOS, OS/2, NT), last modified: Wed May 18 15:16:22 2011
roid@xxxx ~/tmp $ gunzip heb.traineddata.gz
-rw-r–r– 1 roid ecryptfs 522575 Oct 30 15:30 heb.traineddata
use the zip one its the same.
how did you created gzip file?
probably from some windows utility. but if it worked on centos it should have worked on other distros.
maybe your gzip is old ? gzip 1.3.5
anyway i did gunzip and gzip again.
and you could just use the zipped one.
its the same file.
roid@xxxx ~/tmp $ file heb.traineddata.gz
heb.traineddata.gz: gzip compressed data, was “heb.traineddata”, from Unix, last modified: Wed May 18 15:17:18 2011, max compression
תודה רבה!!!!
thanks for this good article..
Hi Roi,
Thanks for the post. For some reason, my tesseract (3) crash when working with your training data (-l heb).
I created my own heb training data, it does not crash but the Hebrew is from left to right.
Any thing I can do about it?
you need to reverse the string. tesseract doesn’t do it for you
if we reverse string it needs to define all of the word that are in heb !
would you please show it in a example ?
you mean flipping the input image?
just normal string reverse.
you should try and see the output.
Does anyone know what are the many @ signs that Tesseract output file contains?
Do you plan to join to Enrico Segre and his training (http://tesseract-ocr.googlecode.com/issues/attachment?aid=3980168041635874079&name=heb-rashi-stam4.tgz&token=63ae8889c508b9706b619f4ce8066685)?
Hi Roi,
Thanks, but like the poster above, your Hebrew training data is also making Tesseract (version 3.02) crash for me.
I get the following error when I try to run Tesseract on a JPEG with your training data (heb.traineddata.gz):
num_edges_ > 0:Error:Assert failed:in file ..\..\dict\dawg.cpp, line 320
How can I fix this? Thanks in advance!
you need to recompile the training data.
Hi Roi,
Sorry, but I’m completely clueless. Can you please please be more specific? What commands should I run?
I tried searching for recompiling Tesseract training data, but no luck. Shouldn’t the .traineddata be already compiled? If not, do I have to recompile from the “training_fonts.zip” file you provided? I just don’t understand.
Please help,
Thanks!
yes when i said recompile i meant training.
the attached files was done using old version and you need to redo the training with tesseract3 and training_fonts.zip
look here:
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
Hi Roi,
Do you know anyone who has been training Tesseract to recognize Hebrew or Yiddish with niqqud? Those of us who are working on Hebrew Wikisource are hoping that an open source OCR solution is available. I have high hopes for Tesseract but don’t have a clear idea of how I can begin to train it to recognize printed texts of Hebrew with niqqud at different font sizes (something that Kobi Zamir’s HOCR software had a very hard time with).
There seems to be a project dealing with this:
http://www.cs.bgu.ac.il/~elhadad/hocr/
Thanks Chaim — I was aware of that work. For anyone interested, I’ve written something of a summary of development in open-source Hebrew OCR as of January 2015, here.
Do you familiar with an automatic tool for tesseract 3.02 training?
I didn’t find one that works
no sorry. I just took what I did for tesseract 2 and compiled it with tesseract 3 without any modifications.