24 thoughts to “Tesseract training data for Hebrew”

      1. $ gzip -d heb.traineddata.gz

        gzip: heb.traineddata.gz: not in gzip format

        $ file heb.traineddata.gz
        heb.traineddata.gz: PCX ver. 2.5 image data

        tested on mandrivalinux 2011 64bit

        1. strange. tested on centos.

          roid@xxxx ~/tmp $ file heb.traineddata.gz
          heb.traineddata.gz: gzip compressed data, was “heb.traineddata”, from FAT filesystem (MS-DOS, OS/2, NT), last modified: Wed May 18 15:16:22 2011

          roid@xxxx ~/tmp $ gunzip heb.traineddata.gz
          -rw-r–r– 1 roid ecryptfs 522575 Oct 30 15:30 heb.traineddata

          use the zip one its the same.

            1. probably from some windows utility. but if it worked on centos it should have worked on other distros.
              maybe your gzip is old ? gzip 1.3.5

              anyway i did gunzip and gzip again.
              and you could just use the zipped one.
              its the same file.

              roid@xxxx ~/tmp $ file heb.traineddata.gz
              heb.traineddata.gz: gzip compressed data, was “heb.traineddata”, from Unix, last modified: Wed May 18 15:17:18 2011, max compression

  1. Hi Roi,
    Thanks for the post. For some reason, my tesseract (3) crash when working with your training data (-l heb).
    I created my own heb training data, it does not crash but the Hebrew is from left to right.
    Any thing I can do about it?

  2. Hi Roi,
    Thanks, but like the poster above, your Hebrew training data is also making Tesseract (version 3.02) crash for me.
    I get the following error when I try to run Tesseract on a JPEG with your training data (heb.traineddata.gz):

    num_edges_ > 0:Error:Assert failed:in file ..\..\dict\dawg.cpp, line 320

    How can I fix this? Thanks in advance!

  3. Hi Roi,
    Sorry, but I’m completely clueless. Can you please please be more specific? What commands should I run?

    I tried searching for recompiling Tesseract training data, but no luck. Shouldn’t the .traineddata be already compiled? If not, do I have to recompile from the “training_fonts.zip” file you provided? I just don’t understand.

    Please help,
    Thanks!

  4. Hi Roi,

    Do you know anyone who has been training Tesseract to recognize Hebrew or Yiddish with niqqud? Those of us who are working on Hebrew Wikisource are hoping that an open source OCR solution is available. I have high hopes for Tesseract but don’t have a clear idea of how I can begin to train it to recognize printed texts of Hebrew with niqqud at different font sizes (something that Kobi Zamir’s HOCR software had a very hard time with).

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.