Tesseract is an open source OCR system currently developed by Google.

Today Tesseract is the only open source OCR system that is able to deliver accurate recognition results.

How We Use Tesseract

We use Tesseract as an internal OCR engine for ImgHog in our text reading solutions. Since a solution usually contains both preprocessing and postprocessing stages, all calls to Tesseract actually are wrapped up in ImgHog algorithms.

After many years of development, not only we perfected our knowledge of Tesseract capabilities and architecture but also worked out our unique systematic approach to Tesseract training and accuracy improvement.

Tesseract Training

One of the most important features of Tesseract is full training capability. Out of the box, the current version of Tesseract is trained to recognize a few dozen fonts and languages. It allows to be additionally trained for almost any font in almost any language.

That also means that Tesseract can be trained for recognition of almost any predefined set of symbols. This capability proves to be very useful for CustomOCR solutions given the nature of text recognition problems being solved. It is often required to do recognition of uncommon fonts, rare characters sets, or known characters with unusual shapes.

Character Classifier

The strongest side of Tesseract is its character classifier – a part that does all actual character recognition. It is this that lets Tesseract to be the most accurate among other open source OCR systems.