The talk will present the results of our thesis where we studied the stages for the processing and mainly the recognition of handwritten documents. Processing consists of pre-processing for image enhancement; and segmentation for detecting first text lines, then words and finally characters. At the recognition stage a feature vector is extracted for all characters found during segmentation in order to classify them to predefined classes using supervised machine learning techniques. We studied several feature extraction techniques and developed a methodology that extracts features and classifies characters using a hierarchical scheme. This methodology, after being tested on well-known contemporary handwritten character databases, achieved recognition rates that are among the best one can find in the literature. Furthermore, this methodology was also applied to handwritten digits, cursive handwritten words and characters extracted from historical documents, either handwritten or historical. The recognition rates in these experiments were also very high. Moreover, an algorithm that is based on unsupervised machine learning techniques, for evaluating and eventually optimizing character segmentation was also suggested. Finally, a complete Optical Character Recognition (OCR) tool that integrates all the above stages in order to assist the recognition of either contemporary or historical documents with, neither a priori knowledge of the language or the fonts nor the existence of a standard database was developed. This tool enables the user to create his own character database, thus converting document images (e.g. .tiff, .jpeg etc) to ASCII format.
Talk slides in pdf [~1,7MB]http://www.iit.demokritos.gr/docs/seminars/bambakas-slides.pdf