string(10) "newsevents"

The Document Intelligence Centre of Excellence (DICE) part of the IIS division of the Institute of Informatics & Telecommunications (IIT), has been granted a patent by the United States Patent and Trademark Office (USPTO) for its innovative system titled “System and Method for Automatically Tagging Documents”.

The patent is held by IIT-affiliated researchers Eleftherios-Panagiotis Loukas and Eirini Spyropoulou and introduces a method that enhances the processing and interpretation of numerical data by deep learning models—significantly improving their performance in tasks involving numerical content. This patented technology builds upon previous research presented at ACL 2022, titled “FiNER: Financial Numeric Entity Recognition for XBRL Tagging”, and focuses on intelligent preprocessing of electronic documents.

Standard tokenisation used in transformer-based AI models often struggles with numbers. For example, a value like “1,234.56” is typically broken into separate pieces like “1”, “,”, “234”, “.”, and “56”. This fragmentation makes it harder for models to understand and use the full numeric value, which is especially problematic in tasks like financial document classification or tagging, where numeric context matters.

To solve this a new approach is being proposed: replacing numbers in text with special tokens that represent the entire number as a single unit. These tokens capture the structure and approximate size of the number (e.g., [NUM] or [XX.X]), making them easier for models to process accurately. By avoiding unnecessary fragmentation, this method helps models better understand numerical information, leading to improved performance in tasks that involve numbers.

The system operates by receiving an electronic document, extracting its text, and replacing numeric or date values with symbolic representations. The processed text is then tokenized and analyzed by a deep learning module, which automatically assigns semantic tags to the appropriate tokens. The final output is a fully tagged version of the document, ready for downstream AI applications.

 

 

Skip to content