Biomedical semantic indexing using dense word vectors in BioASQ

Printer-friendly versionSend by email
A. Kosmopoulos, I. Androutsopoulos & G. Paliouras
Methods: We examine the use of dense word vectors, also known as word embeddings, as an efficient method of dimensionality reduction that makes hierarchical text classiffication algorithms more scalable in biomedical semantic indexing, without being less effective than the usual bag-of-words representation. We consider several approaches for the transition from dense word vectors to dense vectors that represent entire texts, proposing the approach that we believe fits better this domain. We experiment with at and hierarchically expanded K-nearest neighbor classifiers that employ dense vector representations of article abstracts, examining the effect of various parameters. We also present a high precision system that can be combined with the Medical Text Indexer (MTI) system of the US National Library of Medicine (NLM) to improve its performance. Results: Our experiments were performed on biomedical semantic indexing datasets of the BioASQ challenge. We show that dense word vectors can lead to very large dimensionality reduction (from millions of features to just 200), compared to the usual bag-of-words representation, reducing significantly the training and classification times of the classifiers, without degrading their effectiveness. We also present experiments on the combination of our high precision system with MTI, showing improvements in overall performance. Conclusions: Dense word vectors (word embeddings) can lead to very large dimensionality reduction, making hierarchical text classification algorithms more scalable in biomedical semantic indexing, without degrading their effectiveness. K-nearest neighbor classifiers with dense vector representations of abstracts can help improve the performance of NLM's MTI semantic indexing system.
Software and Knowledge Engineering Laboratory (SKEL)
Publication Name: 
Journal Of Bio-Medical Semantics, Supplement On Bio-Medical Information Retrieval

© 2018 - Institute of Informatics and Telecommunications | National Centre for Scientific Research "Demokritos"