July 15, 2015

Large scale hierarchical classification with limited training instances per category

Kosmopoulos Aris

Abstract:

Hierarchies are becoming ever more popular for the organization of objects (documents, websites, etc.), particularly on the Web. Web directories are an example. Along with their widespread use, comes the need for automated classification of new instances to the categories in the hierarchy. As the size of the hierarchy grows and the number of categories increases, a number of interesting machine learning problems arise. In particular, it is one of the rare situations where data sparsity remains an issue despite the vastness of available data, since the number of training documents for some categories can still be very limited. The first important issue, in hierarchical classification, is the evaluation of different classification algorithms, an issue which is complicated by the hierarchical relations among the classes. Several evaluation measures have been proposed for hierarchical classification using the hierarchy in different ways without however providing a unified view of the problem. This thesis studies the problem of evaluation in hierarchical classification by analysing and abstracting the key components of the existing performance measures. It also proposes two alternative generic views of hierarchical evaluation and introduces two corresponding novel measures. T
Another issue that this thesis addresses is the dimensionality reduction in large scale hierarchical classification with tree hierarchies. The most basic way of hierarchical classification is that of cascade classification, which greedily traverses the hierarchy from root to the predicted leaf. In order to perform cascade classification, a classifier must be trained for each node of the hierarchy. In large scale problems, the number of features can be prohibitively large for the classifiers in the upper levels of the hierarchy. It is therefore desirable to reduce the dimensionality of the feature space at these levels. This thesis examines the computational feasibility of the most common dimensionality reduction method (Principal Component Analysis) for this problem, as well as the computational benefits that it provides for cascade classification and its effect on classification accuracy. A new hierarchical approach is also proposed, which extends cascade classification. Experimental results are provided which indicate that, using the same classification algorithm, one can achieve better results with our approach, compared to the traditional flat and cascade classifications.
Finally this thesis deals with large scale multi-label classification in directed acyclic graphs (DAGs) hierarchies, using a large biomedical dataset (BioASQ) as a benchmark.
Although the DAG hierarchy, combined with the multi-label factor introduces new challenges, the availability of the full text of the instances of this dataset facilitates the use of powerful dimensionality reduction or word representation techniques, such as deep learning (word2vec representation).

Skip to content