We are pleased to announce the 3rd edition of the Pascal Large Scale Hierarchical Text Classification (LSHTC) and ECML/PKDD 2012 Discovery Challenge. The LSHTC Challenge is a hierarchical text classification competition, using large datasets. This year’s challenge focuses on interesting learning problems like multi-task and refinement learning.
Hierarchies are becoming ever more popular for the organization of text documents, particularly on the Web. Web directories and Wikipedia are two examples of such hierarchies. Along with their widespread use, comes the need for automated classification of new documents to the categories in the hierarchy. As the size of the hierarchy grows and the number of documents to be classified increases, a number of interesting machine learning problems arise. In particular, it is one of the rare situations where data sparsity remains an issue, despite the vastness of available data: as more documents become available, more classes are also added to the hierarchy, and there is a very high imbalance between the classes at different levels of the hierarchy. Additionally, the statistical dependence of the classes poses challenges and opportunities for the learning methods.
The challenge consists of 3 tracks, involving different category systems with different data properties and focusing on different learning and mining problems. The challenge is based on two large datasets: one created from the ODP web directory (dmoz) and one from Wikipedia. The datasets are multi-class, multi-label and hierarchical. The number of categories range between 13,000 and 325,000 roughly and the number of the documents between 380,000 and 2,400,000.
Visit the 3rd Pascal Challenge websitehttp://lshtc.iit.demokritos.gr/