In many information repositories available over the web, data are progressively updated with new contents that become accessible as a continuous stream of textual information. Available data are not only big and rich, but also dynamic, and the capability to deal with time becomes a crucial requirement for enforcing an effective analysis of this continuous flow of information.
In this talk, we envisage the exploratory analysis of textual data streams as a continuous bootstrapping process, where each bootstrapping cycle works on an incoming document chunk of the stream related to a fixed-size time window. We focus on presenting a combination of keyword similarity and clustering techniques to: (i) classify documents into fine-grained similarity clusters; (ii) aggregate similar clusters into larger document collections (i.e., topics) sharing a richer, more user-prominent keyword set; (iii) assimilate newly extracted topics of current bootstrapping cycle with existing topics resulting from previous bootstrapping cycles, by linking similar topics of different time windows, to highlight topic trends and evolution. An analysis framework is also discussed for enabling topic-based exploration of the underlying textual data stream according to thematic and temporal perspectives. Experimental results are finally presented based on a real data stream of newspaper articles about politics published by the New York Times from Jan 1st 1900 to Dec 31st 2015.
Prof. Alfio Ferrara
Department of Computer Science, Università degli Studi di Milano
Alfio Ferrara is Associate Professor of Computer Science at the University of Milano, where he received his Ph.D. in Computer Science in 2005 for his work on ontology and instance matching. He is currently teaching Database Systems and Information Retrieval.
He has published over 80 papers in refereed journals and international conferences on ontologies and semantic web, data analysis, and data integration. On these topics, he also worked in national and international research projects. He is member of the editorial board of the Program journal and member of the participant steering committee of the Ontology Alignment Evaluation Initiative (OAEI), where he coordinates the instance matching track. His research interests include database and semi-structured data integration, Web-based information systems, data analysis, and knowledge representation and evolution.
Speaker home page: http://islab.di.unimi.it/ferrara