If Information Retrieval (IR) on the web is dominated by systems earning ranking functions from log data, standard ad hoc IR is largely dominated by probabilistic systems with few parameters to set, as Okapi, the language models or the Divergence from Randomness models. These latter models are based on several probabilistic distributions and assumptions which help their deployment in practical situations. If these models are well founded from an information retrieval point of view (they satisfy heuristic retrieval constraints), the probability distributions they rely on yield in general a poor fit to empirical data. Thus, in the “model word frequency distributions to retrieve documents” approach, the first part (model word frequency distributions) is somehow neglected wrt the second part (retrieve documents) in most models.
In this presentation, we introduce a new IR model, based on probability distributions fitting well empirical data, and satisfying heuristic retrieval constraints. To do so, we first explore the links between heuristic retrieval constraints and word frequency distributions. After proposing an analytical view of heuristic retrieval constraints, we review empirical findings on word frequency distributions and the central role played by burstiness in this context. We then introduce the family of information-based IR models and develop, within this family, a new IR model based on the log-logistic distribution. We finally compare our model with existing ones on several standard collections.