当前位置: X-MOL 学术Inf. Retrieval J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
How do interval scales help us with better understanding IR evaluation measures?
Information Retrieval Journal ( IF 2.5 ) Pub Date : 2019-09-04 , DOI: 10.1007/s10791-019-09362-z
Marco Ferrante , Nicola Ferro , Eleonora Losiouk

Evaluation measures are the basis for quantifying the performance of IR systems and the way in which their values can be processed to perform statistical analyses depends on the scales on which these measures are defined. For example, mean and variance should be computed only when relying on interval scales. In our previous work we defined a theory of IR evaluation measures, based on the representational theory of measurement, which allowed us to determine whether and when IR measures are interval scales. We found that common set-based retrieval measures—namely precision, recall, and F-measure—always are interval scales in the case of binary relevance while this does not happen in the multi-graded relevance case. In the case of rank-based retrieval measures—namely AP, gRBP, DCG, and ERR—only gRBP is an interval scale when we choose a specific value of the parameter p and define a specific total order among systems while all the other IR measures are not interval scales. In this work, we build on our previous findings and we carry out an extensive evaluation, based on standard TREC collections, to study how our theoretical findings impact on the experimental ones. In particular, we conduct a correlation analysis to study the relationship among the above-mentioned state-of-the-art evaluation measures and their scales. We study how the scales of evaluation measures impact on non parametric and parametric statistical tests for multiple comparisons of IR system performance. Finally, we analyse how incomplete information and pool downsampling affect different scales and evaluation measures.

中文翻译:

间隔量表如何帮助我们更好地了解投资者关系评估措施?

评估措施是量化IR系统性能的基础,处理其值以进行统计分析的方式取决于定义这些措施的规模。例如,仅当依赖区间标度时才应计算均值和方差。在先前的工作中,我们基于测量的代表性理论定义了IR评估措施的理论,该理论使我们能够确定IR措施是否以及何时为间隔尺度。我们发现,在二元相关性的情况下,常见的基于集合的检索度量(即精度,查全率和F度量)始终是区间标度,而在多级相关性情况下则不会发生这种情况。如果是基于等级的检索措施,即AP,gRBP,DCG,p并定义系统之间的特定总顺序,而所有其他IR度量都不是区间标度。在这项工作中,我们以先前的发现为基础,并根据标准TREC馆藏进行了广泛的评估,以研究理论发现如何影响实验结果。特别是,我们进行了相关性分析,以研究上述最新评估指标及其规模之间的关系。我们研究了IR系统性能的多个比较,评估尺度对非参数和参数统计测试的影响。最后,我们分析不完整的信息和池下采样如何影响不同的规模和评估措施。
更新日期:2019-09-04
down
wechat
bug