当前位置: X-MOL 学术arXiv.cs.MM › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Impact of the Number of Votes on the Reliability and Validity of Subjective Speech Quality Assessment in the Crowdsourcing Approach
arXiv - CS - Multimedia Pub Date : 2020-03-25 , DOI: arxiv-2003.11300
Babak Naderi, Tobias Hossfeld, Matthias Hirth, Florian Metzger, Sebastian M\"oller, Rafael Zequeira Jim\'enez

The subjective quality of transmitted speech is traditionally assessed in a controlled laboratory environment according to ITU-T Rec. P.800. In turn, with crowdsourcing, crowdworkers participate in a subjective online experiment using their own listening device, and in their own working environment. Despite such less controllable conditions, the increased use of crowdsourcing micro-task platforms for quality assessment tasks has pushed a high demand for standardized methods, resulting in ITU-T Rec. P.808. This work investigates the impact of the number of judgments on the reliability and the validity of quality ratings collected through crowdsourcing-based speech quality assessments, as an input to ITU-T Rec. P.808 . Three crowdsourcing experiments on different platforms were conducted to evaluate the overall quality of three different speech datasets, using the Absolute Category Rating procedure. For each dataset, the Mean Opinion Scores (MOS) are calculated using differing numbers of crowdsourcing judgements. Then the results are compared to MOS values collected in a standard laboratory experiment, to assess the validity of crowdsourcing approach as a function of number of votes. In addition, the reliability of the average scores is analyzed by checking inter-rater reliability, gain in certainty, and the confidence of the MOS. The results provide a suggestion on the required number of votes per condition, and allow to model its impact on validity and reliability.

中文翻译:

众包方法中投票数对主观语音质量评估可靠性和有效性的影响

传输语音的主观质量传统上是在受控实验室环境中根据 ITU-T Rec. P.800。反过来,通过众包,众包工作者使用自己的听力设备在自己的工作环境中参与主观在线实验。尽管存在这种不太可控的条件,但越来越多地使用众包微任务平台进行质量评估任务推动了对标准化方法的高需求,导致 ITU-T Rec. P.808。这项工作调查了判断次数对通过基于众包的语音质量评估收集的质量评级的可靠性和有效性的影响,作为对 ITU-T Rec. 第 808 页。使用绝对类别评级程序在不同平台上进行了三个众包实验,以评估三个不同语音数据集的整体质量。对于每个数据集,平均意见得分 (MOS) 是使用不同数量的众包判断来计算的。然后将结果与标准实验室实验中收集的 MOS 值进行比较,以评估众包方法作为投票数量的函数的有效性。此外,通过检查评分者间的可靠性、确定性增益和 MOS 的置信度来分析平均分数的可靠性。结果为每个条件所需的投票数提供了建议,并允许对其对有效性和可靠性的影响进行建模。平均意见得分 (MOS) 是使用不同数量的众包判断来计算的。然后将结果与标准实验室实验中收集的 MOS 值进行比较,以评估众包方法作为投票数量的函数的有效性。此外,通过检查评分者间的可靠性、确定性增益和 MOS 的置信度来分析平均分数的可靠性。结果为每个条件所需的投票数提供了建议,并允许对其对有效性和可靠性的影响进行建模。平均意见得分 (MOS) 是使用不同数量的众包判断来计算的。然后将结果与标准实验室实验中收集的 MOS 值进行比较,以评估众包方法作为投票数量的函数的有效性。此外,通过检查评分者间的可靠性、确定性增益和 MOS 的置信度来分析平均分数的可靠性。结果为每个条件所需的投票数提供了建议,并允许对其对有效性和可靠性的影响进行建模。通过检查评估者间的可靠性、确定性增益和 MOS 的置信度来分析平均分数的可靠性。结果为每个条件所需的投票数提供了建议,并允许对其对有效性和可靠性的影响进行建模。通过检查评估者间的可靠性、确定性增益和 MOS 的置信度来分析平均分数的可靠性。结果为每个条件所需的投票数提供了建议,并允许对其对有效性和可靠性的影响进行建模。
更新日期:2020-10-01
down
wechat
bug