The Accuracy of Measurements with Probability and Nonprobability Survey Samples: Replication and Extension,Public Opinion Quarterly

当前位置： X-MOL 学术 › Public Opinion Quarterly › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

The Accuracy of Measurements with Probability and Nonprobability Survey Samples: Replication and Extension
Public Opinion Quarterly ( IF 2.9 ) Pub Date : 2018-01-01 , DOI: 10.1093/poq/nfy038
Bo MacInnis ₁ , Jon A Krosnick _{2,

3} , Annabell S Ho ₁ , Mu-Jung Cho ₁

Affiliation

Many studies in various countries have found that telephone and internet surveys of probability samples yielded data that were more accurate than internet surveys of nonprobability samples, but some authors have challenged this conclusion. This paper describes a replication and an expanded comparison of data collected in the United States, using a variety of probability and nonprobability sampling methods, using a set of 50 measures of 40 benchmark variables, larger than any used in the past, and assessing accuracy using a new metric for this literature: root mean squared error. Despite substantial drops in response rates since a prior comparison, the probability samples interviewed by telephone or the internet were the most accurate. Internet surveys of a probability sample combined with an opt-in sample were less accurate; least accurate were internet surveys of opt-in panel samples. These results were not altered by implementing poststratification using demographics. Bo MacInnis is a lecturer in the Department of Communication at Stanford University, Stanford, CA, USA. Jon A. Krosnick is Frederic O. Glover Professor in Humanities and Social Sciences; professor of communication, political science, and psychology at Stanford University, Stanford, CA, USA; and university fellow at Resources for the Future, Washington, DC, USA. Annabell S. Ho is a doctoral candidate in the Department of Communication at Stanford University, Stanford, CA, USA. Mu-Jung Cho is a doctoral candidate in the Department of Communication at Stanford University, Stanford, CA, USA. The authors thank the survey firms that participated in this study and provided data for evaluation; the National Science Foundation [award number SES-1042938 to J.A.K.], which funded some of the data collection; and David Yeager, who provided valuable advice and assistance. *Address correspondence to Jon A. Krosnick, 432 McClatchy Hall, 450 Serra Mall, Stanford University, Stanford, CA 94305, USA; email: krosnick@stanford.edu. Public Opinion Quarterly, Vol. 82, No. 4, Winter 2018, pp. 707–744 doi:10.1093/poq/nfy038 Advance Access publication October 31, 2018 D ow naded rom http/academ ic.p.com /poq/article-ct/82/4/707/5151369 by Adam Ellsw orth, Adam Ellsw rth on 11 M arch 2019 Inspired importantly by the insights of R. A. Fisher (1925), as described and applied early on by Neyman (1934) and others, probability sampling via random selection has been the gold standard for surveys in the United States for decades. The dominant mode of questionnaire administration has shifted over time from face-to-face interviewing to random-digit-dial telephone interviewing in the 1970s (for reviews, see Brick 2011) to self-administration via the internet (Couper 2011). Most internet surveys today are done with nonprobability samples of people who volunteer to complete questionnaires in exchange for cash or gifts and who were not selected randomly from the population of interest (Brick 2011). Often, stratification and quotas are used to maximize the resemblance of participating respondents with the population of interest in terms of demographics. The prominence of nonprobability sampling methods today (Brick 2011; Callegaro et al. 2014) represents a return to the beginnings of survey research a century ago and to a method that was all but abandoned in serious work in the interim (e.g., Converse 1987; Berinsky 2006). The transition to probability sampling from quota sampling was spurred by quota sampling’s failure in predicting the 1948 election (Converse 1987, pp. 201–10) and by “new ground in theory and application” in probability sampling (Converse 1987, p. 204). But in recent years, numerous authors have argued that nonprobability sampling can produce veridical assessments and should be the tool of choice for scientists interested in minimizing research costs while maximizing data accuracy (e.g., Silver 2012; Ansolabehere and Rivers 2013; Ansolabehere and Schaffner 2014; Wang et al. 2015). Harking back to the early days, many contemporary observers share Moser and Stuart’s (1953) view that “statisticians have too easily dismissed a technique which often gives good results and has the virtue of economy” (p. 388). During the past 15 years, a series of studies have compared the accuracy of probability samples and nonprobability samples. Some of these studies have shown that probability samples have produced accurate measurements, while nonprobability samples were consistently less accurate, sometimes strikingly so. Such studies led the AAPOR Task Force on Online Panels to conclude that “nonprobability samples are generally less accurate than probability samples” (Baker et al. 2010). And the AAPOR Task Force on Nonprobability Sampling concluded: “Although nonprobability samples often have performed well in electoral polling, the evidence of their accuracy is less clear in other domains and in more complex surveys that measure many different phenomena” (Baker et al. 2013). However, that Task Force also said: “Sampling methods used with opt-in panels have evolved significantly over time and, as a result, research aimed at evaluating the validity of survey estimates from these sample sources should focus on sampling methods rather than the panels themselves. ... Research evaluations of older methods of nonprobability sampling from panels may have little relevance to the current methods being used” (Baker et al. 2013). MacInnis et al. 708 D ow naded rom http/academ ic.p.com /poq/article-ct/82/4/707/5151369 by Adam Ellsw orth, Adam Ellsw rth on 11 M arch 2019 Some observers have claimed that since the Task Force report was written, response rates of probability-based telephone surveys have continued to decline (but see Marken 2018), making probability sample surveys no better than nonprobability sample surveys. This paper addresses these concerns by providing new evidence on the topic. We report evaluations of data collected with an array of methods in 2012. These evaluations assess whether probability sampling yielded more accurate measurements than did various types of nonprobability samples and whether accuracy has changed during the years since 2004, when the last of the studies like this was conducted (Yeager et al. 2011). Further, the present study supplements the work of Dutwin and Buskirk (2017) by evaluating a low-response-rate RDD telephone survey. Comparing Probability and Nonprobability Sample Surveys Studies have evaluated the accuracy of survey measurements of probability and nonprobability sample surveys by comparing statistics produced by the surveys with benchmarks assessing the same characteristics using methods of high reliability, such as government records (e.g., the State Department’s record of the number of passports held by Americans) and federal surveys with very high response rates. Such studies have found that nonprobability sample surveys yielded data that were less accurate than the data collected from probability samples when measuring voting behavior (Malhotra and Krosnick 2007; Chang and Krosnick 2009; Sturgis et al. 2016), health behavior (Yeager et al. 2011), consumption behavior (Szolnoki and Hoffmann 2013), sexual behaviors and attitudes (Erens et al. 2014), and demographics (Malhotra and Krosnick 2007; Chang and Krosnick 2009; Yeager et al. 2011; Szolnoki and Hoffmann 2013; Erens et al. 2014; Dutwin and Buskirk 2017). Furthermore, current methods of adjusting nonprobability sample data have done little or nothing to correct the inaccuracy in estimates from nonprobability samples (Yeager et al. 2011; see Tourangeau, Conrad, and Couper 2013 for a review). However, another set of recent papers, focused on pre-election polls, suggests that nonprobability samples yielded data that were as accurate, or more accurate than, probability sample surveys (e.g., Ansolabehere and Rivers 2013; Wang et al. 2015). And the very low response rates attained by probability-based telephone surveys in recent years have led some to the belief that the theoretical advantages of probability-based surveys no longer obtain. The research reported here adds evidence to the ongoing discussion of probability and nonprobability sample surveys. Accuracy of Probability and Nonprobability Samples 709 D ow naded rom http/academ ic.p.com /poq/article-ct/82/4/707/5151369 by Adam Ellsw orth, Adam Ellsw rth on 11 M arch 2019 Metrics to Assess Accuracy Past studies have used various different metrics to assess accuracy of measurements by comparing them to benchmarks (see Callegaro et al. 2014). The present study introduces a new metric to this set. Some studies have characterized the accuracy of a single measurement. Malhotra and Krosnick (2007), for example, examined the absolute deviation of the percent of respondents choosing each response category in a survey from the percent of people in the population in that response category. Yeager et al. (2011) computed the absolute deviation of the percent of respondents choosing the modal response category in a survey from the percent of people in the population in that modal category. Walker, Pettit, and Rubinson (2009) and Gittelman et al. (2015) compared the percent of respondents choosing one response category (sometimes the modal category, sometimes a nonmodal category) in a survey to the percent of people in the population in that category (without explaining why the particular response category was chosen). Kennedy et al. (2016) computed the absolute deviation of the percent of respondents choosing one response category or the combination of two response categories (without explaining why the particular response category or categories was/were chosen or combined) in a survey from the percent of people in the population in that category or categories. Other studies have examined multiple measurements in comparing the accuracy of probability and nonprobability surveys. Ansolabehere and Schaffner (2014) and Sturgis et al. (2016) comput

中文翻译：

概率和非概率调查样本测量的准确性：复制和扩展

许多国家的研究发现，概率样本的电话和互联网调查产生的数据比非概率样本的互联网调查更准确，但一些作者对这一结论提出质疑。本文描述了对在美国收集的数据的复制和扩展比较，使用各种概率和非概率抽样方法，使用 40 个基准变量的 50 个度量，比过去使用的任何变量都大，并使用该文献的一个新指标：均方根误差。尽管自之前的比较以来响应率大幅下降，但通过电话或互联网访问的概率样本是最准确的。结合选择加入样本的概率样本的互联网调查不太准确；最不准确的是对选择加入小组样本的互联网调查。使用人口统计数据实施后分层并没有改变这些结果。Bo MacInnis 是美国加利福尼亚州斯坦福大学斯坦福大学传播系的讲师。Jon A. Krosnick 是 Frederic O. Glover 人文与社会科学教授；美国加利福尼亚州斯坦福大学斯坦福大学传播学、政治学和心理学教授；和美国华盛顿特区未来资源的大学研究员。Annabell S. Ho 是美国加利福尼亚州斯坦福大学斯坦福大学传播系的博士生。Mu-Jung Cho 是美国加利福尼亚州斯坦福大学斯坦福大学传播系的博士生。作者感谢参与本研究并提供评估数据的调查公司；国家科学基金会 [奖励编号 SES-1042938 给 JAK]，资助了一些数据收集；和 David Yeager，他提供了宝贵的建议和帮助。*地址与 Jon A. Krosnick, 432 McClatchy Hall, 450 Serra Mall, Stanford University, Stanford, CA 94305, USA 的通信地址；电子邮件：krosnick@stanford.edu。民意季刊，卷。82, No. 4, Winter 2018, pp. 707–744 doi:10.1093/poq/nfy038 Advance Access 出版物 2018 年 10 月 31 日 Dow naded rom http/academ ic.p.com /poq/article-ct/82/4 /707/5151369 by Adam Ellsw orth, Adam Ellsw rth on 11 March 2019 受 RA Fisher (1925) 见解的重要启发，正如 Neyman (1934) 和其他人在早期描述和应用的那样，通过随机选择进行概率抽样几十年来一直是美国调查的黄金标准。随着时间的推移，问卷管理的主要模式已从 1970 年代的面对面访问转变为随机数字拨号电话访问（有关评论，参见 Brick 2011），再到通过互联网进行自我管理（Couper 2011）。今天的大多数互联网调查都是对自愿完成问卷以换取现金或礼物的人的非概率样本进行的，这些人不是从感兴趣的人群中随机选择的（Brick 2011）。通常，分层和配额用于最大限度地提高参与调查对象与感兴趣人群在人口统计方面的相似性。当今非概率抽样方法的突出地位（Brick 2011；Callegaro 等人。2014 年）代表回到一个世纪前调查研究的开端，以及在此期间在严肃工作中几乎被放弃的方法（例如，Converse 1987；Berinsky 2006）。配额抽样未能预测 1948 年大选（Converse 1987，第 201-10 页）和概率抽样中的“理论和应用的新基础”（Converse 1987，第 204 页）推动了从配额抽样向概率抽样的转变. 但近年来，许多作者认为非概率抽样可以产生真实的评估，并且应该成为有兴趣将研究成本最小化同时最大限度地提高数据准确性的科学家的首选工具（例如，Silver 2012；Ansolabehere 和 Rivers 2013；Ansolabehere 和 Schaffner 2014； Wang 等人，2015 年）。回首往事，许多当代观察家都同意 Moser 和 Stuart (1953) 的观点，即“统计学家太容易忽视一种通常会产生良好结果并具有经济优势的技术”（第 388 页）。在过去的 15 年中，一系列研究比较了概率样本和非概率样本的准确性。其中一些研究表明，概率样本产生了准确的测量结果，而非概率样本的准确度一直较低，有时甚至惊人。此类研究导致 AAPOR 在线小组工作组得出结论，“非概率样本通常不如概率样本准确”（Baker 等人，2010 年）。AAPOR 非概率抽样工作组总结道：“尽管非概率抽样在选举投票中表现良好，在其他领域以及衡量许多不同现象的更复杂调查中，其准确性的证据不太清楚”（Baker 等人，2013 年）。然而，该工作组还表示：“与选择加入小组一起使用的抽样方法随着时间的推移发生了显着变化，因此，旨在评估来自这些样本来源的调查估计的有效性的研究应侧重于抽样方法而不是小组他们自己。...对来自面板的非概率抽样的旧方法的研究评估可能与当前使用的方法几乎没有相关性”（Baker 等人，2013 年）。麦金尼斯等人。708 Dow naded rom http/academ ic.p.com /poq/article-ct/82/4/707/5151369 作者：Adam Ellsw orth，Adam Ellsw rth 于 2019 年 3 月 11 日一些观察员声称，自工作组报告以来被写，基于概率的电话调查的响应率持续下降（但参见 Marken 2018），使得概率抽样调查并不比非概率抽样调查好。本文通过提供有关该主题的新证据来解决这些问题。我们报告了对 2012 年使用一系列方法收集的数据的评估。这些评估评估了概率抽样是否比各种类型的非概率样本产生了更准确的测量结果，以及自 2004 年以来的几年中准确性是否发生了变化，当时最后一次研究是这样的进行了（Yeager 等人，2011 年）。此外，本研究通过评估低响应率 RDD 电话调查补充了 Dutwin 和 Buskirk（2017）的工作。比较概率和非概率抽样调查研究通过将调查产生的统计数据与使用高可靠性方法评估相同特征的基准进行比较，评估了概率和非概率抽样调查的调查测量的准确性，例如政府记录（例如，国务院的记录）美国人持有的护照数量）和联邦调查的回复率非常高。此类研究发现，在测量投票行为（Malhotra 和 Krosnick 2007；Chang 和 Krosnick 2009；Sturgis 等人，2016 年）、健康行为（Yeager 等人，2016 年）时，非概率抽样调查产生的数据不如从概率样本收集的数据准确。 2011)、消费行为 (Szolnoki and Hoffmann 2013)、性行为和态度 (Erens et al. 2014)、和人口统计学（Malhotra 和 Krosnick 2007；Chang 和 Krosnick 2009；Yeager 等人 2011；Szolnoki 和 Hoffmann 2013；Erens 等人 2014；Dutwin 和 Buskirk 2017）。此外，当前调整非概率样本数据的方法几乎没有或根本没有纠正非概率样本估计值的不准确性（Yeager 等人，2011 年；参见 Tourangeau、Conrad 和 Couper 2013 年的综述）。然而，最近的另一组关注选举前民意调查的论文表明，非概率样本产生的数据与概率样本调查一样准确或更准确（例如，Ansolabehere 和 Rivers 2013；Wang 等人 2015）。近年来，基于概率的电话调查获得的非常低的响应率使一些人相信，基于概率的调查的理论优势不再具有。此处报告的研究为正在进行的概率和非概率抽样调查的讨论增加了证据。Accuracy of Probability and Nonprobability Samples 709 Dow naded rom http/academ ic.p.com /poq/article-ct/82/4/707/5151369 by Adam Ellsw orth, Adam Ellsw rth 于 2019 年 10 月 11 日发表的 Metrics to Assess Accuracy过去的研究使用各种不同的指标通过将它们与基准进行比较来评估测量的准确性（参见 Callegaro 等人，2014 年）。本研究为该集合引入了一个新指标。一些研究对单次测量的准确性进行了表征。例如，Malhotra 和 Krosnick (2007) 检查了在调查中选择每个响应类别的受访者百分比与该响应类别中人口百分比的绝对偏差。耶格尔等人。(2011) 计算了在调查中选择模态响应类别的受访者百分比与该模态类别的人口百分比的绝对偏差。Walker、Pettit 和 Rubinson（2009 年）以及 Gittelman 等人。(2015) 将调查中选择一个响应类别（有时是模态类别，有时是非模态类别）的受访者百分比与该类别人口中的人口百分比进行了比较（没有解释选择特定响应类别的原因）。肯尼迪等人。(2016) 计算了在调查中选择一个响应类别或两个响应类别组合的受访者百分比的绝对偏差（没有解释为什么选择或组合特定的一个或多个类别）与受访者的百分比该类别或多个类别中的人口。其他研究在比较概率和非概率调查的准确性时检查了多种测量。Ansolabhere 和 Schaffner (2014) 以及 Sturgis 等人。(2016) 计算

更新日期：2018-01-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文