skip to main content
survey

QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension

Published:02 February 2023Publication History
Skip Abstract Section

Abstract

Alongside huge volumes of research on deep learning models in NLP in the recent years, there has been much work on benchmark datasets needed to track modeling progress. Question answering and reading comprehension have been particularly prolific in this regard, with more than 80 new datasets appearing in the past 2 years. This study is the largest survey of the field to date. We provide an overview of the various formats and domains of the current resources, highlighting the current lacunae for future work. We further discuss the current classifications of “skills” that question answering/reading comprehension systems are supposed to acquire and propose a new taxonomy. The supplementary materials survey the current multilingual resources and monolingual resources for languages other than English, and we discuss the implications of overfocusing on English. The study is aimed at both practitioners looking for pointers to the wealth of existing data and at researchers working on new resources.

REFERENCES

  1. [1] Abdou Mostafa, Sas Cezar, Aralikatte Rahul, Augenstein Isabelle, and Søgaard Anders. 2019. X-WikiRE: A large, multilingual resource for relation extraction as machine comprehension. In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo’19). 265274. Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Abujabal Abdalghani, Roy Rishiraj Saha, Yahya Mohamed, and Weikum Gerhard. 2019. ComQA: A community-sourced dataset for complex factoid question answering with paraphrase clusters. In Proceedings of the 17th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). 307317. https://aclweb.org/anthology/papers/N/N19/N19-1027/.Google ScholarGoogle Scholar
  3. [3] Acharya Manoj, Jariwala Karan, and Kanan Christopher. 2019. VQD: Visual query detection in natural scenes. In Proceedings of the 17th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). 19551961. Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Acharya Manoj, Kafle Kushal, and Kanan Christopher. 2019. TallyQA: Answering complex counting questions. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI’19). 80768084. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Adams Douglas. 2009. The Hitchhiker’s Guide to the Galaxay. Ballantine Books, New York, NY. PR6051.D3352 H5 2009Google ScholarGoogle Scholar
  6. [6] Adlakha Vaibhav, Dhuliawala Shehzaad, Suleman Kaheer, Vries Harm de, and Reddy Siva. 2022. TopiOCQA: Open-domain conversational question answering with topic switching. Transactions of the Association for Computational Linguistics 10 (April2022), 468483. Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Akula Arjun, Changpinyo Soravit, Gong Boqing, Sharma Piyush, Zhu Song-Chun, and Soricut Radu. 2021. CrossVQA: Scalably generating benchmarks for systematically testing VQA generalization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 21482166. Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Alexandersson Jan, Buschbeck-Wolf Bianka, Fujinami Tsutomu, Kipp Michael, Koch Stephan, Maier Elisabeth, Reithinger Norbert, Schmitz Birte, and Siegel Melanie. 1998. Dialogue Acts in Verbmobil 2. Technical Report. Verbmobil.Google ScholarGoogle Scholar
  9. [9] Anantha Raviteja, Vakulenko Svitlana, Tu Zhucheng, Longpre Shayne, Pulman Stephen, and Chappidi Srinivas. 2021. Open-domain question answering goes conversational via question rewriting. In Proceedings of the 19th Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’21). 520534. https://aclanthology.org/2021.naacl-main.44.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Antol Stanislaw, Agrawal Aishwarya, Lu Jiasen, Mitchell Margaret, Batra Dhruv, Zitnick C. Lawrence, and Parikh Devi. 2015. VQA: Visual question answering. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV’15). 24252433. http://ieeexplore.ieee.org/document/7410636/.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Artetxe Mikel, Ruder Sebastian, and Yogatama Dani. 2020. On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 46234637. Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Asai Akari and Choi Eunsol. 2021. Challenges in information-seeking QA: Unanswerable questions and paragraph retrieval. In Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP’21). 14921504. Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Asai Akari, Eriguchi Akiko, Hashimoto Kazuma, and Tsuruoka Yoshimasa. 2018. Multilingual extractive reading comprehension by runtime machine translation. arXiv:1809.03275 [CS] (2018). http://arxiv.org/abs/1809.03275.Google ScholarGoogle Scholar
  14. [14] Asai Akari, Kasai Jungo, Clark Jonathan H., Lee Kenton, Choi Eunsol, and Hajishirzi Hannaneh. 2020. XOR QA: Cross-lingual open-retrieval question answering. arXiv:2010.11856 [CS] (2020). http://arxiv.org/abs/2010.11856.Google ScholarGoogle Scholar
  15. [15] Asri Layla El, Schulz Hannes, Sharma Shikhar, Zumer Jeremie, Harris Justin, Fine Emery, Mehrotra Rahul, and Suleman Kaheer. 2017. Frames: A corpus for adding memory to goal-oriented dialogue systems. arXiv:1704.00057 [CS] (2017). http://arxiv.org/abs/1704.00057.Google ScholarGoogle Scholar
  16. [16] Atanasova Pepa, Simonsen Jakob Grue, Lioma Christina, and Augenstein Isabelle. 2020. Generating fact checking explanations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 73527364. Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Augenstein Isabelle, Lioma Christina, Wang Dongsheng, Lima Lucas Chaves, Hansen Casper, Hansen Christian, and Simonsen Jakob Grue. 2019. MultiFC: A real-world multi-domain dataset for evidence-based fact checking of claims. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 46854697. Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Bailey Kathleen M.. 2018. Multiple-choice item format. In The TESOL Encyclopedia of English Language Teaching. Wiley, 1–8. Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Bajaj Payal, Campos Daniel, Craswell Nick, Deng Li, Gao Jianfeng, Liu Xiaodong, Majumder Rangan, et al. 2016. MS MARCO: A human generated machine reading comprehension dataset. arXiv:1611.09268 [CS] (2016). http://arxiv.org/abs/1611.09268.Google ScholarGoogle Scholar
  20. [20] Bajgar Ondrej, Kadlec Rudolf, and Kleindienst Jan. 2017. Embracing data abundance: BookTest dataset for reading comprehension. In Proceedings of the 5th International Conference on Learning Representations (ICLR’17). https://openreview.net/pdf?id=H1U4mhVFe.Google ScholarGoogle Scholar
  21. [21] Banerjee Somnath, Naskar Sudip Kumar, and Rosso Paolo. 2016. The first cross-script code-mixed question answering corpus. In Proceedings of the Workshop on Modeling, Learning, and Mining for Cross/Multilinguality (MultiLingMine’16) Co-located with the 2016 European Conference on Information Retrieval (ECIR’16). 1–10. 5665. http://ceur-ws.org/Vol-1589/MultiLingMine6.pdf.Google ScholarGoogle Scholar
  22. [22] Bartha Paul. 2019. Analogy and analogical reasoning. In The Stanford Encyclopedia of Philosophy (Spring 2019 ed.), Zalta Edward N. (Ed.). Metaphysics Research Lab, Stanford University, Stanford, CA. https://plato.stanford.edu/archives/spr2019/entries/reasoning-analogy/.Google ScholarGoogle Scholar
  23. [23] Bender Emily M.. 2019. The #BenderRule: On naming the languages we study and why it matters. The Gradient. Retrieved September 16, 2022 from https://thegradient.pub/the-benderrule-on-naming-the-languages-we-study-and-why-it-matters/.Google ScholarGoogle Scholar
  24. [24] Bender Emily M. and Friedman Batya. 2018. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics 6 (2018), 587604. Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Berant Jonathan, Chou Andrew, Frostig Roy, and Liang Percy. 2013. Semantic parsing on Freebase from question-answer pairs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’13). 15331544. https://www.aclweb.org/anthology/D13-1160.Google ScholarGoogle Scholar
  26. [26] Berant Jonathan, Srikumar Vivek, Chen Pei-Chun, Linden Abby Vander, Harding Brittany, Huang Brad, Clark Peter, and Manning Christopher D.. 2014. Modeling biological processes for reading comprehension. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 14991510.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Berzak Yevgeni, Malmaud Jonathan, and Levy Roger. 2020. STARC: Structured annotations for reading comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 57265735. https://www.aclweb.org/anthology/2020.acl-main.507.Google ScholarGoogle Scholar
  28. [28] Bhagavatula Chandra, Bras Ronan Le, Malaviya Chaitanya, Sakaguchi Keisuke, Holtzman Ari, Rashkin Hannah, Downey Doug, Yih Wen-Tau, and Choi Yejin. 2019. Abductive commonsense reasoning. In Proceedings of the 7th International Conference on Learning Representations (ICLR’19). https://openreview.net/forum?id=Byg1v1HKDB.Google ScholarGoogle Scholar
  29. [29] Bisk Yonatan, Holtzman Ari, Thomason Jesse, Andreas Jacob, Bengio Yoshua, Chai Joyce, Lapata Mirella, et al. 2020. Experience grounds language. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 87188735. Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Bisk Yonatan, Zellers Rowan, Bras Ronan Le, Gao Jianfeng, and Choi Yejin. 2020. PIQA: Reasoning about physical commonsense in natural language. Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’20). 74327439. Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Bjerva Johannes, Bhutani Nikita, Golshan Behzad, Tan Wang-Chiew, and Augenstein Isabelle. 2020. SubjQA: A dataset for subjectivity and review comprehension. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 54805494. Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Blackburn Simon. 2008. Inference. Retrieved September 16, 2022 from https://www.oxfordreference.com/view/10.1093/acref/9780199541430.001.0001/acref-9780199541430.Google ScholarGoogle Scholar
  33. [33] Blackburn Simon. 2008. Reasoning. Retrieved September 16, 2022 from https://www.oxfordreference.com/view/10.1093/acref/9780199541430.001.0001/acref-9780199541430.Google ScholarGoogle Scholar
  34. [34] Bone Elisa and Prosser Mike. 2020. Multiple Choice Questions: An Introductory Guide. Retrieved September 16, 2022 from https://melbourne-cshe.unimelb.edu.au/__data/assets/pdf_file/0010/3430648/multiple-choice-questions_final.pdf.Google ScholarGoogle Scholar
  35. [35] Bordes Antoine, Usunier Nicolas, Chopra Sumit, and Weston Jason. 2015. Large-scale simple question answering with memory networks. arXiv:1506.02075 [CS] (2015). http://arxiv.org/abs/1506.02075.Google ScholarGoogle Scholar
  36. [36] Boyd-Graber Jordan. 2019. What question answering can learn from trivia nerds. arXiv:1910.14464 [CS] (2019). http://arxiv.org/abs/1910.14464.Google ScholarGoogle Scholar
  37. [37] Boyd-Graber Jordan, Feng Shi, and Rodriguez Pedro. 2018. Human-computer question answering: The case for quizbowl. In Proceedings of the NIPS’17 Competition: Building Intelligent Systems, Sergio Escalera and Markus Weimer (Eds.). Springer Series on Challenges in Machine Learning. Springer International, Cham, Switzerland, 169–180. Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Brown Tom B., Mann Benjamin, Ryder Nick, Subbiah Melanie, Kaplan Jared, Dhariwal Prafulla, Neelakantan Arvind, et al. 2020. Language models are few-shot learners. arXiv:2005.14165 [CS] (2020). http://arxiv.org/abs/2005.14165.Google ScholarGoogle Scholar
  39. [39] Budzianowski Paweł, Wen Tsung-Hsien, Tseng Bo-Hsiang, Casanueva Iñigo, Ultes Stefan, Ramadan Osman, and Gašić Milica. 2018. MultiWOZ—A large-scale multi-domain Wizard-of-Oz dataset for task-oriented dialogue modelling. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 50165026. Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Cambazoglu B. Barla, Sanderson Mark, Scholer Falk, and Croft Bruce. 2020. A review of public datasets in question answering research. ACM SIGIR Forum 54, 2 (2020), 23. http://www.sigir.org/wp-content/uploads/2020/12/p07.pdf.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Campos Jon Ander, Otegi Arantxa, Soroa Aitor, Deriu Jan, Cieliebak Mark, and Agirre Eneko. 2020. DoQA-accessing domain-specific FAQs via conversational QA. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 73027314. https://aclanthology.org/2020.acl-main.652/.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Cao Shulin, Shi Jiaxin, Pan Liangming, Nie Lunyiu, Xiang Yutong, Hou Lei, Li Juanzi, He Bin, and Zhang Hanwang. 2022. KQA Pro: A dataset with explicit compositional programs for complex question answering over knowledge base. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL’22). 61016119. Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Carrino Casimiro Pio, Costa-Jussà Marta R., and Fonollosa José A. R.. 2019. Automatic Spanish translation of the SQuAD dataset for multilingual question answering. arXiv:1912.05200 [CS] (2019). http://arxiv.org/abs/1912.05200.Google ScholarGoogle Scholar
  44. [44] Castelli Vittorio, Chakravarti Rishav, Dana Saswati, Ferritto Anthony, Florian Radu, Franz Martin, Garg Dinesh, et al. 2020. The TechQA dataset. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 12691278. https://www.aclweb.org/anthology/2020.acl-main.117.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Celikyilmaz Asli, Clark Elizabeth, and Gao Jianfeng. 2020. Evaluation of text generation: A survey. arXiv:2006.14799 [CS] (2020). http://arxiv.org/abs/2006.14799.Google ScholarGoogle Scholar
  46. [46] Chandu Khyathi, Loginova Ekaterina, Gupta Vishal, Genabith Josef van, Neumann Günter, Chinnakotla Manoj, Nyberg Eric, and Black Alan W.. 2018. Code-mixed question answering challenge: Crowd-sourcing data and techniques. In Proceedings of the 3rd Workshop on Computational Approaches to Linguistic Code-Switching. 2938. Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Chapman Graham, Cleese John, Gilliam Terry, Idle Eric, Jones Terry, Palin Michael, Goldstone John, Milligan Spike, troupe) Monty Python (Comedy, Films Handmade, and (Firm) Criterion Collection. 1999. Life of Brian.Google ScholarGoogle Scholar
  48. [48] Chattopadhyay Prithvijit, Vedantam Ramakrishna, Selvaraju Ramprasaath R., Batra Dhruv, and Parikh Devi. 2017. Counting everyday objects in everyday scenes. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 11351144. https://openaccess.thecvf.com/content_cvpr_2017/html/Chattopadhyay_Counting_Everyday_Objects_CVPR_2017_paper.html.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Chen Anthony, Gudipati Pallavi, Longpre Shayne, Ling Xiao, and Singh Sameer. 2021. Evaluating entity disambiguation and the role of popularity in retrieval-based NLP. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL’20). 44724485. Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Chen Anthony, Stanovsky Gabriel, Singh Sameer, and Gardner Matt. 2019. Evaluating question answering evaluation. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering. 119124. Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Chen Anthony, Stanovsky Gabriel, Singh Sameer, and Gardner Matt. 2020. MOCHA: A dataset for training and evaluating generative reading comprehension metrics. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 65216532. https://www.aclweb.org/anthology/2020.emnlp-main.528.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Chen Danqi, Fisch Adam, Weston Jason, and Bordes Antoine. 2017. Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL’17). 18701879. Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Chen Danqi and Yih Wen-Tau. 2020. Open-domain question answering. In Proceedings of ACL: Tutorial Abstracts. 3437. Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Chen Wenhu, Zha Hanwen, Chen Zhiyu, Xiong Wenhan, Wang Hong, and Wang William Yang. 2020. HybridQA: A dataset of multi-hop question answering over tabular and textual data. In Findings of EMNLP’20. 10261036. Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Chen Xingyu, Zhao Zihan, Chen Lu, Ji JiaBao, Zhang Danyang, Luo Ao, Xiong Yuxuan, and Yu Kai. 2021. WebSRC: A dataset for web-based structural reading comprehension. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 41734185. Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Chen Zhiyu, Chen Wenhu, Smiley Charese, Shah Sameena, Borova Iana, Langdon Dylan, Moussa Reema, et al. 2021. FinQA: A dataset of numerical reasoning over financial data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 36973711. Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Chesnevar Carlos Iván, Maguitman Ana Gabriela, and Loui Ronald Prescott. 2000. Logical models of argument. ACM Computing Surveys 32, 4 (2000), 337383. .Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Cho Minseok, Amplayo Reinald Kim, Hwang Seung-Won, and Park Jonghyuck. 2018. Adversarial TableQA: Attention supervision for question answering on tables. In Proceedings of Machine Learning Research. 391406. http://proceedings.mlr.press/v95/cho18a/cho18a.pdf.Google ScholarGoogle Scholar
  59. [59] Choi Eunsol, He He, Iyyer Mohit, Yatskar Mark, Yih Wen-Tau, Choi Yejin, Liang Percy, and Zettlemoyer Luke. 2018. QuAC: Question answering in context. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 21742184. http://aclweb.org/anthology/D18-1241.Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Choi Eunsol, Palomaki Jennimaria, Lamm Matthew, Kwiatkowski Tom, Das Dipanjan, and Collins Michael. 2021. Decontextualization: Making sentences stand-alone. Transactions of the Association for Computational Linguistics 9 (April 2021), 447461. Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Choudhury Sagnik Ray, Rogers Anna, and Augenstein Isabelle. 2022. Machine reading, fast and slow: When do models “Understand” language? In Proceedings of the 29th International Conference on Computational Linguistics. International Committee on Computational Linguistics, 78–93. https://aclanthology.org/2022.coling-1.8.Google ScholarGoogle Scholar
  62. [62] Christmann Philipp, Roy Rishiraj Saha, Abujabal Abdalghani, Singh Jyotsna, and Weikum Gerhard. 2019. Look before you hop: Conversational question answering over knowledge graphs using judicious context expansion. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. ACM, New York, NY, 729738. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. [63] Ciosici Manuel, Cecil Joe, Lee Dong-Ho, Hedges Alex, Freedman Marjorie, and Weischedel Ralph. 2021. Perhaps PTLMs should go to school—A task to assess open book and closed book QA. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 61046111. Google ScholarGoogle ScholarCross RefCross Ref
  64. [64] Clark Christopher and Gardner Matt. 2018. Simple and effective multi-paragraph reading comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18). 845855. Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Clark Christopher, Lee Kenton, Chang Ming-Wei, Kwiatkowski Tom, Collins Michael, and Toutanova Kristina. 2019. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 17th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). 29242936. https://aclweb.org/anthology/papers/N/N19/N19-1300/.Google ScholarGoogle Scholar
  66. [66] Clark Jonathan H., Choi Eunsol, Collins Michael, Garrette Dan, Kwiatkowski Tom, Nikolaev Vitaly, and Palomaki Jennimaria. 2020. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics 8 (July 2020), 454470. Google ScholarGoogle ScholarCross RefCross Ref
  67. [67] Clark Peter, Cowhey Isaac, Etzioni Oren, Khot Tushar, Sabharwal Ashish, Schoenick Carissa, and Tafjord Oyvind. 2018. Think you have solved question answering? Try ARC, the AI2 reasoning challenge. arXiv:1803.05457 [CS] (2018). http://arxiv.org/abs/1803.05457.Google ScholarGoogle Scholar
  68. [68] Colas Anthony, Kim Seokhwan, Dernoncourt Franck, Gupte Siddhesh, Wang Zhe, and Kim Doo Soon. 2020. TutorialVQA: Question answering dataset for tutorial videos. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’20). 54505455. https://www.aclweb.org/anthology/2020.lrec-1.670.Google ScholarGoogle Scholar
  69. [69] Costa Tarcísio Souza, Gottschalk Simon, and Demidova Elena. 2020. Event-QA: A dataset for event-centric question answering over knowledge graphs. arXiv:2004.11861 [CS] (2020). http://arxiv.org/abs/2004.11861.Google ScholarGoogle Scholar
  70. [70] Croce Danilo, Zelenanska Alexandra, and Basili Roberto. 2019. Enabling deep learning for large scale question answering in Italian. Intelligenza Artificiale 13, 1 (Jan. 2019), 4961. Google ScholarGoogle ScholarCross RefCross Ref
  71. [71] Cui Yiming, Liu Ting, Che Wanxiang, Xiao Li, Chen Zhipeng, Ma Wentao, Wang Shijin, and Hu Guoping. 2019. A span-extraction dataset for Chinese machine reading comprehension. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 58835889. Google ScholarGoogle ScholarCross RefCross Ref
  72. [72] Cui Yiming, Liu Ting, Chen Zhipeng, Ma Wentao, Wang Shijin, and Hu Guoping. 2018. Dataset for the first evaluation on chinese machine reading comprehension. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’18). https://www.aclweb.org/anthology/L18-1431.Google ScholarGoogle Scholar
  73. [73] Cui Yiming, Liu Ting, Chen Zhipeng, Wang Shijin, and Hu Guoping. 2016. Consensus attention-based neural networks for chinese reading comprehension. In Proceedings of the International Conference on Computational Linguistics (COLING’16). 17771786. https://www.aclweb.org/anthology/C16-1167.Google ScholarGoogle Scholar
  74. [74] Cui Yiming, Liu Ting, Yang Ziqing, Chen Zhipeng, Ma Wentao, Che Wanxiang, Wang Shijin, and Hu Guoping. 2020. A sentence Cloze dataset for Chinese machine reading comprehension. In Proceedings of the International Conference on Computational Linguistics (COLING’20). 67176723. Google ScholarGoogle ScholarCross RefCross Ref
  75. [75] Dasigi Pradeep, Liu Nelson F., Marasović Ana, Smith Noah A., and Gardner Matt. 2019. Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 59245931. Google ScholarGoogle ScholarCross RefCross Ref
  76. [76] Dasigi Pradeep, Lo Kyle, Beltagy Iz, Cohan Arman, Smith Noah A., and Gardner Matt. 2021. A dataset of information-seeking questions and answers anchored in research papers. In Proceedings of the 19th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’21). 45994610. Google ScholarGoogle ScholarCross RefCross Ref
  77. [77] Demey Lorenz, Kooi Barteld, and Sack Joshua. 2019. Logic and probability. In The Stanford Encyclopedia of Philosophy (Summer 2019 ed.), Zalta Edward N. (Ed.). Metaphysics Research Lab, Stanford University, Stanford, CA. https://plato.stanford.edu/archives/sum2019/entries/logic-probability/.Google ScholarGoogle Scholar
  78. [78] d’Hoffschmidt Martin, Vidal Maxime, Belblidia Wacim, and Brendlé Tom. 2020. FQuAD: French question answering dataset. arXiv:2002.06071 [CS] (2020). http://arxiv.org/abs/2002.06071.Google ScholarGoogle Scholar
  79. [79] Dinan Emily, Roller Stephen, Shuster Kurt, Fan Angela, Auli Michael, and Weston Jason. 2018. Wizard of Wikipedia: Knowledge-powered conversational agents. arXiv:1811.01241 [CS] (2018). http://arxiv.org/abs/1811.01241.Google ScholarGoogle Scholar
  80. [80] Santos Cícero dos, Barbosa Luciano, Bogdanova Dasha, and Zadrozny Bianca. 2015. Learning hybrid representations to retrieve semantically equivalent questions. In Proceedings of the Joint Conference of the 53rd Annual Meeting of the Association for Computational Linguistics and the 5th International Joint Conference on Natural Language Processing (ACL-IJCNLP’21). 694699. Google ScholarGoogle ScholarCross RefCross Ref
  81. [81] Douven Igor. 2017. Abduction. In The Stanford Encyclopedia of Philosophy (Summer 2017 ed.), Zalta Edward N. (Ed.). Metaphysics Research Lab, Stanford University, Stanford, CA. https://plato.stanford.edu/archives/sum2017/entries/abduction/.Google ScholarGoogle Scholar
  82. [82] Dua Dheeru, Gottumukkala Ananth, Talmor Alon, Gardner Matt, and Singh Sameer. 2019. ORB: An open reading benchmark for comprehensive evaluation of machine reading comprehension. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering. 147153. Google ScholarGoogle ScholarCross RefCross Ref
  83. [83] Dua Dheeru, Wang Yizhong, Dasigi Pradeep, Stanovsky Gabriel, Singh Sameer, and Gardner Matt. 2019. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 17th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). 23682378. https://aclweb.org/anthology/papers/N/N19/N19-1246/.Google ScholarGoogle Scholar
  84. [84] Duan Nan and Tang Duyu. 2018. Overview of the NLPCC 2017 shared task: Open domain Chinese question answering. In Natural Language Processing and Chinese Computing, Huang Xuanjing, Jiang Jing, Zhao Dongyan, Feng Yansong, and Hong Yu (Eds.). Springer, Cham, Switzerland, 954961. Google ScholarGoogle ScholarCross RefCross Ref
  85. [85] Dunietz Jesse, Burnham Greg, Bharadwaj Akash, Rambow Owen, Chu-Carroll Jennifer, and Ferrucci Dave. 2020. To test machine comprehension, start by defining comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 78397859. https://www.aclweb.org/anthology/2020.acl-main.701.Google ScholarGoogle ScholarCross RefCross Ref
  86. [86] Dunn Matthew, Sagun Levent, Higgins Mike, Guney V. Ugur, Cirik Volkan, and Cho Kyunghyun. 2017. SearchQA: A new Q&A dataset augmented with context from a search engine. arXiv:1704.05179 [CS] (2017). http://arxiv.org/abs/1704.05179.Google ScholarGoogle Scholar
  87. [87] Dzendzik Daria, Foster Jennifer, and Vogel Carl. 2021. English machine reading comprehension datasets: A survey. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 87848804. Google ScholarGoogle ScholarCross RefCross Ref
  88. [88] Efimov Pavel, Chertok Andrey, Boytsov Leonid, and Braslavski Pavel. 2020. SberQuAD—Russian reading comprehension dataset: Description and analysis. arXiv:1912.09723 [CS] (2020). Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. [89] Elgohary Ahmed, Peskov Denis, and Boyd-Graber Jordan. 2019. Can you unpack that? Learning to rewrite questions-in-context. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 59185924. Google ScholarGoogle ScholarCross RefCross Ref
  90. [90] Eric Mihail and Manning Christopher D.. 2017. Key-value retrieval networks for task-oriented dialogue. arXiv:1705.05414 [CS] (2017). http://arxiv.org/abs/1705.05414.Google ScholarGoogle Scholar
  91. [91] Ettinger Allyson. 2020. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics 8 (2020), 3448. arXiv:1907.13528Google ScholarGoogle ScholarCross RefCross Ref
  92. [92] Fan Angela, Jernite Yacine, Perez Ethan, Grangier David, Weston Jason, and Auli Michael. 2019. ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 35583567. Google ScholarGoogle ScholarCross RefCross Ref
  93. [93] Faruqui Manaal and Das Dipanjan. 2018. Identifying well-formed natural language questions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing(EMNLP’18). 798803. Google ScholarGoogle ScholarCross RefCross Ref
  94. [94] Fayek H. M. and Johnson J.. 2020. Temporal reasoning via audio question answering. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020), 22832294. Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. [95] Fenogenova Alena, Mikhailov Vladislav, and Shevelev Denis. 2020. Read and reason with MuSeRC and RuCoS: Datasets for machine reading comprehension for Russian. In Proceedings of the International Conference on Computational Linguistics (COLING’20). 64816497. Google ScholarGoogle ScholarCross RefCross Ref
  96. [96] Ferguson James, Gardner Matt, Hajishirzi Hannaneh, Khot Tushar, and Dasigi Pradeep. 2020. IIRC: A dataset of incomplete information reading comprehension questions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 11371147. Google ScholarGoogle ScholarCross RefCross Ref
  97. [97] Garcia Noa, Ye Chentao, Liu Zihua, Hu Qingtao, Otani Mayu, Chu Chenhui, Nakashima Yuta, and Mitamura Teruko. 2020. A dataset and baselines for visual question answering on art. arXiv:2008.12520 [CS] (2020). http://arxiv.org/abs/2008.12520.Google ScholarGoogle Scholar
  98. [98] Gardner Matt, Artzi Yoav, Basmov Victoria, Berant Jonathan, Bogin Ben, Chen Sihao, Dasigi Pradeep, et al. 2020. Evaluating models’ local decision boundaries via contrast sets. In Findings of EMNLP’20. 13071323. Google ScholarGoogle ScholarCross RefCross Ref
  99. [99] Gardner Matt, Berant Jonathan, Hajishirzi Hannaneh, Talmor Alon, and Min Sewon. 2019. Question answering is a format; when is it useful? arXiv:1909.11291 (2019).Google ScholarGoogle ScholarCross RefCross Ref
  100. [100] Gardner Matt, Merrill William, Dodge Jesse, Peters Matthew, Ross Alexis, Singh Sameer, and Smith Noah A.. 2021. Competency problems: On finding and removing artifacts in language data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 18011813. Google ScholarGoogle ScholarCross RefCross Ref
  101. [101] Garg Siddhant, Vu Thuy, and Moschitti Alessandro. 2020. TANDA: Transfer and adapt pre-trained transformer models for answer sentence selection. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’20). 77807788. Google ScholarGoogle ScholarCross RefCross Ref
  102. [102] Gebru Timnit, Morgenstern Jamie, Vecchione Briana, Vaughan Jennifer Wortman, Wallach Hanna, III Hal Daumé, and Crawford Kate. 2020. Datasheets for datasets. arXiv:1803.09010 [CS] (2020). http://arxiv.org/abs/1803.09010.Google ScholarGoogle Scholar
  103. [103] Geiger Atticus, Cases Ignacio, Karttunen Lauri, and Potts Christopher. 2019. Posing fair generalization tasks for natural language inference. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 44754485. Google ScholarGoogle ScholarCross RefCross Ref
  104. [104] Geva Mor, Goldberg Yoav, and Berant Jonathan. 2019. Are we modeling the task or the annotator? An investigation of annotator bias in natural language understanding datasets. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 11611166. Google ScholarGoogle ScholarCross RefCross Ref
  105. [105] Glushkova Taisia, Machnev Alexey, Fenogenova Alena, Shavrina Tatiana, Artemova Ekaterina, and Ignatov Dmitry I.. 2020. DaNetQA: A yes/no question answering dataset for the russian language. arXiv:2010.02605 [CS] (2020). http://arxiv.org/abs/2010.02605.Google ScholarGoogle Scholar
  106. [106] Goldberg Yoav. 2019. Assessing BERT’s syntactic abilities. arXiv:1901.05287 [CS] (2019). http://arxiv.org/abs/1901.05287.Google ScholarGoogle Scholar
  107. [107] González Ana Valeria, Rogers Anna, and Søgaard Anders. 2021. On the interaction of belief bias and explanations. In Findings of ACL-IJCNLP’21. 29302942. https://aclanthology.org/2021.findings-acl.259.Google ScholarGoogle Scholar
  108. [108] Gordon Andrew, Kozareva Zornitsa, and Roemmele Melissa. 2012. SemEval-2012 Task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In Proceedings of the 1st Joint Conference on Lexical and Computational Semantics (*SEM’12). 394398. https://aclweb.org/anthology/papers/S/S12/S12-1052/.Google ScholarGoogle Scholar
  109. [109] Gordon Daniel, Kembhavi Aniruddha, Rastegari Mohammad, Redmon Joseph, Fox Dieter, and Farhadi Ali. 2018. IQA: Visual question answering in interactive environments. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 40894098.Google ScholarGoogle ScholarCross RefCross Ref
  110. [110] Gu Yu, Kase Sue, Vanni Michelle, Sadler Brian, Liang Percy, Yan Xifeng, and Su Yu. 2021. Beyond I.I.D.: Three levels of generalization for question answering on knowledge bases. arXiv:2011.07743 [CS] (2021). Google ScholarGoogle ScholarDigital LibraryDigital Library
  111. [111] Guo Mandy, Yang Yinfei, Cer Daniel, Shen Qinlan, and Constant Noah. 2020. MultiReQA: A cross-domain evaluation for retrieval question answering models. arXiv:2005.02507 [CS] (2020). http://arxiv.org/abs/2005.02507.Google ScholarGoogle Scholar
  112. [112] Guo Shangmin, Liu Kang, He Shizhu, Liu Cao, Zhao Jun, and Wei Zhuoyu. 2017. IJCNLP-2017 Task 5: Multi-choice question answering in examinations. In Proceedings of the IJCNLP’17, Shared Tasks. 3440. https://www.aclweb.org/anthology/I17-4005.Google ScholarGoogle Scholar
  113. [113] Gupta Aditya, Xu Jiacheng, Upadhyay Shyam, Yang Diyi, and Faruqui Manaal. 2021. Disfl-QA: A benchmark dataset for understanding disfluencies in question answering. In Findings of ACL 2021. https://arxiv.org/abs/2106.04016.Google ScholarGoogle Scholar
  114. [114] Gupta Deepak, Kumari Surabhi, Ekbal Asif, and Bhattacharyya Pushpak. 2018. MMQA: A multi-domain multi-lingual question-answering framework for English and Hindi. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’18). https://www.aclweb.org/anthology/L18-1440.Google ScholarGoogle Scholar
  115. [115] Gupta Mansi, Kulkarni Nitish, Chanda Raghuveer, Rayasam Anirudha, and Lipton Zachary C.. 2019. AmazonQA: A review-based question answering task. In Proceedings of the 2019 International Joint Conference on Artificial Intelligence (IJCAI’19). 49965002. Google ScholarGoogle ScholarCross RefCross Ref
  116. [116] Gupta Vishal, Chinnakotla Manoj, and Shrivastava Manish. 2018. Transliteration better than translation? Answering code-mixed questions over a knowledge base. In Proceedings of the 3rd Workshop on Computational Approaches to Linguistic Code-Switching. 3950. Google ScholarGoogle ScholarCross RefCross Ref
  117. [117] Gururangan Suchin, Swayamdipta Swabha, Levy Omer, Schwartz Roy, Bowman Samuel, and Smith Noah A.. 2018. Annotation artifacts in natural language inference data. In Proceedings of the 16th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18). 107112. Google ScholarGoogle ScholarCross RefCross Ref
  118. [118] Han Rujun, Hsu I-Hung, Sun Jiao, Baylon Julia, Ning Qiang, Roth Dan, and Peng Nanyun. 2021. ESTER: A machine reading comprehension dataset for reasoning about event semantic relations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 75437559. Google ScholarGoogle ScholarCross RefCross Ref
  119. [119] Hashemi Helia, Aliannejadi Mohammad, Zamani Hamed, and Croft W. Bruce. 2019. ANTIQUE: A non-factoid question answering benchmark. arXiv:1905.08957 [CS] (2019). http://arxiv.org/abs/1905.08957.Google ScholarGoogle Scholar
  120. [120] Hassan Naeemul, Arslan Fatma, Li Chengkai, and Tremayne Mark. 2017. Toward automated fact-checking: Detecting check-worthy factual claims by ClaimBuster. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’17). ACM, New York, NY, 18031812. http://dblp.uni-trier.de/db/conf/kdd/kdd2017.html#HassanALT17.Google ScholarGoogle ScholarDigital LibraryDigital Library
  121. [121] Hawthorne James. 2021. Inductive logic. In The Stanford Encyclopedia of Philosophy (Spring 2022 ed.), Zalta Edward N. (Ed.). Metaphysics Research Lab, Stanford University, Stanford, CA. https://plato.stanford.edu/archives/spr2021/entries/logic-inductive/.Google ScholarGoogle Scholar
  122. [122] He Wei, Liu Kai, Liu Jing, Lyu Yajuan, Zhao Shiqi, Xiao Xinyan, Liu Yuan, et al. 2018. DuReader: A Chinese machine reading comprehension dataset from real-world applications. In Proceedings of the Workshop on Machine Reading for Question Answering. 3746. Google ScholarGoogle ScholarCross RefCross Ref
  123. [123] Hedberg Nancy, Sosa Juan M., and Fadden Lorna. 2004. Meanings and configurations of questions in English. In Proceedings of the International Conference on Speech Prosody. 309312. https://www.isca-speech.org/archive/sp2004/papers/sp04_309.pdf.Google ScholarGoogle Scholar
  124. [124] Hemphill Charles T., Godfrey John J., and Doddington George R.. 1990. The ATIS spoken language systems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24–27, 1990. https://www.aclweb.org/anthology/H90-1021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  125. [125] Hermann Karl Moritz, Kočiský Tomáš, Grefenstette Edward, Espeholt Lasse, Kay Will, Suleyman Mustafa, and Blunsom Phil. 2015. Teaching machines to read and comprehend. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS’15). 16931701. http://dl.acm.org/citation.cfm?id=2969239.2969428.Google ScholarGoogle Scholar
  126. [126] Hewlett Daniel, Lacoste Alexandre, Jones Llion, Polosukhin Illia, Fandrianto Andrew, Han Jay, Kelcey Matthew, and Berthelot David. 2016. WikiReading: A novel large-scale language understanding task over Wikipedia. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL’16). 15351545. Google ScholarGoogle ScholarCross RefCross Ref
  127. [127] Hill Felix, Bordes Antoine, Chopra Sumit, and Weston Jason. 2015. The goldilocks principle: Reading children’s books with explicit memory representations. arXiv:1511.02301 [CS] (2015). http://arxiv.org/abs/1511.02301.Google ScholarGoogle Scholar
  128. [128] Horbach Andrea, Aldabe Itziar, Bexte Marie, Lacalle Oier Lopez de, and Maritxalar Montse. 2020. Linguistic appropriateness and pedagogic usefulness of reading comprehension questions. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’20). 17531762. https://www.aclweb.org/anthology/2020.lrec-1.217.Google ScholarGoogle Scholar
  129. [129] Huang Lifu, Bras Ronan Le, Bhagavatula Chandra, and Choi Yejin. 2019. Cosmos QA: Machine reading comprehension with contextual commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 23912401. Google ScholarGoogle ScholarCross RefCross Ref
  130. [130] Hudson Drew A. and Manning Christopher D.. 2019. GQA: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 67006709. https://openaccess.thecvf.com/content_CVPR_2019/html/Hudson_GQA_A_New_Dataset_for_Real-World_Visual_Reasoning_and_Compositional_CVPR_2019_paper.html.Google ScholarGoogle ScholarCross RefCross Ref
  131. [131] International SRI. 2011. SRI’s Amex Travel Agent Data. Retrieved September 16, 2022 from http://www.ai.sri.com/~communic/amex/amex.html.Google ScholarGoogle Scholar
  132. [132] Jang Yunseok, Song Yale, Yu Youngjae, Kim Youngjin, and Kim Gunhee. 2017. TGIF-QA: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 27582766. https://openaccess.thecvf.com/content_cvpr_2017/html/Jang_TGIF-QA_Toward_Spatio-Temporal_CVPR_2017_paper.html.Google ScholarGoogle ScholarCross RefCross Ref
  133. [133] Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. 2019. CAsT 2019: The conversational assistance track overview. In Proceedings of the Text REtrival Conference (TREC’19).Google ScholarGoogle Scholar
  134. [134] Jia Robin and Liang Percy. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 20212031. Google ScholarGoogle ScholarCross RefCross Ref
  135. [135] Jia Zhen, Abujabal Abdalghani, Roy Rishiraj Saha, Strötgen Jannik, and Weikum Gerhard. 2018. TempQuestions: A benchmark for temporal question answering. In Companion of WWW’18. ACM, New York, NY, 10571062. Google ScholarGoogle ScholarDigital LibraryDigital Library
  136. [136] Jia Zhen, Abujabal Abdalghani, Roy Rishiraj Saha, Strötgen Jannik, and Weikum Gerhard. 2018. TEQUILA: Temporal question answering over knowledge bases. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM’18). ACM, New York, NY, 18071810. Google ScholarGoogle ScholarDigital LibraryDigital Library
  137. [137] Jiang Kelvin, Wu Dekun, and Jiang Hui. 2019. FreebaseQA: A new factoid QA data set matching trivia-style question-answer pairs with Freebase. In Proceedings of the 17th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). 318323. https://aclweb.org/anthology/papers/N/N19/N19-1028/.Google ScholarGoogle Scholar
  138. [138] Jimenez Carlos E., Russakovsky Olga, and Narasimhan Karthik. 2022. CARETS: A consistency and robustness evaluative test suite for VQA. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL’22). 63926405. Google ScholarGoogle ScholarCross RefCross Ref
  139. [139] Jin Qiao, Dhingra Bhuwan, Liu Zhengping, Cohen William, and Lu Xinghua. 2019. PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 25672577. Google ScholarGoogle ScholarCross RefCross Ref
  140. [140] Johnson Justin, Hariharan Bharath, Maaten Laurens van der, Fei-Fei Li, Zitnick C. Lawrence, and Girshick Ross. 2017. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 29012910.Google ScholarGoogle ScholarCross RefCross Ref
  141. [141] Joshi Mandar, Choi Eunsol, Weld Daniel, and Zettlemoyer Luke. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL’17). 16011611. Google ScholarGoogle ScholarCross RefCross Ref
  142. [142] Katsis Yannis, Chemmengath Saneem, Kumar Vishwajeet, Bharadwaj Samarth, Canim Mustafa, Glass Michael, Gliozzo Alfio, et al. 2022. AIT-QA: Question answering dataset over complex tables in the airline industry. In Proceedings of the 20th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’22). 305314. Google ScholarGoogle ScholarCross RefCross Ref
  143. [143] Kaushik Divyansh, Hovy Eduard, and Lipton Zachary. 2019. Learning the difference that makes a difference with counterfactually-augmented data. In Proceedings of the International Conference on Learning Representations(ICLR’19). https://openreview.net/forum?id=Sklgs0NFvr.Google ScholarGoogle Scholar
  144. [144] Keraron Rachel, Lancrenon Guillaume, Bras Mathilde, Allary Frédéric, Moyse Gilles, Scialom Thomas, Soriano-Morales Edmundo-Pavel, and Staiano Jacopo. 2020. Project PIAF: Building a native French question-answering dataset. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’20). 54815490. https://www.aclweb.org/anthology/2020.lrec-1.673.Google ScholarGoogle Scholar
  145. [145] Khashabi Daniel, Chaturvedi Snigdha, Roth Michael, Upadhyay Shyam, and Roth Dan. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 16th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18). 252262. Google ScholarGoogle ScholarCross RefCross Ref
  146. [146] Khashabi Daniel, Khot Tushar, Sabharwal Ashish, Tafjord Oyvind, Clark Peter, and Hajishirzi Hannaneh. 2020. UnifiedQA: Crossing format boundaries with a single QA system. arXiv:2005.00700 [CS] (2020). https://arxiv.org/abs/2005.00700.Google ScholarGoogle Scholar
  147. [147] Kim Kyung-Min, Heo Min-Oh, Choi Seong-Ho, and Zhang Byoung-Tak. 2017. DeepStory: Video story QA by deep embedded memory networks. In Proceedings of the 2017 International Joint Conference on Artificial Intelligence (IJCAI’17). https://openreview.net/forum?id=ryZczSz_bS.Google ScholarGoogle ScholarCross RefCross Ref
  148. [148] Kim Seokhwan, D’Haro Luis Ferdinando, Banchs Rafael E., Henderson Matthew, Willisams Jason, and Yoshino Koichiro. 2016. Dialog State Tracking Challenge 5 Handbook v.3.1. Retrieved September 16, 2022 from http://workshop.colips.org/dstc5/.Google ScholarGoogle Scholar
  149. [149] Kocisky Tomas, Schwarz Jonathan, Blunsom Phil, Dyer Chris, Hermann Karl Moritz, Melis Gabor, and Grefenstette Edward. 2018. The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics 6 (2018), 317328. http://aclweb.org/anthology/Q18-1023.Google ScholarGoogle ScholarCross RefCross Ref
  150. [150] Kong Xiang, Gangal Varun, and Hovy Eduard. 2020. SCDE: Sentence Cloze dataset with high quality distractors from examinations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 56685683. Google ScholarGoogle ScholarCross RefCross Ref
  151. [151] Koons Robert. 2017. Defeasible reasoning. In The Stanford Encyclopedia of Philosophy (Winter 2017 ed.), Zalta Edward N. (Ed.). Metaphysics Research Lab, Stanford University, Stanford, CA. https://plato.stanford.edu/archives/win2017/entries/reasoning-defeasible/.Google ScholarGoogle Scholar
  152. [152] Korablinov Vladislav and Braslavski Pavel. 2020. RuBQ: A Russian dataset for question answering over wikidata. arXiv:2005.10659 [CS] (2020). http://arxiv.org/abs/2005.10659.Google ScholarGoogle Scholar
  153. [153] Krishna Kalpesh, Roy Aurko, and Iyyer Mohit. 2021. Hurdles to progress in long-form question answering. In Proceedings of the 19th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’21). 49404957. Google ScholarGoogle ScholarCross RefCross Ref
  154. [154] Kushman Nate, Artzi Yoav, Zettlemoyer Luke, and Barzilay Regina. 2014. Learning to automatically solve algebra word problems. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 271281. Google ScholarGoogle ScholarCross RefCross Ref
  155. [155] Kwiatkowski Tom, Palomaki Jennimaria, Redfield Olivia, Collins Michael, Parikh Ankur, Alberti Chris, Epstein Danielle, et al. 2019. Natural questions: A benchmark for question answering research. Transactions of Association for Computational Linguistics 7 (2019), 452–466. https://ai.google/research/pubs/pub47761.Google ScholarGoogle ScholarCross RefCross Ref
  156. [156] Lai Guokun, Xie Qizhe, Liu Hanxiao, Yang Yiming, and Hovy Eduard. 2017. RACE: Large-scale reading comprehension dataset from examinations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 785794. Google ScholarGoogle ScholarCross RefCross Ref
  157. [157] Lee C., Wang S., Chang H., and Lee H.. 2018. ODSQA: Open-domain spoken question answering dataset. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT’18). 949956. Google ScholarGoogle ScholarCross RefCross Ref
  158. [158] Lee Kyungjae, Yoon Kyoungho, Park Sunghyun, and Hwang Seung-Won. 2018. Semi-supervised training data generation for multilingual question answering. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’18). https://www.aclweb.org/anthology/L18-1437.Google ScholarGoogle Scholar
  159. [159] Lei Jie, Yu Licheng, Bansal Mohit, and Berg Tamara. 2018. TVQA: Localized, compositional video question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 13691379. https://www.aclweb.org/anthology/papers/D/D18/D18-1167/.Google ScholarGoogle ScholarCross RefCross Ref
  160. [160] Levesque Hector J., Davis Ernest, and Morgenstern Leora. 2012. The Winograd Schema Challenge. In Proceedings of the 13th International Conference on Principles of Knowledge Representation and Reasoning. 552561.Google ScholarGoogle Scholar
  161. [161] Levy Omer, Seo Minjoon, Choi Eunsol, and Zettlemoyer Luke. 2017. Zero-shot relation extraction via reading comprehension. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL’17). 333342. Google ScholarGoogle ScholarCross RefCross Ref
  162. [162] Lewis Patrick, Oguz Barlas, Rinott Ruty, Riedel Sebastian, and Schwenk Holger. 2020. MLQA: Evaluating cross-lingual extractive question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 73157330. https://www.aclweb.org/anthology/2020.acl-main.653/.Google ScholarGoogle ScholarCross RefCross Ref
  163. [163] Li Chia-Hsuan, Wu Szu-Lin, Liu Chi-Liang, and Lee Hung-Yi. 2018. Spoken SQuAD: A study of mitigating the impact of speech recognition errors on listening comprehension. arXiv:1804.00320 [CS] (2018). http://arxiv.org/abs/1804.00320.Google ScholarGoogle Scholar
  164. [164] Li Haonan, Tomko Martin, Vasardani Maria, and Baldwin Timothy. 2022. MultiSpanQA: A dataset for multi-span question answering. In Proceedings of the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’22). 12501260. Google ScholarGoogle ScholarCross RefCross Ref
  165. [165] Li Jiaqi, Liu Ming, Kan Min-Yen, Zheng Zihao, Wang Zekun, Lei Wenqiang, Liu Ting, and Qin Bing. 2020. Molweni: A challenge multiparty dialogues-based machine reading comprehension dataset with discourse structure. arXiv:2004.05080 [CS] (2020). http://arxiv.org/abs/2004.05080.Google ScholarGoogle Scholar
  166. [166] Li Jing, Zhong Shangping, and Chen Kaizhi. 2021. MLEC-QA: A Chinese multi-choice biomedical question answering dataset. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 88628874. Google ScholarGoogle ScholarCross RefCross Ref
  167. [167] Li Peng, Li Wei, He Zhengyan, Wang Xuguang, Cao Ying, Zhou Jie, and Xu Wei. 2016. Dataset and neural recurrent sequence labeling model for open-domain factoid question answering. arXiv:1607.06275 [CS] (2016). http://arxiv.org/abs/1607.06275.Google ScholarGoogle Scholar
  168. [168] Li Xiaoya, Yin Fan, Sun Zijun, Li Xiayu, Yuan Arianna, Chai Duo, Zhou Mingxin, and Li Jiwei. 2019. Entity-relation extraction as multi-turn question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 13401350. Google ScholarGoogle ScholarCross RefCross Ref
  169. [169] Li Yongqi, Li Wenjie, and Nie Liqiang. 2022. MMCoQA: Conversational question answering over text, tables, and images. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL’22). 42204231. Google ScholarGoogle ScholarCross RefCross Ref
  170. [170] Liang Yichan, Li Jianheng, and Yin Jian. 2019. A new multi-choice reading comprehension dataset for curriculum learning. In Proceedings of the 11th Asian Conference on Machine Learning. 742757. http://proceedings.mlr.press/v101/liang19a.html.Google ScholarGoogle Scholar
  171. [171] Lim Seungyoung, Kim Myungji, and Lee Jooyoul. 2019. KorQuAD1.0: Korean QA dataset for machine reading comprehension. arXiv:1909.07005 [CS] (2019). http://arxiv.org/abs/1909.07005.Google ScholarGoogle Scholar
  172. [172] Lin Bill Yuchen, Lee Seyeon, Khanna Rahul, and Ren Xiang. 2020. Birds have four legs? NumerSense: Probing numerical commonsense knowledge of pre-trained language models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 68626868. Google ScholarGoogle ScholarCross RefCross Ref
  173. [173] Lin Kevin, Tafjord Oyvind, Clark Peter, and Gardner Matt. 2019. Reasoning over paragraph effects in situations. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering. 5862. Google ScholarGoogle ScholarCross RefCross Ref
  174. [174] Lin Stephanie, Hilton Jacob, and Evans Owain. 2022. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL’22). 32143252. Google ScholarGoogle ScholarCross RefCross Ref
  175. [175] Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C. Lawrence. 2014. Microsoft COCO: Common objects in context. In Computer Vision—ECCV 2014, Fleet David, Pajdla Tomas, Schiele Bernt, and Tuytelaars Tinne (Eds.). Springer, Cham, Switzerland, 740755. Google ScholarGoogle ScholarCross RefCross Ref
  176. [176] Linzen Tal. 2020. How can we accelerate progress towards human-like linguistic generalization? arXiv:2005.00955 [CS] (2020). https://arxiv.org/pdf/2005.00955.pdf.Google ScholarGoogle Scholar
  177. [177] Liu Jian, Cui Leyang, Liu Hanmeng, Huang Dandan, Wang Yile, and Zhang Yue. 2020. LogiQA: A challenge dataset for machine reading comprehension with logical reasoning. In Proceedings of the 2020 International Joint Conference on Artificial Intelligence (IJCAI’20). 36223628. Google ScholarGoogle ScholarCross RefCross Ref
  178. [178] Liu Jiahua, Lin Yankai, Liu Zhiyuan, and Sun Maosong. 2019. XQA: A cross-lingual open-domain question answering dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 23582368. Google ScholarGoogle ScholarCross RefCross Ref
  179. [179] Liu Pengyuan, Deng Yuning, Zhu Chenghao, and Hu Han. 2019. XCMRC: Evaluating cross-lingual machine reading comprehension. In Natural Language Processing and Chinese Computing, Tang Jie, Kan Min-Yen, Zhao Dongyan, Li Sujian, and Zan Hongying (Eds.). Springer, Cham, Switzerland, 552564. Google ScholarGoogle ScholarDigital LibraryDigital Library
  180. [180] Liu Yinhan, Ott Myle, Goyal Naman, Du Jingfei, Joshi Mandar, Chen Danqi, Levy Omer, Lewis Mike, Zettlemoyer Luke, and Stoyanov Veselin. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692 [CS] (2019). http://arxiv.org/abs/1907.11692.Google ScholarGoogle Scholar
  181. [181] Long Teng, Bengio Emmanuel, Lowe Ryan, Cheung Jackie Chi Kit, and Precup Doina. 2017. World knowledge for reading comprehension: Rare entity prediction with hierarchical LSTMs using external descriptions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 825834. Google ScholarGoogle ScholarCross RefCross Ref
  182. [182] Longpre Shayne, Lu Yi, and Daiber Joachim. 2020. MKQA: A linguistically diverse benchmark for multilingual open domain question answering. arXiv:2007.15207 [CS] (2020). http://arxiv.org/abs/2007.15207.Google ScholarGoogle Scholar
  183. [183] Lowe Ryan, Pow Nissan, Serban Iulian, and Pineau Joelle. 2015. The Ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. arXiv:1506.08909 [CS] (2015). http://arxiv.org/abs/1506.08909.Google ScholarGoogle Scholar
  184. [184] Ma Kaixin, Jurczyk Tomasz, and Choi Jinho D.. 2018. Challenging reading comprehension on daily conversation: Passage completion on multiparty dialog. In Proceedings of the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’22). 20392048. Google ScholarGoogle ScholarCross RefCross Ref
  185. [185] MacFarlane Leigh-Ann and Boulet Genviève. 2017. Multiple-choice tests can support deep learning! Proceedings of the Atlantic Universities’ Teaching Showcase 21 (2017), 6166. https://ojs.library.dal.ca/auts/article/view/8430.Google ScholarGoogle Scholar
  186. [186] Maharaj Tegan, Ballas Nicolas, Rohrbach Anna, Courville Aaron, and Pal Christopher. 2017. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. arXiv:1611.07810 [CS] (2017). http://arxiv.org/abs/1611.07810.Google ScholarGoogle Scholar
  187. [187] Marcham Cheryl L., Turnbeaugh Treasa M., Gould Susan, and Nadler Joel T.. 2018. Developing certification exam questions: More deliberate than you may think. Professional Safety 63, 5 (May 2018), 4449. https://onepetro.org/PS/article/63/05/44/33528/Developing-Certification-Exam-Questions-More.Google ScholarGoogle Scholar
  188. [188] Marcinczuk Michał, Ptak Marcin, Radziszewski Adam, and Piasecki Maciej. 2013. Open dataset for development of Polish question answering systems. In Proceedings of the 6th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics. https://www.researchgate.net/profile/Maciej-Piasecki/publication/272685856_Open_dataset_for_development_of_Polish_Question_Answering_systems.Google ScholarGoogle Scholar
  189. [189] Masry Ahmed, Long Do, Tan Jia Qing, Joty Shafiq, and Hoque Enamul. 2022. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of ACL’22. 22632279. Google ScholarGoogle ScholarCross RefCross Ref
  190. [190] McAuley Julian and Yang Alex. 2016. Addressing complex and subjective product-related queries with customer reviews. In Proceedings of the 25th International Conference on World Wide Web (WWW’16). 625635. Google ScholarGoogle ScholarDigital LibraryDigital Library
  191. [191] McCann Bryan, Keskar Nitish Shirish, Xiong Caiming, and Socher Richard. 2018. The natural language decathlon: Multitask learning as question answering. arXiv:1806.08730 [CS, STAT] (2018). http://arxiv.org/abs/1806.08730.Google ScholarGoogle Scholar
  192. [192] McCarthy John and Hayes Patrick. 1969. Some philosophical problems from the standpoint of artificial intelligence. In Machine Intelligence 4, Meltzer B. and Michie Donald (Eds.). Edinburgh University Press, 463502.Google ScholarGoogle Scholar
  193. [193] McCoy Tom, Min Junghyun, and Linzen Tal. 2019. BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance. arXiv:1911.02969 [CS] (2019). http://arxiv.org/abs/1911.02969.Google ScholarGoogle Scholar
  194. [194] McCoy Tom, Pavlick Ellie, and Linzen Tal. 2019. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 34283448. Google ScholarGoogle ScholarCross RefCross Ref
  195. [195] McNamara Danielle S. and Magliano Joe. 2009. Toward a comprehensive model of comprehension. In The Psychology of Learning and Motivation. Psychology of Learning and Motivation Series, Vol. 51. Academic Press, Cambridge, MA, 297384. Google ScholarGoogle ScholarCross RefCross Ref
  196. [196] Miao Shen-Yun, Liang Chao-Chun, and Su Keh-Yih. 2020. A diverse corpus for evaluating and developing English math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 975984. https://www.aclweb.org/anthology/2020.acl-main.92.Google ScholarGoogle ScholarCross RefCross Ref
  197. [197] Mihaylov Todor, Clark Peter, Khot Tushar, and Sabharwal Ashish. 2018. Can a suit of armor conduct electricity? A new dataset for open book question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 23812391. http://aclweb.org/anthology/D18-1260.Google ScholarGoogle ScholarCross RefCross Ref
  198. [198] Min Sewon, Michael Julian, Hajishirzi Hannaneh, and Zettlemoyer Luke. 2020. AmbigQA: Answering ambiguous open-domain questions. arXiv:2004.10645 [CS] (2020). http://arxiv.org/abs/2004.10645.Google ScholarGoogle Scholar
  199. [199] Mirzaee Roshanak, Faghihi Hossein Rajaby, Ning Qiang, and Kordjamshidi Parisa. 2021. SPARTQA: A textual question answering benchmark for spatial reasoning. In Proceedings of the 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’21). 45824598. Google ScholarGoogle ScholarCross RefCross Ref
  200. [200] Mishra Swaroop, Mitra Arindam, Varshney Neeraj, Sachdeva Bhavdeep, and Baral Chitta. 2020. Towards question format independent numerical reasoning: A set of prerequisite tasks. arXiv preprint arXiv:2005.08516 (2020). https://arxiv.org/abs/2005.08516.Google ScholarGoogle Scholar
  201. [201] Mitchell Margaret, Wu Simone, Zaldivar Andrew, Barnes Parker, Vasserman Lucy, Hutchinson Ben, Spitzer Elena, Raji Inioluwa Deborah, and Gebru Timnit. 2019. Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT*’19). ACM, New York, NY, 220229. Google ScholarGoogle ScholarDigital LibraryDigital Library
  202. [202] Modi Ashutosh, Anikina Tatjana, Ostermann Simon, and Pinkal Manfred. 2016. InScript: Narrative texts annotated with script information. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’16). 34853493. https://www.aclweb.org/anthology/L16-1555.Google ScholarGoogle Scholar
  203. [203] Möller Timo, Reina Anthony, Jayakumar Raghavan, and Pietsch Malte. 2020. COVID-QA: A question answering dataset for COVID-19. In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL’20. https://www.aclweb.org/anthology/2020.nlpcovid19-acl.18.Google ScholarGoogle Scholar
  204. [204] Mostafazadeh Nasrin, Roth Michael, Chambers Nathanael, and Louis Annie. 2017. LSDSem 2017 shared task: The story Cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential, and Discourse-Level Semantics. 4651. http://www.aclweb.org/anthology/W17-0900.Google ScholarGoogle ScholarCross RefCross Ref
  205. [205] Mozannar Hussein, Maamary Elie, Hajal Karl El, and Hajj Hazem. 2019. Neural Arabic question answering. In Proceedings of the 4th Arabic Natural Language Processing Workshop. 108118. Google ScholarGoogle ScholarCross RefCross Ref
  206. [206] Mun Jonghwan, Seo Paul Hongsuck, Jung Ilchae, and Han Bohyung. 2017. MarioQA: Answering questions by watching gameplay videos. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV’17). http://arxiv.org/abs/1612.01669.Google ScholarGoogle ScholarCross RefCross Ref
  207. [207] Nakov Preslav, Hoogeveen Doris, Màrquez Lluís, Moschitti Alessandro, Mubarak Hamdy, Baldwin Timothy, and Verspoor Karin. 2017. SemEval-2017 Task 3: Community question answering. In Proceedings of the 11th International Workshop on Semantic Evaluations (SemEval’17). 2748. http://www.aclweb.org/anthology/S17-2003.Google ScholarGoogle ScholarCross RefCross Ref
  208. [208] Nakov Preslav, Màrquez Lluís, Magdy Walid, Moschitti Alessandro, Glass Jim, and Randeree Bilal. 2015. SemEval-2015 Task 3: Answer selection in community question answering. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval’15). 269281.Google ScholarGoogle Scholar
  209. [209] Nakov Preslav, Màrquez Lluís, Moschitti Alessandro, Magdy Walid, Mubarak Hamdy, Freihat abed Alhakim, Glass Jim, and Randeree Bilal. 2016. SemEval-2016 Task 3: Community question answering. 525545.Google ScholarGoogle Scholar
  210. [210] Nguyen Kiet, Nguyen Vu, Nguyen Anh, and Nguyen Ngan. 2020. A Vietnamese dataset for evaluating machine reading comprehension. In Proceedings of the International Conference on Learning Representations (ICLR’20). 25952605. Google ScholarGoogle ScholarCross RefCross Ref
  211. [211] Ning Qiang, Wu Hao, Han Rujun, Peng Nanyun, Gardner Matt, and Roth Dan. 2020. TORQUE: A reading comprehension dataset of temporal ordering questions. arXiv:2005.00242 [CS] (2020). http://arxiv.org/abs/2005.00242.Google ScholarGoogle Scholar
  212. [212] Omura Kazumasa, Kawahara Daisuke, and Kurohashi Sadao. 2020. A method for building a commonsense inference dataset based on basic events. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 24502460. https://www.aclweb.org/anthology/2020.emnlp-main.192.Google ScholarGoogle ScholarCross RefCross Ref
  213. [213] Onishi Takeshi, Wang Hai, Bansal Mohit, Gimpel Kevin, and McAllester David. 2016. Who did what: A large-scale person-centered Cloze dataset. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’16). 22302235. Google ScholarGoogle ScholarCross RefCross Ref
  214. [214] Ostermann Simon, Modi Ashutosh, Roth Michael, Thater Stefan, and Pinkal Manfred. 2018. MCScript: A novel dataset for assessing machine comprehension using script knowledge. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’18). https://www.aclweb.org/anthology/L18-1564.Google ScholarGoogle Scholar
  215. [215] Ostermann Simon, Roth Michael, Modi Ashutosh, Thater Stefan, and Pinkal Manfred. 2018. SemEval-2018 Task 11: Machine comprehension using commonsense knowledge. In Proceedings of the 12th International Workshop on Semantic Evaluation. 747757. Google ScholarGoogle ScholarCross RefCross Ref
  216. [216] Pampari Anusri, Raghavan Preethi, Liang Jennifer, and Peng Jian. 2018. emrQA: A large corpus for question answering on electronic medical records. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 23572368. http://aclweb.org/anthology/D18-1258.Google ScholarGoogle ScholarCross RefCross Ref
  217. [217] Pang Richard Yuanzhe, Parrish Alicia, Joshi Nitish, Nangia Nikita, Phang Jason, Chen Angelica, Padmakumar Vishakh, et al. 2022. QuALITY: Question answering with long input texts, yes! In Proceedings of the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’22). 53365358. Google ScholarGoogle ScholarCross RefCross Ref
  218. [218] Paperno Denis, Kruszewski Germán, Lazaridou Angeliki, Pham Quan Ngoc, Bernardi Raffaella, Pezzelle Sandro, Baroni Marco, Boleda Gemma, and Fernández Raquel. 2016. The LAMBADA dataset: Word prediction requiring a broad discourse context. arXiv:1606.06031 [CS] (2016). http://arxiv.org/abs/1606.06031.Google ScholarGoogle Scholar
  219. [219] Pasupat Panupong and Liang Percy. 2015. Compositional semantic parsing on semi-structured tables. In Proceedings of the Joint Conference of the 53rd Annual Meeting of the Association for Computational Linguistics and the 5th International Joint Conference on Natural Language Processing (ACL-IJCNLP’21). 14701480. Google ScholarGoogle ScholarCross RefCross Ref
  220. [220] Patel Alkesh, Bindal Akanksha, Kotek Hadas, Klein Christopher, and Williams Jason. 2020. Generating natural questions from images for multimodal assistants. arXiv:2012.03678 [CS] (2020). http://arxiv.org/abs/2012.03678.Google ScholarGoogle Scholar
  221. [221] Peñas Anselmo, Unger Christina, and Ngomo Axel-Cyrille Ngonga. 2014. Overview of CLEF question answering track 2014. In Information Access Evaluation. Multilinguality, Multimodality, and Interaction. Springer, Cham, Switzerland, 300306. Google ScholarGoogle ScholarCross RefCross Ref
  222. [222] Peñas Anselmo, Unger Christina, Paliouras Georgios, and Kakadiaris Ioannis. 2015. Overview of the CLEF question answering track 2015. In Experimental IR Meets Multilinguality, Multimodality, and Interaction.Lecture Notes in Computer Science, Vol. 9283. Springer, 539544.Google ScholarGoogle Scholar
  223. [223] Penha Gustavo, Balan Alexandru, and Hauff Claudia. 2019. Introducing MANtIS: A novel multi-domain information seeking dialogues dataset. arXiv:1912.04639 [CS] (2019). http://arxiv.org/abs/1912.04639.Google ScholarGoogle Scholar
  224. [224] Peskov Denis, Clarke Nancy, Krone Jason, Fodor Brigi, Zhang Yi, Youssef Adel, and Diab Mona. 2019. Multi-domain goal-oriented dialogues (MultiDoGO): Strategies toward curating and annotating large scale dialogue data. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 45264536. Google ScholarGoogle ScholarCross RefCross Ref
  225. [225] Pfeiffer Jonas, Geigle Gregor, Kamath Aishwarya, Steitz Jan-Martin, Roth Stefan, Vulić Ivan, and Gurevych Iryna. 2022. xGQA: Cross-lingual visual question answering. In Findings of ACL’22. 24972511. Google ScholarGoogle ScholarCross RefCross Ref
  226. [226] Price Eric. 2014. The NIPS Experiment. Retrieved September 16, 2022 from http://blog.mrtz.org/2014/12/15/the-nips-experiment.html.Google ScholarGoogle Scholar
  227. [227] Pruthi Danish, Gupta Mansi, Dhingra Bhuwan, Neubig Graham, and Lipton Zachary C.. 2019. Learning to deceive with attention-based explanations. arXiv:1909.07913 [CS] (2019). http://arxiv.org/abs/1909.07913.Google ScholarGoogle Scholar
  228. [228] Qin Lianhui, Gupta Aditya, Upadhyay Shyam, He Luheng, Choi Yejin, and Faruqui Manaal. 2021. TIMEDIAL: Temporal commonsense reasoning in dialog. arXiv:2106.04571 [CS.CL] (2021).Google ScholarGoogle Scholar
  229. [229] Qiu Boyu, Chen Xu, Xu Jungang, and Sun Yingfei. 2019. A survey on neural machine reading comprehension. arXiv:1906.03824 [CS] (2019). http://arxiv.org/abs/1906.03824.Google ScholarGoogle Scholar
  230. [230] Qu Chen, Yang Liu, Croft W. Bruce, Trippas Johanne R., Zhang Yongfeng, and Qiu Minghui. 2018. Analyzing and characterizing user intent in information-seeking conversations. In Proceedings of the 41st ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’18). ACM, New York, NY, 989992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  231. [231] Radlinski Filip, Balog Krisztian, Byrne Bill, and Krishnamoorthi Karthik. 2019. Coached conversational preference elicitation: A case study in understanding movie preferences. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue. https://research.google/pubs/pub48414/.Google ScholarGoogle Scholar
  232. [232] Raghavi Khyathi Chandu, Chinnakotla Manoj Kumar, and Shrivastava Manish. 2015. “Answer ka type kya he?”: Learning to classify questions in code-mixed language. In Proceedings of the 24th International Conference on World Wide Web (WWW’15 Companion). ACM, New York, NY, 853858. Google ScholarGoogle ScholarDigital LibraryDigital Library
  233. [233] Rajani Nazneen Fatema, Krause Ben, Yin Wengpeng, Niu Tong, Socher Richard, and Xiong Caiming. 2020. Explaining and improving model behavior with k nearest neighbor representations. arXiv:2010.09030 [CS] (2020). http://arxiv.org/abs/2010.09030.Google ScholarGoogle Scholar
  234. [234] Rajpurkar Pranav, Jia Robin, and Liang Percy. 2018. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18). 784789. http://aclweb.org/anthology/P18-2124.Google ScholarGoogle ScholarCross RefCross Ref
  235. [235] Rajpurkar Pranav, Zhang Jian, Lopyrev Konstantin, and Liang Percy. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’16). 23832392.Google ScholarGoogle ScholarCross RefCross Ref
  236. [236] Ramponi Alan and Plank Barbara. 2020. Neural unsupervised domain adaptation in NLP—A survey. In Proceedings of the International Conference on Computational Linguistics (COLING’20). 68386855. Google ScholarGoogle ScholarCross RefCross Ref
  237. [237] Rashkin Hannah, Sap Maarten, Allaway Emily, Smith Noah A., and Choi Yejin. 2018. Event2Mind: Commonsense inference on events, intents, and reactions. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18). 463473. Google ScholarGoogle ScholarCross RefCross Ref
  238. [238] Reddy Siva, Chen Danqi, and Manning Christopher D.. 2019. CoQA: A conversational question answering challenge. Transactions of the Association for Computational Linguistics 7 (March 2019), 249266. Google ScholarGoogle ScholarCross RefCross Ref
  239. [239] Ribeiro Marco Tulio, Singh Sameer, and Guestrin Carlos. 2016. “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’16). ACM, New York, NY, 11351144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  240. [240] Ribeiro Marco Tulio, Wu Tongshuang, Guestrin Carlos, and Singh Sameer. 2020. Beyond accuracy: Behavioral testing of NLP models with checklist. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 49024912. https://www.aclweb.org/anthology/2020.acl-main.442.Google ScholarGoogle ScholarCross RefCross Ref
  241. [241] Richardson Matthew, Burges Christopher J. C., and Renshaw Erin. 2013. MCTest: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’13). 193203.Google ScholarGoogle Scholar
  242. [242] Rodriguez Pedro and Boyd-Graber Jordan. 2021. Evaluation paradigms in question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 96309642. Google ScholarGoogle ScholarCross RefCross Ref
  243. [243] Rodriguez Pedro, Crook Paul, Moon Seungwhan, and Wang Zhiguang. 2020. Information seeking in the spirit of learning: A dataset for conversational curiosity. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 81538172. Google ScholarGoogle ScholarCross RefCross Ref
  244. [244] Rodriguez Pedro, Feng Shi, Iyyer Mohit, He He, and Boyd-Graber Jordan. 2021. Quizbowl: The case for incremental question answering. arXiv:1904.04792 [CS] (2021). Google ScholarGoogle ScholarCross RefCross Ref
  245. [245] Roemmele Melissa, Bejan Cosmin Adrian, and Gordon Andrew S.. 2011. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In Proceedings of the AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning. 6.Google ScholarGoogle Scholar
  246. [246] Rogers Anna. 2019. How the Transformers Broke NLP Leaderboards. Retrieved September 16, 2022 from https://hackingsemantics.xyz/2019/leaderboards/.Google ScholarGoogle Scholar
  247. [247] Rogers Anna. 2021. Changing the world by changing the data. In Proceedings of the Conference of the Association for Computational Linguistics (ACL’21). 21822194. https://aclanthology.org/2021.acl-long.170.Google ScholarGoogle ScholarCross RefCross Ref
  248. [248] Rogers Anna and Augenstein Isabelle. 2020. What can we do to improve peer review in NLP? In Findings of EMNLP’20. 12561262. https://www.aclweb.org/anthology/2020.findings-emnlp.112/.Google ScholarGoogle Scholar
  249. [249] Rogers Anna, Kovaleva Olga, Downey Matthew, and Rumshisky Anna. 2020. Getting closer to AI complete question answering: A set of prerequisite real tasks. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’20). 87228731. https://aaai.org/ojs/index.php/AAAI/article/view/6398.Google ScholarGoogle ScholarCross RefCross Ref
  250. [250] Roy Uma, Constant Noah, Al-Rfou Rami, Barua Aditya, Phillips Aaron, and Yang Yinfei. 2020. LAReQA: Language-agnostic answer retrieval from a multilingual pool. arXiv:2004.05484 [CS] (2020). http://arxiv.org/abs/2004.05484.Google ScholarGoogle Scholar
  251. [251] Ruder Sebastian and Avirup Si. 2021. Multi-domain multilingual question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21).Google ScholarGoogle ScholarCross RefCross Ref
  252. [252] Rudinger Rachel, Shwartz Vered, Hwang Jena D., Bhagavatula Chandra, Forbes Maxwell, Bras Ronan Le, Smith Noah A., and Choi Yejin. 2020. Thinking like a skeptic: Defeasible inference in natural language. In Findings of EMNLP’20. 46614675. Google ScholarGoogle ScholarCross RefCross Ref
  253. [253] Rychalska Barbara, Basaj Dominika, Wróblewska Anna, and Biecek Przemyslaw. 2018. Does it care what you asked? Understanding importance of verbs in deep learning QA system. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP. 322324. http://aclweb.org/anthology/W18-5436.Google ScholarGoogle ScholarCross RefCross Ref
  254. [254] Sachan Mrinmaya, Dubey Kumar, Xing Eric, and Richardson Matthew. 2015. Learning answer-entailing structures for machine comprehension. In Proceedings of the Joint Conference of the 53rd Annual Meeting of the Association for Computational Linguistics and the 5th International Joint Conference on Natural Language Processing (ACL-IJCNLP’21). 239249. Google ScholarGoogle ScholarCross RefCross Ref
  255. [255] Saeidi Marzieh, Bartolo Max, Lewis Patrick, Singh Sameer, Rocktäschel Tim, Sheldon Mike, Bouchard Guillaume, and Riedel Sebastian. 2018. Interpretation of natural language rules in conversational machine reading. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 20872097. Google ScholarGoogle ScholarCross RefCross Ref
  256. [256] Sakaguchi Keisuke, Bras Ronan Le, Bhagavatula Chandra, and Choi Yejin. 2019. WinoGrande: An adversarial Winograd Schema Challenge at scale. arXiv:1907.10641 [CS] (2019). http://arxiv.org/abs/1907.10641.Google ScholarGoogle Scholar
  257. [257] Sap Maarten, Rashkin Hannah, Chen Derek, Bras Ronan Le, and Choi Yejin. 2019. Social IQA: Commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 44534463. Google ScholarGoogle ScholarCross RefCross Ref
  258. [258] Schlegel Viktor, Nenadic Goran, and Batista-Navarro Riza. 2020. Beyond leaderboards: A survey of methods for revealing weaknesses in natural language inference data and models. arXiv:2005.14709 [CS] (2020). http://arxiv.org/abs/2005.14709.Google ScholarGoogle Scholar
  259. [259] Schlegel Viktor, Valentino Marco, Freitas André, Nenadic Goran, and Batista-Navarro Riza. 2020. A framework for evaluation of machine reading comprehension gold standards. In Proceedings of the Language Resources and Evaluation Conference. http://arxiv.org/abs/2003.04642.Google ScholarGoogle Scholar
  260. [260] Serban Iulian Vlad, Lowe Ryan, Henderson Peter, Charlin Laurent, and Pineau Joelle. 2015. A survey of available corpora for building data-driven dialogue systems. arXiv:1512.05742 [CS, STAT] (2015). http://arxiv.org/abs/1512.05742.Google ScholarGoogle Scholar
  261. [261] Shao Chih Chieh, Liu Trois, Lai Yuting, Tseng Yiying, and Tsai Sam. 2019. DRCD: A Chinese machine reading comprehension dataset. arXiv:1806.00920 [CS] (2019). http://arxiv.org/abs/1806.00920.Google ScholarGoogle Scholar
  262. [262] Shi Shuming, Wang Yuehui, Lin Chin-Yew, Liu Xiaojiang, and Rui Yong. 2015. Automatically solving number word problems by semantic parsing and reasoning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’15). 11321142. Google ScholarGoogle ScholarCross RefCross Ref
  263. [263] Shibuki Hideyuki, Sakamoto Kotaro, Kano Yoshionobu, Mitamura Teruko, Ishioroshi Madoka, Itakura Kelly Y., Wang Di, Mori Tatsunori, and Kando Noriko. 2014. Overview of the NTCIR-11 QA-lab task. In Proceedings of the 11th NTCIR Conference. 518529. http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings11/pdf/NTCIR/OVERVIEW/01-NTCIR11-OV-QALAB-ShibukiH.pdf.Google ScholarGoogle Scholar
  264. [264] Sinha Koustuv, Sodhani Shagun, Dong Jin, Pineau Joelle, and Hamilton William L.. 2019. CLUTRR: A diagnostic benchmark for inductive reasoning from text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 44964505. Google ScholarGoogle ScholarCross RefCross Ref
  265. [265] Sitaram Sunayana, Chandu Khyathi Raghavi, Rallabandi Sai Krishna, and Black Alan W.. 2020. A survey of code-switched speech and language processing. arXiv:1904.00784 [CS, STAT] (2020). http://arxiv.org/abs/1904.00784.Google ScholarGoogle Scholar
  266. [266] Soleimani Amir, Monz Christof, and Worring Marcel. 2021. NLQuAD: A non-factoid long question answering data set. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (EACL’21). 12451255. https://aclanthology.org/2021.eacl-main.106.Google ScholarGoogle ScholarCross RefCross Ref
  267. [267] Sugawara Saku and Aizawa Akiko. 2016. An analysis of prerequisite skills for reading comprehension. In Proceedings of the Workshop on Uphill Battles in Language Processing: Scaling Early Achievements to Robust Methods. 15. Google ScholarGoogle ScholarCross RefCross Ref
  268. [268] Sugawara Saku, Kido Yusuke, Yokono Hikaru, and Aizawa Akiko. 2017. Evaluation metrics for machine reading comprehension: Prerequisite skills and readability. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL’17). 806817. Google ScholarGoogle ScholarCross RefCross Ref
  269. [269] Sugawara Saku, Stenetorp Pontus, Inui Kentaro, and Aizawa Akiko. 2020. Assessing the benchmarking capacity of machine reading comprehension datasets. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’20). http://arxiv.org/abs/1911.09241.Google ScholarGoogle ScholarCross RefCross Ref
  270. [270] Suhr Alane, Lewis Mike, Yeh James, and Artzi Yoav. 2017. A corpus of natural language for visual reasoning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL’17). 217223. Google ScholarGoogle ScholarCross RefCross Ref
  271. [271] Suhr Alane, Zhou Stephanie, Zhang Ally, Zhang Iris, Bai Huajun, and Artzi Yoav. 2019. A corpus for reasoning about natural language grounded in photographs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 64186428. Google ScholarGoogle ScholarCross RefCross Ref
  272. [272] Sun Haitian, Cohen William, and Salakhutdinov Ruslan. 2022. ConditionalQA: A complex reading comprehension dataset with conditional answers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL’22). 36273637. Google ScholarGoogle ScholarCross RefCross Ref
  273. [273] Sun Kai, Yu Dian, Chen Jianshu, Yu Dong, Choi Yejin, and Cardie Claire. 2019. DREAM: A challenge data set and models for dialogue-based reading comprehension. Transactions of the Association for Computational Linguistics 7 (April 2019), 217231. Google ScholarGoogle ScholarCross RefCross Ref
  274. [274] Sun Ningyuan, Yang Xuefeng, and Liu Yunfeng. 2020. TableQA: A large-scale Chinese Text-to-SQL dataset for table-aware SQL generation. arXiv:2006.06434 [CS] (2020). http://arxiv.org/abs/2006.06434.Google ScholarGoogle Scholar
  275. [275] Suster Simon and Daelemans Walter. 2018. CliCR: A dataset of clinical case reports for machine reading comprehension. In Proceedings of the 16th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18). 15511563. Google ScholarGoogle ScholarCross RefCross Ref
  276. [276] Tafjord Oyvind, Clark Peter, Gardner Matt, Yih Wen-Tau, and Sabharwal Ashish. 2019. QuaRel: A dataset and models for answering questions about qualitative relationships. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI’19).Google ScholarGoogle ScholarDigital LibraryDigital Library
  277. [277] Tafjord Oyvind, Gardner Matt, Lin Kevin, and Clark Peter. 2019. QuaRTz: An open-domain dataset of qualitative relationship questions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 59415946. Google ScholarGoogle ScholarCross RefCross Ref
  278. [278] Talmor Alon and Berant Jonathan. 2018. The web as a knowledge-base for answering complex questions. In Proceedings of the 16th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18). 641651. Google ScholarGoogle ScholarCross RefCross Ref
  279. [279] Talmor Alon, Herzig Jonathan, Lourie Nicholas, and Berant Jonathan. 2019. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’22). 41494158. https://www.aclweb.org/anthology/papers/N/N19/N19-1421/.Google ScholarGoogle Scholar
  280. [280] Talmor Alon, Yoran Ori, Catav Amnon, Lahav Dan, Wang Yizhong, Asai Akari, Ilharco Gabriel, Hajishirzi Hannaneh, and Berant Jonathan. 2021. MultimodalQA: Complex question answering over text, tables and images. In Proceedings of the 9th International Conference on Learning Representations (ICLR’21). 12. https://openreview.net/pdf/f3dad930cb55abce99a229e35cc131a2db791b66.pdf.Google ScholarGoogle Scholar
  281. [281] Tapaswi Makarand, Zhu Yukun, Stiefelhagen Rainer, Torralba Antonio, Urtasun Raquel, and Fidler Sanja. 2016. MovieQA: Understanding stories in movies through question-answering. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).Google ScholarGoogle ScholarCross RefCross Ref
  282. [282] Thakur Nandan, Reimers Nils, Rücklé Andreas, Srivastava Abhishek, and Gurevych Iryna. 2021. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Proceedings of the 35th Conference on Neural Information Processing Systems, Datasets, and Benchmarks Track. https://openreview.net/forum?id=wCu6T5xFjeJ.Google ScholarGoogle Scholar
  283. [283] Thomas Paul, McDuff Daniel, Czerwinski Mary, and Craswell Nick. 2017. MISC: A data set of information-seeking conversations. In Proceedings of the 1st International Workshop on Conversational Approaches to Information Retrieval (CAIR’17). https://www.microsoft.com/en-us/research/wp-content/uploads/2017/07/Thomas-etal-CAIR17.pdf.Google ScholarGoogle Scholar
  284. [284] Thomason Jesse, Gordon Daniel, and Bisk Yonatan. 2019. Shifting the baseline: Single modality performance on visual navigation & QA. In Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’19). 19771983. https://www.aclweb.org/anthology/papers/N/N19/N19-1197/.Google ScholarGoogle ScholarCross RefCross Ref
  285. [285] Thorne James, Vlachos Andreas, Christodoulopoulos Christos, and Mittal Arpit. 2018. FEVER: A large-scale dataset for fact extraction and verification. In Proceedings of the 16th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18). 809819. Google ScholarGoogle ScholarCross RefCross Ref
  286. [286] Trippas Johanne R., Spina Damiano, Cavedon Lawrence, Joho Hideo, and Sanderson Mark. 2018. Informing the design of spoken conversational search: Perspective paper. In Proceedings of the 2018 Conference on Human Information Interaction and Retrieval (CHIIR’18). ACM, New York, NY, 3241. Google ScholarGoogle ScholarDigital LibraryDigital Library
  287. [287] Trischler Adam, Wang Tong, Yuan Xingdi, Harris Justin, Sordoni Alessandro, Bachman Philip, and Suleman Kaheer. 2017. NewsQA: A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP. 191200. Google ScholarGoogle ScholarCross RefCross Ref
  288. [288] Tsatsaronis George, Balikas Georgios, Malakasiotis Prodromos, Partalas Ioannis, Zschunke Matthias, Alvers Michael R., Weissenborn Dirk, et al. 2015. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics 16, 1 (April 2015), 138. Google ScholarGoogle ScholarCross RefCross Ref
  289. [289] Tseng Bo-Hsiang, Shen Sheng-Syun, Lee Hung-Yi, and Lee Lin-Shan. 2016. Towards machine comprehension of spoken content: Initial TOEFL listening comprehension test by machine. In Proceedings of the 17th Annual Conference of the International Speech Communication Association (Interspeech’16). 27312735. Google ScholarGoogle ScholarCross RefCross Ref
  290. [290] Upadhyay Shyam and Chang Ming-Wei. 2017. Annotating derivations: A new evaluation strategy and dataset for algebra word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL’17). 494504. https://www.aclweb.org/anthology/E17-1047.Google ScholarGoogle ScholarCross RefCross Ref
  291. [291] Vakulenko Svitlana and Savenkov Vadim. 2017. TableQA: Question answering on tabular data. arXiv:1705.06504 [CS] (2017). http://arxiv.org/abs/1705.06504.Google ScholarGoogle Scholar
  292. [292] Lee Chris van der, Gatt Albert, Miltenburg Emiel van, Wubben Sander, and Krahmer Emiel. 2019. Best practices for the human evaluation of automatically generated text. In Proceedings of the 12th International Conference on Natural Language Generation. 355368. Google ScholarGoogle ScholarCross RefCross Ref
  293. [293] Meer Elke van der, Beyer Reinhard, Heinze Bertram, and Badel Isolde. 2002. Temporal order relations in language comprehension. Journal of Experimental Psychology. Learning, Memory, and Cognition 28, 4 (July 2002), 770779.Google ScholarGoogle ScholarCross RefCross Ref
  294. [294] Dijk Teun A. van and Kintsch Walter. 1983. Strategies of Discourse Comprehension. Academic Press, New York, NY. P302 .D472 1983Google ScholarGoogle Scholar
  295. [295] Vilares David and Gómez-Rodríguez Carlos. 2019. HEAD-QA: A healthcare dataset for complex reasoning. arXiv:1906.04701 [CS] (2019). http://arxiv.org/abs/1906.04701.Google ScholarGoogle Scholar
  296. [296] Voorhees Ellen M. and Tice Dawn M.. 2000. Building a question answering test collection. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’00). ACM, New York, NY, 200207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  297. [297] Wallace Eric and Boyd-Graber Jordan. 2018. Trick me if you can: Adversarial writing of trivia challenge questions. In Proceedings of the Association for Computational Linguistics Student Research Workshop (ACL-SRW’18). 127133. http://aclweb.org/anthology/P18-3018.Google ScholarGoogle ScholarCross RefCross Ref
  298. [298] Wallace Eric, Feng Shi, Kandpal Nikhil, Gardner Matt, and Singh Sameer. 2019. Universal adversarial triggers for attacking and analyzing NLP. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’19) http://arxiv.org/abs/1908.07125.Google ScholarGoogle Scholar
  299. [299] Wang Bingning, Yao Ting, Zhang Qi, Xu Jingfang, and Wang Xiaochuan. 2020. ReCO: A large scale Chinese reading comprehension dataset on opinion. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’20). 8. https://www.aaai.org/Papers/AAAI/2020GB/AAAI-WangB.2547.pdf.Google ScholarGoogle ScholarCross RefCross Ref
  300. [300] Wang Jiexin, Jatowt Adam, Färber Michael, and Yoshikawa Masatoshi. 2021. Improving question answering for event-focused questions in temporal collections of news articles. Information Retrieval Journal 24, 1 (Feb. 2021), 2954. Google ScholarGoogle ScholarDigital LibraryDigital Library
  301. [301] Wang Jiexin, Jatowt Adam, and Yoshikawa Masatoshi. 2022. ArchivalQA: A large-scale benchmark dataset for open domain question answering over historical news collections. arXiv:2109.03438 [CS]. Google ScholarGoogle ScholarCross RefCross Ref
  302. [302] Wang Ping, Shi Tian, and Reddy Chandan K.. 2020. Text-to-SQL generation for question answering on electronic medical records. In Proceedings of the Web Conference 2020 (WWW’20). ACM, New York, NY, 350361. Google ScholarGoogle ScholarDigital LibraryDigital Library
  303. [303] Watarai Takuto and Tsuchiya Masatoshi. 2020. Developing dataset of Japanese slot filling quizzes designed for evaluation of machine reading comprehension. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’20). 68956901. https://www.aclweb.org/anthology/2020.lrec-1.852.Google ScholarGoogle Scholar
  304. [304] Weissenborn Dirk, Minervini Pasquale, Augenstein Isabelle, Welbl Johannes, Rocktäschel Tim, Bošnjak Matko, Mitchell Jeff, et al. 2018. Jack the Reader—A machine reading framework. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18). 2530. Google ScholarGoogle ScholarCross RefCross Ref
  305. [305] Weston Jason, Bordes Antoine, Chopra Sumit, Rush Alexander M., Merriënboer Bart van, Joulin Armand, and Mikolov Tomas. 2015. Towards AI-complete question answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698 (2015).Google ScholarGoogle Scholar
  306. [306] White Michael, Chapman Graham, Cleese John, Idle Eric, Gilliam Terry, Jones Terry, Palin Michael, et al. 2001. Monty Python and the Holy Grail.Google ScholarGoogle Scholar
  307. [307] Wong Yuk Wah and Mooney Raymond. 2006. Learning for semantic parsing with statistical machine translation. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference. 439446. https://www.aclweb.org/anthology/N06-1056.Google ScholarGoogle ScholarDigital LibraryDigital Library
  308. [308] Wu Chien-Sheng, Madotto Andrea, Liu Wenhao, Fung Pascale, and Xiong Caiming. 2022. QAConv: Question answering on informative conversations. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL’22). 53895411. Google ScholarGoogle ScholarCross RefCross Ref
  309. [309] Xiong Wenhan, Wu Jiawei, Wang Hong, Kulkarni Vivek, Yu Mo, Chang Shiyu, Guo Xiaoxiao, and Wang William Yang. 2019. TWEETQA: A social media focused question answering dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 50205031. Google ScholarGoogle ScholarCross RefCross Ref
  310. [310] Xu Canwen, Pei Jiaxin, Wu Hongtao, Liu Yiyu, and Li Chenliang. 2020. MATINF: A jointly labeled large-scale dataset for classification, question answering and summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 35863596. https://www.aclweb.org/anthology/2020.acl-main.330.Google ScholarGoogle ScholarCross RefCross Ref
  311. [311] Xu Ying, Wang Dakuo, Yu Mo, Ritchie Daniel, Yao Bingsheng, Wu Tongshuang, Zhang Zheng, et al. 2022. Fantastic questions and where to find them: FairytaleQA—An authentic dataset for narrative comprehension. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL’22). 447460. Google ScholarGoogle ScholarCross RefCross Ref
  312. [312] Yang Yi, Yih Wen-Tau, and Meek Christopher. 2015. WikiQA: A challenge dataset for open-domain question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’15). 20132018. http://aclweb.org/anthology/D15-1237.Google ScholarGoogle ScholarCross RefCross Ref
  313. [313] Yang Zhengzhe and Choi Jinho D.. 2019. FriendsQA: Open-domain question answering on TV show transcripts. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue. 188197. Google ScholarGoogle ScholarCross RefCross Ref
  314. [314] Yang Zhilin, Qi Peng, Zhang Saizheng, Bengio Yoshua, Cohen William, Salakhutdinov Ruslan, and Manning Christopher D.. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the Conference on Empirical Methods in Nautral Language Processing (EMNLP’18). 23692380. http://aclweb.org/anthology/D18-1259.Google ScholarGoogle ScholarCross RefCross Ref
  315. [315] Yatskar Mark. 2019. A qualitative comparison of CoQA, SQuAD 2.0, and QuAC. In Proceedings of the 17th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). 23182323. https://www.aclweb.org/anthology/papers/N/N19/N19-1241/.Google ScholarGoogle Scholar
  316. [316] Yin Fan, Shi Zhouxing, Hsieh Cho-Jui, and Chang Kai-Wei. 2021. On the faithfulness measurements for model interpretations. arXiv:2104.08782 [CS] (2021). http://arxiv.org/abs/2104.08782.Google ScholarGoogle Scholar
  317. [317] You Chenyu, Chen Nuo, Liu Fenglin, Yang Dongchao, and Zou Yuexian. 2020. Towards data distillation for end-to-end spoken conversational question answering. arXiv:2010.08923 [CS, EESS] (2020). http://arxiv.org/abs/2010.08923.Google ScholarGoogle Scholar
  318. [318] Yu Weihao, Jiang Zihang, Dong Yanfei, and Feng Jiashi. 2019. ReClor: A reading comprehension dataset requiring logical reasoning. In Proceedings of the 7th International Conference on Learning Representations (ICLR’19). https://openreview.net/forum?id=HJgJtT4tvB.Google ScholarGoogle Scholar
  319. [319] Zellers Rowan, Bisk Yonatan, Schwartz Roy, and Choi Yejin. 2018. SWAG: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 93104. http://aclweb.org/anthology/D18-1009.Google ScholarGoogle ScholarCross RefCross Ref
  320. [320] Zellers Rowan, Holtzman Ari, Bisk Yonatan, Farhadi Ali, and Choi Yejin. 2019. HellaSwag: Can a machine really finish your sentence? In Proceedings of the Conference of the Association for Computational Linguistics (ACL’19). http://arxiv.org/abs/1905.07830.Google ScholarGoogle ScholarCross RefCross Ref
  321. [321] Zhang Michael and Choi Eunsol. 2021. SituatedQA: Incorporating extra-linguistic contexts into QA. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 73717387. Google ScholarGoogle ScholarCross RefCross Ref
  322. [322] Zhang Sheng, Liu Xiaodong, Liu Jingjing, Gao Jianfeng, Duh Kevin, and Durme Benjamin Van. 2018. ReCoRD: Bridging the gap between human and machine commonsense reading comprehension. arXiv:1810.12885 [cs] (oct2018). arXiv:1810.12885 [cs] http://arxiv.org/abs/1810.12885.Google ScholarGoogle Scholar
  323. [323] Zhang Yian, Warstadt Alex, Li Haau-Sing, and Bowman Samuel R.. 2020. When do you need billions of words of pretraining data? arXiv:2011.04946 [cs] (nov2020). arXiv:2011.04946 [cs] http://arxiv.org/abs/2011.04946.Google ScholarGoogle Scholar
  324. [324] Zhang Zhuosheng and Zhao Hai. 2018. One-shot learning for question-answering in gaokao history challenge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18). 449461. https://www.aclweb.org/anthology/C18-1038.Google ScholarGoogle Scholar
  325. [325] Zhao Tony Z., Wallace Eric, Feng Shi, Klein Dan, and Singh Sameer. 2021. Calibrate before use: Improving few-shot performance of language models. In Proceedings of the 38th International Conference on Machine Learning (ICML’21). http://arxiv.org/abs/2102.09690.Google ScholarGoogle Scholar
  326. [326] Zhong Victor, Xiong Caiming, and Socher Richard. 2017. Seq2SQL: Generating structured queries from natural language using reinforcement learning. arXiv:1709.00103 [CS] (2017). http://arxiv.org/abs/1709.00103.Google ScholarGoogle Scholar
  327. [327] Zhou Ben, Khashabi Daniel, Ning Qiang, and Roth Dan. 2019. “Going on a vacation” takes longer than “going for a walk”: A study of temporal commonsense understanding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 33613367. Google ScholarGoogle ScholarCross RefCross Ref
  328. [328] Zhu Fengbin, Lei Wenqiang, Huang Youcheng, Wang Chao, Zhang Shuo, Lv Jiancheng, Feng Fuli, and Chua Tat-Seng. 2021. TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL’21). 32773287. Google ScholarGoogle ScholarCross RefCross Ref
  329. [329] Zhu Fengbin, Lei Wenqiang, Wang Chao, Zheng Jianming, Poria Soujanya, and Chua Tat-Seng. 2021. Retrieving and reading: A comprehensive survey on open-domain question answering. arXiv:2101.00774 [CS] (2021). http://arxiv.org/abs/2101.00774.Google ScholarGoogle Scholar
  330. [330] Zhu Linchao, Xu Zhongwen, Yang Yi, and Hauptmann Alexander G.. 2017. Uncovering the temporal context for video question answering. International Journal of Computer Vision 124, 3 (Sept. 2017), 409421. Google ScholarGoogle ScholarDigital LibraryDigital Library
  331. [331] Zhu Ming, Ahuja Aman, Juan Da-Cheng, Wei Wei, and Reddy Chandan K.. 2020. Question answering with long multiple-span answers. In Findings of EMNLP’20. 38403849. Google ScholarGoogle ScholarCross RefCross Ref
  332. [332] Zwaan Rolf A.. 2016. Situation models, mental simulations, and abstract concepts in discourse comprehension. Psychonomic Bulletin & Review 23, 4 (Aug. 2016), 10281034. Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Computing Surveys
          ACM Computing Surveys  Volume 55, Issue 10
          October 2023
          772 pages
          ISSN:0360-0300
          EISSN:1557-7341
          DOI:10.1145/3567475
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 2 February 2023
          • Online AM: 13 September 2022
          • Accepted: 11 August 2022
          • Revised: 3 July 2022
          • Received: 27 July 2021
          Published in csur Volume 55, Issue 10

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • survey
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format