survey

QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension

Authors:
Anna Rogers

University of Copenhagen, Copenhagen K, Denmark and RIKEN

University of Copenhagen, Copenhagen K, Denmark and RIKEN

0000-0002-4845-4023
View Profile

,
Matt Gardner

Microsoft Semantic Machines, Washington, USA

Microsoft Semantic Machines, Washington, USA

0000-0001-8458-1727
View Profile

,
Isabelle Augenstein

University of Copenhagen, Copenhagen K, Denmark

University of Copenhagen, Copenhagen K, Denmark

0000-0003-1562-7909
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 55 Issue 10Article No.: 197pp 1–45https://doi.org/10.1145/3560260

Published:02 February 2023Publication History

ACM Computing Surveys

Abstract

Alongside huge volumes of research on deep learning models in NLP in the recent years, there has been much work on benchmark datasets needed to track modeling progress. Question answering and reading comprehension have been particularly prolific in this regard, with more than 80 new datasets appearing in the past 2 years. This study is the largest survey of the field to date. We provide an overview of the various formats and domains of the current resources, highlighting the current lacunae for future work. We further discuss the current classifications of “skills” that question answering/reading comprehension systems are supposed to acquire and propose a new taxonomy. The supplementary materials survey the current multilingual resources and monolingual resources for languages other than English, and we discuss the implications of overfocusing on English. The study is aimed at both practitioners looking for pointers to the wealth of existing data and at researchers working on new resources.

REFERENCES

[1] Abdou Mostafa, Sas Cezar, Aralikatte Rahul, Augenstein Isabelle, and Søgaard Anders. 2019. X-WikiRE: A large, multilingual resource for relation extraction as machine comprehension. In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo’19). 265–274. Google ScholarCross Ref
[2] Abujabal Abdalghani, Roy Rishiraj Saha, Yahya Mohamed, and Weikum Gerhard. 2019. ComQA: A community-sourced dataset for complex factoid question answering with paraphrase clusters. In Proceedings of the 17th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). 307–317. https://aclweb.org/anthology/papers/N/N19/N19-1027/.Google Scholar
[3] Acharya Manoj, Jariwala Karan, and Kanan Christopher. 2019. VQD: Visual query detection in natural scenes. In Proceedings of the 17th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). 1955–1961. Google ScholarCross Ref
[4] Acharya Manoj, Kafle Kushal, and Kanan Christopher. 2019. TallyQA: Answering complex counting questions. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI’19). 8076–8084. Google ScholarDigital Library
[5] Adams Douglas. 2009. The Hitchhiker’s Guide to the Galaxay. Ballantine Books, New York, NY. PR6051.D3352 H5 2009Google Scholar
[6] Adlakha Vaibhav, Dhuliawala Shehzaad, Suleman Kaheer, Vries Harm de, and Reddy Siva. 2022. TopiOCQA: Open-domain conversational question answering with topic switching. Transactions of the Association for Computational Linguistics 10 (April2022), 468–483. Google ScholarCross Ref
[7] Akula Arjun, Changpinyo Soravit, Gong Boqing, Sharma Piyush, Zhu Song-Chun, and Soricut Radu. 2021. CrossVQA: Scalably generating benchmarks for systematically testing VQA generalization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 2148–2166. Google ScholarCross Ref
[8] Alexandersson Jan, Buschbeck-Wolf Bianka, Fujinami Tsutomu, Kipp Michael, Koch Stephan, Maier Elisabeth, Reithinger Norbert, Schmitz Birte, and Siegel Melanie. 1998. Dialogue Acts in Verbmobil 2. Technical Report. Verbmobil.Google Scholar
[9] Anantha Raviteja, Vakulenko Svitlana, Tu Zhucheng, Longpre Shayne, Pulman Stephen, and Chappidi Srinivas. 2021. Open-domain question answering goes conversational via question rewriting. In Proceedings of the 19th Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’21). 520–534. https://aclanthology.org/2021.naacl-main.44.Google ScholarCross Ref
[10] Antol Stanislaw, Agrawal Aishwarya, Lu Jiasen, Mitchell Margaret, Batra Dhruv, Zitnick C. Lawrence, and Parikh Devi. 2015. VQA: Visual question answering. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV’15). 2425–2433. http://ieeexplore.ieee.org/document/7410636/.Google ScholarDigital Library
[11] Artetxe Mikel, Ruder Sebastian, and Yogatama Dani. 2020. On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 4623–4637. Google ScholarCross Ref
[12] Asai Akari and Choi Eunsol. 2021. Challenges in information-seeking QA: Unanswerable questions and paragraph retrieval. In Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP’21). 1492–1504. Google ScholarCross Ref
[13] Asai Akari, Eriguchi Akiko, Hashimoto Kazuma, and Tsuruoka Yoshimasa. 2018. Multilingual extractive reading comprehension by runtime machine translation. arXiv:1809.03275 [CS] (2018). http://arxiv.org/abs/1809.03275.Google Scholar
[14] Asai Akari, Kasai Jungo, Clark Jonathan H., Lee Kenton, Choi Eunsol, and Hajishirzi Hannaneh. 2020. XOR QA: Cross-lingual open-retrieval question answering. arXiv:2010.11856 [CS] (2020). http://arxiv.org/abs/2010.11856.Google Scholar
[15] Asri Layla El, Schulz Hannes, Sharma Shikhar, Zumer Jeremie, Harris Justin, Fine Emery, Mehrotra Rahul, and Suleman Kaheer. 2017. Frames: A corpus for adding memory to goal-oriented dialogue systems. arXiv:1704.00057 [CS] (2017). http://arxiv.org/abs/1704.00057.Google Scholar
[16] Atanasova Pepa, Simonsen Jakob Grue, Lioma Christina, and Augenstein Isabelle. 2020. Generating fact checking explanations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 7352–7364. Google ScholarCross Ref
[17] Augenstein Isabelle, Lioma Christina, Wang Dongsheng, Lima Lucas Chaves, Hansen Casper, Hansen Christian, and Simonsen Jakob Grue. 2019. MultiFC: A real-world multi-domain dataset for evidence-based fact checking of claims. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 4685–4697. Google ScholarCross Ref
[18] Bailey Kathleen M.. 2018. Multiple-choice item format. In The TESOL Encyclopedia of English Language Teaching. Wiley, 1–8. Google ScholarCross Ref
[19] Bajaj Payal, Campos Daniel, Craswell Nick, Deng Li, Gao Jianfeng, Liu Xiaodong, Majumder Rangan, et al. 2016. MS MARCO: A human generated machine reading comprehension dataset. arXiv:1611.09268 [CS] (2016). http://arxiv.org/abs/1611.09268.Google Scholar
[20] Bajgar Ondrej, Kadlec Rudolf, and Kleindienst Jan. 2017. Embracing data abundance: BookTest dataset for reading comprehension. In Proceedings of the 5th International Conference on Learning Representations (ICLR’17). https://openreview.net/pdf?id=H1U4mhVFe.Google Scholar
[21] Banerjee Somnath, Naskar Sudip Kumar, and Rosso Paolo. 2016. The first cross-script code-mixed question answering corpus. In Proceedings of the Workshop on Modeling, Learning, and Mining for Cross/Multilinguality (MultiLingMine’16) Co-located with the 2016 European Conference on Information Retrieval (ECIR’16). 1–10. 56–65. http://ceur-ws.org/Vol-1589/MultiLingMine6.pdf.Google Scholar
[22] Bartha Paul. 2019. Analogy and analogical reasoning. In The Stanford Encyclopedia of Philosophy (Spring 2019 ed.), Zalta Edward N. (Ed.). Metaphysics Research Lab, Stanford University, Stanford, CA. https://plato.stanford.edu/archives/spr2019/entries/reasoning-analogy/.Google Scholar
[23] Bender Emily M.. 2019. The #BenderRule: On naming the languages we study and why it matters. The Gradient. Retrieved September 16, 2022 from https://thegradient.pub/the-benderrule-on-naming-the-languages-we-study-and-why-it-matters/.Google Scholar
[24] Bender Emily M. and Friedman Batya. 2018. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics 6 (2018), 587–604. Google ScholarCross Ref
[25] Berant Jonathan, Chou Andrew, Frostig Roy, and Liang Percy. 2013. Semantic parsing on Freebase from question-answer pairs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’13). 1533–1544. https://www.aclweb.org/anthology/D13-1160.Google Scholar
[26] Berant Jonathan, Srikumar Vivek, Chen Pei-Chun, Linden Abby Vander, Harding Brittany, Huang Brad, Clark Peter, and Manning Christopher D.. 2014. Modeling biological processes for reading comprehension. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1499–1510.Google ScholarCross Ref
[27] Berzak Yevgeni, Malmaud Jonathan, and Levy Roger. 2020. STARC: Structured annotations for reading comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 5726–5735. https://www.aclweb.org/anthology/2020.acl-main.507.Google Scholar
[28] Bhagavatula Chandra, Bras Ronan Le, Malaviya Chaitanya, Sakaguchi Keisuke, Holtzman Ari, Rashkin Hannah, Downey Doug, Yih Wen-Tau, and Choi Yejin. 2019. Abductive commonsense reasoning. In Proceedings of the 7th International Conference on Learning Representations (ICLR’19). https://openreview.net/forum?id=Byg1v1HKDB.Google Scholar
[29] Bisk Yonatan, Holtzman Ari, Thomason Jesse, Andreas Jacob, Bengio Yoshua, Chai Joyce, Lapata Mirella, et al. 2020. Experience grounds language. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 8718–8735. Google ScholarCross Ref
[30] Bisk Yonatan, Zellers Rowan, Bras Ronan Le, Gao Jianfeng, and Choi Yejin. 2020. PIQA: Reasoning about physical commonsense in natural language. Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’20). 7432–7439. Google ScholarCross Ref
[31] Bjerva Johannes, Bhutani Nikita, Golshan Behzad, Tan Wang-Chiew, and Augenstein Isabelle. 2020. SubjQA: A dataset for subjectivity and review comprehension. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 5480–5494. Google ScholarCross Ref
[32] Blackburn Simon. 2008. Inference. Retrieved September 16, 2022 from https://www.oxfordreference.com/view/10.1093/acref/9780199541430.001.0001/acref-9780199541430.Google Scholar
[33] Blackburn Simon. 2008. Reasoning. Retrieved September 16, 2022 from https://www.oxfordreference.com/view/10.1093/acref/9780199541430.001.0001/acref-9780199541430.Google Scholar
[34] Bone Elisa and Prosser Mike. 2020. Multiple Choice Questions: An Introductory Guide. Retrieved September 16, 2022 from https://melbourne-cshe.unimelb.edu.au/__data/assets/pdf_file/0010/3430648/multiple-choice-questions_final.pdf.Google Scholar
[35] Bordes Antoine, Usunier Nicolas, Chopra Sumit, and Weston Jason. 2015. Large-scale simple question answering with memory networks. arXiv:1506.02075 [CS] (2015). http://arxiv.org/abs/1506.02075.Google Scholar
[36] Boyd-Graber Jordan. 2019. What question answering can learn from trivia nerds. arXiv:1910.14464 [CS] (2019). http://arxiv.org/abs/1910.14464.Google Scholar
[37] Boyd-Graber Jordan, Feng Shi, and Rodriguez Pedro. 2018. Human-computer question answering: The case for quizbowl. In Proceedings of the NIPS’17 Competition: Building Intelligent Systems, Sergio Escalera and Markus Weimer (Eds.). Springer Series on Challenges in Machine Learning. Springer International, Cham, Switzerland, 169–180. Google ScholarCross Ref
[38] Brown Tom B., Mann Benjamin, Ryder Nick, Subbiah Melanie, Kaplan Jared, Dhariwal Prafulla, Neelakantan Arvind, et al. 2020. Language models are few-shot learners. arXiv:2005.14165 [CS] (2020). http://arxiv.org/abs/2005.14165.Google Scholar
[39] Budzianowski Paweł, Wen Tsung-Hsien, Tseng Bo-Hsiang, Casanueva Iñigo, Ultes Stefan, Ramadan Osman, and Gašić Milica. 2018. MultiWOZ—A large-scale multi-domain Wizard-of-Oz dataset for task-oriented dialogue modelling. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 5016–5026. Google ScholarCross Ref
[40] Cambazoglu B. Barla, Sanderson Mark, Scholer Falk, and Croft Bruce. 2020. A review of public datasets in question answering research. ACM SIGIR Forum 54, 2 (2020), 23. http://www.sigir.org/wp-content/uploads/2020/12/p07.pdf.Google ScholarDigital Library
[41] Campos Jon Ander, Otegi Arantxa, Soroa Aitor, Deriu Jan, Cieliebak Mark, and Agirre Eneko. 2020. DoQA-accessing domain-specific FAQs via conversational QA. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 7302–7314. https://aclanthology.org/2020.acl-main.652/.Google ScholarCross Ref
[42] Cao Shulin, Shi Jiaxin, Pan Liangming, Nie Lunyiu, Xiang Yutong, Hou Lei, Li Juanzi, He Bin, and Zhang Hanwang. 2022. KQA Pro: A dataset with explicit compositional programs for complex question answering over knowledge base. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL’22). 6101–6119. Google ScholarCross Ref
[43] Carrino Casimiro Pio, Costa-Jussà Marta R., and Fonollosa José A. R.. 2019. Automatic Spanish translation of the SQuAD dataset for multilingual question answering. arXiv:1912.05200 [CS] (2019). http://arxiv.org/abs/1912.05200.Google Scholar
[44] Castelli Vittorio, Chakravarti Rishav, Dana Saswati, Ferritto Anthony, Florian Radu, Franz Martin, Garg Dinesh, et al. 2020. The TechQA dataset. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 1269–1278. https://www.aclweb.org/anthology/2020.acl-main.117.Google ScholarCross Ref
[45] Celikyilmaz Asli, Clark Elizabeth, and Gao Jianfeng. 2020. Evaluation of text generation: A survey. arXiv:2006.14799 [CS] (2020). http://arxiv.org/abs/2006.14799.Google Scholar
[46] Chandu Khyathi, Loginova Ekaterina, Gupta Vishal, Genabith Josef van, Neumann Günter, Chinnakotla Manoj, Nyberg Eric, and Black Alan W.. 2018. Code-mixed question answering challenge: Crowd-sourcing data and techniques. In Proceedings of the 3rd Workshop on Computational Approaches to Linguistic Code-Switching. 29–38. Google ScholarCross Ref
[47] Chapman Graham, Cleese John, Gilliam Terry, Idle Eric, Jones Terry, Palin Michael, Goldstone John, Milligan Spike, troupe) Monty Python (Comedy, Films Handmade, and (Firm) Criterion Collection. 1999. Life of Brian.Google Scholar
[48] Chattopadhyay Prithvijit, Vedantam Ramakrishna, Selvaraju Ramprasaath R., Batra Dhruv, and Parikh Devi. 2017. Counting everyday objects in everyday scenes. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 1135–1144. https://openaccess.thecvf.com/content_cvpr_2017/html/Chattopadhyay_Counting_Everyday_Objects_CVPR_2017_paper.html.Google ScholarCross Ref
[49] Chen Anthony, Gudipati Pallavi, Longpre Shayne, Ling Xiao, and Singh Sameer. 2021. Evaluating entity disambiguation and the role of popularity in retrieval-based NLP. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL’20). 4472–4485. Google ScholarCross Ref
[50] Chen Anthony, Stanovsky Gabriel, Singh Sameer, and Gardner Matt. 2019. Evaluating question answering evaluation. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering. 119–124. Google ScholarCross Ref
[51] Chen Anthony, Stanovsky Gabriel, Singh Sameer, and Gardner Matt. 2020. MOCHA: A dataset for training and evaluating generative reading comprehension metrics. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 6521–6532. https://www.aclweb.org/anthology/2020.emnlp-main.528.Google ScholarCross Ref
[52] Chen Danqi, Fisch Adam, Weston Jason, and Bordes Antoine. 2017. Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL’17). 1870–1879. Google ScholarCross Ref
[53] Chen Danqi and Yih Wen-Tau. 2020. Open-domain question answering. In Proceedings of ACL: Tutorial Abstracts. 34–37. Google ScholarCross Ref
[54] Chen Wenhu, Zha Hanwen, Chen Zhiyu, Xiong Wenhan, Wang Hong, and Wang William Yang. 2020. HybridQA: A dataset of multi-hop question answering over tabular and textual data. In Findings of EMNLP’20. 1026–1036. Google ScholarCross Ref
[55] Chen Xingyu, Zhao Zihan, Chen Lu, Ji JiaBao, Zhang Danyang, Luo Ao, Xiong Yuxuan, and Yu Kai. 2021. WebSRC: A dataset for web-based structural reading comprehension. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 4173–4185. Google ScholarCross Ref
[56] Chen Zhiyu, Chen Wenhu, Smiley Charese, Shah Sameena, Borova Iana, Langdon Dylan, Moussa Reema, et al. 2021. FinQA: A dataset of numerical reasoning over financial data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 3697–3711. Google ScholarCross Ref
[57] Chesnevar Carlos Iván, Maguitman Ana Gabriela, and Loui Ronald Prescott. 2000. Logical models of argument. ACM Computing Surveys 32, 4 (2000), 337–383. .Google ScholarDigital Library
[58] Cho Minseok, Amplayo Reinald Kim, Hwang Seung-Won, and Park Jonghyuck. 2018. Adversarial TableQA: Attention supervision for question answering on tables. In Proceedings of Machine Learning Research. 391–406. http://proceedings.mlr.press/v95/cho18a/cho18a.pdf.Google Scholar
[59] Choi Eunsol, He He, Iyyer Mohit, Yatskar Mark, Yih Wen-Tau, Choi Yejin, Liang Percy, and Zettlemoyer Luke. 2018. QuAC: Question answering in context. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 2174–2184. http://aclweb.org/anthology/D18-1241.Google ScholarCross Ref
[60] Choi Eunsol, Palomaki Jennimaria, Lamm Matthew, Kwiatkowski Tom, Das Dipanjan, and Collins Michael. 2021. Decontextualization: Making sentences stand-alone. Transactions of the Association for Computational Linguistics 9 (April 2021), 447–461. Google ScholarCross Ref
[61] Choudhury Sagnik Ray, Rogers Anna, and Augenstein Isabelle. 2022. Machine reading, fast and slow: When do models “Understand” language? In Proceedings of the 29th International Conference on Computational Linguistics. International Committee on Computational Linguistics, 78–93. https://aclanthology.org/2022.coling-1.8.Google Scholar
[62] Christmann Philipp, Roy Rishiraj Saha, Abujabal Abdalghani, Singh Jyotsna, and Weikum Gerhard. 2019. Look before you hop: Conversational question answering over knowledge graphs using judicious context expansion. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. ACM, New York, NY, 729–738. Google ScholarDigital Library
[63] Ciosici Manuel, Cecil Joe, Lee Dong-Ho, Hedges Alex, Freedman Marjorie, and Weischedel Ralph. 2021. Perhaps PTLMs should go to school—A task to assess open book and closed book QA. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 6104–6111. Google ScholarCross Ref
[64] Clark Christopher and Gardner Matt. 2018. Simple and effective multi-paragraph reading comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18). 845–855. Google ScholarCross Ref
[65] Clark Christopher, Lee Kenton, Chang Ming-Wei, Kwiatkowski Tom, Collins Michael, and Toutanova Kristina. 2019. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 17th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). 2924–2936. https://aclweb.org/anthology/papers/N/N19/N19-1300/.Google Scholar
[66] Clark Jonathan H., Choi Eunsol, Collins Michael, Garrette Dan, Kwiatkowski Tom, Nikolaev Vitaly, and Palomaki Jennimaria. 2020. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics 8 (July 2020), 454–470. Google ScholarCross Ref
[67] Clark Peter, Cowhey Isaac, Etzioni Oren, Khot Tushar, Sabharwal Ashish, Schoenick Carissa, and Tafjord Oyvind. 2018. Think you have solved question answering? Try ARC, the AI2 reasoning challenge. arXiv:1803.05457 [CS] (2018). http://arxiv.org/abs/1803.05457.Google Scholar
[68] Colas Anthony, Kim Seokhwan, Dernoncourt Franck, Gupte Siddhesh, Wang Zhe, and Kim Doo Soon. 2020. TutorialVQA: Question answering dataset for tutorial videos. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’20). 5450–5455. https://www.aclweb.org/anthology/2020.lrec-1.670.Google Scholar
[69] Costa Tarcísio Souza, Gottschalk Simon, and Demidova Elena. 2020. Event-QA: A dataset for event-centric question answering over knowledge graphs. arXiv:2004.11861 [CS] (2020). http://arxiv.org/abs/2004.11861.Google Scholar
[70] Croce Danilo, Zelenanska Alexandra, and Basili Roberto. 2019. Enabling deep learning for large scale question answering in Italian. Intelligenza Artificiale 13, 1 (Jan. 2019), 49–61. Google ScholarCross Ref
[71] Cui Yiming, Liu Ting, Che Wanxiang, Xiao Li, Chen Zhipeng, Ma Wentao, Wang Shijin, and Hu Guoping. 2019. A span-extraction dataset for Chinese machine reading comprehension. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 5883–5889. Google ScholarCross Ref
[72] Cui Yiming, Liu Ting, Chen Zhipeng, Ma Wentao, Wang Shijin, and Hu Guoping. 2018. Dataset for the first evaluation on chinese machine reading comprehension. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’18). https://www.aclweb.org/anthology/L18-1431.Google Scholar
[73] Cui Yiming, Liu Ting, Chen Zhipeng, Wang Shijin, and Hu Guoping. 2016. Consensus attention-based neural networks for chinese reading comprehension. In Proceedings of the International Conference on Computational Linguistics (COLING’16). 1777–1786. https://www.aclweb.org/anthology/C16-1167.Google Scholar
[74] Cui Yiming, Liu Ting, Yang Ziqing, Chen Zhipeng, Ma Wentao, Che Wanxiang, Wang Shijin, and Hu Guoping. 2020. A sentence Cloze dataset for Chinese machine reading comprehension. In Proceedings of the International Conference on Computational Linguistics (COLING’20). 6717–6723. Google ScholarCross Ref
[75] Dasigi Pradeep, Liu Nelson F., Marasović Ana, Smith Noah A., and Gardner Matt. 2019. Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 5924–5931. Google ScholarCross Ref
[76] Dasigi Pradeep, Lo Kyle, Beltagy Iz, Cohan Arman, Smith Noah A., and Gardner Matt. 2021. A dataset of information-seeking questions and answers anchored in research papers. In Proceedings of the 19th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’21). 4599–4610. Google ScholarCross Ref
[77] Demey Lorenz, Kooi Barteld, and Sack Joshua. 2019. Logic and probability. In The Stanford Encyclopedia of Philosophy (Summer 2019 ed.), Zalta Edward N. (Ed.). Metaphysics Research Lab, Stanford University, Stanford, CA. https://plato.stanford.edu/archives/sum2019/entries/logic-probability/.Google Scholar
[78] d’Hoffschmidt Martin, Vidal Maxime, Belblidia Wacim, and Brendlé Tom. 2020. FQuAD: French question answering dataset. arXiv:2002.06071 [CS] (2020). http://arxiv.org/abs/2002.06071.Google Scholar
[79] Dinan Emily, Roller Stephen, Shuster Kurt, Fan Angela, Auli Michael, and Weston Jason. 2018. Wizard of Wikipedia: Knowledge-powered conversational agents. arXiv:1811.01241 [CS] (2018). http://arxiv.org/abs/1811.01241.Google Scholar
[80] Santos Cícero dos, Barbosa Luciano, Bogdanova Dasha, and Zadrozny Bianca. 2015. Learning hybrid representations to retrieve semantically equivalent questions. In Proceedings of the Joint Conference of the 53rd Annual Meeting of the Association for Computational Linguistics and the 5th International Joint Conference on Natural Language Processing (ACL-IJCNLP’21). 694–699. Google ScholarCross Ref
[81] Douven Igor. 2017. Abduction. In The Stanford Encyclopedia of Philosophy (Summer 2017 ed.), Zalta Edward N. (Ed.). Metaphysics Research Lab, Stanford University, Stanford, CA. https://plato.stanford.edu/archives/sum2017/entries/abduction/.Google Scholar
[82] Dua Dheeru, Gottumukkala Ananth, Talmor Alon, Gardner Matt, and Singh Sameer. 2019. ORB: An open reading benchmark for comprehensive evaluation of machine reading comprehension. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering. 147–153. Google ScholarCross Ref
[83] Dua Dheeru, Wang Yizhong, Dasigi Pradeep, Stanovsky Gabriel, Singh Sameer, and Gardner Matt. 2019. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 17th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). 2368–2378. https://aclweb.org/anthology/papers/N/N19/N19-1246/.Google Scholar
[84] Duan Nan and Tang Duyu. 2018. Overview of the NLPCC 2017 shared task: Open domain Chinese question answering. In Natural Language Processing and Chinese Computing, Huang Xuanjing, Jiang Jing, Zhao Dongyan, Feng Yansong, and Hong Yu (Eds.). Springer, Cham, Switzerland, 954–961. Google ScholarCross Ref
[85] Dunietz Jesse, Burnham Greg, Bharadwaj Akash, Rambow Owen, Chu-Carroll Jennifer, and Ferrucci Dave. 2020. To test machine comprehension, start by defining comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 7839–7859. https://www.aclweb.org/anthology/2020.acl-main.701.Google ScholarCross Ref
[86] Dunn Matthew, Sagun Levent, Higgins Mike, Guney V. Ugur, Cirik Volkan, and Cho Kyunghyun. 2017. SearchQA: A new Q&A dataset augmented with context from a search engine. arXiv:1704.05179 [CS] (2017). http://arxiv.org/abs/1704.05179.Google Scholar
[87] Dzendzik Daria, Foster Jennifer, and Vogel Carl. 2021. English machine reading comprehension datasets: A survey. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 8784–8804. Google ScholarCross Ref
[88] Efimov Pavel, Chertok Andrey, Boytsov Leonid, and Braslavski Pavel. 2020. SberQuAD—Russian reading comprehension dataset: Description and analysis. arXiv:1912.09723 [CS] (2020). Google ScholarDigital Library
[89] Elgohary Ahmed, Peskov Denis, and Boyd-Graber Jordan. 2019. Can you unpack that? Learning to rewrite questions-in-context. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 5918–5924. Google ScholarCross Ref
[90] Eric Mihail and Manning Christopher D.. 2017. Key-value retrieval networks for task-oriented dialogue. arXiv:1705.05414 [CS] (2017). http://arxiv.org/abs/1705.05414.Google Scholar
[91] Ettinger Allyson. 2020. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics 8 (2020), 34–48. arXiv:1907.13528Google ScholarCross Ref
[92] Fan Angela, Jernite Yacine, Perez Ethan, Grangier David, Weston Jason, and Auli Michael. 2019. ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 3558–3567. Google ScholarCross Ref
[93] Faruqui Manaal and Das Dipanjan. 2018. Identifying well-formed natural language questions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing(EMNLP’18). 798–803. Google ScholarCross Ref
[94] Fayek H. M. and Johnson J.. 2020. Temporal reasoning via audio question answering. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020), 2283–2294. Google ScholarDigital Library
[95] Fenogenova Alena, Mikhailov Vladislav, and Shevelev Denis. 2020. Read and reason with MuSeRC and RuCoS: Datasets for machine reading comprehension for Russian. In Proceedings of the International Conference on Computational Linguistics (COLING’20). 6481–6497. Google ScholarCross Ref
[96] Ferguson James, Gardner Matt, Hajishirzi Hannaneh, Khot Tushar, and Dasigi Pradeep. 2020. IIRC: A dataset of incomplete information reading comprehension questions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 1137–1147. Google ScholarCross Ref
[97] Garcia Noa, Ye Chentao, Liu Zihua, Hu Qingtao, Otani Mayu, Chu Chenhui, Nakashima Yuta, and Mitamura Teruko. 2020. A dataset and baselines for visual question answering on art. arXiv:2008.12520 [CS] (2020). http://arxiv.org/abs/2008.12520.Google Scholar
[98] Gardner Matt, Artzi Yoav, Basmov Victoria, Berant Jonathan, Bogin Ben, Chen Sihao, Dasigi Pradeep, et al. 2020. Evaluating models’ local decision boundaries via contrast sets. In Findings of EMNLP’20. 1307–1323. Google ScholarCross Ref
[99] Gardner Matt, Berant Jonathan, Hajishirzi Hannaneh, Talmor Alon, and Min Sewon. 2019. Question answering is a format; when is it useful? arXiv:1909.11291 (2019).Google ScholarCross Ref
[100] Gardner Matt, Merrill William, Dodge Jesse, Peters Matthew, Ross Alexis, Singh Sameer, and Smith Noah A.. 2021. Competency problems: On finding and removing artifacts in language data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 1801–1813. Google ScholarCross Ref
[101] Garg Siddhant, Vu Thuy, and Moschitti Alessandro. 2020. TANDA: Transfer and adapt pre-trained transformer models for answer sentence selection. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’20). 7780–7788. Google ScholarCross Ref
[102] Gebru Timnit, Morgenstern Jamie, Vecchione Briana, Vaughan Jennifer Wortman, Wallach Hanna, III Hal Daumé, and Crawford Kate. 2020. Datasheets for datasets. arXiv:1803.09010 [CS] (2020). http://arxiv.org/abs/1803.09010.Google Scholar
[103] Geiger Atticus, Cases Ignacio, Karttunen Lauri, and Potts Christopher. 2019. Posing fair generalization tasks for natural language inference. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 4475–4485. Google ScholarCross Ref
[104] Geva Mor, Goldberg Yoav, and Berant Jonathan. 2019. Are we modeling the task or the annotator? An investigation of annotator bias in natural language understanding datasets. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 1161–1166. Google ScholarCross Ref
[105] Glushkova Taisia, Machnev Alexey, Fenogenova Alena, Shavrina Tatiana, Artemova Ekaterina, and Ignatov Dmitry I.. 2020. DaNetQA: A yes/no question answering dataset for the russian language. arXiv:2010.02605 [CS] (2020). http://arxiv.org/abs/2010.02605.Google Scholar
[106] Goldberg Yoav. 2019. Assessing BERT’s syntactic abilities. arXiv:1901.05287 [CS] (2019). http://arxiv.org/abs/1901.05287.Google Scholar
[107] González Ana Valeria, Rogers Anna, and Søgaard Anders. 2021. On the interaction of belief bias and explanations. In Findings of ACL-IJCNLP’21. 2930–2942. https://aclanthology.org/2021.findings-acl.259.Google Scholar
[108] Gordon Andrew, Kozareva Zornitsa, and Roemmele Melissa. 2012. SemEval-2012 Task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In Proceedings of the 1st Joint Conference on Lexical and Computational Semantics (*SEM’12). 394–398. https://aclweb.org/anthology/papers/S/S12/S12-1052/.Google Scholar
[109] Gordon Daniel, Kembhavi Aniruddha, Rastegari Mohammad, Redmon Joseph, Fox Dieter, and Farhadi Ali. 2018. IQA: Visual question answering in interactive environments. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 4089–4098.Google ScholarCross Ref
[110] Gu Yu, Kase Sue, Vanni Michelle, Sadler Brian, Liang Percy, Yan Xifeng, and Su Yu. 2021. Beyond I.I.D.: Three levels of generalization for question answering on knowledge bases. arXiv:2011.07743 [CS] (2021). Google ScholarDigital Library
[111] Guo Mandy, Yang Yinfei, Cer Daniel, Shen Qinlan, and Constant Noah. 2020. MultiReQA: A cross-domain evaluation for retrieval question answering models. arXiv:2005.02507 [CS] (2020). http://arxiv.org/abs/2005.02507.Google Scholar
[112] Guo Shangmin, Liu Kang, He Shizhu, Liu Cao, Zhao Jun, and Wei Zhuoyu. 2017. IJCNLP-2017 Task 5: Multi-choice question answering in examinations. In Proceedings of the IJCNLP’17, Shared Tasks. 34–40. https://www.aclweb.org/anthology/I17-4005.Google Scholar
[113] Gupta Aditya, Xu Jiacheng, Upadhyay Shyam, Yang Diyi, and Faruqui Manaal. 2021. Disfl-QA: A benchmark dataset for understanding disfluencies in question answering. In Findings of ACL 2021. https://arxiv.org/abs/2106.04016.Google Scholar
[114] Gupta Deepak, Kumari Surabhi, Ekbal Asif, and Bhattacharyya Pushpak. 2018. MMQA: A multi-domain multi-lingual question-answering framework for English and Hindi. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’18). https://www.aclweb.org/anthology/L18-1440.Google Scholar
[115] Gupta Mansi, Kulkarni Nitish, Chanda Raghuveer, Rayasam Anirudha, and Lipton Zachary C.. 2019. AmazonQA: A review-based question answering task. In Proceedings of the 2019 International Joint Conference on Artificial Intelligence (IJCAI’19). 4996–5002. Google ScholarCross Ref
[116] Gupta Vishal, Chinnakotla Manoj, and Shrivastava Manish. 2018. Transliteration better than translation? Answering code-mixed questions over a knowledge base. In Proceedings of the 3rd Workshop on Computational Approaches to Linguistic Code-Switching. 39–50. Google ScholarCross Ref
[117] Gururangan Suchin, Swayamdipta Swabha, Levy Omer, Schwartz Roy, Bowman Samuel, and Smith Noah A.. 2018. Annotation artifacts in natural language inference data. In Proceedings of the 16th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18). 107–112. Google ScholarCross Ref
[118] Han Rujun, Hsu I-Hung, Sun Jiao, Baylon Julia, Ning Qiang, Roth Dan, and Peng Nanyun. 2021. ESTER: A machine reading comprehension dataset for reasoning about event semantic relations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 7543–7559. Google ScholarCross Ref
[119] Hashemi Helia, Aliannejadi Mohammad, Zamani Hamed, and Croft W. Bruce. 2019. ANTIQUE: A non-factoid question answering benchmark. arXiv:1905.08957 [CS] (2019). http://arxiv.org/abs/1905.08957.Google Scholar
[120] Hassan Naeemul, Arslan Fatma, Li Chengkai, and Tremayne Mark. 2017. Toward automated fact-checking: Detecting check-worthy factual claims by ClaimBuster. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’17). ACM, New York, NY, 1803–1812. http://dblp.uni-trier.de/db/conf/kdd/kdd2017.html#HassanALT17.Google ScholarDigital Library
[121] Hawthorne James. 2021. Inductive logic. In The Stanford Encyclopedia of Philosophy (Spring 2022 ed.), Zalta Edward N. (Ed.). Metaphysics Research Lab, Stanford University, Stanford, CA. https://plato.stanford.edu/archives/spr2021/entries/logic-inductive/.Google Scholar
[122] He Wei, Liu Kai, Liu Jing, Lyu Yajuan, Zhao Shiqi, Xiao Xinyan, Liu Yuan, et al. 2018. DuReader: A Chinese machine reading comprehension dataset from real-world applications. In Proceedings of the Workshop on Machine Reading for Question Answering. 37–46. Google ScholarCross Ref
[123] Hedberg Nancy, Sosa Juan M., and Fadden Lorna. 2004. Meanings and configurations of questions in English. In Proceedings of the International Conference on Speech Prosody. 309–312. https://www.isca-speech.org/archive/sp2004/papers/sp04_309.pdf.Google Scholar
[124] Hemphill Charles T., Godfrey John J., and Doddington George R.. 1990. The ATIS spoken language systems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24–27, 1990. https://www.aclweb.org/anthology/H90-1021.Google ScholarDigital Library
[125] Hermann Karl Moritz, Kočiský Tomáš, Grefenstette Edward, Espeholt Lasse, Kay Will, Suleyman Mustafa, and Blunsom Phil. 2015. Teaching machines to read and comprehend. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS’15). 1693–1701. http://dl.acm.org/citation.cfm?id=2969239.2969428.Google Scholar
[126] Hewlett Daniel, Lacoste Alexandre, Jones Llion, Polosukhin Illia, Fandrianto Andrew, Han Jay, Kelcey Matthew, and Berthelot David. 2016. WikiReading: A novel large-scale language understanding task over Wikipedia. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL’16). 1535–1545. Google ScholarCross Ref
[127] Hill Felix, Bordes Antoine, Chopra Sumit, and Weston Jason. 2015. The goldilocks principle: Reading children’s books with explicit memory representations. arXiv:1511.02301 [CS] (2015). http://arxiv.org/abs/1511.02301.Google Scholar
[128] Horbach Andrea, Aldabe Itziar, Bexte Marie, Lacalle Oier Lopez de, and Maritxalar Montse. 2020. Linguistic appropriateness and pedagogic usefulness of reading comprehension questions. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’20). 1753–1762. https://www.aclweb.org/anthology/2020.lrec-1.217.Google Scholar
[129] Huang Lifu, Bras Ronan Le, Bhagavatula Chandra, and Choi Yejin. 2019. Cosmos QA: Machine reading comprehension with contextual commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 2391–2401. Google ScholarCross Ref
[130] Hudson Drew A. and Manning Christopher D.. 2019. GQA: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 6700–6709. https://openaccess.thecvf.com/content_CVPR_2019/html/Hudson_GQA_A_New_Dataset_for_Real-World_Visual_Reasoning_and_Compositional_CVPR_2019_paper.html.Google ScholarCross Ref
[131] International SRI. 2011. SRI’s Amex Travel Agent Data. Retrieved September 16, 2022 from http://www.ai.sri.com/~communic/amex/amex.html.Google Scholar
[132] Jang Yunseok, Song Yale, Yu Youngjae, Kim Youngjin, and Kim Gunhee. 2017. TGIF-QA: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2758–2766. https://openaccess.thecvf.com/content_cvpr_2017/html/Jang_TGIF-QA_Toward_Spatio-Temporal_CVPR_2017_paper.html.Google ScholarCross Ref
[133] Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. 2019. CAsT 2019: The conversational assistance track overview. In Proceedings of the Text REtrival Conference (TREC’19).Google Scholar
[134] Jia Robin and Liang Percy. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 2021–2031. Google ScholarCross Ref
[135] Jia Zhen, Abujabal Abdalghani, Roy Rishiraj Saha, Strötgen Jannik, and Weikum Gerhard. 2018. TempQuestions: A benchmark for temporal question answering. In Companion of WWW’18. ACM, New York, NY, 1057–1062. Google ScholarDigital Library
[136] Jia Zhen, Abujabal Abdalghani, Roy Rishiraj Saha, Strötgen Jannik, and Weikum Gerhard. 2018. TEQUILA: Temporal question answering over knowledge bases. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM’18). ACM, New York, NY, 1807–1810. Google ScholarDigital Library
[137] Jiang Kelvin, Wu Dekun, and Jiang Hui. 2019. FreebaseQA: A new factoid QA data set matching trivia-style question-answer pairs with Freebase. In Proceedings of the 17th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). 318–323. https://aclweb.org/anthology/papers/N/N19/N19-1028/.Google Scholar
[138] Jimenez Carlos E., Russakovsky Olga, and Narasimhan Karthik. 2022. CARETS: A consistency and robustness evaluative test suite for VQA. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL’22). 6392–6405. Google ScholarCross Ref
[139] Jin Qiao, Dhingra Bhuwan, Liu Zhengping, Cohen William, and Lu Xinghua. 2019. PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 2567–2577. Google ScholarCross Ref
[140] Johnson Justin, Hariharan Bharath, Maaten Laurens van der, Fei-Fei Li, Zitnick C. Lawrence, and Girshick Ross. 2017. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2901–2910.Google ScholarCross Ref
[141] Joshi Mandar, Choi Eunsol, Weld Daniel, and Zettlemoyer Luke. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL’17). 1601–1611. Google ScholarCross Ref
[142] Katsis Yannis, Chemmengath Saneem, Kumar Vishwajeet, Bharadwaj Samarth, Canim Mustafa, Glass Michael, Gliozzo Alfio, et al. 2022. AIT-QA: Question answering dataset over complex tables in the airline industry. In Proceedings of the 20th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’22). 305–314. Google ScholarCross Ref
[143] Kaushik Divyansh, Hovy Eduard, and Lipton Zachary. 2019. Learning the difference that makes a difference with counterfactually-augmented data. In Proceedings of the International Conference on Learning Representations(ICLR’19). https://openreview.net/forum?id=Sklgs0NFvr.Google Scholar
[144] Keraron Rachel, Lancrenon Guillaume, Bras Mathilde, Allary Frédéric, Moyse Gilles, Scialom Thomas, Soriano-Morales Edmundo-Pavel, and Staiano Jacopo. 2020. Project PIAF: Building a native French question-answering dataset. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’20). 5481–5490. https://www.aclweb.org/anthology/2020.lrec-1.673.Google Scholar
[145] Khashabi Daniel, Chaturvedi Snigdha, Roth Michael, Upadhyay Shyam, and Roth Dan. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 16th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18). 252–262. Google ScholarCross Ref
[146] Khashabi Daniel, Khot Tushar, Sabharwal Ashish, Tafjord Oyvind, Clark Peter, and Hajishirzi Hannaneh. 2020. UnifiedQA: Crossing format boundaries with a single QA system. arXiv:2005.00700 [CS] (2020). https://arxiv.org/abs/2005.00700.Google Scholar
[147] Kim Kyung-Min, Heo Min-Oh, Choi Seong-Ho, and Zhang Byoung-Tak. 2017. DeepStory: Video story QA by deep embedded memory networks. In Proceedings of the 2017 International Joint Conference on Artificial Intelligence (IJCAI’17). https://openreview.net/forum?id=ryZczSz_bS.Google ScholarCross Ref
[148] Kim Seokhwan, D’Haro Luis Ferdinando, Banchs Rafael E., Henderson Matthew, Willisams Jason, and Yoshino Koichiro. 2016. Dialog State Tracking Challenge 5 Handbook v.3.1. Retrieved September 16, 2022 from http://workshop.colips.org/dstc5/.Google Scholar
[149] Kocisky Tomas, Schwarz Jonathan, Blunsom Phil, Dyer Chris, Hermann Karl Moritz, Melis Gabor, and Grefenstette Edward. 2018. The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics 6 (2018), 317–328. http://aclweb.org/anthology/Q18-1023.Google ScholarCross Ref
[150] Kong Xiang, Gangal Varun, and Hovy Eduard. 2020. SCDE: Sentence Cloze dataset with high quality distractors from examinations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 5668–5683. Google ScholarCross Ref
[151] Koons Robert. 2017. Defeasible reasoning. In The Stanford Encyclopedia of Philosophy (Winter 2017 ed.), Zalta Edward N. (Ed.). Metaphysics Research Lab, Stanford University, Stanford, CA. https://plato.stanford.edu/archives/win2017/entries/reasoning-defeasible/.Google Scholar
[152] Korablinov Vladislav and Braslavski Pavel. 2020. RuBQ: A Russian dataset for question answering over wikidata. arXiv:2005.10659 [CS] (2020). http://arxiv.org/abs/2005.10659.Google Scholar
[153] Krishna Kalpesh, Roy Aurko, and Iyyer Mohit. 2021. Hurdles to progress in long-form question answering. In Proceedings of the 19th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’21). 4940–4957. Google ScholarCross Ref
[154] Kushman Nate, Artzi Yoav, Zettlemoyer Luke, and Barzilay Regina. 2014. Learning to automatically solve algebra word problems. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 271–281. Google ScholarCross Ref
[155] Kwiatkowski Tom, Palomaki Jennimaria, Redfield Olivia, Collins Michael, Parikh Ankur, Alberti Chris, Epstein Danielle, et al. 2019. Natural questions: A benchmark for question answering research. Transactions of Association for Computational Linguistics 7 (2019), 452–466. https://ai.google/research/pubs/pub47761.Google ScholarCross Ref
[156] Lai Guokun, Xie Qizhe, Liu Hanxiao, Yang Yiming, and Hovy Eduard. 2017. RACE: Large-scale reading comprehension dataset from examinations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 785–794. Google ScholarCross Ref
[157] Lee C., Wang S., Chang H., and Lee H.. 2018. ODSQA: Open-domain spoken question answering dataset. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT’18). 949–956. Google ScholarCross Ref
[158] Lee Kyungjae, Yoon Kyoungho, Park Sunghyun, and Hwang Seung-Won. 2018. Semi-supervised training data generation for multilingual question answering. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’18). https://www.aclweb.org/anthology/L18-1437.Google Scholar
[159] Lei Jie, Yu Licheng, Bansal Mohit, and Berg Tamara. 2018. TVQA: Localized, compositional video question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 1369–1379. https://www.aclweb.org/anthology/papers/D/D18/D18-1167/.Google ScholarCross Ref
[160] Levesque Hector J., Davis Ernest, and Morgenstern Leora. 2012. The Winograd Schema Challenge. In Proceedings of the 13th International Conference on Principles of Knowledge Representation and Reasoning. 552–561.Google Scholar
[161] Levy Omer, Seo Minjoon, Choi Eunsol, and Zettlemoyer Luke. 2017. Zero-shot relation extraction via reading comprehension. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL’17). 333–342. Google ScholarCross Ref
[162] Lewis Patrick, Oguz Barlas, Rinott Ruty, Riedel Sebastian, and Schwenk Holger. 2020. MLQA: Evaluating cross-lingual extractive question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 7315–7330. https://www.aclweb.org/anthology/2020.acl-main.653/.Google ScholarCross Ref
[163] Li Chia-Hsuan, Wu Szu-Lin, Liu Chi-Liang, and Lee Hung-Yi. 2018. Spoken SQuAD: A study of mitigating the impact of speech recognition errors on listening comprehension. arXiv:1804.00320 [CS] (2018). http://arxiv.org/abs/1804.00320.Google Scholar
[164] Li Haonan, Tomko Martin, Vasardani Maria, and Baldwin Timothy. 2022. MultiSpanQA: A dataset for multi-span question answering. In Proceedings of the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’22). 1250–1260. Google ScholarCross Ref
[165] Li Jiaqi, Liu Ming, Kan Min-Yen, Zheng Zihao, Wang Zekun, Lei Wenqiang, Liu Ting, and Qin Bing. 2020. Molweni: A challenge multiparty dialogues-based machine reading comprehension dataset with discourse structure. arXiv:2004.05080 [CS] (2020). http://arxiv.org/abs/2004.05080.Google Scholar
[166] Li Jing, Zhong Shangping, and Chen Kaizhi. 2021. MLEC-QA: A Chinese multi-choice biomedical question answering dataset. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 8862–8874. Google ScholarCross Ref
[167] Li Peng, Li Wei, He Zhengyan, Wang Xuguang, Cao Ying, Zhou Jie, and Xu Wei. 2016. Dataset and neural recurrent sequence labeling model for open-domain factoid question answering. arXiv:1607.06275 [CS] (2016). http://arxiv.org/abs/1607.06275.Google Scholar
[168] Li Xiaoya, Yin Fan, Sun Zijun, Li Xiayu, Yuan Arianna, Chai Duo, Zhou Mingxin, and Li Jiwei. 2019. Entity-relation extraction as multi-turn question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 1340–1350. Google ScholarCross Ref
[169] Li Yongqi, Li Wenjie, and Nie Liqiang. 2022. MMCoQA: Conversational question answering over text, tables, and images. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL’22). 4220–4231. Google ScholarCross Ref
[170] Liang Yichan, Li Jianheng, and Yin Jian. 2019. A new multi-choice reading comprehension dataset for curriculum learning. In Proceedings of the 11th Asian Conference on Machine Learning. 742–757. http://proceedings.mlr.press/v101/liang19a.html.Google Scholar
[171] Lim Seungyoung, Kim Myungji, and Lee Jooyoul. 2019. KorQuAD1.0: Korean QA dataset for machine reading comprehension. arXiv:1909.07005 [CS] (2019). http://arxiv.org/abs/1909.07005.Google Scholar
[172] Lin Bill Yuchen, Lee Seyeon, Khanna Rahul, and Ren Xiang. 2020. Birds have four legs? NumerSense: Probing numerical commonsense knowledge of pre-trained language models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 6862–6868. Google ScholarCross Ref
[173] Lin Kevin, Tafjord Oyvind, Clark Peter, and Gardner Matt. 2019. Reasoning over paragraph effects in situations. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering. 58–62. Google ScholarCross Ref
[174] Lin Stephanie, Hilton Jacob, and Evans Owain. 2022. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL’22). 3214–3252. Google ScholarCross Ref
[175] Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C. Lawrence. 2014. Microsoft COCO: Common objects in context. In Computer Vision—ECCV 2014, Fleet David, Pajdla Tomas, Schiele Bernt, and Tuytelaars Tinne (Eds.). Springer, Cham, Switzerland, 740–755. Google ScholarCross Ref
[176] Linzen Tal. 2020. How can we accelerate progress towards human-like linguistic generalization? arXiv:2005.00955 [CS] (2020). https://arxiv.org/pdf/2005.00955.pdf.Google Scholar
[177] Liu Jian, Cui Leyang, Liu Hanmeng, Huang Dandan, Wang Yile, and Zhang Yue. 2020. LogiQA: A challenge dataset for machine reading comprehension with logical reasoning. In Proceedings of the 2020 International Joint Conference on Artificial Intelligence (IJCAI’20). 3622–3628. Google ScholarCross Ref
[178] Liu Jiahua, Lin Yankai, Liu Zhiyuan, and Sun Maosong. 2019. XQA: A cross-lingual open-domain question answering dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 2358–2368. Google ScholarCross Ref
[179] Liu Pengyuan, Deng Yuning, Zhu Chenghao, and Hu Han. 2019. XCMRC: Evaluating cross-lingual machine reading comprehension. In Natural Language Processing and Chinese Computing, Tang Jie, Kan Min-Yen, Zhao Dongyan, Li Sujian, and Zan Hongying (Eds.). Springer, Cham, Switzerland, 552–564. Google ScholarDigital Library
[180] Liu Yinhan, Ott Myle, Goyal Naman, Du Jingfei, Joshi Mandar, Chen Danqi, Levy Omer, Lewis Mike, Zettlemoyer Luke, and Stoyanov Veselin. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692 [CS] (2019). http://arxiv.org/abs/1907.11692.Google Scholar
[181] Long Teng, Bengio Emmanuel, Lowe Ryan, Cheung Jackie Chi Kit, and Precup Doina. 2017. World knowledge for reading comprehension: Rare entity prediction with hierarchical LSTMs using external descriptions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 825–834. Google ScholarCross Ref
[182] Longpre Shayne, Lu Yi, and Daiber Joachim. 2020. MKQA: A linguistically diverse benchmark for multilingual open domain question answering. arXiv:2007.15207 [CS] (2020). http://arxiv.org/abs/2007.15207.Google Scholar
[183] Lowe Ryan, Pow Nissan, Serban Iulian, and Pineau Joelle. 2015. The Ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. arXiv:1506.08909 [CS] (2015). http://arxiv.org/abs/1506.08909.Google Scholar
[184] Ma Kaixin, Jurczyk Tomasz, and Choi Jinho D.. 2018. Challenging reading comprehension on daily conversation: Passage completion on multiparty dialog. In Proceedings of the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’22). 2039–2048. Google ScholarCross Ref
[185] MacFarlane Leigh-Ann and Boulet Genviève. 2017. Multiple-choice tests can support deep learning! Proceedings of the Atlantic Universities’ Teaching Showcase 21 (2017), 61–66. https://ojs.library.dal.ca/auts/article/view/8430.Google Scholar
[186] Maharaj Tegan, Ballas Nicolas, Rohrbach Anna, Courville Aaron, and Pal Christopher. 2017. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. arXiv:1611.07810 [CS] (2017). http://arxiv.org/abs/1611.07810.Google Scholar
[187] Marcham Cheryl L., Turnbeaugh Treasa M., Gould Susan, and Nadler Joel T.. 2018. Developing certification exam questions: More deliberate than you may think. Professional Safety 63, 5 (May 2018), 44–49. https://onepetro.org/PS/article/63/05/44/33528/Developing-Certification-Exam-Questions-More.Google Scholar
[188] Marcinczuk Michał, Ptak Marcin, Radziszewski Adam, and Piasecki Maciej. 2013. Open dataset for development of Polish question answering systems. In Proceedings of the 6th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics. https://www.researchgate.net/profile/Maciej-Piasecki/publication/272685856_Open_dataset_for_development_of_Polish_Question_Answering_systems.Google Scholar
[189] Masry Ahmed, Long Do, Tan Jia Qing, Joty Shafiq, and Hoque Enamul. 2022. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of ACL’22. 2263–2279. Google ScholarCross Ref
[190] McAuley Julian and Yang Alex. 2016. Addressing complex and subjective product-related queries with customer reviews. In Proceedings of the 25th International Conference on World Wide Web (WWW’16). 625–635. Google ScholarDigital Library
[191] McCann Bryan, Keskar Nitish Shirish, Xiong Caiming, and Socher Richard. 2018. The natural language decathlon: Multitask learning as question answering. arXiv:1806.08730 [CS, STAT] (2018). http://arxiv.org/abs/1806.08730.Google Scholar
[192] McCarthy John and Hayes Patrick. 1969. Some philosophical problems from the standpoint of artificial intelligence. In Machine Intelligence 4, Meltzer B. and Michie Donald (Eds.). Edinburgh University Press, 463–502.Google Scholar
[193] McCoy Tom, Min Junghyun, and Linzen Tal. 2019. BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance. arXiv:1911.02969 [CS] (2019). http://arxiv.org/abs/1911.02969.Google Scholar
[194] McCoy Tom, Pavlick Ellie, and Linzen Tal. 2019. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 3428–3448. Google ScholarCross Ref
[195] McNamara Danielle S. and Magliano Joe. 2009. Toward a comprehensive model of comprehension. In The Psychology of Learning and Motivation. Psychology of Learning and Motivation Series, Vol. 51. Academic Press, Cambridge, MA, 297–384. Google ScholarCross Ref
[196] Miao Shen-Yun, Liang Chao-Chun, and Su Keh-Yih. 2020. A diverse corpus for evaluating and developing English math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 975–984. https://www.aclweb.org/anthology/2020.acl-main.92.Google ScholarCross Ref
[197] Mihaylov Todor, Clark Peter, Khot Tushar, and Sabharwal Ashish. 2018. Can a suit of armor conduct electricity? A new dataset for open book question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 2381–2391. http://aclweb.org/anthology/D18-1260.Google ScholarCross Ref
[198] Min Sewon, Michael Julian, Hajishirzi Hannaneh, and Zettlemoyer Luke. 2020. AmbigQA: Answering ambiguous open-domain questions. arXiv:2004.10645 [CS] (2020). http://arxiv.org/abs/2004.10645.Google Scholar
[199] Mirzaee Roshanak, Faghihi Hossein Rajaby, Ning Qiang, and Kordjamshidi Parisa. 2021. SPARTQA: A textual question answering benchmark for spatial reasoning. In Proceedings of the 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’21). 4582–4598. Google ScholarCross Ref
[200] Mishra Swaroop, Mitra Arindam, Varshney Neeraj, Sachdeva Bhavdeep, and Baral Chitta. 2020. Towards question format independent numerical reasoning: A set of prerequisite tasks. arXiv preprint arXiv:2005.08516 (2020). https://arxiv.org/abs/2005.08516.Google Scholar
[201] Mitchell Margaret, Wu Simone, Zaldivar Andrew, Barnes Parker, Vasserman Lucy, Hutchinson Ben, Spitzer Elena, Raji Inioluwa Deborah, and Gebru Timnit. 2019. Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT*’19). ACM, New York, NY, 220–229. Google ScholarDigital Library
[202] Modi Ashutosh, Anikina Tatjana, Ostermann Simon, and Pinkal Manfred. 2016. InScript: Narrative texts annotated with script information. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’16). 3485–3493. https://www.aclweb.org/anthology/L16-1555.Google Scholar
[203] Möller Timo, Reina Anthony, Jayakumar Raghavan, and Pietsch Malte. 2020. COVID-QA: A question answering dataset for COVID-19. In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL’20. https://www.aclweb.org/anthology/2020.nlpcovid19-acl.18.Google Scholar
[204] Mostafazadeh Nasrin, Roth Michael, Chambers Nathanael, and Louis Annie. 2017. LSDSem 2017 shared task: The story Cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential, and Discourse-Level Semantics. 46–51. http://www.aclweb.org/anthology/W17-0900.Google ScholarCross Ref
[205] Mozannar Hussein, Maamary Elie, Hajal Karl El, and Hajj Hazem. 2019. Neural Arabic question answering. In Proceedings of the 4th Arabic Natural Language Processing Workshop. 108–118. Google ScholarCross Ref
[206] Mun Jonghwan, Seo Paul Hongsuck, Jung Ilchae, and Han Bohyung. 2017. MarioQA: Answering questions by watching gameplay videos. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV’17). http://arxiv.org/abs/1612.01669.Google ScholarCross Ref
[207] Nakov Preslav, Hoogeveen Doris, Màrquez Lluís, Moschitti Alessandro, Mubarak Hamdy, Baldwin Timothy, and Verspoor Karin. 2017. SemEval-2017 Task 3: Community question answering. In Proceedings of the 11th International Workshop on Semantic Evaluations (SemEval’17). 27–48. http://www.aclweb.org/anthology/S17-2003.Google ScholarCross Ref
[208] Nakov Preslav, Màrquez Lluís, Magdy Walid, Moschitti Alessandro, Glass Jim, and Randeree Bilal. 2015. SemEval-2015 Task 3: Answer selection in community question answering. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval’15). 269–281.Google Scholar
[209] Nakov Preslav, Màrquez Lluís, Moschitti Alessandro, Magdy Walid, Mubarak Hamdy, Freihat abed Alhakim, Glass Jim, and Randeree Bilal. 2016. SemEval-2016 Task 3: Community question answering. 525–545.Google Scholar
[210] Nguyen Kiet, Nguyen Vu, Nguyen Anh, and Nguyen Ngan. 2020. A Vietnamese dataset for evaluating machine reading comprehension. In Proceedings of the International Conference on Learning Representations (ICLR’20). 2595–2605. Google ScholarCross Ref
[211] Ning Qiang, Wu Hao, Han Rujun, Peng Nanyun, Gardner Matt, and Roth Dan. 2020. TORQUE: A reading comprehension dataset of temporal ordering questions. arXiv:2005.00242 [CS] (2020). http://arxiv.org/abs/2005.00242.Google Scholar
[212] Omura Kazumasa, Kawahara Daisuke, and Kurohashi Sadao. 2020. A method for building a commonsense inference dataset based on basic events. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 2450–2460. https://www.aclweb.org/anthology/2020.emnlp-main.192.Google ScholarCross Ref
[213] Onishi Takeshi, Wang Hai, Bansal Mohit, Gimpel Kevin, and McAllester David. 2016. Who did what: A large-scale person-centered Cloze dataset. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’16). 2230–2235. Google ScholarCross Ref
[214] Ostermann Simon, Modi Ashutosh, Roth Michael, Thater Stefan, and Pinkal Manfred. 2018. MCScript: A novel dataset for assessing machine comprehension using script knowledge. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’18). https://www.aclweb.org/anthology/L18-1564.Google Scholar
[215] Ostermann Simon, Roth Michael, Modi Ashutosh, Thater Stefan, and Pinkal Manfred. 2018. SemEval-2018 Task 11: Machine comprehension using commonsense knowledge. In Proceedings of the 12th International Workshop on Semantic Evaluation. 747–757. Google ScholarCross Ref
[216] Pampari Anusri, Raghavan Preethi, Liang Jennifer, and Peng Jian. 2018. emrQA: A large corpus for question answering on electronic medical records. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 2357–2368. http://aclweb.org/anthology/D18-1258.Google ScholarCross Ref
[217] Pang Richard Yuanzhe, Parrish Alicia, Joshi Nitish, Nangia Nikita, Phang Jason, Chen Angelica, Padmakumar Vishakh, et al. 2022. QuALITY: Question answering with long input texts, yes! In Proceedings of the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’22). 5336–5358. Google ScholarCross Ref
[218] Paperno Denis, Kruszewski Germán, Lazaridou Angeliki, Pham Quan Ngoc, Bernardi Raffaella, Pezzelle Sandro, Baroni Marco, Boleda Gemma, and Fernández Raquel. 2016. The LAMBADA dataset: Word prediction requiring a broad discourse context. arXiv:1606.06031 [CS] (2016). http://arxiv.org/abs/1606.06031.Google Scholar
[219] Pasupat Panupong and Liang Percy. 2015. Compositional semantic parsing on semi-structured tables. In Proceedings of the Joint Conference of the 53rd Annual Meeting of the Association for Computational Linguistics and the 5th International Joint Conference on Natural Language Processing (ACL-IJCNLP’21). 1470–1480. Google ScholarCross Ref
[220] Patel Alkesh, Bindal Akanksha, Kotek Hadas, Klein Christopher, and Williams Jason. 2020. Generating natural questions from images for multimodal assistants. arXiv:2012.03678 [CS] (2020). http://arxiv.org/abs/2012.03678.Google Scholar
[221] Peñas Anselmo, Unger Christina, and Ngomo Axel-Cyrille Ngonga. 2014. Overview of CLEF question answering track 2014. In Information Access Evaluation. Multilinguality, Multimodality, and Interaction. Springer, Cham, Switzerland, 300–306. Google ScholarCross Ref
[222] Peñas Anselmo, Unger Christina, Paliouras Georgios, and Kakadiaris Ioannis. 2015. Overview of the CLEF question answering track 2015. In Experimental IR Meets Multilinguality, Multimodality, and Interaction.Lecture Notes in Computer Science, Vol. 9283. Springer, 539–544.Google Scholar
[223] Penha Gustavo, Balan Alexandru, and Hauff Claudia. 2019. Introducing MANtIS: A novel multi-domain information seeking dialogues dataset. arXiv:1912.04639 [CS] (2019). http://arxiv.org/abs/1912.04639.Google Scholar
[224] Peskov Denis, Clarke Nancy, Krone Jason, Fodor Brigi, Zhang Yi, Youssef Adel, and Diab Mona. 2019. Multi-domain goal-oriented dialogues (MultiDoGO): Strategies toward curating and annotating large scale dialogue data. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 4526–4536. Google ScholarCross Ref
[225] Pfeiffer Jonas, Geigle Gregor, Kamath Aishwarya, Steitz Jan-Martin, Roth Stefan, Vulić Ivan, and Gurevych Iryna. 2022. xGQA: Cross-lingual visual question answering. In Findings of ACL’22. 2497–2511. Google ScholarCross Ref
[226] Price Eric. 2014. The NIPS Experiment. Retrieved September 16, 2022 from http://blog.mrtz.org/2014/12/15/the-nips-experiment.html.Google Scholar
[227] Pruthi Danish, Gupta Mansi, Dhingra Bhuwan, Neubig Graham, and Lipton Zachary C.. 2019. Learning to deceive with attention-based explanations. arXiv:1909.07913 [CS] (2019). http://arxiv.org/abs/1909.07913.Google Scholar
[228] Qin Lianhui, Gupta Aditya, Upadhyay Shyam, He Luheng, Choi Yejin, and Faruqui Manaal. 2021. TIMEDIAL: Temporal commonsense reasoning in dialog. arXiv:2106.04571 [CS.CL] (2021).Google Scholar
[229] Qiu Boyu, Chen Xu, Xu Jungang, and Sun Yingfei. 2019. A survey on neural machine reading comprehension. arXiv:1906.03824 [CS] (2019). http://arxiv.org/abs/1906.03824.Google Scholar
[230] Qu Chen, Yang Liu, Croft W. Bruce, Trippas Johanne R., Zhang Yongfeng, and Qiu Minghui. 2018. Analyzing and characterizing user intent in information-seeking conversations. In Proceedings of the 41st ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’18). ACM, New York, NY, 989–992. Google ScholarDigital Library
[231] Radlinski Filip, Balog Krisztian, Byrne Bill, and Krishnamoorthi Karthik. 2019. Coached conversational preference elicitation: A case study in understanding movie preferences. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue. https://research.google/pubs/pub48414/.Google Scholar
[232] Raghavi Khyathi Chandu, Chinnakotla Manoj Kumar, and Shrivastava Manish. 2015. “Answer ka type kya he?”: Learning to classify questions in code-mixed language. In Proceedings of the 24th International Conference on World Wide Web (WWW’15 Companion). ACM, New York, NY, 853–858. Google ScholarDigital Library
[233] Rajani Nazneen Fatema, Krause Ben, Yin Wengpeng, Niu Tong, Socher Richard, and Xiong Caiming. 2020. Explaining and improving model behavior with k nearest neighbor representations. arXiv:2010.09030 [CS] (2020). http://arxiv.org/abs/2010.09030.Google Scholar
[234] Rajpurkar Pranav, Jia Robin, and Liang Percy. 2018. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18). 784–789. http://aclweb.org/anthology/P18-2124.Google ScholarCross Ref
[235] Rajpurkar Pranav, Zhang Jian, Lopyrev Konstantin, and Liang Percy. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’16). 2383–2392.Google ScholarCross Ref
[236] Ramponi Alan and Plank Barbara. 2020. Neural unsupervised domain adaptation in NLP—A survey. In Proceedings of the International Conference on Computational Linguistics (COLING’20). 6838–6855. Google ScholarCross Ref
[237] Rashkin Hannah, Sap Maarten, Allaway Emily, Smith Noah A., and Choi Yejin. 2018. Event2Mind: Commonsense inference on events, intents, and reactions. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18). 463–473. Google ScholarCross Ref
[238] Reddy Siva, Chen Danqi, and Manning Christopher D.. 2019. CoQA: A conversational question answering challenge. Transactions of the Association for Computational Linguistics 7 (March 2019), 249–266. Google ScholarCross Ref
[239] Ribeiro Marco Tulio, Singh Sameer, and Guestrin Carlos. 2016. “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’16). ACM, New York, NY, 1135–1144. Google ScholarDigital Library
[240] Ribeiro Marco Tulio, Wu Tongshuang, Guestrin Carlos, and Singh Sameer. 2020. Beyond accuracy: Behavioral testing of NLP models with checklist. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 4902–4912. https://www.aclweb.org/anthology/2020.acl-main.442.Google ScholarCross Ref
[241] Richardson Matthew, Burges Christopher J. C., and Renshaw Erin. 2013. MCTest: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’13). 193–203.Google Scholar
[242] Rodriguez Pedro and Boyd-Graber Jordan. 2021. Evaluation paradigms in question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 9630–9642. Google ScholarCross Ref
[243] Rodriguez Pedro, Crook Paul, Moon Seungwhan, and Wang Zhiguang. 2020. Information seeking in the spirit of learning: A dataset for conversational curiosity. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 8153–8172. Google ScholarCross Ref
[244] Rodriguez Pedro, Feng Shi, Iyyer Mohit, He He, and Boyd-Graber Jordan. 2021. Quizbowl: The case for incremental question answering. arXiv:1904.04792 [CS] (2021). Google ScholarCross Ref
[245] Roemmele Melissa, Bejan Cosmin Adrian, and Gordon Andrew S.. 2011. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In Proceedings of the AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning. 6.Google Scholar
[246] Rogers Anna. 2019. How the Transformers Broke NLP Leaderboards. Retrieved September 16, 2022 from https://hackingsemantics.xyz/2019/leaderboards/.Google Scholar
[247] Rogers Anna. 2021. Changing the world by changing the data. In Proceedings of the Conference of the Association for Computational Linguistics (ACL’21). 2182–2194. https://aclanthology.org/2021.acl-long.170.Google ScholarCross Ref
[248] Rogers Anna and Augenstein Isabelle. 2020. What can we do to improve peer review in NLP? In Findings of EMNLP’20. 1256–1262. https://www.aclweb.org/anthology/2020.findings-emnlp.112/.Google Scholar
[249] Rogers Anna, Kovaleva Olga, Downey Matthew, and Rumshisky Anna. 2020. Getting closer to AI complete question answering: A set of prerequisite real tasks. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’20). 8722–8731. https://aaai.org/ojs/index.php/AAAI/article/view/6398.Google ScholarCross Ref
[250] Roy Uma, Constant Noah, Al-Rfou Rami, Barua Aditya, Phillips Aaron, and Yang Yinfei. 2020. LAReQA: Language-agnostic answer retrieval from a multilingual pool. arXiv:2004.05484 [CS] (2020). http://arxiv.org/abs/2004.05484.Google Scholar
[251] Ruder Sebastian and Avirup Si. 2021. Multi-domain multilingual question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21).Google ScholarCross Ref
[252] Rudinger Rachel, Shwartz Vered, Hwang Jena D., Bhagavatula Chandra, Forbes Maxwell, Bras Ronan Le, Smith Noah A., and Choi Yejin. 2020. Thinking like a skeptic: Defeasible inference in natural language. In Findings of EMNLP’20. 4661–4675. Google ScholarCross Ref
[253] Rychalska Barbara, Basaj Dominika, Wróblewska Anna, and Biecek Przemyslaw. 2018. Does it care what you asked? Understanding importance of verbs in deep learning QA system. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP. 322–324. http://aclweb.org/anthology/W18-5436.Google ScholarCross Ref
[254] Sachan Mrinmaya, Dubey Kumar, Xing Eric, and Richardson Matthew. 2015. Learning answer-entailing structures for machine comprehension. In Proceedings of the Joint Conference of the 53rd Annual Meeting of the Association for Computational Linguistics and the 5th International Joint Conference on Natural Language Processing (ACL-IJCNLP’21). 239–249. Google ScholarCross Ref
[255] Saeidi Marzieh, Bartolo Max, Lewis Patrick, Singh Sameer, Rocktäschel Tim, Sheldon Mike, Bouchard Guillaume, and Riedel Sebastian. 2018. Interpretation of natural language rules in conversational machine reading. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 2087–2097. Google ScholarCross Ref
[256] Sakaguchi Keisuke, Bras Ronan Le, Bhagavatula Chandra, and Choi Yejin. 2019. WinoGrande: An adversarial Winograd Schema Challenge at scale. arXiv:1907.10641 [CS] (2019). http://arxiv.org/abs/1907.10641.Google Scholar
[257] Sap Maarten, Rashkin Hannah, Chen Derek, Bras Ronan Le, and Choi Yejin. 2019. Social IQA: Commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 4453–4463. Google ScholarCross Ref
[258] Schlegel Viktor, Nenadic Goran, and Batista-Navarro Riza. 2020. Beyond leaderboards: A survey of methods for revealing weaknesses in natural language inference data and models. arXiv:2005.14709 [CS] (2020). http://arxiv.org/abs/2005.14709.Google Scholar
[259] Schlegel Viktor, Valentino Marco, Freitas André, Nenadic Goran, and Batista-Navarro Riza. 2020. A framework for evaluation of machine reading comprehension gold standards. In Proceedings of the Language Resources and Evaluation Conference. http://arxiv.org/abs/2003.04642.Google Scholar
[260] Serban Iulian Vlad, Lowe Ryan, Henderson Peter, Charlin Laurent, and Pineau Joelle. 2015. A survey of available corpora for building data-driven dialogue systems. arXiv:1512.05742 [CS, STAT] (2015). http://arxiv.org/abs/1512.05742.Google Scholar
[261] Shao Chih Chieh, Liu Trois, Lai Yuting, Tseng Yiying, and Tsai Sam. 2019. DRCD: A Chinese machine reading comprehension dataset. arXiv:1806.00920 [CS] (2019). http://arxiv.org/abs/1806.00920.Google Scholar
[262] Shi Shuming, Wang Yuehui, Lin Chin-Yew, Liu Xiaojiang, and Rui Yong. 2015. Automatically solving number word problems by semantic parsing and reasoning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’15). 1132–1142. Google ScholarCross Ref
[263] Shibuki Hideyuki, Sakamoto Kotaro, Kano Yoshionobu, Mitamura Teruko, Ishioroshi Madoka, Itakura Kelly Y., Wang Di, Mori Tatsunori, and Kando Noriko. 2014. Overview of the NTCIR-11 QA-lab task. In Proceedings of the 11th NTCIR Conference. 518–529. http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings11/pdf/NTCIR/OVERVIEW/01-NTCIR11-OV-QALAB-ShibukiH.pdf.Google Scholar
[264] Sinha Koustuv, Sodhani Shagun, Dong Jin, Pineau Joelle, and Hamilton William L.. 2019. CLUTRR: A diagnostic benchmark for inductive reasoning from text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 4496–4505. Google ScholarCross Ref
[265] Sitaram Sunayana, Chandu Khyathi Raghavi, Rallabandi Sai Krishna, and Black Alan W.. 2020. A survey of code-switched speech and language processing. arXiv:1904.00784 [CS, STAT] (2020). http://arxiv.org/abs/1904.00784.Google Scholar
[266] Soleimani Amir, Monz Christof, and Worring Marcel. 2021. NLQuAD: A non-factoid long question answering data set. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (EACL’21). 1245–1255. https://aclanthology.org/2021.eacl-main.106.Google ScholarCross Ref
[267] Sugawara Saku and Aizawa Akiko. 2016. An analysis of prerequisite skills for reading comprehension. In Proceedings of the Workshop on Uphill Battles in Language Processing: Scaling Early Achievements to Robust Methods. 1–5. Google ScholarCross Ref
[268] Sugawara Saku, Kido Yusuke, Yokono Hikaru, and Aizawa Akiko. 2017. Evaluation metrics for machine reading comprehension: Prerequisite skills and readability. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL’17). 806–817. Google ScholarCross Ref
[269] Sugawara Saku, Stenetorp Pontus, Inui Kentaro, and Aizawa Akiko. 2020. Assessing the benchmarking capacity of machine reading comprehension datasets. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’20). http://arxiv.org/abs/1911.09241.Google ScholarCross Ref
[270] Suhr Alane, Lewis Mike, Yeh James, and Artzi Yoav. 2017. A corpus of natural language for visual reasoning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL’17). 217–223. Google ScholarCross Ref
[271] Suhr Alane, Zhou Stephanie, Zhang Ally, Zhang Iris, Bai Huajun, and Artzi Yoav. 2019. A corpus for reasoning about natural language grounded in photographs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 6418–6428. Google ScholarCross Ref
[272] Sun Haitian, Cohen William, and Salakhutdinov Ruslan. 2022. ConditionalQA: A complex reading comprehension dataset with conditional answers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL’22). 3627–3637. Google ScholarCross Ref
[273] Sun Kai, Yu Dian, Chen Jianshu, Yu Dong, Choi Yejin, and Cardie Claire. 2019. DREAM: A challenge data set and models for dialogue-based reading comprehension. Transactions of the Association for Computational Linguistics 7 (April 2019), 217–231. Google ScholarCross Ref
[274] Sun Ningyuan, Yang Xuefeng, and Liu Yunfeng. 2020. TableQA: A large-scale Chinese Text-to-SQL dataset for table-aware SQL generation. arXiv:2006.06434 [CS] (2020). http://arxiv.org/abs/2006.06434.Google Scholar
[275] Suster Simon and Daelemans Walter. 2018. CliCR: A dataset of clinical case reports for machine reading comprehension. In Proceedings of the 16th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18). 1551–1563. Google ScholarCross Ref
[276] Tafjord Oyvind, Clark Peter, Gardner Matt, Yih Wen-Tau, and Sabharwal Ashish. 2019. QuaRel: A dataset and models for answering questions about qualitative relationships. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI’19).Google ScholarDigital Library
[277] Tafjord Oyvind, Gardner Matt, Lin Kevin, and Clark Peter. 2019. QuaRTz: An open-domain dataset of qualitative relationship questions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 5941–5946. Google ScholarCross Ref
[278] Talmor Alon and Berant Jonathan. 2018. The web as a knowledge-base for answering complex questions. In Proceedings of the 16th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18). 641–651. Google ScholarCross Ref
[279] Talmor Alon, Herzig Jonathan, Lourie Nicholas, and Berant Jonathan. 2019. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’22). 4149–4158. https://www.aclweb.org/anthology/papers/N/N19/N19-1421/.Google Scholar
[280] Talmor Alon, Yoran Ori, Catav Amnon, Lahav Dan, Wang Yizhong, Asai Akari, Ilharco Gabriel, Hajishirzi Hannaneh, and Berant Jonathan. 2021. MultimodalQA: Complex question answering over text, tables and images. In Proceedings of the 9th International Conference on Learning Representations (ICLR’21). 12. https://openreview.net/pdf/f3dad930cb55abce99a229e35cc131a2db791b66.pdf.Google Scholar
[281] Tapaswi Makarand, Zhu Yukun, Stiefelhagen Rainer, Torralba Antonio, Urtasun Raquel, and Fidler Sanja. 2016. MovieQA: Understanding stories in movies through question-answering. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).Google ScholarCross Ref
[282] Thakur Nandan, Reimers Nils, Rücklé Andreas, Srivastava Abhishek, and Gurevych Iryna. 2021. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Proceedings of the 35th Conference on Neural Information Processing Systems, Datasets, and Benchmarks Track. https://openreview.net/forum?id=wCu6T5xFjeJ.Google Scholar
[283] Thomas Paul, McDuff Daniel, Czerwinski Mary, and Craswell Nick. 2017. MISC: A data set of information-seeking conversations. In Proceedings of the 1st International Workshop on Conversational Approaches to Information Retrieval (CAIR’17). https://www.microsoft.com/en-us/research/wp-content/uploads/2017/07/Thomas-etal-CAIR17.pdf.Google Scholar
[284] Thomason Jesse, Gordon Daniel, and Bisk Yonatan. 2019. Shifting the baseline: Single modality performance on visual navigation & QA. In Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’19). 1977–1983. https://www.aclweb.org/anthology/papers/N/N19/N19-1197/.Google ScholarCross Ref
[285] Thorne James, Vlachos Andreas, Christodoulopoulos Christos, and Mittal Arpit. 2018. FEVER: A large-scale dataset for fact extraction and verification. In Proceedings of the 16th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18). 809–819. Google ScholarCross Ref
[286] Trippas Johanne R., Spina Damiano, Cavedon Lawrence, Joho Hideo, and Sanderson Mark. 2018. Informing the design of spoken conversational search: Perspective paper. In Proceedings of the 2018 Conference on Human Information Interaction and Retrieval (CHIIR’18). ACM, New York, NY, 32–41. Google ScholarDigital Library
[287] Trischler Adam, Wang Tong, Yuan Xingdi, Harris Justin, Sordoni Alessandro, Bachman Philip, and Suleman Kaheer. 2017. NewsQA: A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP. 191–200. Google ScholarCross Ref
[288] Tsatsaronis George, Balikas Georgios, Malakasiotis Prodromos, Partalas Ioannis, Zschunke Matthias, Alvers Michael R., Weissenborn Dirk, et al. 2015. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics 16, 1 (April 2015), 138. Google ScholarCross Ref
[289] Tseng Bo-Hsiang, Shen Sheng-Syun, Lee Hung-Yi, and Lee Lin-Shan. 2016. Towards machine comprehension of spoken content: Initial TOEFL listening comprehension test by machine. In Proceedings of the 17th Annual Conference of the International Speech Communication Association (Interspeech’16). 2731–2735. Google ScholarCross Ref
[290] Upadhyay Shyam and Chang Ming-Wei. 2017. Annotating derivations: A new evaluation strategy and dataset for algebra word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL’17). 494–504. https://www.aclweb.org/anthology/E17-1047.Google ScholarCross Ref
[291] Vakulenko Svitlana and Savenkov Vadim. 2017. TableQA: Question answering on tabular data. arXiv:1705.06504 [CS] (2017). http://arxiv.org/abs/1705.06504.Google Scholar
[292] Lee Chris van der, Gatt Albert, Miltenburg Emiel van, Wubben Sander, and Krahmer Emiel. 2019. Best practices for the human evaluation of automatically generated text. In Proceedings of the 12th International Conference on Natural Language Generation. 355–368. Google ScholarCross Ref
[293] Meer Elke van der, Beyer Reinhard, Heinze Bertram, and Badel Isolde. 2002. Temporal order relations in language comprehension. Journal of Experimental Psychology. Learning, Memory, and Cognition 28, 4 (July 2002), 770–779.Google ScholarCross Ref
[294] Dijk Teun A. van and Kintsch Walter. 1983. Strategies of Discourse Comprehension. Academic Press, New York, NY. P302 .D472 1983Google Scholar
[295] Vilares David and Gómez-Rodríguez Carlos. 2019. HEAD-QA: A healthcare dataset for complex reasoning. arXiv:1906.04701 [CS] (2019). http://arxiv.org/abs/1906.04701.Google Scholar
[296] Voorhees Ellen M. and Tice Dawn M.. 2000. Building a question answering test collection. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’00). ACM, New York, NY, 200–207. Google ScholarDigital Library
[297] Wallace Eric and Boyd-Graber Jordan. 2018. Trick me if you can: Adversarial writing of trivia challenge questions. In Proceedings of the Association for Computational Linguistics Student Research Workshop (ACL-SRW’18). 127–133. http://aclweb.org/anthology/P18-3018.Google ScholarCross Ref
[298] Wallace Eric, Feng Shi, Kandpal Nikhil, Gardner Matt, and Singh Sameer. 2019. Universal adversarial triggers for attacking and analyzing NLP. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’19) http://arxiv.org/abs/1908.07125.Google Scholar
[299] Wang Bingning, Yao Ting, Zhang Qi, Xu Jingfang, and Wang Xiaochuan. 2020. ReCO: A large scale Chinese reading comprehension dataset on opinion. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’20). 8. https://www.aaai.org/Papers/AAAI/2020GB/AAAI-WangB.2547.pdf.Google ScholarCross Ref
[300] Wang Jiexin, Jatowt Adam, Färber Michael, and Yoshikawa Masatoshi. 2021. Improving question answering for event-focused questions in temporal collections of news articles. Information Retrieval Journal 24, 1 (Feb. 2021), 29–54. Google ScholarDigital Library
[301] Wang Jiexin, Jatowt Adam, and Yoshikawa Masatoshi. 2022. ArchivalQA: A large-scale benchmark dataset for open domain question answering over historical news collections. arXiv:2109.03438 [CS]. Google ScholarCross Ref
[302] Wang Ping, Shi Tian, and Reddy Chandan K.. 2020. Text-to-SQL generation for question answering on electronic medical records. In Proceedings of the Web Conference 2020 (WWW’20). ACM, New York, NY, 350–361. Google ScholarDigital Library
[303] Watarai Takuto and Tsuchiya Masatoshi. 2020. Developing dataset of Japanese slot filling quizzes designed for evaluation of machine reading comprehension. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’20). 6895–6901. https://www.aclweb.org/anthology/2020.lrec-1.852.Google Scholar
[304] Weissenborn Dirk, Minervini Pasquale, Augenstein Isabelle, Welbl Johannes, Rocktäschel Tim, Bošnjak Matko, Mitchell Jeff, et al. 2018. Jack the Reader—A machine reading framework. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18). 25–30. Google ScholarCross Ref
[305] Weston Jason, Bordes Antoine, Chopra Sumit, Rush Alexander M., Merriënboer Bart van, Joulin Armand, and Mikolov Tomas. 2015. Towards AI-complete question answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698 (2015).Google Scholar
[306] White Michael, Chapman Graham, Cleese John, Idle Eric, Gilliam Terry, Jones Terry, Palin Michael, et al. 2001. Monty Python and the Holy Grail.Google Scholar
[307] Wong Yuk Wah and Mooney Raymond. 2006. Learning for semantic parsing with statistical machine translation. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference. 439–446. https://www.aclweb.org/anthology/N06-1056.Google ScholarDigital Library
[308] Wu Chien-Sheng, Madotto Andrea, Liu Wenhao, Fung Pascale, and Xiong Caiming. 2022. QAConv: Question answering on informative conversations. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL’22). 5389–5411. Google ScholarCross Ref
[309] Xiong Wenhan, Wu Jiawei, Wang Hong, Kulkarni Vivek, Yu Mo, Chang Shiyu, Guo Xiaoxiao, and Wang William Yang. 2019. TWEETQA: A social media focused question answering dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 5020–5031. Google ScholarCross Ref
[310] Xu Canwen, Pei Jiaxin, Wu Hongtao, Liu Yiyu, and Li Chenliang. 2020. MATINF: A jointly labeled large-scale dataset for classification, question answering and summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 3586–3596. https://www.aclweb.org/anthology/2020.acl-main.330.Google ScholarCross Ref
[311] Xu Ying, Wang Dakuo, Yu Mo, Ritchie Daniel, Yao Bingsheng, Wu Tongshuang, Zhang Zheng, et al. 2022. Fantastic questions and where to find them: FairytaleQA—An authentic dataset for narrative comprehension. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL’22). 447–460. Google ScholarCross Ref
[312] Yang Yi, Yih Wen-Tau, and Meek Christopher. 2015. WikiQA: A challenge dataset for open-domain question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’15). 2013–2018. http://aclweb.org/anthology/D15-1237.Google ScholarCross Ref
[313] Yang Zhengzhe and Choi Jinho D.. 2019. FriendsQA: Open-domain question answering on TV show transcripts. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue. 188–197. Google ScholarCross Ref
[314] Yang Zhilin, Qi Peng, Zhang Saizheng, Bengio Yoshua, Cohen William, Salakhutdinov Ruslan, and Manning Christopher D.. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the Conference on Empirical Methods in Nautral Language Processing (EMNLP’18). 2369–2380. http://aclweb.org/anthology/D18-1259.Google ScholarCross Ref
[315] Yatskar Mark. 2019. A qualitative comparison of CoQA, SQuAD 2.0, and QuAC. In Proceedings of the 17th Annual Conference of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). 2318–2323. https://www.aclweb.org/anthology/papers/N/N19/N19-1241/.Google Scholar
[316] Yin Fan, Shi Zhouxing, Hsieh Cho-Jui, and Chang Kai-Wei. 2021. On the faithfulness measurements for model interpretations. arXiv:2104.08782 [CS] (2021). http://arxiv.org/abs/2104.08782.Google Scholar
[317] You Chenyu, Chen Nuo, Liu Fenglin, Yang Dongchao, and Zou Yuexian. 2020. Towards data distillation for end-to-end spoken conversational question answering. arXiv:2010.08923 [CS, EESS] (2020). http://arxiv.org/abs/2010.08923.Google Scholar
[318] Yu Weihao, Jiang Zihang, Dong Yanfei, and Feng Jiashi. 2019. ReClor: A reading comprehension dataset requiring logical reasoning. In Proceedings of the 7th International Conference on Learning Representations (ICLR’19). https://openreview.net/forum?id=HJgJtT4tvB.Google Scholar
[319] Zellers Rowan, Bisk Yonatan, Schwartz Roy, and Choi Yejin. 2018. SWAG: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 93–104. http://aclweb.org/anthology/D18-1009.Google ScholarCross Ref
[320] Zellers Rowan, Holtzman Ari, Bisk Yonatan, Farhadi Ali, and Choi Yejin. 2019. HellaSwag: Can a machine really finish your sentence? In Proceedings of the Conference of the Association for Computational Linguistics (ACL’19). http://arxiv.org/abs/1905.07830.Google ScholarCross Ref
[321] Zhang Michael and Choi Eunsol. 2021. SituatedQA: Incorporating extra-linguistic contexts into QA. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’21). 7371–7387. Google ScholarCross Ref
[322] Zhang Sheng, Liu Xiaodong, Liu Jingjing, Gao Jianfeng, Duh Kevin, and Durme Benjamin Van. 2018. ReCoRD: Bridging the gap between human and machine commonsense reading comprehension. arXiv:1810.12885 [cs] (oct2018). arXiv:1810.12885 [cs] http://arxiv.org/abs/1810.12885.Google Scholar
[323] Zhang Yian, Warstadt Alex, Li Haau-Sing, and Bowman Samuel R.. 2020. When do you need billions of words of pretraining data? arXiv:2011.04946 [cs] (nov2020). arXiv:2011.04946 [cs] http://arxiv.org/abs/2011.04946.Google Scholar
[324] Zhang Zhuosheng and Zhao Hai. 2018. One-shot learning for question-answering in gaokao history challenge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18). 449–461. https://www.aclweb.org/anthology/C18-1038.Google Scholar
[325] Zhao Tony Z., Wallace Eric, Feng Shi, Klein Dan, and Singh Sameer. 2021. Calibrate before use: Improving few-shot performance of language models. In Proceedings of the 38th International Conference on Machine Learning (ICML’21). http://arxiv.org/abs/2102.09690.Google Scholar
[326] Zhong Victor, Xiong Caiming, and Socher Richard. 2017. Seq2SQL: Generating structured queries from natural language using reinforcement learning. arXiv:1709.00103 [CS] (2017). http://arxiv.org/abs/1709.00103.Google Scholar
[327] Zhou Ben, Khashabi Daniel, Ning Qiang, and Roth Dan. 2019. “Going on a vacation” takes longer than “going for a walk”: A study of temporal commonsense understanding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing: System Demonstrations (EMNLP-IJCNLP’19). 3361–3367. Google ScholarCross Ref
[328] Zhu Fengbin, Lei Wenqiang, Huang Youcheng, Wang Chao, Zhang Shuo, Lv Jiancheng, Feng Fuli, and Chua Tat-Seng. 2021. TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL’21). 3277–3287. Google ScholarCross Ref
[329] Zhu Fengbin, Lei Wenqiang, Wang Chao, Zheng Jianming, Poria Soujanya, and Chua Tat-Seng. 2021. Retrieving and reading: A comprehensive survey on open-domain question answering. arXiv:2101.00774 [CS] (2021). http://arxiv.org/abs/2101.00774.Google Scholar
[330] Zhu Linchao, Xu Zhongwen, Yang Yi, and Hauptmann Alexander G.. 2017. Uncovering the temporal context for video question answering. International Journal of Computer Vision 124, 3 (Sept. 2017), 409–421. Google ScholarDigital Library
[331] Zhu Ming, Ahuja Aman, Juan Da-Cheng, Wei Wei, and Reddy Chandan K.. 2020. Question answering with long multiple-span answers. In Findings of EMNLP’20. 3840–3849. Google ScholarCross Ref
[332] Zwaan Rolf A.. 2016. Situation models, mental simulations, and abstract concepts in discourse comprehension. Psychonomic Bulletin & Review 23, 4 (Aug. 2016), 1028–1034. Google ScholarCross Ref

Index Terms

QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Information extraction
      2. Question answering

Recommendations

SberQuAD – Russian Reading Comprehension Dataset: Description and Analysis
Experimental IR Meets Multilinguality, Multimodality, and Interaction
Abstract
The paper presents SberQuAD – a large Russian reading comprehension (RC) dataset created similarly to English SQuAD. SberQuAD contains about 50K question-paragraph-answer triples and is seven times larger compared to the next competitor. We ...
Read More
Common Difficulties of Reading Comprehension Experienced by Vietnamese Students
ICEMT '21: Proceedings of the 5th International Conference on Education and Multimedia Technology

The following paper explores the most challenging reading comprehension problems confronted by freshmen in EFL reading classes. Also, some possible solutions were suggested based on such challenges by the researcher. Fifty freshmen in a private ...
Read More
Children's reading comprehension and metacomprehension on screen versus on paper
Abstract
On-screen reading is becoming increasingly prevalent in educational settings, and children are now are expected to comprehend texts that they read on screens. However, research suggests that reading on screen impairs comprehension ...
Highlights
- Reading on screen impaired children's comprehension compared to reading on paper.
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Computing Surveys Volume 55, Issue 10
October 2023
772 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3567475
Editor:
Albert Zomaya
University of Sydney, Australia
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 February 2023
- Online AM: 13 September 2022
- Accepted: 11 August 2022
- Revised: 3 July 2022
- Received: 27 July 2021
Published in csur Volume 55, Issue 10

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Reading comprehension
natural language understanding
Qualifiers
- survey
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 4,854
  Total Downloads
- Downloads (Last 12 months)3,526
- Downloads (Last 6 weeks)367
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension

ACM Computing Surveys

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

SberQuAD – Russian Reading Comprehension Dataset: Description and Analysis

Common Difficulties of Reading Comprehension Experienced by Vietnamese Students

Children's reading comprehension and metacomprehension on screen versus on paper