Twenty Years Beyond the Turing Test: Moving Beyond the Human Judges Too

Hernández-Orallo, José

doi:10.1007/s11023-020-09549-0

Twenty Years Beyond the Turing Test: Moving Beyond the Human Judges Too

General Article
Published: 04 November 2020

Volume 30, pages 533–562, (2020)
Cite this article

Minds and Machines Aims and scope Submit manuscript

José Hernández-Orallo ORCID: orcid.org/0000-0001-9746-7632^1,2

1232 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

In the last 20 years the Turing test has been left further behind by new developments in artificial intelligence. At the same time, however, these developments have revived some key elements of the Turing test: imitation and adversarialness. On the one hand, many generative models, such as generative adversarial networks (GAN), build imitators under an adversarial setting that strongly resembles the Turing test (with the judge being a learnt discriminative model). The term “Turing learning” has been used for this kind of setting. On the other hand, AI benchmarks are suffering an adversarial situation too, with a ‘challenge-solve-and-replace’ evaluation dynamics whenever human performance is ‘imitated’. The particular AI community rushes to replace the old benchmark by a more challenging benchmark, one for which human performance would still be beyond AI. These two phenomena related to the Turing test are sufficiently distinctive, important and general for a detailed analysis. This is the main goal of this paper. After recognising the abyss that appears beyond superhuman performance, we build on Turing learning to identify two different evaluation schemas: Turing testing and adversarial testing. We revisit some of the key questions surrounding the Turing test, such as ‘understanding’, commonsense reasoning and extracting meaning from the world, and explore how the new testing paradigms should work to unmask the limitations of current and future AI. Finally, we discuss how behavioural similarity metrics could be used to create taxonomies for artificial and natural intelligence. Both testing schemas should complete a transition in which humans should give way to machines—not only as references to be imitated but also as judges—when pursuing and measuring machine intelligence.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Fig. 7

In AI We Trust: Ethics, Artificial Intelligence, and Reliability

Article Open access 10 June 2020

Quo vadis artificial intelligence?

Article Open access 07 March 2022

GenAI against humanity: nefarious applications of generative artificial intelligence and large language models

Article Open access 22 February 2024

Notes

https://www.forbes.com/sites/jenniferhicks/2015/09/20/beyond-the-turing-test/#e7206bf22411.
https://opinionator.blogs.nytimes.com/2015/02/23/outing-a-i-beyond-the-turing-test/?ref=opinion&_r=0.
https://www.templetonworldcharity.org/our-work/diverse-intelligences.
This separation is well-known in computer science, at least between solving and verifying. For instance, NP problems can be verified easily (in polynomial time), but unless P=NP, we know that solving these problems is much harder than verifying them. For the “cognitive-judge problem” we must distinguish producing, solving and verifying instances, and realise that any of the three can be harder than the others.
In some of the cases above, we are assuming that labelling requires human cognitive effort, such as the bird species example where a human must look at the images. But labelling could have been done in other ways, such as a DNA test.
In language models, ‘perplexity’ is a very common automatic metric, which basically measures how well the model anticipates the next words in a sentence, and a proxy of how well the model compresses the data. Compression has been connected with the Turing test and (machine) intelligence evaluation a few times (Dowe and Hajek 1997, 1998; Mahoney 1999; Dowe et al. 2011). Despite the correlation between perplexity and other evaluation metrics used by human judges, the latter are still used as ground truth to evaluate conversational agents (see, e.g., Adiwardana et al. 2020).
This was implemented using Colab over TensorFlow (https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/biggan_generation_with_tf_hub.ipynb).
Bongard problems are pattern recognition puzzles, where the diagrams on the left have something in common (e.g., only containing convex polygons) that the diagrams on the right do not (e.g., containing concavities). Telling where a new diagram should belong correctly (left or right) is assumed to reveal that there is understanding of the underlying concept.
The Copycat project explored systems that could solve analogies such as “abc is to abd as ijk is to what?”, where giving the right answer should reveal the understanding of the mechanism that generated the strings.
IQ tests usually include abstract questions with diagrams or numbers. For instance, “What’s the odd out of 40, 3, 20 and 80?” assumes understanding of a common pattern behind three elements but not the fourth.
The C-test generated letter series using patterns whose algorithmic complexity and ‘unquestionability’ could be estimated from first principles. For instance, solving instances such as “Continue the series: abbcccdddde...” assumes understanding of the pattern that generates the series.
ARC is also inspired by algorithmic information theory, but the actual instances resemble pixelated versions of the Bongard problems, where there is a pattern that converts some images into others by playing some algorithmic transformation (e.g., filling the closed areas in the image, mirroring an image, etc.). Finding the pattern should indicate understanding of how the transformation works.
Taken from https://www.gwern.net/GPT-3.
This sonnet was also used by Turing in some of his examples about the imitation game (Turing 1950).
Taken from https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.html.
These judges may have a particular training and developmental process, as child machine judges.
http://www.brain-score.org/.

References

Adiwardana, D., Luong, M. T., So, D. R., Hall, J., Fiedel, N., Thoppilan, R., Yang, Z., Kulshreshtha, A., Nemade, G., Lu, Y., et al. (2020) Towards a human-like open-domain chatbot. arXiv:200109977.
Alvarado, N., Adams, S. S., Burbeck, S., & Latta, C. (2002). Beyond the Turing test: Performance metrics for evaluating a computer simulation of the human mind. In The 2nd international conference on development and learning, 2002 (pp. 147–152). IEEE.
Arel, I., & Livingston, S. (2009). Beyond the Turing test. Computer, 42(3), 90–91.
Google Scholar
Armstrong, S., & Sotala, K. (2015). How we’re predicting AI–or failing to. In Beyond artificial intelligence(pp. 11–29). New York: Springer.
Arora, S., Ge, R., Liang, Y., Ma, T., & Zhang, Y. (2017). Generalization and equilibrium in generative adversarial nets (GANS). In Proceedings of the 34th international conference on machine learning (Vol. 70, pp. 224–232). JMLR. org.
Bhatnagar, S., et al. (2017). Mapping intelligence: Requirements and possibilities. In PTAI (pp. 117–135). New York: Springer.
Bongard, M. M. (1970). Pattern Recognition. New York: Spartan Books.
MATH Google Scholar
Borg, M., Johansen, S. S., Thomsen, D. L., & Kraus, M. (2012). Practical implementation of a graphics Turing test. In Advances in visual computing (pp. 305–313). New York: Springer.
Bostrom, N. (2014). Superintelligence: Paths, dangers, strategies. Oxford: Oxford University Press.
Google Scholar
Brock, A., Donahue, J., & Simonyan, K. (2018). Large scale GAN training for high fidelity natural image synthesis. arXiv:180911096.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. arXiv:200514165.
Burkart, J. M., Schubiger, M. N., & van Schaik, C. P. (2017). The evolution of general intelligence. Behavioral and Brain Sciences, 40, e195.
Google Scholar
Burr, C., & Cristianini, N. (2019). Can machines read our minds? Minds and Machines, 29(3), 461–494.
Google Scholar
Campbell, M., Hoane, A. J., & Hsu, F. (2002). Deep Blue. Artificial Intelligence, 134(1–2), 57–83.
MATH Google Scholar
Chollet, F. (2019). The measure of intelligence. arXiv:191101547.
Cohen, P. R. (2005). If not Turing’s test, then what? AI Magazine, 26(4), 61.
Google Scholar
Copeland, B. J. (2000). The Turing test. Minds and Machines, 10(4), 519–539.
MathSciNet Google Scholar
Copeland, J., & Proudfoot, D. (2008). Turing’s test. A philosophical and historical guide. In R. Epstein, G. Roberts, G. Beber (Eds.), Parsing the Turing Test. Philosophical and Methodological Issues in the Quest for the Thinking Computer. New York: Springer.
Crosby, M., Beyret, B., Shanahan, M., Hernandez-Orallo, J., Cheke, L., & Halina, M. (2020). The animal-AI testbed and competition. Proceedings of Machine Learning Research, 123, 164–176.
Google Scholar
Crosby, M., Beyret, B., Hernandez-Orallo, J., Cheke, L., Halina, M., & Shanahan, M. (2019). Translating from animal cognition to AI. NeurIPS workshop on biological and artificial reinforcement learning.
Davis, E., & Marcus, G. (2015). Commonsense reasoning and commonsense knowledge in artificial intelligence. Communications of the ACM, 58(9), 92–103.
Google Scholar
Dennett, D. C. (1971). Intentional systems. The Journal of Philosophy, 68, 87–106.
Google Scholar
Dodge, S., & Karam, L. (2017). A study and comparison of human and deep learning recognition performance under visual distortions. In ICCCN (pp. 1–7). IEEE.
Dowe, D. L., & Hernández-Orallo, J. (2012). IQ tests are not for machines, yet. Intelligence, 40(2), 77–81.
Google Scholar
Dowe, D. L., & Hernández-Orallo, J. (2014). How universal can an intelligence test be? Adaptive Behavior, 22(1), 51–69.
Google Scholar
Dowe, D. L., Hernández-Orallo, J., & Das, P. K. (2011). Compression and intelligence: Social environments and communication. In J. Schmidhuber, K. Thórisson, & M. Looks (Eds.), Artificial general intelligence (Vol. 6830, pp. 204–211)., LNAI series New York: Springer.
Google Scholar
Dowe, D. L., Hajek, A. R. (1997). A computational extension to the Turing test. In Proceedings of the 4th Conference of the Australasian Cognitive Science Society, University of Newcastle, NSW, Australia. Also as Technical Report #97/322, Dept Computer Science, Monash University, Australia.
Dowe, D. L., Hajek, A. R. (1998). A non-behavioural, computational extension to the Turing Test. In Intl. conf. on computational intelligence & multimedia applications (ICCIMA’98) (pp. 101–106). Gippsland, Australia.
Fabra-Boluda, R., Ferri, C., Martínez-Plumed, F., Hernández-Orallo, J., & Ramírez-Quintana, M. J. (2020). Family and prejudice: A behavioural taxonomy of machine learning techniques. In ECAI 2020—24st European conference on artificial intelligence.
Flach, P. (2019). Performance evaluation in machine learning: The good, the bad, the ugly and the way forward. In AAAI.
Fostel, G. (1993). The Turing test is for the birds. ACM SIGART Bulletin, 4(1), 7–8.
Google Scholar
French, R. M. (1990). Subcognition and the limits of the Turing test. Mind, 99(393), 53–65.
MathSciNet Google Scholar
French, R. M. (2000). The Turing test: The first 50 years. Trends in Cognitive Sciences, 4(3), 115–122.
Google Scholar
French, R. M. (2012). Moving beyond the Turing test. Communications of the ACM, 55(12), 74–77. https://doi.org/10.1145/2380656.2380674.
Article Google Scholar
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge: MIT press.
MATH Google Scholar
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014a). Generative adversarial nets. In Advances in neural information processing systems (pp 2672–2680).
Goodfellow, I. J., Shlens, J., & Szegedy, C. (2014b). Explaining and harnessing adversarial examples. arXiv:14126572.
Groß, R., Gu, Y., Li, W., & Gauci, M. (2017). Generalizing GANs: A Turing perspective. In Advances in neural information processing systems (pp. 6316–6326).
Gunning, D. (2018). Machine common sense concept paper. arXiv:181007528.
Harnad, S. (1992). The Turing test is not a trick: Turing indistinguishability is a scientific criterion. ACM SIGART Bulletin, 3(4), 9–10.
Google Scholar
Hayes, P., & Ford, K. (1995). Turing test considered harmful. In International joint conference on artificial intelligence (IJCAI) (pp 972–977).
Hernandez-Orallo, J. (2015). Stochastic tasks: Difficulty and Levin search. In J. Bieger, B. Goertzel, & A. Potapov (Eds.), Artificial general intelligence—8th international conference, AGI 2015, Berlin, Germany, July 22–25, 2015 (pp. 90–100). New York: Springer.
Hernández-Orallo, J. (2000). Beyond the Turing test. Journal of Logic, Language & Information, 9(4), 447–466.
MathSciNet MATH Google Scholar
Hernández-Orallo, J. (2001). On the computational measurement of intelligence factors (pp. 72–79). Gaithersburg: NIST Special Publication.
Google Scholar
Hernández-Orallo, J. (2015). On environment difficulty and discriminating power. Autonomous Agents and Multi-Agent Systems, 29, 402–454.
Google Scholar
Hernández-Orallo, J. (2017a). Evaluation in artificial intelligence: From task-oriented to ability-oriented measurement. Artificial Intelligence Review, 48(3), 397–447.
Google Scholar
Hernández-Orallo, J. (2017b). The measure of all minds: Evaluating natural and artificial intelligence. Cambridge: Cambridge University Press.
Google Scholar
Hernández-Orallo, J. (2019a). Gazing into clever Hans machines. Nature Machine Intelligence, 1(4), 172–173.
MathSciNet Google Scholar
Hernández-Orallo, J. (2019b). Unbridled mental power. Nature Physics, 15(1), 106.
Google Scholar
Hernández-Orallo, J., & Dowe, D. L. (2010). Measuring universal intelligence: Towards an anytime intelligence test. Artificial Intelligence, 174(18), 1508–1539.
MathSciNet Google Scholar
Hernández-Orallo, J., & Dowe, D. L. (2013). On potential cognitive abilities in the machine kingdom. Minds and Machines, 23(2), 179–210.
Google Scholar
Hernández-Orallo, J., Dowe, D. L., España-Cubillo, S., Hernández-Lloreda, M. V., & Insa-Cabrera, J. (2011). On more realistic environment distributions for defining, evaluating and developing intelligence. In J. Schmidhuber, K. Thórisson, & M. Looks (Eds.), Artificial general intelligence (Vol. 6830, pp. 82–91)., LNAI New York: Springer.
Google Scholar
Hernández-Orallo, J., Dowe, D. L., & Hernández-Lloreda, M. V. (2014). Universal psychometrics: Measuring cognitive abilities in the machine kingdom. Cognitive Systems Research, 27, 50–74.
Google Scholar
Hernández-Orallo, J., Insa-Cabrera, J., Dowe, D. L., & Hibbard, B. (2012). Turing tests with Turing machines. Turing, 10, 140–156.
Google Scholar
Hernández-Orallo, J., Martínez-Plumed, F., Schmid, U., Siebers, M., & Dowe, D. L. (2016). Computer models solving intelligence test problems: Progress and implications. Artificial Intelligence, 230, 74–107.
MathSciNet Google Scholar
Hernández-Orallo, J. (2015). C-tests revisited: Back and forth with complexity. In J. Bieger, B. Goertzel, & A. Potapov (Eds.), Artificial general intelligence—8th international conference, AGI 2015, Berlin, Germany, July 22–25, 2015. New York: Springer (pp. 272–282).
Hernández-Orallo, J. (2020). AI evaluation: On broken yardsticks and measurement scales. Evaluating AI Evaluation @ AAAI.
Hernández-Orallo, J., & Minaya-Collado, N. (1998). A formal definition of intelligence based on an intensional variant of Kolmogorov complexity. In Proc. intl symposium of engineering of intelligent systems (EIS’98) (pp. 146–163). ICSC Press.
Hernández-Orallo, J., & Vold, K. (2019). Ai extenders: The ethical and societal implications of humans cognitively extended by ai. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (pp. 507–513).
Hernández-Orallo, J., Insa-Cabrera, J., Dowe, D.L., & Hibbard, B. (2012). Turing machines and recursive Turing Tests. In V. Muller, & A. Ayesh (Eds.), AISB/IACAP 2012 Symposium “Revisiting Turing and his Test”, The Society for the Study of Artificial Intelligence and the Simulation of Behaviour, pp 28–33.
Hernández-Orallo, J., Baroni, M., Bieger, J., Chmait, N., Dowe, D. L., Hofmann, K., et al. (2017). A new AI evaluation cosmos: Ready to play the game? AI Magazine, 38(3), Fall 2007.
Hibbard, B. (2008). Adversarial sequence prediction. Frontiers in Artificial Intelligence and Applications, 171, 399.
Google Scholar
Hibbard, B. (2011). Measuring agent intelligence via hierarchies of environments. In Artificial general intelligence (pp. 303–308). New York: Springer.
Hingston, P. (2009). The 2k botprize. In IEEE symposium on computational intelligence and games (CIG 2009) (pp. 1–1). IEEE.
Hinton, G. E., & Zemel, R. S. (1994). Autoencoders, minimum description length and Helmholtz free energy. In Advances in neural information processing systems (pp. 3–10).
Hofstadter, D. R. (1980). Gödel, escher, bach. New York: Vintage Books.
MATH Google Scholar
Hofstadter, D. R., & Mitchell, M. (1994). The Copycat project: A model of mental fluidity and analogy-making. Norwood, NJ: Ablex Publishing.
Google Scholar
Insa-Cabrera, J., Dowe, D. L., España-Cubillo, S., Hernández-Lloreda, M. V., & Hernández-Orallo, J. (2011a). Comparing humans and AI agents. In International conference on artificial general intelligence (pp. 122–132). New York: Springer.
Insa-Cabrera, J., Dowe, D. L., & Hernández-Orallo, J. (2011b). Evaluating a reinforcement learning algorithm with a general intelligence test. In J. Lozano, J. Gamez, & J. Moreno (Eds.), Current topics in artificial intelligence (CAEPIA 2011). LNAI Series 7023. New York: Springer.
Jiang, Z., Xu, F. F., Araki, J., & Neubig, G. (2020b). How can we know what language models know? Transactions of the Association for Computational Linguistics, 8, 423–438.
Google Scholar
Jiang, M., Luketina, J., Nardelli, N., Minervini, P., Torr, P. H., Whiteson, S., & Rocktäschel, T. (2020a). Wordcraft: An environment for benchmarking commonsense agents. arXiv:200709185.
Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Master’s thesis, University of Toronto, https://www.cs.toronto.edu/~kriz/cifar.html.
Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., & Aila, T. (2019). Improved precision and recall metric for assessing generative models. arXiv:190406991.
Legg, S., & Hutter, M. (2007). Universal intelligence: A definition of machine intelligence. Minds and Machines, 17(4), 391–444.
Google Scholar
Leibo, J. Z., et al. (2018). Psychlab: A psychology laboratory for deep reinforcement learning agents. arXiv:180108116.
Levesque, H. J. (2017). Common sense, the Turing test, and the quest for real AI. New York: MIT Press.
MATH Google Scholar
Levesque, H., Davis, E., & Morgenstern, L. (2012). The Winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning.
Li, W., Gauci, M., & Groß, R. (2016). Turing learning: A metric-free approach to inferring behavior and its application to swarms. Swarm Intelligence, 10(3), 211–243.
Google Scholar
Li, W., Gauci, M., & Groß, R. (2013). A coevolutionary approach to learn animal behavior through controlled interaction. In Proceedings of the 15th annual conference on Genetic and evolutionary computation (pp. 223–230).
van der Linden, W. J. (2008). Using response times for item selection in adaptive testing. Journal of Educational and Behavioral Statistics, 33(1), 5–20.
MathSciNet Google Scholar
Mahoney, M. V. (1999). Text compression as a test for artificial intelligence. In Proceedings of the national conference on artificial intelligence (pp 970–970). AAAI.
Marcus, G., Rossi, F., & Veloso, M. (2016). Beyond the Turing test (special issue). AI Magazine, 37(1), 3–101.
Google Scholar
Marcus, G. (2020). The next decade in AI: Four steps towards robust artificial intelligence. arXiv:200206177.
Marcus, G., Ross, F., & Veloso, M. (2015). Beyond the Turing test. AAAI workshop, http://www.math.unipd.it/~frossi/BeyondTuring2015/.
Martinez-Plumed, F., & Hernandez-Orallo, J. (2018). Dual indicators to analyse AI benchmarks: Difficulty, discrimination, ability and generality. IEEE Transactions on Games, 12, 121–131.
Google Scholar
Martínez-Plumed, F., Prudêncio, R. B., Martínez-Usó, A., & Hernández-Orallo, J. (2019). Item response theory in AI: Analysing machine learning classifiers at the instance level. Artificial Intelligence, 271, 18–42.
MathSciNet MATH Google Scholar
Martínez-Plumed, F., Gomez, E., & Hernández-Orallo, J. (2020). Tracking AI: The capability is (not) near. In European conference on artificial intelligence.
Masum, H., Christensen, S., & Oppacher, F. (2002). The Turing ratio: Metrics for open-ended tasks. In Conf. on genetic and evolutionary computation (pp. 973–980). Morgan Kaufmann.
McCarthy, J. (1983). Artificial intelligence needs more emphasis on basic research: President’s quarterly message. AI Magazine, 4(4), 5.
Google Scholar
McDermott, D. (2007). Level-headed. Artificial Intelligence, 171(18), 1183–1186.
Google Scholar
Mishra, A., Bhattacharyya, P., & Carl, M. (2013). Automatically predicting sentence translation difficulty. In ACL (pp 346–351).
Mitchell, M. (2019). Artificial intelligence: A guide for thinking humans. UK: Penguin.
Google Scholar
Moor, J. (2003). The Turing test: the elusive standard of artificial intelligence (Vol. 30). New York: Springer Science & Business Media.
MATH Google Scholar
Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J., & Kiela, D. (2019). Adversarial nli: A new benchmark for natural language understanding. arXiv:191014599.
Nilsson, N. J. (2006). Human-level artificial intelligence? Be serious!. AI Magazine, 26(4), 68.
Google Scholar
Oppy, G., & Dowe, D. L. (2011). The turing test. In: Zalta, E. N. (Ed.), Stanford encyclopedia of philosophy, Stanford University. http://plato.stanford.edu/entries/turing-test/.
Preston, B. (1991). AI, anthropocentrism, and the evolution of ‘intelligence’. Minds and Machines, 1(3), 259–277.
Google Scholar
Proudfoot, D. (2011). Anthropomorphism and AI: Turing’s much misunderstood imitation game. Artificial Intelligence, 175(5), 950–957.
MathSciNet Google Scholar
Proudfoot, D. (2017). The Turing test-from every angle. In J. Bowen, M. Sprevak, R. Wilson, & B. J. Copeland (Eds.), The Turing Guide. Oxford: Oxford University Press.
Google Scholar
Rahwan, I., Cebrian, M., Obradovich, N., Bongard, J., Bonnefon, J. F., Breazeal, C., et al. (2019). Machine behaviour. Nature, 568(7753), 477–486.
Google Scholar
Rajalingham, R., Issa, E. B., Bashivan, P., Kar, K., Schmidt, K., & DiCarlo, J. J. (2018). Large-scale, high-resolution comparison of the core visual object recognition behavior of humans, monkeys, and state-of-the-art deep artificial neural networks. Journal of Neuroscience, 38(33), 7255–7269.
Google Scholar
Rajpurkar, P., Jia, R., & Liang, P. (2018). Know what you don’t know: Unanswerable questions for squad. arXiv:180603822.
Rozen, O., Shwartz, V., Aharoni, R., & Dagan, I. (2019). Diversify your datasets: Analyzing generalization via controlled variance in adversarial datasets. In Proceedings of the 23rd conference on computational natural language learning (CoNLL), Association for Computational Linguistics, Hong Kong, China, pp. 196–205.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
MathSciNet Google Scholar
Sakaguchi, K., Bras, R. L., Bhagavatula, C., & Choi, Y. (2019). Winogrande: An adversarial winograd schema challenge at scale. arXiv:190710641.
Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3(3), 210–229.
MathSciNet Google Scholar
Saygin, A. P., Cicekli, I., & Akman, V. (2000). Turing test: 50 years later. Minds and Machines, 10(4), 463–518.
Google Scholar
Schlangen, D. (2019). Language tasks and language games: On methodology in current natural language processing research. arXiv:190810747.
Schoenick, C., Clark, P., Tafjord, O., Turney, P., & Etzioni, O. (2017). Moving beyond the Turing test with the Allen AI science challenge. Communications of the ACM, 60(9), 60–64.
Google Scholar
Schrimpf, M., Kubilius, J., Hong, H., Majaj, N. J., Rajalingham, R., Issa, E. B., Kar, K., Bashivan, P., Prescott-Roy, J., Schmidt, K., Yamins, D. L. K., & DiCarlo, J. J. (2018). Brain-score: Which artificial neural network for object recognition is most brain-like? bioRxiv preprint.
Schweizer, P. (1998). The truly total Turing test. Minds and Machines, 8(2), 263–272.
Google Scholar
Sebeok, T. A., & Rosenthal, R. E. (1981). The clever Hans phenomenon: Communication with horses, whales, apes, and people. Annals of the NY Academy of Sciences, 364, 1–17.
Google Scholar
Seber, G. A. F., & Salehi, M. M. (2013). Adaptive cluster sampling. In Adaptive sampling designs (pp 11–26). New York: Springer.
Settles, B. (2009). Active learning. Tech. rep., synthesis lectures on artificial intelligence and machine learning. Morgan & Claypool.
Shah, H., & Warwick, K. (2015). Human or machine? Communications of the ACM, 58(4), 8.
Google Scholar
Shanahan, M. (2015). The technological singularity. New York: MIT Press.
Google Scholar
Shoham, Y. (2017). Towards the AI index. AI Magazine, 38(4), 71–77.
Google Scholar
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., et al. (2017b). Mastering the game of Go without human knowledge. Nature, 550(7676), 354–359.
Google Scholar
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel T, et al. (2017a). Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv:171201815.
Sloman, A. (2014). Judging chatbots at Turing test. http://www.csbhamacuk/research/projects/cogaff/misc/turing-test-2014html.
Stern, R., Sturtevant, N., Felner, A., Koenig, S, et al. (2019). Multi-agent pathfinding: Definitions, variants, and benchmarks. arXiv:190608291.
Sturm, B. L. (2014). A simple method to determine if a music information retrieval system is a “horse”. IEEE Transactions on Multimedia, 16(6), 1636–1644.
Google Scholar
Turing, A. M. (1950). Computing machinery and intelligence. Mind, 59, 433–460.
MathSciNet Google Scholar
Turing, A. (1952). Can automatic calculating machines be said to think? BBC. BBC Third Programme, 14 and 23 Jan. 1952, between M. H. A. Newman, A. M. T., Sir Geoffrey Jefferson and R. B. Braithwaite. Reprinted in Copeland, B. J. (ed.) The essential Turing(pp. 494–495). Oxford: Oxford University Press. http://www.turingarchive.org/browse.php/B/6.
Vale, C. D., & Weiss, D. J. (1975). A study of computer-administered stradaptive ability testing. Tech. rep., Minnesota Univ. Minneapolis Dept. of Psychology.
Van Seijen, H., Fatemi, M., Romoff, J., Laroche, R., Barnes, T., & Tsang, J. (2017). Hybrid reward architecture for reinforcement learning. In NIPS (pp. 5392–5402).
Vardi, M. Y. (2015). Human or machine? Response. Communications of the ACM, 58(4), 8–8.
Google Scholar
Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A. S., Yeo, M., Makhzani, A., Küttler, H., Agapiou, J., Schrittwieser, J., et al. (2017). Starcraft ii: A new challenge for reinforcement learning. arXiv:170804782.
von Ahn, L., Blum, M., & Langford, J. (2004). Telling humans and computers apart automatically. Communications of the ACM, 47(2), 56–60.
Google Scholar
von Ahn, L., Maurer, B., McMillen, C., Abraham, D., & Blum, M. (2008). RECAPTCHA: Human-based character recognition via web security measures. Science, 321(5895), 1465.
MathSciNet MATH Google Scholar
Wainer, H. (2000). Computerized adaptive testing: A primer (2nd ed.). Mahwah, NJ: Lawrence Erlabaum Associate Publishers.
Google Scholar
Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2019). Superglue: A stickier benchmark for general-purpose language understanding systems. arXiv:190500537.
Watt, S. (1996). Naive psychology and the inverted Turing test. Psycoloquy, 7(14), 463–518.
Google Scholar
Weiss, D. J. (2011). Better data from better measurements using computerized adaptive testing. Journal of Methods and Measurement in the Social Sciences, 2(1), 1–27.
Google Scholar
You, J. (2015). Beyond the Turing test. Science, 347(6218), 116–116.
Google Scholar
Youyou, W., Kosinski, M., & Stillwell, D. (2015). Computer-based personality judgments are more accurate than those made by humans. Proceedings of the National Academy of Sciences, 112(4), 1036–1040.
Google Scholar
Zadeh, L. A. (2008). Toward human level machine intelligence-Is it achievable? The need for a paradigm shift. IEEE Computational Intelligence Magazine, 3(3), 11–22.
Google Scholar
Zellers, R., Bisk, Y., Schwartz, R., & Choi, Y. (2018). Swag: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 conference on empirical methods in natural language processing (EMNLP).
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., & Choi, Y. (2019). Hellaswag: Can a machine really finish your sentence? arXiv:190507830.
Zhou, P., Khanna, R., Lin, B. Y., Ho, D., Ren, X., & Pujara, J. (2020). Can BERT reason? logically equivalent probes for evaluating the inference capabilities of language models. arXiv:200500782.
Zillich, M. (2012). My robot is smarter than your robot. on the need for a total Turing test for robots. In: V. Muller & A. Ayesh (Eds.), AISB/IACAP 2012 symposium “revisiting turing and his test”, The Society for the Study of Artificial Intelligence and the Simulation of Behaviour, pp. 12–15.

Download references

Acknowledgements

I appreciate the reviewers’ comments, leading to new Sect. 5, among other modifications and insights in the final version. This work was funded by the Future of Life Institute, FLI, under grant RFP2-152, and also supported by the EU (FEDER) and Spanish MINECO under RTI2018-094403-B-C32, and Generalitat Valenciana under PROMETEO/2019/098. Figure 1 was kindly generated on purpose by Fernando Martínez-Plumed.

Author information

Authors and Affiliations

Universitat Politècnica de València, Valencia, Spain
José Hernández-Orallo
Leverhulme Centre for the Future of Intelligence, Cambridge, UK
José Hernández-Orallo

Authors

José Hernández-Orallo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to José Hernández-Orallo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hernández-Orallo, J. Twenty Years Beyond the Turing Test: Moving Beyond the Human Judges Too. Minds & Machines 30, 533–562 (2020). https://doi.org/10.1007/s11023-020-09549-0

Download citation

Received: 25 March 2020
Accepted: 29 October 2020
Published: 04 November 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s11023-020-09549-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Twenty Years Beyond the Turing Test: Moving Beyond the Human Judges Too

Abstract

Access this article

Similar content being viewed by others

In AI We Trust: Ethics, Artificial Intelligence, and Reliability

Quo vadis artificial intelligence?

GenAI against humanity: nefarious applications of generative artificial intelligence and large language models

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Twenty Years Beyond the Turing Test: Moving Beyond the Human Judges Too

Abstract

Access this article

Similar content being viewed by others

In AI We Trust: Ethics, Artificial Intelligence, and Reliability

Quo vadis artificial intelligence?

GenAI against humanity: nefarious applications of generative artificial intelligence and large language models

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation