Skip to main content
Log in

Still no lie detector for language models: probing empirical and conceptual roadblocks

  • Published:
Philosophical Studies Aims and scope Submit manuscript

Abstract

We consider the questions of whether or not large language models (LLMs) have beliefs, and, if they do, how we might measure them. First, we consider whether or not we should expect LLMs to have something like beliefs in the first place. We consider some recent arguments aiming to show that LLMs cannot have beliefs. We show that these arguments are misguided. We provide a more productive framing of questions surrounding the status of beliefs in LLMs, and highlight the empirical nature of the problem. With this lesson in hand, we evaluate two existing approaches for measuring the beliefs of LLMs, one due to Azaria and Mitchell (The internal state of an llm knows when its lying, 2023) and the other to Burns et al. (Discovering latent knowledge in language models without supervision, 2022). Moving from the armchair to the desk chair, we provide empirical results that show that these methods fail to generalize in very basic ways. We then argue that, even if LLMs have beliefs, these methods are unlikely to be successful for conceptual reasons. Thus, there is still no lie-detector for LLMs. We conclude by suggesting some concrete paths for future work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. Diaconis and Skyrms (2018) give a concise and thoughtful introduction to both of these topics.

  2. See Lipton (2018) for a conceptual discussion of model interpretability.

  3. We provide code at https://github.com/BLINDED_FOR_REVIEW. Should be https://github.com/balevinstein/Probes.

  4. For an in depth, conceptual overview of decoder only transformer models, see (Levinstein, 2023).

  5. After the model is trained, intuitively what these embeddings are doing is representing semantic and other information about the token along with information about what has come before it in the sequence.

  6. In our discussion here we focus on the question of whether or not LLMs track and internally represent truth. This then motivates our interest in the recent probing techniques. If this tracking of truth is action-guiding, then it makes sense to call it belief (we make this connection clear in the remainder of Sect. 3.1). There are other worries that LLM belief skeptics bring up, mainly to do with communicative intent and reference (Bender & Koller, 2020; Bender et al., 2021; Shanahan, 2022). As the quotes above show, these concerns are often expressed all together, even though we believe they are somewhat separate. Computer scientists and philosophers alike have recently attacked these other concerns. Mandelkern and Linzen (2023) use arguments from the externalist tradition in the philosophy of language to argue that LLMs can refer, in contrast to Bender and Koller’s concerns about grounding (2020; also Bender et al. (2021)). Similarly but contrastingly, Piantadosi and Hill (2022) also argue against grounding and reference concerns (notably those in Bender and Koller (2020)), using the more internalist approach of rejecting the claim that meaning requires reference, and instead focusing on the conceptual roles that various concepts play inside the LLM itself. They argue that we should search for meaning in LLMs by discovering how the internal states of LLMs relate to one another. Pavlick (2023) provides a nice high-level summary of both internalist and externalist arguments that LLMs might encode meaning.

  7. The canonical formalization of this idea in economics and statistics is Savage’s Foundations of Statistics (1972). Philosophers use Savage’s formulation, as well as Jeffrey’s in The Logic of Decision (1990).

  8. More precisely, utility is a numerical representation that captures how strongly an agent cares about outcomes.

  9. We are here ignoring nuances involving inner alignment (Hubinger et al., 2019).

  10. Accuracy-first epistemology models epistemic behavior—forming and updating beliefs and credences—as fundamentally about maximizing accuracy. Indeed, even in isolation from downstream practical effects of doxastic attitudes, we can derive many fundamental norms constraining beliefs and credences based on the pursuit of accuracy alone (Joyce, 1998; Easwaran, 2016; Dorst, 2019).

  11. Indeed, there is a rich interplay between the broad functionalist thesis and functionalism about machine states in particular. Early functionalists were explicitly concerned with computational theories of mind. See (Putnam, 1979).

  12. Shanahan argues further that one reason LLMs (especially bare-bones models) lack beliefs is that any creature with beliefs must be able to update those beliefs:

    Nothing can count as a belief about the world we share—in the largest sense of the term—unless it is against the backdrop of the ability to update beliefs appropriately in the light of evidence from that world, an essential aspect of the capacity to distinguish truth from falsehood. (p. 6, 2022)

    This strikes us as false. However, there are still two ways in which LLMs could update their beliefs if they have them. First, during training, LLMs learn via gradient descent. Second, even in a frozen model, in-context learning is possible. Although in-context updates are temporary because the weights are reset during the next inference cycle, this still seems like a legitimate form of updating to us.

  13. Lieder and Griffiths (2020) make a similar point.

  14. The sentences in the dataset all ended with a period (i.e., full-stop) as the final token. We ran some initial tests to see if probes did better on the embedding for the period or for the penultimate token. We found it did not make much of a difference, so we did our full analysis using the embeddings for the penultimate tokens.

  15. The ‘6.7b’ refers to the number of parameters (i.e., 6.7 billion).

  16. In formal models of beliefs and credence, the main domain is usually an algebra over events. If we wish to identify doxastic attitudes in language models, then we should check that those attitudes behave roughly as expected over such an algebra. Such algebras are closed under negation, so it is a motivated starting point.

  17. Azaria and Mitchell (2023) did an admirable job creating their datasets. Some of the statements were generated automatically using reliable tables of information, and other parts were automated using ChatGPT and then manually curated. Nonetheless, there are some imperfect examples. For instance, in Scientific Facts, one finds sentences like Humans have five senses: sight, smell, hearing, taste, and touch, which is not unambiguously true.

  18. Harding makes these conditions precise in the language of information theory. Further development of the concept of representation in the context of probes strikes us as an important line of research in working to understand the internal workings of deep learning models.

  19. Some readers may worry about a second degenerate solution. The model could use the embeddings to find which of \(x_i^+\) and \(x_i^-\) contained a negation. It could map one of the embeddings to (approximately) 1 and the other to (approximately) 0 to achieve a low loss. Burns et al. (2022) avoid this solution by normalizing the embeddings for each class by subtracting the means and dividing by the standard deviations. However, as we’ll see below, for the datasets that we used, such normalization was ineffective for MLP-based probes, and the probes consistently found exactly this degenerate solution.

  20. One way to see that \(L_\text {CCS}\) won’t incentive a probe to learn the actual credences of the model is to observe that this loss function is not a strictly proper scoring rule (Gneiting & Raftery, 2007). However, use of a strictly proper scoring rule for training probes requires appeal to actual truth-values, which in turn requires supervised learning.

  21. A linear probe is one that applies linear weights to the embeddings (and perhaps adds a constant), followed by a sigmoid function to turn the result into a value between 0 and 1. Linear probes have an especially simple functional form, so intuitively, if a linear probe is successful, the embedding is easy to extract.

  22. Thanks to Daniel Filan for pointing out that the linear probes could not separate the positive and negative datasets after normalization. Nonetheless, as our results show, they were able to find directions in activation space that achieved low loss but were uncorrelated with truth.

  23. These are both consequences of the fact that for any proposition A, \(\Pr (A) + \Pr (\lnot A) = 1\): take \(A :=x \wedge \Pr (x)\), for example, and apply de Morgan’s laws.

  24. Burns et al. (2022) investigate other unsupervised approaches as well that appeal to principal component analysis and/or clustering (such as Bimodal Salience Search (p. 22)). We believe—with some changes—most of the conceptual issues for CCS apply to those as well.

  25. The conceptual problems plaguing both types of probing techniques would still exist. However, the thought is that if the representation of truth is especially perspicuous, the probes might land upon such a representation naturally. This is a highly empirical question.

  26. Latent variables are best understood in contrast to observable variables. Suppose, for example, that you are trying to predict the outcomes of a series of coin tosses. The observable variables in this context would be the actual outcomes: heads and tails. A latent variable would be an unobservable that you use to help make your predictions. For example, suppose you have different hypotheses about the bias of the coin and take the expected bias as your prediction for the probability of heads on the next toss. You are using your beliefs about the latent variable (the bias of the coin) to generate your beliefs about the observables.

  27. Using latent variables to compute probability distributions is commonplace in science and statistics (Everett, 2013). Though we do not have the space to do latent variable methods full justice, one reason for this is that using distributions over latent variables in order to calculate a distribution over observable variables can have massive computational benefits (see, for example, chapter 16 of (Goodfellow et al., 2016)). Thus, it would be fairly surprising if there weren’t a useful way to think of LLMs as using some kinds of latent variables in order to make predictions about the next token. Indeed, there is already some preliminary work on what sorts of latent variables LLMs might be working with (Xie et al., 2021; Jiang, 2023).

  28. This question is related to the classic theoretician’s dilemma in the philosophy of science (Hempel, 1958).

References

  • Alain, G., & Bengio, Y. (2016). Understanding intermediate layers using linear classifier probes. arXiv:1610.01644.

  • Azaria, A., & Mitchell, T. (2023). The internal state of an llm knows when its lying.

  • Beery, S., van Horn, G., & Perona, P. (2018). Recognition in terra incognita.

  • Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency (pp. 610–623).

  • Bender, E. M., & Koller, A. (2020). Climbing towards nlu: On meaning, form, and understanding in the age of data. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 5185–5198).

  • Burns, C., Ye, H., Klein, D., & Steinhardt, J. (2022). Discovering latent knowledge in language models without supervision.

  • Christiano, P., Xu, M., & Cotra, A. (2021). Arc’s first technical report: Eliciting latent knowledge.

  • Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., & Toutanova, K. (2019). Boolq: Exploring the surprising difficulty of natural yes/no questions.

  • Cowie, C. (2014). In defence of instrumentalism about epistemic normativity. Synthese, 191(16), 4003–4017.

    Article  Google Scholar 

  • Diaconis, P., & Skyrms, B. (2018). Ten great ideas about chance. Princeton University Press.

    Book  Google Scholar 

  • Dorst, K. (2019). Lockeans maximize expected accuracy. Mind, 128(509), 175–211.

    Article  MathSciNet  Google Scholar 

  • Easwaran, K. (2016). Dr. Truthlove or: How I learned to stop worrying and love Bayesian probabilities. Noûs, 50(4), 816–853.

  • Evans, O., O. Cotton-Barratt, L. Finnveden, A. Bales, A. Balwit, P. Wills, L. Righetti, and W. Saunders (2021). Truthful ai: Developing and governing ai that does not lie. arXiv:2110.06674.

  • Everett, B. (2013). An introduction to latent variable models. Springer.

    Google Scholar 

  • (FAIR), M. F. A. R. D. T., A., Bakhtin, N., Brown, E., Dinan, G., Farina, C., Flaherty, D., Fried, A., Goff, J., Gray, HHu., et al. (2022). Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378(6624), 1067–1074.

  • Gilboa, I., Minardi, S., Samuelson, L., & Schmeidler, D. (2020). States and contingencies: How to understand savage without anyone being hanged. Revue économique, 71(2), 365–385.

    Article  Google Scholar 

  • Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477), 359–378.

    Article  MathSciNet  CAS  Google Scholar 

  • Godfrey-Smith, P. (1991). Signal, decision, action. The Journal of philosophy, 88(12), 709–722.

    Article  Google Scholar 

  • Godfrey-Smith, P. (1998). Complexity and the function of mind in nature. Cambridge University Press.

    Google Scholar 

  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.

    Google Scholar 

  • Harding, J. (2023). Operationalising representation in natural language processing. arXiv:2306.08193.

  • Hempel, C. G. (1958). The theoretician’s dilemma: A study in the logic of theory construction. Minnesota Studies in the Philosophy of Science, 2, 173–226.

    Google Scholar 

  • Hubinger, E., van Merwijk, C., Mikulik, V., Skalse, J., & Garrabrant, S. (2019). Risks from learned optimization in advanced machine learning systems. arXiv:1906.01820.

  • Jeffrey, R. C. (1990). The logic of decision. University of Chicago Press.

    Google Scholar 

  • Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., & Fung, P. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), 1–38.

    Article  Google Scholar 

  • Jiang, H. (2023). A latent space theory for emergent abilities in large language models. arXiv:2304.09960.

  • Joyce, J. M. (1998). A nonpragmatic vindication of probabilism. Philosophy of Science, 65(4), 575–603.

    Article  MathSciNet  Google Scholar 

  • Khashabi, D., Min, S., Khot, T., Sabharwal, A., Tafjord, O., Clark, P., & Hajishirzi, H. (2020). Unifiedqa: Crossing format boundaries with a single qa system.

  • Levinstein, B. (2023). A conceptual guide to transformers.

  • Lieder, F., & Griffiths, T. L. (2020). Resource-rational analysis: Understanding human cognition as the optimal use of limited computational resources. Behavioral and Brain Sciences, 43, e1.

    Article  Google Scholar 

  • Lipton, Z. C. (2018). The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue, 16(3), 31–57.

    Article  Google Scholar 

  • Maas, A., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (pp. 142–150).

  • Mandelkern, M., & Linzen, T. (2023). Do language models refer? arXiv:2308.05576.

  • Millikan, R. G. (1995). White queen psychology and other essays for Alice. MIT Press.

    Book  Google Scholar 

  • Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730–27744.

    Google Scholar 

  • Papineau, D. (1988). Reality and representation. Mind, 97(388).

  • Pavlick, E. (2023). Symbols and grounding in large language models. Philosophical Transactions of the Royal Society A, 381(2251), 20220041.

    Article  ADS  Google Scholar 

  • Piantadosi, S. T., & Hill, F. (2022). Meaning without reference in large language models. arXiv:2208.02957.

  • Putnam, H. (1979). Philosophical papers: Volume 2, mind, language and reality. Cambridge University Press.

  • Quine, W. V. (1969). Natural kinds. In Essays in honor of Carl G. Hempel: A tribute on the occasion of his sixty-fifth birthday (pp. 5–23). Springer.

  • Quine, W. V. O. (1960). Word and object. MIT Press.

    Google Scholar 

  • Ramsey, F. P. (2016). Truth and probability. Readings in Formal Epistemology: Sourcebook (pp. 21–45).

  • Savage, L. J. (1972). The foundations of statistics. Courier Corporation.

  • Shanahan, M. (2022). Talking about large language models. arXiv:2212.03551.

  • Smead, R. (2015). The role of social interaction in the evolution of learning. The British Journal for the Philosophy of Science.

  • Smead, R. S. (2009). Social interaction and the evolution of learning rules. University of California.

  • Sober, E. (1994). The adaptive advantage of learning and a priori prejudice. Ethology and Sociobiology, 15(1), 55–56.

    Article  Google Scholar 

  • Stephens, C. L. (2001). When is it selectively advantageous to have true beliefs? sandwiching the better safe than sorry argument. Philosophical Studies, 105, 161–189.

    Article  Google Scholar 

  • Stich, S. P. (1990). The fragmentation of reason: Preface to a pragmatic theory of cognitive evaluation. The MIT Press.

    Google Scholar 

  • Street, S. (2009). Evolution and the normativity of epistemic reasons. Canadian Journal of Philosophy Supplementary, 35, 213–248.

    ADS  Google Scholar 

  • Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). Llama: Open and efficient foundation language models.

  • Tversky, A., & Kahneman, D. (1981). The framing of decisions and the psychology of choice. Science, 211(4481), 453–458.

    Article  ADS  MathSciNet  CAS  PubMed  Google Scholar 

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., & Polosukhin, I. (2017). Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems. (Vol. 30). Curran Associates Inc.

  • Wang, B., & Komatsuzaki , A. (2021). Gpt-j-6b: A 6 billion parameter autoregressive language model. https://github.com/kingoflolz/mesh-transformer-jax

  • Xie, S. M., Raghunathan, A., Liang, P., & Ma, T. (2021). An explanation of in-context learning as implicit Bayesian inference. arXiv:2111.02080.

  • Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., Mihaylov, T., Ott, M., Shleifer, S., Shuster, K., Simig, D., Koura, P. S., Sridhar, A., Wang, T., & Zettlemoyer, L. (2022). Opt: Open pre-trained transformer language models.

Download references

Acknowledgments

Thanks to Amos Azaria, Nora Belrose, Dylan Bowman, Nick Cohen, Daniel Filan, Jacqueline Harding, Aydin Mohseni, Bruce Rushing, Murray Shanahan, Nate Sharadin, Julia Staffel, and audiences at UMass Amherst, the University of Rochester, and the Center for AI Safety for helpful comments and feedback. Special thanks to Amos Azaria and Tom Mitchell jointly for access to their code and datasets. We are grateful to the Center for AI Safety for use of their compute cluster. B.L. was partly supported by a Mellon New Directions Fellowship (number 1905-06835) and by Open Philanthropy. D.H. was partly supported by a Long-Term Future Fund grant.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Benjamin A. Levinstein.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Levinstein, B.A., Herrmann, D.A. Still no lie detector for language models: probing empirical and conceptual roadblocks. Philos Stud (2024). https://doi.org/10.1007/s11098-023-02094-3

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11098-023-02094-3

Keywords

Navigation