Elsevier

Artificial Intelligence

Volume 314, January 2023, 103821
Artificial Intelligence

Regret minimization in online Bayesian persuasion: Handling adversarial receiver's types under full and partial feedback models

https://doi.org/10.1016/j.artint.2022.103821Get rights and content

Abstract

In Bayesian persuasion, an informed sender has to design a signaling scheme that discloses the right amount of information so as to influence the behavior of a self-interested receiver. This kind of strategic interaction is ubiquitous in real-world economic scenarios. However, the seminal model by Kamenica and Gentzkow makes some stringent assumptions that limit its applicability in practice. One of the most limiting assumptions is, arguably, that the sender is required to know the receiver's utility function to compute an optimal signaling scheme. We relax this assumption through an online learning framework in which the sender repeatedly faces a receiver whose type is unknown and chosen adversarially at each round from a finite set of possible types. We are interested in no-regret algorithms prescribing a signaling scheme at each round of the repeated interaction with performances close to that of a best-in-hindsight signaling scheme. First, we prove a hardness result on the per-round running time required to achieve no-α-regret for any α<1. Then, we provide algorithms for the full and partial feedback models with regret bounds sublinear in the number of rounds and polynomial in the size of the instance. Finally, we show that, by relaxing the persuasiveness constraints on signaling schemes, it is possible to design an algorithm with a better running time and small regret.

Introduction

Bayesian persuasion was first introduced by Kamenica and Gentzkow [2] as the problem faced by an informed sender trying to influence the behavior of a self-interested receiver via the strategic provision of payoff-relevant information. The sender faces an information structure design problem which amounts to deciding ‘who gets to know what’ about some exogenous parameters collectively termed state of nature, whose value is drawn from a common prior distribution and observed by the sender only.

The Bayesian persuasion framework can effectively describe strategic interactions in which an agent (the receiver) has to make a decision by relying only on information revealed by another external entity (the sender). In real-world economic interactions, the objectives of the former are oftentimes not aligned with those of the latter. Therefore, a natural question is: how can the sender steer the receiver's behavior towards some target outcome by exploiting the asymmetry in the knowledge of the realized state of nature? Answers to that question already found application in domains such as auctions and online advertisement [3], [4], [5], [6], [7], voting [8], [9], [10], [11], traffic routing [12], [13], recommendation systems [14], security [15], [16], and product marketing [17], [18].

In the original model by Kamenica and Gentzkow [2], the sender and the receiver share a common prior distribution over a finite set of states of nature. The interaction between the sender and the receiver goes on as follows. First, the sender publicly commits to a signaling scheme, which is a randomized mapping from states of nature to signals being sent to the receiver. Then, the sender observes the realized state of nature, which is drawn according to the prior distribution, and she/he computes a signal that is sent to the receiver according to the signaling scheme. The receiver observes the signal, and updates her/his posterior distribution over the states of nature accordingly, through a classical Bayesian update. Finally, the receiver selects an action maximizing her/his expected utility under the current posterior distribution. The sender and the receiver obtain a payoff which is a function of the receiver's action, and of the realized state of nature. An optimal signaling scheme for the sender is such that it maximizes her/his expected utility.

The model by Kamenica and Gentzkow [2] makes some important assumptions. First, the sender computes the signaling scheme at the ex ante stage, i.e., before observing the realized state of nature. This rules out any possibility for signaling through the choice of the signaling scheme. Second, the sender has full commitment power, which means that once the state of nature is realized, the signals are drawn according to the previously announced signaling scheme, and the receiver observes undistorted signal realizations. This allows the receiver to perform a Bayesian update of her/his prior beliefs without questioning the interim incentives of the sender. This assumption is reasonable in a number of practical settings (see, e.g., the arguments by Dughmi [19]). One particularly relevant argument to that effect is that reputation and credibility may be a key factor for the long term utility of the sender, as argued by Rayo and Segal [20].

In practical applications, the sender does not usually have a perfect knowledge of the receiver's utility function. For example, in buyer-seller interactions, the seller (i.e., the sender) does not have an exact knowledge of the buyer's utility function as a function of the product features. In this paper, we aim at relaxing the assumption that, in order to compute a ‘good’ signaling scheme, the sender must have a perfect knowledge of the receiver's objectives and preferences. We do that by proposing and studying an online learning framework in which the sender repeatedly faces a receiver whose type is unknown and chosen adversarially at each round from a finite set of possible types.

We deal with uncertainty about the receiver's type by framing the Bayesian persuasion problem in an online learning framework. The online Bayesian persuasion framework consists of a repeated Bayesian persuasion problem where, at each round, the receiver's type is adversarially chosen from a finite set of types. In particular, at each time t, the sender has to announce a signaling scheme ϕt, and she/he is confronted with a receiver with an unknown type that is chosen by an adversary. Then, a one-shot interaction between the sender and the receiver is played: the sender observes the realized state of nature at time t, and computes a signal according to ϕt. After observing such signal, the receiver updates the shared prior distribution over the states of nature in a Bayesian fashion, and selects an action maximizing her/his expected utility under the resulting posterior distribution. Then, the receiver at round t leaves forever after having observed the realized utility for her/his action, which is a function of the action, the state of nature at t, and her/his type. Our goal is designing an online algorithm that recommends a signaling scheme at each round, guaranteeing an expected utility for the sender close to that of the best-in-hindsight signaling scheme. We study this problem under two models of feedback: in the full information model, the sender selects a signaling scheme and later observes the type of the best-responding receiver; in the partial information model, the sender observes the actions taken by the receiver, but does not explicitly observe the receiver's types.

The online Bayesian persuasion framework provides a general tool that may be employed, with some adaptations, to describe various real-world problems. Two examples are the following:

    Signaling in repeated posted price auctions.

    In a posted price auction the seller (i.e., the sender) tries to sell an item to a buyer (i.e., the receiver) by proposing a take-it-or-leave-it price (see, e.g., [21], [22]). The buyer can choose whether to accept or decline the offer without the possibility of reverting her/his decision. Posted price auctions are one of the most frequent auction formats in e-commerce platforms [23]. In such settings, the utility of the buyer may depend on a state of nature describing the quality of the item being sold and/or some of its features. This information is known to the seller, but buyers cannot directly observe the item since the auction is carried out on the web. However, buyers can form a prior distribution over the states of nature by observing historical information publicly available on the web (for example, reviews of past purchases). In this scenario, the seller may be able to send signals to buyers in order to (partially) disclose information about the underlying state of the item (e.g., through the information displayed on the webpage). Moreover, in e-commerce platforms users arrive in an online fashion, and the volume of users is typically large enough to motivate regret-minimization approaches, in which the seller's goal is attaining the performance of the best fixed information disclosure policy in hindsight. Finally, let us also notice that, at each time step the quality of the item being sold is usually not correlated with the type of user accessing the platform at that time (this is the case, e.g., of accommodation websites and retail platforms).

    Signaling in network congestion games.

    Network congestion games model scenarios in which players choose a routing option to go from a source node to a target one, and their individual costs increase with the number of players choosing the same edges of the network. In such settings, if each player behaves selfishly, externalities may introduce inefficiencies in the final outcome of the game [24]. Bayesian network congestion games are an extension of such model in which the state of the network can be uncertain and not perfectly known to its users (e.g., drivers may not be aware of road works and accidents in a road network). In many settings, it is reasonable to assume the existence of third-party entities (e.g., road management companies, or platforms operating navigation systems) which may have access to the realized state of the network. Then, the third-party entity may want to disclose the right amount of information to each user of the network in order to mitigate the overall congestion. When a user requests information on the network (e.g., opening the navigation system and setting a destination), the third-party entity needs to decide which route to recommend. Types model different preferences that users may have over the route, for example, whether the user prefers to follow routes they already know, or whether they prefer to take a shorter but congested road with respect to a longer, less congested, route. The state of the network at each time a user request has to be served can be assumed to be independent of the user type. Moreover, it is easy for users to form a prior on the state of the network based on what they experienced in the past.

We remark that online Bayesian persuasion is a general framework, and that both scenarios would require one to address further difficulties with respect to our model (in the former the agent needs to select a price together with a signal, and in the latter there is a combinatorial number of possible route options). However, our model can serve as the foundation to address such problems, since it captures the key challenge of dealing with an unknown and adversarially selected sequence of receivers' types, which is a common feature of both settings.

First, we study the complexity of the online Bayesian persuasion problem. We provide a negative result that rules out, even in the full information setting, the possibility of designing a no-regret algorithm with polynomial per-round running time. The same hardness result holds when employing the notion of no-α-regret (in the additive sense) for any α<1. Intuitively, an algorithm exhibits the no-α-regret property if its average regret approaches α after a sufficiently large (but polynomial in the size of the instance) number of rounds. Formally, we show that for any α1, a no-α-regret algorithm for the online Bayesian persuasion problem requiring a per-round running time polynomial in the size of the instance cannot exist, unless NPRP (see Theorem 3). In order to prove this negative result we show, as an intermediate step, that the problem of approximating an optimal signaling scheme is NP-hard even in the offline Bayesian persuasion problem in which the sender knows the probability distribution according to which receiver's types are selected. By exploiting some techniques introduced by Roughgarden and Wang [25], it is possible to show that if the offline problem cannot be approximated, then neither can its corresponding online learning problem.

Then, we study whether it is possible to devise a no-regret algorithm for the online Bayesian persuasion problem by relaxing the (per-round) running time constraint. This is not a trivial problem even in the full information feedback setting since, at each round t, the sender has to choose a signaling scheme among an infinite number of alternatives. Moreover, the sender's utility depends on the receiver's best response, which yields an objective function which is not linear nor convex (or even continuous in the space of the signaling schemes). In the full information feedback setting, we show how to construct an algorithm that guarantees a regret polynomial in the size of the problem instance, and sublinear in the number of rounds T with order O(T1/2) (see Theorem 4). We do that by showing that it is enough for the sender to consider a finite set of posteriors as her/his action space, without compromising the quality of the solution. The finite set of posteriors has size growing exponentially in the number of states of nature (i.e., the source of the hardness result in Theorem 3 is the number of states of nature). Therefore, achieving no-regret in polynomial time is actually possible when the number of states of nature is fixed. In the partial information feedback setting, we develop an algorithm guaranteeing a regret polynomial in the size of the problem instance, and sublinear in T with order O(T1/5) (see Theorem 7). In this case, the main idea is to use a full-information no-regret algorithm in combination with a mechanism to estimate the sender's utilities corresponding to signaling schemes different from the one recommended by the algorithm. In order to build such mechanism, we extend some techniques by Balcan et al. [26] to a general setting in which only biased utility estimators are available, rather than unbiased ones as it is the case in previous works. This result is of general interest and can be generalized to any partial information setting, beyond online Bayesian persuasion.

Finally, we study whether it is possible to design an algorithm with a better running time, at the cost of relaxing the notion of best response according to which, at each round t, the receiver selects her/his action. Specifically, at each round t, the receiver is allowed to play an ϵ-best response to the current posterior distribution: the receiver can select an action providing her/him an expected utility which is at most ϵ less than the optimal value. Then, we define a suitable notion of regret by comparing the performance of the best-in-hindsight signaling scheme with the rewards obtained when facing ϵ-best-responding receivers and taking decisions according to the online algorithm. Under this definition of regret, we show that, for any ϵ>0 and α>0, there exists an algorithm with running time quasi-polynomial in the number of receiver's actions and polynomial in the other parameters of the instance, guaranteeing an α-regret sublinear in T with order O(T1/2), with a dependence quasi-polynomial in the number of actions and polynomial in the other parameters of the instance (Theorem 8). In doing so, we prove that there exists a set of signaling schemes of size quasi-polynomial in the number of actions and polynomial in the other parameters of the instance that, for each possible sequence of receiver's types, includes a signaling scheme nearly as good as the optimal one. This yields a polynomial-time no-α-regret algorithm when the number of receiver's actions is fixed.

Information design.  The Bayesian persuasion framework was introduced by Kamenica and Gentzkow [2], and its study from a computational perspective was initiated by Dughmi and Xu [27]. The single-round model has been extended to various settings such as, for example, games with multiple receivers under private (see, e.g., [17], [28], [29]) and public signaling (see, e.g., [12], [30], [29], [10], [31], [32]), settings in which receivers interact in an imperfect-information sequential game [33], network congestion games [34], and (ex ante and ex post) constrained problems [35]. Ely et al. [36] and Hörner and Skrzypacz [37] study information disclosure over time, but consider settings very different from our model (releasing news so as to optimize suspense and surprise, and selling information over time, respectively). More background on the ‘design of information structures’ in theoretical economics can be found in [38], [39], [40]. For a survey of algorithmic results on Bayesian persuasion see [19].

Online learning in Stackelberg games.  The closest line of research to ours is the one studying online learning problems in Stackelberg games. In these games, a leader commits to a probability distribution over a set of actions, and a follower plays an action maximizing her/his utility given the leader's commitment [41]. In this setting, Letchford et al. [42] and Blum et al. [43] study the problem of computing the best leader's strategy against an unknown follower using a polynomial number of best-response queries. Marecki et al. [44] study the problem with a single follower with type drawn from a Bayesian prior. Balcan et al. [26] study how to minimize the leader's regret in an online setting in which the follower's type is unknown and chosen adversarially from a finite set. Although the problem is conceptually similar to ours, the Bayesian persuasion framework presents a number of additional challenges: the solution to a Stackelberg game consists of a point in a finite-dimensional simplex, while the solution to a Bayesian persuasion problem is a probability distribution with potentially infinite support size. This probability distribution is subject to additional consistency constraints, which (under partial feedback) rule out the possibility of exploiting unbiased estimators of the sender's expected utility.

Online learning.  It is worth mentioning that known online learning algorithms for either the full or the partial feedback setting (see, e.g., [45]) do not provide any guarantee in the case of online Bayesian persuasion. Indeed, the regret bounds of those algorithms depend linearly or sublinearly in the number of actions, but the action space in Bayesian persuasion is infinite. A large body of previous works in other fields resolves the issue of dealing with an infinite action space by requiring specific assumptions (e.g., linear or convex utility functions [46], [47], [48], [49]). However, in the online Bayesian persuasion setting, these assumptions do not hold as the sender's utility depends on the receiver's best response, which yields a function that is not convex. Another approach would be to discretize the action space, which has been shown to be useful in several settings [50], [51]. However, such an approach is inapplicable in our problem, since the space of (direct) signaling schemes is defined by complex constraints that are not easy to deal with by using a naïve discretization procedure.

Robust approaches to Bayesian persuasion.  Recently, there has been a growing interest in relaxing some of the assumptions of the original model by Kamenica and Gentzkow [2]. For example, Hu and Weng [52], Kosterina [53], and Dworczak and Pavan [54] study robust approaches with respect to private information that the receiver may have. Zu et al. [55] study the case in which the sender has to be robust with respect to her/his ignorance of the receiver's prior, and has to persuade the receiver while estimating the receiver's prior on the fly. They present an online algorithm making persuasive action recommendations and guaranteeing cumulative regret in the order of O(TlogT) against the optimal mechanism with the knowledge of the prior distribution. Finally, the motivation behind our work is similar to that of the recent work by Babichenko et al. [56]. In that paper, the authors study robustness of persuasion with respect to the sender's ignorance of the receiver's objectives in the offline setting. Their goal is minimizing the sender's regret over a single iteration, and the authors provide positive results for the case in which the sender knows the ordinal preferences of the receiver over states of nature. The authors study the case of a single receiver with a binary action space, and an arbitrary (unknown) utility function. We view the approach of Babichenko et al. [56] as complementary to ours.1

Incentivized exploration.  Our work is not the first one bridging online learning with Bayesian persuasion. Indeed, there is a substantial line of research that uses Bayesian persuasion tools in the context of incentivized exploration in multi-armed bandits (see the seminal works by Kremer et al. [57] and [58], and their follow-ups [59], [60], [61], [62]). This setting models situations where the learner (the sender) has to influence the behavior of an agent (the receiver) who takes actions in place of the former, through the provision of information. Similarly to our model, each round of the repeated interaction consists in a Bayesian persuasion problem involving a new agent. However, the two settings crucially differ in some aspects. First, in our model, the state of nature is chosen anew at each round, while, in the incentivized exploration setting, the state of nature is determined only once and persists through all rounds. Second, in our setting the receiver can be of multiple types and these are selected adversarially, while the incentivized exploration setting only considers a single agent (with [63] and [64] constituting two notable exceptions) and stochastic rewards. Finally, let us remark that incentivized exploration is a non-trivial task even when there is only one receiver's type, while ours is non-trivial only with multiple receiver's types. We refer the reader to Chapter 11 in the book by Slivkins [65] for further details on the topic.

The remainder of the paper is organized as follows. Section 2 introduces basic concepts and the notation employed throughout the paper, including some useful remarks on how to analyze the offline problem in the space of posterior distributions. Section 3 formally introduces the online Bayesian persuasion framework, and the relevant notions of regret. Section 4 describes the negative complexity results. Then, we study how to achieve sub-linear regret by relaxing the requirements on per-iteration running times. Section 5 studies this problem in the full information feedback setting, and Section 6 studies the problem in the partial information feedback setting. Section 7 describe how to build a PTAS when receivers are ϵ-best-responding agents. Finally, Section 8 summarizes the results and discusses future research directions. All the proofs omitted from the paper are in Appendix A, Appendix B, Appendix C and Appendix D.

Section snippets

Preliminaries

The receiver has a finite set of m actions A:={ai}i=1m and a set of n possible types K:={ki}i=1n. For each type kK, the receiver's payoff function is uk:A×Θ[0,1], where Θ:={θi}i=1d is a finite set of d states of nature. For notational convenience, we denote by uθk(a)[0,1] the utility observed by the receiver of type kK when the realized state of nature is θΘ and she/he plays action aA. The sender's utility when the state of nature is θΘ is described by the function uθs:A[0,1].

As it is

The online Bayesian persuasion framework

We consider the following online setting. The sender plays a repeated game in which, at each round t[T], she/he commits to a signaling scheme ϕt, observes a state of nature θtμ, and she/he sends signal stϕθtt to the receiver.6 Then, a receiver of unknown type updates her/his prior distribution and selects an action at maximizing her/his expected reward (in the one-shot interaction at round t). We focus on the problem in which the

Hardness of sub-linear regret

Our first result is negative: for any α<1, it is unlikely (i.e., technically, it is not the case unless NPRP) that there exists a no-α-regret algorithm for the online Bayesian persuasion problem requiring a per-round running time polynomial in the size of the instance. In order to prove the result, we provide an intermediate step, showing that the problem of approximating an optimal signaling scheme is computationally intractable even in the offline Bayesian persuasion problem in which the

Full information feedback setting

The negative result of the previous section (Theorem 3) rules out the possibility of designing an algorithm which satisfies the no-regret property and requires a poly(t,n,m,d) per-round running time. A natural question is whether it is possible to devise a no-regret algorithm for the online Bayesian persuasion problem by relaxing the running-time constraint. This is not a trivial problem, since, at every round t, the sender has to choose a signaling scheme among an infinite number of

Partial information feedback setting

In this setting, at every round t, the sender can only observe the action at played by the receiver. Therefore, the sender has no information on the utility us(w,kt) that she/he would have obtained by choosing any signaling scheme wW other than wt. A Naïve solution would be to follow the approach of the previous section. In particular, we have already proved that the sender can restrict to the set of signaling schemes W. Hence, we can employ any learning algorithm with bandit feedback

A no-α-regret algorithm for ϵ-persuasive signaling schemes

Theorem 3 shows that, for all α<1, there is no polynomial-time algorithm for the online Bayesian persuasion problem that guarantees no-α-regret, unless NPRP. This implies that achieving sublinear regret with a polynomial-time algorithm is unlikely. Section 5 and 6 described no-regret algorithms for the full information and partial information feedback models requiring an exponential per-round running time. In this section, we focus on the following natural question: is it possible to design an

Discussion and future works

We proposed the online Bayesian persuasion framework as a natural extension of the original model by Kamenica and Gentzkow [2]. This is, to the best of our knowledge, the first work relaxing the assumption that the sender has a perfect knowledge of the receiver's utility function by casting the Bayesian persuasion problem into an online learning framework. Let us remark that our work provides a first fundamental step towards relaxing the stringent assumptions of the model by Kamenica and

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work has been partially supported by the Italian MIUR PRIN 2017 Project ALGADIMAR “Algorithms, Games, and Digital Market” and by the Italian PNRR-PE 2022 Project FAIR “Future Artificial Intelligence Research”.

References (76)

  • F. Bacchiocchi et al.

    Public signaling in bayesian ad auctions

  • R. Alonso et al.

    Persuading voters

    Am. Econ. Rev.

    (2016)
  • M. Castiglioni et al.

    Persuading voters: it's easy to whisper, it's hard to speak loud

  • Y. Cheng et al.

    Mixture selection, mechanism design, and signaling

  • M. Castiglioni et al.

    Persuading voters in district-based elections

  • U. Bhaskar et al.

    Hardness results for signaling in Bayesian zero-sum and network routing games

  • S. Vasserman et al.

    Implementing the wisdom of waze

  • Y. Mansour et al.

    Bayesian exploration: incentivizing exploration in Bayesian games

  • Z. Rabinovich et al.

    Information disclosure as a means to security

  • H. Xu et al.

    Signaling in Bayesian Stackelberg games

  • Y. Babichenko et al.

    Algorithmic aspects of private Bayesian persuasion

  • O. Candogan

    Persuasion in networks: public signals and k-cores

  • S. Dughmi

    Algorithmic information structure design: a survey

    ACM SIGecom Exch.

    (2017)
  • L. Rayo et al.

    Optimal information disclosure

    J. Polit. Econ.

    (2010)
  • J. Correa et al.

    Posted price mechanisms for a random stream of customers

  • M. Babaioff et al.

    Dynamic pricing with limited supply

    ACM Trans. Econ. Comput.

    (2015)
  • L. Einav et al.

    Auctions versus posted prices in online markets

    J. Polit. Econ.

    (2018)
  • T. Roughgarden

    Intrinsic robustness of the price of anarchy

    J. ACM

    (2015)
  • T. Roughgarden et al.

    Minimizing regret with multiple reserves

  • M.F. Balcan et al.

    Commitment without regrets: online learning in Stackelberg security games

  • S. Dughmi et al.

    Algorithmic Bayesian persuasion

  • S. Dughmi et al.

    Algorithmic persuasion with no externalities

  • A. Rubinstein

    Honest signaling in zero-sum games is hard, and lying is even harder

  • H. Xu

    On the tractability of public persuasion with no externalities

  • M. Castiglioni et al.

    Public Bayesian persuasion: being almost optimal and almost persuasive

  • A. Celli et al.

    Private Bayesian persuasion with sequential games

  • M. Castiglioni et al.

    Signaling in Bayesian network congestion games: the subtle power of symmetry

    Proc. AAAI Conf. Artif. Intell.

    (2021)
  • Y. Babichenko et al.

    Bayesian persuasion under ex ante and ex post constraints

    Proc. AAAI Conf. Artif. Intell.

    (2021)
  • Cited by (6)

    • Online Mechanism Design for Information Acquisition

      2023, Proceedings of Machine Learning Research
    • Optimal Rates and Efficient Algorithms for Online Bayesian Persuasion

      2023, Proceedings of Machine Learning Research

    A short version of this article appeared in Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020 [1].

    View full text