Characterizing limits and opportunities in speeding up Markov chain mixing

https://doi.org/10.1016/j.spa.2021.03.006Get rights and content

Abstract

A variety of paradigms have been proposed to speed up Markov chain mixing, ranging from non-backtracking random walks to simulated annealing and lifted Metropolis–Hastings. We provide a general characterization of the limits and opportunities of different approaches for designing fast mixing dynamics on graphs using the framework of “lifted Markov chains”. This common framework allows to prove lower and upper bounds on the mixing behavior of these approaches, depending on a limited set of assumptions on the dynamics. We find that some approaches can speed up the mixing time to diameter time, or a time inversely proportional to the graph conductance, while others allow for no speedup at all.

Introduction

The importance of algorithms based on Markov chains is widely appreciated. In computer science, random walks and Markov chain Monte Carlo form the backbone of many randomized algorithms to solve tasks such as approximating the volume of convex bodies [21] or the permanent of a non-negative matrix [32], or to solve combinatorial optimization problems using simulated annealing methods [37]. In physics, Markov chain Monte Carlo is an indispensable tool for sampling and simulation of many-body systems. Some examples are the use of Glauber dynamics to simulate the Ising model [44] or the Metropolis–Hastings algorithm [27], [45] to sample from the Gibbs distribution.

In general, a Markov chain can be used to sample from a probability distribution π that is not directly available nor explicitly known. Instead, known properties of the target distribution are translated into a stochastic evolution that is engineered to converge, or mix, to an equilibrium distribution which coincides with the target one. In various contexts, such Markov chain is easier to obtain or to implement rather than direct sampling from π. Under rather mild conditions, running this Markov chain from any starting position and stopping after a sufficient number of steps T, the resulting state will be approximately distributed according to the target equilibrium distribution π. Critical to these applications is how fast the stochastic process converges, and estimating this convergence speed or mixing time is often a difficult task [3], [40]. Approaches include estimating the spectral radius of the transition map [20], [39], or using advanced coupling and stopping time arguments [43].

In order to decrease the number of required steps T, thus accelerating the convergence towards π, a wide range of approaches has been proposed. The following are some examples, which will be treated in more detail in Section 4. All approaches describe local dynamics over the node set of a graph, in which the dynamics can only move from a node to any of its neighboring nodes (i.e., nodes that have an edge to the present node).

  • Stopping rules: The simple Markov chain scheme has a deterministic stopping time T, i.e. the transitions specified by the Markov chain are run for a fixed number of steps T, upon which its state is returned. As an extension, one can choose the stopping time randomly, according to some predefined distribution or dependent on the nodes that have been visited [34], [40], [43]. Such choice is formally described by a stopping rule. For instance, if the stopping time is uniformly distributed over some fixed time interval [0,T], then the output is called the Cesaro average. A more advanced stopping rule could say, e.g. go on until you have seen each node at least once (this relates to the Markov chain cover time [40]). By returning a sample obtained through a stopping rule, it is possible to converge faster to the target distribution π. More precisely, one specifies a stopping rule such that the distribution over nodes, conditioned on having stopped, is ϵ-close to π; and the mixing performance is measured by the expectation of the stopping time.

  • Non-backtracking random walks: Consider that we want a sample from π, which is the stationary distribution of a random walk on an undirected graph, moving from a given node to any neighbor with equal probability. When applying the random walk from a given starting node, there is a probability that the walker moves from node a to node b and then directly back to a; such a move is generally detrimental for spreading on the graph. A non-backtracking random walk therefore assigns a decreased probability α1 of traversing the same edge twice in a row. That is, the probability of choosing at time t+1 the same node where one was at time t1 is decreased to α1 with respect to a uniform choice among available neighbors, and the probability of choosing any of the other available neighbors uniformly is accordingly increased. This process has the same stationary distribution π over the nodes, and [4], [19], [22], [36] have shown that this approach generally speeds up mixing as compared to a simple random walk.

  • Simulated annealing or slowly-varying Markov chains: Markov chains are sometimes used to find the minimum of a function g over the graph nodes. The target stationary distribution should thus be much larger at a minimum than at other places, and this can be achieved by choosing the probability of jumping from i to j larger than the probability of jumping from j to i if g(j)<g(i). If a jump towards higher values of g is assigned a tiny probability, then this Markov chain has a high probability to get stuck for a long time in local minima that are not the global one. In contrast if the influence of g on jumping probabilities is made too weak, the stationary distribution is essentially random over all nodes. As a remedy, a time-dependent sequence of Markov chains can be proposed, whose transition probabilities and stationary distributions converge gradually to the “irregular” goal distribution concentrated on local minima, such that during the early steps of the sequence one can efficiently jump out of local minima. See for instance [37].

  • Gather-and-distribute strategies: This method was originally proposed in a consensus setting [24], where a given load must be distributed as fast as possible over nodes of a network in a distributed way. The gather-and-distribute strategy is a time-varying procedure, using a sequence of two time intervals. In terms of Markov chain mixing, during the first intervals, transition probabilities are chosen so as to move all the probability mass to a single predefined node. Thus after this first time interval, whatever the initial distribution, the Markov chain ends up in a single predefined situation. Knowing that the second time interval starts from this particular probability distribution, its transition probabilities are then designed to redistribute the probability according to the goal distribution. As an example, on a complete binary tree of depth D, one could choose the transition probabilities during the first D time steps so that all the probability mass moves onto the root node. From this known situation, common to all initial distributions, it is not hard to design transition probabilities over the next D time steps (e.g. using a “stochastic bridge”, see Section 5.2) in order to redistribute the probability mass towards the target stationary distribution. The mixing time of this approach is thus 2D on the binary tree example, exponentially improving over the simple random walk mixing time Ω(2D).

  • Data-augmented and lifted Markov chains: Consider again the problem of obtaining a sample from some (indirectly specified) goal distribution π over the nodes. Sometimes this distribution is easier to obtain as the marginal of a distribution on a larger sample space. In this case, one can obtain a sample by using a Markov chain that evolves on an augmented state space, consisting of the original variable and some latent variables, and just discarding the value of the latent variable in the obtained sample, see e.g. [60], [62]. A related strategy is to use latent variables and augmented graphs not only for an easier specification of the target, but also to possibly accelerate the convergence thanks to memory or momentum effects, see e.g. [12], [15], [18], [19] and the next sections for details on Markov chains on lifted graphs.

This list of seemingly distinct ideas provides a range of speedups over the use of a simple random walk. From the literature cited above, non-backtracking random walks are shown to provide (at least) a constant factor speedup, lifted Markov chains can provide up to a quadratic speedup, and gather-and-distribute strategies can provide exponential speedup or even more, on certain graphs.

Example 1.1 Mixing on a Cycle

For illustration we consider the toy example of sampling from the uniform distribution on a cycle graph. This example lies at the basis of several speedup ideas like [15], [18], [35]. The possible walker positions are the integers 0,1,2,,n1, and at each discrete time step, the walker can decide to stay put, to add 1 or to subtract 1 to its position (modulo n).

A standard random walk would add +1 or 1 to the current position, each with probability 12. After t1 steps the standard deviation from the original position of the walker is of order (t). As a consequence, it will take approximately n2 steps for the random walk to converge to the uniform distribution.

A stopping rule could say: always add +1 to your position, but at time t0 stop the process there with probability min(δ(1δ)t,1) for δ=1n. Effectively this means that there is a probability δ to stop the process at each time t{0,1,,n1} and hence after n1 time steps we have a uniform distribution. (Note that the time-dependence of the stopping rule is needed: a constant probability to decide and stop once having performed any t time steps would not achieve the same distribution, not even approximately.)

A non-backtracking walk, and also the lifted Markov chains introduced in [15], [18], would say: start with a given value v and a given sign {+,}. At each time-step, add 1 to the current position, and change with a probability δ1. In this way, the walker will preferably keep moving in the same direction, while occasionally turning around and then keep moving in the other direction. This can be loosely viewed as a random walk with effective moves of order 1δ over time steps 1δ, and therefore its long-term diffusive behavior rather involves a Gaussian of standard deviation 1δδt=tδt. It is shown in [18] that for δ1n this walk has mixing time O(n).

A slowly varying Markov chain could start with adding +1 at each position v. This introduces a deterministic drift on the cycle, but it will not lead to a uniform distribution. To resolve this, the chain is slowly varied towards a standard random walk. Similarly to the lifted Markov chain, the state will have a tendency to explore more of the circle thanks to the initial deterministic walk dynamics, but ultimately it will converge to the uniform distribution thanks to the final random walk dynamics.

A gather-and-distribute strategy could implement the following moves. During the first n steps, gather all probability mass on position 0. After this the state of the system is perfectly known, and we can efficiently disperse it uniformly,1 independent of the initial distribution. These dynamics exactly map any initial distribution to the uniform distribution after O(n) steps.  

One must note that not all the above speed-up approaches build on the same prior knowledge of the graph and target distribution, e.g. gather-and-distribute appears to require more prior insight to design an efficient algorithm — at least in its most basic implementation. In the present paper, we set this point aside and instead ask the question: what speedup can we ultimately expect from each one of these approaches?

To our knowledge, indeed, no classification or clear comparison of the achievable speedups for this variety of approaches is known, so that it is not exactly clear which ones are more promising to pursue towards more advanced e.g. adaptive design versions, and which ones will quickly hit a hard limit.

In this paper, we show how the attainable speedups can be categorized on the basis of fundamental properties of the stochastic processes resulting from such algorithms, like invariance of the target distribution or the initialization of auxiliary variables. This can allow to quickly assess the potential of a technique before digging into its deeper details and possible variations. Such results add to the recently revived interest in irreversible and beyond-Metropolis–Hastings techniques, some interesting results of which are presented in [11], [12], [54], [55], [61]. Our analysis builds on the fact that a wide set of speedup approaches, including all the ones listed above, can be cast into the overarching framework of lifted Markov chains (LMCs). This allows us to use mixing time bounds in the LMC framework, in order to derive bounds on the speedup achievable within each individual approach. The translation into an LMC is not unique, and therefore it is important to consider the best achievable performance under more abstract properties of LMC classes. The results are also of independent interest for the LMC literature itself, as they clarify how the mixing time bound depends precisely on the assumptions of the setting, and how some traditional assumptions relate to existing algorithmic approaches. Finally, one of the motivations for the present work is to pave the way towards a quantitative and fair comparison between lifted chains and quantum walks: although very different from an internal dynamics perspective, both these models rely on non-Markovian effects and share many similarities that can be made precise within the proposed framework [8].

The remainder of the paper can be summarized as follows. We first introduce the LMC model in Section 2, along with several particular LMC constructions which will be used in our discussion. The associated properties or constraints that we consider on an LMC are presented in Section 3. Section 4 is dedicated to translate into LMCs and to discuss the above defined properties for some key classes of algorithms. This will allow us to properly collocate our results with respect to existing ones and to highlight among others the importance of two aspects that have a key role: the ability of locally initializing the latent variables of the lifted chain, and whether or not we impose invariance of the target distribution during the whole evolution of the LMC. These two properties allow us to immediately identify some “extreme” scenarios, described in Section 5, where either lifting does not yield any advantage over a standard Markov chain, or it potentially allows for reaching the target in the trivial minimum time, corresponding to the diameter of the graph. For the latter, we explicitly provide an LMC that mixes in diameter time, building on stochastic bridges [23], [52] which can be efficiently set up provided one has a full knowledge of the graph. This is still different from a practical construction based on local knowledge only, yet we recall that both in [15] and in our paper, the purpose is to research the ultimate potential of a method, not to design new practical algorithms. A summarized version of the results of Section 5 can be found in the conference proceedings [7].

In Section 6, we establish lower bounds on the mixing time in intermediate scenarios, i.e. when only one of the above constraints is requested, together with constraints on the reducibility of the LMC, the mixing of the LMC towards its own lifted stationary distribution, and its ergodic flows. These bounds depend on the conductance of the graph, which provides a richer description of the graph topology than the diameter. In particular, we show that a conductance bound for the mixing time of lifted Markov chains holds under either of two seemingly unconnected constraints: (i) if we impose that the lifted chain mixes from any initial state in the entire lifted state space, i.e. without allowing to choose the initial values of the latent variables (this is essentially extending the scope of the result in [15]); or (ii) if we impose that the lifted dynamics keeps the target distribution invariant, i.e. when the system starts well-mixed it must stay so for all times. Conductance bounds are typically stricter than the diameter-time bound, yet examples show how they still allow for the lifted chains to significantly outperform the best possible standard Markov chain. Furthermore, we show that the other constraints – i.e. obeying imposed ergodic flows, irreducibility of the LMC, and considering the mixing properties of the lifted distribution vs. the marginal on the original nodes – do not significantly modify the achievable mixing time.

In Section 7 we provide some further observations: we show that most of the bounds that we prove are tight up to log-factors, we further illustrate how particular properties can be deduced indirectly from our scenarios and bounds, and we discuss possible extensions of our results to other settings. To conclude, a summary of the results and a brief outlook on future developments are provided in Section 8.

Section snippets

Setting: mixing dynamics on graphs and their lifts

Consider a graph G=(V,E), with V a set of N nodes, which we label as V={1,,N} and E the set of directed edges, i.e. ordered pairs of nodes. Throughout the paper, graphs are assumed to be connected and a real function on the node space V will be represented as a vector in RN. In particular, we denote by ei the canonical basis vector, with all elements zero except its i’th element equal to one. The notation ei will also be used more generally for canonical basis vectors whose dimension is clear

Mixing time and design scenarios

The overarching message of this paper is that the achievable mixing performance is critically dependent on some constraints and insensitive to some others, besides the locality associated to the graph G. We now specify those constraints and emerging design scenarios considered here, in the LMC framework. It will appear that even the definition of mixing time depends on the imposed constraints.

LMCs for existing algorithms and their properties

Before getting to the results, we illustrate the different LMC constraints by showing how the speedup approaches mentioned in the introduction can be cast into the LMC framework. The resulting LMCs will be used as a basis to apply the bounds that we derive in the upcoming sections, in order to deduce mixing time properties for the algorithms themselves. Note that the translation of a given algorithm into an LMC is not unique. A trivial example is that one can always construct a new LMC by

Minimal and maximal acceleration of mixing: invariance and initialization, both or none

We start by identifying the scenarios for which the lift cannot provide any advantage in mixing time with respect to a simple Markov chain, and those that allow for the fastest (diameter time) mixing. Remarkably, the only constraints that are relevant to determine these “extreme” behaviors concern the capability of initializing the lift, and the invariance of π.

Results on conductance bounds

In this section we discuss conductance bounds for LMC scenarios. Section 6.1 discusses the known conductance bounds which, as we show in Section 6.2, lies intermediate to the diameter and random walk bounds derived in the previous section. In Section 6.3 we prove that the bound holds for a number of scenarios beyond what was known, and in Section 6.4 we relativize the relevance of ergodic flows in the light of these results.

Tightness and complementary observations

In this section we present some further results and observations, related to the lower bounds on mixing time. The mixing time bounds for the scenarios of Section 5 are obviously tight, i.e. they can indeed be achieved by an appropriate LMC on any graph. In Section 7.1 we establish tightness as well for most of the scenarios involving a conductance bound; this builds rather directly on the result of [15]. In Section 7.2, we observe how some graph properties can also be directly derived from our

Summary and perspective

We show how a wide range of approaches to accelerate the mixing time of random walks can be cast as lifted Markov chains (LMCs) in different scenarios — see examples throughout the text and further ones in the appendix. We provide an extensive classification of these scenarios, and show that the limits and opportunities of acceleration approaches feature a subtle yet clear dependency on the scenario in which they can be cast. This allows us to put the specific results on LMCs, such as the

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

Some partial results of this work regarding the role of initialization and invariance that could be of interest to the system and control community had been presented in the conference paper [7].

References (62)

  • ApersS. et al.

    Bounding the convergence time of local probabilistic evolution

  • ApersS. et al.

    When does memory speed-up mixing?

  • ApersS. et al.

    Simulation of quantum walks and fast mixing with classical processes

    Phys. Rev. A

    (2018)
  • BatuT. et al.

    Testing that distributions are close

  • F. Bénézit, A. Dimakis, P. Thiran, M. Vetterli, Gossip along the way: Order-optimal consensus through randomized path...
  • BierkensJ.

    Non-reversible metropolis-hastings

    Stat. Comput.

    (2016)
  • BierkensJ. et al.

    A piecewise deterministic scaling limit of lifted Metropolis–Hastings in the Curie–Weiss model

    Ann. Appl. Probab.

    (2017)
  • BoydS. et al.

    Fastest mixing Markov chain on graphs with symmetries

    SIAM J. Optim.

    (2009)
  • BoydS. et al.

    Fastest mixing Markov chain on a graph

    SIAM Rev.

    (2004)
  • ChenF. et al.

    Lifting Markov chains to speed up mixing

  • DervovicD.

    For every quantum walk there is a (classical) lifted Markov chain with faster mixing time

    (2017)
  • DiaconisP.

    Some things we’ve learned (about Markov chain Monte Carlo)

    Bernoulli

    (2013)
  • DiaconisP. et al.

    Analysis of a nonreversible Markov chain sampler

    Ann. Appl. Probab.

    (2000)
  • DiaconisP. et al.

    On the spectral analysis of second-order Markov chains

    Ann. Fac. Sci. Toulouse Math.

    (2013)
  • DiaconisP. et al.

    Geometric bounds for eigenvalues of Markov chains

    Ann. Appl. Probab.

    (1991)
  • DyerM. et al.

    A random polynomial-time algorithm for approximating the volume of convex bodies

    J. ACM

    (1991)
  • FitznerR. et al.

    Non-backtracking random walk

    J. Stat. Phys.

    (2013)
  • GeorgiouT.T. et al.

    Positive contraction mappings for classical and quantum Schrödinger systems

    J. Math. Phys.

    (2015)
  • GeorgopoulosL.

    Definitive Consensus for Distributed Data Inference

    (2011)
  • GerencsérB. et al.

    Improved mixing rates of directed cycles by added connection

    J. Theoret. Probab.

    (2019)
  • HastingsW.K.

    Monte Carlo Sampling methods using Markov chains and their applications

    Biometrika

    (1970)
  • Cited by (0)

    View full text