1 Introduction

Load balancing algorithms play a crucial role in distributing jobs among multiple servers and have attracted strong renewed interest due to proliferation of large data centers and cloud computing. Well-known load balancing algorithms include, for instance, the Join-the-Shortest-Queue (JSQ), Join-the-Shortest-Queue-d (JSQ(d)) and Join-the-Idle-Queue (JIQ) policies. These policies have been extensively analyzed in an overarching framework called the supermarket model. This framework consists of a single dispatcher that immediately routes the arriving jobs to one of the N identical parallel servers according to the assignment policy under consideration. The servers are identical in the sense that each of them can handle any arriving job and all servers process jobs at the same rate. The JSQ policy assigns each arriving job to the server with the shortest queue length and has strong stochastic optimality properties among the class of policies without advance knowledge about the service requirements [5, 28]. The JSQ policy involves a significant communication burden, however, which may be prohibitive in large systems.

This scalability issue has spurred an interest in the JSQ(d) policy which assigns a job to the server with the shortest queue length among \(d\ge 2\) randomly selected servers. Mitzenmacher [18] and Vvedenskaya et al. [27] analyzed the JSQ(d) policy in an asymptotic regime where the total arrival rate and the number of servers grow proportionately larger. Substantial performance gains were established compared to purely random assignment, even for \(d=2\). Mukherjee et al. [19] show that the waiting time of an arriving job in fact vanishes when d tends to infinity as the number of servers grows large. A vanishing waiting time is also achieved by the JIQ policy which directs arriving jobs to an idle server or a randomly selected server if all servers are occupied [16]. The JIQ policy only has a constant communication overhead per job, but requires memory at the dispatcher. We refer to Van der Boor et al. [26] and Gamarnik et al. [8] for further details.

As mentioned above, a key feature of the supermarket framework is the exchangeability of the servers in the sense that any job can be handled equally well by any server, which is often not the case in practice. In the present paper, we will focus on a scenario where jobs or servers are not intrinsically different, but where particular servers might be better equipped to process certain jobs because of affinity or compatibility relations. We refer to the model as the affinity-scheduling model. Such affinity relations may, for example, arise due to geographical proximity in spatial settings, or data locality in content distribution or transaction processing applications.

More concretely, once a job arrives, there is a subset of the servers (referred to as the primary selection) that can process it at rate \(\mu _1\). However, it might be beneficial to assign this job to one of the remaining servers, if this allows an immediate start of the service. Obtaining service at a server that was not in the primary selection comes with a price; the service rate \(\mu _2\) will be smaller than \(\mu _1\). In a recent paper by Gardner et al. [9], a similar assignment policy is considered for a loosely related but different model where servers are heterogeneous. The jobs are statistically identical in this setting, i.e., a specific subset of servers is faster for all jobs. In Sect. 2, the assignment and scheduling policy of the affinity-scheduling model will be described in greater detail as well as the similarity and difference with the setup in [9].

One particular model instance is the neighborhood model, where the geographical proximity of the underlying network has been taken into account. Assume that the service system is built on top of a graph structure \(G_N\), then the primary selection for a job consists of the server where it arrives and the immediate neighbors of this server determined by \(G_N\). The neighborhood model extends the models constructed by Gast [10], Turner [25] and Mukherjee et al. [19]. In these settings, it is assumed that all nodes have equal arrival rates and jobs can only be forwarded to direct neighbors; it is not possible to redirect an arriving job to any other nodes. The model constructed by Yekkehkhany et al. [29] does allow for jobs to be redirected to higher-degree neighbors to be served at lower rates.

Another model instance that we will investigate is the combinatorial model. Let d be a fixed integer, then any selection of d servers could occur as a primary selection for an arriving job. In addition, the arrival rates will be equal among the server selections which strengthens the symmetric nature of the combinatorial model.

The lack of exchangeability among the servers makes the affinity-scheduling model complicated to analyze in general. The analytical techniques that are most commonly used in the context of the supermarket model, such as mean field limits and even standard coupling arguments, fundamentally rely on this exchangeability. These techniques can only be applied for the combinatorial model. For the general model, and in particular the neighborhood model, the investigation of load balancing issues is challenging and enters largely uncharted methodological territory.

We will establish a stochastic dominance result for the occupancy process of the general affinity-scheduling model, which will yield a sufficient stability constraint as an immediate by-product. Exploiting the coupling of this dominance result, we can derive two tighter dominance results for the neighborhood model, which will in particular hold if the underlying graph structure is rather dense. To the best of our knowledge, these are the first results that explicitly capture the impact of network structure on load balancing performance.

For the combinatorial model, we will conduct a fluid limit analysis. A trajectory of the fluid limit will converge to one of the possibly multiple fixed points, depending on the mutual relationships of the model parameters and the initial configuration of the system. When the fixed point is unique, we demonstrate that this provides a good approximation for the intractable stationary distribution in a finite server setting. When multiple fixed points occur, we observe the phenomenon of ‘tunneling’ described by [11]. The stochastic process will switch between multiple modes corresponding to the locally stable fixed points of the fluid limit.

The remainder of this paper is organized as follows: A detailed model description is provided in Sect. 2. Next, the main stochastic dominance result and its associated coupling are presented in Sect. 3, as well as two tighter results for specific instances of the neighborhood model. In Sect. 4, we present a fluid limit analysis of the affinity-scheduling policy for combinatorial models. The proofs of all results are deferred to Sect. 5. Finally, Sect. 6 provides concluding remarks and some directions for further research.

2 Model description

We now describe the affinity-scheduling model with N servers. Let \(\mathcal {P}(\{1,\dots ,N\})\) denote the power set of all servers. For a selection \(S\in \mathcal {P}(\{1,\dots ,N\})\), jobs arrive as a Poisson process with rate \(\lambda _S \ge 0\). For these jobs, the servers in S and \(S^c\) are called the primary and secondary servers, respectively. An arriving job can be assigned as a type-I job to a primary server or as a type-II job to a secondary server. Type-I jobs have independent and exponentially distributed service times with parameter \(\mu _1>0\) and are favored by the server over the type-II jobs. Type-II jobs have on average longer service times which are independent and exponentially distributed with parameter \(\mu _2\) (\(0<\mu _2 <\mu _1\)). It is important to note that the job type is not predetermined on arrival, but established by the assignment policy. The main idea behind our assignment policy is: ‘Assign a job to a server in the primary selection unless it might be beneficial to assign it to a secondary server even though the service time might be longer.’ The rationale for this is to reduce the waiting time of a job. More precisely, the assignment policy goes through the following three steps:

  1. 1.

    If there is at least one completely idle server in the primary selection S, then assign the arriving job as a type-I job to one of these servers.

  2. 2.

    Otherwise, if there is at least one completely idle server in the secondary selection \(S^c\), then assign the arriving job as a type-II job to one of these servers.

  3. 3.

    If there are no idle servers present, then assign the job as a type-I job to the primary server with the smallest number of type-I jobs. Ties are broken according to the number of type-II jobs, in favor of a lower number.

When the second step is omitted, our policy resembles a JSQ(|S|) policy with |S| the cardinality of the primary selection. However, the cardinality of the server selection is allowed to differ among arriving jobs in our model and the server selection S itself is not sampled uniformly at random as is the case in a JSQ(|S|) policy. Moreover, the second step can be interpreted as a JIQ policy on the set of secondary servers. Our affinity-scheduling policy thus shares similarities with both policies.

Denote the configuration of a server, i.e., the number of type-I and type-II jobs in its queue, by (ij), \(i\ge 0\) and \(j\in \{0,1\}\). As an illustrative example of the assignment policy, suppose, for a given arriving job, the primary and secondary servers have \(\{(1,0),(1,1),(1,1),(4,0) \}\) and \(\{(1,0),(1,1),(1,1),(3,1)\}\) as their configurations, respectively. Under the assignment policy, the third step will be applied and the primary server with configuration (1, 0) will receive an additional type-I job. Notice that under this strategy, an arriving job will never be assigned as a type-II job to a server that already has a job in its queue. This implies that if the initial configuration has at most one type-II job, the number of type-II jobs per server will never exceed one. Without essential loss of generality, we will assume that the initial configuration indeed satisfies this constraint. Furthermore, we assume that the routing decision time and transit times are negligible.

There is no lower bound imposed on the value of \(\mu _2\), which strengthens the fact that we focus primarily on the type-I jobs. In general, type-I jobs are the preferred type of jobs, which also manifests itself in the scheduling policy. Each server operates under a preemptive priority scheduling discipline in favor of the type-I jobs. Moreover, type-I jobs are served in order of arrival.

Let \(N_{n,j}(t)\) denote the number of type-j jobs at server \(n\in \{1,\dots ,N\}\) at time t. The configuration of server n is then given by \(\left( N_{n,\text {I}}(t),N_{n,\text {II}}(t)\right) \in \mathbb {N}\times \{0,1\}\) with state space \(\left( \mathbb {N}\times \{0,1\}\right) ^{N}\). The vector \(\left( N_{n,\text {I}}(t),N_{n,\text {II}}(t)\right) _{n}\) evolves as an irreducible, time-homogeneous Markov process. However, the length of this vector grows with the number of servers which is therefore not well suited for, for instance, a fluid limit analysis. Consequently, we introduce different variables that are more server centric and will be more convenient in proving stochastic dominance and analyzing the fluid limit. Define \(Q_{ij}(t)\) as the number of servers with i type-I jobs and j type-II jobs at time t, with \(i\ge 0\) and \(j\in \{0,1\}\). Then,

$$\begin{aligned} \overline{Q}_{ij}(t)\doteq \sum \limits _{k\ge i} Q_{kj}(t) \end{aligned}$$
(1)

denotes the number of servers with at least i type-I jobs and exactly j type-II jobs. We note that \(\overline{Q}^N_{00}(t)+\overline{Q}^N_{01}(t)=N\) by definition. Since the stochastic dominance result in the next section will focus on the type-I jobs, we also introduce the following variables:

$$\begin{aligned} \overline{Q}_{m+}(t) \doteq \sum _{i = m}^{\infty } \overline{Q}_i(t), \end{aligned}$$
(2)

where \(\overline{Q}_{i}(t)\) denotes the number of servers with at least i type-I jobs, i.e., \(\overline{Q}_{i}(t) = \overline{Q}_{i0}(t) + \overline{Q}_{i1}(t)\). It is important to note that these variables will no longer lead to a Markov process representation in the general settings mentioned in the introduction. This immediately limits the number of available techniques to analyze the performance.

Let \(\mathcal {S}\) denote the subset of \(\mathcal {P}(\{1,\dots ,N\})\) with strictly positive arrival rates. Besides the general setting where \(\mathcal {S}\) can be any subset of \(\mathcal {P}(\{1,\dots ,N\})\), we will also investigate some more restricted cases. In the neighborhood model on the graph topology \(G_N\), each node represents a server and the edges represent underlying relations between them. Then each selection in \(\mathcal {S}\) consists of a server and its neighbors determined by \(G_N\). In total, \(\mathcal {S}\) contains N different server selections and jobs arrive to each of them independently at a uniform rate \(\lambda >0\). This model instance captures the situation where a job’s physical arrival location plays a role in its service time at the various servers.

Let \(\mathcal {S}\) consist of all possible server selections of size d. The cardinality of \(\mathcal {S}\) is \(\left( {\begin{array}{c}N\\ d\end{array}}\right) \), and henceforth, we refer to this model as the combinatorial model. We assume a uniform arrival rate \(\nu \) per selection. We let \(\nu =\lambda N / \left( {\begin{array}{c}N\\ d\end{array}}\right) \) per selection so that the total rate is given by \(\lambda N\). Observe that the combinatorial model captures the situation where a selection of d servers is drawn uniformly at random as the primary selection for each job.

Remark 1

There are also instances of the affinity-scheduling model that are not captured by either the neighborhood model or the combinatorial model. On the one hand, there are also model instances with non-uniform arrival rates \((\lambda _S)_S\) per server selection. On the other hand, some model instances could be intermediate versions of the neighborhood model and the combinatorial model. As an example, suppose a job arrives to a primary selection that consists of the servers \(1,\dots ,5\) or an arbitrary selection of size two of the remaining servers. Then, \(\mathcal {S}\) consists of \(\{1,\dots ,5\}\) and all pairs of servers of \(6,\dots ,N\).

As mentioned in introduction, Gardner et al. focus on a model with heterogeneous servers which is loosely related to ours. The system under consideration consists of exactly \(k_F\) servers operating at rate \(\mu _1\) (fast servers) and \(k_S\) servers operating at rate \(\mu _2 < \mu _1\) (slow servers) [9]. For each arriving job, a primary selection of size \(d_F\) is uniformly selected from the fast servers and a secondary selection of size \(d_S\) is selected from the slow servers. Then, an allocation strategy similar to ours, referred to as JSQ(\(d_F\), \(d_S\)), is applied with the only difference that the second and third steps are not always applied even if the conditions are fulfilled. For instance, even if the condition in step two is fulfilled, with probability \(1{-}p_S\) the arriving job will still be forwarded to the (busy) fast server with the shortest queue length. Similarly, the third step is only applied with probability \(p_F\) once its condition is fulfilled and there are no idle severs in both selections. So, the major difference between the two models is the way heterogeneity between the servers manifests itself. Both models coincide in the very special case when the allocation strategy in [9] utilizes the model parameters \(d_F = k_F\), \(d_S = k_S\) and \(p_F = p_S = 1\) and the set of server selections \(\mathcal {S}\) in the affinity-scheduling model consists of only one set of servers that coincides with the fast servers in [9].

3 Stochastic dominance and coupling

In this section, we establish several stochastic dominance results for our affinity-scheduling strategy. We will construct a stochastic coupling that allows a comparison with various reference systems in terms of the ordered server states, and refer to this coupling as the restructure coupling, or shorter R-coupling. In contrast to the affinity system, the various reference systems all involve N exchangeable servers and are therefore far more amenable to (asymptotic) analysis, yielding tractable performance bounds. So, the coupling builds a bridge between a structured system on the one hand and an unstructured system on the other hand, hence the name of the coupling. From the previous section, we know that there is at most one type-II job present. Due to this boundedness of the number of type-II jobs, we will only count the number of type-I jobs in the affinity system. One reason for this is that only these jobs could possibly lead to unstable behavior.

3.1 General framework

Before we elaborate on the specific results, the general framework of the R-coupling is presented. The common features and differences of the specific generalizations of the coupling for different model instances are exposed.

While each of the N servers in the reference system processes jobs in a FCFS manner at rate \(\mu _1\), the various specific incarnations differ in the value of the normalized arrival rate per server \(\lambda _0\) and the policy for assigning jobs. The choice of the specific reference system is aligned with the properties of the affinity system in terms of the server selections \({\mathcal S}\) and the associated arrival rates \(\lambda _S\), \(S \in {\mathcal S}\). Loosely speaking, we obtain increasingly tight dominance results under increasingly restrictive symmetry and structural conditions on the server selections \({\mathcal S}\) and the associated arrival rates. The three specific variants for the reference system that we consider operate under either (i) a purely random assignment (RA) policy, (ii) a MJSQ(k) policy (as specified later) or (iii) a JSQ(k) policy (as described in the introduction). While the RA system provides exact upper bounds in terms of independent M/M/1 queues, the MJSQ(k) and JSQ(k) systems yield asymptotic upper bounds based on fluid limits.

The dominance results revolve around stochastic majorization properties in terms of the ordered server states. Specifically, define \(\overline{Q}_{m+}^{\mathrm {aff}}(t)\) and \(\overline{Q}_{m+}^{\mathrm {ref}}(t)\) as the variables in (2) for the affinity and reference system, respectively. We will establish results of the form

$$\begin{aligned} \{(\overline{Q}_{m+}^{\mathrm {aff}}(t))_{m \ge 1}\}_{t \ge 0} \le _{{\mathrm {st}}} \{(\overline{Q}_{m+}^{\mathrm {ref}}(t))_{m \ge 1}\}_{t \ge 0}. \end{aligned}$$

This majorization result indicates that the number of type-I jobs residing in the mth or higher-ordered queue position in the affinity system is stochastically bounded from above by the number of jobs residing in the mth or higher-ordered queue position in the reference system. In particular, taking \(m = 1\), this implies that the total number of type-I jobs in the affinity system is stochastically bounded from above by the total number of jobs in the reference system. As noted earlier, we know the exact distribution of the latter quantity in the RA system and have an asymptotic result for the MJSQ(k) and JSQ(k) systems.

In order to prove the stochastic majorization properties, we introduce the R-coupling to construct sample paths for the affinity and reference systems on a joint probability space for which the stated inequalities hold in a deterministic way [15, 22, 24]. The servers in both systems can be ordered in an ascending way according to number of (type-I) jobs in their queue. Let \(N_{[n],\text {I}}(t)\) and \(N_{[n]}(t)\) denote the number of type-I jobs in the affinity system and in the reference system at the nth ordered server at time t, respectively. For all three specific reference systems, the common proof concept is to ensure that under the coupling the two following key properties always hold with respect to the ordered server states as illustrated in Fig. 1. Let t indicate the moment that an event occurs in the coupled systems.

  1. (a)

    If \(N_{[n],\text {I}}(t) = N_{[n],\text {I}}(t^-)+1\), then \(N_{[\tilde{n}]}(t) = N_{[\tilde{n}]}(t^-)+1\) with \(\tilde{n}\in \{n,\dots ,N\}\).

    So, an arrival of a type-I job in the affinity system must give rise to an arrival at a higher-ordered server in the reference system.

  2. (b)

    If \(N_{[n]}(t) = N_{[n]}(t^-)-1\), then \(N_{[n],\text {I}}(t) =\max (N_{[n],\text {I}}(t^-)-1,0)\).

    So, a service completion in the reference system must force a service completion of a type-I job at the same ordered server in the affinity system (unless there is no type-I job at this server).

Fig. 1
figure 1

A schematic representation of the two ordered systems with \(N=6\) servers at time t. The affinity system, operating under the affinity-scheduling policy, consists of type-I (squares) and type-II (pentagons) jobs. The N exchangeable servers in the reference system make no distinction between the jobs

We can prove the following general lemma.

Lemma 1

(R-coupling) If a stochastic coupling between the affinity system and the reference system can be constructed such that (a) and (b) are satisfied, then \((\overline{Q}_{i}^{\mathrm {aff}}(t))_{i\ge 1}\) is majorized by \((\overline{Q}_{i}^{\mathrm {ref}}(t))_{i\ge 1}\) for \(t\ge 0\), that is,

$$\begin{aligned} \sum \limits _{i = m}^{ \infty }\overline{Q}_{i}^{\mathrm {aff}}(t) \le \sum \limits _{i = m}^{ \infty }\overline{Q}_{i}^{\mathrm {ref}}(t) \end{aligned}$$
(3)

for all \(m\ge 1\), provided that the initial configurations of both systems satisfy this inequality.

More precisely, the vector \((\overline{Q}_{i}^{\mathrm {ref}}(t))_{i\ge 1}\) weakly submajorizes the vector \((\overline{Q}_{i}^{\mathrm {aff}}(t))_{i\ge 1}\). If for \(m=1\) the inequality in (3) holds with equality, then this result implies that the affinity-scheduling policy disperses the type-I jobs more evenly among the servers than the considered assignment policy in the reference system. The proof of Lemma 1 is discussed in Sect. 5.1. In the remainder of this section, we will precisely describe the affinity coupling for each of the reference systems under consideration and verify that the properties (a) and (b) are satisfied. The coupling at service completion epochs to ensure property (b) as further detailed below is fairly standard and common across all three reference systems. The coupling at arrival epochs depends on the assignment policy under consideration in the reference system. Although the general framework is highlighted below, the precise realizations will be illustrated in the following subsections.

Coupling at arrival epochs In contrast to the service completions, the coupling at arrival epochs to guarantee property (a) is novel and highly specific to the reference system under consideration. Due to the lack of exchangeability among servers, the coupling at arrival epochs involves a further subtle complication that does not arise in constructing sample path comparisons in the context of the ordinary supermarket model. Even though we compare the evolution of the two systems in terms of the \(\overline{Q}_i\) variables as usual, these generally do not provide a Markovian state description for the affinity system as noted earlier in Sect. 2. In particular, the transitions at arrival epochs intricately depend on the server selections \({\mathcal S}\) and cannot be suitably represented in terms of the \(\overline{Q}_i\) variables.

Coupling at service completion epochs The coupling generates potential service completions at rate \(\mu _1 N\), but the aggregate service rate in either the affinity or the reference system might be lower than \(\mu _1N\) because of servers being idle or only working at rate \(\mu _2\) on type-II jobs. Let \(P_{\mathrm {aff}}\) and \(P_{\mathrm {ref}}\) be the sets of ordered positions of servers in the affinity and reference system, respectively, that are working on (type-I) jobs just before some time t at which a potential service completion occurs. Define P as the intersection \(P_{\mathrm {aff}} \cap P_{\mathrm {ref}}\) which equals \(P_{\mathrm {aff}}\) or \(P_{\mathrm {ref}}\) due to the ordering and the preemptive strategy of the affinity-scheduling policy. A random variable \(X_t\), drawn from a uniform distribution on [0, 1], decides which of the following actions is selected:

  1. (i)

    \(0\le X_t \le \frac{|P|}{N}\): Sample uniformly at random a position n from P; a departure will take place at time t in both systems at the nth ordered server.

  2. (ii)

    \(\frac{|P|}{N} < X_t \le \frac{|P_{\alpha }|}{N}\) where \(\alpha \) is ‘\(\mathrm {aff}\)’ or ‘\(\mathrm {ref}\)’: Sample uniformly at random a server position from \(P_{\alpha }\setminus P\); one job will be removed from the corresponding server in system \(\alpha \) at time t.

  3. (iii)

    \(X_t > \frac{\max \{|P_{\mathrm {aff}}|,|P_{\mathrm {ref}}|\}}{N}\): No real departure will occur among the type-I jobs in the affinity system or the jobs in the reference system.

We note that the total departure rate of type-I jobs from the affinity system is indeed given by \(\mu _1 |P_{\mathrm {aff}}|\), likewise for the reference system with a total departure rate of \(\mu _1 |P_{\mathrm {ref}}|\). The idea to work with intersections of the active server sets comes from [20, Section 4].

3.2 R-coupling with the general model

We now consider a general structure for the server selections \(\mathcal {S}\) and the corresponding arrival rates \(\{\lambda _S \mid S \in \mathcal {S}\}\) per server selection. The reference system will operate under the RA policy with arrival rate \(\lambda _0\) per server. Thus, \(\lambda _0<\mu _1\) is a sufficient stability condition for the reference system. So the purpose of this subsection is twofold: the affinity coupling is illustrated in the general setting of our affinity-scheduling policy in order to obtain a stochastic dominance result, and a stability condition is obtained as an immediate by-product.

The choice of \(\lambda _0\) is determined by the arrival rates per server selection in the affinity system, namely

$$\begin{aligned} \lambda _0\doteq \underset{p_{Sn}}{\min }\left\{ \underset{n}{\max } \left\{ \lambda _n^* = \sum _{S\in \mathcal {S}: n\in S}\lambda _{S}p_{Sn} ~\left| ~ \sum _{n\in S} p_{Sn} = 1 ~\text {with}~ p_{Sn} \ge 0, \forall n\in S \right. \right\} \right\} . \end{aligned}$$
(4)

In order to achieve a meaningful upper bound, we chose \(\lambda _0\) as small as possible. Since we only count the number of type-I jobs in the affinity system, \(\lambda _0\) must also be large enough to cope with ‘worst-case scenarios’ where no arriving job is assigned as a type-II job. Therefore, the optimization problem in (4) computes a solution that distributes the arriving jobs as type-I jobs among the servers as uniformly as possible. The variable \(p_{Sn}\) may be interpreted as the fraction of jobs with server selection S that are assigned to server \(n\in S\). With this interpretation in mind, it is easily seen that at least one server must handle an arrival rate of \(\lambda _0\) or larger in case jobs are only allowed to be executed as type-I jobs. Thus, \(\lambda _0<\mu _1\) is clearly a necessary stability condition for any policy in this case. The condition is sufficient as well, for instance for a simple static strategy that assigns a job with server selection S to server n with probability \(p_{Sn}\). However, the implementation of this policy would require full knowledge of the arrival rates \(\lambda _S\). We will establish that the condition is also sufficient for the stability of our affinity-scheduling strategy, which does not rely on any knowledge of the arrival rates \(\lambda _S\) at all.

We now specify the R-coupling for the reference system with the RA policy.

Coupling at arrival epochs The coupling generates potential arrival events at rate \(\lambda _0 N\). If a potential arrival occurs at time t, a position \(n^*\) from the set \(\{1,\dots ,N\}\) is selected uniformly at random. For brevity, we simply refer to the server at the ordered position \(n^*\) as server \(n^*\). An addition of a new job in the reference system will take place at this server \(n^*\). Since this position was randomly selected, the coupling strategy will give rise to an addition according to the RA policy in a system with arrival rate \(\lambda _0\) per server.

In order to determine whether an arrival event of a type-I job takes place in the affinity system and at which server this will happen, we follow the strategy described below. Two random variables, \(Y_{t,1}\) and \(Y_{t,2}\), are sampled from a uniform distribution on [0, 1] to take into account that the total arrival rate in the affinity system might be smaller than \(\lambda _0 N\) and to select a server selection S for an arriving job. To make the decisions, we rely on the variables \((p_{Sn}^*)_{S,n}\) that attain the minimum in (4). First, \(Y_{t,1}\) establishes if an arrival occurs to a primary selection containing server \(n^*\), which happens with probability \(\lambda _{n^*}^*/\lambda _0\). If an arrival will take place, then a server selection S containing \(n^*\) is selected as the primary selection with probability \(\lambda _{S}p_{Sn^*}^*/\lambda ^*_{n^*}\) for which \(Y_{t,2}\) is used. All remaining servers form the secondary selection. Note that the total arrival rate to a server selection S,

$$\begin{aligned} \lambda _0 N \sum _{n\in S} \frac{1}{N}\frac{\lambda ^*_n}{\lambda _0} \frac{\lambda _{S}p_{Sn}^*}{\lambda ^*_n}, \end{aligned}$$
(5)

will indeed be equal to \(\lambda _S\) in the affinity system as \(\sum _{n\in S} p_{Sn}^*=1\) by definition. So, this approach will coincide with the arrival process of the general model described in Sect. 2. Once these selections are set for an arriving job, the assignment policy is applied as defined in Sect. 2. Due to the general structure of \(\mathcal {S}\), it is not possible to determine the exact server at which a job is assigned in terms of the variables \((\overline{Q}_{ij})_{i,j}\). However, if the new job is assigned as a type-I job in the affinity system to one of the servers in S, it is known that the position of this server will be at most \(n^*\). Since the newly arrived job in the reference system is assigned to server \(n^*\), property (a) of the coupling is maintained.

We can state the following theorem that will lead to a sufficient stability condition. This theorem follows from the majorization result in Lemma 1 since the above-described coupling satisfies the general framework of the R-coupling.

Theorem 1

(General affinity-scheduling model) Let \(\lambda _0\), as defined in (4), be the arrival rate per server in the reference system operating under the RA policy. Then, for suitable initial conditions,

$$\begin{aligned} \{(\overline{Q}_{m+}^{\mathrm {aff}}(t))_{m \ge 1}\}_{t \ge 0} \le _{{\mathrm {st}}} \{(\overline{Q}_{m+}^{\mathrm {RA}}(t))_{m \ge 1}\}_{t \ge 0} \end{aligned}$$
(6)

holds for the general affinity-scheduling model with N servers.

Theorem 1 provides a stochastic upper bound for the total number of type-I jobs in the affinity system in terms of the number of jobs in a reference system with the RA policy by taking \(m=1\). Although this upper bound is sufficient to guarantee stochastic stability for \(\lambda _0<\mu _1\), we will develop tighter majorization results for particular instances of the neighborhood model in the next subsection.

It can be shown that the condition \(\lambda _0 < \mu _1\) indeed implies the necessary and sufficient stability condition for a broader class of flexible parallel server systems in Harrison and López [12] and Stolyar [23]. In the setting of the affinity-scheduling model without type-II jobs, i.e., \(\mu _2 = 0\), the condition \(\lambda _0 < \mu _1\) is also necessary for stability. This scenario in fact falls in the framework with partially accessible servers considered by Foss and Chernova [7]. While the corresponding stability condition derived in [7] has a somewhat different form, it can be shown that it is equivalent to ours using the max-flow min-cut theorem of Ford–Fulkerson [6]. This proof is discussed in Sect. 5.2.

Remark 2

(General applicability of the R-coupling) The scope of application of the R-coupling is much broader than the general affinity-scheduling model. The described method is powerful enough to handle any assignment policy for the type-II jobs as long as the type-I jobs are assigned to a server with the shortest queue length within the primary selection. This implies that the number of type-I jobs will still be dominated by the number of jobs in the reference system. Furthermore, if the assignment policy for the type-II jobs only allows a finite number of them in each queue, then stability is guaranteed once \(\lambda _0<\mu _1\). For instance, this method could also be applicable in a setting without type-II jobs as the general compatibility model. In this setting, arriving jobs can only be served at predetermined server selections.

3.3 Neighborhood model

We will further investigate our model on a graph topology \(G_N\) as described in Sect. 2. It is challenging to get a grip on the performance of an assignment policy that is applied in a network structure, and establishing stochastic dominance relations can give an initial insight into the theoretical behavior of load balancing algorithms in structured environments. It is mentioned in Sect. 2 that the arrival rate over all server selections established by the graph structure \(G_N\) is given by \(\lambda \), and, thus, Theorem 1 is still valid if we set \(\lambda _0 = \lambda \). However, we will make two different assumptions on the structure of the graph topology, and for each of them a tighter dominance result than Theorem 1 is obtained. The first scenario assumes that the minimum degree of \(G_N\) is sufficiently high and the second scenario entails regular graph topologies.

3.3.1 Minimum degree

The reference system with N exchangeable servers operates under a modified version of the JSQ assignment policy, namely MJSQ(k) [19]. In this setting, new jobs arrive at a total rate of \(\lambda N\) and are processed at a server according to a FCFS policy at rate \(\mu _1>\lambda \). An arriving job is assigned to the server with the \((k+1)\)th shortest queue length. A clear analogy can be seen if the system is initially completely empty; then, k servers will constantly remain idle. The system operates as if only \(N{-}k\) servers are present and applies a JSQ policy restricted to these servers. If N is sufficiently large compared to k, i.e., if

$$\begin{aligned} \lambda N < \mu _1 (N-k), \end{aligned}$$
(7)

then the MJSQ(k) policy is stochastically stable.

Suppose that the minimum degree of the graph \(G_N\) is at least \(N{-}k{-}1\), without any other structural assumptions. Letting N and k satisfy the relation in (7), we can describe a coupling between our neighborhood model with underlying topology \(G_N\) and the reference system with the MJSQ(k) policy. The coupling between both systems will fit the general framework of the affinity coupling. The coupling method for the arriving jobs will differ from the general setting in the previous subsection, and the coupling between the service completions follows the methodology explained in Sect. 3.1.

Coupling at arrival epochs For each of the neighborhood sets in \(\mathcal {S}\), there is a uniform arrival rate \(\lambda \) such that the total arrival rate in the affinity system is also given by \(\lambda N\). Assuming that an event in the coupled sample path is an arrival, it is always directed to the server at position \(k+1\) under the MJSQ(k) policy. For the neighborhood model, the primary selection S consists of a randomly selected server and its neighbors under the topology \(G_N\) and the secondary selection \(S^c\) contains all other servers. We do not know the exact ordered positions of the servers in the primary selection that is of size at least \(N-k\) in terms of the \(\overline{Q}_i\) variables. The worst-case scenario that could arise is a primary selection of size exactly \(N{-}k\) where the servers are the \(N{-}k\) highest ordered servers. Then, a type-I job is assigned to the server at position \(k+1\). All other scenarios where an arriving job is labeled as a type-I job in the affinity system will lead to an assignment that is at most at the \((k+1)\)th position. Hence, property (a) of the affinity coupling is satisfied.

Theorem 2 follows from the majorization result in Lemma 1 under the above-described coupling.

Theorem 2

(Neighborhood model with minimum degree \(\varvec{N}-\varvec{k}-\varvec{1}\)) Consider the neighborhood model with an underlying graph topology with minimum degree \(N{-}k{-}1\) and a reference system that operates under the MJSQ(k) policy. Then, for suitable initial conditions,

$$\begin{aligned} \{(\overline{Q}_{m+}^{\mathrm {aff}}(t))_{m \ge 1}\}_{t \ge 0} \le _{{\mathrm {st}}} \{(\overline{Q}_{m+}^{\mathrm{MJSQ}(k)}(t))_{m \ge 1}\}_{t \ge 0}. \end{aligned}$$
(8)

Once the reference system is stochastically stable, if condition (7) is fulfilled, we can give a meaningful upper bound on the total number of type-I jobs in the neighborhood model in terms of the total number of jobs under the MJSQ(k) policy. Note that the stability region in (7) for fixed values of N is smaller compared to the general setting in Sect. 3.2. However, when N is sufficiently large, the MJSQ(k) will outperform the RA policy, implying better performance results on the fluid level. So this upper bound will then be tighter compared to the result in Theorem 1.

Remark 3

Theorem 2 can be generalized for scenarios of the affinity-scheduling model where each server selection S has a size of at least \(N{-}k\) and a non-uniform arrival rate \(\lambda _S\).

3.3.2 Regular graph

As mentioned in introduction, JSQ(k) gives already substantial performance improvements for small values of k compared to the RA policy. With this in mind, we show that the number of type-I jobs under our affinity-scheduling policy on a d-regular graph is stochastically dominated by the total number of jobs under a JSQ(k) policy, when d and k satisfy the following relation:

$$\begin{aligned} \sum _{i=1}^{N-d-1} \left( {\begin{array}{c}N-i\\ k-1\end{array}}\right) \le \frac{d+1}{N} \left( {\begin{array}{c}N\\ k\end{array}}\right) . \end{aligned}$$
(9)

This condition manifests itself in the coupling construction between the arrival events in both systems such that feature (a) of the R-coupling is maintained. We will introduce a novel approach to represent or visualize all possible server selections in \(\mathcal {S}\) that an arriving job can choose from.

An arriving job in the reference system with JSQ(k) is assigned to the lowest positioned server among k randomly selected servers. In total, there are \(\left( {\begin{array}{c}N\\ k\end{array}}\right) \) server selections and each server belongs to \(\left( {\begin{array}{c}N-1\\ k-1\end{array}}\right) \) different server selections. Thus, the lowest positioned server of the system belongs to \(\left( {\begin{array}{c}N-1\\ k-1\end{array}}\right) \) different server selections; the second lowest server is part of precisely \(\left( {\begin{array}{c}N-2\\ k-1\end{array}}\right) \) different server selections without the lowest ordered server. One can continue this reasoning up to the \((N-k+1)\)th lowest ordered server; this server belongs to only one more server selection that is not yet observed at any of the lower-ordered servers. All higher-ordered servers cannot be part of an unobserved server selection. We will construct a step function from the positions \(\{1,\dots ,N\}\) to the interval [0, 1] based on the so-called block interpretation of the server selections. Assume that the servers are ordered from 1 to N and the lowest ordered position of the k selected servers is denoted by n. We represent this selection as a block from position n to N with height \(1/\left( {\begin{array}{c}N\\ k\end{array}}\right) \). This procedure can be repeated for each of the \(\left( {\begin{array}{c}N\\ k\end{array}}\right) \) possible selections, and these corresponding blocks can be stacked according to their length. This procedure is called the block interpretation of the server selections. The outer edges of these stacked blocks will give rise to the following step function:

$$\begin{aligned} f_{\mathrm {ref}} :\{1,\dots ,N\} \rightarrow [0,1] :x \mapsto \left\{ \begin{array}{ll} \frac{1}{\left( {\begin{array}{c}N\\ k\end{array}}\right) }\sum _{i=1}^{x} \left( {\begin{array}{c}N-i\\ k-1\end{array}}\right) , &{}\quad 1\le x\le N-k+1,\\ 1, &{}\quad N-k+1 < x \le N. \end{array}\right. \end{aligned}$$
(10)

A visualization is shown in Fig. 2.

Fig. 2
figure 2

A visualization of the step functions of the server selections in both systems. There are \(N=12\) servers, \(d=2\) (or \(d=7\)) and \(k=3\). The block interpretation of server selection \(S=\{8,9,11\}\) of the affinity system is represented with a gray block

We aim to construct a similar step function based on the possible primary server selections for the affinity system when the underlying graph topology is a d-regular graph. Jobs arrive at a total rate \(\lambda N\), and an arriving job selects uniformly at random a server selection S from \(\mathcal {S}\). By construction, \(\mathcal {S}\) contains N different primary server selections, each of size \(d+1\). Then, the lowest ordered server in the system belongs to \(d+1\) different server selections. However, it is not possible to count the number of additional server selections containing the second lowest ordered server without knowing the position of each of the servers, since we operate on a fixed graph structure. We construct a step function based on the worst-case scenario to maintain property (a) of the coupling where the lowest positioned server of each of the server selections is still at the highest possible position. The first jump occurs at the lowest positioned server, while all remaining jumps will occur at the highest possible positioned servers. This induces stronger correlations between the servers that have more type-I jobs. Notice that the worst-case ordering of the servers might not align with the real underlying d-regular structure, so that this approach might be too conservative. The step function of this worst-case scenario is given by

$$\begin{aligned}&f_{\mathrm {aff}} :\{1,\dots ,N\} \rightarrow [0,1]\nonumber \\&x \mapsto \left\{ \begin{array}{ll} \frac{d+1}{N}, &{}\quad 1\le x\le N-d - \lceil \frac{N}{d+1}\rceil +1,\\ 1-\frac{d+1}{N}(N-d-x), &{}\quad N-d - \lceil \frac{N}{d+1}\rceil +2 \le x \le N-d-1,\\ 1, &{}\quad N-d\le x \le N. \end{array}\right. \end{aligned}$$
(11)

An example of this step function is shown in Fig. 2.

Coupling at arrival epochs Let the total arrival rate be \(\lambda N\) in both systems. For an arriving job at time t, we determine the servers of interest using the inverse transform sampling method [4, Chapter 2]. First, we note that the functions \(f_{\mathrm {aff}}\) and \(f_{\mathrm {ref}}\) are cumulative distribution functions by construction. Second, the only server of interest of the server selection S in the affinity system or the server selection in the reference system is the lowest positioned server. So we sample a random variable \(X_t\) from a uniform distribution on [0, 1] and determine the two servers positions, \(n_{\mathrm {aff}}\) and \(n_{\mathrm {ref}}\), of interest of both systems. This procedure is visualized in Fig. 2. In the affinity system, a job can be assigned as a type-I job to the selected server, or it may be assigned to any other server as a type-II job.

So in order to guarantee feature (a) of the affinity coupling, it needs to be ensured that \(n_{\mathrm {aff}}\le n_{\mathrm {ref}}\), i.e., \(f_{\mathrm {aff}}(n)\ge f_{\mathrm {ref}}(n)\) for all positions n. We observe that the d-regular graph must be rather dense in order to obtain a tighter upper bound than provided by the RA policy. Once the degree d is at least N / 2, the step function \(f_{\mathrm {aff}}\) only makes two jumps, at positions 1 and \(N-d\) of sizes \((d+1)/N\) and \((N-d-1)/N\), respectively. Since the step function \(f_{\mathrm {ref}}\) is concave in its discrete points, we only need to ensure that \(f_{\mathrm {aff}}(N-d-1) \ge f_{\mathrm {ref}}(N-d-1)\) holds, so that the step function of the affinity system is above the step function of reference system. This results in condition (9) on the values of d and k. Due to the coupling construction, we can prove the following dominance result.

Theorem 3

(Neighborhood model with \(\varvec{d}\)-regular graph) Consider the neighborhood model with an underlying d-regular graph topology and a reference system operating under a JSQ(k) policy. If the model parameters d and k satisfy condition (9), then, for suitable initial conditions,

$$\begin{aligned} \{(\overline{Q}_{m+}^{\mathrm {aff}}(t))_{m \ge 1}\}_{t \ge 0} \le _{{\mathrm {st}}} \{(\overline{Q}_{m+}^{\mathrm{JSQ}(k)}(t))_{m \ge 1}\}_{t \ge 0}. \end{aligned}$$
(12)

Due to the coupling construction using the block interpretation of the server selections, the above-described coupling fits the general framework of the affinity coupling as stated in Lemma 1. Therefore, the result of Theorem 3 follows from this lemma. If \(\lambda <\mu _1\), the reference system is stochastically stable and provides a meaningful upper bound on the performance of the neighborhood model on a d-regular topology.

Table 1 Smallest possible value of d, \(d^*\), that satisfies condition (9) is listed for a system with \(N = 50\) servers and given values of k

In Table 1, we list \(d^*\), the minimum value of d as a function of k that guarantees the required dominance of the step functions for a system with \(N=50\) servers. Furthermore, Fig. 3 visualizes the behavior of \(d^*\) for fixed values of k and increasing graph sizes N. The minimum degree seems to grow linearly with N. For instance, it is a straightforward computation to show that

$$\begin{aligned} d^* = \left\lceil \frac{\sqrt{N^2+4(N-1)^2}-N}{2} \right\rceil \end{aligned}$$
(13)

when \(k=2\). This expression is indeed of linear order in N.

Fig. 3
figure 3

A visualization of the smallest possible value of d, \(d^*\), that satisfies condition (9) as function of N for fixed values of k

From Table 1 and Fig. 3, we observe that the graph structure must be rather dense in order to stochastically dominate the affinity-scheduling policy with a JSQ(k) policy even for small values of k. One can argue that the primary selection of our affinity-scheduling strategy must be much larger compared to the server selection under JSQ(k) in order to guarantee better performance. But we should keep in mind that the underlying graph structure is fixed and all possible server selections are predetermined, while the JSQ(k) strategy can be seen as a strategy on a complete graph where an arbitrary set of size k of the servers can be selected. This will affect the performance compared to a system with N exchangeable servers, which is intuitively clear.

Moreover, it is important to note that the obtained value of \(d^*\) might be too conservative. Our coupling method using the step functions requires a degree that is at least equal to N / 2 in order to upper bound by the strategy JSQ(1), i.e., the RA policy. On the other hand, we showed in the general result that the number of type-I jobs under any structural interpretation \(\mathcal {S}\) is stochastically dominated by the number of jobs under a random assignment strategy.

Remark 4

(Combinatorial model) Applying our affinity-scheduling strategy to the combinatorial model with N servers shows a lot of similarities with the JSQ(d) policy in the setting of N exchangeable servers. Namely, an arriving job is assigned to the server with the shortest queue length among d arbitrarily selected servers and sometimes the job can be directed to an idle server outside this selection. The coupling can be adjusted such that the number of type-I jobs at the nth ordered position under our affinity-scheduling policy is less than or equal to the number of jobs at the nth ordered position under a JSQ(d) policy. The relation between the two policies will become more apparent in Sect. 4 when fluid limit results are investigated.

Moreover, it can be shown that the combinatorial model is stochastically stable under a preemptive and a non-preemptive scheduling policy using a Foster–Lyapunov argument. This result is shown under the assumption that the number of type-II jobs at each server never exceeds one. The fact that stability is preserved under a preemptive policy in favor of the type-I jobs is no surprise due to the structure of the server selections \(\mathcal {S}\) and the resemblance of the first step in the assignment policy with the JSQ(d) policy. Under a non-preemptive policy, it is no longer intuitively clear, as any finite value of \(\mu _2\) is allowed and one could imagine a situation where all servers are processing a type-II job and type-I jobs start to accumulate behind these type-II jobs.

4 Fluid limit and fixed point analysis

As mentioned in the introduction, the affinity model in general lacks the exchangeability among the servers that underpins the use of mean field limits as the main analytical techniques in the supermarket model. Due to its inherent symmetry, the combinatorial model with uniform arrival rates for each of the server selections in \(\mathcal {S}\) as described in Sect. 2 is one of the exceptions. The variables \((Q_{ij}^N(t))_{i,j}\) will give rise to a Markov process representation in this case. The primary and secondary server selections for an arriving job are of sizes d and \(N{-}d\), respectively. In order to gain insight into the system performance, we introduce the fluid-scaled variables, i.e.,

$$\begin{aligned} \left( \frac{\overline{Q}^N_{ij}(t)}{N}\right) _{i,j}, \end{aligned}$$

and analyze a sequence of systems where the number of servers N tends to infinity. The (weak) limit that arises is referred to as the fluid limit and is denoted by \((\overline{q}_{ij}(t))_{i,j}\). When it is helpful to stress the proportion of servers with exactly i type-I jobs, instead of at least i type-I job, we consider the variables \((q_{ij}(t))_{i,j}\). It is clear that \(q_{ij}(t)\) is given by \(\overline{q}_{ij}(t){-}\overline{q}_{i+1,j}(t)\) for any i and j. Furthermore, we assume that \(\lambda <\mu _1\) to guarantee stochastic stability. Throughout this section, we will consider a system with \(\lambda =0.8\), \(\mu _1=1\) and \(\mu _2=0.5\) in the numerical and simulation experiments, unless specified otherwise.

4.1 Fluid limit

We now provide a characterization of the (deterministic) fluid limit in terms of a set of discontinuous differential equations. The t reference in the notation will be omitted, if the context allows this.

We introduce a reduced arrival rate \(\tilde{\lambda }\). A job will always be directed to an idle server if available, either as a type-I job or a type-II job, and idle servers are generated at rate \(\mu _1q_{10}+\mu _2q_{01}\). This implies that if \(\lambda \) is sufficiently high, i.e., \(\lambda > \mu _1q_{10}+\mu _2q_{01}\), only a fraction of the arriving jobs will start to queue in front of a server as type-I jobs on the fluid level. This fraction is given by \(\tilde{\lambda }/\lambda \), with

$$\begin{aligned} \tilde{\lambda }=\left( \lambda - \mu _1q_{10}-\mu _2q_{01}\right) ^+. \end{aligned}$$
(14)

Then,

$$\begin{aligned} {\left\{ \begin{array}{ll} \tfrac{\hbox {d}}{\hbox {d}t} \overline{q}_{00} = \mu _2{q}_{01} - \lambda (1-q_{00})^{d} + \tilde{\lambda } \mathbb {1}\{q_{00} = 0\}, \\ \tfrac{\hbox {d}}{\hbox {d}t} \overline{q}_{01} = -\mu _2{q}_{01} + \mathbb {1}\{q_{00}> 0\} \left[ \lambda (1-q_{00})^{d}\right] + \mathbb {1}\{q_{00} = 0\} \left[ \lambda - \tilde{\lambda }\right] ,\\ \tfrac{\hbox {d}}{\hbox {d}t} \overline{q}_{10} = -\mu _1{q}_{10} + \mathbb {1}\{q_{00} > 0\} \left[ \lambda \left( 1-(1-q_{00})^{d}\right) \right] ,\\ \tfrac{\hbox {d}}{\hbox {d}t} \overline{q}_{11} = -\mu _1{q}_{11} + \tilde{\lambda } \mathbb {1} \{q_{00} = 0\} \left[ \left( \overline{q}_{10}+\overline{q}_{01}\right) ^{d} -\left( \overline{q}_{10}+\overline{q}_{11}\right) ^{d}\right] ,\\ \text {for}~i\ge 2,\\ \tfrac{\hbox {d}}{\hbox {d}t} \overline{q}_{i0} = -\mu _1{q}_{i0} + \tilde{\lambda } \mathbb {1} \{q_{00} = 0\} \left[ \left( \overline{q}_{i-1,0}+\overline{q}_{i-1,1}\right) ^{d} - \left( \overline{q}_{i0}+\overline{q}_{i-1,1}\right) ^{d} \right] ,\\ \tfrac{\hbox {d}}{\hbox {d}t} \overline{q}_{i1} = -\mu _1\overline{q}_{i1} + \tilde{\lambda } \mathbb {1}\{q_{00} = 0\} \left[ \left( \overline{q}_{i0} +\overline{q}_{i-1,1}\right) ^{d} -\left( \overline{q}_{i0} +\overline{q}_{i1}\right) ^{d}\right] , \end{array}\right. } \end{aligned}$$
(15)

with \(\overline{q}_{00}+\overline{q}_{01}=1\).

Since the system operates under a preemptive priority policy, the structure of the departure rate in each of the equations in (15) is clear. For instance, in order to change the proportion \(\overline{q}_{11}\) due to a job completion, this job completion must take place at a server with configuration (1, 1). Exactly a fraction \(q_{11}\) of the servers has this configuration, and since these servers each work at rate \(\mu _1\), the total rate of change is given by \(-\mu _1{q}_{11}\).

Let us illustrate the representation of the arrival term for the derivative of \(\overline{q}_{11}\). Only an arrival of a type-I job at a server with configuration (0, 1) can contribute to the arrival term, and the probability that this configuration is the smallest among the d servers in the primary selection is given by \(\left( \overline{q}_{10}+\overline{q}_{01}\right) ^{d} -\left( \overline{q}_{10}+\overline{q}_{11}\right) ^{d}\). Ties are broken according to the presence of a type-II job, in favor of having no type-II jobs. Moreover, there should be no idle servers because otherwise an arriving job would be assigned here as a type-II job. Since type-I jobs arrive at a reduced rate \(\tilde{\lambda }\), the total rate of change is given by \(\tilde{\lambda } \mathbb {1}\{q_{00} = 0\} [ \left( \overline{q}_{10}+\overline{q}_{01}\right) ^{d} -\left( \overline{q}_{10}+\overline{q}_{11}\right) ^{d}]\).

The expressions for the arrival terms in the derivatives of (15) and the reduced arrival rate \(\tilde{\lambda }\) should be considered more carefully due to the discontinuity at \(q_{00}=0\). We will give a sketch of the derivation of this fluid limit in Sect. 5.3. This derivation relies on the martingale method for point processes and Markovian queueing settings outlined by Pang et al. [21] and Brémaud [3].

The fluid limit expression can be validated with simulations of the fluid-scaled stochastic process. Consider, for instance, Fig. 4, where the solution of the fluid limit (15) is presented together with a simulated trajectory of a reasonably large system. It can be observed that the simulated trajectory fluctuates closely around the numerical solution of the fluid limit, which supports the connection between the fluid limit and the behavior of the stochastic system in a many-server setting.

Fig. 4
figure 4

A comparison between a simulated trajectory of the fluid-scaled stochastic process with \(N=2000\) servers (thin line) and the numerical solution (thick line) of the fluid limit (15). The model parameters are given by \(d=25\), \(\lambda = 0.8\), \(\mu _1 = 1\) and \(\mu _2 = 0.5\)

4.2 Fixed points

To investigate the long-run behavior of the fluid limit (15), we are interested in its fixed points. It turns out that the mutual relationships between the model parameters d, \(\lambda \), \(\mu _1\) and \(\mu _2\) play a crucial role. In the remainder of this section, we investigate the setting where \(\lambda > \mu _2\), in order to compare one of the fixed points with the fixed point of a JSQ(d) policy with reduced load \(\tilde{\lambda }\).

Theorem 4

(Fixed points) When \(\lambda > \mu _2\) and \(d\ge 2\), the system of differential equations (15) always has the following fixed point:

$$\begin{aligned} {\left\{ \begin{array}{ll} \overline{q}_{i0}^* = 0, &{}\\ \overline{q}_{i1}^* = \left( \frac{\lambda -\mu _2}{\mu _1-\mu _2} \right) ^{\frac{d^i-1}{d-1}}, &{} i=0,1,2\dots . \end{array}\right. } \end{aligned}$$
(16)

Let \(d^*\doteq d^*(\lambda ,\mu _1,\mu _2)\) denote the minimum selection size that satisfies

$$\begin{aligned} d^* \lambda \left( \frac{1}{\mu _2} - \frac{1}{\mu _1} \right)> & {} 1, \nonumber \\ \left( 1- \frac{1}{d^*}\right) \frac{\mu _1}{\lambda }> & {} \left( d^*\lambda \left( \frac{1}{\mu _2} - \frac{1}{\mu _1}\right) \right) ^{\frac{1}{d^*-1}}. \end{aligned}$$
(17)

If \(d \ge d^*(\lambda ,\mu _1,\mu _2)\), then precisely two more fixed points exist. These fixed points are such that \(q_{00} + q_{01}+q_{10} = 1\) and \(q_{00}>0\).

The proof of this theorem is discussed in Sect. 5.3. It can be observed that there always exists a sufficiently large d value that satisfies both inequalities of condition (17) for given values of \(\lambda \), \(\mu _1\) and \(\mu _2\). This is trivial to see for the first inequality. The second inequality can be rewritten as

$$\begin{aligned} \left( 1-\frac{1}{d}\right) \left( \frac{1}{d \cdot a}\right) ^{\frac{1}{d-1}} > \frac{\lambda }{\mu _1} \end{aligned}$$
(18)

with \(a\doteq \lambda (\frac{1}{\mu _2}-\frac{1}{\mu _1})\). The left-hand side is increasing as a function of d with limit \(1> \lambda /\mu _1\). Table 2 gives the values of \(d^*\) satisfying (17) for a range of model parameters. It can be seen that the higher the load, the larger the size of the primary selections must be for multiple fixed points to persist. The additional fixed points have a strictly positive fraction of idle servers, and it is intuitively clear that the number of servers where a job can be processed at rate \(\mu _1\) must grow with the load in order for such fixed points to persist.

Table 2 For given \(\lambda \), \(\mu _1 = 1\) and \(\mu _2\), the minimum value \(d^*\) that satisfies condition (17) is listed

The long-term fraction of servers with at least i jobs under a JSQ(d) policy is given by

$$\begin{aligned} \overline{q}_i^* = \left( \frac{\tilde{\lambda }}{\mu _1}\right) ^{\frac{d^i-1}{d-1}} =\left( \frac{\lambda -\mu _2}{\mu _1-\mu _2}\right) ^{\frac{d^i-1}{d-1}}, \end{aligned}$$
(19)

for \(i\ge 0,\) with arrival rate \(\tilde{\lambda }\) and service rate \(\mu _1\) [18]. This shows a strong similarity with the fixed point (16) where two types of jobs are taken into account. Next we consider the case \(d=1\). When \(\lambda > \mu _2\), there still is a unique fixed point with \(q_{00} = 0\), given by

$$\begin{aligned} {\left\{ \begin{array}{ll} \overline{q}_{i0}^* = 0, &{}\\ \overline{q}_{i1}^* = \left( \frac{\lambda -\mu _2}{\mu _1-\mu _2}\right) ^i, &{}i=0,1,2,\dots . \end{array}\right. } \end{aligned}$$
(20)

This shows strong resemblance with the RA policy with load \(\rho = \tilde{\lambda }/\mu _1\). Allowing a primary selection of at least two servers leads to a super-exponential improvement compared to a primary selection of size one. On the other hand, there is no fixed point with \(\lambda \ge \mu _2\) and \(q_{00}>0\). Only if \(\lambda < \mu _2\) can we show that there is a unique fixed point with \(q_{00}>0\), namely

$$\begin{aligned} q_{00}^* = \frac{(\mu _2-\lambda )\mu _1}{(\mu _2-\lambda )\mu _1 + \lambda \mu _2}. \end{aligned}$$
(21)

4.3 Further analysis

We will conduct a further analysis of the fluid limit (15) where we distinguish between \(d < d^*\) and \(d \ge d^*\).

4.3.1 Sufficiently small primary selections

When d is sufficiently small in terms of the model parameters \(\lambda \), \(\mu _1\) and \(\mu _2\), i.e., \(d < d^*\), the fixed point (16) of the fluid limit (15) is unique. Numerical experiments suggest that this fixed point is a global attractor, i.e., the trajectories of the fluid limit will converge to this fixed point for every initial state of the system. As an example, we present Fig. 5, where the numerical solution of the fluid limit is visualized for ten randomly sampled initial configurations. We consider a system with the above-mentioned model parameters and a primary selection of size \(d=3\). As can be seen from the figure, all cumulative fractions \(\left( \overline{q}_{i0}\right) _{i\ge 0}\) tend to zero. The fractions \(\overline{q}_{01}\), \(\overline{q}_{11}\), \(\overline{q}_{21}\) and \(\overline{q}_{31}\) converge to 1, 0.6, 0.1296 and 0.0013, respectively.

Fig. 5
figure 5

Ten solution trajectories (thin lines) of the fluid limit (15) are plotted for randomly generated initial states, showing convergence of the cumulative distributions to the fixed point (thick dashed line). The model parameters are given by \(d = 3\), \(\lambda = 0.8\), \(\mu _1=1\) and \(\mu _2= 0.5\), and each of the servers has at most three type-I jobs in the initial states

In the previous section, we used the R-coupling to show stochastic stability and the existence of an (unknown) stationary distribution for \(\lambda <\mu _1\). Assuming global stability of the unique fixed point, Theorem 1 by Benaïm and Le Boudec [2] ensures that the large-N limit of the stationary distribution will converge to the fixed point. Moreover, from simulations it can be observed that the trajectories indeed converge to the unique fixed point of the fluid limit (15). As an example, consider a system with the above-mentioned model parameters. Figure 6 shows a simulated trajectory of the fluid-scaled variables for a system with \(N=2000\) servers that is initially completely empty. It can be seen that the trajectory converges to the fixed point \((q_{01},q_{11},q_{21}, \dots ) = (0.40,0.4704,0.1283,\dots )\), rounded at four decimals.

Fig. 6
figure 6

A comparison between the long-term behavior of a simulated trajectory for \(N= 2000\) servers (thin line) and the unique fixed point (thick dashed line) of the fluid limit (15). The model parameters are given by \(d = 3\), \(\lambda = 0.8\), \(\mu _1 = 1\) and \(\mu _2 = 0.5\)

The asymptotic approximation for the mean stationary queue length, excluding the job in service, suggested by the fixed point is given by

$$\begin{aligned} \mathbb {E}\left[ Q_{ \text {CM({ d})}}\right] = \sum \limits _{i\ge 1} i q_{i,1} =\sum \limits _{i\ge 1} \left( \frac{\lambda -\mu _2}{\mu _1-\mu _2}\right) ^{\frac{d^{i}-1}{d-1}}. \end{aligned}$$
(22)

Here CM(d) refers to the combinatorial model with a primary server selection of size d. It might be interesting to compare the performance of our affinity-scheduling policy with the performance of the JSQ(d) policy in an ordinary supermarket model with arrival rate \(\lambda \) and service rate \(\mu _1\) [18, 27]. The combinatorial model can be seen as an extension of the JSQ(d) policy with the additional feature that jobs can be assigned to a distant server if this allows the service to start immediately. The mean queue length under the JSQ(d) policy is given by

$$\begin{aligned} \mathbb {E}\left[ Q_{\text {JSQ({ d})}} \right] = \sum \limits _{i\ge 1} i q_{i+1} =\sum \limits _{i\ge 1} \left( \frac{\lambda }{\mu _1}\right) ^{\frac{d^{i+1}-1}{d-1}}. \end{aligned}$$
(23)

Furthermore, the exact mean queue length under the RA policy is given by

$$\begin{aligned} \mathbb {E}\left[ Q_{\mathrm {RA}}\right] = \left( 1-\frac{\lambda }{\mu _1}\right) \sum \limits _{i\ge 1} i \left( \frac{\lambda }{\mu _1}\right) ^{i+1} =\frac{(\lambda /\mu _1)^2}{1-\lambda /\mu _1}. \end{aligned}$$
(24)

Figure 7 presents a comparison of the number of waiting jobs as a function of \(\lambda \), with \(d=3\), \(\mu _1=1\) and \(\mu _2=0.5\). It is known that the mean queue length for the RA policy tends to infinity when the offered traffic grows to one. We see that the mean queue length in the combinatorial model is slightly larger than for the JSQ(d) policy. On the other hand, the variance of the queue length in the JSQ(d) model is almost twice as large compared to the combinatorial model. We conclude that the combinatorial model still performs well from a queue length perspective, even though each server has a type-II job and possibly multiple type-I jobs in its queue.

Fig. 7
figure 7

Comparison of the mean and variance of the queue length as a function of \(\lambda \) between three different models on the fluid level for \(d =3\), \(\mu _1 = 1\) and \(\mu _2 = 0.5\)

From the fixed point expression, it is not immediately visible that type-II jobs finish their service, since the fraction \(q_{00}\) is zero. However, an idle server will be filled instantly with an arriving job. A total fraction

$$\begin{aligned} \frac{\lambda -\tilde{\lambda }}{\lambda } = \frac{\mu _2}{\lambda }\frac{\mu _1-\lambda }{\mu _1-\mu _2} \end{aligned}$$
(25)

of the arriving jobs undergo this ‘immediate switch’: they are assigned as a type-II job to a server that just emptied its queue. This fraction decreases in \(\lambda \), and for example, for the above-mentioned model parameters, this leads to a fraction of 1 / 4. Furthermore, the type-II jobs will leave the system at the same rate \(\lambda -\tilde{\lambda }\) as they enter the system, since we study the system in equilibrium. Moreover, due to Little’s law we know that the expected waiting time of an arbitrary job is finite. Let W denote the waiting time, then \( \mathbb {E}\left[ Q_{\mathrm{CM}(d)}\right] =\lambda \mathbb {E}[W]. \) Since the expected queue length under our affinity-scheduling policy is finite, this results in a finite expected waiting time for an arbitrary job, so also for the type-II jobs.

Since each server operates under a preemptive scheduling policy, we can calculate the average waiting time of a type-I job using Little’s law. Let \(Q_{\mathrm {I}}\) denote the number of type-I jobs at a server. Then,

$$\begin{aligned} \mathbb {E}[Q_{\mathrm {I}}] = \sum \limits _{i\ge 1} i q_{i+1,1} =\sum \limits _{i\ge 1} \left( \frac{\lambda -\mu _2}{\mu _1-\mu _2}\right) ^{\frac{d^{i+1}-1}{d-1}}. \end{aligned}$$
(26)

Furthermore, the reduced arrival rate \(\tilde{\lambda }\) gives the arrival rate of type-I jobs on the fluid level. If \(W_\mathrm {I}\) represents the waiting time of a type-I job, then due to Little’s law \(\mathbb {E}[Q_{\mathrm {I}}] = \tilde{\lambda }\mathbb {E}[W_{\mathrm {I}}].\) Let \(Q_{\mathrm {II}}\) and \(W_{\mathrm {II}}\) have the same interpretation as above but for the type-II jobs. We condition on the type of job to obtain

$$\begin{aligned} \mathbb {E}[W] = \frac{\tilde{\lambda }}{\lambda } \mathbb {E}[W_{\mathrm {I}}] + \frac{\lambda - \tilde{\lambda }}{\lambda } \mathbb {E}[W_{\mathrm {II}}]. \end{aligned}$$
(27)

Because of Little’s law, this results in

$$\begin{aligned} \mathbb {E}[W_{\mathrm {II}}] = \frac{1}{\lambda -\tilde{\lambda }} \left( \mathbb {E}\left[ Q_{\text {CM({ d})}} \right] - \mathbb {E}[Q_{\mathrm {I}}]\right) = \frac{1}{\lambda -\tilde{\lambda }} \frac{\lambda -\mu _2}{\mu _1-\mu _2}. \end{aligned}$$
(28)

We can also immediately apply Little’s law to the type-II jobs. We know that they arrive at rate \(\lambda -\tilde{\lambda }\) and the mean waiting queue length is by definition given by

$$\begin{aligned} \mathbb {E}[Q_{\mathrm {II}}] = \sum \limits _{i \ge 1} q_{i1} = 1- q_{01} =\frac{\lambda -\mu _2}{\mu _1-\mu _2}. \end{aligned}$$
(29)

In Fig. 8, we compare the mean waiting time of a type-I or type-II job with the mean waiting time under the RA or the JSQ(d) policy. The mean waiting time of type-II jobs is fairly high, but still lower than the waiting time under the RA policy. We also observe that the mean waiting time of type-I jobs is significantly smaller than under a JSQ(d) policy. We conclude that our assignment policy leads to a reduction in the mean waiting time for a large group of arriving jobs at the expense of some other jobs that encounter longer waiting times. The uniqueness of the fixed point allows us to analyze the asymptotic stationary distribution of the model. On the other hand, we observe that the value of the size of the server selection d is too small to achieve a zero waiting time for an arriving job.

Fig. 8
figure 8

A comparison of the mean waiting times of the type-I and type-II jobs with the mean waiting times under the RA policy or a JSQ(d) policy as function of \(\lambda \) for \(d =3\), \(\mu _1 = 1\) and \(\mu _2 = 0.5\)

4.3.2 Sufficiently large primary selections

Assume that the primary selection has a sufficiently large size d for given model parameters in terms of the conditions (17), i.e., \(d\ge d^*\). From Theorem 4, we know that, in addition to the closed-form fixed point (16), there are two more fixed points with \(q_{00}+q_{01}+q_{10}=1\). We prove the following theorem using the indirect Lyapunov method.

Theorem 5

(Local (in)stability) Of the two additional fixed points mentioned in Theorem 4 with \(q_{00}+q_{01}+q_{10}=1\) when \(d\ge d^*\), one is locally stable and the other one is unstable.

The proof of Theorem 5 is given in Sect. 5.3. In the remainder of this subsection, we will provide a numerical illustration, where we consider a system with \(\lambda =0.8\), \(\mu _1=1\), \(\mu _2=0.5\) and \(d=25\) throughout. We observed similar qualitative behavior across many different scenarios, but only present results for the above parameter values because of space constraints. To get a better notion of the local stability, we present Fig. 9. For several initial values such that \(q_{00}+ q_{01}+q_{10} = 1\), the system of differential equations (15) is solved numerically. All trajectories with initial states indicated with large dots will converge to the locally stable fixed point from the previous theorem and a few of these trajectories are also visualized. All other initial states, indicated with small dots, will not converge to this locally stable fixed point. We see that these states have a large fraction of servers with a type-II job present and a small fraction of idle servers, since there is a smaller probability of selecting an idle server in the primary selection. So jobs will have a longer mean service time as a type-II job and jobs will start to accumulate.

Fig. 9
figure 9

An overview of the initial states with \(q_{00}+q_{01}+q_{10} = 1\), if the corresponding trajectories converge to the locally stable fixed point (triangle), then the initial states are indicated with a large dot, and otherwise, they are indicated with a small dot. We consider a system with \(d=25\), \(\lambda =0.8\), \(\mu _1=1\) and \(\mu _2=0.5\)

In total, this gives rise to two locally stable fixed points: the closed-form fixed point (16) where each server has a type-II job and possibly multiple type-I jobs, and the fixed point from Theorem 5 where at most one job is present at each server. In the remainder of this section, we will refer to these fixed points as the queueing fixed point and no-queueing fixed point, respectively. We do not formally prove this statement, but we will illustrate it with a representative example. For a system with the above-mentioned parameters, the two fixed points under consideration (non-cumulative fractions) are given by

$$\begin{aligned} (q_{00},q_{01},q_{10})= & {} (0.1966,0.0067,0.7967),\nonumber \\ (q_{01},q_{11})= & {} (0.4,0.6). \end{aligned}$$
(30)

Both fixed points are indicated in Fig. 10 with dotted and dashed lines, respectively. Furthermore, the graphs contain 20 trajectories starting from randomly sampled initial configurations with \(q_{00}+q_{01}+q_{10}+q_{11} = 1\); all these trajectories converge to one of the two fixed points. This implies that the convergence area presented in Fig. 9 to the no-queueing fixed point will in fact be larger. As can be seen, most of the trajectories will converge to the queueing fixed point. This phenomenon will be even more apparent if we allow initial states with more than two jobs.

Fig. 10
figure 10

Trajectories for 20 randomly sampled initial states are visualized, with initial points such that \(q_{00}+q_{01}+q_{10}+q_{11} = 1\). The trajectories converge to one of the two fixed points; the no-queueing and queueing fixed point are indicated with a thick dotted line and thick dashed line, respectively

The literature often describes systems with a unique global attractor as a fixed point of the fluid limit so that there is a direct connection between the stationary distribution in a many-server setting and this fixed point. However, the non-uniqueness of the fixed points does not imply that these two concepts are completely disjoint. For instance, Fig. 4 presents a comparison between the numerical solution of the fluid limit and a simulation with \(N=2000\) servers with the above-mentioned model parameters. The system is initially empty, and the simulated trajectory seems to converge to the no-queueing fixed point. We could present a similar figure, where in the initial configuration each server has one type-II job, in which case both the numerical solution and the simulation seem to tend to the queueing fixed point.

However, the stochastic process with a finite number of servers is an irreducible Markov process which implies that any state can be reached as long as the process is observed long enough and a unique equilibrium distribution must exist. Nevertheless, it can be observed that the residence time near each of the locally stable fixed points, which increases with N, is long before the process makes the transition to the other locally stable fixed point. Gibbens et al. [11] describe this concept of switching between multiple modes by ‘tunneling.’

Examples of models where metastability plays an important role in loss and communication networks can be found in [1, 11, 30]. More recent work by Martirosyan and Robert [17] considers an assignment policy closely related to the affinity-scheduling policy in a loss network setting, i.e., jobs can be redirected to distant servers with a penalty or can be omitted if none of the servers has enough spare capacity. Also in this setting, a fluid limit analysis reveals multiple locally stable fixed points.

The combinatorial model was introduced as a highly symmetric model instance of the more general affinity-scheduling model. This built-in symmetry allowed us to conduct a fluid limit analysis. However, simulation results indicate that the observed metastability is a more general phenomenon. The simulations were conducted for server systems with various sizes and various underlying graph structures, including 24-regular graphs and Erdös-Rényi random graphs with average degree 25. The remaining model parameters were kept as \(\lambda = 0.8\), \(\mu _1=1\) and \(\mu _2=0.5\), and the system was initially completely empty. Note that for both instances of the neighborhood model the (average) size of the primary selection is at least 25 as before, but the variability among the possible primary selections has been greatly reduced compared to the combinatorial model instances. We chose for an average degree of 25 instead of 24 for the Erdös-Rényi random graphs to ensure that most of the sizes of the primary selections indeed satisfy condition (17). The value of the corresponding no-queueing fixed point when \(d=26\) slightly changes to \((q_{00},q_{01},q_{10}) = (0.1974,0.0053,0.7973)\). There is no noticeable difference for the value of the queueing fixed point up to four decimals. As an illustrative example, we present Fig. 11. To obtain this figure, 300 simulation runs were conducted for Erdös–Rényi random graphs consisting of \(N=250\) servers. The long-term fractions \(q_{00}\), \(q_{10}\), \(q_{01}\) and \(q_{11}\) were monitored and presented in a histogram. The values of the two above-mentioned locally stable fixed points are indicated with dotted and dashed lines, respectively. It can be observed that the long-term fractions in this non-symmetric setting still converge to either of the two fixed points. For increasing values of N, the simulated long-term fractions will be even closer to the theoretical fixed points of the combinatorial model.

Fig. 11
figure 11

Long-term fractions \(q_{00}\), \(q_{10}\), \(q_{01}\) and \(q_{11}\) of 300 simulation runs on Erdös–Rényi random graphs with average degree 25 and \(N=250\) servers. Indicated with dotted and dashed lines are the theoretical locally stable fixed points for combinatorial model instances with model parameters \(d=26\), \(\lambda = 0.8\), \(\mu _1=1\) and \(\mu _2=0.5\)

5 Proofs

5.1 Proof of Lemma 1: R-coupling

Since the system configurations between two consecutive events remain unchanged, we will condition on the discrete event times and use forward induction.

Assume that (3) holds up to the time of the \((k{-}1)\)th event. We will argue that the majorization property still holds at time \(t_k\) of the kth event by making a distinction between arrival and departure epochs. But first we need a formal way to express the effect of these events in terms of \((\overline{Q}_{i}^{\mathrm {aff}}(t) )_{i\ge 1}\) and \((\overline{Q}_{i}^{\mathrm {ref}}(t) )_{i\ge 1}\). For instance, the server with the nth shortest queue length is selected for a departure. Due to the ordering, we know that there are at least \(N-n+1\) servers with the same number of jobs or more in their queues as the server at ordered position n. It might also be possible that the server at the ordered position \(n-1\) has the same number of jobs as the server at position n. Then, there is no notable difference in terms of the variables \(\overline{Q}_{i}\) whether a removal takes place at position \(n-1\) or at position n. Instead of removing from the server at position n and reordering the servers before computing \((\overline{Q}_{i}^{\mathrm {aff}}(t) )_{i\ge 1}\) and \((\overline{Q}_{i}^{\mathrm {ref}}(t) )_{i\ge 1}\), we can also immediately update these variables. The difference is subtle and valid because the proof does not rely on the present type-II jobs or on the actual servers but only on their relative positions. Therefore, we define two intermediate quantities:

$$\begin{aligned} I_{\mathrm {aff}}(n)&\doteq \max \{j: \overline{Q}_{j}^{\mathrm {aff}}\ge N-n+1\}, \nonumber \\ I_{\mathrm {ref}}(n)&\doteq \max \{j: \overline{Q}_{j}^{\mathrm {ref}}\ge N-n+1\}. \end{aligned}$$
(31)

For instance, in the affinity system in Fig. 1, \(I_{\mathrm {aff}}(3)\) is given by 1. Furthermore, only one job will be added or removed at a discrete time event. A new event at time \(t_k\) could only violate (3) if at time \(t_{k-1}\) (3) holds with equality, i.e.,

$$\begin{aligned} \sum \limits _{i = m}^{ \infty }\overline{Q}_{i}^{\mathrm {aff}}(t_{k-1}) =\sum \limits _{i = m}^{ \infty }\overline{Q}_{i}^{\mathrm {ref}}(t_{k-1}) \end{aligned}$$
(32)

with \(m \ge 1\). Therefore, we only focus on this setting in the induction step.

Arrival. At time \(t_k\), an arrival occurs, and first the nth ordered position is selected. The updated reference system looks as follows:

$$\begin{aligned} \overline{Q}_j^{\mathrm {ref}}(t_k) =\left\{ \begin{array}{ll} \overline{Q}_j^{\mathrm {ref}}(t_k^-)+1, &{}\quad \text {if}\; j = I_{\mathrm {ref}}(n)+1,\\ \overline{Q}_j^{\mathrm {ref}}(t_k^-), &{}\quad \text {otherwise.} \\ \end{array}\right. \end{aligned}$$
(33)

If the newly arrived job is assigned as a type-II job in the affinity system or no arrival takes place due to the coupling, (3) is trivially satisfied. We consider the setting where the job is assigned as a type-I job to a server at position \(n_{\mathrm {aff}}\le n\), such that

$$\begin{aligned} \overline{Q}_{j}^{\mathrm {aff}}(t_k) = \left\{ \begin{array}{ll} \overline{Q}_{j}^{\mathrm {aff}}(t_k^-)+1, &{}\quad \text {if}\; j =I_{\mathrm {aff}}(n_{\mathrm {aff}})+1,\\ \overline{Q}_{j}^{\mathrm {aff}}(t_k^-), &{}\quad \text {otherwise.} \\ \end{array}\right. \end{aligned}$$
(34)

Moreover, the left-hand side of (3) remains unchanged if \(m > I_{\mathrm {aff}}(n_{\mathrm {aff}})+1\) so that the order in (3) is preserved. Now, fix \(m \le I_{\mathrm {aff}}(n_{\mathrm {aff}})+1\). If we now show that also \(I_{\mathrm {ref}}(n) \ge m-1\), then (3) remains valid since both sides are raised by one. We use (32) and the induction hypothesis for \(m-1\) at time \(t_k^-\) to obtain

$$\begin{aligned} \overline{Q}_{m-1}^{\mathrm {aff}}(t_k^-)= & {} \sum \limits _{i = m-1}^{ \infty } \overline{Q}_{i}^{\mathrm {aff}}(t^-_k) - \sum \limits _{i = m}^{ \infty } \overline{Q}_{i}^{\mathrm {aff}}(t^-_k) \nonumber \\\le & {} \sum \limits _{i = m-1}^{ \infty }\overline{Q}_{i}^{\mathrm {ref}}(t^-_k) -\sum \limits _{i = m}^{ \infty }\overline{Q}_{i}^{\mathrm {ref}}(t^-_k) =\overline{Q}_{m-1}^{\mathrm {ref}}(t_k^-). \end{aligned}$$
(35)

Then, it follows that \(I_{\mathrm {aff}}(n_{\mathrm {aff}}) \ge m-1\) implies \(I_{\mathrm {ref}}(n) \ge m -1\), which concludes the derivation if the event at time \(t_k\) is an arrival.

Departure. If at time \(t_k\) a departure takes place, one of the following four scenarios will occur:

  1. 1.

    There is a job completion of a type-I job in the affinity system and of a job in the reference system.

  2. 2.

    There is only a departure of a job in the reference system.

  3. 3.

    There is only a departure of a type-I job in the affinity system.

  4. 4.

    There is no departure of a type-I job in the affinity system or a job in the reference system.

It is clear that we only need to investigate the first two scenarios.

Scenario 1. Let \(P_{\mathrm{aff}}\) and \(P_{\mathrm{ref}}\) be the sets of ordered positions of servers in the affinity and reference system, respectively, that are working on (type-I) jobs just before time \(t_k\). Define P as the intersection \(P_{\mathrm{aff}}\cap P_{\mathrm{ref}}\). Let \(n \in P\) be the position of the servers in both the affinity and the reference system from which a job will be removed. The updated states will be

$$\begin{aligned} \overline{Q}_{j}^{\mathrm {aff}}(t_k)= & {} \left\{ \begin{array}{ll} \overline{Q}_{j}^{\mathrm {aff}}(t_k^-)-1, &{}\quad \text {if}\; j = I_{\mathrm {aff}}(n),\\ \overline{Q}_{j}^{\mathrm {aff}}(t_k^-), &{}\quad \text {otherwise}, \\ \end{array}\right. \nonumber \\ \overline{Q}_{j}^{\mathrm {ref}}(t_k)= & {} \left\{ \begin{array}{ll} \overline{Q}_{j}^{\mathrm {ref}}(t_k^-)-1, &{}\quad \text {if}\; j = I_{\mathrm {ref}}(n),\\ \overline{Q}_{j}^{\mathrm {ref}}(t_k^-), &{}\quad \text {otherwise}. \\ \end{array}\right. \end{aligned}$$
(36)

We will focus on \(m\le I_{\mathrm {ref}}(n)\), since for \(m>I_{\mathrm {ref}}(n)\) (3) remains trivially valid. A similar argument as above will be used to show that \(I_{\mathrm {aff}}(n) \ge m\), so that both sides will be lowered by one compared to the event time \(t_{k-1}\). We use (32) and the induction hypothesis for \(m+1\) at time \(t_k^-\) to obtain \(\overline{Q}_{m}^{\mathrm {aff}}(t_k^-) \ge \overline{Q}_{m}^{\mathrm {ref}}(t_k^-)\). Then, it follows that \(I_{\mathrm {ref}}(n) \ge m\) implies \(I_{\mathrm {aff}}(n) \ge m\) which concludes the proof of scenario 1.

Scenario 2. Let \(n \in P_{\mathrm {ref}} \setminus P\) be the position where a job leaves the reference system. Then for all j

$$\begin{aligned} \overline{Q}_{j}^{\mathrm {aff}}(t_k)= & {} \overline{Q}_{j}^{\mathrm {aff}}(t_k^-), \nonumber \\ \overline{Q}_{j}^{\mathrm {ref}}(t_k)= & {} \left\{ \begin{array}{ll} \overline{Q}_{j}^{\mathrm {ref}}(t_k^-)-1, &{}\quad \text {if}\; j = I_{\mathrm {ref}}(n),\\ \overline{Q}_{j}^{\mathrm {ref}}(t_k^-), &{}\quad \text {otherwise}. \\ \end{array}\right. \end{aligned}$$
(37)

Again we focus on \(m \le I_{\mathrm {ref}}(n)\). Fixing m, we will show by contradiction that (32) cannot occur, so that (3) is preserved at time \(t_k\) since the right-hand side can be lowered by at most one. Assuming that (32) does hold and using the induction hypothesis on \(m+1\), we conclude that \(\overline{Q}_{m}^{\mathrm {aff}} (t_k^-)\ge \overline{Q}_{m}^{\mathrm {ref}}(t^-_k)\). Now,

$$\begin{aligned} \overline{Q}_{m}^{\mathrm {aff}}(t^-_k) \ge \overline{Q}_{m}^{\mathrm {ref}}(t^-_k) \ge \overline{Q}_{I_{\mathrm {ref}}(n)}^{\mathrm {ref}}(t^-_k) \ge N-n+1 \ge |P|+1, \end{aligned}$$
(38)

since \(N-|P_{\mathrm {ref}}| < n\le N-|P|\). This implies that \(\overline{Q}_{m}^{\mathrm {aff}}(t_k^-) > |P|\); however, there are only \(|P| = |P_{\mathrm {aff}}|\) servers working on a type-I job in the affinity system. This leads to a contradiction and concludes the proof of Lemma 1.

5.2 Proof: equivalent stability conditions

As mentioned in Sect. 3.2, our affinity-scheduling model without any type-II jobs is a special instance of the parallel server model with multiple job classes described by Foss and Chernova in [7]. Each job class has its own arrival rate and set of available servers with a server or class-dependent service time distribution. The JSQ policy is one of the various potential allocation policies. Our model meets this description if there is a job class for each of the different server selections \(S\in \mathcal {S}\) and a common service rate \(\mu _1\) for all the servers.

In the above-described special setting, the necessary and sufficient stability condition in Theorems 2.5 and 2.7 in [7] reduces to

$$\begin{aligned} \rho _0 = \max _{J\subseteq \{1,\dots ,N\}} \left\{ \frac{1}{|J|\mu _1} \sum \limits _{U\subseteq J} \lambda _U \right\} < 1. \end{aligned}$$
(39)

With \(\lambda _U\), the arrival rate to server set U if U is a server selection from \(\mathcal {S}\) and zero otherwise, this stability condition implies that the aggregate arrival rate to each set of servers may never exceed the total service capacity of this set. Our stability condition, on the other hand, guarantees this for each server individually by optimizing the way the arriving streams of jobs can be divided among the servers, i.e., \(\lambda _0 < \mu _1\) with \(\lambda _0\) as defined in (4). This constructive representation serves the purposes in the stochastic coupling argument in Sect. 3.2.

Proposition 1

The conditions \(\rho _0 \le 1\) and \(\lambda _0 \le \mu _1\) are equivalent to \(\rho _0\) and \(\lambda _0\) as defined in (39) and (4), respectively.

The proof of Proposition 1 consists of two parts: first a flow network is constructed given the model parameters under consideration; then, an argument using the max-flow min-cut theorem [6] is applied to conclude the equivalence.

From a given instance of the affinity-scheduling model, a network is constructed consisting of a source node (s), a sink node (t) and nodes for each server \(n\in \{1,\dots ,N\}\) and each server selection \(S\in \mathcal {S}\). There are directed edges from the source s to each server node n with capacity \(c\ge 0\); the value of c will be specified in the second part of the proof. Moreover, there are edges with infinite capacity from each server to the server selections that include this server. Finally, each server selection node S is connected to the sink node t with the arrival rate \(\lambda _S\) as its edge capacity. A visualization of the flow network construction is shown in Fig. 12.

Fig. 12
figure 12

a Graph model with four nodes and arrival rate \(\lambda _S\) for each of the server selections S in \(\mathcal {S} =\left\{ \{1\}, \{2,3\}, \{2,3,4\}, \{3,4\} \right\} \). b Corresponding flow network

Let F(c) and C(c) denote the maximum achievable flow and minimum cut that separates the source s from the sink t in the constructed network, respectively, for a given capacity c. Note that

$$\begin{aligned} F(c) \le \sum \limits _{S\in \mathcal {S}} \lambda _S \end{aligned}$$
(40)

since the edges (St), \(S\in \mathcal {S}\), form a cut, with equality once \(c\ge \lambda _0\) due to the definition of \(\lambda _0\). Furthermore, the edges connecting server nodes and server selection nodes will never be included in the minimum cut, and the minimum cut has a specific form. Namely, if \(J_c\) is a subset of \(\{1,\dots ,N\}\), then the cut consists of all edges (sn) with \(n\in J_c\) and the edges (St) with \(S \in \mathcal {S}\) such that S is not included in \(J_c\). Indeed, note that given the set \(J_c\), all edges (St) for which S is not included in \(J_c\) must be captured in order to have a cut, but any edges (St) for which \(S \subseteq J_c\) can be omitted while preserving the cut property. This implies that the minimum cut optimization problem can be rewritten as

$$\begin{aligned} \tilde{C}(c) = \min \limits _{J\subseteq \{1,\dots ,N\}} \left\{ c|J| + \sum \limits _{S \nsubseteq J} \lambda _S \right\} \end{aligned}$$
(41)

without changing the value of the objective function. An example of a cut with \(J = \{1,2,3\}\) is presented in Fig. 12b by dashed lines. Moreover, due to the max-flow min-cut theorem, it is known that

$$\begin{aligned} F(c) = C(c) = \tilde{C}(c). \end{aligned}$$
(42)

If \(\lambda _0 \le \mu _1\), then let \(\lambda _0 \le c \le \mu _1\). Using (40) and (41) together with (42) leads to

$$\begin{aligned}&\sum \limits _{S\in \mathcal {S}} \lambda _S \le c|J| + \sum \limits _{S \nsubseteq J} \lambda _S \nonumber \\&\quad \Leftrightarrow \sum \limits _{S\subseteq J} \lambda _S \le c|J| \end{aligned}$$
(43)

for all \(J\subseteq \{1,\dots ,N\}\). Hence, \(\rho _0 \le 1\).

If \(\rho _0 \le 1\), then the equivalence in (43) holds for all \(J \subseteq \{1,\dots ,N\}\) and \(c = \mu _1\). This implies in particular that

$$\begin{aligned} \sum \limits _{S\in \mathcal {S}} \lambda _S \le \tilde{C}(\mu _1) = F(\mu _1) \le \sum \limits _{S\in \mathcal {S}} \lambda _S. \end{aligned}$$
(44)

By construction, \(\lambda _0\) is the smallest possible value for c such that (40) holds with equality. So, \(\lambda _0 \le \mu _1\).

5.3 Proofs: fluid limit and fixed point analysis

5.3.1 Derivation of fluid limit (15)

First, consider the stochastic process with N servers and its corresponding flow conservation equations. Next, the martingale methods as outlined by Pang et al. [21] and Brémaud [3] are applied and the limit as N tends to infinity of the fluid-scaled process is studied. Then, (15) is obtained from the resulting system of integral equations.

Step 1: flow conservation equations Let \(p_{ij}^N(q,t)\) be the probability that an arriving job at time t is assigned to a server with i type-I jobs and j type-II jobs as a type-q job, with \(q \in \{\mathrm {I},\mathrm {II}\}\). As before, we will omit the time dependence t to ease the notation.

We can only assign a job as a type-I or type-II job to an idle server; assignments to servers with a higher configuration will always take place as a type-I job. The corresponding transition probabilities are given by

$$\begin{aligned} p_{00}^N(\mathrm {I}) = 1- \left( 1-\frac{Q_{00}^N}{N}\right) ^{d}, \end{aligned}$$
(45)

the probability that an idle server is present in the primary selection, and

$$\begin{aligned} p_{00}^N(\mathrm {II}) = \mathbb {1}\{Q_{00}^N >0 \} \left( 1-\frac{Q_{00}^N}{N}\right) ^{d}, \end{aligned}$$
(46)

the probability that the primary selection does not contain an idle server while they are present. As mentioned in the model description, the secondary selection contains all servers that are not in the primary selection. Hence, the indicator function \(\mathbb {1}\{Q_{00}^N >0 \}\) emerges in the probabilities.

An arriving job will be assigned as a type-I job to a server with configuration (i, 0), with \(i\ge 1\), if the minimum configuration in the primary selection is given by (i, 0) and when there are no completely idle servers that can be included in the secondary selection. The corresponding probability is given by the probability that the primary selection contains only servers with at least i type-I jobs minus the probability that all d servers have a configuration strictly higher than (i, 0). Thus, for \(i\ge 1\),

$$\begin{aligned} p_{i0}^N(\mathrm {I}) = \mathbb {1}\{Q^N_{00} = 0\} \left[ \left( \sum \limits _{k\ge i} \left[ \frac{Q^N_{k0}}{N} {+} \frac{Q^N_{k1}}{N} \right] \right) ^{d}{-}\left( \frac{Q^N_{i1}}{N} +\sum \limits _{k\ge i+1} \left[ \frac{Q^N_{k0}}{N} +\frac{Q^N_{k1}}{N}\right] \right) ^{d}\right] .\quad \end{aligned}$$
(47)

In a similar way, we obtain \(p_{i1}^N(\text {I})\), for \(i\ge 0\):

$$\begin{aligned} p_{i1}^N(\mathrm {I}) = \mathbb {1}\{Q^N_{00} = 0\} \left[ \left( \frac{Q^N_{i1}}{N}{ +}\sum \limits _{k\ge i+1} \left[ \frac{Q^N_{k0}}{N} {+} \frac{Q^N_{k1}}{N}\right] \right) ^{d} { -} \left( \sum \limits _{k\ge i+1} \left[ \frac{Q^N_{k0}}{N} +\frac{Q^N_{k1}}{N}\right] \right) ^{d}\right] .\quad \end{aligned}$$
(48)

Once these probabilities are set, the flow conservation equations can be constructed. The randomness in the stochastic model is caused by Poisson arrivals and exponentially distributed service times, so that the number of arrivals and service completions can be counted using Poisson processes with appropriately chosen rates. Define a set of independent Poisson processes with rate 1. Let \(P_{A_{00,q}}\) denote the Poisson counting process for the number of arriving type-q jobs at servers with configuration (0, 0), and \(P_{A_{ij}},i+j \ge 1,\) reflects the number of arriving jobs at servers with configuration (ij). Similarly, define the counting process of the service completions \(P_{S_{ij}},~i+j \ge 1\). Furthermore, if \(i \ge 1 \), the number of servers at time t with at least i type-I jobs and exactly j type-II jobs depends on its initial state \((\overline{Q}^N_{ij}(0))\), the number of service completions of jobs at servers with configuration (ij) and the number of arrivals at servers in configuration \((i-1,j)\) within the time interval [0, t). We obtain the following flow conservation equations for the stochastic model \((\overline{Q}^N_{ij})_{i,j}\) with N servers and total arrival rate \(\lambda N\). Let \(i\ge 2\):

$$\begin{aligned} \overline{Q}^N_{00}(t)= & {} \overline{Q}^N_{00}(0) + P_{S_{01}}\left( \mu _2 \int \limits _{0}^{t}Q_{01}^N(s)\,\hbox {d}s\right) \nonumber \\&- P_{A_{00}}\left( \lambda N \int \limits _{0}^{t}\left[ p_{00}^N(\mathrm {I},s) + p_{00}^N(\mathrm {II},s)\right] \,\hbox {d}s\right) ,\nonumber \\ \overline{Q}^N_{01}(t)= & {} \overline{Q}^N_{01}(0) - P_{S_{01}}\left( \mu _2 \int \limits _{0}^{t}Q_{01}^N(s)\,\hbox {d}s\right) + P_{A_{00,\mathrm {II}}}\left( \lambda N \int \limits _{0}^{t}p_{00}^N(\mathrm {II},s)\,\hbox {d}s\right) ,\nonumber \\ \overline{Q}^N_{10}(t)= & {} \overline{Q}^N_{10}(0) - P_{S_{10}}\left( \mu _1 \int \limits _{0}^{t}Q_{10}^N(s)\,\hbox {d}s\right) + P_{A_{00,\mathrm {I}}}\left( \lambda N \int \limits _{0}^{t}p_{00}^N(\mathrm {I},s)\,\hbox {d}s\right) ,\nonumber \\ \overline{Q}^N_{11}(t)= & {} \overline{Q}^N_{11}(0) - P_{S_{11}}\left( \mu _1 \int \limits _{0}^{t}Q_{11}^N(s)\,\hbox {d}s\right) + P_{A_{01}}\left( \lambda N \int \limits _{0}^{t}p_{01}^N(\mathrm {I},s)\,\hbox {d}s\right) ,\nonumber \\ \overline{Q}^N_{i0}(t)= & {} \overline{Q}^N_{i0}(0) - P_{S_{i0}}\left( \mu _1 \int \limits _{0}^{t}Q_{i0}^N(s)\,\hbox {d}s\right) + P_{A_{i-1,0}}\left( \lambda N \int \limits _{0}^{t}p_{i-1,0}^N(\mathrm {I},s)\,\hbox {d}s\right) ,\nonumber \\ \overline{Q}^N_{i1}(t)= & {} \overline{Q}^N_{i1}(0) - P_{S_{i1}}\left( \mu _1 \int \limits _{0}^{t}Q_{i1}^N(s)\,\hbox {d}s\right) + P_{A_{i-1,1}}\left( \lambda N \int \limits _{0}^{t}p_{i-1,1}^N(\mathrm {I},s)\,\hbox {d}s\right) .\nonumber \\ \end{aligned}$$
(49)

Due to the Poisson split property, we define \(P_{A_{00}}\) as the sum of the two processes \(P_{A_{00,\mathrm {I}}}\) and \(P_{A_{00,\mathrm {II}}}\).

Step 2: Fluid-scaled process Dividing both sides of the equations by N results in a fluid-scaled process. Further, because of the martingale results in [3] and [21], we can define noise terms \(e_{ij}(N)\) that tend to 0 as \(N\rightarrow \infty \) with \(i\ge 0\) and \(j\in \{0,1\}\). The fluid-scaled system can be rewritten as follows, for \(i\ge 2\):

$$\begin{aligned} \frac{\overline{Q}^N_{00}(t)}{N}= & {} \frac{\overline{Q}^N_{00}(0)}{N} + \mu _2 \int \limits _{0}^{t}\frac{{Q}_{01}^N(s)}{N} \,\hbox {d}s \nonumber \\&- \lambda \int \limits _{0}^{t}\left[ p_{00}^N(\mathrm {I},s) + p_{00}^N(\mathrm {II},s)\right] \,\hbox {d}s + e_{00}(N), \nonumber \\ \frac{\overline{Q}^N_{01}(t)}{N}= & {} \frac{\overline{Q}^N_{01}(0)}{N} - \mu _2 \int \limits _{0}^{t}\frac{{Q}_{01}^N(s)}{N} \,\hbox {d}s +\lambda \int \limits _{0}^{t}p_{00}^N(\mathrm {II},s)\,\hbox {d}s + e_{01}(N),\nonumber \\ \frac{\overline{Q}^N_{10}(t)}{N}= & {} \frac{\overline{Q}^N_{10}(0)}{N} - \mu _1 \int \limits _{0}^{t}\frac{{Q}_{10}^N(s)}{N} \,\hbox {d}s + \lambda \int \limits _{0}^{t}p_{00}^N(\mathrm {I},s)\,\hbox {d}s + e_{10}(N),\nonumber \\ \frac{\overline{Q}^N_{11}(t)}{N}= & {} \frac{\overline{Q}^N_{11}(0)}{N} - \mu _1 \int \limits _{0}^{t}\frac{{Q}_{11}^N(s)}{N} \,\hbox {d}s + \lambda \int \limits _{0}^{t}p_{01}^N(\mathrm {I},s)\,\hbox {d}s + e_{11}(N),\nonumber \\ \frac{\overline{Q}^N_{i0}(t)}{N}= & {} \frac{\overline{Q}^N_{i0}(0)}{N} - \mu _1 \int \limits _{0}^{t}\frac{{Q}_{i0}^N(s)}{N} \,\hbox {d}s + \lambda \int \limits _{0}^{t}p_{i-1,0}^N(\mathrm {I},s)\,\hbox {d}s + e_{i0}(N),\nonumber \\ \frac{\overline{Q}^N_{i1}(t)}{N}= & {} \frac{\overline{Q}^N_{i1}(0)}{N} - \mu _1 \int \limits _{0}^{t}\frac{{Q}_{i1}^N(s)}{N} \,\hbox {d}s + \lambda \int \limits _{0}^{t}p_{i-1,1}^N(\mathrm {I},s)\,\hbox {d}s + e_{i1}(N). \end{aligned}$$
(50)

Step 3: Toward fluid limits While making the transition from integral equations to differential equations with N tending to infinity, the representation of the departure terms in (15) is straightforward. The arrival terms in the differential equations, on the other hand, are not immediately obvious.

To illustrate the difficulty, assume there are among the N servers only a small number of idle servers. As the assignment policy describes, one of these servers will be selected by an arriving job. If the number of idle servers is small and the arrival rate is sufficiently high, rapid switches will occur in the indicator function \(\mathbb {1}\{Q^N_{00} = 0 \}\). A server that becomes idle due to a service completion will immediately be selected again by the arriving job. However, the fraction of idle servers (\(Q^N_{00}/N\)) will be more robust against these changes due to the fluid scaling.

In general, this phenomenon is called ‘separation of time scales’ as described by Hunt and Kurtz [14]. One observes the interaction of two processes. One process evolves very fast, namely the number of idle servers, while the second process evolves much slower, the occupancy fractions in this setting. In order to obtain the arrival terms of the fluid limit, we should be able to combine these processes. Focusing on the first arrival integral in (50), the question arises of how to handle the expression

$$\begin{aligned} \lim _{N\rightarrow \infty } \lambda \int \limits _{0}^{t}\left[ p_{00}^N(\mathrm {I},s) + p_{00}^N(\mathrm {II},s)\right] \,\hbox {d}s = \lim _{N\rightarrow \infty } \lambda \int \limits _{0}^{t}\mathbb {1}\{Q^N_{00}(s) > 0 \}\,\hbox {d}s. \end{aligned}$$
(51)

A similar problem is analyzed in [14], where one needs to take the limit of a integral of an indicator function. The existence of a measure \(\alpha \) is deduced such that

$$\begin{aligned} \lim _{N\rightarrow \infty } \lambda \int \limits _{0}^{t}\mathbb {1}\{Q^N_{00}(s) > 0 \}\,\hbox {d}s =\lambda \int \limits _{0}^{t}\alpha (s)\,\hbox {d}s. \end{aligned}$$
(52)

The existence of this function \(\alpha \), which does not need to be continuous, can be justified by the following reasoning: In a small time interval, say \([0,\delta t]\), the number of idle servers is a heavily fluctuating process, though the process describing the occupancy fractions is approximately constant. During this small interval, the number of idle servers can be considered as a birth-and-death process with ‘death’ rate \(\lambda \), since an arriving job causes a reduction in the number of idle servers. The ‘birth’ rate is determined by the occupancy fractions, i.e., the fraction of servers that is working on type-I or type-II jobs. Then, it is argued in [14] that

$$\begin{aligned} \frac{1}{\delta t}\int _{0}^{\delta t} \mathbb {1}\{Q^N_{00}(s) > 0 \}\,\hbox {d}s, \end{aligned}$$
(53)

after application of the ergodic theorem, converges to an invariant measure if N tends to infinity. This invariant measure will give rise to the function \(\alpha \). One already senses that the presence or absence of idle servers should be handled as two different cases. Therefore, we make a distinction between \(q_{00}\) strictly positive or equal to zero in the intuitive explanation of the structure of the fluid limit.

The case\(q_{00}>0\). When the number of idle servers is sufficiently large, each arriving job will be assigned to an idle server for sure. A fraction

$$\begin{aligned} \left( 1-q_{00} \right) ^{d} \end{aligned}$$
(54)

of the arriving jobs will be assigned as type-II jobs, which causes the changes in (15) for \(\overline{q}_{00}\), \(\overline{q}_{01}\) and \(\overline{q}_{10}\).

The case\(q_{00}=0\). Idle servers are generated at rate \(\mu _1 q_{10} + \mu _2 q_{01}\). Since d is finite, the probability that the primary selection would contain an idle server is negligible; each idle server will be provided with a type-II job when the arrival rate is high enough. If \(\tilde{\lambda } = (\lambda - \mu _1 q_{10} + \mu _2 q_{01})^+\) is strictly larger than zero, a fraction

$$\begin{aligned} \frac{\mu _1q_{10}+ \mu _2q_{01}}{\lambda } = \frac{\lambda - \tilde{\lambda }}{\lambda } \end{aligned}$$
(55)

of the stream of incoming jobs will immediately be redirected to the idle servers as a type-II job. The excess stream of incoming jobs (fraction \(\tilde{\lambda }/\lambda \)) will not observe any idle server and will start to form (type-I) queues in front of the servers of the primary selection according to a straightforward generalization of the transition probabilities mentioned in step 1.

This concludes the derivation of the fluid limit (15).

5.3.2 Proof of Theorem 4: fixed points

We will start with the proof of the closed-form fixed point and show that this is the only fixed point without idle servers on the fluid level, i.e., \(q_{00}=0\). Next, we will consider fixed points with \(q_{00}>0\).

Fixed points with\(q_{00} = 0\). The correctness of the expression in (16) can easily be confirmed by substitution into (15). The result can be established in two steps. First, we observe that the derivatives of \((\overline{q}_{i0})_i\) in (15) remain zero once \((\overline{q}_{i0}^*)_i\) equals zero. Then, we substitute \((\overline{q}_{i0}^*)_i = 0\) into the derivatives of \((\overline{q}_{i1})_i\). For \(i\ge 1\), we obtain

$$\begin{aligned} \frac{\hbox {d}}{\hbox {d}t} \overline{q}_{i1}^* = \mu _1(\overline{q}_{i+1,1}^* -\overline{q}_{i1}^*) + \tilde{\lambda } \left[ (\overline{q}_{i-1,1}^*)^{d} -(\overline{q}_{i1}^*)^{d}\right] = 0. \end{aligned}$$
(56)

These equations can be solved, and one obtains the fixed point as given in (16). Note the similarity between (56) and the fluid limit of a JSQ(d) policy with reduced arrival rate

$$\begin{aligned} \tilde{\lambda } = \lambda - \frac{\mu _1-\lambda }{\mu _1-\mu _2}\mu _2 = \lambda - \mu _2q_{01}^*, \end{aligned}$$
(57)

in a setting where each of the exchangeable servers works at rate \(\mu _1\) [18].

Second, this fixed point is unique under the condition that \(q_{00}\) equals zero. From Lemma 2 in [18], we know that the fixed point of the fluid limit in the JSQ(d) setting is unique when \(d\ge 2\). This implies that under the condition that all servers have a type-II job, i.e., \(\overline{q}_{i0}^* = 0\) for all i, uniqueness is guaranteed. Assume by contradiction that another fixed point exists without idle servers but with possibly a positive cumulative fraction \(\overline{q}_{i0}^*\) for some i. We focus on the differential equations of \((\overline{q}_{i0})_{i\ge 1}\) under this fixed point. From

$$\begin{aligned} \frac{\hbox {d}}{\hbox {d}t}\overline{q}_{10}^* = \mu _1(\overline{q}_{20}^*-\overline{q}_{10}^*)=0, \end{aligned}$$
(58)

we get that \(\overline{q}_{10}^* = \overline{q}_{20}^*\). Repeating this procedure for \(i = 2\),

$$\begin{aligned} \frac{\hbox {d}}{\hbox {d}t} \overline{q}_{20}^*= & {} \mu _1(\overline{q}_{30}^* -\overline{q}_{20}^*) + \tilde{\lambda } \left[ (\overline{q}_{10}^* -\overline{q}_{11}^*)^{d} -(\overline{q}_{20}^*-\overline{q}_{11}^*)^{d}\right] \nonumber \\= & {} \mu _1(\overline{q}_{30}^*-\overline{q}_{20}^*)=0, \end{aligned}$$
(59)

resulting in \(\overline{q}_{20}^* = \overline{q}_{30}^*\). By induction, we could show that \(\overline{q}_{i0}^* = \overline{q}_{i+1,0}^*\); for \(i\ge 1\), this leads to \(\overline{q}_{i0}^* = 0\) for \(i\ge 1\). This proves the uniqueness of the fixed point when \(q_{00}\) equals zero.

Fixed points with\(q_{00} > 0\). Under this setting, the fluid limit equations (15) simplify significantly:

$$\begin{aligned} {\left\{ \begin{array}{ll} \tfrac{\hbox {d}}{\hbox {d}t} \overline{q}_{00}=\mu _2{q}_{01} - \lambda (1-q_{00})^{d}, \\ \tfrac{\hbox {d}}{\hbox {d}t} \overline{q}_{01}=-\mu _2{q}_{01} + \lambda (1-q_{00})^{d},\\ \tfrac{\hbox {d}}{\hbox {d}t} \overline{q}_{10}= -\mu _1{q}_{10} + \lambda \left( 1-(1-q_{00})^{d}\right) ,\\ \tfrac{\hbox {d}}{\hbox {d}t} \overline{q}_{11}=-\mu _1 {q}_{11},\\ \text {for}~i\ge 2,\\ \tfrac{\hbox {d}}{\hbox {d}t}\overline{q}_{i0}=-\mu _1{q}_{i0},\\ \tfrac{\hbox {d}}{\hbox {d}t} \overline{q}_{i1}=-\mu _1{q}_{i1}. \end{array}\right. } \end{aligned}$$
(60)

For any fixed point, it should hold that \((\overline{q}_{i0}^*)_{i\ge 2} = 0\) and \((\overline{q}_{i1}^*)_{i\ge 1} = 0\). This implies that the only positive fractions are \(q_{00}\), \(q_{01}\) and \(q_{10}\). The resulting system of differential equations is given by

$$\begin{aligned} {\left\{ \begin{array}{ll} \tfrac{\hbox {d}}{\hbox {d}t} q_{00}=\mu _1q_{10} + \mu _2q_{01}-\lambda ,\\ \tfrac{\hbox {d}}{\hbox {d}t} q_{01}= -\mu _2q_{01} + \lambda (1-q_{00})^{d}, \\ \tfrac{\hbox {d}}{\hbox {d}t} q_{10}= -\mu _1q_{10} + \lambda \left( 1-(1-q_{00})^{d}\right) . \end{array}\right. } \end{aligned}$$
(61)

From the second and third equality, it is clear that once \(q_{00}^*\) is known, we know the entire fixed point:

$$\begin{aligned} \left\{ \begin{array}{l} q_{01}^* = \frac{\lambda }{\mu _2}(1-q_{00}^*)^{d},\\ q_{10}^* = \frac{\lambda }{\mu _1} \left( 1-(1-q_{00}^*)^{d}\right) . \end{array}\right. \end{aligned}$$
(62)

The system in (61) is linearly dependent. We use the fact that \(q_{00}\), \(q_{01}\) and \(q_{10}\) must sum up to one to determine \(q_{00}\). It must hold that

$$\begin{aligned} 1 = q_{00} + (1-q_{00})^{d} \left( \frac{\lambda }{\mu _2} -\frac{\lambda }{\mu _1} \right) + \frac{\lambda }{\mu _1}. \end{aligned}$$

Define \(x\doteq 1-q_{00}\). We are interested in the zero points of the polynomial f within [0, 1),  with

$$\begin{aligned} f(x) = x^{d} \left( \frac{\lambda }{\mu _2} - \frac{\lambda }{\mu _1} \right) -x + \frac{\lambda }{\mu _1}. \end{aligned}$$
(63)

We will evaluate the existence of the fixed points based on the behavior of f and its derivative,

$$\begin{aligned} f'(x) = d \left( \frac{\lambda }{\mu _2} - \frac{\lambda }{\mu _1} \right) x^{d-1} -1. \end{aligned}$$
(64)

Furthermore,

$$\begin{aligned} f(0)= & {} \frac{\lambda }{\mu _1}>0 ,\nonumber \\ f(1)= & {} \frac{\lambda }{\mu _2}-1 >0, \end{aligned}$$
(65)

and \(f'\) is monotone increasing on (0, 1) with

$$\begin{aligned} f'(0)= & {} -1<0, \nonumber \\ f'(1)= & {} d\lambda \left( \frac{1}{\mu _2}-\frac{1}{\mu _1}\right) -1. \end{aligned}$$
(66)

Since f is positive in both its endpoints and the derivative \(f'\) is monotone increasing, we need at least a vanishing derivative in (0, 1) in order to have a fixed point. This is guaranteed when \(f'(1) > 0\); this is the first condition from (17). We now know that f attains a local minimum at

$$\begin{aligned} \tilde{x} \doteq \left( \frac{1}{d} \frac{1}{\lambda \left( \frac{1}{\mu _2} -\frac{1}{\mu _1}\right) }\right) ^{\frac{1}{d-1}} \end{aligned}$$
(67)

and is strictly positive in its endpoints. If \(f(\tilde{x})\) is exactly zero, we have one fixed point, namely \(q_{00}^* =1-\tilde{x}\). But only in very special cases the second condition of (17) is satisfied with equality for an arbitrary choice of \(d_1\), \(\lambda \), \(\mu _1\) and \(\mu _2\). On the other hand, if \(f(\tilde{x})<0\), i.e., if also

$$\begin{aligned} \left( 1- \frac{1}{d}\right) \left( \frac{1}{d}\right) ^{\frac{1}{d-1}} >\frac{\lambda }{\mu _1} \left( \lambda \left( \frac{1}{\mu _2} - \frac{1}{\mu _1}\right) \right) ^{\frac{1}{d-1}} \end{aligned}$$
(68)

holds, then we have exactly two fixed points such that \(q_{00} +q_{01}+ q_{10} = 1\). There is one fixed point situated at each side of \(\tilde{x}\) in the interval (0, 1). This gives that for d large enough we can find two solutions of the reduced system of differential equations. It can be shown by contradiction that both fixed points are larger than \(\lambda /\mu _1\), so the corresponding fractions of idle servers are smaller than \(1-\lambda /\mu _1\).

For completeness, we mention that \(\lambda = \mu _2\) would imply that \(f(1) = 0\) and so the proportion of idle servers is zero, which violates the assumption that \(q_{00}>0\). Moreover, if \(\lambda < \mu _2\), then the polynomial f vanishes in the interval (0, 1). The monotone increasing property of the derivative of f leads to the fact that there exists a unique fixed point \(x^*\) in (0, 1). This results in a unique fixed point \((q_{00}^*, q_{01}^*,q_{10}^*)\) with \(q_{00}^*>0\).

This concludes the proof of Theorem 4.

5.3.3 Proof of Theorem 5: local (in)stability

We will prove local (in)stability using the indirect Lyapunov method based on the Hartman–Grobman theorem [13]. This theorem states that a system of differential equations behaves near its fixed points as its linearized version. The eigenvalues of the linearized system will define the local behavior of the system unless one of the eigenvalues has a real part equal to zero; then, the Hartman–Grobman theorem is inconclusive. If we were to immediately apply this theorem to one of the two fixed points of (61), we would obtain an eigenvalue exactly equal to zero, but one can resolve this issue since (61) is a redundant system. Since \(q_{00} + q_{01} + q_{10} = 1\), it is sufficient to know the instantaneous change of two variables. Each elimination will lead to the same two eigenvalues so we can remove, for instance, the third equation from (61):

$$\begin{aligned} {\left\{ \begin{array}{ll} \tfrac{\hbox {d}}{\hbox {d}t}q_{00} = \mu _1(1-q_{00}-q_{01})+ \mu _2q_{01} - \lambda , \\ \tfrac{\hbox {d}}{\hbox {d}t}q_{01} = -\mu _2q_{01} + \lambda (1-q_{00})^{d}. \end{array}\right. } \end{aligned}$$
(69)

Let \((q_{00}^*,q_{01}^*,q_{10}^*)\) denote a fixed point, then the matrix of the linearized system looks as follows near its fixed point:

$$\begin{aligned} \begin{bmatrix} -\mu _1&\quad \mu _2-\mu _1 \\ -\lambda d_1(1-q_{00}^*)^{d-1}&\quad -\mu _2 \end{bmatrix}. \end{aligned}$$
(70)

The corresponding eigenvalues are given by

$$\begin{aligned} \alpha _{\pm } = \frac{1}{2}\left[ -(\mu _1+\mu _2) \pm \sqrt{(\mu _1-\mu _2)^2+4\lambda d(\mu _1-\mu _2)(1-q_{00}^*)^{d-1}}\right] . \end{aligned}$$
(71)

Since \(\mu _1 > \mu _2\), the quantity under the root is always positive, so the square root is real. This implies, furthermore, that \(\alpha _- <0\). To determine the sign of \(\alpha _+\), we need to make a distinction between the two fixed points. From the proof of Theorem 4, we know that the two fixed points are on both sides of \(\tilde{x}\), with \(\tilde{x}\) as in (67). For

$$\begin{aligned} 1-q_{00}^* > \tilde{x} = \left( \frac{1}{d}\frac{\mu _1\mu _2}{\lambda (\mu _1-\mu _2)}\right) ^{\frac{1}{d-1}}, \end{aligned}$$
(72)

we have that

$$\begin{aligned} 2\alpha _+> & {} -(\mu _1+\mu _2) + \sqrt{(\mu _1-\mu _2)^2 +4\lambda d(\mu _1-\mu _2)\left( \frac{1}{d}\frac{\mu _1\mu _2}{\lambda (\mu _1-\mu _2)}\right) }\nonumber \\= & {} -(\mu _1+\mu _2) + \sqrt{(\mu _1+\mu _2)^2} \nonumber \\= & {} 0. \end{aligned}$$
(73)

This shows that the fixed point with the smallest fraction of idle servers is unstable.

When \(1-q_{00}^* < \tilde{x}\), it follows in a similar way that \(2\alpha _+ < 0\). This shows that the fixed point with the largest proportion of idle servers is locally stable. This concludes the proof of Theorem 5.

6 Conclusion and outlook

We investigated load balancing issues in a service system where particular servers are better equipped to process certain jobs due to affinity or compatibility relations. The general model in particular covers the setting with an underlying network topology \(G_N\), referred to as the neighborhood model. The analysis of the neighborhood model is severely complicated by the lack of exchangeability among the servers; a feature present in the supermarket modeling framework that allows mean field techniques. We constructed the novel R-coupling, or restructure coupling, to obtain stochastic performance bounds for the general model and more specific settings, for instance model instances where the underlying graph topology \(G_N\) has a specific minimum degree or is a d-regular graph.

Another instance of the general model, the combinatorial model, has enough inherent symmetry to conduct a fluid limit analysis. The fluid limit was stated in terms of a set of discontinuous differential equations, and its fixed point sensitively depends on the size d of the primary selection. When d is sufficiently small, a unique fixed point exists, but the associated waiting time does not vanish. When the primary selection is sufficiently large, a fixed point arises that does provide a zero waiting time. On the other hand, the above-mentioned fixed point still persists, giving rise to bistability issues.

As mentioned above, the stochastic upper bounds for the neighborhood model in terms of a supermarket model with a JSQ(d) policy require the degrees in the underlying graph to be relatively high compared to d. To some extent, this indicates that the performance may be poor in certain pathological cases even when the node degrees are fairly high. An interesting topic for further research would be to extend the R-coupling and possibly identify relevant structural conditions on the graph topology in order to sharpen these bounds.

For a fixed set of model parameters with a uniform arrival rate per server selection and the described affinity-scheduling policy, it seems plausible to expect an improvement in performance when the set of server selections \(\mathcal {S}\) grows bigger and becomes more diverse in some appropriate sense. This would imply that the performance of the combinatorial model provides a conservative estimate for the performance of, for instance, the neighborhood model with an underlying regular graph structure. Moreover, recall that a supermarket model with a JSQ(d) policy is equivalent to the combinatorial model with server selection of size d when jobs cannot be assigned as type-II jobs, which effectively occurs when \(\mu _2\) approaches zero. This suggests that, for sufficiently small \(\mu _2\), the supermarket model with a JSQ(d) policy provides stochastic lower bounds for our affinity-scheduling model with server selections of size at most d.

The bistability of the fluid limit of the combinatorial model for large values of d not only precludes any convergence statements for the stationary distribution, but also suggests that the assignment strategy could possibly be refined. In future work we intend to examine such refinements and establish that these eliminate the queueing fixed point and render the no-queueing fixed point globally stable.