Skip to main content
Log in

Preserving stabilization while practically bounding state space using incorruptible partially synchronized clocks

  • Published:
Distributed Computing Aims and scope Submit manuscript

Abstract

Stabilization is a key dependability property for dealing with unanticipated transient faults, as it guarantees that even in the presence of such faults, the system will recover to states where it satisfies its specification. One of the desirable attributes of stabilization is the use of bounded space for each variable. In this paper, we present an algorithm that transforms a stabilizing program that uses variables with unbounded domain into a stabilizing program that uses bounded variables by using partially synchronized physical time. Specifically, our algorithm relies on bounded clock drift \(\epsilon \) among processes and message delivery that either delivers the message in time \(\delta \) or loses it. If we let \(\epsilon \) to be as much as 100 s and \(\delta \) to be as much as 1 h, this property is satisfied by any practical system. While non-stabilizing programs (that do not handle transient faults) can deal with unbounded variables by assigning large enough but bounded space, stabilizing programs—that need to deal with arbitrary transient faults—cannot do the same since a transient fault may corrupt the variable to its maximum value. We show that our transformation algorithm is applicable to several problems including logical clocks, vector clocks, mutual exclusion, diffusing computations, and so on. Moreover, our approach can also be used to bound counters used in an earlier work by Katz and Perry for adding stabilization to a non-stabilizing program. By combining our algorithm with that work by Katz and Perry and by assuming incorruptible partially synchronized clocks, it would be possible to provide stabilization for a rich class of problems, by assigning large enough but bounded space for variables.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. The results can also be extended to cases where physical clocks are eventually within \(\epsilon \) of each other. However, in this case, the total time for convergence would also include the time required to restore the clocks to be within \(\epsilon \) of each other. For the sake of simplicity, this issue is considered to be beyond the scope of the paper.

  2. The variable channel contains messages, and each message m in it is associated with a timestamp cl.m. There can be other details associated with a message, like id of the sender process, id of the receiver process, etc., but since timestamp is the only information relevant to our algorithm, we refer to channel as a variable that contains message timestamps.

  3. For origins of 3 and 11, we refer the reader to the text at the beginning of Sect. 5.5.

References

  1. Alon, N., Attiya, H., Dolev, S., Dubois, S., Potop-Butucaru, M., Tixeuil, S.: Practically stabilizing SWMR atomic memory in message-passing systems. J. Comput. Syst. Sci. 81(4), 692–701 (2015). https://doi.org/10.1016/j.jcss.2014.11.014

    Article  MathSciNet  MATH  Google Scholar 

  2. Arora, A., Gouda, M.G.: Distributed reset. IEEE Trans. Comput. 43(9), 1026–1038 (1994)

    Article  Google Scholar 

  3. Arora, A., Kulkarni, S., Demirbas, M.: Resettable vector clocks. In: Proceedings of the Nineteenth Annual ACM Symposium on Principles of Distributed Computing, pp. 269–278. ACM, NY (2000). https://doi.org/10.1145/343477.343628

  4. Arora, A., Kulkarni, S.S.: Designing masking fault-tolerance via nonmasking fault-tolerance. IEEE Trans. Softw. Eng. 24(6), 435–450 (1998). https://doi.org/10.1109/32.689401

    Article  Google Scholar 

  5. Awerbuch, B., Patt-Shamir, B., Varghese, G.: Bounding the unbounded. In: Proceedings IEEE INFOCOM ’94, The Conference on Computer Communications, Thirteenth Annual Joint Conference of the IEEE Computer and Communications Societies, Networking for Global Communications, Toronto, Ontario, Canada, June 12–16, 1994, pp. 776–783 (1994). https://doi.org/10.1109/INFCOM.1994.337661

  6. Blanchard, P., Dolev, S., Beauquier, J., Delaët, S.: Practically Self-stabilizing Paxos Replicated State-Machine, pp. 99–121. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-09581-3_8

    Book  Google Scholar 

  7. Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985). https://doi.org/10.1145/214451.214456

    Article  Google Scholar 

  8. Chandy, K.M., Misra, J.: Parallel Program Design—A Foundation. Addison-Wesley, Reading (1989)

    MATH  Google Scholar 

  9. Dasgupta, A., Ghosh, S., Xiao, X.: Probabilistic fault-containment. In: Masuzawa, T., Tixeuil, S. (eds.) Stabilization, Safety, and Security of Distributed Systems, 9th International Symposium, 2007, Paris, France, November 14–16, 2007, Proceedings, Lecture Notes in Computer Science, vol. 4838, pp. 189–203. Springer, New York (2007). https://doi.org/10.1007/978-3-540-76627-8_16

    Chapter  Google Scholar 

  10. Dijkstra, E.W.: Self-stabilizing systems in spite of distributed control. Commun. ACM 17(11), 643–644 (1974)

    Article  Google Scholar 

  11. Dijkstra, E.W., Scholten, C.S.: Termination detection for diffusing computations. Inf. Process. Lett. 11(1), 1–4 (1980)

    Article  MathSciNet  Google Scholar 

  12. Dolev, S., Georgiou, C., Marcoullis, I., Schiller, E.M.: Self-stabilizing virtual synchrony. In: Proceedings of the Stabilization, Safety, and Security of Distributed Systems—17th International Symposium, SSS 2015, Edmonton, AB, Canada, August 18–21, 2015, pp. 248–264 (2015). https://doi.org/10.1007/978-3-319-21741-3_17

  13. Fidge, C.J.: Timestamps in message-passing systems that preserve the partial ordering. In: Proceedings of the 11th Australian Computer Science Conference, vol. 10, no. 1, pp. 56–66 (1988)

  14. Fischer, M.J., Lynch, N.A., Paterson, M.: Impossibility of distributed consensus with one faulty process. J. ACM 32(2), 374–382 (1985). https://doi.org/10.1145/3149.214121

    Article  MathSciNet  MATH  Google Scholar 

  15. Garcia-Luna-Aceves, J.J.: Loop-free routing using diffusing computations. IEEE/ACM Trans. Netw. 1(1), 130–141 (1993). https://doi.org/10.1109/90.222913

    Article  Google Scholar 

  16. Ghosh, S.: Distributed Systems: An Algorithmic Approach. Chapman & Hall, London (2014)

    Book  Google Scholar 

  17. Ghosh, S., Gupta, A., Herman, T., Pemmaraju, S.V.: Fault-containing self-stabilizing distributed protocols. Distrib. Comput. 20(1), 53–73 (2007). https://doi.org/10.1007/s00446-007-0032-2

    Article  MATH  Google Scholar 

  18. Katz, S., Perry, K.J.: Self-stabilizing extensions for meassage-passing systems. Distrib. Comput. 7(1), 17–26 (1993). https://doi.org/10.1007/BF02278852

    Article  MATH  Google Scholar 

  19. Kulkarni, S.S., Arora, A.: Multitolerance in distributed reset. Chicago J. Theor. Comput. Sci. (1998). http://cjtcs.cs.uchicago.edu/articles/1998/4/contents.html

  20. Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21(7), 558–565 (1978). https://doi.org/10.1145/359545.359563

    Article  MATH  Google Scholar 

  21. Lamport, L., Lynch, N.A.: Distributed Computing: Models and Methods. MIT Press, Cambridge (1990)

    MATH  Google Scholar 

  22. Lee, S., Muhammad, R.M., Kim, C.: A Leader Election Algorithm Within Candidates on Ad Hoc Mobile Networks, pp. 728–738. Springer, Berlin (2007)

    Google Scholar 

  23. Mattern, F.: Virtual time and global states of distributed systems. In: Cosnard M (ed.) Parallel and Distributed Algorithms, pp. 215–226. North-Holland (1989)

  24. Valapil, V.T., Kulkarni, S.S.: Preserving stabilization while practically bounding state space. In: 13th European Dependable Computing Conference, EDCC 2017, Geneva, Switzerland, September 4–8, 2017, pp. 26–33 (2017). https://doi.org/10.1109/EDCC.2017.13

  25. Vasudevan, S., Kurose, J.F., Towsley, D.F.: Design and analysis of a leader election algorithm for mobile ad hoc networks. In: 12th IEEE International Conference on Network Protocols, Berlin, Germany, pp. 350–360. IEEE Computer Society (2004). https://doi.org/10.1109/ICNP.2004.1348124

  26. Yingchareonthawornchai, S., Kulkarni, S.S., Demirbas, M.: Analysis of bounds on hybrid vector clocks. In: OPODIS 2015, December 14–17, 2015, Rennes, France, pp. 34:1–34:17 (2015). https://doi.org/10.4230/LIPIcs.OPODIS.2015.34

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vidhya Tekken Valapil.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work is partially supported by NSF CNS 1329807, NSF CNS 1318678, and XPS 1533802. This is an extension of the previous work that appeared in the 13th European Dependable Computing Conference (EDCC), 2017.

Appendix

Appendix

In this section, we present the detailed analysis of the effect on bound of counters derived by our algorithm when clocks (of processes and global clock) differ from each other by more that one region. This section also includes a summarized table of notations used in this paper.

1.1 Proof of our claim on the effect of clocks differing by multiple regions

Our transformation algorithm was based on \(nReg=1\) in Definition 15. For a distributed system of n processes where the physical clocks of the processes are guaranteed to be synchronized within 0.1 s of each other and w.r.t the global clock, to achieve \(nReg=1\) the designer could choose \(\mathcal {RS} =0.1 s\).

In this section, we analyze the effect of varying nReg on the range or size of the counters. In other words, we would like to answer the question “What is the effect on the range of the counters (i.e.,bound on counters determined by the transformation algorithm) if the clocks of processes are more than one region apart (w.r.t each other and w.r.t the global clock) i.e. \(nReg > 1\) ?” Allowing \(nReg > 1\) could help the designer to choose a smaller \(\mathcal {RS} \). For instance, in the example discussed above, say if the regions identified by the physical clocks of the processes (w.r.t each other and w.r.t global clock) are allowed to differ by at most 100 regions, i.e. \(nReg=100\) then the designer could choose \(\mathcal {RS} =1 ms\).

Observe that when nReg changes variables \(max_{inc} \) and \(max_r\) vary accordingly. For example though the system described above with \(nReg = 1\) and \(\mathcal {RS} = 0.1 s\) can be equivalently modeled with \(nReg = 100\) and \(\mathcal {RS} = 1 ms\), the values of \(max_r\) and \(max_{inc} \) differ in the two settings. If the growth in the counters is distributed uniformly, and if \(max_{inc} =100\) in the first setting, then \(max_{inc} \) would be 1 in the system with the second setting. Similarly, if \(max_r=1\) in the the first setting, then \(max_r\) could be 100 in the second setting i.e. with \(nReg=100\) and \(\mathcal {RS} = 1 ms\).

1.1.1 Bound for multiple regions

If nReg = 1, i.e. clocks of any two processes differ from each other by at most one region and clock of any process differs from the global clock by at most one region, we try to ensure that any free counter is in the range:\([3 (r) max_{inc} ..3(r+1)max_{inc} + 2max_{inc}-1 ]\) and (derived from the range of the free counters) any dependent counter to be in the range: \([3 ((r-2-max_r)) max_{inc} ..3(r+1)max_{inc} + 2max_{inc}-1 ]\). The size of this range of dependent counters is: \(max_{inc} (11 + 3max_r)\). Based on this size, each unbounded counter in the program is maintained in modulo B arithmetic. In other words, values of unbounded counters of the original program are bounded by B in the transformed program, where

$$\begin{aligned} B=3[max_{inc} (11 + 3max_r)] \end{aligned}$$
(1)

If nReg = 2, i.e. clocks of any two processes are allowed to differ from each other by at most 2 regions and clock of any process is allowed to differ from the global clock by at most 2 regions, then we will try to ensure that any free counter is in the range: \([4*r*max_{inc} '\) .. \(4*(r+1)*max_{inc} '+3*max_{inc} '-1]\) and (derived from the range of the free counters) any dependent counter to be in the range: \([4*(r-2-max_r')*max_{inc} '\) .. \(4*(r+1)*max_{inc} '+3*max_{inc} '-1]\).

Recall that when nReg changes variables \(max_{inc} \) and \(max_r\) vary accordingly, this is denoted by variables \(max_r'\) and \(max_{inc} '\) in the above equations. Now by generalizing formulas presented above, i.e. clocks of any two processes are allowed to differ from each other by at most nReg regions and clock of any process is allowed to differ from the global clock by at most nReg regions, we try to ensure that any free counter is in the range: \([(nReg+2)(r)max_{inc} '..(nReg+2)(r+1)max_{inc} ' + (nReg+1)max_{inc} ' -1 ]\) and any dependent counter to be in the range: \([(nReg+2)((r-(nReg+1)-max_r))max_{inc} ' ..(nReg+2)(r+1)max_{inc} ' + (nReg+1)max_{inc} ' -1 ]\). The size of this range of dependent counters is: \(max_{inc} '(nReg^2 + 5(nReg+1)+max_r'(nReg+2))\). Based on this size, each unbounded counter in the program would be maintained in modulo B arithmetic i.e. values of the unbounded counters of the original program will be bounded by B in the transformed program, where

$$\begin{aligned} B= 3[max_{inc} '(nReg^2 + 5(nReg+1)+max_r'(nReg+2))] \end{aligned}$$
(2)

1.1.2 Analyzing bounds for counters when processes differ by multiple regions versus a single region

Here we analyze if \(nReg > 1\) is beneficial over \(nReg = 1\), i.e. if the bound (on the counters) identified when \(nReg>1\) is smaller than the bound identified when \(nReg=1\). Formally, if the bound identified with \(nReg > 1\) is denoted as new [Eq. (2) in A.1.1] and if the bound identified with \(nReg=1\) is denoted as old [Eq. (1) in A.1.1], then we would like to check if the following is true,

$$\begin{aligned}&old - new \ge 0\nonumber \\&\quad (3[max_{inc} (11 + 3max_r)]) - \nonumber \\&\quad (3[max_{inc} '(nReg^2 + 5(nReg+1)+max_r'(nReg+2))]) \ge 0, \nonumber \\&\quad t\quad (\textit{where }nReg > 1) \nonumber \\&\quad (max_{inc} (11 + 3max_r))- \nonumber \\&\quad (max_{inc} '(nReg^2 + 5(nReg+1)+max_r'(nReg+2))) \ge 0 \end{aligned}$$
(3)

If the growth in the counters is distributed uniformly across time, then when the region size becomes smaller (or larger) the bound on the growth in counters within a region becomes smaller (and larger respectively). In other words as nReg becomes larger, RS (region-size) becomes smaller, and \(max_{inc} \) (maximum growth in counters within a region) becomes smaller, and if nReg becomes smaller, RS becomes larger and \(max_{inc} \) becomes larger. We apply this notion to the second half of the above equation, i.e.,as nReg increases \(max_{inc} \) becomes smaller. So the above equation can be rewritten as,

$$\begin{aligned}&(max_{inc} (11 + 3max_r)) -\nonumber \\&\quad (\dfrac{max_{inc}}{nReg} *(nReg^2 + 5(nReg+1)\nonumber \\&\quad +max'_{inc}(nReg+2)) \ge 0 \end{aligned}$$
(4)

Also, from Sect. 5.3 recall that \(max_r\) stands for \(max(r_b +r_f)\). So as nReg increases, \(max_r\) also increases. So the above equation becomes,

$$\begin{aligned}&(max_{inc} (11 + 3max_r)) -\nonumber \\&\quad (\dfrac{max_{inc}}{nReg} *(nReg^2 + 5(nReg+1)\nonumber \\&\quad +(nReg*max_r)(nReg+2))\nonumber \\&\quad \ge 0 \end{aligned}$$
(5)

Solving the above equation results in the below equation:

$$\begin{aligned} (6-nReg-\dfrac{5}{nReg}+ max_r*(1-nReg)) \ge 0 \end{aligned}$$
(6)

Here \(max_r>0\), and we will analyze the following two cases (i) \(max_r=1\), (ii) \(max_r>1\).

Substituting (i) in (6), we obtain \(nReg\le 1\) or \(nReg \le 2.5\). So nReg has to be 2 for equation (3) to be true. In other words, the bound on the counters when \(nReg>1\) is smaller than the bound obtained with \(nReg=1\) only for the case where \(nReg=2\). Observe that this is true only if the growth in the counters is distributed uniformly over time and if \(max_r=1\).

Substituting (ii) in (6) we obtain that \(nReg \le 1\) or \(nReg \le 1.67\). So we observe that if \(max_r>1\) then the only beneficial choice is \(nReg=1\), i.e. the notion of modeling region-size such that clocks of any two processes ( or the physical clock of any process and global clock) differ by at most 1 region.

1.2 Summary of notations

Generic variables

p

Program

\(V_p\)

Set of variables of program p

\(SV_p\)

Dynamic-sized equivalent of \(V_p\), i.e., a dynamically changing collection of only simple variables, obtained by unraveling complex variables of \(V_p\) into the simple variables contained in them

\(A_p\)

Set of actions of program p

s

State of program p

\(s_l\)

lth state in a computation of program p

guard

Condition involving variables in \(V_p\)

statement

Task involving update of a subset of variables in \(V_p\)

\(\rho \), \(\rho '\)

Computation prefixes

x

Variable in \(V_p\)

x(s)

Value of variable x in state s

fc

Free counter

\(fc(s_l)\)

Value of free counter fc in state \(s_l\)

w, a, d

Positive integers unless specified otherwise

dc

Dependent counter

S

Set of states

RS

Region size

\(t_g\)

Abstract global time

\(t_j\)

Physical time at process j

\(\lfloor \frac{t_g}{\mathcal {RS}} \rfloor \)

Abstract global region

\(\lfloor \frac{t_j}{\mathcal {RS}} \rfloor \)

Region of process j

r

Region

\(r_b,r_f\)

Used to characterize life of a dependent counter in terms of regions

\(max_r\)

Maximum of \((r_b+r_f)\) of any dependent counter

\(max_{inc}\)

Maximum increase in any free counter within a global region

\(p'\)

Program obtained by applying our transformation algorithm to program p

B

3[\(max_{inc} (11 + 3max_r)\)] or 3 times the range of any dependent counter

Variables in Katz and Perry example

xy

Round numbers

nr

Next round

cr

Current round

lr

Round number when the last real reset was performed

b

Boolean variable that identifies if the reset was real or fake

Variables in Lamport’s logical clocks example

jk

Processes

cl.j

Logical clock value of process j

m

Message

cl.m

Message timestamp or logical clock value associated with message m

\(channel_{j,k}\)

Complex variable that contains timestamps of messages in transit between process j and process k

v

Number of regions within which a message is guaranteed to be delivered at the receiver process

Variables in vector clocks example

vc.j

Vector clock maintained at process j

vc.j.k

Highest clock or counter value of process k that process j is aware of

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tekken Valapil, V., Kulkarni, S.S. Preserving stabilization while practically bounding state space using incorruptible partially synchronized clocks. Distrib. Comput. 33, 423–443 (2020). https://doi.org/10.1007/s00446-019-00365-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00446-019-00365-z

Keywords

Navigation