Practical task allocation for software fault-tolerance and its implementation in embedded automotive systems

Bhat, Anand; Samii, Soheil; Rajkumar, Ragunathan

doi:10.1007/s11241-019-09339-7

Practical task allocation for software fault-tolerance and its implementation in embedded automotive systems

Published: 06 September 2019

Volume 55, pages 889–924, (2019)
Cite this article

Real-Time Systems Aims and scope Submit manuscript

606 Accesses
8 Citations
Explore all metrics

Abstract

Due to the advent of active safety features and automated driving capabilities, the complexity of embedded computing systems within automobiles continues to increase. Such advanced driver assistance systems (ADAS) are inherently safety-critical and must tolerate failures in any subsystem. However, fault-tolerance in safety-critical systems has been traditionally supported by hardware replication, which is prohibitively expensive in terms of cost, weight, and size for the automotive market. Recent work has studied the use of software-based fault-tolerance techniques that utilize task-level hot and cold standbys to tolerate fail-stop processor and task failures. The benefit of using standbys is maximal when a task and any of its standbys obey the placement constraint of not being co-located on the same processor. We propose a new heuristic based on a “tiered” placement constraint, and show that our heuristic produces a better task assignment that saves at least one processor up to 40% of the time relative to the best known heuristic to date. We then introduce a task allocation algorithm that, for the first time to our knowledge, leverages the run-time attributes of cold standbys. Our empirical study finds that our heuristic uses no more than one additional processor in most cases relative to an optimal allocation that we construct for evaluation purposes using a creative technique. We also extend our heuristic to support mixed-criticality systems which allow for overload operation. We have designed and implemented our software fault-tolerance framework in AUTOSAR, an automotive industry standard. We use this implementation to provide an experimental evaluation of our task-level fault-tolerance features. Finally, we present an analysis of the worst-case behavior of our task recovery features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 18

Fig. 20

A Modern Primer on Processing in Memory

Exploration of issues, challenges and latest developments in autonomous cars

Article Open access 06 May 2023

B. Padmaja, CH. V. K. N. S. N. Moorthy, … Myneni Madhu Bala

ROBUST: 221 bugs in the Robot Operating System

Article Open access 23 March 2024

Christopher S. Timperley, Gijs van der Hoorn, … Andrzej Wąsowski

Notes

SAE J3016: Taxonomy and definitions for terms related to on-road motor vehicle automated driving systems.
“Autosar.” http://www.autosar.org.
“IEEE802.1cb-frame replication and elimination for reliability.” http://www.ieee802.org/1/pages/802.1cb.html.
“Autosar.” http://www.autosar.org.
This is feasible with earliest deadline first (EDF) and rate-monotonic scheduling (RMS) with harmonic task sets.
Since creating an optimal allocation given an arbitrary taskset is NP-Hard to compute, we instead explicitly create a perfect solution that by definition represents an optimal allocation.
“Arccore.” http://www.arccore.com.

References

Avizienis A et al (2004) Basic concepts and taxonomy of dependable and secure computing. IEEE transactions on dependable and secure computing
Balasubramanian J et al. (2010) Middleware for resource-aware deployment and configuration of fault-tolerant real-time systems. In: RTAS ’10, pp 69–78
Bhat A, Aoki S, Rajkumar R (2018) Tools and methodologies for autonomous driving systems. In: Proceedings of the IEEE vol 106, pp 1700–1716
Bhat A, Samii S, Rajkumar RR (2018) Recovery time considerations in real-time systems employing software fault tolerance. In: 30th Euromicro Conference on Real-Time Systems (ECRTS 2018) (S. Altmeyer, ed.), vol. 106 of Leibniz International Proceedings in Informatics (LIPIcs), (Dagstuhl, Germany). Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, pp 23:1–23:22
Bouyssounouse B, Sifakis J (2005) Tools for verification and validation. Springer, Berlin, pp 72–84
Google Scholar
Chen J et al (2007) Real-time task replication for fault tolerance in identical multiprocessor systems. In: Proceedings of the 13th IEEE real time and embedded technology and applications symposium, RTAS ’07, pp 249–258
Cristian F (1991) Reaching agreement on processor-group membership in synchronous distributed systems. Distrib Comput 4(4):175–187
Article Google Scholar
Davis RI, Burns A, Bril RJ, Lukkien JJ (2007) Controller area network (can) schedulability analysis: refuted, revisited and revised. Real-Time Syst 35:239–272
Article Google Scholar
Felber PNP (2004) Experiences, strategies, and challenges in building fault-tolerant CORBA systems. IEEE Trans Comput. 53(5):497–511
Article Google Scholar
Gopalakrishnan S, Caccamo M (2006) Task partitioning with replication upon heterogeneous multiprocessor systems. RTAS 06:199–207
Google Scholar
Huang H, Gill C, Lu C (2012) Implementation and evaluation of mixed-criticality scheduling approaches for periodic tasks. In: 2012 IEEE 18th Real Time and Embedded Technology and Applications Symposium, pp 23–32
Johnson D (1973) Near optimal allocation algorithms. Ph.D. Dissertation, MIT, MA
Kim J et al (2010) R-BATCH: task partitioning for fault-tolerant multiprocessor real-time systems. In: CIT 2010, Bradford, West Yorkshire, UK, June 29-July 1, 2010, pp 1872–1879
Kim J et al (2012) Safer: system-level architecture for failure evasion in real-time applications. In: IEEE 33rd real-time systems symposium (RTSS), 2012
Klobedanz K et al (2013) Embedded systems: design, analysis and verification. In: Proceedings of the 4th IFIP TC 10, IESS 2013, Paderborn, Germany, June 17-19, 2013. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 238–249
Lakshmanan K, De Niz D, Rajkumar RR, Moreno G (2013) Overload provisioning in mixed-criticality cyber-physical systems. ACM Trans Embed Comput Syst 11:83:1–83:24
Google Scholar
Lakshmanan K, Niz DD, Rajkumar R, Moreno G (2010) Resource allocation in distributed mixed-criticality cyber-physical systems. In: 2010 IEEE 30th International Conference on Distributed Computing Systems, pp 169–178
Leu K et al (2012) Generic reliability analysis for safety-critical flexray drive-by-wire systems. In: Connected Vehicles and Expo (ICCVE), 2012
Narasimhan P et al (2005) MEAD: support for real-time fault-tolerant CORBA. Concurr Comp-Pract E 17(12):1527–1545
Article Google Scholar
Niz D, Lakshmanan K, Rajkumar R (2009) On the scheduling of mixed-criticality real-time task sets. In: 2009 30th IEEE Real-Time Systems Symposium, pp 291–300
Oh D, Baker T (1998) Utilization bounds for n-processor rate monotonic scheduling with static processor assignment. In: Real-Time System, pp vol 15, pp 183–192
Phillips M, Narayanan V, Aine S, Likhachev M (2015) Efficient search with an ensemble of heuristics. In: Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15. AAAI Press, pp 784–791
Pinello C et al (2008) Fault-tolerant distributed deployment of embedded control software. In: IEEE transactions on computer-aided design of integrated circuits and systems vol 27, pp 906–919
Pop T, Pop P, Eles P, Peng Z, Andrei A (2006) Timing analysis of the flexray communication protocol. In: 18th Euromicro conference on real-time systems (ECRTS’06), pp 11–216
Rajkumar R, Gagliardi M (1996) High availability in the real-time publisher/subscriber inter-process communication model. In: 17th IEEE Real-Time Systems Symposium, pp 136–141
Ramamritham K (1995) Allocation and scheduling of precedence-related periodic tasks. IEEE Trans Parallel Distrib Syst 6:412–420
Article Google Scholar
Samii S (2015) Ethernet TSN as enabling technology for ADAS and automated driving systems. In: IEEE-SA Ethernet and IP at Automotive Technology Day, Oct 2015
Zhu P, Yang F, Tu G (2010) Fault-tolerant rate-monotonic compact-factor-driven scheduling in hard-real-time systems. Wuhan Univ J Nat Sci 15(3):217–221
Article Google Scholar

Download references

Author information

Authors and Affiliations

Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Anand Bhat
General Motors R&D, Warren, MI, 48092, USA
Soheil Samii
Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Ragunathan Rajkumar

Authors

Anand Bhat
View author publications
You can also search for this author in PubMed Google Scholar
Soheil Samii
View author publications
You can also search for this author in PubMed Google Scholar
Ragunathan Rajkumar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anand Bhat.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

The BFD-P and R-BFD Heuristics

In this section, we provide a brief overview of the BFD-P and R-BFD heuristics (Kim et al. 2010).

The BFD-P algorithm follows the below steps,

(1)
Sort tasks including replicas in the decreasing order of utilization.
(2)
Fit every task into the best fit processor obeying the placement constraint, i.e., any task should not be co-located with its replica.
(3)
Add a new processor if a task does not fit any bin.
(4)
Iterate until no tasks remain.

The R-BFD algorithm follows the below steps,

(1)
The given tasks are sorted in decreasing order of utilization.
(2)
The primary tasks are extracted and allocated first using the BFD-P heuristic.
(3)
The replicas are then allocated one by one, highest order replicas first, i.e., opposite to the TPCD approach.
(4)
Add a new processor if a task does not fit any bin.
(5)
Iterate until no tasks remain.

1.1 TPCD primary redistribution

As highlighted in Sect. 4.1 the TPCD algorithm can result in an allocation where a processor becomes dominant and runs only primaries. This can be seen in the allocation produced by the example in Sect. 4.1. Figure 22 represents the TPCD allocation produced with backup types annotated. As can be seen the last processor runs all primaries and becomes the dominant processor. This is not ideal in a safety-critical systems. This type of allocation is a result of the fact that TPCD allocates tasks in tiers. This same allocation scheme also allows backups to be clubbed together. If the backups of some of the primaries on the dominant processor run identical copies as the primary it is possible to swap their positions with the primary to produce a more balanced distribution of primaries and standbys. This is highlighted in Fig. 23.

1.2 Group maintenance

In this section, we describe some of the messages that are passed between the primary and its backups to maintain group membership. The primary acts as the leader of the group and has information about all its members. It shares this information with its group members so that, in the case of a primary failure, the remaining group members and the new leader can continue to maintain the group.

Since the task to node assignment is done beforehand, each task is pre-defined to run as either a primary, hot or cold standby. On startup, all nodes enter a group formation phase. In this phase, a primary periodically broadcasts a \(Group_{create}\) message along with its heartbeat. It then waits for a pre-configured interval \(T_{timeout_{pri}}\) (typically a multiple of its period \(T_{pri}\)) to receive any \(Group_{created}\) responses, which would indicate that a primary already exists. If there is no such response in this interval, the primary moves out of the startup phase and enters normal operation, where it starts producing application outputs, listening for heartbeats from its standbys and producing heartbeats, state and group information (i.e. the normal life cycle outlined in Sect. 3.2).

Each standby on startup periodically produces a \(Group_{join}\) message. On receiving the \(Group_{create}\) or \(Group_{created}\) message from the primary, it transitions out of the startup phase and starts its normal operation and producing heartbeat messages. If a standby does not receive a \(Group_{create}\) message for a period of \(T_{timeout_{sb}}\) units, then it declares primary failure and transitions to the normal mode of operation as the new primary. Figure 17a represents the final stable state of the system after the groups are created and also shows the messages exchanged within the group during normal operation.

If a primary fails and is later restarted, it will broadcast a \(Group_{create}\) message along with its heartbeat. This time, the standby will have taken over as the new primary and it will respond to the \(Group_{create}\) message with a \(Group_{created}\) message indicating to the re-launched primary that it should run as a standby. Figure 17c represents the final stable state of the system after such a dynamic reconfiguration.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bhat, A., Samii, S. & Rajkumar, R. Practical task allocation for software fault-tolerance and its implementation in embedded automotive systems. Real-Time Syst 55, 889–924 (2019). https://doi.org/10.1007/s11241-019-09339-7

Download citation

Published: 06 September 2019
Issue Date: October 2019
DOI: https://doi.org/10.1007/s11241-019-09339-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Practical task allocation for software fault-tolerance and its implementation in embedded automotive systems

Abstract

Access this article