Skip to main content
Log in

Practical task allocation for software fault-tolerance and its implementation in embedded automotive systems

  • Published:
Real-Time Systems Aims and scope Submit manuscript

Abstract

Due to the advent of active safety features and automated driving capabilities, the complexity of embedded computing systems within automobiles continues to increase. Such advanced driver assistance systems (ADAS) are inherently safety-critical and must tolerate failures in any subsystem. However, fault-tolerance in safety-critical systems has been traditionally supported by hardware replication, which is prohibitively expensive in terms of cost, weight, and size for the automotive market. Recent work has studied the use of software-based fault-tolerance techniques that utilize task-level hot and cold standbys to tolerate fail-stop processor and task failures. The benefit of using standbys is maximal when a task and any of its standbys obey the placement constraint of not being co-located on the same processor. We propose a new heuristic based on a “tiered” placement constraint, and show that our heuristic produces a better task assignment that saves at least one processor up to 40% of the time relative to the best known heuristic to date. We then introduce a task allocation algorithm that, for the first time to our knowledge, leverages the run-time attributes of cold standbys. Our empirical study finds that our heuristic uses no more than one additional processor in most cases relative to an optimal allocation that we construct for evaluation purposes using a creative technique. We also extend our heuristic to support mixed-criticality systems which allow for overload operation. We have designed and implemented our software fault-tolerance framework in AUTOSAR, an automotive industry standard. We use this implementation to provide an experimental evaluation of our task-level fault-tolerance features. Finally, we present an analysis of the worst-case behavior of our task recovery features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

Notes

  1. SAE J3016: Taxonomy and definitions for terms related to on-road motor vehicle automated driving systems.

  2. “Autosar.” http://www.autosar.org.

  3. “IEEE802.1cb-frame replication and elimination for reliability.” http://www.ieee802.org/1/pages/802.1cb.html.

  4. “Autosar.” http://www.autosar.org.

  5. This is feasible with earliest deadline first (EDF) and rate-monotonic scheduling (RMS) with harmonic task sets.

  6. Since creating an optimal allocation given an arbitrary taskset is NP-Hard to compute, we instead explicitly create a perfect solution that by definition represents an optimal allocation.

  7. “Arccore.” http://www.arccore.com.

References

  • Avizienis A et al (2004) Basic concepts and taxonomy of dependable and secure computing. IEEE transactions on dependable and secure computing

  • Balasubramanian J et al. (2010) Middleware for resource-aware deployment and configuration of fault-tolerant real-time systems. In: RTAS ’10, pp 69–78

  • Bhat A, Aoki S, Rajkumar R (2018) Tools and methodologies for autonomous driving systems. In: Proceedings of the IEEE vol 106, pp 1700–1716

  • Bhat A, Samii S, Rajkumar RR (2018) Recovery time considerations in real-time systems employing software fault tolerance. In: 30th Euromicro Conference on Real-Time Systems (ECRTS 2018) (S. Altmeyer, ed.), vol. 106 of Leibniz International Proceedings in Informatics (LIPIcs), (Dagstuhl, Germany). Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, pp 23:1–23:22

  • Bouyssounouse B, Sifakis J (2005) Tools for verification and validation. Springer, Berlin, pp 72–84

    Google Scholar 

  • Chen J et al (2007) Real-time task replication for fault tolerance in identical multiprocessor systems. In: Proceedings of the 13th IEEE real time and embedded technology and applications symposium, RTAS ’07, pp 249–258

  • Cristian F (1991) Reaching agreement on processor-group membership in synchronous distributed systems. Distrib Comput 4(4):175–187

    Article  Google Scholar 

  • Davis RI, Burns A, Bril RJ, Lukkien JJ (2007) Controller area network (can) schedulability analysis: refuted, revisited and revised. Real-Time Syst 35:239–272

    Article  Google Scholar 

  • Felber PNP (2004) Experiences, strategies, and challenges in building fault-tolerant CORBA systems. IEEE Trans Comput. 53(5):497–511

    Article  Google Scholar 

  • Gopalakrishnan S, Caccamo M (2006) Task partitioning with replication upon heterogeneous multiprocessor systems. RTAS 06:199–207

    Google Scholar 

  • Huang H, Gill C, Lu C (2012) Implementation and evaluation of mixed-criticality scheduling approaches for periodic tasks. In: 2012 IEEE 18th Real Time and Embedded Technology and Applications Symposium, pp 23–32

  • Johnson D (1973) Near optimal allocation algorithms. Ph.D. Dissertation, MIT, MA

  • Kim J et al (2010) R-BATCH: task partitioning for fault-tolerant multiprocessor real-time systems. In: CIT 2010, Bradford, West Yorkshire, UK, June 29-July 1, 2010, pp 1872–1879

  • Kim J et al (2012) Safer: system-level architecture for failure evasion in real-time applications. In: IEEE 33rd real-time systems symposium (RTSS), 2012

  • Klobedanz K et al (2013) Embedded systems: design, analysis and verification. In: Proceedings of the 4th IFIP TC 10, IESS 2013, Paderborn, Germany, June 17-19, 2013. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 238–249

  • Lakshmanan K, De Niz D, Rajkumar RR, Moreno G (2013) Overload provisioning in mixed-criticality cyber-physical systems. ACM Trans Embed Comput Syst 11:83:1–83:24

    Google Scholar 

  • Lakshmanan K, Niz DD, Rajkumar R, Moreno G (2010) Resource allocation in distributed mixed-criticality cyber-physical systems. In: 2010 IEEE 30th International Conference on Distributed Computing Systems, pp 169–178

  • Leu K et al (2012) Generic reliability analysis for safety-critical flexray drive-by-wire systems. In: Connected Vehicles and Expo (ICCVE), 2012

  • Narasimhan P et al (2005) MEAD: support for real-time fault-tolerant CORBA. Concurr Comp-Pract E 17(12):1527–1545

    Article  Google Scholar 

  • Niz D, Lakshmanan K, Rajkumar R (2009) On the scheduling of mixed-criticality real-time task sets. In: 2009 30th IEEE Real-Time Systems Symposium, pp 291–300

  • Oh D, Baker T (1998) Utilization bounds for n-processor rate monotonic scheduling with static processor assignment. In: Real-Time System, pp vol 15, pp 183–192

  • Phillips M, Narayanan V, Aine S, Likhachev M (2015) Efficient search with an ensemble of heuristics. In: Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15. AAAI Press, pp 784–791

  • Pinello C et al (2008) Fault-tolerant distributed deployment of embedded control software. In: IEEE transactions on computer-aided design of integrated circuits and systems vol 27, pp 906–919

  • Pop T, Pop P, Eles P, Peng Z, Andrei A (2006) Timing analysis of the flexray communication protocol. In: 18th Euromicro conference on real-time systems (ECRTS’06), pp 11–216

  • Rajkumar R, Gagliardi M (1996) High availability in the real-time publisher/subscriber inter-process communication model. In: 17th IEEE Real-Time Systems Symposium, pp 136–141

  • Ramamritham K (1995) Allocation and scheduling of precedence-related periodic tasks. IEEE Trans Parallel Distrib Syst 6:412–420

    Article  Google Scholar 

  • Samii S (2015) Ethernet TSN as enabling technology for ADAS and automated driving systems. In: IEEE-SA Ethernet and IP at Automotive Technology Day, Oct 2015

  • Zhu P, Yang F, Tu G (2010) Fault-tolerant rate-monotonic compact-factor-driven scheduling in hard-real-time systems. Wuhan Univ J Nat Sci 15(3):217–221

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anand Bhat.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

The BFD-P and R-BFD Heuristics

In this section, we provide a brief overview of the BFD-P and R-BFD heuristics (Kim et al. 2010).

The BFD-P algorithm follows the below steps,

  1. (1)

    Sort tasks including replicas in the decreasing order of utilization.

  2. (2)

    Fit every task into the best fit processor obeying the placement constraint, i.e., any task should not be co-located with its replica.

  3. (3)

    Add a new processor if a task does not fit any bin.

  4. (4)

    Iterate until no tasks remain.

The R-BFD algorithm follows the below steps,

  1. (1)

    The given tasks are sorted in decreasing order of utilization.

  2. (2)

    The primary tasks are extracted and allocated first using the BFD-P heuristic.

  3. (3)

    The replicas are then allocated one by one, highest order replicas first, i.e., opposite to the TPCD approach.

  4. (4)

    Add a new processor if a task does not fit any bin.

  5. (5)

    Iterate until no tasks remain.

1.1 TPCD primary redistribution

As highlighted in Sect. 4.1 the TPCD algorithm can result in an allocation where a processor becomes dominant and runs only primaries. This can be seen in the allocation produced by the example in Sect. 4.1. Figure 22 represents the TPCD allocation produced with backup types annotated. As can be seen the last processor runs all primaries and becomes the dominant processor. This is not ideal in a safety-critical systems. This type of allocation is a result of the fact that TPCD allocates tasks in tiers. This same allocation scheme also allows backups to be clubbed together. If the backups of some of the primaries on the dominant processor run identical copies as the primary it is possible to swap their positions with the primary to produce a more balanced distribution of primaries and standbys. This is highlighted in Fig. 23.

Fig. 22
figure 22

TPCD solution to 2a with backup types highlighted (Color figure online)

Fig. 23
figure 23

TPCD solution to 2a with primary redistribution [(P primary, B backup (color online)]

1.2 Group maintenance

In this section, we describe some of the messages that are passed between the primary and its backups to maintain group membership. The primary acts as the leader of the group and has information about all its members. It shares this information with its group members so that, in the case of a primary failure, the remaining group members and the new leader can continue to maintain the group.

Since the task to node assignment is done beforehand, each task is pre-defined to run as either a primary, hot or cold standby. On startup, all nodes enter a group formation phase. In this phase, a primary periodically broadcasts a \(Group_{create}\) message along with its heartbeat. It then waits for a pre-configured interval \(T_{timeout_{pri}}\) (typically a multiple of its period \(T_{pri}\)) to receive any \(Group_{created}\) responses, which would indicate that a primary already exists. If there is no such response in this interval, the primary moves out of the startup phase and enters normal operation, where it starts producing application outputs, listening for heartbeats from its standbys and producing heartbeats, state and group information (i.e. the normal life cycle outlined in Sect. 3.2).

Each standby on startup periodically produces a \(Group_{join}\) message. On receiving the \(Group_{create}\) or \(Group_{created}\) message from the primary, it transitions out of the startup phase and starts its normal operation and producing heartbeat messages. If a standby does not receive a \(Group_{create}\) message for a period of \(T_{timeout_{sb}}\) units, then it declares primary failure and transitions to the normal mode of operation as the new primary. Figure 17a represents the final stable state of the system after the groups are created and also shows the messages exchanged within the group during normal operation.

If a primary fails and is later restarted, it will broadcast a \(Group_{create}\) message along with its heartbeat. This time, the standby will have taken over as the new primary and it will respond to the \(Group_{create}\) message with a \(Group_{created}\) message indicating to the re-launched primary that it should run as a standby. Figure 17c represents the final stable state of the system after such a dynamic reconfiguration.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bhat, A., Samii, S. & Rajkumar, R. Practical task allocation for software fault-tolerance and its implementation in embedded automotive systems. Real-Time Syst 55, 889–924 (2019). https://doi.org/10.1007/s11241-019-09339-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11241-019-09339-7

Keywords

Navigation