Dependable Multiprocessing (DM) ¹

John Samson, Gary Gardner, David Lupia
Honeywell Inc., Aerospace Systems
john.r.samson@honeywell.com

Minesh Patel, Paul Davis, Vikas Aggarwal
Tandel Systems LLC
mpatel@tandelsystems.com

Alan George
University of Florida
george@hcs.ufl.edu

Zbigniew Kalbarczyk,
University of Illinois/Armored Computing, Inc.
kalbar@crhc.uiuc.edu

Rafi Some
Jet Propulsion Laboratory, California Institute of Technology
Raphael.Some@jpl.nasa.gov

Abstract

With the ever-increasing demand for higher bandwidth and processing capacity of today’s space exploration, space science, and defense missions, the ability to efficiently apply commercial-off-the-shelf (COTS) processors for on-board computing is now a critical need. In response to this need, NASA’s New Millennium Program office commissioned the development of Dependable Multiprocessor (DM) technology for use in payload and robotic missions, but the technology is also applicable to a wide variety of DoD missions.

The Dependable Multiprocessor technology is a COTS-based, power-efficient, high-performance, highly dependable, fault-tolerant cluster computer. While current COTS high performance processors are exhibiting adequate Total Integrated Dose (TID) performance to meet the requirements of the natural space radiation environment, Single Event Upsets) SEUs caused by heavy ions and solar flares are, and will remain, a problem. Traditional approaches to mitigate the SEU problem involve fixed redundancy schemes such as Self Checking Pairs (SCP) or Triple Modular Redundancy (TMR). While effective in mitigating the effects of SEUs, use of these techniques comes at a high price, 100% overhead for SCP, and 200% overhead for TMR, particularly when such a level of protection is not needed. In such cases, it would be beneficial to be able to convert that overhead into useful mission processing capability. The idea behind Dependable Multiprocessor (DM) is to be able to configure the processing system to maximize the processing capability available to the mission.

To satisfy this need, the Dependable Multiprocessor concept has been demonstrated and is currently being developed further as one of the flight experiments for NASA’s New Millennium Program (NMP) ST8 project. The objective of this NMP ST8 effort is to combine high performance, fault tolerant, COTS-based cluster processing with replication services, Algorithm-Based Fault Tolerance, (ABFT), and fault tolerant middleware in an architecture and software framework capable of supporting a wide variety of mission applications.

The software architecture for this DM framework is depicted in Figure 1. Figure 1 shows two types of processing nodes: one type, a rad hard system controller, which can operate through any environment without upsetting, and a second type, a cluster of COTS high performance processing nodes. A high level API (Application Interface) and a high level SAL (System Abstraction Layer) provide both application independence and platform independence, while allowing the particular mission applications and platforms to take advantage of fault tolerance services and reliable messaging offered by the fault tolerant middleware layer. The DM hardware architecture is depicted in Figure 2.

The DM middleware software was designed to meet two objectives: 1) fault tolerant cluster management, and 2) software-enhanced SEU immunity of COTS-based processing platforms.

A paper describing the project, the hardware architecture, the software architecture, and the development status midway through the TRL5 (Technology Readiness Level 5) effort was presented at HPEC 2005. The DM project recently successfully passed the TRL5 milestone, qualifying it for advancement to flight system development status. As part of the TRL5 technology validation demonstration, a comprehensive fault injection campaign was run, during which the DM system was subjected to thousands of fault injections. Coverage, detection and recovery latency, and throughput and fault tolerance performance data were recorded and fed into predictive Reliability and Availability Models, demonstrating the effectiveness of the system. The paper will describe the experiments, the demonstrations, and the performance achieved to date. The paper will also describe the flight system hardware and software.

This paper fits well with the theme of HPEC 2006 because, while the DM project is the current incarnation of the long-held desire to fly COTS in space, the current system is based on related technologies developed over the years on several NASA, DoD, and DARPA programs such as AOSP, AAOP, REE, ISCP, and Space Touchstone.

¹ The Dependable Multiprocessor project was formerly known as the Environmentally Adaptive Fault Tolerant Computer project.
Figure 1 – DM Software Architecture

Figure 2 – DM Hardware Architecture

* Mass Data Storage Unit, Custom Spacecraft I/O, etc.
REFERENCES


