Terminal adaptive guidance via reinforcement meta-learning: Applications to autonomous asteroid close-proximity operations

doi:10.1016/j.actaastro.2020.02.036

Acta Astronautica

Volume 171, June 2020, Pages 1-13

https://doi.org/10.1016/j.actaastro.2020.02.036 Get rights and content

Highlights

•
Adaptive guidance system optimized using Reinforcement Meta-Learning.
•
System maps sensor output directly to actuator commands.
•
System autonomously completes a landing maneuver with pinpoint accuracy.
•
We test system in high fidelity 6-DOF simulator.
•
Simulator models time-varying dynamics, actuator failure, and sensor distortion.

Abstract

Current practice for asteroid close proximity maneuvers requires extremely accurate characterization of the environmental dynamics and precise spacecraft positioning prior to the maneuver. This creates a delay of several months between the spacecraft's arrival and the ability to safely complete close proximity maneuvers. In this work we develop an adaptive integrated guidance, navigation, and control system that can complete these maneuvers in environments with unknown dynamics, with initial conditions spanning a large deployment region, and without a shape model of the asteroid. The system is implemented as a policy optimized using reinforcement meta-learning. The lander is equipped with an optical seeker that locks to either a terrain feature, reflected light from a targeting laser, or an active beacon, and the policy maps observations consisting of seeker angles and LIDAR range readings directly to engine thrust commands. The policy implements a recurrent network layer that allows the deployed policy to adapt real time to both environmental forces acting on the agent and internal disturbances such as actuator failure and center of mass variation. We validate the guidance system through simulated landing maneuvers in a six degrees-of-freedom simulator. The simulator randomizes the asteroid's characteristics such as solar radiation pressure, density, spin rate, and nutation angle, requiring the guidance and control system to adapt to the environment. We also demonstrate robustness to actuator failure, sensor bias, and changes in the lander's center of mass and inertia tensor. Finally, we suggest a concept of operations for asteroid close proximity maneuvers that is compatible with the guidance system.

Introduction

Current practice for asteroid close proximity operations involves first making several passes past the asteroid in order to collect images and LIDAR data that allows creating a shape model of the asteroid [1]. In addition, these maneuvers can be used to estimate the environmental dynamics in the vicinity of the asteroid, which allows calculation of a burn that will put the spacecraft into a safe orbit [2]. Once in this safe orbit, statistical orbit determination techniques [3] augmented by optical navigation are used to create a model that can estimate the spacecraft's orbital state (position and velocity) as a function of time. This model is extremely accurate provided the spacecraft remains in the safe orbit. While the spacecraft is in the safe orbit, both the asteroid shape model and the model of the asteroid's dynamics are refined. Mission planners then use the orbit model, along with an estimate of the forces acting on the spacecraft, to plan an open loop maneuver that will bring the spacecraft from a given point in the safe orbit to some desired location and velocity. An example of such a trajectory is the OSIRIS REx Touch and Go (TAG) sample collection maneuver [2]. If the dynamics model used to plan the open loop maneuver is not completely accurate, errors can accumulate over the trajectory, resulting in a large error ellipse for the contact position between the asteroid and spacecraft [4]. For this reason, multiple rehearsal maneuvers are typically planned and executed in order to ensure that the dynamics model is accurate in the TAG region of interest. To be clear, since the OSIRIS REx CONOPS uses an open loop trajectory for the TAG maneuver, even if the landing site were tagged with a targeting laser, the spacecraft GNC system would not be capable of employing such navigation aid to enhance landing precision or tolerate larger initial condition uncertainty. The latter is due to constraints in the overall system design. Importantly, a partial closed-loop approach had been initially considered to improve the overall accuracy. Berry et al. [4] devised an algorithm based on a second-order surface response method where two individual laser measurements, specifically executed to estimate range-to-go via limb detection and one single altitude, may be employed to correct the timing and magnitude of the intermediate maneuver, thus improving the accuracy. However, as a back-up plan, an autonomous navigation system called Natural Feature Tracking (NFT [5]) has been devised and implemented on the OSIRIS REx spacecraft computer in an attempt to comply with the more stringent requirements for Bennu sampling (less then 5 m radius) due to the unexpected rough terrain observed on the asteroid surface [6]. Finally, note that current practice is not compatible with completely autonomous missions, where the spacecraft can conduct operations on and around an asteroid without human supervision. For example, the Hayabusa 2 mission has devised a hybrid ground/onboard-based navigation system to navigate the initial descent from a home (hovering) position [7,8]. Named GCP-NAV, the system autonomously controls the vertical descent velocity using the on-board LIDAR, whereas the horizontal position and velocity are determined on the ground by an operator that processes the navigation cameras outputs [9]. Subsequently, a horizontal maneuver is planned on the ground and uploaded to the on-board computer to command the spacecraft descent along a predefined path. Note however, the last two phases of the descent, i.e. surface-relative descent and touch down are fully autonomous and rely on the deployment of a target marker which is released from the main spacecraft at 100 m altitude. During such a phase, the spacecraft GNC system attempts to track the marker using both optical navigation camera and Flash Lamp (FLASH).

Now consider an adaptive guidance, navigation and control (GNC) system that after a short interaction with the environment can adapt to that environment's ground truth dynamics, only limited by the spacecraft's thrust capability. Such a system would allow a paradigm shift for mission design, as highlighted in the comparison between current practice and what might be possible using the proposed system, as shown in Table 1. Of course there are scientific reasons for characterizing an asteroid's environment as accurately as possible, but the proposed innovation gives mission flexibility. For example, a mission might involve visiting multiple asteroids and collecting samples, and the orbits of the asteroids might make it necessary to spend only a short time at each one. Or the mission goal might not be scientific at all, but rather to identify resource rich asteroids for future mining operations. For a given level of accuracy with respect to the environmental dynamics model, the ability to adapt real time when the environment diverges from the model should provide a significant reduction in mission risk.

Coupling a suitable navigation system with a traditional closed loop guidance and control law can potentially improve maneuver accuracy. However, if the asteroid's environmental dynamics are not well characterized, accuracy will still be compromised due to errors stemming from both the dynamics model used in the state estimation algorithm and the potential inability of the guidance and control law to function optimally in an environment with unknown dynamics. Indeed, an optimal trajectory generated based off of an inaccurate dynamics model may be infeasible (impossible to track with a controller given control constraints) in the actual environment. Moreover, our initial research into this area [10] has shown that traditional closed loop guidance laws such as DR/DV [11] are not robust to actuator failure, unknown dynamics, and navigation system errors, whereas the proposed GNC system is. Finally, note that integration of the navigation system allows the system to quickly adapt to sensor bias.

It is worth noting that for the 3-DOF case, we have demonstrated landing on an asteroid using a policy that maps LIDAR altimeter readings directly to thrust commands [10], but the performance was suboptimal. We have also demonstrated an asteroid TAG maneuver in 3-DOF where a particle filter [12] uses an asteroid shape model to infer the spacecraft's state in the asteroid body-fixed reference frame, and an energy-optimal guidance law [11] maps this to estimated state to a thrust command. Here performance was acceptable, but the GNC system required knowledge of the asteroid's environmental dynamics.

Recent work by others in the area of adaptive guidance algorithms include [13], which demonstrates an adaptive control law for a UAV tracking a reference trajectory, where the adaptive controller adapts to external disturbances. One limitation is the linear dynamics model, which may not be accurate, as well as the fact that the frequency of the disturbance must be known. In Ref. [14] develops a fault identification system for Mars entry phase control using a pre-trained neural network, with a fault controller implemented as a second Gaussian neural network, Importantly, the second network requires on-line parameter update during the entry phase, which may not be possible to implement in real time on a flight computer. Moreover, the adaptation is limited to known actuator faults as identified by the 1st network. And in Ref. [15] the authors develop an adaptive controller for spacecraft attitude control using reaction wheels. This approach is also limited to actuator faults, and the architecture does not adapt to either state estimation bias or environmental dynamics.

In this work we develop an adaptive and integrated GNC system applicable to asteroid close proximity maneuvers, that allows a lander deployed by the spacecraft (see Section 2) to accurately and robustly land at a designed site. The system is optimized using reinforcement meta-learning (RL meta-learning), and implements a global policy over the region of state space defined by the deployment region and potential landing sites. The policy maps observations to actions, with the observations consisting of angles and range readings from the lander's seeker, changes in lander attitude since the start of the maneuver, and lander rotational velocity. The policy actions consist of on/off thrust commands to the lander's thrusters. In order to reduce mission risk, we present a concept of operations (CONOPS) where a hovering spacecraft tags the landing site with a targeting laser, providing an obvious target for the lander's seeker camera. However, future work will investigate the effectiveness of using terrain features as targets, and the use of surface beacons. In the RL framework, the seeker can be considered an attention mechanism, determining what object in the agent's field of regard the policy should target during the maneuver. In the case where we want to target a terrain feature rather than a tagged landing site, the landing site would be identified by the seeker, rather than the guidance policy. Both seeker design and laser aided guidance are mature technologies, with seekers being widely used in guided missiles [16], and laser aided guidance used in certain types of missiles and guided bombs. Reinforcement Learning (RL) has recently been successfully applied to landing guidance problems [[17], [18], [19], [20]].

Adaptability is achieved through RL-Meta Learning, where different environmental dynamics, sensor noise, actuator failure, and changes in the lander's center of mass and inertia tensor are treated as a range of partially observable Markov decision processes (POMDP). In each POMPD, the policy's recurrent network hidden state will evolve differently over the course of an episode, capturing information regarding hidden variables that are useful in minimizing the cost function, i.e., external forces, changes in the lander's internal dynamics and sensor bias. By optimizing the policy over this range of POMDPs, the trained policy will be able to adapt to novel POMPDs encountered during deployment. Specifically, even though the policy's parameters are fixed after optimization, the policy's hidden state will evolve based off the current POMPD, thus adapting to the environment.

The policy uses approximately 16,000 32 bit network parameters, and requires approximately 64 KB of memory. The policy takes approximately 1 ms to run the mapping between estimated state and thruster commands (four small matrix multiplications) on a 3 Ghz processor. Since in this work the mapping is updated every 6 s, we do not see any issues with running this on the current generation of space-certified flight computers. A diagram illustrating how the policy interfaces with peripheral lander components is shown in Fig. 1.

One advantage of our proposed CONOPS and GN&C system as compared to current practice is that the environmental dynamics need not be accurately characterized prior to the maneuver, removing an element of mission risk. Compared to completely passive optical navigation approaches, our method has the additional advantage that it is insensitive to lighting conditions and does not rely on the asteroid having sufficient terrain diversity to enable navigation. Moreover, the system can adapt to sensor bias and actuator failure, further reducing mission risk. The downside is that fuel efficiency will be inferior to that of an optimal trajectory generated using knowledge of the environmental dynamics. It would be possible to improve the fuel efficiency by observing the movement of the target location from an inertial reference frame, and using this information to put the lander on a collision triangle heading with the target landing site. Instead of heading for the target site, the lander would head towards the point where the target site will be at the completion of the maneuver. In this approach, the agent would be rewarded for keeping the seeker angles at their value at the start of a maneuver, which will keep the lander on the collision triangle with the moving target, as described in more detail in Ref. [21].

We demonstrate that the system can complete maneuvers from a large deployment region and without knowledge of the local environmental dynamics, and successfully adapt to sensor distortion, changes in the lander's center of mass and inertia tensor, and actuator failure. In this work, we will focus on a maneuver that begins approximately 1 km from the desired landing site, with a deployment region spanning approximately 1 cubic km. The goal is to reach a position within 1 m of a target location 10 m above the designated landing site, with velocity magnitude less than 10 cm/s, and negligible rotational velocity. What happens next will be mission specific. To illustrate a scenario, a hovering guidance and control system using LIDAR altimeter could take over at that point, bringing the lander to an attitude consistent with the deployment of a robotic arm, and collect a sample, with the hovering controller compensating for the disturbance created by the arm pushing against the surface. Alternately, the lander could release a rover from this altitude.

Section snippets

Concept of operations (CONOPS)

The GNC system described in this work uses a camera-based optical seeker. In order for the optical seeker to lock onto the desired landing site, the landing site must be appropriately marked. There are multiple methods that could be used to mark the landing site, including reflective markers dropped on the asteroid's surface by a hovering spacecraft [22]. We will propose two new methods for tagging the landing site using a targeting laser on board a hovering spacecraft. Once the landing site is

Lander configuration

The lander is modeled as a uniform density cube with height $h = 2 m$ , width $w = 2 m$ , and depth $d = 2 m$ , with inertia matrix given in Eq. (1), where m is the lander's mass. The lander has a wet mass ranging from 450 to 500 kg. The thruster configuration is shown in Table 2, where x, y, and z are the body frame axes. Roll is about the x-axis, yaw is about the z-axis, and pitch is about the y-axis. When two thrusters on a given side of the cube are fired concurrently, they provide translational thrust

RL overview

In the RL framework, an agent learns through episodic interaction with an environment how to successfully complete a task by learning a policy that maps observations to actions. The environment initializes an episode by randomly generating a ground truth state, mapping this state to an observation, and passing the observation to the agent. These observations could be a corrupted version of the ground truth state (to model sensor noise) or could be raw sensor outputs such as Doppler radar

Experiments

Once we have a chance to commercialize the technology, we will post the Python code that allows reproducing our results on our Github site; the repository will be indexed at github.com/Aerospace-AI/Aerospace-AI.github.io.

Implementation considerations

In this work we considered an ideal seeker that perfectly tracked the target from a stabilized (inertial) platform. When this guidance system is implemented on a small lander, miniaturization of the seeker hardware is critical. First, note that since the GNC system only requires changes in attitude that have accumulated from the start of a maneuver, rather than use a star tracker, we can measure the difference between a gyroscope stabilized reference frame and the lander body frame. This should

Conclusion

We formulated a particularly difficult problem: precision maneuvers around an asteroid with unknown dynamics, starting from a large range of initial condition uncertainty, accounting for actuator failure, center of mass variation, and sensor noise, and using raw sensor measurements. We created a high fidelity 6-DOF simulator that synthesized asteroid models with randomized parameters. Where the asteroid is modeled as a uniform density ellipsoid that in general is not rotating about a principal

Declaration of competing interest

We have no competing interests.

References (37)

Y. Tsuda et al.
System design of the hayabusa 2–asteroid sample return mission to 1999 ju3
Acta Astronaut.
(2013)
Y. Tsuda et al.
Hayabusa2–sample return and kinetic impact mission to near-earth asteroid ryugu
Acta Astronaut.
(2019)
T. Yamaguchi et al.
Hayabusa2-ryugu proximity operation planning and landing site selection
Acta Astronaut.
(2018)
Y. Huang et al.
Mars entry fault-tolerant control via neural network and structure adaptive model inversion
Adv. Space Res.
(2019)
R. Gaskell et al.
Characterizing and navigating small bodies with imaging data
Meteoritics Planet Sci.
(2008)
B. Udrea et al.
Sensitivity analysis of the touchdown footprint at (101955) 1999 rq36
B. Schutz et al.
Statistical Orbit Determination
(2004)
K. Berry et al.
Osiris-rex touch-and-go (tag) mission design and analysis
D.A. Lorenz et al.
Lessons learned from osiris-rex autonomous navigation using natural feature tracking
D. DellaGiustina et al.
Properties of rubble-pile asteroid (101955) bennu from osiris-rex imaging and thermal analysis
Nat. Astron.
(2019)

B. Gaudet, R. Linares, Adaptive Guidance with Reinforcement Meta-Learning, arXiv preprint...

C. D'Souza et al.

An optimal guidance law for planetary landing

S. Thrun et al.

Probabilistic Robotics

(2005)

N. Prabhakar et al.

Trajectory-driven adaptive control of autonomous unmanned aerial vehicles with disturbance accommodation

J. Guid. Contr. Dynam.

(2018)

Y. Han et al.

Adaptive fault-tolerant control of spacecraft attitude dynamics with actuator failures

J. Guid. Contr. Dynam.

(2015)

G.M. Siouris

Missile Guidance and Control Systems

(2004)

R. Furfaro et al.

Deep learning for autonomous lunar landing

R. Furfaro et al.

A recurrent deep architecture for quasi-optimal feedback guidance in planetary landing

Cited by (57)

Robust interplanetary trajectory design under multiple uncertainties via meta-reinforcement learning
2024, Acta Astronautica
This paper focuses on the application of meta-reinforcement learning to the robust design of low-thrust interplanetary trajectories in the presence of multiple uncertainties. A closed-loop control policy is used to optimally steer the spacecraft to a final target state despite the considered perturbations. The control policy is approximated by a deep recurrent neural network, trained by policy-gradient reinforcement learning on a collection of environments featuring mixed sources of uncertainty, namely dynamic uncertainty and control execution errors. The recurrent network is able to build an internal representation of the distribution of environments, thus better adapting the control to the different stochastic scenarios. The results in terms of optimality, constraint handling, and robustness on a fuel-optimal low-thrust transfer between Earth and Mars are compared with those obtained via a traditional reinforcement learning approach based on a feed-forward neural network.
Recorded recurrent deep reinforcement learning guidance laws for intercepting endoatmospheric maneuvering missiles
2024, Defence Technology
This work proposes a recorded recurrent twin delayed deep deterministic (RRTD3) policy gradient algorithm to solve the challenge of constructing guidance laws for intercepting endoatmospheric maneuvering missiles with uncertainties and observation noise. The attack-defense engagement scenario is modeled as a partially observable Markov decision process (POMDP). Given the benefits of recurrent neural networks (RNNs) in processing sequence information, an RNN layer is incorporated into the agent’s policy network to alleviate the bottleneck of traditional deep reinforcement learning methods while dealing with POMDPs. The measurements from the interceptor’s seeker during each guidance cycle are combined into one sequence as the input to the policy network since the detection frequency of an interceptor is usually higher than its guidance frequency. During training, the hidden states of the RNN layer in the policy network are recorded to overcome the partially observable problem that this RNN layer causes inside the agent. The training curves show that the proposed RRTD3 successfully enhances data efficiency, training speed, and training stability. The test results confirm the advantages of the RRTD3-based guidance laws over some conventional guidance laws.
Reinforcement learning-based stable jump control method for asteroid-exploration quadruped robots
2023, Aerospace Science and Technology
Unlike the spherical gravitational field of planets and other large solar system bodies, the gravitational field of asteroids is irregular and weak. It is challenging for a planetary rover to obtain sufficient traction forces in this environment. However, this gravitational environment is suitable for legged robots with jumping ability, but it also imposes higher demands on control methods. Therefore, this study aimed to address the problem of jump control method for asteroid-exploration quadruped robots. As the robot jumps off the surface of an asteroid, it would fly for a certain amount of time because of the low gravitational acceleration. The prolonged flight phase underscores the significance of the robot's take off and attitude control. A model-free stable jumping control method was devised in this study. This method can satisfy the control requirements for takeoff, attitude adjustment, and soft landing by using end-to-end multi-agent reinforcement learning (MARL). MARL is more advantageous than single-agent reinforcement learning in dealing with composite motion control problems under similar observation conditions. A simulated training environment was established, incorporating models of the gravitational field, task partitioning for jumping, and design of reward functions, including jump trajectory planning. The efficacy of the proposed jump control method for a quadruped robot was successfully demonstrated in the gravitational environment of an irregular rod-shaped asteroid, 216 Kleopatra.
Integrated robust navigation and guidance for the kinetic impact of near-earth asteroids based on deep reinforcement learning
2023, Aerospace Science and Technology
The defense against near-Earth asteroids (NEAs) using kinetic impact is faced with various challenges, including limited maneuverability of the impactor, inaccurate dynamic models, poor observability of relative navigation, and control execution errors. To address these challenges, an integrated robust navigation and guidance method for the kinetic impact of NEAs based on deep reinforcement learning (DRL) is proposed in this paper, which can directly map angle measurements by a monocular camera to guidance maneuvers. Firstly, a discrete partial observable Markov decision process (POMDP) is modeled for the integrated navigation and guidance problem of NEA interception. To address the situation where impactors are usually only equipped with aiming cameras and only the line of sight (LOS) angle is measured, past and current LOS measurements are concatenated into a one-dimensional state observation vector to directly introduce the memory of historical state information. A shaping reward function design based on potential energy has also been proposed to solve the problem of sparse reward as the main goal. Subsequently, proximal policy optimization (PPO) is used to solve the established POMDP model, obtaining an integrated navigation and guidance policy that directly maps from the original output of the navigation sensor to the guidance command. The potentially hazard asteroid (PHA) Bennu is used as the target and a typical kinetic impact defense scenario is designed to simulate and verify the model and method proposed in this paper, considering the influence of multiple factors. Numerical simulation results show that the proposed method achieves an interception accuracy of 203.58 m (mean value). The proposed method abandons the traditional separate design of navigation and guidance algorithms, and its robustness has been tested and verified in a wide range of uncertain environments.
Optimal path planning of spacecraft fleet to asteroid detumbling utilizing deep neural networks and genetic algorithm
2023, Advances in Space Research
Following space discoveries, asteroids were found as rich sources of minerals and organic matter that can be exploited. In this paper, we present a precise, fast, and robust energy-optimal soft landing with a combined GA-Collocation method for a spacecraft fleet aimed at an irregular asteroid detumbling mission. It is assumed that the spacecraft fleet carried to the asteroid's equilibrium points introduced as the start locations of the mission. Here, the 433 Eros asteroid has been considered as the stony target known for its irregular elongated shape and nearly large dimension. We modeled the gravitational potential field by polyhedron as the most accurate one for such asymmetric objects. Hence, to decline the high burden computational of related equations in gravitational acceleration calculation, a Deep Neural Network with seven hidden layers and a relatively large dataset covering 6000 pairs has been developed. We utilized a Genetic Algorithm to guess systematically more optimal and reliable initial costates of some key parameters to raise the speed and accuracy of estimation. The acceptable efficiency and accuracy in optimal path prediction of the presented approach for a fleet of spacecraft approved throughout precise simulations. Eventually, the feasibility of the proposed approach is demonstrated through the corresponding results.
Densely rewarded reinforcement learning for robust low-thrust trajectory optimization
2023, Advances in Space Research
To overcome the time-consuming training caused by the sparse reward function in reinforcement learning, an efficient dense reward framework for robust low-thrust trajectory optimization is proposed. The dense reward functions are designed separately for the deterministic and considered uncertain scenarios, including state uncertainties, observation uncertainties and execution uncertainties. For the uncertainties, the dense reward function is designed to diminish the deviation with respect to the isochronous nominal state along the corresponding deterministic optimal trajectory at each step, rendering the reward function no longer sparse and suitable for complex multirevolution problems. In addition, a multistage reward function of the terminal constraints for the rendezvous missions is designed by incorporating some exponential acceleration terms, enabling significant improvement in training efficiency as the terminal errors become low. In addition, a dense reward function for the deterministic scenario is also proposed via the introduction of empirical forbidden zones and an exponential term. The effectiveness and efficiency of the proposed method is demonstrated in a simple Earth-Mars mission and a complex Earth-Venus multirevolution mission. The promising results verify the significant effect of the proposed method in speeding up the process of training an initial incapable agent to an ‘expert’ while guaranteeing or even improving the performance.

View all citing articles on Scopus

View full text

Research paperTerminal adaptive guidance via reinforcement meta-learning: Applications to autonomous asteroid close-proximity operations

Highlights

Abstract

Introduction

Section snippets

Concept of operations (CONOPS)

Lander configuration

RL overview

Experiments

Implementation considerations

Conclusion

Declaration of competing interest

Acta Astronaut.

Acta Astronaut.

Acta Astronaut.

Adv. Space Res.

Characterizing and navigating small bodies with imaging data

Meteoritics Planet Sci.

Sensitivity analysis of the touchdown footprint at (101955) 1999 rq36

Statistical Orbit Determination

Osiris-rex touch-and-go (tag) mission design and analysis

Lessons learned from osiris-rex autonomous navigation using natural feature tracking

Properties of rubble-pile asteroid (101955) bennu from osiris-rex imaging and thermal analysis

Nat. Astron.

An optimal guidance law for planetary landing

Probabilistic Robotics

Trajectory-driven adaptive control of autonomous unmanned aerial vehicles with disturbance accommodation

J. Guid. Contr. Dynam.

Adaptive fault-tolerant control of spacecraft attitude dynamics with actuator failures

J. Guid. Contr. Dynam.

Missile Guidance and Control Systems

Deep learning for autonomous lunar landing

A recurrent deep architecture for quasi-optimal feedback guidance in planetary landing

Research paper
Terminal adaptive guidance via reinforcement meta-learning: Applications to autonomous asteroid close-proximity operations