Target tracking strategy using deep deterministic policy gradient
Introduction
The capability of the autonomous decision-making system of unmanned combat air vehicle (UCAV) perform agile maneuvers in complex dynamic environments has been favored by industry, and attracted many attentions in control engineering during past decades [1]. The purpose of unmanned aerial vehicle (UAV) technologies is to create a completely independent and intelligent decision support system, in which autonomy, navigation algorithm, target detection, energy and information management are particularly emphasized [2], [3]. Moreover, when UCAV is used for military purposes, it always works in dense and threatening environments that require active trajectory generation and navigation control [4], [5].
Advances in the field of navigation, guidance and control systems have made significant contributions to the development of autonomous vehicles [6]. Traditionally, a guidance system is designed separately from a control system, and the strategy of optimizing the offline path planning is the most common method [7]. Naeem W. et al. [8] reviewed some important guidance laws for UAVs, including Lyapunov-based guidance, proportional navigation guidance (PNG) and line-of-sight (LOS) guidance. Through a sensible guidance strategy, UCAV moves towards target by splitting the path into multiple waypoints and in a certain order. The waypoints are a number of fixed points in the space that can be optimized to find the best path to guide the UAV to target area. Typical criteria for assessing the optimal path for UAVs are related to travel time, safety conditions, and energy consumption [9]. In a known scenario, path planning first needs to model the mission environment, and then optimizes the trajectory of a certain objective function by classical algorithms, including artificial potential field (APF), A* algorithm, rapid-exploring random tree (RRT), heuristic evolutionary computation and so forth [10], [11], [12], [13]. These methods introduced different considerations such as trajectory smoothness, waypoint accessibility, map gridding and action discretization However, When UAV operates in dynamic spaces, online reactive path planning based on multi-sensor fusion will consume significant computational time for environmental assessment, so it is hard to meet real-time requirements.
Target tracking, as a key part of the UCAV navigation task in cognitive electronic warfare (CEW), is the process of linking target searching and striking. Precise target tracking tends to use motion planning methods for end-to-end decision making, while treating the physical constraints of maneuver vehicles as optimization terms or analytical terms [14]. To better adapt low-level controllers of navigation, there are two effective motion planning strategies for UCAV, one is to record distinctive maneuvering patterns individually in a pattern library and then to arrange where and when these patterns are used [15], [16]; the other is to directly design a complete optimal/suboptimal control scheme under defined dynamic models, including fuzzy control [17], [18], neural network control [19], nonlinear dynamic inversion control [20], etc. In contrast, the former is low in complexity but difficult to overcome environmental dynamic changes, while the latter is highly real-time but often limited in performance due to the strong non-linearity of dynamic models of UCAV.
Fortunately, the emergence of deep reinforcement learning (DRL) techniques make UCAV possible to play an effective role in CEW under prior near-ignorance while eliminating the derivation of non-linear dynamic models, thus promising to overcome the bottleneck of traditional planning or control means. Recently, various kinds of intelligent agents derived from DRL have achieved remarkable success in Atari 2600 game, Go, navigation support and other applications [21], [22], and we pay more attention to how DRL solves different planning tasks in the field of navigation [23], [24], [25]. When considering discrete actions, Zhao et al. let UCAV master a high-winning rate of air-combat decisions via using deep Q-learning network (DQN) algorithm [24]. DRL-based path planning algorithms such as value iteration network (VIN) and three-dimensional path planning network (TDPP-Net) performed well in 2D and 3D space, respectively [26], [27]. For continuous actions, Zhang et al. enabled UAVs to navigate from arbitrary departure places to destinations using only sensory information of local environment and GPS signal based on deep deterministic policy gradient (DDPG) [28], but only simple expressions of state and action were designed. Authors in [29] designed and validated a Gazebo-based multi-functional reinforcement learning framework, which aims to solve the UAV landing task on a moving platform using a new DDPG algorithm. Unfortunately, the UAV designed in these studies is either a pure static environment or only a threat of motion, so there is no guarantee of generalizability of the DRL network and robustness of navigation control, especially for target tracking.
To bring UCAV’s motion closer to reality, DRL algorithms are commonly tested on a simulation systems that use video images as input. Yu et al. implemented a DQN-based agent to avoid obstacles by learning the steering motion of the simulated car with original video images [30]. However, there are two thorny issues when using images as the state input: (1) the performance of visual/infrared sensors is easily disturbed by climate and distance, making it difficult to generalize to invisible target states using convolutional neural networks; (2) in order to fully exploit the maneuverability of UCAV for evasion and attack within certain physical constraints, the performance of the DRL algorithm and the reasonableness of the simulator are extremely demanding.
Since a complete understanding of UCAV’s kinetics can be ignored by using DRL to enhance different control formulas, it can greatly improve the adaptability of autonomous systems [25], [31]. For example, in conjunction with UCAV’s target searching, a linear control model is firstly established to express the state update process of UCAV, then a complete or partial non-linear control model is formed by DRL network, and finally the controller design is inferred based on the generated strategy [32]. Furthermore, With interactive experience, the model-free DRL algorithms are cable to directly determine optimal control strategies when learning unknown dynamic models, thus making them easy to apply to the field of unmanned driving [33], [34], [35], including autonomous underwater vehicles’ navigation control and mobile robots’ simultaneous localization and mapping [32], [36], yet none of these unmanned vehicles for training consider the coupling motion in the vertical direction. Overall, guiding UCAV to generate autonomous longitudinal maneuvers in a 3D space is not easy, and evaluating the learned behavioral strategies of UCAVs in a unified coordinate system must be more difficult, both of which limit the development of DRL in specific combat scenarios.
The above literature review identifies a large amount of DRL-related prior work on autonomous navigation and provides a solid research base for UCAV target tracking task. To overcome the shortcoming of DRL’s instability in continuous control models and to implement a UCAV learning system based on observe-orient-decide-act (OODA) loop theory in CEW [37], the contribution of this study includes the following three points:
- (1)
The effects of error-coupling between relative velocity and displacement on reward shaping of DRL are discussed based on the objective function of vector-based navigation in the APF method;
- (2)
By configuring the target tracking framework of UCAV with DDPG [33], we develop an advanced underactuated controller from radar sensor to vectorial motion;
- (3)
Establish a new target tracking simulator and a variety of threat scenarios to evaluate the tracking performance and strategy essence of deep agents.
The structure of this paper is outlined as follows. Section 2 briefly introduces the technical background of DRL and DDPG algorithms; In Section 3, we establish a new task model based on the versatile CEW (vCEW) framework developed before, and update the DRL framework for controlling UCAV. Section 4 extrapolates the error-coupling phenomenon inherent in the objective function of traditional target tracking, and the matching process between the DDPG algorithm and the Tracker environment is revealed in Section 5. Then, Section 6 provides a comprehensive discussion of the experimental arrangements and its results. Section 7 reports concluding remarks.
Section snippets
Deep reinforcement learning
One of the primary goals in the field of artificial intelligence is to solve the complex tasks of unprocessed high-dimensional sensory input, and the combination of reinforcement learning (RL) and deep learning has made important progress in this goal [21].
Tracker environment
The target tracking environment, Tracker, proposed in this work is based on our previous vCEW framework [31]. In the Tracker, a UCAV is used as a maneuverable combat agent. Its mission is to keep tracking the target in a given 3D space while avoiding all threats safely. Note that threats in the environment include both dynamic and static obstacles. For each timestep of the mission, the UCAV relies on radar sensors to sense and process threat state information, then takes a desired action under
Error-coupling analysis
For the problem of moving target tracking, the vector-based navigation method under APF thinking can shape arbitrary vectorial actions and is therefore suitable for inspiring autonomous maneuvers of the UCAV [41], [42]. However, since the observational errors produced in Tracker can be lethal to both velocity and displacement estimation, thus the extent to which the objective function of target tracking is affected has to be assessed to ensure the effectiveness of vector-based navigation in DRL.
Adaptive matching between DRL and tracker
This section focuses on how to match the execution agent in a Tracker task to the input and output ports of the DDPG algorithm.
Simulation experiments and numerical results
The reliability of motion planning in real environments is related to the environmental complexity: in obstacle-sparse spaces, the intelligent level of deep agents trained by DRL cannot be highlighted; while in obstacle-dense environments, the performance of DRL algorithms is not sufficient to allow the agent to complete the tracking task. Thus in this Section, we will arrange a variety of task scenarios to test the adaptability of DDPG. In addition, to analyze the performance of the DRL
Conclusion
This paper discussed the control method of combining DDPG with UCAV’s constrained motion model, which is helpful for UCAVs to realize autonomous agile maneuvers in CEW. By analyzing the objective function of traditional objective tracking, we introduced the classical ideas of vector-based navigation into the reward shaping process of DRL, and theoretically explored the role of error coupling. Then a DRL-based framework for UCAVs’ continuous-action decision-making inspired by the stage
CRediT authorship contribution statement
Shixun You: Conceptualization, Methodology, Software. Ming Diao: Data curation, Investigation, Formal analysis. Lipeng Gao: Writing - original draft. Fulong Zhang: Visualization, Software. Huan Wang: Supervision, Validation.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Shixun You received his bachelor’s degree from the Taiyuan University of Technology, China, in 2015, and he is pursuing his master’s degree and Ph.D. degree in Information and Communication Engineering from Harbin Engineering University, China. His research interests include artificial intelligence, evolutionary computation, cognitive jamming, and cooperative jamming.
References (42)
- et al.
Modelling of UAV formation flight using 3D potential field
Simul. Model. Pract. Theory
(2008) - et al.
Tdpp-net: Achieving three-dimensional path planning via a deep neural network architecture
Neurocomputing
(2019) - et al.
Adaptive low-level control of autonomous underwater vehicles using deep reinforcement learning
Robot. Auton. Syst.
(2018) - et al.
A 3D collision avoidance strategy for UAV with physical constraints
Measurement
(2016) - et al.
Defense Science Board Study on Unmanned Aerial Vehicles and Uninhabited Combat Aerial Vehicles
(2004) - et al.
Safety, security, and rescue missions with an unmanned aerial vehicle (UAV)
J. Intell. Robot. Syst.
(2011) - et al.
Efficient visual odometry and mapping for unmanned aerial vehicle using ARM-based stereo vision pre-processing system
- et al.
Trends in electronic warfare
IETE Tech. Rev.
(2003) - et al.
Optimal path planning for unmanned combat aerial vehicles to defeat radar tracking
J. Guid. Control Dyn.
(2006) - et al.
A review of intelligent systems software for autonomous vehicles
Evolutionary algorithm based offline/online path planner for UAV navigation
IEEE Trans. Syst. Man Cybern. B
A review of guidance laws applicable to unmanned underwater vehicles
J. Inst. Navig.
Path planning with multiple objectives
IEEE Robot. Autom. Mag.
An obstacle-based rapidly-exploring random tree
Bidirectional A* search for time-dependent fast paths
J. Am. Chem. Soc.
Classic and heuristic approaches in robot motion planning-a chronological review
World Acad. Sci. Eng. Technol.
Motion planning and obstacle avoidance
Autonomous control of unmanned combat air vehicles: Design of a multimodal control and flight planning framework for agile maneuvering
IEEE Control Syst. Mag.
Control-oriented physical input modelling for a helicopter UAV
J. Intell. Robot. Syst.
Adaptive fuzzy control for non-triangular structural stochastic switched nonlinear systems with full state constraints
IEEE Trans. Fuzzy Syst.
An expert 2DOF fractional order fuzzy PID controller for nonlinear systems
Neural Comput. Appl.
Cited by (25)
Towards universal and sparse adversarial examples for visual object tracking
2024, Applied Soft ComputingAutonomous target tracking of multi-UAV: A two-stage deep reinforcement learning approach with expert experience
2023, Applied Soft ComputingControlling fracture propagation using deep reinforcement learning
2023, Engineering Applications of Artificial IntelligenceAdvanced deep deterministic policy gradient based energy management strategy design for dual-motor four-wheel-drive electric vehicle
2023, Mechanism and Machine TheoryCitation Excerpt :Compared to the optimization theory-based EMS that relies on accurate system model to improve performance, the model-free RL method is an appropriate option for the complicated vehicle system, which is hardly described accurately. Q-Learning [40], DQN [41], DDQN [42], DDPG [43], etc. are typical model-free reinforcement learning algorithms. Compared with Q-Learning, DQN, and DDQN, DDPG has a unique advantage in handling continuous state-action problems, which can search for the optimal solution without discretizing the state and action.
An electromyography signals-based human-robot collaboration system for human motion intention recognition and realization
2022, Robotics and Computer-Integrated ManufacturingControlling mixed-mode fatigue crack growth using deep reinforcement learning
2022, Applied Soft ComputingCitation Excerpt :The algorithm achieves better performance compared with other algorithms in terms of long-term average net bit rate. You, Diao [13] developed target tracking strategy for unmanned combat air vehicles based on DDPG. They found the agent was able to master several commonly encountered maneuvers to control unmanned combat air vehicles to perform real-time navigation tasks.
Shixun You received his bachelor’s degree from the Taiyuan University of Technology, China, in 2015, and he is pursuing his master’s degree and Ph.D. degree in Information and Communication Engineering from Harbin Engineering University, China. His research interests include artificial intelligence, evolutionary computation, cognitive jamming, and cooperative jamming.
Ming Diao received his bachelor’s, master’s, and Ph.D degrees from Harbin Engineering University, China. He is currently a Professor with the College of Information and Communication, Harbin Engineering University. He won four Ministerial and Provincial-Level Science and Technology Awards. He is a member of the China Society of Image and Graphics (CHN) and a Fellow of China Institute of Communications (CHN). His research interests include wideband signal processing, pattern recognition, machine learning, and telecommunication.
Lipeng Gao received his bachelor’s, master’s, and Ph.D degrees from Harbin Engineering University, China. He is currently a Professor with the College of Information and Communication, Harbin Engineering University. He won two Ministerial and Provincial-Level Science and Technology Awards. His research interests include wideband signal processing, information fusion, artificial intelligence, and cooperative jamming.
Fulong Zhang received his bachelor’s degree from Harbin Engineering University of China in 2003. He is currently a researcher and expert in the First Academy of China Aerospace Science and Industry Corporation. As the academic and technical leader in the field of electronic countermeasures in No.8511 Research Institute of CASIC, he is responsible for the overall design, testing and verification of multi-type electronic countermeasure equipment and comprehensive electromagnetic protection equipment. He has won one second prize of National Defense Science and Technology Progress Award, two third prizes, six national defense scientific and technological achievements, and eight authorized national defense patents.
Huan Wang received his bachelor’s and master’s degree from Harbin Engineering University of China in 2016 and 2019, respectively. He is currently an assistant engineer and assistant designer in No.8511 Research Institute, China Aerospace Science and Industry Corporation. His research interests include wideband signal processing, radar signal recognition, and cooperative jamming.