Abstract

The UAV pursuit-evasion strategy based on Deep Deterministic Policy Gradient (DDPG) algorithm is a current research hotspot. However, this algorithm has the defect of low efficiency in sample exploration. To solve this problem, this paper uses the imitation learning (IL) to improve the DDPG exploration strategy. A kind of quasiproportional guidance control law is designed to generate effective learning samples, which are used as the data of the initial experience pool of DDPG algorithm. The UAV pursuit-evasion strategy based on DDPG and imitation learning (IL-DDPG) is proposed, and the algorithm obtains the data from the experience pool for experience playback learning, which improves the exploration efficiency of the algorithm in the initial stage of training and avoids the problem of too many useless exploration in the training process. The simulation results show that the trained pursuit-UAV can flexibly adjust the flight speed and flight attitude to pursuit the evasion-UAV quickly. It also verifies that the improved DDPG algorithm is more effective than the basic DDPG algorithm to improve the training efficiency.

1. Introduction

At present, UAVs are more and more widely used, such as sensor networks [1], data security [2], smart network systems [3], intelligent transportation systems [4], automatic identification systems [5], target encirclement control [6], and pursuit-evasion confrontation [7]. The UAV pursuit-evasion confrontation is the game between two drones with conflicts of interest. The pursuit-UAV tries to capture the evasion-UAV through the pursuit maneuver strategy, and the evasion-UAV tries to escape by evasion maneuver strategy.

The methods on the UAV pursuit-evasion strategy include differential game method [8], expert system method [9], and influence diagram decision method [10]. However, the common problem of these methods is that it is more difficult to obtain analytical solutions. The DDPG algorithm is a policy-based reinforcement learning (RL) method which can use neural network (NN) for end-to-end learning. Research on the pursuit-evasion strategy based on the DDPG algorithm [11] is a current research hotspot.

Based on the DDPG algorithm, Zhang et al. [12] studied the cooperative pursuit of incoming targets by UAV swarm and designed a guided return function for specific pursuit tasks. Song et al. [13] designed a reward function considering the tracking error and trajectory stability for the landing trajectory tracking control problem of UAVs, then proposed a trajectory tracking control method based on DDPG algorithm. The trained result has higher accuracy than the traditional PID control method.

A problem of RL is that the efficiency of sample exploration is low, which makes learning and training inefficient. In the early training stage of reinforcement learning, a relatively large random noise is set for the exploration strategy to improve the exploration ability. But it will also produce a lot of inefficient samples (that is, useless action exploration), resulting in small rewards at the initial training stage. Therefore, how to improve the exploration ability, obtain efficient samples, and improve the utilization rate of samples is an urgent problem to be solved for RL training.

Expert experience and the mixed decision-making technology have been used to accelerate the training process of reinforcement learning. Wang [14] used reinforcement learning algorithm based on expert knowledge to solve the UAV path planning problem. The algorithm used multiple tasks with known environmental parameters to train the UAV and then transferred the trained result knowledge to the training of new tasks to accelerate the training process. Wu [15] studied the UAV reactive obstacle avoidance algorithm based on transfer learning and deep reinforcement learning, which makes the UAV quickly and efficiently respond to unfamiliar scenarios. In order to improve the motor skills of the manipulator and the learning ability of unmanned driving, Lu [16] and Zuo [17] integrated the experience of experts in their respective fields into reinforcement learning algorithm and designed the reinforcement learning algorithm under different tasks. Mu [18] studied the UAV cooperative formation maintenance and collision avoidance method based on the fusion of model knowledge and data training. The switching system based on the consensus theory and the multiagent cooperative collision avoidance method was learned in advance before training, which improves the training efficiency of the UAV formation control method.

Inspired by these works, the UAV pursuit-evasion strategy based on DDPG and imitation learning (IL-DDPG) is proposed, the algorithm can avoid excessive useless exploration and converge more quickly. The main contributions of this paper are as follows: (1)A kind of quasiproportional guidance control law is designed for the instructor to realize effective pursuit. The control law can be used to generate effective learning samples for the pretraining of IL-DDPG algorithm(2)The exploration strategy of the DDPG algorithm is improved. In the pretraining stage, the instructor maneuver samples generated by the quasiproportional guidance control law are used as the data of the initial experience pool. The algorithm obtains the data from the experience pool for experience playback learning, which improves the exploration efficiency of the algorithm in the initial stage of training and avoids the problem of too many useless exploration in the training process

The rest of this paper is organized as follows. In Section 2, the system model and problem statement are presented. UAV pursuit strategy based on DDPG is presented in Section 3. In Section 4, UAV pursuit strategy based on IL-DDPG is proposed. Then, Section 5 provides the experimental results. Conclusions are given in Section 6.

2. Problem Description and Modeling

2.1. The Pursuit-Evasion Problem of UAV

In the pursuit-evasion problem, the pursuit-UAV must chase and capture the evasion-UAV, and the evasion-UAV must escape and stay away from the pursuit-UAV.

For this problem, a zero-sum differential game model with control constraints is established. The geometric model of pursuit-evasion is shown in Figure 1.

In Figure 1, represents the pursuit-UAV, represents the evasion-UAV, is the speed of the pursuit-UAV, is the speed of the evasion-UAV, is the heading angle of the pursuit-UAV, is the heading angle of the evasion-UAV, and is the angle of the Line of Sight (LOS); LOS refers to the ray of the pursuit-UAV pointing to the evasion-UAV . The goal of the pursuit-UAV is to capture the target in the shortest time. The goal of the evasion-UAV is to stay away from the pursuit-UAV and to avoid being captured in the preset time or to maximize the delay time of being captured. The standard differential game is described as (1) and (2) [19]. where is the distance between the two UAVs and is the moment when the pursuit-UAV captures the evasion-UAV . Equation (1) is the objective function of the pursuit-UAV, and (2) is the objective function of the evasion-UAV.

2.2. The Kinematic Model of UAV

The motion state equations of the UAVs are defined as where represents the angular velocity of the UAVs and represents the acceleration of the UAVs.

The motion control variables of the UAVs are where and are the maximum speed of the UAVs and and are the maximum angular velocity of the UAVs. where is the simulation time step, is the turning radius, is the minimum turning radius, is the maximum turning angle within , and is the maximum lateral overload. Therefore, the maximum angular velocity can be obtained.

The initial state of the UAVs is defined as

If the distance between the evasion-UAV and the pursuit-UAV is within the capture range of the pursuit-UAV and does not increase, the capture is successful, as shown in (8), and the capture range can be the detection range or the attack range of the UAV. where is the capture range of the UAV and is the 2-norm of the two-dimensional vector and it can be calculated by where and represent the instantaneous relative distances of the pursuit-UAV and the evasion-UAV in the -axis and the -axis at time , respectively.

3. UAV Pursuit Strategy Based on DDPG

3.1. MDP Model

The MDP model can be divided into MDP state space model, MDP action space model, MDP state transition function, and MDP reward function.

3.1.1. MDP State Space Model

The UAV is set to carry on-board GPS equipment and gyroscopes to obtain its own position and speed, namely, . We also set the UAV to carry the on-board airborne radar to obtain detected target’s position and speed, namely, .

In order to increase the adaptability of the algorithm, the relative position is used to establish the state space model. where and are the angles between the speed direction of the pursuit-UAV and evasion-UAV and the LOS, respectively. is the angle between the speed direction of the two UAVs, is the distance between the two UAVs, is the speed of the pursuit-UAV, and refers to the speed difference between the two UAVs.

3.1.2. MDP Action Space

The control input of the UAV is a two-dimensional vector, namely, action space where is the acceleration of pursuit-UAV and evasion-UAV and , is the angular velocity of pursuit-UAV and evasion-UAV. Both and satisfy the constraints (4). (1)MDP state transition function

The state transition function is as shown in

3.1.3. MDP Reward Function

A combination of sparse reward and guided reward function is designed: where represents the total reward of the UAV. is a guided reward, represents the distance between the pursuit-UAV and the evasion-UAV at time , and represents the distance at time ; k is proportionality; represents the sparse reward of the pursuit-UAV being too far away from the evasion-UAV; represents the sparse reward of the pursuit-UAV to complete the task.

is the variation of the relative distance between the pursuit-UAV and the evasion-UAV. When the relative distance becomes smaller, the pursuit-UAV gets a positive reward; when the relative distance becomes larger, the pursuit-UAV gets a negative reward. where represents the relative distance threshold and is a large positive constant, which punishes the algorithm when the pursuit-UAV’s action strategy is incorrect and the distance from the evasion-UAV is too far. where is a large positive constant, which rewards the algorithm when the UAV completes the task.

3.2. DDPG Algorithm

The core of reinforcement learning is that the agent obtains rewards by interacting with the environment and adjusts the strategy according to the size of the rewards to realize the optimization of decision-making. Deep reinforcement learning (DRL) combines the approximate fitting of deep learning (DL) and the decision-making optimization of reinforcement learning (RL). The most representative DRL algorithm is the deep Q-learning (DQN) algorithm.

DQN uses two networks with the same structure but different parameters. One network generates the current value, and the other network generates the target value. Then, these two values are used to minimize the loss function, and the parameters of the current network are copied to the target network after a period of time. DQN uses experience replay to break the relevance of RL data and uses random sampling to extract data from the experience pool for training.

The Deep Deterministic Policy Gradient (DDPG) algorithm which was developed based on the core idea of DQN also uses the Actor-Critic dual network mechanism and combines the advantages of the value function and the strategy function method.

The DDPG algorithm has four subnetworks, and the network structure of the algorithm is shown in Figure 2.

Actor network and Critic network both have target network (TargetNet) and evaluation network (EvalNet), so DDPG has a total of 4 subnetworks.

The Actor selects action according to the action probability provided by itself. The Critic_EvalNet evaluates the current state and the value of the action selected by the Actor, and the Critic_TargetNet evaluates the next state and the value of the action selected by the Actor_TargetNet for . Then, the Actor will adjust the probability of the action according to Critic’s evaluation of the action [20, 21]. and are the EvalNet and TargetNet parameters of the Critic and Actor, respectively. Actor and Critic use different functions to train and update the parameters.

Critic uses the mean square error loss function to update the parameters of the Critic_EvalNet through the gradient of the neural network, as shown in where is the sample size and is defined as

Actor uses the gradient of Equation (19) to update the parameter of the Actor_EvalNet.

Like DQN, the EvalNet will train the network parameters in real time to update , and the TargetNet parameters will follow the EvalNet through soft updates. The advantage of using soft update is to make algorithm training more stable and easy to guarantee convergence. The soft update is described as where is the inertial update rate.

A major innovation of DDPG is the use of motion noise. Adding a random noise to the action generated by the Actor turns the deterministic decision into a random process. It enhances the exploration of the algorithm. Commonly used random noises are Gaussian Noise and Ornstein-Uhlenbeck (OU) Noise.

OU Noise is also called OU process, which is a random process. It will explore a certain distance around the mean value in the positive or negative direction. This is conducive to exploring in one direction and can improve the efficiency of exploration and training for inertial systems.

The agent obtains the sample set of the training network in the process of interacting with the environment and stores these samples in the experience pool. During training, the agent selects some minibatch samples according to the random sampling strategy to train the neural network parameters through experience replay.

3.3. Training Process of DDPG Offline Algorithm

In our experiment, the control period is set to the simulation step. It should be noted that the subscript of states and represents the time step rather than the actual flight time, and the actual flight time is . Algorithm 1 shows the flow of training algorithm for UAV pursuit strategy based on DDPG.

4. UAV Pursuit Strategy Based on IL-DDPG

4.1. Imitation Learning

Model-free and model-based reinforcement learning methods both learn a strategy from scratch that maximizes the cumulative return. For complex tasks, the agent has a huge search space and cannot get meaningful rewards frequently in the initial stage, which leads to a slow convergence rate of reinforcement learning.

IL means that the agent uses the decision data provided by experts to learn the best strategy [22]. It can be used to solve problems that the reward cannot be given. We can integrate IL with RL to accelerate the process of strategy learning by providing effective samples through experts’ demonstrations.

At present, scholars have successfully verified the feasibility of this method. A deep Q-learning from demonstrations (DQfD) algorithm is proposed [23], which combines the TD updates with the supervised classification of the instructor’s actions, and the demonstrations are used to pretrain the Q network in the DQN, and at the same time, the demonstrations are put into the experience pool, and these expert data are used to accelerate the learning process on a large scale. The DQfD is proved that it has better initial performance than DQN.

On this basis, the DDPG algorithm is combined with the demonstrations in a similar way to construct the DDPGfD algorithm [24].

4.2. IL-DDPG Algorithm

DDPG based on imitation learning algorithm (IL-DDPG) is designed to solve the maneuver decision-making problem of the UAV pursuit-evasion. The design of this algorithm mainly includes two aspects; one is the algorithm framework, and the other is the maneuver strategy of the instructor.

Training algorithm for UAV strategy based on DDPG
initial experience pool D with memory size M
initial the Eval networks of Actor network and Critic network: and
forepisode=1 to MaxEpisodedo
 initialize OU-Noise
 initialize the state of pursuit-UAV and evasion-UAV in set range randomly,
obtain the initial state of simulation environment
fort=1 to MaxStepdo
  select action of pursuit-UAV where is the action constraint processing process
  select maneuver strategy for evasion-UAV
  input the control signal into the UAV integrate to get the next state of UAV, and calculate the environment state
  obtain the immediate reward from the environment
  store experience sample in D
  randomly sample form D to get a sample set of BatchSize
  update the Eval network parameter of the Critic
  update the Eval network parameter of the Actor
  update the Target network parameters and of Critic network and Actor network by (20)
  if the episode end condition is satisfied, break
end for
end for
4.2.1. Framework of IL-DDPG Algorithm

Figure 3 shows the algorithm framework of IL-DDPG. In this framework, the instructor’s strategy is used to generate amounts of experience and store them in the experience pool in the initial stage. And these experiences are used to train the network by RL.

Figure 4 shows the process of UAV offline training and exploration. Before starting any interaction with the environment, IL-DDPG initially only trains the demonstrations, which is the pretraining process. A value function that satisfies the Bellman equation is used to imitate the instructor so that it can be updated with TD_error once the UAV starts interacting. The subsequent learning and training of IL-DDPG are consistent with the DDPG algorithm.

4.2.2. Instructor Confront Strategy

The main improvement of our algorithm is the design of DDPG initial exploration strategy, that is, instructor confront strategy. Proportional guidance is one of the missile guidance methods, and it is also often used in the interception of maneuvering targets. Therefore, the proportional guidance method can be used as our instruction strategy.

The pure proportional navigation method [25, 26] is that during the guidance process, the rotational angular velocity of the controlled object’s velocity vector is proportional to the rotational angular velocity of LOS, and the core equation of the guidance is shown as where is the scale factor and its range is , and is the ideal control relation equation describing the guidance method. Figure 5 shows the relative movement relationship of pure proportional method.

The disadvantage of pure proportional guidance is that the normal overload required to hit the target is directly related to the target speed at the hit point and the UAV’s attack mode, and it leads to difficulty in selecting the value of. We can use the generalized proportional guidance method to improve the characteristics of proportional guidance.

The normal overload is selected according to the rotation angular velocity of the LOS, namely, . The normal overload when the UAV hits the target is

It can be seen that the required overload at the hit point has nothing to do with the UAV speed and attack direction.

Considering the characteristics of the UAV’s capture range, a quasiproportional guidance control law [27] is designed as shown in Figure 6. Compared with pure proportional guidance, it fully considers the difference between UAV guidance and missile guidance.

In Figure 6, the red circle is the effective capture range of the pursuit-UAV, and is the capture radius of the UAV. is the relative speed of the pursuit-UAV and the evasion-UAV, and . is the angle of the . and are the two guiding boundary lines; and are the angles of these lines. is the angle of LOS, and is the angle between LOS and the two boundary lines.

The quasiproportional guidance law guides the pursuit-UAV to make the evasion-UAV fall into the capture range of . For this purpose, the relative velocity vector and its angle are controlled. In the guidance process, approaching the target along EA or EB depends on the difference between and the boundary line. If is smaller, EA will be selected as the guiding boundary; if is smaller, EB will be selected.

If is chosen as the guidance boundary line, the quasiproportional guidance instruction can be and we can get where is the distance between pursuit-UAV and evasion-UAV .

The state transition equation of maneuver decision control is shown as

5. Simulation Experiments

5.1. Experimental Settings

The simulation system is constructed based on Python, using PyCharm Community 2020.2 and Anaconda3 platforms. The deep learning environment adopts TensorFlow 1.14.0. The computer is configured as Inter [email protected] GHz CPU, GTX 1660Ti GPU, 16 GB RAM.

The training parameters of algorithms in the experiment are shown in Table 1.

The experiment parameters of the UAV pursuit-evasion game simulation environment are shown in Table 2.

The evasion-UAV adopts the classic escape strategy [28], namely, where calculation of is shown in the following:

All networks are multilayer feedforward neural network with a single hidden layer. The number of neurons in each layer of Target-Actor network and Eval-Actor network is [6, 128, 2]. Their hidden layer uses Relu () as activation function, and their output layer uses Tanh () as activation function. The input of Critic network are the MDP state and generated actions by the Actor network, so the number of neurons in each layer of Target-Critic network and Eval-Critic network is [8, 128, 1]. Their activation functions are the same as those of Actor network. The training applies ADAM Optimizer as optimizer.

5.2. Simulation Results
5.2.1. Instructor Confront Strategy

The pursuit-UAV only uses the designed quasiproportional guidance strategy. The evasion-UAV adopts two different strategies: uniform linear motion and the classic escape strategy. The speed of the evasion-UAV is set for 10 m/s.

As shown in Figure 7, the evasion-UAV escapes in a straight line simply, and the pursuit-UAV successfully captures the evasion-UAV after adjusting the speed direction. But in Figure 8, the evasion-UAV escapes successfully. It is because the pursuit-UAV that uses the quasiproportional guidance method as the pursuit guidance law needs time to adjust its heading, which creates an opportunity for the evasion-UAV to escape within the predetermined maximum time.

Although the quasiproportional guidance law cannot let the pursuit-UAV capture the evasion-UAV when the evasion-UAV uses the classic escape strategy, it can guide the pursuit-UAV to explore good initial experience as an instructor strategy.

5.2.2. Comparison

Average reward is used to verify the convergence and effectiveness of the proposed algorithm, and it is defined as the average value of reward in latest 100 episodes.

With the same training parameters and the same experiment parameters, the average rewards of the trained results obtained by the IL-DDPG and DDPG algorithms are shown in Figure 9.

As shown as Figure 9, the IL-DDPG algorithm converges faster than the DDPG algorithm, and it is more stable than the DDPG algorithm.

In order to compare the trained results of the two algorithms under the same initial conditions, the results are used to simulate the UAVs pursuit-evasion process. The simulation results are shown in Figures 10 and 11.

It can be seen from Figures 10 and 11 that the trained results obtained by the IL-DDPG algorithm achieves a shorter capture time.

Furthermore, as shown in Figures 12 and 13, the pursuit-UAV using the IL-DDPG algorithm can adjust its speed and heading in time to capture the evasion-UAV, no matter whether the evasion-UAV adopts uniform linear motion strategy or random motion strategy.

These experiments prove that the UAV pursuit strategy based on the IL-DDPG algorithm has a good generalization, and the trained UAV can successfully complete the pursuit task in the pursuit-evasion game.

Figures 14 and 15 increase the velocity of evasion-UAV to 11 m/s and 12 m/s, respectively. It can be seen that pursuit-UAV can capture the evasion-UAV within a given time, which verifies the Imitation of the IL-DDPG algorithm.

6. Conclusion

The training algorithm of the UAV pursuit strategy based on the IL-DDPG algorithm introduces a quasiproportional guidance control law as the instructor strategy to improve the exploration efficiency in the early stage of DDPG training and avoids the problems of excessive useless exploration. Simulation results show the effectiveness and generalization of this algorithm.

For the future work in this paper, we should study how to effectively combine the imitation learning and the multiexperience pool technology to accelerate the training of the algorithm.

Data Availability

The numerical data used to support the findings of this study is included within the article.

Conflicts of Interest

The authors declare no conflict of interest.

Acknowledgments

This research was funded in part by the Aviation Science Foundation of China under Grant 2020Z023053001 and Natural Science Foundation of Shaanxi Province (No. 2020JM-537).