Brief paperDistributed inverse optimal control☆
Introduction
Inverse optimal control (IOC) seeks to learn an underlying objective function of an optimal control system from its optimal trajectories (Ng, Russell, et al., 2000). Its applications include imitation learning (Finn, Levine, & Abbeel, 2016), where robots learn objective functions from expert demonstrations, autonomous driving, (Kuderer, Gulati, & Burgard, 2015), where human driving styles are transferred to vehicle controllers, and human–robot collaboration (Mainprice, Hayne, & Berenson, 2016), where the human control objective is learned for coordination.
In the literature of IOC (Abbeel and Ng, 2004, Englert et al., 2017, Jin, Wang et al., 2020, Keshavarz et al., Molloy et al., 2018, Puydupin-Jamin et al., 2012, Ratliff et al., 2006, Ziebart et al., 2008), an unknown objective function is typically parameterized as a weighted sum of selected features (or basis functions). Then the problem of solving IOC is reduced to estimate weights for selected features. One direction for solving IOC is by a double-layer scheme with weights estimate updated in the outer layer and the corresponding optimal control in the inner layer. Methods developed under this idea include feature matching (Abbeel & Ng, 2004), where the weights are updated towards reducing the difference of feature values between demonstrations and the predicted trajectory; maximum margin (Ratliff et al., 2006), where the weights are solved by maximizing the margin of the objective values between demonstrations and the predicted trajectory; maximum entropy (Ziebart et al., 2008), where the weights are updated to maximize the entropy of the trajectory probability distribution while matching the feature values of demonstrations; and minimizing prediction loss (Jin, Wang et al., 2020), where the weights are solved by minimizing the distance between the predicted trajectory and observed one. Recent research for IOC has been made by minimizing the violation of optimality conditions such as Karush–Kuhn–Tucker (KKT) conditions and Pontryagin’s maximum principle by demonstrations, which has been shown to be more computationally efficient (Englert et al., 2017, Keshavarz et al., Molloy et al., 2018, Puydupin-Jamin et al., 2012).
With significant progress in IOC as stated above, existing techniques are mostly designed in a centralized way. This limits their applications to the practical situation when complete observed trajectories exceed the memory/computation capability of any central processor. A natural idea to address this is by employing a multi-agent system for IOC, in which each agent only observes trajectory segments. By a trajectory segment, we mean a segment of trajectory within a certain interval of the overall time horizon, which can also permit a single point of the trajectory. This challenging situation when segment other than complete trajectory is available may also arise frequently from data missing, limited sensor capability or occlusion. The authors of Bogert, Lin, Doshi, and Kulic (2016) have initiated some efforts to address such challenging situation by modeling missing data using hidden variables when the missing portion is small (Bogert & Doshi, 2017). Very recently, the authors of Jin et al., 2019, Jin et al., 2021 have introduced the concept of a recovery matrix for solving IOC when only trajectory segments are available, which are required to be consecutive and satisfy a matrix rank condition.
Recognition of this has motivated us to develop a distributed IOC method for multi-agent systems, in which each agent can only communicate with its nearby neighbors. We further suppose each agent can only access several trajectory segments, which may not suffice to infer the objective function by itself. Our contribution includes two aspects: first, we propose a way to evaluate whether a trajectory segment can contribute to IOC, and if so, such segment will impose a linear constraint to the unknown objective function weights; we also establish IOC identifiability from trajectory segments; and second, we develop a distributed algorithm to enable all agents to collaboratively solve the weights exponentially fast by communicating with their neighbors. This is perhaps the first distributed algorithm for solving IOC.
Notation: The column operator stacks its arguments into a column. means a stack of vectors indexed from to (), i.e., . (bold-type) denotes a block matrix. Given a vector function and a value , denotes the Jacobian matrix with respect to evaluated at . The zero matrix/vector is , and the identity matrix is , both with appropriate dimensions. is the all-one vector of dimension. is the kernel of matrix .
Section snippets
Problem statement
Consider an optimal control system with dynamics where is differentiable; is the system state; is the input; and is the time step. A trajectory of states-inputs of the optimal control system over time horizon is defined by which results from minimizing an unknown objective function parameterized as a weighted sum of features: Here, is a given feature
IOC identifiability from trajectory segments
In this section we will introduce an index to evaluate whether or not a trajectory segment contributes to solving IOC, and then we establish the IOC identifiability from trajectory segments.
A distributed algorithm for IOC
We now consider a graph of agents and each agent has only access to a set of trajectory segments as in (4). By Theorem 1, one has Specially, if there is no IOC-effective trajectory segment in . Thus, each agent with trajectory segments only knows . Then all we need is to develop distributed algorithms in Mou, Liu, and Morse (2015) and Zhou, Wang, Mou, and Anderson (2020) for agents to cooperatively solve the group of linear equations in (27).
Simulations
We evaluate the proposed method on a simulated two-link robot arm moving vertically. The dynamics of the arm is , where and are the joint angles and input torques (see Spong and Vidyasagar (2008, p. 209), all parameters are set as units). Define the system state variable and control input , and discretize the dynamics by Euler method with a time interval . The arm is controlled to minimize (3) here with
Conclusions
This paper has developed a distributed algorithm to solve inverse optimal control in multi-agent systems, in which each agent only gets access to trajectory segments. Although each agent may not be able to solve the IOC alone, each agent could coordinate with its nearby neighbors and cooperatively solve the IOC problem exponentially fast. Our future research includes applications of the proposed distributed inverse optimal control into robot’s learning from human’s sparse demonstrations (Jin,
Wanxin Jin is currently a fourth-year Ph.D. student at the School of Aeronautics and Astronautics, Purdue University. Prior to Purdue, he worked as a research assistant at the Technical University of Munich, Germany, from 2016 to 2017. Wanxin Jin received his B.S. degree and M.Sc. in Control Science and Engineering from Harbin Institute of Technology, China, in 2014 and 2016, respectively. His research interest spans control theory, machine learning, and optimization with applications to
References (27)
- et al.
Finite-horizon inverse optimal control for discrete-time nonlinear systems
Automatica
(2018) - Abbeel, Pieter, & Ng, Andrew Y. (2004). Apprenticeship learning via inverse reinforcement learning. In International...
Nonlinear programming
Journal of the Operational Research Society
(1997)- Bogert, Kenneth, & Doshi, Prashant (2017). Scaling expectation-maximization for inverse reinforcement learning to...
- Bogert, Kenneth, Lin, Jonathan Feng-Shun, Doshi, Prashant, & Kulic, Dana (2016). Expectation-maximization for inverse...
- et al.
Inverse KKT: Learning cost functions of manipulation tasks from demonstrations
International Journal of Robotics Research
(2017) - Finn, Chelsea, Levine, Sergey, & Abbeel, Pieter (2016). Guided cost learning: Deep inverse optimal control via policy...
- et al.
Inverse optimal control for multiphase cost functions
IEEE Transactions on Robotics
(2019) - et al.
Inverse optimal control with incomplete observations
International Journal of Robotics Research
(2021) - et al.
Learning from sparse demonstrations
(2020)
Learning from incremental directional corrections
Pontryagin differentiable programming: An end-to-end learning and control framework
Advances in Neural Information Processing Systems (NeurIPS)
Cited by (16)
Deep Koopman learning of nonlinear time-varying systems
2024, AutomaticaAdaptive Neural Inverse Optimal Tracking Control for Uncertain Nonlinear System with Actuator Faults
2023, IEEE Transactions on Systems, Man, and Cybernetics: SystemsLearning From Sparse Demonstrations
2023, IEEE Transactions on RoboticsA Data-Driven Approach for Inverse Optimal Control
2023, Proceedings of the IEEE Conference on Decision and Control
Wanxin Jin is currently a fourth-year Ph.D. student at the School of Aeronautics and Astronautics, Purdue University. Prior to Purdue, he worked as a research assistant at the Technical University of Munich, Germany, from 2016 to 2017. Wanxin Jin received his B.S. degree and M.Sc. in Control Science and Engineering from Harbin Institute of Technology, China, in 2014 and 2016, respectively. His research interest spans control theory, machine learning, and optimization with applications to robotics and human–robot autonomy.
Shaoshuai Mou is an Assistant Professor in the School of Aeronautics and Astronautics at Purdue University, where he directs the Autonomous and Intelligent Multi-agent Systems (AIMS) Lab and also co-direct Purdue’s new Center for Innovation in Control, Optimization and Networks (ICON). Before joining Purdue, he received a Ph.D. in Electrical Engineering at Yale University in 2014 and worked as a postdoc researcher at MIT till 2015. His research interests include multi-agent autonomy and learning, distributed algorithms for control and optimization, human–machine teaming, resilience & cybersecurity, and also experimental research involving autonomous air and ground vehicles.
- ☆
The research is supported by a funding from Northrop Grumman Mission Systems’ University Research Program on Research in Applications for Learning Machines Consortium (REALM), USA . The material in this paper was not presented at any conference. This paper was recommended for publication in revised form by Associate Editor Dimos V. Dimarogonas under the direction of Editor Christos G. Cassandras.