CertRL: Formalizing Convergence Proofs for Value and Policy Iteration in Coq,arXiv - CS - Logic in Computer Science

当前位置： X-MOL 学术 › arXiv.cs.LO › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

CertRL: Formalizing Convergence Proofs for Value and Policy Iteration in Coq
arXiv - CS - Logic in Computer Science Pub Date : 2020-09-23 , DOI: arxiv-2009.11403
Koundinya Vajjha, Avraham Shinnar, Vasily Pestun, Barry Trager, Nathan Fulton

Reinforcement learning algorithms solve sequential decision-making problems in probabilistic environments by optimizing for long-term reward. The desire to use reinforcement learning in safety-critical settings inspires a recent line of work on formally constrained reinforcement learning; however, these methods place the implementation of the learning algorithm in their Trusted Computing Base. The crucial correctness property of these implementations is a guarantee that the learning algorithm converges to an optimal policy. This paper begins the work of closing this gap by developing a Coq formalization of two canonical reinforcement learning algorithms: value and policy iteration for finite state Markov decision processes. The central results are a formalization of Bellman's optimality principle and its proof, which uses a contraction property of Bellman optimality operator to establish that a sequence converges in the infinite horizon limit. The CertRL development exemplifies how the Giry monad and mechanized metric coinduction streamline optimality proofs for reinforcement learning algorithms. The CertRL library provides a general framework for proving properties about Markov decision processes and reinforcement learning algorithms, paving the way for further work on formalization of reinforcement learning algorithms.

中文翻译：

CertRL：形式化 Coq 中值和策略迭代的收敛证明

强化学习算法通过优化长期奖励来解决概率环境中的顺序决策问题。在安全关键环境中使用强化学习的愿望激发了最近关于形式约束强化学习的一系列工作；然而，这些方法将学习算法的实现置于其可信计算库中。这些实现的关键正确性属性是保证学习算法收敛到最佳策略。本文通过开发两种规范强化学习算法的 Coq 形式化开始弥合这一差距：有限状态马尔可夫决策过程的值和策略迭代。核心结果是贝尔曼最优原则的形式化及其证明，它使用贝尔曼最优算子的收缩性质来确定序列收敛于无限水平极限。CertRL 开发举例说明了 Giry monad 和机械化度量联合如何简化强化学习算法的最优性证明。CertRL 库为证明马尔可夫决策过程和强化学习算法的属性提供了一个通用框架，为进一步研究强化学习算法的形式化铺平了道路。

更新日期：2020-09-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>