当前位置: X-MOL 学术Comput. Aided Civ. Infrastruct. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A meta–reinforcement learning algorithm for traffic signal control to automatically switch different reward functions according to the saturation level of traffic flows
Computer-Aided Civil and Infrastructure Engineering ( IF 9.6 ) Pub Date : 2022-09-30 , DOI: 10.1111/mice.12924
Gyeongjun Kim 1 , Jiwon Kang 1 , Keemin Sohn 1, 2
Affiliation  

Reinforcement learning (RL) algorithms have been widely applied in solving traffic signal control problems. Traffic environments, however, are intrinsically nonstationary, which creates a convergence problem that RL algorithms struggle to overcome. Basically, as a target problem for an RL algorithm, the Markov decision process (MDP) can be solved only when both the transition and reward functions do not vary. Unfortunately, the environment for traffic signal control is not stationary since the goal of traffic signal control varies according to congestion levels. For unsaturated traffic conditions, the objective of traffic signal control should be to minimize vehicle delay. On the other hand, the objective must be to maximize the throughput when traffic flow is saturated. A multiregime analysis is possible for varying conditions, but classifying the traffic regime creates another complex task. The present study provides a meta-RL algorithm that embeds a latent vector to recognize the different contexts of an environment in order to automatically classify traffic regimes and apply a customized reward for each context. In simulation experiments, the proposed meta-RL algorithm succeeded in differentiating rewards according to the saturation level of traffic conditions.

中文翻译:

一种根据交通流量饱和度自动切换不同奖励函数的交通信号控制元强化学习算法

强化学习 (RL) 算法已广泛应用于解决交通信号控制问题。然而,交通环境本质上是不稳定的,这造成了 RL 算法难以克服的收敛问题。基本上,作为 RL 算法的目标问题,马尔可夫决策过程 (MDP) 只有在转移函数和奖励函数都不变时才能解决。不幸的是,交通信号控制的环境不是固定的,因为交通信号控制的目标根据拥堵程度而变化。对于非饱和交通条件,交通信号控制的目标应该是尽量减少车辆延误。另一方面,目标必须是在流量饱和时最大化吞吐量。对于不同的条件,多制度分析是可能的,但是对交通状况进行分类会产生另一项复杂的任务。本研究提供了一种元强化学习算法,该算法嵌入了一个潜在向量来识别环境的不同上下文,以便自动对交通状况进行分类并为每个上下文应用定制的奖励。在仿真实验中,所提出的元强化学习算法成功地根据交通状况的饱和程度区分奖励。
更新日期:2022-09-30
down
wechat
bug