当前位置: X-MOL 学术IEEE ACM Trans. Netw. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Learning to Schedule Network Resources Throughput and Delay Optimally Using Q+-Learning
IEEE/ACM Transactions on Networking ( IF 3.7 ) Pub Date : 2021-01-26 , DOI: 10.1109/tnet.2021.3051663
Jeongmin Bae 1 , Joohyun Lee 2 , Song Chong 1
Affiliation  

As network architecture becomes complex and the user requirement gets diverse, the role of efficient network resource management becomes more important. However, existing throughput-optimal scheduling algorithms such as the max-weight algorithm suffer from poor delay performance. In this paper, we present reinforcement learning-based network scheduling algorithms for a single-hop downlink scenario which achieve throughput-optimality and converge to minimal delay. To this end, we first formulate the network optimization problem as a Markov decision process (MDP) problem. Then, we introduce a new state-action value function called $Q^{+}$ -function and develop a reinforcement learning algorithm called $Q^{+}$ -learning with UCB (Upper Confidence Bound) exploration which guarantees small performance loss during a learning process. We also derive an upper bound of the sample complexity in our algorithm, which is more efficient than the best known bound from Q-learning with UCB exploration by a factor of $\gamma ^{2}$ where $\gamma $ is the discount factor of the MDP problem. Finally, via simulation, we verify that our algorithm shows a delay reduction of up to 40.8% compared to the max-weight algorithm over various scenarios. We also show that the Q + -learning with UCB exploration converges to an $\epsilon $ -optimal policy 10 times faster than Q-learning with UCB.

中文翻译:

学习使用Q +学习计划网络资源的吞吐量和延迟

随着网络体系结构的复杂化和用户需求的多样化,有效的网络资源管理的作用变得越来越重要。但是,现有的吞吐量最佳调度算法(例如最大权重算法)的延迟性能较差。在本文中,我们为单跳下行链路场景提供了基于强化学习的网络调度算法,该算法可实现吞吐量优化并收敛到最小延迟。为此,我们首先将网络优化问题表述为马尔可夫决策过程(MDP)问题。然后,我们引入了一个新的状态动作值函数,称为 $ Q ^ {+} $ -功能 并开发一种称为“增强学习”的算法 $ Q ^ {+} $ -学习借助UCB(上限可信度)探索,可确保学习过程中的少量性能损失。我们还推导了算法中样本复杂度的上限,其效率比使用UCB探索的Q学习最有效的上限高出一个系数。 $ \ gamma ^ {2} $ 在哪里 $ \伽马$ 是MDP问题的折扣因子。最后,通过仿真,我们验证了在各种情况下,与最大权重算法相比,我们的算法显示出最多40.8%的延迟减少。我们还表明, 通过UCB探索进行的Q +学习收敛于一个 $ \ epsilon $ -最优策略比使用UCB的Q学习快10倍。
更新日期:2021-01-26
down
wechat
bug