当前位置: X-MOL 学术Cogn. Psychol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Finding structure in multi-armed bandits
Cognitive Psychology ( IF 2.6 ) Pub Date : 2020-06-01 , DOI: 10.1016/j.cogpsych.2019.101261
Eric Schulz 1 , Nicholas T Franklin 1 , Samuel J Gershman 1
Affiliation  

How do humans search for rewards? This question is commonly studied using multi-armed bandit tasks, which require participants to trade off exploration and exploitation. Standard multi-armed bandits assume that each option has an independent reward distribution. However, learning about options independently is unrealistic, since in the real world options often share an underlying structure. We study a class of structured bandit tasks, which we use to probe how generalization guides exploration. In a structured multi-armed bandit, options have a correlation structure dictated by a latent function. We focus on bandits in which rewards are linear functions of an option's spatial position. Across 5 experiments, we find evidence that participants utilize functional structure to guide their exploration, and also exhibit a learning-to-learn effect across rounds, becoming progressively faster at identifying the latent function. Our experiments rule out several heuristic explanations and show that the same findings obtain with non-linear functions. Comparing several models of learning and decision making, we find that the best model of human behavior in our tasks combines three computational mechanisms: (1) function learning, (2) clustering of reward distributions across rounds, and (3) uncertainty-guided exploration. Our results suggest that human reinforcement learning can utilize latent structure in sophisticated ways to improve efficiency.

中文翻译:

在多臂匪徒中寻找结构

人类如何寻找奖励?这个问题通常使用多臂老虎机任务来研究,这需要参与者在探索和开发之间进行权衡。标准的多臂强盗假设每个选项都有独立的奖励分配。然而,独立学习期权是不现实的,因为在现实世界中,期权通常共享一个基本结构。我们研究了一类结构化的老虎机任务,我们用它来探索泛化如何指导探索。在结构化的多臂老虎机中,期权具有由潜在函数决定的相关结构。我们关注老虎机,其中奖励是期权空间位置的线性函数。在 5 个实验中,我们发现证据表明参与者利用功能结构来指导他们的探索,并且还表现出跨轮学习的学习效果,在识别潜在功能方面变得越来越快。我们的实验排除了几种启发式解释,并表明使用非线性函数可以获得相同的结果。比较几种学习和决策模型,我们发现我们任务中人类行为的最佳模型结合了三种计算机制:(1)函数学习,(2)跨轮奖励分布的聚类,以及(3)不确定性引导的探索. 我们的结果表明,人类强化学习可以以复杂的方式利用潜在结构来提高效率。比较几种学习和决策模型,我们发现我们任务中人类行为的最佳模型结合了三种计算机制:(1)函数学习,(2)跨轮奖励分布的聚类,以及(3)不确定性引导的探索. 我们的结果表明,人类强化学习可以以复杂的方式利用潜在结构来提高效率。比较几种学习和决策模型,我们发现我们任务中人类行为的最佳模型结合了三种计算机制:(1)函数学习,(2)跨轮奖励分布的聚类,以及(3)不确定性引导的探索. 我们的结果表明,人类强化学习可以以复杂的方式利用潜在结构来提高效率。
更新日期:2020-06-01
down
wechat
bug