A comparison model of reinforcement-learning and win-stay-lose-shift decision-making processes: A tribute to W.K. Estes

https://doi.org/10.1016/j.jmp.2013.10.001Get rights and content

Highlights

  • Modifies a win-stay-lose-shift (WSLS) model using equations developed by W.K. Estes.

  • Develops a dual-process WSLS–Reinforcement Learning (RL) model of decision-making.

  • The WSLS–RL model accounts for behavior in three different decision-making tasks.

  • Supports Estes’ view of cognition consisting of multiple concurrent processes.

Abstract

W.K. Estes often championed an approach to model development whereby an existing model was augmented by the addition of one or more free parameters to account for additional psychological mechanisms. Following this same approach we utilized Estes’ (1950) own augmented learning equations to improve the plausibility of a win-stay-lose-shift (WSLS) model that we have used in much of our recent work. We also improved the plausibility of a basic reinforcement-learning (RL) model by augmenting its assumptions. Estes also championed models that assumed a comparison between multiple concurrent cognitive processes. In line with this, we develop a WSLS–RL model that assumes that people have tendencies to stay with the same option or switch to a different option following trials with relatively good (“win”) or bad (“lose”) outcomes, and that the tendencies to stay or shift are adjusted based on the relative expected value of each option. Comparisons of simulations of the WSLS–RL model with data from three different decision-making experiments suggest that the WSLS–RL provides a good account of decision-making behavior. Our results also support the assertion that human participants weigh both the overall valence of the previous trial’s outcome and the relative value of each option during decision-making.

Introduction

The influence of W.K. Estes’ work on the fields of Mathematical and Cognitive Psychology cannot be overstated. His pioneering work on verbal conditioning, which would later come to be known as probability learning, presaged work in reinforcement learning and reward-based decision-making that is extremely popular today. Central to Estes’ work was the goal of explaining behavior in mathematical terms that could be formally modeled. He viewed the development and application of mathematical models of psychological phenomena as “a critical step in moving from descriptions of phenomena in ordinary language to representations in a theoretical plane” (Estes, 2002).

Another central theme in Estes’ work was the notion of multiple concurrent processes in cognition (Estes, 1997, Estes, 2002, Estes and Da Polito, 1967, Maddox and Estes, 1996, Maddox and Estes, 1997). He discussed this idea in much of his work on decision-making, recognition, and category-learning and made several attempts to formally model learning and memory processes by assuming that a comparison was made between the output of multiple concurrent cognitive processes, and the output of this comparison was what ultimately led to a response. The notion of multiple concurrent processes is a perennial theme in experimental psychology, and Estes was among those who championed this approach (Sloman, 1996, Smith and Decoster, 2000, Wason and Evans, 1975).

Much of our own recent work has centered on comparing fits of two different types of models to decision-making data: associative-based Reinforcement Learning (RL) models, and heuristic, or rule-based Win-Stay-Lose-Shift (WSLS) models (Cooper et al., 2013, Worthy, Hawthorne et al., 2013, Worthy and Maddox, 2012, Worthy et al., 2012). RL models have perhaps been the most popular models of decision-making over the past several decades and have been used to describe behavior in a number of different decision-making tasks (Erev & Roth, 1998; Frank, Seeberger, & O’Reilly, 2004; Sutton & Barto, 1998; Yechiam & Busemeyer, 2005). WSLS models have also been popular for quite some time, but have typically only been applied to data from binary choice experiments (Goodnow & Pettigrew, 1955; Medin, 1972; Novak & Sigmund, 1993; Otto, Taylor, & Markman, 2011; Steyvers, Lee, & Wagenmakers, 2009). Our recent work has demonstrated that WSLS models can often provide equally good or superior fits compared to RL models for data from a wide variety of decision-making tasks (Worthy, Hawthorne et al., 2013, Worthy and Maddox, 2012, Worthy et al., 2012).

In the current work we modify our WSLS model by utilizing equations first developed by Estes in his work modeling probability learning in the 1950s (Estes, 1957, Estes, 2002, Estes and Straughan, 1954). The modification significantly improves the fit of our WSLS model and allows the WSLS model to assume that tendencies to stay following a win or shift following a loss change over time. We also test an augmented version of a basic RL model. The basic RL model assumes that participants track the recency-weighted average rewards they receive when they select each option to determine each option’s expected reward values. The recency-weighted averages, or expected reward values for each option, are then compared to determine the probability of selecting each option. The augmented version of the RL model, allows for the additional assumptions that participants may assign reward credit to options that were chosen in the recent past and that expected rewards for each option decay, or are “forgotten” as they are selected less often and. Recent work has demonstrated that adding these assumptions to the basic RL model can significantly improve the fit (Bogacz, McClure, Li, Cohen, & Montague, 2007; Daw, O’Doherty, Dayan, Seymour, & Dolan, 2006; Erev & Roth, 1998; Gureckis & Love, 2009; Howard-Jones, Bogacz, Yoo, Leonards, & Demetriou, 2010; Sutton & Barto, 1998).

We then combine the WSLS and RL models into a WSLS–RL Comparison model of decision-making inspired by Estes’ later attempts to develop models of cognition that assumed multiple concurrent processes. The combined dual process model assumes: (a) that people have tendencies to stay with the same option or shift to a different option following trials with good (relative win) or bad (relative loss) outcomes (based on the WSLS model’s assumptions), and (b) that the tendencies to stay or shift are adjusted based on the relative value of each option (based on the RL model’s assumptions). Thus, people have tendencies to stay or switch on the next trial based on the overall outcome valence of that trial relative to the previous trial, and these tendencies are adjusted based on the value of the reward they expect to receive from each choice option. The model assumes that people are more likely to stay on a ‘win’ trial or shift on a ‘loss’ trial (WSLS), and they are more likely to stay-with or shift-to options with higher expected values than options with lower expected values (RL). The benefit of fitting a dual-process comparison model is that we can evaluate whether participants consider both the valence of the last outcome (WSLS) and relative value of each option (RL) to make decisions on each trial, rather than just the valence or the relative value.

Thus, the approach we take here is to augment two models that have been very successful in describing decision-making behavior by adding additional mechanisms. This is a common approach that was championed by Estes (1994):

A standard, and very powerful, procedure that is available once we have a model that provides a good fit to a set of data is to augment the model by adding one additional mechanism or process of interest (often, but not necessarily, accomplished by adding one free parameter). …  It is hard to overestimate the power of this technique for gaining evidence about mechanisms and processes that cannot be directly observed”.

In the following sections we first present the RL and WSLS models used as components in the WSLS–RL dual-process model, including the modification to our previous instantiation of the WSLS model based on Estes’ early work in modeling probability learning (Estes, 1957, Estes and Straughan, 1954), and the modifications to the basic RL model to allow expected reward values to decay and for reward credit to be given to options chosen in the recent past. We then fit the dual-process WSLS–RL model to the data from three experiments and evaluate the degree to which weight is given to the valence of the prior outcome (WSLS) versus the relative value of each option (RL). We also simulate the model using best-fitting parameter values from participants in our experiments and compare the observed behavior of participants to that predicted by the WSLS–RL model. Our analysis included both a comparison of the proportion of times participants, and the model, select the most advantageous option over the course of the experiment, and how frequently participants, and the model, ‘stay’ by picking the same option that was selected on the previous trial, or ‘shift’ by picking a different option than the one chosen on the previous trial. This allows us to examine how the model accounts for both tendencies to select options with higher expected values and tendencies to stay or shift to different options depending on the outcome of the previous trial.

In decision-making situations involving choice, RL models assume that people develop Expected Values (EV) for each choice option that represent the reward (or punishment) they expect to receive following each choice. A probability for selecting each option a on trial t is typically given by a Softmax rule which provides an action selection probability for each option, a, based on its EV relative to the EVs of all j options (Sutton & Barto, 1998): P(at)=e[γEVa,t]j=1ne[γEVj,t]. Here γ is an exploitation parameter that determines the degree to which the option with the highest EV is chosen. As γ approaches infinity the highest valued option is chosen more often, and as γ approaches 0 all options are chosen equally often.

The basic RL model assumes that participants develop Expected Values (EVs) for each option that represent the rewards they expect to receive upon selecting each option. EVs for all options are initialized at EVinitial, a free parameter in the model at the beginning of the task, and updated only for the chosen option, i, according to the following updating rule: EVi,t+1=EVi,t+α[r(t)EVi,t]. Learning is modulated by a learning rate, or recency, parameter (α), 0α1, that weighs the degree to which the model updates the EVs for each option based on the prediction error between the reward received (r(t)), and the current EV on trial t. As α approaches 1 greater weight is given to the most recent rewards in updating EVs, indicative of more active updating of EVs on each trial, and as α approaches 0 rewards are given less weight in updating EVs. When α=0 no learning takes place, and EVs are not updated throughout the experiment from their initial starting points. This model has been used in a number of previous studies to characterize choice behavior (e.g.  Daw et al., 2006, Otto et al., 2010, Worthy et al., 2007, Yechiam and Busemeyer, 2005). The basic assumption behind RL models is that people probabilistically select options with higher EVs.

In the current work we augment the basic RL model in two ways. First, we allow the model to assume that eligibility traces for recent actions determine the degree to which the credit for the reward received on each trial is given to each option. Eligibility traces assert that participants remember which options they have chosen in the recent past, and that some of the credit from the reward received on each trial goes to options chosen on previous trials, rather than all of the credit going to the option that was just chosen (Bogacz et al., 2007; Gureckis & Love, 2009; Sutton & Barto, 1998). Each time an option is chosen the eligibility trace for that option (i) is incremented according to:λi,t=λi,t1+1.EVs for all, j, options are then updated according to the following updating rule: EVj,t+1=EVj,t+α[r(t)EVj,t]λj,t. On each trial, the eligibility trace, λj, for every option decays based on an eligibility trace decay parameter, ζ(0ζ1): λj,t+1=λj,tζ. Eligibility traces are meant to assert that participants remember which actions they have recently selected, and in this way recent actions can be credited if they lead to increases in reward on future trials. Higher decay parameter (ζ) values indicate less decay of memory traces for recent actions and more credit assignment to options that have been frequently selected in the recent past. The addition of eligibility traces to RL models has resulted in improved model performance in a variety of tasks (Bogacz et al., 2007; Gureckis & Love, 2009; Sutton & Barto, 1998).

The augmented RL model also allows the model to assume that EVs for each option are forgotten, or decay as they are selected less often (Howard-Jones et al., 2010). At the end of each trial the EVs for all j options are updated according to: EVj,t+1=EVj,tλ+(1λ)ε where λ(0λ1) represents the rate of decay, with smaller values indicating faster decay, and ε represents the value to which EVs converge if an option is not chosen. The addition of the decay parameter to the model has provided a better fit to the data in much recent work (e.g.  Ahn et al., 2008, Erev and Roth, 1998, Howard-Jones et al., 2010, Worthy, Hawthorne et al., 2013). To summarize, on each trial the chosen option’s eligibility trace is incremented to make the chosen option more “eligible” for learning (Eq. (3)), EVs for all options are updated based on the reward received (Eq. (4)), then both eligibility traces and EVs for all options decay (Eqs. (5)–(6)), and finally the RL model’s probability of selecting each option is computed by comparing each options’s EV (Eq. (1)).

An alternative strategy to the RL strategy of probabilistically selecting options expected to provide larger rewards is a WSLS strategy (Novak & Sigmund, 1993; Otto et al., 2011; Steyvers et al., 2009). WSLS is a rule-based strategy that has been shown to be commonly used in binary outcome choice tasks (e.g. Otto et al., 2011). Under this strategy, participants ‘stay’ by picking the same option on the next trial if they were rewarded, and ‘shift’ by picking the other option on the next trial if they were not rewarded.

This strategy can be modeled for data from binary outcome experiments like early work in probability learning (Estes & Straughan, 1954), but it can also be modeled for data from decision-making tasks where participants receive varying amounts of reward (or punishment) on each trial. In this more general form of the WSLS model participants “stay” by picking the same option on the next trial if the reward was equal to or larger than the reward received on the previous trial (a “win” trial), or “shift” by selecting the other option on the next trial if the reward received on the current trial was smaller than the reward received on the previous trial (a “lose” trial; Worthy & Maddox, 2012; Worthy et al., 2012).

The probabilities of staying following a “win” or shifting following a “loss” are free parameters in the model. In a two-alternative decision-making experiment the probability of staying with the same option, a, on the next t trial (t+1) if the reward, r, received on the current trial is equal to or greater than the reward received on the previous trial is:P(at+1|choicet=a  and  r(t)r(t1))=P(stay|win). The probability of switching to another option following a win trial is 1-P(stay|win).

The probability of shifting to the other option, b, on the next t trial (t+1) if the reward, r, received on the current trial is less than the reward received on the previous trial is: P(bt+1|choicet=a  and  r(t)<r(t1))=P(shift|loss). The probability of staying with an option following a “loss” is 1-P(shift|loss).

We have fit this model to experimental data in several of our recent studies, and it often provides a better fit than RL models (Worthy, Hawthorne et al., 2013, Worthy and Maddox, 2012, Worthy et al., 2012). However, one shortcoming of the model is that it is not a learning model because the best-fitting values of P(stay|win) and P(shift|loss) are estimated over all trials, and these values do not change throughout the experiment. It is reasonable to assume that the probability of staying on a “win” trial or shifting on a “loss” trial does not remain static over the course of the experiment.

In the early 1950s Estes encountered a similar situation when extending his statistical model for simple associative learning (Estes, 1950). In this model change in mean response probability on reinforced trials is given by: pt+1=pt+θ(1pt). Here the probability of a response increases on the next trial if a reward occurs on trial t, and θ performs a similar function that the learning rate (α) parameter performs in Eqs. (2), (4). On unrewarded trials changes in mean response probability are given by: pt+1=(1θ)pt. Here the probability of a response decreases on the next trial if a reward does not occur on trial t.

We utilized a modified version of Eq. (8) to modify P(stay|win) and P(shift|loss) on each trial based on whether the trial is a “win” or a “loss” trial. The modified WSLS model has six parameters: P(stay|win)initial and P(shift|loss)initial, which represent the starting values of P(stay|win) and P(shift|loss),P(stay|win)final and P(shift|loss)final, which represent the asymptotic ending values of P(stay|win) and P(shift|loss), and θP(stay|win) and θP(lose|shift) which determine how much P(stay|win) and P(shift|loss) change on each trial.

If r(t)r(t1), then the trial is considered a “win” trial and the following equation that is of the same form as Eq. (8) is used to adjust P(stay|win): P(stay|win)t+1=P(stay|win)t+θP(stay|win)(P(stay|win)finalP(stay|win)t). If r(t)<r(t1), then the trial is considered a “loss” trial and P(shift|loss)t+1=P(shift|loss)t+θP(lose|shift)(P(shift|loss)finalP(shift|loss)t). Modifying the WSLS model by adding Eqs. (8), (9) allows the model to assume that participants’ tendencies to stay or shift on win and loss trials are modified throughout the experiment. This modification of the WSLS model allows the model to assume learning in that propensities to stay following a positive outcome or switch following a negative outcome are not required to remain static across all trials.

RL and WSLS models can both capture behavior reasonably well in a variety of tasks. However, one possibility is that participants consider both the overall valence of the outcome on the previous trial (WSLS) and the relative value of each option (RL) to make decisions on each trial. The RL-based process provides information on the EV of each option relative to the EVs for all other options, while the WSLS-based process provides information on the participant’s general propensity to stay with the same option or shift to a different option depending on whether the outcome was an improvement or a decline compared to the outcome on the previous trial. Modeling either process alone may not adequately account for human decision-making behavior. It is likely that human decision-making behavior involves a consideration of both the relative value of each option (RL) and the trend in rewards from trial to trial (WSLS).

The WSLS–RL model combines these two assumptions by assuming that the probability of selecting each option is affected by both the valence of the prior outcome and the relative value of each option. This assumption is accounted for by the WSLS–RL model by adding the parameter κWSLS which weighs the degree to which the WSLS model’s output is utilized in determining the probability of selecting each j option: P(jt)=P(jt)WSLSκWSLS+P(jt)RL(1κWSLS). This method of comparing the output from two separate models is one suggested by Estes in some of his later work (Estes, 2002, Maddox and Estes, 1996). In sum, the dual-process comparison WSLS–RL model has 13 free parameters, 6 for the RL-based process, 6 for the WSLS-based process, and κWSLS which weights the output of each process. The equations used for the WSLS–RL model are Eqs. (1), (3), (4), (5), (6), (7), (8), and (11)–(12).

In the experiments presented below we fit the WSLS–RL Comparison model to data from three decision-making experiments that have quite different reward structures. We then examine the best fitting parameters from fits to subjects’ data in our experiments. If people consider both the valence of recent outcomes (WSLS) and the relative value of each option then parameter estimates for κWSLS should be near 0.50. We also simulate the WSLS–RL model using best-fitting parameter estimates from participants in each experiment and compare the behavior of our experiments to that predicted by the model.

In Experiment 1 participants perform a binary outcome decision making task where they receive either three points or one point each time they select one of two options. One option provides the higher payoff (three points) 70% of the time and the other option provides the higher payoff only 30% of the time. In Experiment 2 participants perform a similar two choice task where they earn points on each trial and attempt to maximize the cumulative points earned. In this task one option provides an average payoff of 65 points on each trial, while the other option provides an average payoff of only 55 points on each trial. There is a standard deviation of 10 points around the average payoff for each option, and thus the task requires learning which option is better despite a high degree of noise in the rewards given by each option.

Experiments 1 and 2 have choice-history independent reward structures because the payoffs given on each trial are notinfluenced by the previous choices the participant has made. In Experiment 3 participants perform a choice-history dependent task where the payoffs are affected by the proportion of times participants have selected each option over the previous ten trials. One option, the Increasing option, causes future rewards for both options to increase, while the other option, the Decreasing option, causes future rewards for both options to decrease. The Increasing option is the optimal choice, but it always provides a smaller immediate reward compared to the decreasing option. Thus, the Decreasing option initially appears more rewarding despite being disadvantageous in the long run. Choice-history dependent tasks like these have recently become popular in examining how people avoid immediately rewarding options in favor of options that maximize long-term cumulative reward (Bogacz et al., 2007, Gureckis and Love, 2009, Otto et al., 2010, Worthy et al., 2011).

Section snippets

Experiment 1

In Experiment 1 participants performed a two-choice binary outcome decision-making task where their goal was to maximize the cumulative points gained over the course of the experiment.

Experiment 2

In Experiment 2 participants performed a decision-making experiment that shared many similarities with Experiment 1. However, the rewards in this task were continuously valued, rather than binary. Fig. 2 plots the rewards given by the Advantageous and Disadvantageous options on each trial. As stated above, mean payoffs of 65 and 55 points were given for the Advantageous and Disadvantageous decks, respectively. There was a standard deviation of 10 points around each deck’s mean payoff.

Experiment 3

To provide a third test of the WSLS–RL model’s ability to account for participants’ decision-making behavior we had participants performing a dynamic, choice-history dependent decision-making task where the rewards given by each option depended on the recent choices participants had made. As in Experiments 1 and 2, participants performed a two-choice decision-making task where they were asked to pick from one of two decks of cards and maximize the cumulative points gained throughout the task.

General discussion

In three decision-making experiments that had qualitatively different reward structures the WSLS–RL dual process comparison model consistently provided a good account of participants’ decision-making behavior. The model was able to account for both how often participants selected the Advantageous option and how often participants stayed with the same option over consecutive trials. This supports the assumption of the model that participants consider both the valence of the most recent outcome,

References (38)

  • I. Erev et al.

    Predicting how people play games: reinforcement learning in experimental games with unique, mixed strategy equilibria

    The American Economic Review

    (1998)
  • W.K. Estes

    Toward a statistical theory of learning

    Psychological Review

    (1950)
  • W.K. Estes

    Theory of learning with constant, variable, or contingent probabilities of reinforcement

    Psychometrika

    (1957)
  • W.K. Estes

    Classification and cognition

    (1994)
  • W.K. Estes

    Processes of memory loss, recovery, and distortion

    Psychological Review

    (1997)
  • W.K. Estes

    Traps in the route to models of memory and decision

    Psychonomic Bulletin & Review

    (2002)
  • W.K. Estes et al.

    Independent variation of information storage and retrieval processes in paired-associate learning

    Journal of Experimental Psychology

    (1967)
  • W.K. Estes et al.

    Analysis of a verbal conditioning situation in terms of statistical learning theory

    Journal of Experimental Psychology

    (1954)
  • M.J. Frank et al.

    By carrot or by stick: reinforcement learning in Parkinsonism

    Science

    (2004)
  • Cited by (0)

    View full text