In his research comparing the predictions of optimal foraging theory to laboratory research on schedules of reinforcement, Stephen Lea (1979) showed the strong preference that animals have for reinforcement that comes with shorter delay over longer delay, even when the density of reinforcement favors the longer delay reinforcement. Delay of reinforcement has long been considered a primary determinant of the effectiveness of a reinforcer (e.g., Catania, 1979; Kimble, 1961). Historically, response acquisition has been viewed as being negatively correlated with the interval between a response and reinforcement (Thorndike, 1911), although it also has been suggested that delay of reinforcement affects performance, but has little effect on learning (Warden & Haas, 1927; see also Watson, 1917). Hull (1952) proposed that different findings on the effect of delay of reinforcement on learning could be explained in terms of the degree to which conditioned reinforcers might bridge the delay between the response and the reinforcer (see also Spence, 1947). With regard to discrete trial discrimination learning, studies often have found that delay of reinforcement retards learning (Carlson & Wielkiewicz, 1972; Culbertson, 1970; Keesey, 1964).

Although the earlier research was conducted with rats in the context of discrete-trial procedures, similar results have been found with pigeons using operant procedures (Baum, 1973; Fantino, 1969; Mazur, 1987). In the context of operant conditioning, it appears that the effect of reinforcement delay depends on what the animal is doing during the delay (e.g., pecking at a lit key or not pecking at a lit key; see Lattal, 2010, for a review).

Delay discounting and the effect of a prior commitment

A clear effect of delay of reinforcement can be seen in the delay discounting effect found in many species, including pigeons, rats, and humans. With this task, animals have a choice between two alternatives—one that provides a small amount of food after a short delay, the other that provides a larger amount of food, but after a longer delay. For example, when pigeons were given a choice between eight pellets immediately or 16 pellets after a short delay, they became indifferent between the two alternatives when the 16-pellet alternative was delayed by as little as 10 s (Oliveira, Green, & Myerson, 2014). The relation between delay of reinforcement and magnitude of reinforcement is well described by the hyperbolic discounting function in which V is the present value of a future reward, A is the amount of the reward, D is the delay of the reward, and k is the slope of the discounting function:

$$ V=\frac{A}{1+ kD}. $$
(1)

An interesting characteristic of the hyperbolic delay discounting function can be seen in Fig. 1. When the delay to the smaller, sooner reward is short, it may have greater value than the larger, later reward; however, when the delay to the smaller, sooner reward is longer, the value of the larger, later reward may be greater.

Fig. 1
figure 1

Hyperbolic delay discounting functions representing the loss of value for each outcome as a function of the delay between choice time (CT) and either the smaller sooner reward (SS) or the larger later reward (LL). CT1 = choice at time 1; CT2 = choice at time 2. Note. At CT1 the value of SS is greater than the value of LL, but at CS2 the value of LL is greater than the value of SS

This relation is more intuitive if viewed as a function of Weber’s Law. Let us say that at Choice Time 1, one pellet, if delayed by 0.5 s, is preferred to four pellets, if delayed by 5 s. The ratio of the sooner to the later would be 1 to 10, but if the choice is made at Choice Time 2, 10 s earlier, the delay to sooner would be 10.5 s and the delay to later would be 15 s, a ratio of 10.5 to 15, or close to 2 to 3. Thus, according to this theory, the values of the two delays should be more similar, and magnitude of reinforcement should play a larger role in choice. Rachlin and Green (1972) showed that getting pigeons to make a prior “commitment” got them to switch from a preference for the smaller, sooner to that of the larger, later.

The positive effect of delayed reinforcement

Delay discounting is often associated with impulsivity (Odum, 2011) or the lack of self-control because, in general, a preference for the larger, later would result in a higher rate of reinforcement. It is easy to speculate, however, how impulsivity might have been evolutionarily selected. After all, nature is generally not able to “promise” a larger, later reinforcement. In nature, there is completion from other conspecifics, especially in animals that tend to forage socially. In nature, for many species the adage he who hesitates is lost is likely to apply.

On the other hand, in our Western culture, we tend to value self-control because important rewards often require self-control (e.g., getting a college degree, withholding aggression, avoiding large credit card debt) and, at least in our modern environment, future rewards can be made more predictable. However, our hunter-gatherer ancestors may have needed to be somewhat impulsive as they foraged for food, especially when they encountered escaping prey.

With the delay discounting procedure, the effect of making a prior commitment (delaying reinforcement) follows from the nature of the hyperbolic discounting function. If the choice is made earlier, the value of the smaller, sooner is predicted to cross over the value of the larger, later, and such a crossover has been found by (Rachlin & Green, 1972). Although in the case of prior commitment, the added delay likely devalues both alternatives; because the delay function is hyperbolic, it devalues the smaller reinforcer faster than the larger one.

The principle of prior commitment might be applicable to other designs in which animals are given a choice between two different magnitudes or probabilities of reinforcement, but they are not choosing optimally. In what follows, I will describe the results of several experiments in which we have found that pigeons (and sometimes rats) have difficulty learning to choose the larger reinforcer when an objectively smaller reinforcer is available. The tasks are sometimes quite different, but what most of them have in common is a choice in which both alternatives are associated with reinforcement, but one is clearly a better outcome than the other.

The ephemeral reward task

First reported by Bshary and Grutter (2002), the ephemeral reward task involves presentation to a subject of two alternatives (plates), each of which contains a small bit of food. If the subject chooses Alternative A, it gets the food from that plate and the trial is over. If the subject chooses Alternative B, it gets the food from that plate, and it also can eat the food on the other plate, Alternative A. Perhaps, not surprisingly, the subjects, in this case wrasse (cleaner fish), learn to choose Alternative B quickly (within 100 trials). What is surprising is that several species of primates do not learn to choose Alternative B within 100 trials (Salwiczek et al., 2012). The authors suggest that the cleaner fish have a natural tendency to acquire such a task because they live on reefs and they clean the mouths of larger fish, some of which also live on the reef, but others who merely visit the reef. The authors propose that the cleaner fish must learn to swim out and service the visitors because they are transitory and will quickly leave (the visitors correspond to Alternative B), whereas those that live on the reef will remain and can be serviced later (residents correspond to Alternative A). The primates have had no such experience, and thus they find the ephemeral reward task difficult.

The authors propose that the cleaner fish generalize from servicing large fish on or near the reef where they live to eating from plates provided to them in laboratory fish tanks. Such generalization seems quite remarkable. It must have seemed so as well to Pepperberg and Hartsfield (2014), who repeated the experiment with African grey parrots, whose natural ecology is quite similar to that of primates and is quite different from that of the wrasse. Yet, like the wrasse, the parrots acquired the ephemeral reward task quite easily.

Pepperberg and Hartsfield (2014) noted that both the fish and the parrots made their choice with the mouth or beak, whereas the primates chose with their hand. It is not clear why that would make a difference, but with two hands, perhaps there is some tendency for the primates to attempt to choose both alternatives, one with each hand.

We tested the one-beak versus two-hand hypothesis by conducting a similar experiment with pigeons with different-colored cues (Zentall, Case, & Luong, 2016). Surprisingly, not only did the pigeons fail to acquire the optimal two-reward choice, but they actually showed a significant preference for the suboptimal one-reward alternative.

Careful examination of the task identified a potential artifact. With this task, all trials involved a response to Alternative A. Choice of A ended the trial and choice of B ended with a response to A. Thus, initially, there would have been twice as many reinforcements associated with choice of the suboptimal Alternative A as with the optimal Alternative B. In a follow-up experiment, we replicated the result in an automated operant box in which pecking at either of two colored lights resulted in reinforcement from the feeder below.

To test the artifact hypothesis, we reran the experiment, but when the pigeon chose the optimal alternative, during the time of reinforcement we replaced the color of the optimal alternative with a third color, C, and a response to C provided a second reinforcement. Thus, choice of the optimal alternative no longer ended with reinforcement associated with the suboptimal choice, A. This group showed a significant reduction in choice of the suboptimal alternative, but these pigeons still failed to acquire a significant preference for the optimal alternative.

The inability to maximize reinforcement with the ephemeral reward task appears to be unrelated to generally held notions of animal intelligence (Bitterman, 1975) and not directly attributable to the natural ecology of the species tested. In spite of its apparent simplicity, although wrasse and parrots easily acquired the task, primates and pigeons did not. To test the generality of this failure, Zentall, Case, and Berry (2017b) repeated the original experiment with rats as subjects and showed that they too did not show a preference for the optimal alternative.

The failure of several species to acquire the optimal choice response with the ephemeral reward task brought to mind the results of delay discounting research, which also show consistent suboptimal choice under a variety of conditions. Rachlin and Green (1972) found that suboptimal choice can be reduced by using a prior commitment procedure, in which choice of the smaller, sooner is reduced by making the pigeons choose at an earlier time—that is, by eliminating the immediacy of the outcome of the suboptimal alternative.

To test this hypothesis, much like with the commitment procedure, Zentall, Case, and Berry (2017a) imposed a 20-s delay (using a fixed-interval 20-s schedule) between the pigeons’ choice and the first reinforcement. If the optimal alternative was chosen, following reinforcement, the other alternative appeared, and a single peck provided a second reinforcement. With this commitment procedure, the pigeons developed a strong preference for the optimal alternative (see Fig. 2). To test the generality of this finding, Zentall et al. (2017b) tested the procedure with rats and found a similar result. Now the rats too developed a preference for the optimal alternative. Thus, imposing a delay between choice and reinforcement facilitated optimal choice with this task.

Fig. 2
figure 2

Ephemeral reward task: Pigeons have a choice between two alternatives. If they choose suboptimally, Alternative A, they get rewarded and the trial is over. If they choose optimally, Alternative B, they get rewarded and they can also have the reward associated with A. Thus, A is worth 1 reward and B is worth 2 rewards. In the FR1 choice condition, a single peck is required to make their initial choice, followed immediately by reinforcement. In the FI20s choice condition, the first peck after 20 s following their initial choice is followed by reinforcement. After Zentall, Case and Berry (2017a)

One might interpret the results of the original ephemeral reward task as being an example of delay discounting because in the ephemeral reward task, the second reinforcement is somewhat delayed. To do so, however, one would have to argue that although the immediate outcome following choice is the same regardless of the choice, the short delay between the first reinforcement and the second reinforcement (about 1 s; see Zentall et al., 2016, Experiment 1) is sufficient to act like a larger (i.e., extra) later reinforcer. The delay discounting interpretation may have more difficulty accounting for the results of the one versus two pellet experiment presented in the following section.

Discrimination of large versus small reward

Several years ago, we attempted to train pigeons on a simultaneous discrimination in which choice of a blue light resulted in feeder access of 1.5 s, and choice of a simultaneously presented red light resulted in feeder access of 3.0 s. Surprisingly, we found that the pigeons had great difficulty learning the discrimination.

Recently, House, Peng, and Zentall (2020) returned to this magnitude-of-reinforcement discrimination. We hypothesized that the problem may have been that the pigeons were having trouble discriminating between the durations of reinforcement associated with each color because the immediate effect of choice of each color was exactly the same (access to the feeder). Perhaps the discrimination would be easier if the discrimination was between one and two pellets of food because the pigeons could see the difference in the magnitude of reinforcement.

In light of the results of the ephemeral reward task, however, we decided to include a group of pigeons for which the pellet outcomes were delayed, by requiring 10 pecks to either color. The results were quite striking. The control group for which a single peck was required to produce the outcome associated with each color showed little sign of learning to choose the color associated with the two-pellet alternative, whereas the experimental group for which 10 pecks were required learned the discrimination (see Fig. 3). We hypothesized that the immediate reinforcement following choice made the two outcomes difficult to discriminate. It is also possible, however, that requiring 10 pecks allowed for better processing of the stimuli (see, e.g., Elsmore, 1971; Roberts, 1972) or, alternatively, resulted in contrast between the greater effort and reinforcement (a so-called justification of effort effect; Clement, Feltus, Kaiser, & Zentall, 2000; Friedrich, Clement, & Zentall, 2005; Friedrich & Zentall, 2004). Paradoxically, by delaying both outcomes, the pigeons learned to obtain a greater amount of food.

Fig. 3
figure 3

The one-pellet versus two-pellet discrimination task: When a single peck was required, and choice of one alternative provided pigeons immediately with one pellet, whereas choice of the other alternative provided pigeons immediately with two pellets, they showed little evidence of optimal choice. When 10 pecks were required to either alternative, however, the pigeons learned to choose the optimal two-pellet alternative. After House, Peng, and Zentall (2020)

Object permanence

Various forms of object permanence have been used to track the cognitive development of young children (Piaget, 1963). Object permanence is assumed to involve the understanding that when an object is placed into a container or behind an occluder, it continues to exist and can be found. In the simplest form of object permanence, visible displacement, in the presence of the subject, an object may be placed in one of two containers and the subject is free to recover the object. In the more difficult form of object permanence, invisible displacement, after the object is placed in one of the containers, the container with the object inside is moved, thus invisibly displacing the object. One way this has been tested with children is by placing a container at each end of a beam that can be rotated, and after placement of the object, the beam is rotated (Bai & Bertenthal, 1992). In the easier form of invisible displacement, the beam is rotated 90°, so the orientation of the beam is now different. In the harder version, the beam is rotated 180°, so the container with the object and the empty container exchange places (Barth & Call, 2006). Research with dogs (Miller, Gipson, Vaughan, Rayburn-Reeves, & Zentall, 2009) has found that they have no problem with the visible displacement, and they do very well with the 90° invisible displacement, but they do not appear to be able to follow the 180° rotation.

We have recently conducted a similar experiment with pigeons and found them to have great difficulty with the simplest, visible displacement form of the task, even after many sessions of training (Zentall & Raley, 2019). Observation of the pigeons suggested that they readily learned to associate the sight and sound of grain being placed into one of the containers. They would flap their wings and became highly agitated, but they chose the container with the food at low levels of accuracy above chance. Could the relative immediacy of reinforcement following baiting cause the pigeons to choose impulsively? If so, could we get a better measure of object permanence involving visible displacement if we imposed a brief delay between baiting the container and choice?

In a follow-up experiment with new birds, conducted the same way as the original experiment but with a 5-s delay between baiting and test, we found that the pigeons could learn which container was baited. But learning which container has the food does not demonstrate object permanence, because the pigeons could have learned to use the sight of the experimenter’s hand or the location of the sound of the grain falling into the container as a cue to choose that container. That is, the original test (the first few trials) is the only appropriate measure of object permanence because object permanence should occur without training.

Following acquisition of the visible displacement form of the task, we transferred the pigeons to the 90° invisible displacement, again with a 5-s delay between baiting and test. Importantly, the pigeons were highly accurate on the immediate transfer test. Finally, we transferred the pigeons to the 180° invisible displacement, and, surprisingly, they transferred at a high level of accuracy. Thus, inserting a 5-s delay between baiting the container and testing the birds for object permanence improved the performance of pigeons on the various forms of the object permanence task.

This finding is particularly interesting in that it is quite different from the examples previously discussed. That is, in the earlier research, it was a choice between good and better, such that there were no “incorrect” choices. Instead, the object permanence task is more like a typical simultaneous discrimination, with one stimulus associated with reinforcement and the other stimulus associated with the absence of reinforcement, yet the pigeons were not able to learn the discrimination without the inserted delay. This finding is consistent with the idea that the immediacy of reinforcement is responsible for the failure to choose accurately on this task.

The commitment procedure developed in the context of delay discounting was the motivation for the introduction of the delay in the various examples provided in the preceding sections. It is not likely, however, that the crossover in the assumed hyperbolic delay functions responsible for the success of the commitment procedure is involved in the effect of delay on acquisition of the one-pellet versus two-pellet task, the object permanence task, or even the ephemeral reward task. Instead, I propose that when choice leads to immediate reinforcement, it often leads to impulsive choice, and the introduction of a delay between choice and reinforcement leads to better self control.

In the next section, I will describe one more example of the effect of an added delay on the reduction in suboptimal choice. The task has to do with a form of suboptimal choice related to unskilled gambling behavior, such as buying lottery tickets or playing roulette or slot machines.

Suboptimal choice in a gambling task

For several years, we have been studying a task in which pigeons and rats show suboptimal choice, but in this task, it not just that the pigeons do not learn to make the optimal choice. Instead, the pigeons show a strong preference for the suboptimal choice (Stagner & Zentall, 2010; see also Mazur, 1989; Spetch, Belke, Barnet, Dunn, & Pierce, 1990). In the version of this task most similar to unskilled gambling by humans, pigeons are given a choice between two alternatives. One alternative, 20% of the time, provides a cue that signals they will get 10 pellets of food, but 80% of the time they get a signal that no food will be coming. The optimal alternative provides a cue that signals that they will always get three pellets of food. Thus, their choice is between an average of two pellets of food and a sure three pellets of food (Zentall & Stagner, 2011). Surprisingly, the pigeons showed a strong preference for the two-pellet alternative.

Pigeons also show a similar suboptimal preference when the outcomes involve different probabilities of reinforcement (Stagner & Zentall, 2010). For example, they prefer an alternative that 20% of the time gives them a signal that they will always be fed, to an alternative that 100% of the time gives them a signal that they will get food 50% of the time. The research suggests that it is not the probability or magnitude of reinforcement associated with initial choice that determines the preference, but instead the probability or magnitude of reinforcement associated with the signals for reinforcement that follow. Interestingly, the signal for the absence of reinforcement that occurs on 80% of the choices of the suboptimal alternative fails to inhibit choice of that alternative.

Consistent with this hypothesis, tests of the value of the signal for the absence of reinforcement using a combined cue test failed to show that it functions as a conditioned inhibitor (Laude, Stagner, & Zentall, 2014). In the combined cue test, the presumed inhibitory stimulus is presented in combination with a known excitatory stimulus, and the reduction in responding, relative to the excitatory stimulus by itself, is taken as a measure of inhibition (Rescorla, 1969).

Further support for the hypothesis that the probability of reinforcement associated with the signals for reinforcement that follow choice is responsible for suboptimal choice comes from a design in which 50% of the time one alternative provides a signal for reinforcement, whereas 100% of the time the other alternative provides a signal for reinforcement (Smith & Zentall, 2016). Consistent with the signal value hypothesis, as both signals for reinforcement are perfect predictors of reinforcement, the pigeons are indifferent between the two alternatives.

In a follow-up experiment that extended training for 75 sessions, we found that under similar conditions pigeons gradually developed a significant preference for the suboptimal alternative (Case & Zentall, 2018). This finding led us to propose that in addition to the value of the signals for reinforcement, pigeons’ choice of the suboptimal alternative is also affected by the contrast between the expected value of reinforcement associated with the initial choice and the value of signal for reinforcement that followed. Curiously, the pigeon should expect 50% reinforcement for choice of the suboptimal alternative, but when it occurs, the appearance of the cue for reinforcement signals 100% reinforcement. Hence, there is positive contrast. On the other hand, given choice of the optimal alternative, the pigeon would expect 100% reinforcement, and the appearance of the signal for 100% reinforcement involves no contrast. Thus, even when the optimal alternative involves no uncertainty (100% reinforcement), pigeons develop a preference for the suboptimal alternative that provides reinforcement only half of the time.

In the suboptimal choice (gambling) task, there is already a delay between the initial choice alternative and reinforcement, so impulsive choice would not appear to be an issue. In most of the procedures used (e.g., Stagner & Zentall, 2010), however, the signal for reinforcement associated with the suboptimal alternative appears immediately following the choice response (typically one peck). Thus, one could view the suboptimal choice as sometimes providing an immediate conditioned reinforcer.

If inserting a delay between initial choice and reinforcement facilitates the acquisition of optimal choice using the several procedures outlined in the preceding sections, could it also affect suboptimal choice when applied to the delay between choice and the conditioned reinforcer in this gambling-like task? Zentall, Andrews, and Case (2017) tested this hypothesis using a design in which choice of the suboptimal alternative was followed by signaled reinforcement 25% of the time, whereas choice of the optimal alternative was followed by unsignaled reinforcement 75% reinforcement (see Fig. 4). For pigeons in the experiment group, choice of either alternative initiated a fixed interval 20-s schedule, at the end of which the stimulus signaling reinforcement (or its absence) appeared. For pigeons in the control group (with trial duration equated), choice of either alternative led immediately to the scheduled signaling stimulus. Pigeons in the control group showed the typical strong preference for the suboptimal alternative, whereas those in the experimental group were relatively indifferent between the two alternatives (see Fig. 5). Although the delay did not eliminate suboptimal choice by the pigeons in the experimental group, it did result in a substantial reduction in suboptimal choice (see also McDevitt, Spetch, & Dunn, 1997).

Fig. 4
figure 4

Design of the suboptimal choice (gambling) task. Pigeons in the experimental group made their choice by completing a fixed-interval 20-s schedule. Pigeons in the control group made their choice by completing a brief fixed-interval 1-s schedule. To attempt to equate the trial duration for both groups, pigeons in the control group started each trial by completing a fixed-interval 20-s schedule to the center key, whereas pigeons in the experimental group started each trial by completing a fixed-interval 1-s schedule to the center key. After Zentall, Andrews, and Case (2017)

Fig. 5
figure 5

Suboptimal choice (gambling) task. Pigeons in the experimental group were required to complete a fixed-interval 20-s schedule following their choice to obtain reinforcement. Control pigeons made a single peck to obtain reinforcement. After Zentall, Andrews, and Case (2017)

Conclusions

The history of research delay of reinforcement suggest that delay typically leads to a weaker association between a stimulus and reinforcement that follows. Research described here suggests that in learning that involves a simultaneous discrimination, under a variety of conditions, adding a delay between choice and the outcome of that choice (reinforcement or a conditioned reinforcer) can discourage animals from choosing a suboptimal alternative.

The prior commitment procedure developed by Rachlin and Green (1972) provided the impetus for a number of experiments exploring the effect of the insertion of a delay between a choice response and reinforcement. In the ephemeral reward task, the failure of rats and pigeons to learn to choose the alternative that provided them with two reinforcements rather than one was reminiscent of delay discounting and suggested that the immediacy of reinforcement may have been a factor. Inserting a delay between choice and the first reinforcement led to optimal choice by both species.

In a related task, pigeons had difficulty learning to choose a stimulus that provided them with two pellets of food rather than one, when reinforcement following choice was immediate. But the pigeons learned readily when the choice required 10 pecks rather than one to make their choice.

In a somewhat different task, pigeons were not able to show object permanence in the simplest visible displacement version of the task, and, surprisingly, they showed only minimal ability to learn by trial and error. Once a delay was inserted between baiting and choice, however, they not only learned to choose correctly when food was visibly displaced but they transferred that learning to a 90° invisible displacement and then to the more difficult 180° invisible displacement.

Finally, in a quite different task, pigeons were found to show a strong preference for an alternative that infrequently signaled a high probability reinforcer over an alternative that always signaled a more frequent but lower probability reinforcer. In this research, there was a clear delay to reinforcement following choice, but there was no delay between the choice and the conditioned reinforcer that followed. When the choice was followed by a delay prior to the appearance of the conditioned reinforcer, however, substantially less suboptimal choice was found.

It is proposed that in a variety of discrimination tasks, difficulty in learning to make the optimal response may be constrained by the immediacy of reinforcement, which may lead to impulsive choice. Although adding a delay between the choice response and reinforcement may appear counterintuitive, it may facilitate learning and improve the performance of simultaneous discriminations in several contexts. Importantly, procedures that decrease the likelihood of impulsive choice, by definition, lead to what in humans would be considered better self-control.