Introduction

Spatially disconnected visual elements can appear to form a perceptual group, and much effort over the last century investigated how groups form. For example, Köhler (1929) proposed that any observer who “looks passively” at Fig. 1 will see two groups of patches. The example was used to argue against the possibility that observers group elements when they have previous experience of these elements behaving as a unit, e.g., we perceive a pencil as a unit because it behaves as a single unit when we use it. Since an observer looking at the image has not seen the left three patches behaving as a unit, Köhler argued that grouping was not learnt from experience. As a member of the Gestalt School, Köhler proposed that these groupings can be accounted for by some rule, e.g., grouping by proximity, that generalizes to other cases.

Fig. 1
figure 1

A redrawing of Fig. 1 in Köhler (1929, p. 154)

Köhler’s argument includes the following assumption: for a given stimulus, perceptual groups are, in some sense, primitives. In other words, it is assumed that there is one way to group the six patches, and everyone will perceive them as forming a group of three patches on the left and another group of three patches on the right. A similar assumption is relied on for the demonstrative examples throughout Wertheimer’s (1923/1950) seminal paper on perceptual organization (although Wertheimer did identify some possible roles of experience; for discussion see Wagemans, 2018), and it continues to be presupposed in modern experimental work on grouping (Palmer and Beck, 2007; Trick & Enns, 1997; Vickery, 2008) and in examples of grouping rules given in textbooks (e.g., Palmer, 1999).

However, Köhler’s argument involves a subtle qualification: everyone who looks passively at Fig. 1 perceives two groups of patches. This statement does not rule out the possibility that perceived groups can be task-dependent, i.e., the way in which an observer groups stimulus elements may depend on the particular task at hand. Other Gestalt psychologists were less flexible about the role of experience in perceptual grouping, which was vigorously debated (Braly, 1933; Gottschaldt, 1926/1950; Koffka, 1935/1963; Moore, 1930; Wertheimer, 1923/1950). For example, experiments by Gottschaldt led him to conclude that experience has a negligible effect on perceived organization, e.g., repeatedly seeing a figure has little impact on whether an observer reports seeing this figure when it is embedded in a larger figure. His conclusion continued to be contested through the 1950s (for overviews, see Bevan, 1961; Bevan & Zener, 1952). This issue of whether past experience influences perceived groupings continues to be experimentally investigated (e.g., Kimchi & Hadad, 2002; Vickery & Jiang, 2009; Zemel et al.,, 2002; for a review see Peterson & Kimchi, 2013).

Our view is motivated by what is readily achieved in a computational model of visual perception. We recently (Kon & Francis, 2022) proposed a flexible model of visual perception that uses top-down control to modulate neural activity corresponding to perceptual groups. Here, we show that such top-down control can be applied in a way that helps observers segment and identify elements in a visual scene. Thus, we propose that observers develop task-dependent grouping strategies to promote performance. Importantly, this view suggests that perceptual groups will vary across tasks and stimuli.

This paper describes the first application of the new model to empirical data about grouping. We start by describing data both from Han et al., (1999a) and, given the small sample size of the original study, from two replication experiments. We then briefly describe the model and explain how it can be applied to the experimental task of Han et al., (1999a). Simulation results show that the model, with an appropriate grouping strategy, does a good job capturing the observed pattern of data. Additionally, we show that alternative grouping strategies do not perform as well, which helps explain why observers would apply the particular grouping strategy that we identify.

Han et al.’s (1999a) Aims and Task

Han et al., (1999a) investigated relationships among uniform connectedness and the grouping principles of proximity and shape similarity. Uniform connectedness is a characteristic of perceptual organization introduced by Palmer and Rock (1994) in which a connected region of uniform visual properties tends to be organized as a single perceptual unit. Han et al. use the example of circles that are joined by a line the same color as the circles’ borders (see Fig. 2, stimulus B). Han et al., (1999a) wanted to investigate the relative order of visual processing for uniform connectedness compared to classic Gestalt principles. Their Experiment 1 measured response times in a letter discrimination task where each letter was made up of small spatially separated circles (stimuli are shown in Fig. 2, top row). Han et al., (1999a) explored three main hypotheses by comparing mean reaction times to different stimulus conditions. First, due to previous work indicating that grouping by proximity occurs earlier than grouping by shape similarity (Ben-Av and Sagi, 1995; Han et al., 1999b), they expected (1) response times for stimuli in which target circles were grouped solely by proximity (Fig. 2, stimulus A) to be faster than those where target circles were grouped solely by shape similarity (Fig. 2, stimulus C). Based on Palmer and Rock’s (1994) theoretically-motivated claim that uniform connectedness occurs prior to classic grouping principles, they also expected (2) mean response times for stimuli where target circles were connected by lines and grouped by proximity (Fig. 2, stimulus B) to be faster than those for stimuli in which target circles must be grouped solely by proximity (Fig. 2, stimulus A) and (3) response times for stimuli where target circles were connected by lines and grouped by shape similarity (Fig. 2, stimulus D) to be faster than those for stimuli in which target circles must be grouped solely by shape similarity (Fig. 2, stimulus C).

Fig. 2
figure 2

Top row: A reproduction of the stimuli used in Experiment 1 of Han et al., (1999a) to investigate uniform connectedness and grouping principles, which were used in Experiment 1 reported here. Bottom row: Stimuli used in Experiment 2, which differs in that the stimuli had equidistant horizontal and vertical spacing between shapes. For each experiment, the task on each trial was to indicate, as quickly and accurately as possible, whether the circles formed an E or an H shape

The leftmost column of Fig. 3 shows the mean response time (top) and error rate (bottom) for each stimulus condition, averaged across the n = 17 observers. As expected, grouping by proximity was much faster than grouping by shape (compare conditions A and C). Likewise, when grouping was based on shape, uniform connectedness led to much faster responses and fewer errors (compare conditions C and D). However, contrary to expectations, uniform connectedness did not have much impact when grouping could be done by proximity (compare conditions A and B). The last result was somewhat surprising because Palmer and Rock’s theory indicates that uniform connectedness should dominate the grouping principles. We will return to this finding in the General Discussion section below.

Fig. 3
figure 3

The top row shows mean response times, and the bottom row shows error rates for each condition. Mean response times are calculated only from correct responses. The first column shows results from Han et al.’s (1999a) experiment for the 80 ms ISI condition only. The middle two columns are results from two replication experiments (Experiments 1 and 2 reported here). The last column shows results from model simulations of Experiment 2 when the model implemented a particular grouping strategy (note the change in scale for response times). Error bars represent one standard error of the mean. Condition letters refer to the stimuli in Fig. 2

These conclusions were supported by three additional experiments in Han et al., (1999a) that varied the stimuli and task. Before attempting to model the experimental results, we wanted to be sure that the empirical findings were solid. We were a bit worried about the small sample sizes used by the original study, so we replicated key parts of Experiment 1 in Han et al., (1999a). As it turns out, our worries were largely unfounded; however, since future scientists may have similar concerns, we share our experimental details and findings.

Experiment 1: Direct Replication of Han et al. (1999a)

Our first experiment is a replication of Experiment 1 in Han et al., (1999a). One deviation is switching the contrast of the stimuli and background. The original experiment used black line figures on a white background, but for the model simulations it was better to use white line figures on a black background, so we used the same stimuli for the experiment. An additional change is the removal of variation in ISI between stimulus offset and mask onset. Han et al., (1999a) did not find an effect of ISI, so to simplify the experiment we only used ISI = 80 ms. A side effect of this choice is that the warning tone at the start of a trial in the original experiment no longer seemed to be necessary and so was omitted. Finally, although the original experiment counterbalanced responses for two unnamed keyboard keys, we used the E-key and H-key to register responses and did not counterbalance them between participants.

Method

Participants.

To identify an appropriate sample size, we first determined that we wanted to measure each mean response time with a precision that would have a standard error of 10 milliseconds. Response time standard deviations across observers for identification tend to be around 100 milliseconds (for the findings in Han et al., 1999, the standard deviations are around 106 milliseconds), so we planned for a sample size of around 100 observers because this would give us a standard error of \(100/\sqrt {100} = 10\). Due to excess sign-ups for the experiment, we ended up with a total of 125 observers. Participants were naïve undergraduates from Purdue University who took part in exchange for course credit. All participants provided informed consent in accordance with Purdue’s Institutional Review Board.

Apparatus.

This study was conducted online using bespoke JavaScript and HTML code (Francis and Neath, 2015). Participants used a computer to take the experiment, and those who attempted to use a tablet or phone were prompted to switch to a laptop or desktop computer. Participants used the E-key and H-key on their keyboard to register responses for E and H target letters, respectively. We provide a local version of the code for the experiment at the Open Science Framework (https://osf.io/zrkue/).

Stimuli.

The stimuli used in the experiment are shown in Fig. 2, top row. Since the study was conducted online, a participant’s distance from the monitor and the monitor’s size are unknown, and thus the visual angle subtended by the stimuli is also unknown. To provide some sense of the size of stimulus elements, a hypothetical participant might use a laptop on a desk with the monitor 18 inches away. With a 13.3-inch (diagonal) monitor with a width of around 11.25 inches and resolution of 2560x1600 (making pixel density 227 pixels per inch), a circle diameter or square side would subtend approximately 0.5 of visual angle. The width of the middle row of circles, i.e., the distance from the leftmost circle to the rightmost circle in the center row, would subtend approximately 5.23. The height of the leftmost column of circles would subtend approximately 6.72.

Procedure.

After reading instructions that explained the task, encouraged them to look at the fixation cross for the duration of each trial, and prompted them to make responses as quickly and as accurately as possible, participants scrolled to the bottom of the webpage, where the experiment took place. Figure 4 schematizes what occurred on a trial. A participant initiated the first trial by pressing the B-key on their keyboard. 2000 milliseconds after this key was pressed, the stimulus appeared and was shown for 160 ms. This was followed by a fixed inter-stimulus interval of 80 ms and then a random dot mask for 80 ms. After mask offset the screen was blank except for a fixation cross that remained onscreen for the full duration of each trial. The participant responded by pressing the E-key if the target letter was an E or the H-key if the target letter was an H. Participants were instructed to rest their left index finger on the E-key, right index finger on the H-key, and thumb on the B-key for the duration of the experiment. After completing 24 practice trials in which all conditions were experienced, the participant completed two blocks of experimental trials, with each block having 72 trials. Over the course of the experiment, all stimulus conditions were randomly interleaved. The total number of experimental trials was the same number of experimental trials that participants in the original experiment completed for each ISI condition. If the response was incorrect, too fast (if the response time was less than 100 ms), or too slow (if the response time was greater than 2000 ms), participants were given feedback at the end of a trial. After each trial, participants were prompted to press the B-key to initiate the next trial. Trials with reaction times lower than 100 ms or greater than 2000 ms were omitted from analysis. Reaction times were only based on correct trials.

Fig. 4
figure 4

Stimulus sequence for an experiment trial

Results and Discussion

Mean response times for correct trials and error rates for Experiment 1 are shown in the second column of Fig. 3 and in Table 1. We ran an ANOVA model in R (version 4.0.2; R Core Team, 2020) using the ez package (Lawrence, 2016). (The data and R script for the analysis can be found on the OSF at https://osf.io/zrkue/)

Table 1 Experiment 1 Descriptive Statistics and Correlations

A repeated measures ANOVA showed condition type had a significant effect on response time, F(2.32,276.02) = 277.05, p < .001 (since Mauchly’s test indicated a violation of sphericity, ε = .77, Huyn-Feldt corrected results are reported). Planned contrasts indicated that response times were significantly lower for the condition with similarity grouping and uniform connectedness (condition D) compared to the condition with similarity grouping only (condition C), t(372) = 11.27, p < .001, and for the condition with proximity grouping only (condition A) compared to the condition with shape grouping only (condition C), t(372) = − 24.28, p < .001. There was no significant difference in response times for the condition with proximity grouping only (condition A) compared to the condition with proximity grouping and uniform connectedness (condition B), t(372) = 0.23, p = .815.

Following Han et al., the error rate for each observer was transformed with an arcsine square-root function prior to analysis. Results for error rates mirrored those for response times. A repeated measures ANOVA with a Huynh-Feldt correction (ε = .96) showed condition type had a significant effect on error rates, F(2.88,342.51) = 126.61, p < .001. Planned contrasts indicated that error rates were significantly lower for the condition with similarity grouping and uniform connectedness (condition D) compared to the condition with similarity grouping only (condition C), t(372) = 11.21, p < .001, and for the condition with proximity grouping only (condition A) compared to the condition with shape grouping only (condition C), t(372) = − 17.47, p < .001. There was no significant difference in error rates for the condition with proximity grouping only (condition A) compared to the condition with proximity grouping and uniform connectedness (condition B), t(372) = − 1.25, p = .214.

The results of this experiment largely replicate the pattern of results in Han et al., (1999a). Response times are longer for our experiment, which probably reflects differences in equipment, context, and training. Our results also do not show as big a difference between conditions C and D as in the original findings, but the pattern is still convincing.

Experiment 2: Slightly Modified Stimuli

Our Experiment 1 had stimuli with the same spacing as Han et al., where the vertical spacing of shapes is greater than the horizontal. As we started to run model simulations with these stimuli, we realized that the difference between the horizontal and vertical spacing might quantitatively affect the required top-down connection control settings in the model. The model could handle such stimuli, but we thought it would be prudent to empirically check whether the spacing difference had a measurable impact on behavior (this would guide model development). Thus, our second experiment was identical to the first except the vertical spacing was equal to the horizontal spacing (see Fig. 2, bottom row). We also felt that having equidistant elements was a better way to measure the relative impact of grouping principles.

Method

Participants.

We again planned to gather data so that we measured mean response time with a standard error of around 10 milliseconds. Experiment 1 above found standard deviations across participants around 113 milliseconds, so we again planned for approximately 100 participants. Due to excess sign ups, we ended up with 120 naïve undergraduates from Purdue University taking part in the experiment in exchange for course credit. All participants provided informed consent in accordance with Purdue’s Institutional Review Board.

Apparatus, Stimuli, and Procedure.

All aspects were identical to those of Experiment 1 except for the vertical spacing of elements, which was the same as the horizontal spacing of elements in Experiment 1. Using the same hypothetical observer as for Experiment 1, the height of the leftmost column of circles would subtend approximately 6.1.

Results and Discussion

Mean response times and error rates for Experiment 2 are shown in the third column of Fig. 3 and in Table 2, and they largely match the results from Experiment 1 and the results from Han et al., (1999a).

Table 2 Experiment 2 Descriptive Statistics and Correlations

A repeated measures ANOVA with a Huynh-Feldt correction (ε = .75) showed condition type had a significant effect on response time, F(2.26,269.29) = 321.41, p < .001. Planned contrasts indicated that response times were significantly faster for the condition with similarity grouping and uniform connectedness (condition D) compared to the condition with similarity grouping only (condition C), t(357) = 12.44, p < .001, and for the condition with proximity grouping only (condition A) compared to the condition with shape grouping only (condition C), t(357) = − 26.13, p < .001. There was no significant difference in response times for the condition with proximity grouping only (condition A) compared to the condition with proximity grouping and uniform connectedness (condition B), t(357) = 0.41, p = .685.

Similar to Experiment 1, condition type had a significant effect on transformed error rates, F(3,357) = 107.08, p < .001. Planned contrasts indicated that error rates were significantly lower for the condition with similarity grouping and uniform connectedness (condition D) compared to the condition with similarity grouping only (condition C), t(357) = 7.60, p < .001, and for the condition with proximity grouping only (condition A) compared to the condition with shape grouping only (condition C), t(357) = − 15.89, p < .001. There was no significant difference in error rates for the condition with proximity grouping only (condition A) compared to the condition with proximity grouping and uniform connectedness (condition B), t(357) = − 1.30, p = .194.

Comparing the results of the two replication experiments indicates that having stimulus elements equidistant or not hardly affects the pattern of mean response times or error rates. One small difference is that the mean response times between these experiments differ by approximately 20 milliseconds for each condition (faster for the equidistant stimuli of Experiment 2).

Model Simulations

Simulated Grouping Strategy

We propose that human observers promote performance on a particular task and stimulus set by implementing a grouping strategy, which consists of separate connection and selection strategies that are chosen in tandem. The model is a version of the LAMINART cortical neural network (Grossberg and Raizada, 2000; Raizada & Grossberg, 2001). The version of the model used here includes the connection circuit introduced in Kon and Francis (2022), which we use to implement connection strategies, and the segmentation circuit introduced by Francis et al., (2017), which we use to implement selection strategies.

Connections in the Model

A connection strategy concerns which stimulus elements to connect. According to the model, these connections occur among orientation-sensitive complex cells in cortical area V2 and spread between active cells responding to stimulus edges. For example, the second and third columns in Fig. 5 show model V2 activity. As in prior work (Francis et al., 2017; Kon and Francis, 2022), this activity is color coded where red indicates that the vertically-tuned cell at a pixel is active, green horizontal, and blue diagonal. Notice that some of the V2 activity in Fig. 5 corresponds to oriented edges in the stimulus but other activity—the connections—is generated by the connection circuits (Grossberg & Mingolla, 1985a).

Fig. 5
figure 5

Input to the model consists of images that could be shown to a participant, e.g., the left stimulus. To demonstrate how top-down control of the connection circuits can result in different patterns of connections, this stimulus was input in the model for four simulations, each of which had a different set of connection parameters, for 500 milliseconds of model time. The other images show the V2 neural activity of each simulation summed during 450–500 milliseconds. See the Appendix for details about the connection parameters used in each simulation. Numbers in some of the image titles correspond to grouping strategies discussed in the main text

The spread of connections can be altered by tuning the parameters of three circuits: (a) the spread controller circuit enables the spread of connections from detected edges, (b) the long controller circuit reduces the spread of connections from long edges, and (c) the short controller circuit reduces the spread of connections from short edges (for details about the connection circuits, see Kon & Francis, 2022). If these connection circuits are off, then no connections form. For example, if the stimulus shown in Fig. 5 is input to the model for 500 milliseconds and if the connection circuits are off, then the V2 activity only represents stimulus edges (Fig. 5, image labelled “No connections”).

By top-down control of connection parameters, the connection circuits can be tuned to produce various connection patterns (see the Appendix for details about the connection circuit parameter values used for the simulations reported here). For example, they can be tuned so that connections form only between nearby circles (Fig. 5, image labelled “Circles only”), between nearby circles and between nearby squares but not between circles and squares (Fig. 5, image labelled “Circles and squares”), or between all shapes (Fig. 5, image labelled “All connected”). These connected elements may be regarded as forming groups, so the model can group this set of stimulus elements in several ways via connections.

Building on claims made by Francis et al., (2017) and Kon and Francis (2022), we assume that observers use top-down control to tune the connection parameters in order to promote performance on a given task and stimulus set so that, e.g., target and/or distractor elements link together. We refer to this type of tuning as a connection strategy. To see how connections allow for faster identification of targets, we first require a sketch of how the segmentation circuit functions.

Selection in the Model

The model uses segmentation layers and a selection signal to perform a kind of figure-ground processing. Selected contours are transferred from a default image plane (called Segmentation Layer 0) to a separate image plane (Segmentation Layer 1) (Francis et al., 2017). Fig. 6 demonstrates this segmentation process as it unfolds in time. The bottom row shows how the contours corresponding to the letter formed by the connected circles are transferred to Segmentation Layer 1.

Fig. 6
figure 6

Demonstration of how a selection signal selects and segments elements it covers (and elements that are connected to these elements) in to a separate segmentation layer over time. Each image shows activity summed over 50 milliseconds of model time. Segmentation Layer 0 contains V2 neural activity, and the selection signal is applied to this layer (see Francis et al., 2017, for details)

In Fig. 6, a single selection signal is represented by a gray circle in Segmentation Layer 0, and it acts like an attentional spotlight. The selection signal has two key traits. First, contours “selected” by the signal, i.e., contours that are at the same location as the signal, are transferred to Segmentation Layer 1, as can be seen at time 250-300 ms in Fig. 6. Second, the signal itself spreads across contours that are connected to any selected contour. Thus, even though the selection signal in Fig. 6 is placed on a single circle, the contours of that circle are connected to its neighboring circles, so the signal spreads to those neighbors as well. The signal keeps spreading among connected contours so that, given enough time, it selects and segments all the elements that form the global shape.

If only the target contours are segmented to the separate image plane, then identification is improved because the surrounding distractors do not interfere with identification. A selection strategy concerns the number, placement, size, and timing of a selection signal, and we propose that these properties are subject to top-down control. As is obvious from the example in Fig. 6, what gets selected depends very much on what elements form connections. Thus, it is the combination of a connection strategy and a selection strategy that defines a grouping strategy. A good grouping strategy promotes good identification, so we describe a plausible identification mechanism before describing a grouping strategy that is easy to implement and produces results that match the experimental data of Han et al., (1999a).

Model Evidence, Template Matching, and Stopping Rule

The observer’s task was to identify the global target letter, which could be modeled with a number of different mechanisms. We chose to use a pair of templates to calculate a model evidence score and to apply a stopping rule that gives an indication of confidence in the identity of the letter. After each 50 milliseconds of model time, a model evidence score was calculated from activity in Segmentation Layer 1. Neural responses within the areas covered by H and E templates, shown in Fig. 7, were summed to produce VH and VE, respectively, and activity across the entire layer was summed to produce \(\sum V\). Model evidence is

$$ \frac{V_{E}}{\sum V}-\frac{V_{H}}{\sum V}. $$
(1)

Since the proportion of activity corresponding to an H is subtracted from that corresponding to an E, a model evidence score greater than zero is considered to be evidence for the letter being an E, while a score less than zero is evidence that the segmented responses form an H. Because it takes time for the selection signal to spread and for the selected boundary activity to be segmented in to layer 1, the model often does not have enough information to make an accurate judgment about the letter type early in a trial. So, a stopping rule was implemented in the model as a measure of model confidence regarding its judgment about letter type. According to the stopping rule, if the model evidence score was greater than zero (or less than zero) for three consecutive 50 millisecond time steps, then the trial ended and the response corresponding to the final model evidence score was taken to be the response. Otherwise, the trial continued up to a maximum of 750 milliseconds after stimulus onset, at which point a letter was randomly chosen as a “guess.” This stopping rule was chosen based on pilot simulations that indicated the model was relatively fast and accurate with this stopping rule.

Fig. 7
figure 7

The location and size of the E template (left image) and H template (right image) are indicated by gray boxes, which are overlaid onto V2 cell activity output for stimulus A with a letter E target in order to give a sense of what they cover

Grouping Strategy 1

Now that we have summarized the main functions of the connection and segmentation circuits, we can identify a grouping strategy that could be used for this task and stimulus set, which we will call “Grouping Strategy 1.” Recall that a grouping strategy consists of a connection strategy and a selection strategy. A simple connection strategy would be to connect only target circles for each stimulus condition, which is shown in row 2 of Fig. 8. With this connection strategy, if a selection signal falls on part of any circle or their connections, then the entire letter will eventually be selected and segmented, which allows for easy identification of the target letter.

Fig. 8
figure 8

Simulations that demonstrate the connection and selection signal size strategies of Grouping Strategy 1 for each stimulus condition (top row). For all conditions, the connection strategy (middle row) allows circles to connect to neighbors and nothing else. Conditions A and B use a selection strategy with a large selection signal (schematized in the bottom row over the stimulus), which promotes fast selection of the entire shape. Conditions C and D use a small selection signal to avoid selecting square distractors

Following ideas in Francis et al., (2017), a plausible selection strategy involves two selection signals placed at locations that help to quickly identify the global letter shape. Specifically, the left selection signal aimed for a location that was centered on the second circle from the left in the row with the fixation cross, and the right selection signal aimed for a location on the second circle from the right in the same row. Because it is unlikely that an observer would precisely place a selection signal at the same location across trials, noise was added to the x and y coordinates of the aimed-for location by adding a value that was randomly drawn (for each coordinate on each trial) from a normal distribution with a mean of zero and standard deviation of 10 (see Fig. 8, bottom row, for examples of selection signal locations with noise added). These locations were chosen because they are near the fixation cross and are likely to result in the selection and segmentation of the target letter.

An additional aspect of the selection strategy is that the size of the selection signals is varied across conditions. For conditions A and B, larger selection signals with a diameter of 67 pixels were used, while selection signals of a diameter of 19 pixels were used for conditions C and D. For conditions C and D, the small size was chosen because it could select only a target circle and not select any surrounding squares. For conditions A and B, larger selection signals were used because there was no risk of selecting a distractor square. Since it takes time for the selection signal to spread along edges/connections and segment them (e.g., Fig. 9), a larger selection signal will lead to faster segmentation of the target elements. We propose that an observer uses gist information from the image to know which selection signal size to use on a given trial. Each selection signal was initiated at 50 milliseconds after stimulus onset and remained at the same location until the end of the trial. The V2 activity selected by the pair of selection signals was segmented into a single layer (Segmentation Layer 1). Figure 9 provides some sample trials that used this selection strategy.

Fig. 9
figure 9

The first 250 milliseconds of two trials from the simulation that implemented Grouping Strategy 1. The letter was correctly identified in Example Trial 1. The letter was incorrectly identified in Example Trial 2 because some distractor squares were selected along with the target circles, and one of the squares fell inside the H-template, thereby triggering the wrong template

Simulation Stimuli, Method and Procedure

For each simulation, 120 simulated trials were run for each stimulus condition (i.e., 60 trials for each target letter for each condition). As depicted in Fig. 10, on each simulated trial the stimulus image was presented for 150 ms followed by a black image for 100 ms, which served as a 100 ms ISI. Then a random dot mask was shown for 100 ms that was followed by a black image until enough model evidence accumulated for the model to indicate whether the letter was an E or H (see Fig. 11 for an example trial that shows model activity given a random dot mask). If there was not enough evidence for either letter after a total of 750 ms after stimulus onset, then the trial terminated and a guess was made about the identity of the letter, i.e., each letter had a probability of 0.5 of being selected. The time it took on each trial for the model evidence score to go above (or below) zero and trigger the stopping rule provides a measure of response time that can be compared with that of human observers. Additionally, whether the model chose the correct letter on a given trial contributed to an error rate that can be compared with that of human observers.

Fig. 10
figure 10

Each simulated trial emulates the structure of an experimental trial: the stimulus is followed by an ISI and a random dot mask. Note that the times differ slightly from those used in the experiment, e.g., the simulated ISI is 100 ms rather than 80 ms. This is due to the model having a (arbitrary) time step of 50 ms, which has been used in prior work (Francis et al., 2017; Kon and Francis, 2022). We opted to continue to use this time step duration here since this small difference in timing should not affect the pattern of results

Fig. 11
figure 11

Part of an example trial that involved model activity from a random dot mask. On this trial the model took 400 ms to make a correct response

All simulations were programmed using Python2 scripts with the package NEST 2.14.0 (Peyser et al., 2017) for creating the cells and synapses and for managing network dynamics. A single cell type (iaf_psc_alpha, which is a leaky integrate-and-fire neuron model with alpha-function shaped synaptic currents) and synapse type (static_synapse) were used, and synapse weights were manually set to implement the various circuits (for orientation detectors and connection formation). Each trial takes approximately 11–17 minutes (real time) to run, depending on the time at which a trial was terminated due to a target letter being identified. Stimuli were made using a custom Python3 script with standard packages (numpy, random) and were written to .bmp files using the package ImageIO (Klein et al., 2018). As in the experiment, a different random dot mask was generated for each simulated trial. The simulations were run in batches on two computers in parallel to reduce overall run time. The computers were a 2019 MacBook Pro (32 GB RAM and 8 cores) and a 2018 Linux (16 GB RAM and 6 cores running Debian). Checks of the different computing systems indicate that they give the same results. (All code and stimuli can be found on the OSF at https://osf.io/zrkue/)

Model Results and Discussion

The grouping strategy summarized in Fig. 8, i.e., Grouping Strategy 1, was implemented in the model and produced the simulated results shown in the last column of Fig. 3. Recall that with this grouping strategy, circle elements connect to each other, but square elements do not connect with anything. The circle group is then selected/segmented to layer 1 and separated from unselected elements. Activity in layer 1 is then interpreted by the templates. Model results from the simulation of this experiment are well correlated with our Experiment 2 data (r = .994 for mean response times, r = .998 for error rates). Some discrepancies with the data are easily explained. Response times are much faster for the simulations compared with the experimental results, but this is expected because the model lacks a motor component, among other things. Additionally, model error rates for the simulated data are at zero for conditions A and B, but non-zero for the experimental results. We suspect that participants were performing near ceiling for these conditions and that their non-zero error rates are largely noise, e.g., accidentally pressing the wrong key, which does not happen in the model. Overall, the model using Grouping Strategy 1 produces results similar to those measured in Experiment 2.

Performance for Other Connection Strategies

It is encouraging that model performance with Grouping Strategy 1 seems to closely mimic human behavior. But, we do not mean to propose that this grouping strategy is the best because it matches human behavior. To justify why observers might use a given grouping strategy, we need some other motivation such as good performance or easy implementation. Grouping Strategy 1 seems pretty easy to implement: the connection strategy uses a fixed set of top-down control parameters for all conditions and the selection strategy involves consistent placement of selection signals with only variations in selection signal size for sets of conditions.

To investigate how well Grouping Strategy 1 does with regard to overall performance, we consider two alternative approaches that differ in their connection strategies. In Grouping Strategy 0, no stimulus elements were connected. In Grouping Strategy 2, circles connected with other circles and squares connected with other squares (yet circles and squares did not connect with each other). Simulation response times and error rates for all three strategies are shown in Fig. 12.

Fig. 12
figure 12

A comparison of the performance for grouping strategies that differed in their connection strategies, which are shown in the images using condition C as an example. Error bars in the left plot represent standard error. The standard error is zero for several of the means for conditions A and B because the model consistently responds as quickly as possible

Comparing Grouping Strategies 0 and 1, for conditions A and C where the target circles are not joined by stimulus lines, the connections generated with Grouping Strategy 1 clearly lead to better performance: response times and error rates for Grouping Strategy 0 were quite high compared to Grouping Strategy 1. Notice, however, that the lack of connections for condition A does not increase the response time as much as for condition C. This is due to the relatively large selection signal size used for condition A; the large selection signal size selects target objects relatively fast, effectively grouping elements by selection, and, thus, the observer does not need to rely on the spread of connections before making a decision (see Fig. 9). However, this selection strategy comes at the cost of a high error rate since, if the target letter is an E, the selection signal is unlikely to cover the top or bottom row of target circles and, thus, the observer tends to incorrectly respond that the target letter is an H. An extremely large selection signal could be used in tandem with no connections for condition A, which would group all stimulus elements by selection and result in the same fast and accurate performance that Grouping Strategy 1 produces. However, the use of connections allows the observer to use a broader range of selection signal sizes and have fast, accurate performance for condition A. In sum, connections are not necessary to produce good performance for condition A if a very large selection signal were used, but connections do allow a broad range of selection signal size strategies to promote good task performance. Additionally, for condition C connections are needed to promote good performance because increasing the selection signal size results in the selection of surrounding squares, which interferes with template matching. Given that connections need to be created for some trials and generally do not hurt performance, we suspect that observers elect to form connections between target elements in all conditions. Keeping a consistent grouping strategy also simplifies the task for observers, which probably reduces response times overall.

Adding connections among distractor squares (Grouping Strategy 2) makes performance worse for conditions C and D. Connecting only target circles in Grouping Strategy 1 has the advantage of not being too costly if a square is mistakenly selected. If this square is connected to other squares as in Grouping Strategy 2, then the observer will select and segment a large number of distractors, which interferes with the template calculations.

For conditions B and D, Grouping Strategy 0 (with no connections) leads to similar performance as Grouping Strategy 1, which is unsurprising given that their target circles are joined by lines in the stimulus. For these conditions the selection signal spreads across the target even without connections.

Overall, Grouping Strategy 1 leads to the best performance compared to the other strategies (assuming that the selection signal size strategy described in section 5.5 is used, which was the same in Grouping Strategies 0, 1 and 2). Thus, the observed empirical pattern of results is arguably due to observers implementing something similar to Grouping Strategy 1 because it is easy to do and because it does a good job on the task.

Performance for Different Selection Strategies

To assess the role of the selection strategy on performance, a simulation was conducted that implemented Grouping Strategy 3, which had the same connection strategy as Grouping Strategy 1 but the opposite selection size strategy. The strategies and results are summarized in Fig. 13.

Fig. 13
figure 13

A comparison of the performance for grouping strategies that have different selection strategies, which differ in selection signal size, for each condition as shown in the images. As in Fig. 8, the selection signals are overlaid on the stimulus images to give a sense of their location and to present them separately from the connection strategy, even though in the model the selection signals are applied to V2 activity in Segmentation Layer 0. Error bars in the left plot represent standard error. Standard error is zero for conditions A and B with Grouping Strategy 1 because the model consistently responds as quickly as possible on every trial

Compared with Grouping Strategy 1, all mean response times for Grouping Strategy 3 are higher. For conditions A and B, this is due to the small selection signals missing a target circle or being at locations, e.g., the middle of the middle row, that required more time for the selection signal to spread across the target elements and, thus, took longer to reach a decision. Therefore, larger selection signals, like those implemented in Grouping Strategy 1, lead to better performance for conditions A and B because they rarely miss a target element and because they allow the selection signal to spread quickly, which leads to faster segmentation and identification of the target.

For conditions C and D, response times are slow for Grouping Strategy 3 because the larger selection signals segment more square distractors. For reasons similar to why Grouping Strategy 2 was slower than Grouping Strategy 1, the segmented distractors often interfere with the model evidence calculations that might favor identification of the target. In turn, the model often has to wait for selection signals to spread across the entire target before making a decision. Although Grouping Strategy 3 results in few errors, it comes at the cost of higher response times.

Thus, Grouping Strategy 1 results in better overall performance than Grouping Strategy 3. A smaller selection signal size reduces response time for conditions C and D yet hinders performance for conditions A and B. In turn, a selection strategy with small selection signals for conditions C and D and large selection signals for conditions A and B will result in good performance (assuming that the selection strategy is coupled with a connection strategy in which only circles connect).

Exploratory Analyses for Task-Set Switching

A reviewer of an earlier version of this manuscript suggested exploring the empirical data for evidence of “task-set switching” (Kiesel et al., 2010; Rogers & Monsell, 1995; Schneider & Logan, 2014). As described above, our model suggests that observers use different selection strategies depending on whether the target elements are embedded among distractors (conditions C and D) or not (conditions A and B). We suppose that observers select which strategy to use for a given trial by extracting gist information from the scene. Setting up a selection strategy might take some time, so responses might be a bit slower if the preceding trial used a different selection strategy. By looking at the response times for sequential trial pairs, we can check whether response times for the second trial are slower when the previous trial should involve a different selection strategy. The experiment was not designed to test for task-set switching effects (e.g., the number of repetition and switch trials are not necessarily balanced), and we do not have a theoretically motivated estimate of the size of such an effect. If we assume that the grouping strategy of the previous trial “carries over” to the next trial, we expect a small advantage for repetition trials because the modification to the grouping strategy only involves changing the size of the selection signal. If, on the other hand, we assume that the grouping strategy is effectively reset for each trial, then we would not expect an effect, as both repetition and switch trials require some time to set up the grouping strategy. For the sake of simplicity, the simulations reported above were run in accord with the second assumption; however, the model provides no reason to hold one assumption over the other. For these reasons, we consider this analysis to be an exploratory investigation of possible model properties rather than a test of model predictions.

Table 3 shows the mean response times for the second of a pair of consecutive trials in Experiment 1, based on whether the model predicts that observers use either the same (repeated) or a different (switched) selection strategy for the two trials. There is a small (around 6 ms) increase in response times when the previous trial would use a different model-predicted grouping strategy than the current trial. Although small, an ANOVA indicates a significant effect of grouping strategy sequence on response time, F(1,124) = 9.26, p = .003.

Table 3 Descriptive Statistics for Task-Set Switching in Experiment 1

Table 4 shows the statistics corresponding to the same analysis for Experiment 2. Here, the effect of grouping strategy sequence among trial pairs is 7 ms and significant, F(1,119) = 10.75, p = .001.

Table 4 Descriptive Statistics for Task-Set Switching in Experiment 2

General Discussion

The close match between the experimental results and the simulated results with Grouping Strategy 1 provides support for our claim that human observers may be using this kind of grouping strategy. This grouping strategy consists of two key components. First, the same connection strategy was used for all conditions so that only target circles formed connections between themselves. This connection strategy allows a range of selection signal sizes to promote performance for conditions A and B. Second, a condition-dependent selection strategy with small selection signals for conditions with surrounding squares and large selection signals for conditions without surrounding squares, produces fast letter identification with few errors. In our simulations, Grouping Strategy 1 not only best resembled the pattern of responses from the experiments but produced the best overall performance.

Recall from Section 1 that Han et al. designed their experiment to explore three main hypotheses: (1) grouping by proximity occurs earlier than grouping by shape, (2) uniform connectedness occurs prior to grouping by proximity, and (3) uniform connectedness occurs prior to grouping by shape similarity.

Han et al. regard the difference in performance for conditions A and C as support for hypothesis (1) that grouping by proximity occurs earlier than grouping by shape. In the model, this difference in performance is largely due to the selection strategy, which is chosen to avoid interference from the distractors in condition C. While all conditions have the same connection strategy of connecting only target circles, the selection strategy involves using a larger selection signal for condition A compared to condition C. A large selection signal quickly segments elements even without spreading across connected elements. Due to the risk of selecting nearby square distractors in condition C, smaller selection signals are used, which comes at the cost of taking more time to segment enough target signal to identify the global letter. Errors in condition C typically occur when a selected square falls in the template of the other letter or if one of the small selection signals does not land on a circle due to misplacement, which can result in the observer having evidence in favor of the incorrect letter. For condition A, there is no risk of selecting a distractor and, thus, no such errors.

Han et al. took the difference in performance between conditions C and D as support for hypothesis (3) that uniform connectedness occurs before grouping by shape similarity. In the model, this difference in performance is due to the time it takes for connections to form. In condition D, the circles are already joined by physical lines, so the selection signal can spread across these lines as soon as the signal is initiated. For condition C, the target elements must be joined by connections before the selection signal can spread across them. The formation of connections takes time, which produces slower responses for condition C, compared to condition D. Errors occur for both conditions due to randomness in the placement of selection signals and inadvertent selection of distractors. Such errors occur more frequently for condition C because it takes time for connections to spread and, thus, the observer could have information indicating the incorrect letter for a longer period of time, thereby increasing the chance of making a fast but incorrect response. Thus, the empirical data does not necessarily indicate an order to processes; the model detects and manipulates edges, but no edges are more basic than (or necessarily prior to) other edges. Likewise, the model suggests that uniform connectedness is an emergent property of selection/segmentation rather than being due to a specialized mechanism that results in basic units that are then grouped at a later stage of processing.

Han et al. regard the similarity in performance for conditions A and B as support against hypothesis (2) that uniform connectedness occurs prior to grouping by proximity. In the model, this similarity in performance is due to the combination of a large selection signal for both conditions and a connection strategy where target elements are connected in all conditions. The similar performance for these conditions is quite robust in the model because the lack of distractors makes it easy to select target elements by using large selection signals provided that the target circles are connected, which, as argued above, is an easy connection strategy to implement for this stimulus set. There are a range of grouping strategies that all involve using relatively large selection signals to promote performance for condition A, and these strategies also perform well for condition B. Thus, rather than indicating that uniform connectedness does not occur prior to grouping by proximity, the similarity in performance for conditions A and B reflects the ability of the observer to use relatively large selection signals for these conditions, which results in fast, accurate identification of the target letter regardless of whether the circles are joined by lines (condition B) or not (condition A).

Overall, the results in Han et al., (1999a) support our hypothesis that observers use a grouping strategy, which involves both a connection strategy and a selection strategy, that promotes performance on a given task.

Conclusions

According to the model, a grouping strategy consists of a connection strategy and a selection strategy, both of which are subject to top-down control. A connection strategy is implemented by tuning the connection parameters to create connections among stimulus elements that will promote performance on a given task. A selection strategy, which concerns the number, placement, size, and timing of a selection signal, is chosen in conjunction with a connection strategy with the aim of promoting performance on a particular task by guiding selection signals to areas that will separate targets and distractors, thereby making the target(s) easier to identify.

For the Han et al. task, the simulation results quite closely resemble those of human observers when the model implements a grouping strategy that is easy to apply and promotes good performance. The model works with low-level mechanisms instead of relying on Gestalt grouping principles (Kon and Francis, 2022). We anticipate that many empirical measures of grouping reflect different grouping strategies that are implemented by observers to easily and efficiently complete specified tasks for that measure. Given that we claim grouping strategies are task-dependent, an implication of this hypothesis is that it will be uncommon for strategies to generalize from one situation to another. Instead, observers will create and implement novel strategies for each situation and task. From this perspective grouping is rarely the passive process suggested by Köhler (1929), and comparisons across conditions might need to consider how various grouping strategies could be brought to bear.

The model’s interpretation of the empirical data highlights that perceptual grouping is neither a well-defined concept nor a process that has a single mechanism or even a series of mechanisms. Rather, we propose that what is referred to as “perceptual grouping” reflects many different model behaviors that together achieve a given task. What is described as a type of grouping may involve many different mechanisms depending on the task and stimulus set. For example, in the model differences in grouping can occur through different connections between elements or by different selection approaches. For conditions C and D, the surrounding squares prevent the use of large selection signals, so small selection signals are used that spread along connected stimulus edges. In these conditions, it might be suggested that target circles were grouped by the formed connections. However, such connections were not required for conditions A and B. Here, large selection signals can be used to quickly segment the target elements. In these conditions, it might be suggested that grouping was done by selection rather than by the formed connections (although such connections do not hurt the process). The complexity of involved mechanisms that operate in parallel make it challenging to empirically isolate one mechanism from others, a point that has been apparent in the empirical literature for quite some time (Wagemans, 2018). However, in the model top-down control of these mechanisms can be directly manipulated, thereby allowing for better understanding of how these mechanisms contribute to performing the task at hand.

Importantly, although the top-down grouping strategy implemented by the model was chosen to promote performance on the task, the model mechanisms were not designed to emulate performance for the specific task of Han et al., (1999a). The selection/segmentation mechanisms utilized here were proposed to play an important role in “uncrowding” (Francis et al., 2017). Likewise, the connection circuits used here were originally proposed to explain the general flexibility of perceptual grouping (Kon and Francis, 2022). The application of the same mechanisms to the task of Han et al., (1999a) demonstrates how a few basic circuits can be surreptitiously combined to solve novel visual tasks.