Call control allows an organism to produce an acoustic signal irrespective of its own underlying emotional state. It is thus a prerequisite to “higher” abilities, such as call imitation, innovation and the use of arbitrary or deceptive calls, and therefore to speech. However, among primates, call control is presumed to be greatly confined to humans (Seyfarth and Cheney 2008). Consequently, there is little agreement about its evolutionary precursors (Christiansen and Kirby 2003). Essentially two major models and lines of evidence have been proposed; speech evolved (1) as an extension of acoustic communication in non-human primates (e.g. Seyfarth et al. 1980; Slocombe and Zuberbühler 2005; Arnold and Zuberbühler 2006; Wich et al. 2009) or (2) from non-human primate gestural communication (e.g. Rizzolatti and Arbib 1998; Corballis 2003; Arbib Michael et al. 2008). These models have been seen as mutually exclusive or as sequential accounts in which calls replace gestures (Brown et al. 1999), however, both face limitations concerning the emergence of call control in our evolutionary lineage. Did call control derive from an essentially emotional call use, or from an essentially voluntary gesture use, as that of non-human primates? The acoustic model needs to explain how a fundamentally close-ended acoustic system became open-ended (i.e. with limitless number of elements; alike speech). The gestural model needs to clarify the behaviors and respective functional advantages that allowed a shift (or “translation”) from an open-ended gestural system to an open-ended acoustic system.

Other important evolutionary models, such as, on syntax (e.g. Scott-Phillips and Kirby 2010), protolanguage (e.g. Mithen 2005), musilanguage (e.g. Brown et al. 1999), linguistic categories (e.g. Puglisi et al. 2008), increased breathing control (e.g. Maclarnon and Hewitt 2004) and iterated learning (e.g. Smith et al. 2003), some of which merge acoustic and gestural models, such as, on Motherese (e.g. Falk 2004) and frame/content (e.g. MacNeilage 1998), commonly begin with a hypothetical organism that is equipped a priori with call control, or overlook the behaviors that may have provided the functional advantages towards call control. We propose that recent orangutan (Pongo pygmaeus wurmbii) findings answer and reconcile the limitations of these models. Arguments supporting the above mentioned models are compatible with the view presented.

Recently we have described (Hardus et al. 2009a) how and why wild orangutans use gestures to functionally alter the acoustic characteristics of a particular sound (sensu Lameira et al. 2010) emitted under disturbing contexts, the kiss squeak (Hardus et al. 2009b). By positioning a hand or holding leaves in front of their lips, wild orangutans lower the maximum frequency (i.e. that of highest dB) but maintain other parameters of the call similar. Evidence suggests that kiss squeaks are under voluntary motor control in orangutans, and when individuals produce these modified variants of the call, they sound as if their body size is bigger than it actually is, reinforcing this impression on a potential predator and potentially deterring it through functional deception.

Kiss squeaks with a hand and on leaves represent, to our best knowledge, the only example of instrumental gesture-calls (IGC) in non-human primates. They can be defined as gestures that modify oro-laryngeal acoustic production, with or without tools, such as finger-assisted whistling or brass-/woodwind-instrument playing. In order to achieve this acoustic modification, some sort of physical contact between hands/tools and lips, and possibly tongue, is critically required. Mere physical proximity is unlikely to modify a call considerably, as for instance, when “loud speaking” through funneled hands. These gestures are importantly distinct from gestures that produce an acoustic signal themselves, with or without tools, and that can be made during call production. Such acoustic gesture-calls have been reported in other ape species (Arcadi et al. 1998) and are possibly present in most non-human primate species, such as when making noisy displays during loud calls and/or alarm calling, by slapping the ground or strongly striking branches. Heuristically, gestures may be considered additive in acoustic gesture-calls, whereas gestures in IGC may be considered multiplicative.

IGC in hominids multiply the number of call-types comprising the acoustic repertoire in an extremely simple way: one call-type used in combination with different gestures produces new call-types. That is, the potential to augment its innate acoustic repertoire can be achieved solely by means of an ability already present—gesture control. It is very likely that our ape/hominid ancestors would have exploited such “new” repertoire when available, as means to transmit more (graded) information, since cognitive abilities in non-human primates have been demonstrated to be richer and more advanced than their acoustic counterparts (Seyfarth and Cheney 2010).

We hypothesise that IGC, dating back to the hominid-pongid split (9–13 m.y.a.; Hobolth et al. 2011) may have provided the direct functional and neural sensory-motor basis towards call control in an early human ancestor essentially lacking this ability, that is, they served as an exaptation for this ability. IGC are remarkable in that they bring into close temporal, motivational, contextual, anatomical and functional association both the gestural and oro-laryngeal systems of motor control in the communication domain. Hand-assisted feeding, for instance, raises the same associations between gestural and oro-laryngeal systems of motor control but in the foraging domain. IGC comprise therefore, obligatorily, the expression of synchronous activations of multiple neural sensory-motor systems in the ape brain. In the ape cerebral cortex, such activations will mainly occur within regions homologous to the cortical homunculus (that comprises the primary motor cortex, which plays a crucial role in general voluntary motor control) and between the cortical homunculus and other cortical systems involved in the domain of communication, such as those homologous to Broca’s and Wernicke’s areas (Taglialatela et al. 2011). Such synchronous activations may have provided a neural interface between the brain areas activated, through functional integration and clustering (Tononi et al. 1998a, b) enabling the sharing of abilities which were previously fundamentally restricted or segregated to particular areas. By means of cortical and neural plasticity (Lieberman 2002a), alike for example, use-dependent functional reorganization of sensory cortices (Pantev et al. 1998), this interface would have set the basis for the establishment of enhanced and more resilient short and long distance circuits. Indeed, cortical and neural plasticity is at the basis of hemispheric asymmetries in key areas of the ape and human brain for communicative signaling (Hopkins and Nir 2010; Perani et al. 2011).

As the focus of voluntary control, the cortical homunculus would represent the main stage for these circuit modifications. The number of areas activated in this area and their mutual proximity would add up to form a momentary local hotspot of activations sufficient to ignite neighbouring areas over which there was previously little voluntary control. Namely, circuitry between the respiration, hand, face, lips, and tongue (somatotopic) locations would expand to include that of larynx areas. These circuits would not necessarily be required to be established de novo, but instead, would only be required to modestly build and expand on previously existing ones. For instance, a rudimentary but functionally relevant interface between hand, respiration and laryngeal locations (and possibly lips and tongue) is already present in the ape brain, in that use of the right hand for gestures is significantly enhanced when the gestures are accompanied by a call (Hopkins and Cantero 2003). At the same time, pathways between the primary motor cortex and nucleus ambiguous (site of the laryngeal motor-neurons in medulla oblongata), which are specifically interpreted as representing a crucial neural step in gaining call control (Fitch 2005; Brown et al. 2008), are found in apes but not in monkeys (Kuypers 1958), substantiating the view that an rudimentary interface is already present between systems.

In humans, neuroimaging studies support this evolutionary scenario. For instance, the (somatotopic) location of larynx/phonation area (that with control over intrinsic musculature of the larynx, underlying adduction/abduction and tensing/relaxing of the vocal folds) in the cortical homunculus is adjacent to the lips area and the expiratory area (Brown et al. 2008). This means that in humans, phonation, articulation and respiration are neurologically conjunct. Considering that orangutans have been experimentally demonstrated to exert apt voluntary motor control over lips and respiration (Wich et al. 2009; Lameira et al. in review), it is reasonable to view this conjunction as evolutionarily relevant in humans. While laryngeal musculature may operate in complex ways during (online) speech and other functions (Jürgens 2002; Ludlow 2005), the evolutionary genesis of call control theoretically commenced when the first rudimentary neural signal initiating in the primary motor cortex would be transmitted successfully simply to set the larynx into position during air-flow. The view that neural circuitry flexibility could have successfully achieved this in our ancestors is supported by a phenomenon known in human as motor equivalence, where speakers develop different motor strategies, i.e., use different musculatures, of the larynx to achieve the same voice outcome (Ludlow 2005). Accordingly, IGC could potentially explain why the area of representation of the intrinsic laryngeal muscles has seemingly migrated toward the labial area in humans (Brown et al. 2008). In addition, IGC are in concordance with the increasing literature corroborating that gestures and calls/speech are neurally co-processed (e.g. Rizzolatti and Arbib 1998; Bernardis and Gentilucci 2006; Xu et al. 2009).

At the same time, these bimodal behaviors represent cultural variants of orangutan behavior (e.g. van Schaik et al. 2003). Accordingly, enhanced neural connectivity would have also developed across brain systems in areas involved in processing social information, emotional valence and learning, such as the amygdala and the auditory cortex (Remedios et al. 2009). Thus, brain-language (Deacon 1998), biology-culture (Richerson and Boyd 2005) and music-language premises (Brown 1999) are concordant with the IGC hypothesis.

IGC present a parsimonious route to human-like neurophysiology, increased call control and repertoire size in the earliest stages of speech evolution, but one may question its relevance based on the phylogenetic distance between orangutans and humans. Three clarifications are required. Firstly, comparison between human, chimpanzee and orangutan genomes shows that some regions of the human genome more closely resemble orangutan’s (Hobolth et al. 2011). Although this percentage is approximately 1%, a necessarily bigger percentage is equally similar between humans, chimpanzees and orangutans. While broad genetic underpinnings of speech are not well understood beyond FoxP2 gene (e.g. Enard et al. 2002), the relevance of genetic proximity within hominoids remains equivocal. Secondly, speech is a bio-cultural evolutionary phenomenon (Richerson and Boyd 2005), and therefore, theories must encompass some degree of interaction between social and genetic mechanisms in the acquisition and transmission of communication signals. Orangutans and chimpanzees are the only apes to show extensive cultures in the wild (e.g. Whiten et al. 1999; van Schaik et al. 2003), thus, both species represent promising models. Thirdly, the description of IGC in orangutans but (so far) not in chimpanzees may constitute a methodological artifact. While cultural variants between populations have been investigated in wild chimpanzees, this record tends to focus on feeding behavior (Watson and Caldwell 2009). Oppositely, researchers have investigated geographical variation in orangutans’ complete call repertoire (Hardus et al. 2009b). These conditions may have benefited the description of IGC more readily than in chimpanzees. There are nonetheless anecdotes suggesting that IGC may be part of their repertoire, such as the use of a hand in front of the mouth to muffle a call, as described by Jane Goodall (Deacon 1998).

This essay presents a new view on the earliest stages of speech evolution, based on orangutan IGC. It builds on the concept that enhanced linguistic ability cannot be totally differentiated from enhanced motor activity (Lieberman 2002b), and argues that IGC may have constituted speech exaptations, providing functional advantages in a human ancestor essentially lacking call control but allowing the emergence of the neural and communicative basis for subsequent selection favouring basic abilities for speech. This view provides a new concrete model organism, similar in its abilities of (1) call control, (2) call repertoire size and (3) reliance on social learning as those observed in orangutans for future speech evolution models.