1 Introduction

1.1 Word order and sentence comprehension

It has been well established in the sentence processing literature that various syntactic properties, such as word order and the type/length of syntactic dependency, have a major influence on the processing cost of a sentence (Aoshima et al. 2004; Fiebach et al. 2002; Frazier 1987; Nakano et al. 2002; Phillips et al. 2005; Phillips and Wagers 2007; Schlesewsky et al. 2000, among many others). For instance, in languages where the subject (S) precedes the object (O) in their canonical word order, which we call “SO-type languages,” it has been repeatedly observed that the native speakers need more time to comprehend sentences when they are presented in a non-canonical word order, such as the OS word order, where the object comes before the subject (Bornkessel et al. 2002; Erdocia et al. 2009; Hagiwara et al. 2007; Imamura and Koizumi 2008; Kim et al. 2009; Matzke et al. 2002; Mazuka et al. 2002; Sekerina 2003; Tamaoka et al. 2003). In the pair of sentences in (1), taken from Mazuka et al. (2002), native speakers of Japanese took more time to read the scrambled sentence in (1b) than the canonical word order sentence in (1a).

(1)

a.

SO word order (canonical)

Mariko-ga

otooto-o

yonda

Mariko-nom

brother-acc

called

b.

OS word order (scrambled, non-canonical)

  

Otooto-o

Mariko-ga

yonda

brother-acc

Mariko-nom

called

‘Mariko called her brother.’

Further, Bader and Meng (1999) presented stimulus sentences in German, those in (2), to the native speakers of German in a speeded grammaticality judgment task, where the participants were instructed to judge as quickly and accurately as possible whether or not the sentence was grammatically acceptable. The sentences shown in (2) involve a relative clause structure, in which the noun phrase die Eltern “the parents” is locally ambiguous between the subject or the object in the relative clause until the reader reads the clause-final auxiliary verb hat or haben. This auxiliary verb indicates whether the subject is singular or plural, which disambiguates the structure.

(2)

a.

Subject gap in the relative clause

Maria

erzählte

mir

von der Frau,

[CP

die

__

die Eltern

angerufen

hat]

Maria

told

me

about the woman

 

who

 

the parents

phoned

has

‘Maria told me about the woman, who has phoned the parents.’

b.

Object gap in the relative clause

Maria

erzählte

mir

von der Frau,

[CP

die

die Eltern

__

angerufen

haben]

Maria

told

me

about the woman

 

who

the parents

 

phoned

have

‘Maria told me about the woman, who the parents have phoned.’

Bader and Meng (1999) found that the participants in their experiment were slow to judge (2b) to be “acceptable,” compared to (2a). Furthermore, they were less accurate in making a judgment for (2b) than (2a). This showed that the native speakers of German have a bias in which the relative pronoun die “who” corresponds to the subject in the relative clause, so the first noun phrase they encounter in the relative clause is assumed to be the object. They have a subject-first/agent-first preference.

The observations from Mazuka et al (2002) and others suggest that syntactic complexity is one major factor for determining the processing cost of a given sentence. When readers encounter a sequence of noun phrases in a non-canonical word order, they must map those phrases into a structure with an extra layer of projection to accommodate the scrambled noun phrase (Saito 1989; Saito and Fukui 1998; but see Miyagawa 2005 for a different approach). In this sense, the sentences that use the canonical word order have a simpler syntactic structure, hence they require a smaller amount of processing resources for comprehension (e.g., O’Grady 1997). Furthermore, the German examples in (2) suggest that sentences with a longer syntactic dependency are costly to process. The example in (2b) involves a dependency between the relative pronoun and the object position in the relative clause, which is longer than the dependency in (2a) (Gibson 2000; Grodner and Gibson 2005). In terms of dependency, it is possible to regard the processing cost in (1b) of the non-canonical word order as the cost for a filler-gap dependency. The displaced object functions as a filler that has to be encoded in the working memory until it is eventually interpreted at the gap-position; the storage and integration of a filler require a cognitive cost (Gibson, 2000). Therefore, the presence of a filler-gap dependency and its length are factors that increase the processing cost of a given sentence.Footnote 1

There is another approach to the processing costs associated with the OS word order. It has been observed in many sentence production studies that a perceptual property in an event, often noted as saliency, has an effect on word order preferences. Saliency, or properties connected to conceptual accessibility or animacy, is related to the thematic roles in an event, and some have suggested that the agent-before-patient or animate-before-inanimate order is preferred (Bock 1982; Bock and Warren 1985; Branigan et al. 2008; McDonald et al. 1993; Tanaka et al. 2011). For instance, Tanaka et al (2011) used a sentence recall task, and asked the participants to recall Japanese sentences such as those shown in (3).

(3)

a.

SO word order

Minato-de

ryoosi-ga

booto-o

hakonda

port-at

fisherman-nom

boat-acc

carried

b.

OS word order

Minato-de

booto-o

ryoosi-ga

hakonda

port-at

boat-acc

fisherman-nom

carried

‘At the port, the fisherman carried the boat.’

Tanaka et al. (2011) observed that, when the participants recalled the sentences, they were more likely to invert the word order and produce an SO word order such as (3b). One interpretation of this result is that, in addition to syntactic factors, native speakers of Japanese prefer to produce sentences in which the agent comes before the patient. Such an ordering preference may stem from an idea that the agent is more salient than the theme in an event, because the agent has more control over the event’s progress.

Similarly, Sauppe et al. (2013) observed the agent preference over the patient in their picture-description experiment using Tagalog, a language spoken in the Philippines. They presented participants with a picture depicting a transitive event, and measured eye gaze patterns while the participants produced a sentence. In the time-window of the first 600 ms after the presentation of the stimulus picture on the screen, participants looked at the agent in the picture more often and longer than the patient in the picture. Also, Hwang (2017) found the tendency among native speakers of Korean to place agent nouns in the sentence-initial position in a sentence-assembly task. Crucially, this effect was also present when animacy was controlled. These observations suggest that there is a universal tendency to place a salient element, such as an agent noun, before a less salient element. Similar ideas based on some notions related to human cognitive features and/or discourse features are found in MacWhinney (1977), Primus (1999), Kemmerer (2012), Cohn and Paczynski (2013), and Cohn et al. (2017), among others.

The above discussion should make clear that the word order preference of SO over OS (or agent-theme order over theme-agent order) is observed in a wide range of paradigms, but at the same time it should also be noted that data samples are quite limited in terms of the linguistic diversity. Most of the languages that have been studied and found to have SO and/or agent-before-theme preferences are heavily biased toward socially, economically, and/or politically “rich” languages (see related discussions in Anand et al. 2011; Jaeger and Norcliffe 2009; Norcliff et al. 2015). With this in mind, it is important to investigate a much wider range of languages, to examine to what extent the SO and agent-before-theme preferences are universal features of language. In particular, in terms of this word order preference, it is important to investigate those that we call “OS-type languages,” in which the object comes before the subject in their underlying basic word order.

Koizumi et al. (2014) conducted one of the few studies investigating the nature of the SO preference, using an OS-type language (see also Kiyama et al. 2013; Yasunaga et al. 2015). They examined the word order preference in Kaqchikel, a Mayan language spoken in Guatemala. In Kaqchikel, the canonical word order is VOS (Ajsivinac Sian et al. 2004, p. 162; García Mátzar and Rodríguez Guaján 1997, p. 333; Rodríguez Guaján 1994, p. 200; Tichoc Cumes et al. 2000, p. 195; see also England (1991) and Aissen (1992)), but the language also allows SVO word order. Like other Mayan languages, Kaqchikel is a head-marking language, in which the verb carries agreement markers of the dependent elements such as subject and object (with respect to person and number), in addition to tense/aspect markers. Sentences in (4) are a sample set of their target sentences; in an auditory semantic anomaly detection task, they measured participants’ response times in making a plausibility judgment. In the following examples, the verb shows the person and number agreement markers; one for the ergative NP (subject), and the other for the absolutive NP (object).

(4)

a.

VOS word order (canonical)

X-∅-u-chöy

ri chäj

ri ajanel

compl-erg.3sg-abs.3sg-cut

det pine.tree

det carpenter

b.

SVO word order (the subject is fronted to the sentence-initial position)

Ri ajanel

x-∅-u-chöy

ri chäj

det carpenter

compl-erg.3sg-abs.3sg-cut

det pine.tree

‘The carpenter cut the pine tree.’

Koizumi et al. (2014) observed that VOS sentences were responded to significantly faster than SVO sentences, at a rate of roughly 150 ms. These results suggest that the SO preference is not a universal feature of language. Instead, they suggest that syntactic properties of a given language greatly affect the processing cost in sentence comprehension, and note that it took more time for native speakers of Kaqchikel to process sentences with non-canonical word orders, like that in (4b). Yasunaga et al. (2015) also observed a similar effect in their event-related potential (ERP) study. Comparing the SVO and VOS sentences, the object in SVO word order sentences elicited a P600 effect, suggesting that SVO sentences are more costly than VOS sentences. These studies suggest that, both in SO- and OS-type languages, sentences that are syntactically more complex, due to the presence of a filler-gap dependency, require more processing resources than those sentences with no filler-gap dependency.

Building upon the findings from Koizumi et al. (2014), this paper reports on an experiment conducted in the Truku dialect of the Seediq language, which is spoken in Taiwan. Truku is an OS-type language, in which the canonical word order is claimed to be VOS (Aldridge 2004; Tsukida 2009). Truku is typologically different from Kaqchikel, and is not a head-marking language. As is demonstrated in the next section, verbs in Truku do not carry any person/number agreement markers of the dependent elements, but the verb is required to have one voice marker indicating which element in the event is treated as the subject. One might suggest a processing advantage in the verb-initial structure, due to the agreement markers on the verb. In fact, Sauppe (2016) conducted an experiment using the visual-world eye-tracking paradigm in Tagalog and found that, while listening to sentences, native speakers of Tagalog used verbal semantics to anticipate the upcoming referents and their thematic roles as soon as the verb was heard. Studying Truku allows us to examine the role of verbs whose morphological properties are widely different from those in a head-marking language such as Kaqchikel.

Furthermore, it is important to investigate the word order preferences in Truku, because such an inquiry allows us to determine to what extent the widely observed processing preferences, namely the SO and agent-before-theme preferences, are grounded in properties of the linguistic system or somewhat more general human cognitive properties (Koizumi et al. 2014; Kubo et al. 2015). Truku allows SVO word order, but it has also been claimed that SVO is a word order derived from the more basic VOS word order in this language (Aldridge 2004). In the following sections, we will see a more detailed discussion of the grammatical properties of Truku. Under the syntactic complexity hypothesis, that the syntactic complexity has an impact on the word order preferences for sentences (i.e., it is reflected in processing costs), the SVO word order should be more costly than the VOS order in Truku. By contrast, one could propose that, being in the privileged position, the subject is universally salient over objects. Given a typical transitive event, a subject is often associated with the agent role, while the object is associated with the theme role, which accounts for the subject preference. Under this universal saliency hypothesis, the SO word order preference should also be favored in Truku, irrespective of its basic word order. Of course, these two accounts of syntactic complexity and saliency are not in an exclusive relationship, and it is possible that both of them are found to have a certain effect on the word order preferences in Truku.

Finally, we should note that the SO word order preference discussed above is, nevertheless, often conflated with the “agent-before-theme preference” noted in the previous studies. Using various SO-type languages or even Kaqchikel, it has been difficult to tease apart the possible source of the saliency noted above. It is possible that being a subject is an important property to be counted as salient, but it is also possible that being an agent is important and the agent is taken to be salient. Then, we have a case in which the agent is more likely to be promoted to the subject, yielding the subject preference. As will be seen in greater detail below, Truku has a symmetrical voice system in which the agent and patient are equally likely to be promoted to the subject of the sentence (Foley 2008). Given this grammatical characteristic of Truku, along with other grammatical characteristics, it is possible to investigate how these two properties, being a subject and being an agent, interact in comprehending Truku sentences.

1.2 Truku grammar

Truku, a dialect of the Seediq language, is spoken in East Taiwan. Seediq is an indigenous language and the Seediq people are one of Taiwan’s 16 nationally-recognized tribes. Seediq belongs to the Atayal group of the Austronesian language family, and is spoken by approximately 20,000–30,000 people (Covell 2008; Eberhard et al. 2019). Some grammatical properties of Truku are introduced below, as these are relevant to the design of the current experiment.

Truku, like many other Austronesian languages, uses the symmetrical voice system (Aldridge 2004; Foley 2008; Riesberg 2014; Tsukida 2009). Verbs need to carry one of the three voice markers (agent voice, goal voice, or conveyance voice). This system is described as “symmetrical” because there is no default or unmarked voice among the three voice markers (cf., English and Japanese, where the active voice is clearly unmarked syntactically and morphologically, compared to the passive voice). The following examples show the basic voice patterns. The two example sentences in (5) basically represent the same event. The verb carries different voice markers, and the agent “the cook” is promoted to the subject in (5a), while the theme “this pineapple” is promoted to the subject in (5b). Truku also has a third type of voice marker, conveyance voice; in sentences with the verb in the conveyance voice, elements such as instrument phrases or benefactive NPs are promoted to the subject position. Examples of the conveyance voice are not shown here, because sentences of this voice type are not used in the current study. In the VOS word order, the subject is marked with ka, which is glossed as nom, following Tsukida (2009), but a slightly different terminology is used in other works (Aldridge, 2004; Holmer, 2005).

(5)

a.

Agent voice, VOS word order

q〈m〉nilis

kalat

niyi

ka

emphapuy

av-peels

pineapple

this

nom

cook

b.

Goal voice, VOS word order

qnlis-an

emphapuy

ka

kalat

niyi

peels-gv

cook

nom

pineapple

this

‘The cook peels this pineapple.’

Another relevant feature in Truku is the word order variation. As shown in (5), the language allows the VOS word order. It also allows the SVO word order, as shown in (6).

(6)

a.

Agent voice, SVO word order

emphapuy

o

q〈m〉nilis

kalat

niyi

cook

top

av-peels

pineapple

this

b.

Goal voice, SVO word order

kalat

niyi

o

qnlis-an

emphapuy

pineapple

this

top

peels-gv

cook

‘The cook peels this pineapple.’

In (6), the nominal phrase appearing along with the particle ka in (5) is now placed in the clause-initial position. When the subject is fronted, the particle ka can no longer appear, but another particle, o, appears after the fronted subject. It has been claimed that the VOS word order is the syntactically basic word order, and the SVO order is syntactically derived from the VOS word order (Aldridge 2004; Tsukida 2009). In this paper, we follow Aldridge (2004) and assume that the VOS word order is derived by predicate fronting. As for the SVO word order, Yano et al. (2019) have shown that the SVO order is derived from the VOS, and the subject is in a higher position, as evidenced by the observation that the fronted subject can leave its associated quantifier after VO (see the relevant discussion in Sportiche (1988)). Assuming that the subject fronting involves more functional categories in the CP domain (Rizzi 1997), the SVO structure is a syntactically more complex one.

The present study investigates the word order preference in Truku, manipulating voice and word order at the same time. We are particularly interested in whether SVO sentences produce a large processing cost. In Truku, SVO is a derived word order, and the syntactic complexity hypothesis predicts that SVO is more costly than VOS in Truku, because SVO involves a filler-gap dependency. By contrast, if subject is universally salient over object, and if saliency is a major factor responsible for the processing cost of the sentence, native speakers of Truku should process SVO sentences faster than VOS ones. Given that Truku is a symmetrical voice language, we can manipulate whether the agent role is assigned to the subject or object, which allows us to examine the interaction of voice and word order (saliency order) fully. Then, the saliency hypothesis predicts that the SVO word order is preferred in Agent Voice (AV) sentences, while the VOS is preferred in Goal Voice (GV) sentences in our auditory semantic anomaly detection experiment, which is introduced in the next section.

2 Experiment

2.1 Participants

Forty-two native speakers of the Truku dialect of Seediq (9 males and 33 females; M = 60.1 years, SD = 11.1) were paid to participate in the experiment; all signed a written informed consent document. All reported no hearing or other language-related disorders. They live in a village near Hualien City, Taiwan, and all of them also speak Chinese daily.Footnote 2

2.2 Materials

The materials for our experiment consist of 48 sets of sentences. Each set crossed two types of the verb voice (AV and GV) and two word orders (VOS and SVO), making four conditions. Among the 48 sets of sentences, 24 sets used a proper name in the agent phrase, such as Rabay and Abis, which are commonly used names in Truku, according to our consultants. The other 24 sets used a common noun in the agent phrase. Those included emphapuy “cook” or knsat “policeman,” etc. In the agent voice condition, the verb had the agent voice marker, and the NP with the agent role appeared as the subject of the sentence. In the goal voice condition, however, the verb was marked with the goal voice marker, and the NP with the theme role appeared as the subject. As for the word order factor, we prepared the VOS word order, which had the subject at the end of the clause with the nominative marker ka. The SVO word order was also prepared, in which the subject was fronted to the sentence-initial position. Each set of target stimuli thus had four versions, and half of the target stimuli used a proper name, while the other half used a common name. Those 192 sentences were distributed into four lists by crossing voice and word order in a Latin-squared design, so that no participant saw more than one version from each set. Therefore, three factors (voice of the verb, word order, and noun type) are within-subjects factors. A sample set of the target sentences of the common noun type is shown below.

(7)

a.

Agent voice, VOS word order

m-n-hapuy

begu

niyi

ka

empsapuh

av-prf-cook

soup

this

nom

doctor

‘The doctor cooked this soup.’

b.

Agent voice, SVO word order

empsapuh

o

m-n-hapuy

begu

niyi

doctor

top

av-prf-cook

soup

this

c.

Goal voice, VOS word order

n-puy-an

empsapuh

ka

begu

niyi

prf-cook-gv

doctor

nom

soup

this

‘Lit. This soup was cooked by the doctor.’

d.

Goal voice, SVO word order

begu

niyi

o

n-puy-an

empsapuh

soup

this

top

prf-cook-gv

doctor

In addition to the target sentences, 48 filler sentences were prepared. Those filler sentences were all semantically anomalous, such as #Simaw planted the bag; #the chopsticks steamed the vegetables, etc. We made filler sentences in this way because, as will be seen below, we used a semantic anomaly detection task, and all of the target sentences should be acceptable to the native speakers of Truku, while the filler sentences should be responded to as anomalous.Footnote 3

One male native speaker of Truku from the village read the sentences aloud and we recorded him. After the recording, the audio files were trimmed so that the total length of the sentences was closely matched with respect to the voice and word order factor. The mean length of sentences was about 3 s, and there was a significant difference according to the noun-type factor (F(1, 46) = 31.54, p < 0.01), indicating that the mean length of sentences with a proper name was shorter than that with a common noun (the proper name condition, 2589 ms (SD = 311); the common noun condition, 3077 ms (SD = 408)). There was no other significant main effect or interaction.

2.3 Task and procedure

The participants were tested individually. They sat in front of a laptop computer and wore a headset in a quiet room, and were told to relax. They were instructed by an experimenter, who is a native speaker of Truku, to listen to the sentences through the headset, and to decide, as quickly and accurately as possible, whether the sentences they heard made sense or not (Caplan et al. 2008). Two keys on the keyboard (“J” and “F”) were assigned for the responses (‘yes’ and ‘no’). The participants were instructed to use both hands to make their responses. All target sentences should have been judged as “yes,” and all filler sentences should have been judged as “no.” The number of “yes” responses, then, was counterbalanced with the number of “no” responses. After the experimenter provided the instructions to the participant, there was a practice session with seven trials. In each trial, the participants briefly saw a small fixation cross in the middle of the screen for 1000 ms, and the sentence was presented auditorily through their headset. Stimulus sentences were presented in a randomized order for each participant. While the stimulus sentence was presented auditorily, a pair of the smiley face and non-smiley face icons was shown on the screen, which would help the participants press the response keys as they intended. In the practice session, the experimenter provided feedback to the participants’ responses, to ensure they understood the task and felt familiar with the procedure. This practice session was repeated if the participants wanted to practice more. During the experimental session, no feedback was given for wrong answers. The participant could make a judgment before the sentence ended, but they usually made their response after the end of the sentence. We recorded the responses (i.e., “yes” or “no” for the accuracy) and measured the response times (RTs) from the onset of the sentence. The whole experiment took about 15 min.

This task required participants to judge the semantic plausibility of each sentence. The underlying assumption of using RT is that the length of time it took to complete the task reflected sentence processing difficulties or complexities, such as sentences with relatively long syntactic dependencies (Grodner and Gibson 2005) or ones with a non-canonical word order (Kaiser and Trueswell 2004). Self-paced measurements have been widely used in various fields, and measuring response times of the full sentences may not allow us to draw strong conclusions regarding exactly which word or from what region of the sentence such processing difficulty has emerged. To assess the time-course of language comprehension, a word-by-word or phrase-by-phrase self-paced listening task can be set up as a more fine-grained measurement of processing efficiency (Ferreira et al. 1996). We, however, employed a whole-sentence measurement, due to concerns for the ecological validity (Kaiser 2013); that is, people do not hear a sentence segment by segment at their own pace in a natural setting. Instead, they have to process it as speakers produce their utterances. Because most, if not all, native speakers of Truku do not regularly read in Truku, we had to rely on a task measuring their auditory sentence comprehension, rather than reading.

2.4 Analysis

We removed data gathered from four participants from the analyses because they only used one hand to make their responses, and such responses are not reliable. The remaining data from 38 participants were analyzed further. As for the analyses of the response times, data from the trials in which the participants made an incorrect response were excluded. In those incorrect responses for the target sentences the participants answered “no,” and we removed those data from the response time analysis because such negative responses are known to be disproportionally longer. We analyzed our remaining data using the lme4 package (Bates et al. 2015b, a) for the R software (R Core Team 2019). We used logistic mixed-effect models for the accuracy data (Jaeger 2008), as the dependent measure was categorical, and linear mixed-effect models for the RT data (Baayen et al. 2008). Following Barr et al. (2013), the model was initially fit with the maximal random effects structure, including random slopes for the repeated measures factors and random intercepts for participants and items. Three repeated measures factors, voice, word order, and noun type, were entered into the model as fixed factors. For the RT analyses, the mean correct response rate for each participant, and for each item, were entered as covariates. The trial order was also included initially, but was later eliminated because it did not significantly contribute to the improvement of the model’s performance. Converged models were evaluated, and the optimal model was selected using backward elimination. Once the optimal model was chosen, we eliminated the data points that were more than 2.5 SD away from the estimation by the model, and the model was re-fit to determine the final model. The model summary and p-values were obtained using the lmerTest package (Kuznetsova et al. 2017).

While we were preparing for the data analysis, we noticed that the RT patterns from the stimuli with a proper name and those with a common name were largely different. Roughly, the sentences used in our experiment have three phrases/words: subject, verb, and object. Although the participants could sometimes detect anomalous meanings at the second phrase of the filler sentences, they usually needed to listen to the third phrase. The participants then needed to listen to the third phrase to decide whether the stimulus sentence made sense or not. It seemed that the onset time for the third phrase and its duration had some major influence on the response time, in addition to the different length of the total sentence. We then suspected that the length of the third phrase was one of the reasons why the RT patterns differed depending on the type of agent (a proper name or a common noun). Proper names used in the stimuli were relatively short, so the onset and the length of the third phrase differed across conditions. Three research assistants listened to the audio files and measured the length of the third phrases; they also measured the onset of the third phrase. We calculated the means of those measurements for each stimulus. Then, for the RT analyses, the onset time of the third phrase and the length of the third phrase were also entered into the model as covariates.Footnote 4

2.5 Results

The overall correct response rate was 75%. The mean accuracy data by condition is shown in Fig. 1, and a summary of the statistical analyses of the response accuracy data is shown in Table 1.

Fig. 1
figure 1

The mean response accuracy rates for each condition. The error bars indicate standard errors of the mean

Table 1 Summary of the statistical analysis on the response accuracy data

There was a significant main effect of word order, showing that the mean accuracy rate for the SVO condition was lower than that for the VOS condition. Also, a significant main effect of noun type showed that the common noun condition was more difficult than the proper name condition. There was a three-way interaction (almost significant) among the three within-subject factors, and further pairwise comparisons indicated that there was a word order × noun type interaction only in the AV condition, but not in the GV condition. This interaction was driven by the contrast within the common noun condition, showing that the AV-SVO condition was more difficult than the AV-VOS condition. There was no such word order effect in the GV condition.

The overall mean response times are shown in Fig. 2, and a summary of the statistical analysis is shown in Table 2.

Fig. 2
figure 2

The mean response time (ms) for each condition. The error bars indicate standard errors of the mean

Table 2 Summary of the statistical analysis on the response time data

Table 2 shows that the mean RT patterns for the proper name and the common noun conditions were different, as indicated by a significant noun type factor and a marginally significant three-way interaction among word order, voice, and noun type. In general, the mean RT for the common noun condition was slower than that for the proper name condition, which is correlated with the low accuracy for the common noun condition discussed above. Pairwise comparisons suggest that while there was no word order × voice interaction in the proper name condition, there was a significant interaction of such in the common noun condition. Further comparisons show that, in the GV condition, the VOS condition was significantly faster than the SVO condition, but no such difference was found in the AV condition.

3 Discussion

The mean response accuracy rates indicated that sentences with a common noun agent were more difficult than those with a proper name agent. As demonstrated below, this is also reflected in the RT data. The lower accuracy rates in the common noun condition indicated that the participants were more willing to reject the common noun target sentences. In the common noun target sentences, for instance, they had to decide whether “the carpenter looked for the eggs” was a likely event. We suspect that some participants thought that there was a more likely agent who would look for the eggs—a cook, for instance. By contrast, in the proper name target sentences, they may not have had to examine in too much detail whether a given agent was a plausible entity to initiate the action, because the agent was just a name. Of course, they still had to decide whether the eggs are something people look for in general, but the task demand seemed lower in the proper name condition.

We also observed the general pattern that sentences presented in the SVO word order were more difficult than those in the VOS word order. This contrast in the accuracy data was clearly visible in the common noun-AV conditions, suggesting that the derived SVO sentences were more costly to process than the VOS sentences.

The results of the RT measure suggest a few major patterns. First, the mean RT in the proper name condition was faster than that in the common noun in general. Second, there was no word order × voice interaction in the proper name condition, but those two factors do interact in the common noun condition. Basically, there was no RT difference by condition within the proper name condition; in the common noun condition, the RT for the GV-VOS was faster than the other three conditions. With respect to the noun type contrast, we suggest the same account as discussed above; the RT for the proper name condition was faster because it required less processing demand, compared to the common noun condition, to judge whether the sentence was semantically plausible or not. Further, the total length of the sentences with a proper name was shorter than that with a common noun, which is also likely to be responsible for the RT difference. It seems that the participants made a response very soon after the sentence finished, so the lack of word order × voice interaction might be due to the ease of comprehension for the sentences.

As for the word order × voice interaction in the common noun condition, in Sect. 1, we introduced the syntactic complexity account and the universal saliency account, but the predictions regarding these are hard to tease apart when testing SO-type languages. In languages like Truku, the syntactic complexity account predicts that the SVO word order should take longer to process in general because it is a derived word order from the more basic VOS word order. Assuming that the SVO word order involves a more complex syntactic structure (Aldridge 2004; Yano et al. 2019), it is, in general, more costly to process sentences in that word order. Furthermore, the saliency account suggests that sentences in which the agent comes before the theme are processed quickly. Among the conditions in this experiment, sentences in the AV-SVO and the GV-VOS conditions have the agent before the theme, and are predicted to be processed more quickly.

To account for the patterns from the accuracy and the response time data together, we suggest that there is a general VOS preference in Truku. Based on the common noun conditions, a VOS advantage was found in AV conditions in the accuracy data. A similar contrast was not seen in the response time data, but we propose that this exhibits a pattern of speed-accuracy trade-off. The response time in the AV-VOS condition seemed no faster than that in the AV-SVO condition, but this could be due to the relatively high accuracy rate. Participants made a better response by spending a bit longer time to make a response. So, possibly, the response time in AV-SVO would be faster and the accuracy rate would be slightly lower. Then, the contrast between the SVO and the VOS word orders would have been much clearer, showing the VOS advantage. With respect to the pattern in GV conditions, in a similar way, we would expect a slightly higher accuracy rate for the GV-VOS condition, if there is a general VOS advantage. We suppose that this “lower-than-expected” accuracy rate in the GV-VOS is related to the faster response time in this condition. In sum, although we have to rely on the speed-accuracy trade-off, it is plausible to hypothesize that Truku has a general VOS preference, because this can explain the combined pattern of accuracy and response time data.

These interpretations of the results suggest that the often-observed SO preference is not a universal feature of language (Koizumi and Kim 2016; Koizumi et al. 2014; Kiyama et al. 2013; Yasunaga et al. 2015; Yano et al. 2017, 2019). The SO preference has been observed in the sentence processing literature, and, most of the time, the SO-type languages are the targets of investigation. In those languages, the SO word order is the canonical/basic order, implying that it involves a syntactically less complex structure than the non-canonical OS word order. In languages like Truku, the SO word order necessarily involves a more complex structure, so there is no general preference of the SO word order.

There seems to be an apparent alternative account for the common noun condition that employs an interaction between syntactic complexity and saliency (see the relevant illustration in Table 3). This “interaction” account would suggest that both syntactic complexity and saliency are needed in order to explain the full picture of the response time results. According to this account, the GV-SVO sentences took longer time to process than the GV-VOS sentences because: (a) the SVO word order involves a more complex syntactic structure, and (b) the theme NP is the subject and placed in the sentence-initial position in the SVO word order condition. It then precedes the agent NP, which is in the object position. In other words, the GV-SVO sentence is a costly structure both under the syntactic complexity account and the saliency account. On the other hand, in the AV sentences, the AV-SVO sentence is costly to process because it has a more complex structure than the AV-VOS sentence. The AV-VOS sentence, however, is costly to process because the agent NP is the subject and placed at the end of the clause. The theme NP then precedes the agent NP in this sentence type. Assuming that the magnitude of the costs coming from the syntactic complexity and the saliency are not largely different from each other, it may be possible to claim that those two costs cancel each other out. Then, the reaction times from the two conditions did not show a significant difference.Footnote 5 This account may successfully explain the response time pattern, but it requires some extra assumptions with respect to the pattern found in the accuracy data, where there is a clear difference within the AV condition, but not within the GV condition.

Table 3 The relationship between the two accounts (the Universal Saliency and the Syntactic Complexity) with respect to the processing cost, and the four sentence patterns based on the voice and word order

The results from the present study have some implications for the role of head-marking morphology in sentence processing. Recall that Koizumi et al. (2014) and their related work indicated that word order preference is largely determined by the syntactic properties of a given language. They examined Kaqchikel, a Mayan language, which shares a certain syntactic characteristic with Truku. Both have VOS as the canonical word order and SVO as a derived word order. Although Kaqchikel and Truku are typologically different, it is noteworthy that both languages show the OS word order preference. This indicates that the OS word order preference is not limited to certain languages that belong to a particular language family, such as Mayan, but rather this seems to be a property of languages whose canonical word order is VOS. Some might speculate that, because the verbs in Kaqchikel carry a lot of agreement morphemes about the dependent nouns, the head-marking properties in Kaqchikel play a large role for the OS preference. Our current results suggest that having such a head-marking property is not a necessary condition for the OS word order preference, because Truku verbs do not have agreement markers analogous to those in Kaqchikel, yet the language still shows an OS word order preference.

As discussed in Sect. 2.3, we decided to employ a task to measure the response time at the end of the sentence, not a word-by-word self-paced listening task, for example. Therefore, it is a bit difficult to point out exactly at what stage the processing cost emerges in comprehending sentences, but nevertheless, we would predict that the processing cost arises at the sentence-initial NP, because it signals that the sentence is not in the VOS word order (see Yano et al. 2019).

Our results also suggest that properties like saliency often have an influence on the sentence processing costs, but to a different extent in SO- versus OS-type languages. In quite a few production studies of SO-type languages, it has been found that cognitive properties such as saliency have an effect on the word order selections (Branigan et al. 2008; Tanaka et al. 2011). However, our results suggest that saliency is not a major factor explaining the response pattern in Truku. Note that Truku is a language that has a rich voice system. It has been claimed that the goal voice in which a theme argument is promoted to the subject is no more marked than the agent voice, in terms of frequency (Tsukida 2009). Such a distributional tendency with respect to the construction types is not common in SO-type languages (Japanese, English, etc.). Sentences in passive voice are often morphologically more marked, and arguably less frequent (e.g., Roland et al. 2007). The lack of clear indication of the agent-before-theme preference suggests a correlation between the voice property of a given language and the importance of the saliency factor (see Sauppe (2016) for a relevant discussion). Of course, the availability of the symmetrical voice system is not restricted to OS-type languages, and the patterns are much more complicated indeed.

4 Conclusion

Previous experimental results have shown that there is a processing bias whereby sentences are processed more quickly and easily when the subject appears before the object. We pointed out that this bias is widespread, but most of the data come from languages in which the subject precedes the object in their canonical word order. It is often also confounded that the agent argument precedes the theme argument in a sentence. The question of whether the SO word order bias is based on the syntactic complexity of the sentence or on its saliency can be solved by investigating Truku, whose canonical word order is VOS. The results showed that, with respect to the response time, there was, superficially, no word order effect in AV sentences with a common noun agent, but the SVO sentences are processed significantly slower than the VOS sentences in GV sentences. We argued, however, that there was a general VOS preference in Truku, further indicating that the syntactic complexity account is better suited for explaining the pattern found in both response accuracy and response time data.

In sum, our auditory comprehension experiment suggests that the often-observed SO preference in SO-type languages is not fully grounded in the universal properties of human cognition. In Truku, an OS-type language, the OS word order was preferred. We suggest that the lack of (or at least very weak) saliency effect in this language may relate to the symmetrical voice system, where promoting a theme argument to subject does not require a more marked structure, with respect to the verbal morphology and syntax, than promoting an agent argument. Finally, we should point out that investigating a typologically wide set of languages is not only interesting but also necessary for determining the nature of various preferences found in the psycholinguistic literature. It is not always easy to find a way to conduct research as we do in our own institutions, but in a way, just looking at languages that are spoken and used within an easily-accessible range will invite unwanted “bias” in our minds.