1 Introduction

The question addressed in this study is whether two phonologically very similar prefixes of Indonesian are allomorphs or rather independent prefixes. According to the classical definition of allomorphy, variants of a morpheme which have the same underlying form, which share the same meaning, and are in complementary distribution, are classified as allomorphs (Bloomfield 1933; Alber 2011). When two different affixes express roughly the same semantics, they are referred to not as allomorphs but as rival affixes (Aronoff and Anshen 2017). Conversely, when the same form signifies completely different semantic functions, as in the case for English -s (third person singular vs. plural inflection vs. third person genitive, …Plag et al. 2017), we have affix homonymy. Less clear-cut are cases where formatives are obviously similar in form as well as in meaning, without the form similarity being phonologically conditioned. For instance, Peters (2004) argued that English -er and -eer are allomorphs where the choice of -eer is semantically conditioned on the referent being from the semantic field of war. Baayen et al. (2013) discussed the Russian prefixes pere- and pre-, which are etymologically related but express subtly different semantics.

Endresen (2014) provides detailed discussion of the limitations of the classical definition of allomorphy. She points out that there are counterexamples where other parameters should be taken into account, such as subtle differences in meaning as exhibited by the Russian affix pairs s- vs. so- ‘together’, o- vs. ob- ‘around’, pere- vs. pre- ‘across’, vz- vs. voz- ‘up’, and vy- vs. iz- ‘out of’. The two Indonesian prefixes that are the subject of this study likewise raise the question of whether these prefixes are allomorphs, given their phonological similarity, or separate prefixes. As pointed out by Denistia (2018), Indonesian linguists mainly have described the two morphs as independent prefixes (Ramlan 2009; Sneddon et al. 2010), but there are also studies that take them to be allomorphs (Darwowidjojo 1983; Kridalaksana 2008). Since these two prefixes are similar in form, but not phonologically conditioned, and since they are similar, but not identical in meaning, the classical criteria for allomorphy are only approximately satisfied. Thus, the present study is a corpus based investigation into what Endresen (2014) refers to as non-standard allomorphy. Specifically, we examine in detail the differences in the semantics of PE- and PEN-, the differences in their productivity, and the differences in the extent to which derived words with PE- and PEN- are input to further inflection. In our analyses, the paradigmatic relations between base words and derived words are especially informative.Footnote 1

In what follows, we first introduce some basic aspects of Indonesian verb morphology and deverbal nominalization. In the next section, we introduce the databases that inform our analyses. We then present our analyses and conclude with a discussion of the results obtained.

2 Indonesian verb morphology and deverbal nominalization

The morphology of Indonesian is characterized by productive processes of affix substitution. In this study, we are interested in two prefixes that create nouns from verbs through affix substitution, and which express a range of semantic functions (e.g. agent, instrument, patient, Sneddon et al. (2010, pp. 30–33)). One prefix, henceforth PEN-, forms nouns from verbs with the prefix MEN- (e.g. penari ‘dancer’ – menari ‘to dance’). In what follows, for notational clarity, we write prefixes in upper case and their allomorphs as subscripts. PEN- and MEN- have six allomorphs: PENpeng-, PENpen-, PENpem-, PENpe-, PENpeny-, PENpenge-, and MENmeng-, MENmen-, MENmem-, MENme-, MENmeny- and MENmenge-. Sukarno (2017), Ramlan (2009) and Sugerman (2016) summarized the phonological conditioning of these allomorphs as follows:

  • PENpeng-, MENmeng- occurs with base words beginning with a vowel or a velar obstruent /g/, /k/, /h/, or /kh/,

  • PENpen-, MENmen- occurs with base words beginning with a alveolar or palatal obstruent /d/, /t/, /c/, /j/, /sy/, or /z/,

  • PENpem-, MENmem- occurs with base words beginning with a labial consonant /b/, /p/, or /f/,

  • PENpe-, MENme- occurs with base words beginning with a nasal, a semivowel, or a liquid /m/, /n/, /ng/, /ny/, /w/, /j/, /r/, or /l/,

  • PENpeny-, MENmeny- occurs with base words beginning with /s/, and

  • PENpenge-, MENmenge- occurs with monosyllabic base words.

The nasal allomorphy of Indonesian MEN- and PEN- is an example of classical phonologically conditioned allomorphy.

A second prefix, henceforth PE-, forms nouns from verbs with the prefix BER-, again through affix substitution (e.g. petani ‘farmer’ – bertani ‘to farm’), see Ramlan (2009), Ermanto (2016), Sneddon et al. (2010), Putrayasa (2008), Darwowidjojo (1983), Benjamin (2009). BER- has BERbe- and BERbel- as infrequent allomorphs. BER- primarily creates verbs expressing reciprocity, reflexivity, or stativity (see Kridalaksana (2007), Ramlan (2009), Putrayasa (2008), Chaer (2008), Sneddon et al. (2010) for other meanings). BERbe-occurs with stems beginning with /r/ or with stems the first syllable of which ends with /r/, as in risiko ‘risk’, berisiko ‘to run the risk’ and kerja ‘work’, bekerja ‘to work’. BERbel- only occurs with the base word ajar ‘to teach’, belajar ‘to study’ (Sugerman 2016). If PE- is regarded as an allomorph of PEN-, its conditioning is not phonological, as for the allomorphs of PEN-, but morphological: PEN- is paradigmatically related to verbs with MEN- and PE- is paradigmatically related to verbs with BER-.Footnote 2

The base words for the verbs and their nominalizations can be verbs, nouns, and adjectives. There is no consistent difference in lexical meaning between simple base verbs and derived verbs (e.g. buru ‘to hunt’ – berburu ‘to hunt’), although the derived forms may show different syntactic and aspectual behaviour (e.g. buru ‘to hunt’ – memburu ‘to hunt continuously’) (Nuriah 2004). The simple verb is typically used in imperatives.

Verbs with MEN- can be extended with the suffixes -i and -kan. MEN- typically renders a verb explicitly transitive. The suffixes -i and -kan add a further argument, either a beneficiary or a causer, while often at the same time expressing intensification or iteration (Arka et al. 2009; Sutanto 2002; Tomasowa 2007; Kroeger 2007; Sneddon et al. 2010).

  1. 1.

    transitives and ditransitives

    1. (a)

      tulis ‘to write’, menulis ‘to write something’, menulisi ‘to write something on something’

    2. (b)

      tulis ‘to write’, menulis ‘to write something’, menuliskan ‘to write something on behalf of someone’

  2. 2.

    causatives

    1. (a)

      panas ‘hot’, memanas ‘to become hot’, memanasi ‘to heat up something’

    2. (b)

      panas ‘hot’, memanas ‘to become hot’, memanaskan ‘to apply heat to something’

  3. 3.

    transitives and beneficiaries

    1. (a)

      ajar ‘to teach’, mengajar ‘to teach something’, mengajari ‘to teach someone something’

    2. (b)

      ajar ‘to teach’, mengajar ‘to teach something’, mengajarkan ‘to teach something to someone’

    3. (c)

      kirim ‘to send’, mengirim ‘to send something’, mengirimi ‘to send something to someone’

  4. 4.

    iteration and intensification

    1. (a)

      lempar ‘to throw’, melempar ‘to throw something’, melempari ‘to throw something repeatedly at something’

    2. (b)

      pukul ‘to hit’, memukul ‘to hit something’, memukuli ‘to hit something hard over and over again’

Verbs with BER- do not combine with the -i suffix, but are found with -kan or -an to express possession (5, 6) and reciprocity (7, 8):

  1. 5.

    dasar ‘base’, berdasarkan ‘be grounded in’

  2. 6.

    alamat ‘address’, beralamatkan ‘to have an address’

  3. 7.

    gandeng ‘to hold hands’, bergandengan ‘to hold hands with each other’

  4. 8.

    cium ‘to kiss’, berciuman ‘to kiss each other’

Derived nouns with PEN- do not carry the -i or -kan suffixes, even though they may correspond to verbs with these suffixes. For instance, penerbang, ‘pilot’, is paradigmatically related to menerbangkan ‘to fly an aircraft’ rather than to the verb menerbangi, ‘to fly in something’, with the suffix -i marking location. Importantly, the verb menerbang does not exist but only the verbs terbang, ‘fly’, menerbangkan and menerbangi.

Occasionally, one finds both PEN- and PE-. There are 5 cases in which the form with PE- semantically refers to a profession and the form with PEN- does not, as listed in (9). There are also some cases in which the form with PEN- expresses agent, causer, or instrument and the form with PE- expresses patient or agent. In this case, 7 instances are attested in our database, as listed in (10).

  1. (9)

    PEN- and PE- formations that both express agents

    1. (a)

      tembak ‘to shoot’, penembak ‘someone who shoots’ and petembak ‘shooter’ (athlete)

    2. (b)

      tinju ‘to punch’, peninju ‘someone who punches’ and petinju ‘boxer’ (athlete)

    3. (c)

      terjun ‘to sky dive’, penerjun ‘someone who sky dives’ and peterjun ‘sky diver’ (athlete)

    4. (d)

      selam ‘to dive’, penyelam ‘someone who dives’ and peselam ‘diver’ (athlete)

    5. (e)

      dayung ‘to paddle’, pendayung ‘someone who paddles’ and pedayung ‘paddler’ (athlete)

  2. 10.

    PEN- and PE- formations expressing different semantic roles

    1. (a)

      ajar ‘to teach’, pengajar ‘teacher’ (agent) and pelajar ‘student’ (patient)

    2. (b)

      kasih ‘to love’, pengasih ‘lover’ (agent) and pekasih ‘love poison’ (instrument)

    3. (c)

      sakit ‘to be sick’, penyakit ‘disease’ (causer) and pesakit ‘a person with disease’ (patient)

    4. (d)

      sapa ‘to greet’, penyapa ‘a person who greets’ (agent) and pesapa ‘a person who is greeted’ (patient)

    5. (e)

      siar ‘to announce/to sail’, penyiar ‘radio announcer’ (agent) and pesiar ‘a cruise ship’ (instrument)

    6. (f)

      tanda ‘sign’, penanda ‘a sign’ (agent) and petanda ‘a hint’ (patient)

    7. (g)

      tempur ‘to combat’, penempur ‘armament’ (agent) and petempur ‘combatant’ (instrument)

We compiled a database containing 3090 words with PE- and PEN-. Since PEN- and PE- share the form pe-, the question arises of how to assign occurences of the form pe- to either PEN- or PE-. In 235 out of 240 potentially ambiguous forms, inspection of the paradigmatic relation with the corresponding base verb, either a verb with MEN- or a verb with BER-, the noun can be unambiguously assigned to be PEN- or PE-. Five words remain ambiguous: pewushu, ‘wushu athlete’, perindang, ‘provider of shadow’, pemagang, ‘probabitioner’, pemuda, ‘young male’, and pemudik, ‘homecomer’. The semantics of pewushu clarify that it belongs to PE-, the prefix that is used to denote professional athletes. The remaining 4 words are truly ambiguous, but are most likely, given their semantics, belong to the class of PEN- formation. For instance, perindang realises a causative reading, which, as we shall see below, is predominantly expressed by MEN-.

The goal of this paper is to clarify the morphological status of PE- and PEN-, allomorphs or separate prefixes, through a quantitative survey of their productivity, their paradigmatic relations with their base verbs, and the extent to which these derived nouns are input for further inflection. Indonesian inflection comprises several bound morphs: -ku, -mu, and -nya for first, second, and third person singular possessives or objects, ku- and kau- for first and second person subjects (Sneddon et al. 2010). In the Indonesian literature, these bound morphs are referred to as clitics, as they are phonologically reduced forms of free pronouns (Kridalaksana 2008). There are also two suffixes that attach to verbs or nouns to express emphasis (-lah and -pun) or questioning (-kah). In what follows, we will refer to these morphs as inflectional, as they do not give rise to new onomasiological units but rather modify existing words much in the same way as adverbs modify verbs in English. Indonesian also has reduplication, which is used to express the plural for nouns and realizes various semantics function on verbs and adjectives, including intensification and iteration (Sugerman 2016; Chaer 2008; Rafferty 2002; Dalrymple and Mofu 2012). Following Booij (1996), we distinguish between inherent and contextual inflection. Agreement marking on verbs (e.g, ku- and kau-) exemplifies contextual inflection, which is syntactically governed. Inherent inflection is more similar to word formation and hence in some languages can feed derivation and compounding. For instance, in Dutch, plural nouns can appear as left constituents in compounds (Schreuder et al. 1998). Reduplication in Indonesian is inherent inflection: it is not governed by syntactic context (marah-marah ‘very angry’, anak-anak ‘children’, berhenti-berhenti ‘to stop repeatedly’), and can feed further inflection, as in memukul-mukuli, ‘to hit intensively over and over again’, which has as parse [[[meN + [pukul]N]V + pukul]V + i]V. We shall see below that derived words with PE- are more often input to these inflectional processes than derived words with PEN-. We will argue that the joint quantitative evidence justifies to analyse PE- and PEN- as two distinct prefixes rather than as allomorphs. In the next section, we introduce the databases that we derived from a 36 million token corpus of written Indonesian (Goldhahn et al. 2012).

3 Materials

We created a database from the Indonesian corpus that is part of the Leipzig Corpora Collection at http://corpora2.informatik.uni-leipzig.de/download.html, accessed in April 2016. This corpus comprises a variety of written registers (the web, newspapers, Wikipedia) dating from the years 2008–2012 (Goldhahn et al. 2012). There are 112.025 different word types in this corpus, that occur in 2.759.800 sentences, to a total of 36.608.669 word tokens.

The words in the corpus were morphologically analyzed using the MorphInd parser, which has an overall accuracy of 84.6% (Larasati et al. 2011) and it was run in single word mode, i.e., compounds were not parsed. Prior to running the parser, the 200 words with PE- or PEN- that contained a typo were corrected manually. The MorphInd parser’s results for PE- and PEN- were checked and corrected manually against the online version of Kamus Besar Bahasa Indonesia (hereafter called the dictionary), a comprehensive dictionary of Indonesian (http://kbbi.kemdikbud.go.id; accessed on June 2016), to verify the morphological status and semantics of the PE- and PEN- words. We made use of the fourth edition, published in 2012, which has more than 90,000 lemmas (Alwi 2012). The language it records is formal; it omits words that are considered slang or foreign. Where the dictionary and the MorphInd are in conflict, we followed the dictionary. Where the dictionary does not provide information on the word category of the base, we followed the MorphInd parser. The precision of the parser for these words was 0.98 and its recall was 0.82, using the dictionary as the gold standard complemented with manual verification for out-of-vocabulary words.

Sample output of the parser is shown in Table 1: a morphological segmentation is provided where available, as well as a word category label. Table 1 shows that MorphInd identifies pemerintah and pemain correctly. However, it is not able to identify PE- in petugas and pekebun. In some cases, the base identified by the parser is incorrect. For instance, pengusut is formed from usut (to investigate) [PENpeng- + usut], but MorphInd identifies its base as kusut (tangled) [PENpeng- + kusut]. MorphInd also is not always able to accurately identify single syllable base words. In the above examples, this is illustrated by pengelas (welder) which derives from las (weld), [PENpenge- + las], and not kelas (classroom), [PENpeng- + kelas]. Therefore, the output of the parser was manually checked and corrected when necessary.

Table 1 Examples of the output of the MorphInd parser

We processed the data using the R (version 3.3.2) programming language (R Team 2008) in R Studio (R Team 2015). The databases and the R scripts used to construct these databases are available online at http://bit.ly/PePeNProductivity. In what follows, we first present the database with Indonesian verbs, and then proceed to the database with derived nouns with PE- and PEN-.

3.1 The database of Indonesian verbs

Indonesian has deverbal morphology for active, passive, causative, and transitive semantics among others, see Table 2 for examples. From the corpus, we retrieved all verbs recognized by the MorphInd parser and brought these together in a database. The total number of types in the database is 26996. Table 3 illustrates that for each verb, we provide information on the derived word’s frequency in the corpus, the parse provided by MorphInd, the base word, the word category of the base, and the affix or affixes in the verb. When particles (e.g. -lah, -kah, -pun) or affixes (e.g. ku-, -ku, kau-, -mu, -nya) are found attached to a verb (Sneddon et al. 2010; Sugerman 2016), this form is listed with its own entry.Footnote 3

Table 2 Examples of simple and complex verbs in Indonesian, and affix combinations in complex verb as attested in the corpus
Table 3 Examples of entries in the verb database

The database comprises 2489 simple verbs and 24507 affixed verbs (3665 verbs with suffixes, 11562 verbs with prefixes, and 9280 verbs with both prefix and suffix). We observed 27 verb constructions of which 13 are reported in the literature ((Hidajat 2014; Fortin 2006; Sneddon et al. 2010; Benjamin 2009; Arka et al. 2009; Sudaryanto 1993; Kridalaksana 2007)). In our corpus, there are 2 attested verb constructions (e.g. terke-/-an and terper-/-an) that are not productive (1 token and 8 tokens respectively). Table 2 lists the 25 productive constructions.

As our specific interest is in nouns with PE- and PEN-, we extracted from this database all the verbs that correspond to these nominalizations and that carry the prefix BER- or MEN-. To this new database, henceforth the MeBer Database, we added information on the frequency of the base words of these complex verbs, whether the verbal prefix is MEN- or BER- and also the allomorph of MEN- (see Table 4). Whereas all nominalizations with PEN- have a corresponding verb with MEN-, there is one simple verb, sohor ‘to be famous’, that has a corresponding nominalization with PE-, pesohor ‘a famous person’, without having a corresponding verb with BER-. This verb-noun pair is not in the MeBer Database, but in a separate database (SimpleWords) which also specifies the frequency of the base verb and the frequency of the derived noun (see Sneddon et al. (2010) for discussion of such exceptional pairs).

Table 4 Examples of entries in the MeBer database

All the data in MeBer Database were compiled computationally from the output of the MorphInd and subsequently checked manually using the dictionary. In total, there are 8484 words with the MEN- prefix and 3582 words with the BER- prefix. These counts include forms with the suffixes -i, -kan or -an. To this database, we added some words such as beserta ‘to be together with’, belajar ‘to study’, beternak ‘to farm’, bekerja ‘to work’, and beterbangan ‘to fly randomly’ and their inflectional variants, forms which MorphInd did not recognize but that we happened to identify in the course of this study. The MorphInd parser also does not recognize verbs with the allomorph menge-. For the 18 nominalizations with PENpenge-, we manually searched for the occurrences of the corresponding verbs and added these together with their frequency counts to the MeBer database. Finally, a total of 297 verbs with MEN- and 14 verbs with BER- were not recognized by the parser, and were corrected manually on the basis of the dictionary.Footnote 4

3.2 The PePeN database

We brought together the PE- and PEN- words in a lexical database, henceforth the PePeN database. This database also includes the noun with PE- that have a simple verb as the base. In this way, we obtained a total of 3090 words, 267 with PE-, 2818 with PEN-, and 4 words with the unproductive variant PER- (Benjamin 2009).Footnote 5 There are 34 words that the MorphInd parser did not analyze.

All derived words were annotated manually for semantic role (agent, instrument, causer, patient, and location), and checked (for at least one token) against both the dictionary and usage in the corpus. As in English, where -er nominalizations may express multiple semantic roles (Booij 2010; Booij and Lieber 2004) (e.g. printer, which has both an agent and instrument reading), Indonesian PE- and PEN- formations can have multiple interpretations (see Table 5). In this study, we did not distinguish between impersonal agentFootnote 6 and instrument. Although it is well known that PEN- create agents, patients, and instruments (Sneddon et al. 2010), we observed a few cases of causer (e.g. penyakit ‘disease’) and location (e.g. penghujung ‘the end’) in our database. It is possible, even likely, that semantic roles are in use in the corpus without being registered in the database, as manual verification of all 579564 tokens with PE- or PEN- in the corpus was infeasible. In the database, words with more than one semantic role have multiple entries in the database, with one row for each role (cf. Table 5). The frequencies listed in rows of Table 5 are those of the overall frequency of the word and are not broken down by semantics.

Table 5 Examples of semantic role

The PePeN Database thus provides the following information:

  1. 1.

    Word frequency: the token frequency of the derived word in the corpus,

  2. 2.

    Allomorph: the form of the PEN- prefix; where the allomorph does not follow the rules as given in Chaer (2008), Sneddon et al. (2010), e.g. penglihat ‘seer’ is expected to be pelihat, this is marked in the ‘notes’ column of the database as AllomorphDeviation,

  3. 3.

    Base word,

  4. 4.

    Word category of the base word,

  5. 5.

    Base word frequency: the token frequency of the base word in the corpus,

  6. 6.

    MorphInd output as illustrated in Table 1,

  7. 7.

    Semantic role of the derived noun with respect to the base word (agent, instrument, patient, …),

  8. 8.

    Morphological variation: reduplications, particles (e.g.-lah, -pun, per-) or affixes (e.g.-ku, -mu, -nya), if present,

  9. 9.

    Typo: whether the form in the corpus had a spelling error (corrected in the database, frequency counts include the frequency of the corrected typos); when several spelling alternants are in use, this is indicated in the FreeVariance column of the database as illustrated in Table 7.

Entries of this database are listed in Table 6.

Table 6 Example entries in the PePeN database
Table 7 Example entries in the PePeN database illustrating spelling variants and typos (pemain is the second most frequent PEN- nominalizations in the database)

4 Analysis

4.1 Productivity of PE- and PEN- derived nouns

The PE- and PEN- prefixes differ in their productivity. As shown in the upper panel of Table 8, PEN- occurs with more tokens, more types, and more hapax legomena compared to PE-. Further detail is provided by the lower panel of Table 8, which shows the numbers of tokens, types, and hapaxes for PEN- allomorphs and PE-.

Table 8 Counts of tokens, types, and hapaxes for PE- and PEN- (upper table) for the six allomorphs of PEN- (lower table)

Figure 1 presents rank-frequency plots for PE- and PEN- (left panel), and for PE- and all allomorphs of PEN- (right panel), using logarithmic scales (Zipf 1935, 1949). The left panel clarifies that the highest ranked words with PEN- also exceed in frequency the highest ranked words with PE-. Nevertheless, the productivity index V1/N (Baayen 2009) remains greater for PEN- (0.00118) than for PE- (0.00055). The second panel of Fig. 1 shows that four of the six allomorphs of PEN- have rank-frequency curves that lie above the rank-frequency curve of PE-. The curve for PENpeny-, crosses the curve for PE- around rank 50, but still shows many more low-frequency formations. The only allomorph that is less productive than PE- is PENpenge-, an allomorph that attaches to monosyllabic words and which appears in the corpus with only 18 types.

Fig. 1
figure 1

Rank-frequency curves for PE- and PEN- (left panel), and for PE- and sum of the allomorphs of PEN-’s frequency (right panel). PE- is less productive than PEN-, and it is also less productive than the allomorphs of PEN-, with the exception of PENpenge-, which is attested with only 18 types (Color figure online)

Given the similarity of PE- and PEN- form, the question arises of whether it makes sense to consider PE- as a low productivity allomorph of PEN-. To address this question, we examined the counts of types and hapax legomena for PE- and the allomorphs of PEN- as a function of the number of base verbs with BER- and base verbs with allomorphs of MEN-. The panel of Fig. 2 shows that the rate at which base verbs give rise to derived nouns is the same (according to a regression model) for all allomorphs of MEN- and that PE- patterns as an outlier, both with respect to type counts and with respect to hapax legomena. It is remarkable that the rate at which hapaxes and types appear is so constant across the allomorphs of PEN- and MEN-. From this, we draw the conclusion that the outlier PE- is best understood as a formative in its own right. We note here that Indonesian PEN- and MEN- offer a remarkable window on the relation between base productivity and derived productivity.

Fig. 2
figure 2

Counts of types for base verbs (horizontal axis) and counts of types and hapaxes for PE- and PEN- (vertical axis); solid and dashed lines represent regression lines to the PEN- allomorphs for counts of types and counts of hapax legomena respectively

Further evidence that PE- is not an allomorph of PEN- emerges when we take the semantic roles of the derived noun into account. Table 9 cross-tabulates PE- and the allomorphs of PEN- by the semantic roles of these nouns in our database; Fig. 3 provides the corresponding visualisation for the three roles that are most frequent: agent, patient, and instrument. Both PE- and PEN- create agent nouns. PE- shows some productivity for patient nouns, of which there are proportionally very few among the nouns with PEN-. (The numbers are small, but this asymmetry is significant according to a chi-squared test, \(\chi^{2}_{(1)} = 81.32, p<0.0001\); interestingly, the few patient nouns with PEN- are realised with the allomorph pe-, however, the proportion of patient hapaxes is much lower (0.02 for PEN- and 0.13 for PE-, p<0.015, proportion test). Conversely, PEN- is productive for instruments, which are virtually absent for PE-. This may be one of the reasons that PEN- is more productive than PE-. For PEN-, a chi-squared test indicates that the ratios of agents to instruments are proportional across all allomorphs (\(\chi^{2}_{(5)} = 1.01, p>0.1\) and \(\chi^{2}_{(5)} = 5.48, p>0.1\) for both types and hapax legomena). The uniformity of semantic functions accross the allomorphs of PEN- is perfectly in line with the fact that these allomorphs are phonologically conditioned. Conversely, the lack of productivity for instruments that characterizes PE-, and its (limited) productivity for patient nouns that is strongly attenuated for PEN- is a further indication that PE- is unlikely to be an allomorph of PEN-. Thus, Indonesian PEN- and PE- show the kind of semantic specialisation that led Baayen et al. (2013) to conclude that Russian pere- and pre- are not allomorphs but independent prefixes.

Fig. 3
figure 3

Counts of types (left panel) and hapax legomena (right panel) broken down by semantic role, for PE- and the allomorphs of PEN-. Both prefixes support agents, but PE- shows limited productivity for patient nouns, whereas PEN- shows additional productivity for instruments (Color figure online)

Table 9 Cross-tabulation of PE- and the allomorphs of PEN- by semantic role. Upper table: counts of types; lower table: counts of hapax legomena

The counts underlying Table 9 and Fig. 3 are based on a type definition that distinguishes between forms of the noun with different possessive suffixes or suffixes expressing emphasis, as well as noun plurals. When such variants are collapsed into a single type, the pattern of results on the ratios of agents to instruments across all allomorphs remains similar (\(\chi^{2}_{(5)} = 0.75, p>0.1\) and \(\chi^{2}_{(5)} = 5.11, p>0.1\) for both types and hapax legomena). However, the number of distinct types for patient nouns with PE- reduces to 5, each of which occurs more than once. Thus, PE- appears to be well-entrenched for a handful of patient nouns, but does not show real productivity here.

Krott et al. (1999) reported the paradoxical finding that words with less productive affixes tend to be used more as base words for further word formation. A similar observation holds for PE- and PEN-, but now for inflection rather than word formation. Inflectional variation is well illustrated by the noun pengikut ‘follower’, which is attested in the corpus with 9 variants: pengikutku ‘my follower’, pengikutmu ‘your follower’, pengikutnya ‘his/her follower’; reduplication as in pengikut-pengikut ‘followers’; reduplication and affixes as in pengikut-pengikutmu ‘your followers’, pengikut-pengikutnya ‘his/her followers’; affixes and particles as in pengikutmupun ‘your follower’ (contrastive your, i.e., your, not somebody else’s follower), pengikutnyapun ‘his/her follower’ (contrastive), pengikutnyalah ‘his/her follower’ (contrastive in imperative mood). Table 10 shows the counts of the different kinds of inflections types for PE- and PEN-. In our corpus, particles (e.g. -lah, -pun), possessive suffixes (e.g. -ku, -mu, -nya), and plural reduplications are used most often. Figure 4 presents a mosaic plot for the cross-classification of pe and PEN- by type of inflection. The mosaic plot shows that inflected forms of PE- are overrepresented for particles, plurals, and combinations of plurals and possessives. In other words, the less productive prefix, PE-, is used more intensively as input for further inflection than is the case for PEN-. This is likely to be due to the greater entrenchment of words with PE- in the mental lexicon, which makes them more readily available for more further affixation. Thus, the same principles that Krott et al. (1999) reported for derivation in Germanic languages generalize to inflection in Indonesian.

Fig. 4
figure 4

Mosaic plot for the cross-classification of PE- and PEN- by type of inflection. The colour coding represents the Pearson residuals, which clarify where the observed counts are greater (purple) or smaller (pink) than the expected values. A chi-squared test confirms that PE- and PEN- distribute differently over inflectional types (\(\chi^{2}_{(4)} = 36.59\), p<0.0001) (Color figure online)

Table 10 Counts of variants types for PE- and allomorphs of PEN-. The base represents the non-variant forms. Particles, possessive suffixes, and plural reduplications dominate the counts

4.2 The base verbs of PEN- and PE-: MEN- and BER-

Several studies call attention to the tight relation between PE- and PEN- and their verbal base words (Putrayasa 2008; Chaer 2008; Ramlan 2009; Kridalaksana 2007; Darwowidjojo 1983). We therefore inspected the productivity of verb formation, focusing on monomorphemic words as potential base words. In our database, a total of 5581 such monomorphemic words is attested, with 3617 simple nouns, 943 simple adjectives, and 1021 simple verbs. As shown in Table 2, a large number of affixes is available for creating verbs from nouns, adjectives, and verbs. For this study, the number of different complex verb forms will be referred to as a monomorphemic word’s verb family size. The verb family size measure includes inflectional variants of the verbs in its counts. Plots of this verb family size against base frequency show that, as expected, a higher base frequency predicts a greater verb family size. Interestingly, the functional form of this relation is different for base words that give rise to nouns with PEN-, and those that do not. This is illustrated in Fig. 5 (see also Table 11), which present the results of a GAM (Generalized Additive Model, MGCV package version 1.8–17, Wood (2006, 2011)) with a poisson link fitted to the verb family size with centered log base frequency as the predictor. The increase of verb family size with base frequency is greater when PEN- is present, as can be seen by comparing the right panel with the left. In the right panel, we see a linear increase, whereas in the left panel, there is no increase at all for the lowest frequency base words. For the larger part of the range of the base word frequencies, the verb family size is larger if the verb family has a noun with PEN-. We also considered the base words with PE- in the verb family, but as the resulting curve was not significantly different from that of base words with verb families that did not have either nominalization, the two sets were merged into one defined by the absence of PEN- in the verb family. Apparently, base productivity and derived productivity are interacting for PEN-, but independent for PE-.

Fig. 5
figure 5

Partial effects for verb family size regressed on centered log base frequency, for morphological families without nouns with PEN- but possibly including nouns with PE- (left panel) and for morphological families including derived nouns with PEN- (right panel)

Table 11 GAM summary for partial effects for verb family size regressed on centered log base frequency, for morphological families including derived nouns with PEN- and without PEN- but possibly including nouns with PE-

Figure 6Footnote 7 presents mosaic plots for the cross-classification of word category and the presence of PE- or PEN- in a monomorphemic base word’s verb family. The mosaic plot in the left panel concerns base words that have at least one formation in their verb family (i.e. with neither PE- and PEN-, with PE-, or with PEN-). The plot shows that simple words that give rise to affixed verbs but not to any formations with PE- or PEN- are overrepresented for nouns, and that base words that have PEN- in their verb family are overrepresented for verbs, unurprisingly (\(\chi^{2}_{(4)} = 839.97\), p<0.0001). These overrepresentations are indicated by the residuals (Zeileis et al. 2007). The right panel concerns monomorphemic base words for which the verb family size is zero. Again, we see that base words that have PEN- in their verb family are overrepresented for verbs (\(\chi^{2}_{(4)} = 288.58\), p<0.0001). No such overrepresentation is visible for PE-. Whereas the literature on PE- and PEN- holds that PEN- is derived from verbs with MEN-, our corpus data indicate that PEN- actually can attach to simple words that do not have a corresponding verb with MEN-, even though the total number of instances is small (45). It is possible that the relevant MEN- verbs are in use in the language, but not attested in our corpus. Alternatively, it is conceivable that these MEN- verbs only have a virtual existence as possible words.

Fig. 6
figure 6

Left panel: mosaic plot for the type counts of verbs derived from monomorphemic words cross-classified by the word category of the monomorphemic word and the presence of PE- or PEN- in its verb family. Right panel: corresponding mosaic plot for the type counts of monomorphemic words that do not have any derived verbs attested in the corpus. The colour coding represents the Pearson residuals, which clarify where the observed counts are greater (blue) or smaller (red) than the expected values (Color figure online)

We have seen that PEN- is more productive than PE- and more tightly integrated into the verbal system. This raises the question of whether the reduced productivity of PE- might be due to reduced productivity of the verbal prefix BER-. Indeed, verbs with MEN- are more productive overall than verbs with BER- (2714704 tokens with MEN- vs. 801052 tokens with BER-, 5174 types with MEN- vs. 2869 types with BER-, and 996 hapax legomena with MEN- vs. 760 hapax legomena with BER-); see also Table 12 and the rank-frequency plot for BER- and MEN- in the left panel of Fig. 7. However, when considering the allomorphs of MEN- separately, it turns out that BER- is more productive than any of these allomorphs, as shown in the right panel of Fig. 7. Although BER- is more productive than any of the allomorphs of MEN-, it is not the case that PE- is proportionally more productive than any of the allomorphs of PEN-. It follows that the modest productivity of PE- is not a straightforward consequence of the lack of productivity of BER-. This conclusion receives further support from the presence of a significant correlation between the frequency of the MEN- base and the PEN- nominalization (\(r_{s}=0.4397, p<0.0001\)) and the absence of such a correlation for BER- and PE- (\(r_{s}=0.1908, p=0.1711\)).

Fig. 7
figure 7

Rank-frequency plots for MEN- and BER- distributions. The x-axis represents rank and y-axis represents frequency of occurrence in the corpus. The lines in the left panel illustrate that MEN- is more productive than BER-. However, BER- becomes the most productive prefix when it is compared to the individual allomorphs of MEN- (right panel) (Color figure online)

Table 12 Counts of tokens, types, and hapaxes for six MEN- allomorphs (e.g. menge- meny-, me-, mem, -men, meng-) and BER-

5 General discussion

We have presented a quantitative investigation of the use of two nominalizing prefixes of Indonesian: PE- and PEN-. Although quite similar in form, nouns with PE- are described by literature as derived from verbs with the prefix BER-. Conversely, nouns with PEN- typically originate from verbs with the prefix MEN-, and show the same allomorphy in the same conditioning contexts as these prefixed verbs. In this paper, we addressed three questions. First, do PE- and PEN- differ with respect to their degree of productivity? Second, how does their productivity relate paradigmatically to the productivity of their base words? Third, given the similarity in form of PE- and PEN-, should they be taken to be allomorphs? To answer these questions, we examined the use of these nominalizations and their base words in a corpus of written Indonesian.

With regards to their productivity, PEN- is clearly more productive than PE- by any measure of productivity. In fact, PE- is less productive than any of the allomorphs of PEN-, with as only exception the allomorph PENpenge-, for which only 18 words are attested. PEN- is productive for agents and instruments, whereas PE- is productive for agent nouns and to some small extent for patient nouns. Nouns with PE- and PEN- reveal the same productivity paradox that was reported by Krott et al. (1999) for derivation and compounding. Krott et al. observed that less productive morphological categories are used more intensively as input for further word formation. In our data, we likewise find that the less productive prefix, PE-, appears with more variants compared to PEN-.

Whereas words with PE- are more readily accessible for further inflection compared to PEN- (see Fig. 4), words with PEN- emerge as paradigmatically more entrenched. Verbs to which PEN- attaches tend to allow for more verbal affixation than is the case for verbs to which PE- attaches (see Fig. 5). Furthermore, the productivity of the allomorphs of PEN- mirrors the productivity of the allomorphs of their base words with MEN- (see Fig. 2). The proportionalities that govern the types and hapaxes of the allomorphs of MEN- and PEN- does not extend to BER- and PE-. In fact, PE- is surprisingly uncommon with base verbs with BER-, which is not what standard descriptions in the literature—PEN- is derived from MEN-, PE- is derived from BER- (Chaer 2008; Ramlan 2009; Ermanto 2016; Sneddon et al. 2010; Putrayasa 2008; Darwowidjojo 1983; Benjamin 2009) — would lead one to expect.

It is well known that the productivity of an affix can vary depending on the structure of its base words (Aronoff 1976; Baayen and Renouf 1996). Nevertheless, it is surprising to see an almost perfect linear relation between the productivity of the allomorphs of MEN- and the productivity of the allomorphs of PEN-, both with respect to types and with respect to hapax legomena. This linear relationship strongly supports analyses according to which the variant forms of PEN- and MEN- are allomorphs. Our examination of the use of PE- and PEN- in written Indonesian revealed some novel uses that have not been noted in the preceding literature on allomorphy.

This raises the question of whether PE- should be considered to be yet another allomorph of PEN-. Several observations argue against this possibility. First, PE- does not participate in the linear dependence that characterizes the productivity of the allomorphs of MEN- and PEN-. Second, our data indicate that PEN- has a strong preference for verbs as base words, but PE- does not show such a preference. Third, a monomorphemic base word’s verb family tends to be larger when this verb family gives rise to a nominalization with PEN-, but no such tendency is present for PE-. Fourth, the frequencies of words with PEN- enter into a significant correlation with the frequency of the base words, but no such correlation is present for PE-: the formations with PE- have become independent of their base words. Finally, PE- is proportionally overrepresented for patient nouns, whereas PEN- creates primarily instruments in addition to agents.

That allomorphy is to some extent a matter of degree is well known (Baayen et al. 2013; Endresen 2014). Obviously, PE- is highly similar in form to PEN-, in fact, it is identical to one of its allomorphs (although it is possible that phonetically the two are different, see Plag et al. (2017) for durational differences between the realisations of English -s depending on the semantics functions expressed). Yet, even though PE- and PEN- are largely in complementary distribution, they differ substantially in their productivity, both quantitatively and qualitatively, as well as in their entrenchment in the verbal system of Indonesian.