1 Introduction

The use of Machine Translation (MT) to translate everyday, written exchanges is becoming increasingly commonplace; translation tools now regularly appear on chat applications and social networking sites to enable cross-lingual communication. MT systems must therefore be able to handle a wide variety of topics, styles and vocabulary. Importantly, the translation of dialogue requires translating sentences coherently with respect to the conversational flow so that all aspects of the exchange, including speaker intent, attitude and style, are correctly communicated (Bawden 2018).

It is important to have realistic data to evaluate MT models and to guide future MT research for informal, written exchanges. In this article, we present DiaBLa (Dialogue BiLingue ‘Bilingual Dialogue’), a new dataset of English–French spontaneous written dialogues mediated by MT,Footnote 1 obtained by crowdsourcing, covering a range of dialogue topics and annotated with fine-grained human judgments of MT quality. To our knowledge, this is the first corpus of its kind. Our data collection protocol is designed to encourage speakers of two languages to interact, using role-play scenarios to provide conversation material. Sentence-level human judgments of translation quality are provided by the participants themselves while they are actively engaged in dialogue. The result is a rich bilingual test corpus of 144 dialogues, which are annotated with sentence-level MT quality evaluations and human reference translations.

We begin by reviewing related work in corpus development, focusing particularly on informal written texts and spontaneous bilingual conversations (Sect. 1.1). We discuss the potential of the corpus and the collection method for MT research in Sect. 2, both for MT evaluation and for the study of language behaviour in informal dialogues. In Sect. 3 we describe the data collection protocol and interface. We describe basic characteristics of the corpus in Sect. 4. This includes a description of the annotation layers (normalised versions, human reference translations and human MT quality judgments) and examples. We illustrate the usefulness of the human evaluation by providing a comparison and analysis of the MT systems used (Sect. 4.4). We compare two different types of MT system, a baseline model and a mildly context-aware model, based on a quantitative analysis of the human judgements. We also provide examples of challenging phenomena for translation and include a preliminary analysis of a dialogue-level phenomenon, namely consistent use of formal and informal person pronouns (Sect. 4.5). Finally, we provide plans for future work on the corpus in Sect. 5. The corpus, interface, scripts and participation guidelines are freely available under a CC BY-SA 3.0 licence.Footnote 2

1.1 Related work

A number of parallel corpora of informal texts do exist. However they either cover different domains or are not designed with the same aim in mind. OpenSubtitles (Lison and Tiedemann 2016; Lison et al. 2018) is a large-scale corpus of film subtitles, from a variety of domains, making for very heterogeneous content. However the conversations are scripted rather than being spontaneous and are translations of monolingual texts, rather than being bilingual conversations, and are subject to additional constraints such as sentence length due to the fact that they are subtitles. The MSLT corpus (Federmann and Lewis 2016) is designed as a bilingual corpus, and is based on oral dialogues produced by bilingual speakers, who understand the other speaker’s original utterances. This means that it is not possible to analyse the impact that using MT has on the interaction between participants. Other bilingual task-orientated corpora exist, for example BTEC (Basic Travel Expression Corpus; Takezawa et al. 2002), SLDB (Spoken Language DataBase; Morimoto et al. 1994) and the Field Experiment Data of Takezawa et al. (2007), which is the most similar corpus to our own in that it contains MT-mediated dialogue. However these corpora are restricted to the travel/hospitality domains, therefore not allowing the same variety of conversation topic as our corpus. Human judgments for the overall quality are provided for the third corpus (Field Experiment Data), but only at a very coarse-grained level. Feedback about the participants’ perception of MT quality is therefore of a limited nature in terms of MT evaluation, since sentence-level evaluations are not provided. Similarly, the multilingual dialogue corpora from the Verbmobil project (Wahlster 2000) provide mediated interactions between speakers of two languages. However, the topic of conversation is also of a limited nature, centred around scheduling meetings. Furthermore, the corpora are not freely available.

2 Motivation

The main aim of our corpus is to serve as a test set to evaluate MT models in an informal setting in which communication is mediated by MT systems. However, the corpus can also be of interest for studying the type of language used in written dialogues, as well as the way in which human interaction is affected by use of MT as a mediation tool. We develop these two motivations here, starting with the corpus’ utility for MT evaluation (Sect. 2.1) and then discussing the corpus’ potential for the analysis of MT-assisted interaction (Sect. 2.2).

2.1 MT evaluation

The corpus is useful for MT evaluation in three ways: as (i) a test set for automatically evaluating new models, (ii) a challenge set for manual evaluation, and (iii) a validation of the effectiveness of the protocol to collect new dialogues and to compare new translation models in the future.

2.1.1 Test set for automatic evaluation

The test set provides an example of spontaneously produced written utterances in an unscripted setting, with high quality human reference translations. It could be particularly useful for evaluating contextual MT models due to the dialogic nature of the utterances, the need to take into account previous MT outputs and the presence of metadata concerning both the dialogue scenario and the speakers involved. While it is true that the quality of the MT systems will influence the dialogue (in terms of translation errors), the dataset is a useful resource as an example of a scenario in which MT has been used for mediation. MT systems will continue to make errors and have not reached the level of human translators (particularly concerning aspects such as style, politeness and formality). It is therefore important to know how to handle errors when they arise, regardless of the system that has produced them, and to study how users of the systems change their language behaviour depending on the limitations of the systems.

2.1.2 Challenge set for manual evaluation

It can be used as a challenge set for manual evaluation. The sentence-level human judgments provided can be used as an indicator as to which sentences were the most challenging for MT. Manual evaluation of new translations of our test set can then be guided towards those sentences whose translations are marked as poor, to provide an informed idea of the quality of the new models on these difficult examples, and therefore to encourage development for particularly challenging phenomena.

2.1.3 Validation of the protocol for the collection of human judgments of MT quality

Human evaluation remains the most accurate form of MT evaluation, especially for understanding which aspects of language pose difficulties for translation. While hand-crafted examples and challenge sets provide the means to test particular phenomena (King and Falkedal 1990; Isabelle et al. 2017), it is also important to observe and evaluate the quality of translation on spontaneously produced texts. Our corpus provides this opportunity, as it contains spontaneous productions by human participants and is richly annotated for MT quality by its end users. In Sect. 4.4, we provide a preliminary comparative evaluation of the two MT systems, in order to show the utility of the human judgments collected. This same collection method can be applied to new MT models for a similar evaluation.

2.2 MT-assisted interaction

As MT systems are becoming more common online, it is important for them to take into account the type of language that may be used and the way in which user behaviour may affect the system’s translation quality. Non-canonical syntactic structures, spelling and typing errors, text mimicking speech, including pauses and reformulations, must be taken into account if MT systems are to be used for successful communication in more informal environments. The language used in our corpus is relatively clean in terms of spelling. However participants are encouraged to be natural with their language, and therefore a fruitful direction would be in the analysis of the type of language used. Another interesting aspect of human-MT interaction would be to study how users themselves adapt to using such a tool during the dialogues. How do they deal with translation errors, particularly those that make the dialogue incoherent? Do they adjust their language over time, and how do they indicate when they have not understood correctly? An interesting line of research would be to use the corpus to study users’ communication strategies, for example by studying breakdowns in communication as in Higashinaka et al. (2016).

3 Data collection and protocol

We collected the dialogues via a dedicated web interface (shown in Fig. 1) allowing participants to register, log on and chat. Each dialogue involves two speakers, a native French speaker and a native English speaker. Each writes in their native language and the dialogue is mediated by two MT systems, one translating French utterances into English and the other translating English utterances into French.

Fig. 1
figure 1

The dialogue interface as viewed by the English participant (note that this example shows the middle of a dialogue rather than the beginning). The English participant’s original utterances are shown on the right side in green. The French participant’s utterances (machine translated into English) are on the left in peach colour. The quality of these machine translations has been evaluated by the English speaker using the smiley notation (Cf. Sect. 4.4). Note that the finer-grained evaluation is currently hidden in this example. See Fig. 3 for the more detailed view of sentence-level evaluation

3.1 Participants

Participants were adult volunteers recruited by word of mouth and social media. They participated free of charge, motivated by the fun of taking part in fictional role-play. They provided basic information: age bracket, gender, English and French language ability, other languages spoken and whether they work in research or Natural Language Processing (NLP) (see Table 1 for basic statistics). Participants were evenly distributed between males and females and came from both NLP and non-NLP backgrounds. Although age ranges were the same for English and French speakers, the modal age brackets differed (55–64 for English and 25–34 for French).Footnote 3

Table 1 Some basic characteristics of the dialogue participants, recorded and distributed with the corpus

3.2 Scenarios

To provide inspiration and to encourage a wide variety of different utterances, a role-play scenario is given at the start of each dialogue and roles are randomly assigned to the speakers. We designed twelve scenarios, shown in Fig. 2. The scenarios were chosen to reflect a range of everyday scenarios that people can relate to but that do not restrict the dialogue to be of a formulaic nature. The first turn is assigned randomly to one of the speakers to get the dialogue started. This information is indicated at the top of the dialogue screen in the participants’ native languages. A minimum number of 15 sentences per speaker is recommended, and participants are informed once this threshold is reached, although they can continue for longer.Footnote 4 Participants are told to play fictional characters and not to use personal details. We nevertheless anonymise the corpus prior to distribution to remove usernames mentioned in the text.

Fig. 2
figure 2

The twelve scenarios and speaker roles. Equivalent descriptions were presented in French to the French participants. Each scenario was presented six times for each MT model to ensure a balanced corpus

3.3 Evaluation method

The participants evaluate each other’s translated sentences from a monolingual point of view. The choice to use the participants to provide the MT evaluation is an important part of our protocol: we can collect judgments on the fly, facilitating the evaluation process, and it importantly means that the evaluation is performed from the point of view of participants actively engaged in dialogue. Although some errors may go unnoticed (e.g. a word choice error that nevertheless makes sense in context), many errors can be detected this way through judgments about coherence and understanding of the dialogue flow. Having information about perceived mistakes could also be important for identifying those mistakes that are unperceived.

MT quality is evaluated twice, (i) during the dialogue and (ii) at the end of the dialogue. Evaluations are saved for later and not shown to the other participant.

Participants evaluate each translated sentence during the dialogue by first selecting an overall translation quality (perfect, medium or poor) using a smiley notation system. If they select either medium or poor, they are prompted to indicate which types of errors they think occur in the translation: grammar, meaning, style, word choice, coherenceFootnote 5 and otherFootnote 6 (see Fig. 3 for an example). Note that several problems can be indicated for the same sentence. If the participants wish, they can also write a free comment providing additional information or suggesting corrections. This annotation schema was designed to strike the right balance between providing fine-grained evaluations [finer grained than many existing datasets such as the Field Experiment Data of Takezawa et al. (2007)], without making the task tedious or overly complex for the participants, which could also impact the naturalness of the dialogue. The participants can update their previous evaluations at any time during the dialogue any number of times. This may occur if they change their mind about their evaluation, for example if new utterances in the dialogue make it clear that their previous evaluation was not correct. The entire history of the evaluations (including the times of updates) is recorded so that changes in the perception of errors are documented.

Fig. 3
figure 3

The sentence-level evaluation form. The original French sentence was L’entrée c’est juste deux soupes de melon. “The starter is just two melon soups.”

Once the dialogue is fin ished, participants give overall feedback of the MT quality. They are asked to rank the quality of the translations in terms of grammaticality, meaning, style, word choice and coherence on a five-point scale (excellent, good, average, poor and very poor), and to indicate whether any particular aspects of the translation or of the interface were problematic. Finally, they indicate whether they would use such a system to communicate with a speaker of another language.

Before starting to interact, participants were given instructions (with examples) on how to evaluate MT quality and these instructions remained available during the dialogues (cf. Appendix B). There is expected to be a certain degree of variation in participants evaluation. This subjectivity, inevitable with any human evaluation, is interesting, as it gives an indication of the variance of the tolerance for errors, and which types of errors were considered most detrimental.

3.4 MT systems

We compare the quality of two MT model types (see Sect. 4.4). Within a dialogue, the same model type is used for both language directions, and each model is used an equal number of times for a given scenario, and therefore for the the same number of dialogues. Both models are neural encoder-decoder models with attention (Bahdanau et al. 2015), implemented using Nematus (Sennrich et al. 2017). The first model (baseline) is trained to translate sentences in isolation. The second (2to2), is trained to translate sentences in the context of the previous sentence, as in Tiedemann and Scherrer (2017) and (Bawden et al. 2018). This is done by concatenating each sentence with its previous sentence, separated by a special token, and translating both sentences at once. In a post-processing step, only the current sentence is kept. Note that if the previous sentence is written by the same speaker as the current sentence, then the original previous sentence is prepended. If the previous sentence is written by the other speaker (in the opposite language), then the MT output of the previous sentence is prepended to the current sentence. This means that the previous sentence is always in the same language as the current sentence, and also corresponds to the context seen by the current speaker, as illustrated in Examples 1 and 2.Footnote 7

  1. 1.

    Same speaker as previous sentence:

    $$\begin{aligned} \overbrace{\text {You would not believe the day I've had.}}^{\text {Original\, English\, sentence}} <{\textsc {concat}}>{\overbrace{\text {I never want to go back to work again!}}^{\text {Current\, English \,sentence}}} \end{aligned}$$
  2. 2.

    Different speaker from previous sentence:

    $$\begin{aligned} \overbrace{\text {We could go to town for a while}}^{\text {MT\, translation\,of \,French\, sentence}} <{\textsc {concat}}>{\overbrace{\text {I've seen the town and I'm not impressed}}^{\text {Current\, English\, sentence}}} \end{aligned}$$

3.5 Training data and MT setup

The systems are trained using the OpenSubtitles2016 parallel corpus (Lison and Tiedemann 2016). The data is cleaned, tokenised and truecased using the Moses toolkit (Koehn et al. 2007) and tokens are split into subword units using BPE (Sennrich et al. 2016b). The data is then filtered to exclude poorly aligned or truncated sentences,Footnote 8 resulting in a training set of 24,140,225 sentences. Hyper-parameters are given in Appendix A. During the dialogues, the participants’ text is first split into sentences and preprocessed in the same way as the training data. Translation is performed using Marian for fast CPU decoding (Junczys-Dowmunt et al. 2018).

4 Corpus characteristics

Table 2 shows the basic characteristics of the 144 dialogues. 75.7% of dialogues contain more than 35 sentences and the average sentence length is 9.9 tokens, slightly longer than the translations. An extract of dialogue, carried out in scenario 10 (cf. Fig. 2), is given in Fig. 4, providing an example of the type of language used by the participants. The language used is colloquial and contains a number of fixed expressions (e.g. get off your intellectual high-horse, Mr Fancy pants), which can prove difficult for MT, as is the case here. The systems are sometimes robust enough to handle spelling and grammatical errors (e.g. qui ne penses ‘who think\(_{2.sg}\)’ instead of qui ne pense ‘who thinks\(_{3.sg}\)” and rality instead of reality, translated into French as ralité instead of réalité, conserving the spelling error in translation, as a result of subword segmentation). The dialogues also contain cultural references, such as to films and actors. In many cases named entities are well conserved, although sometimes cause problems, for example Marcel Carné translated as Marcel Carborn. The explanation is that Carné is segmented into two subwords, Car and ‘born’ during pre-processing, which are separately translated into English before being concatenated back together in a post-processing step.

Table 2 Characteristics of the resulting corpus in terms of dialogue length and sentence length for both original and translated utterances
Fig. 4
figure 4

A dialogue extract from scenario number 10: ‘You are at home in the evening...’. Shown are the original utterances (“Orig:”), the machine translated versions that were shown to the other participant (“MT:”), the reference translations produced a posteriori (“Ref:”) and some details from the participant evaluation produced during the dialogue (“Eval:”). The MT outputs were produced by the baseline model in this dialogue

4.1 Normalised versions

Although participants were encouraged to use their best spelling and grammar, such errors did occur (missing or repeated words, typographical errors, inconsistent use of punctuation). We provide manually normalised versions of sentences containing errors. The aim of this normalisation is to provide information about the presence of errors (useful for studying their impact on translation), and for providing a basis for the human reference translations, as we do not attempt to reproduce errors in the translations. Corrections are kept to a minimum (i.e non-canonical use of language was not corrected if linked to the colloquial use of language), and therefore in practice are limited to the addition of capital letters at the beginning of sentences and fullstop s at the end of sentences and typographical error correction only when the correct form can easily be guessed from the context.Footnote 9

4.2 Machine translations

Each sentence is translated automatically into the other language for the other participant. As previously mentioned, a single type of MT system (baseline or 2to2) is used for all sentences within a dialogue. The use of two different systems is relevant to our analysis of the human evaluations produced by dialogue participants (Sect. 4.4). The choice of MT system does of course affect the quality of the MT. However, even as techniques advance, the corpus will remain relevant and useful as a test set and for analysing human language behaviour in this setup independently of this choice. The only limiting factor is having an MT model of sufficient quality for the participants to understand each other enough to communicate basic ideas and to effectively exchange on misunderstandings or provide correctly translated reformulations. Our MT models both largely surpass this criterion, as indicated by the positive feedback from participants.

4.3 Human reference translations

In order for the corpus to be used as a test set for future MT models, we also produce human reference translations for each language direction. Translators were native speakers of the target language, with very good to bilingual command of the source language, and all translations were further verified and corrected where necessary by a bilingual speaker.

Particular attention was paid to producing natural, spontaneous translations. The translators did not have access to the machine translated versions of the sentences they were translating to avoid any bias towards the MT models or the training data. However, they could see the machine translated sentences of the opposite language direction. This was important to ensure that utterances were manually translated in the context in which they were originally produced (as the speaker would have seen the dialogue) and to ensure cohesive translations (e.g. for discursive phenomena, such as anaphora and lexical repetition). Spelling mistakes and other typographical irregularities (e.g. missing punctuation and capital letters) were not transferred to the translations; the translations are therefore clean (as if no typographical errors had been present). This choice was made because reproducing the same error type when translating is not always possible and so could depend on an arbitrary decision.

4.3.1 Translation difficulties

The setup and the informal nature of the dialogues posed some unique challenges for translation, both for MT and also more fundamentally for translation in general. We mention a selection of these here to illustrate the complexity of deciding how to translate, where there is not a simple correct translation. We list four of these problems here, with real examples from the corpus.

4.3.1.1 Informal nature of text

Unlike more formal texts such as those from news and parliamentary domains that are typically used in MT, the dialogues are spontaneous productions that contain many colloquialisms and examples of idiomatic speech for which translation equivalents can be difficult to find. We chose idiomatic equivalents based on communicative intention, rather than producing more literal reference translations, as for instance shown in Example 3.

  1. 3.

    Well look at you, Mr Fancy pants!

    Eh bien, regarde-toi, M. le snobinard !

4.3.1.2 Language-specific ambiguity

An ambiguity in one language that does not hold in the other can sometimes lead to seemingly nonsensical utterances. This is a theoretical translation problem and not one that can be solved satisfactorily; either extra explanatory information is added concerning the structure of the other language, sacrificing the spontaneous and natural aspect of the dialogue, or the translation is translated without this extra information, which could leave the receiver of the message confused about the relevance of the utterance. In Example 4, the French speaker corrects his use of the male version of the word ‘patient’ to the female version (patiente). Given that this gender distinction is not one that is made in English, the choice is either to specify the gender of the patient in the English translation, which would be considered odd by the English speaker, or to translate both words as the same word patient as done here, which also leads to the translation of the second translation appearing incoherent. Although this is the only example of this kind in our corpus, this phenomenon could easily reoccur in other gender marked nouns. In Example 5, the word ice-cream is automatically translated into its French equivalent glace, which also has a second meaning of ‘mirror’. The French speaker picks up on this ambiguity concerning the word glace ‘ice cream or mirror’ and asks for clarification. However, the distinction that is made, between the glace which is eaten and the one that you look into is not one that will resonate with the English speaker, for whom no ambiguity exists. The effect of this is that the translation of the French speaker’s utterances appear nonsensical to the English speaker. In our reference translations, we translated these utterances as best as possible, despite resulting incoherences in the target language, without adding additional information.

  1. 4.

    FR: D’ailleurs il est l’heure de mon patientmale suivant.

    Ref: Besides, it’s time for my next patient.

    FR: Ou plutôt, de ma patientefemale suivante, d’ailleurs.

    Ref: Or patient should I say.

  1. 5.

    EN: I can’t stop thinking about ice-cream...

    MT: quoi que je fasse , je ne peux pas m’empêcher de penser à une glace...

    FR: Pensez vous en permanence à la glace qui se mange ?

    Ref: Do you always think about ice cream that’s eaten?

    FR: ou bien à une glace pour se regarder ?

    Ref: Or about a mirror to look into?

4.3.1.3 Self-correction

Mistakes in the original utterance may be corrected by the speaker in subsequent utterances (e.g. poor choice of word, spelling mistake). In theory, we would want the translation of the erroneous utterance to reflect what was initially intended by the speaker, where it can be safely guessed (i.e. a model that is robust to noisy source sentences). However this means that a second utterance whose purpose is simply to correct the first would be superfluous if the translation of the first sentence does not contain an error. An example of a spelling error and correction is given in Example 6. As with the ambiguity cases, we choose to provide reference translations that correct the errors when they are evident rather than inject artificial mistakes into translations as the types of errors injected would be decided arbitrarily.

  1. 6.

    EN: ...to do uor jobs...

    ...pour faire notre travail...

    EN: Typo: ...to do our jobs....

    Typo: ...pour faire notre travail...

Self-corrections were infrequent, with only 2 instances of explicit indication as in Example 6 (using the word typo), but it is a situation that is likely to arise in such written dialogue scenarios.

4.3.1.4 Meta-discussions

Although not common, mistranslations by the MT models occasionally led to coherence problems in the dialogue that led to meta-discussions about what was originally intended. These meta-discussions may or may not contain elements of the mistranslation, which must then be translated back into the other language. In Example 7, the utterance Ou à la limite thaï was poorly translated by the MT system as the Thai limit rather than at a push Thai. The resulting mistranslation repeated by the English speaker in What do you mean by the Thai limit?. Ideally, we would want to indicate to the original French speaker that there was a translation problem and therefore not translate into French using the original expression à la limite thaï. We therefore choose to translate the English sentence as it would most likely have been understood by the English speaker, resulting in a French translation that is different from the original term used.

  1. 7.

    Tu connais un restau indonésien?

    Do you know an Indonesian restaurant?

    Ou à la limite thaï?

    Or at a push Thai? (MT: Or the Thai limit)

    What do you mean by the Thai limit?

    Qu’est-ce que tu veux dire par la limite thaïlandaise?

Of the translation difficulties mentioned here, meta-discussions are the most frequent. For example, there are 9 instances of I mean and 26 instances of do you mean related to meta-discussions in the original English sentences.

4.4 Human judgments of MT quality

As described in Sect. 3, participants evaluated the translations from a monolingual (target language) perspective. We provide a preliminary analysis of these judgments to show that they can be used to distinguish the MT quality of different models and that such a protocol is useful for comparing MT models in a setting in which they are used to mediate human communication.

4.4.1 Overall MT quality

Although an in-depth linguistic analysis is beyond the scope of this paper, we look here at overall trends in evaluation.Footnote 10 Figure 5 illustrates the overall differences between the translation of the two models compared according to the three sentence-level quality labels perfect, medium and poor.

Fig. 5
figure 5

Percentage of sentences for each language direction and model type marked as perfect/medium/poor by participants

The results show unsurprisingly that MT quality is dependent on the language pair; translation into English is perceived as better than into French, approximately half of all EN\(\rightarrow \)FR sentences being annotated as medium or poor.Footnote 11 There is little difference in perceived quality between the baseline and 2to2 models for FR\(\rightarrow \)EN. This contrasts with EN\(\rightarrow \)FR, for which the number of sentences marked as perfect is higher by +4 absolute percentage points for 2to2 than for baseline. An automatic evaluation with bleuFootnote 12 (Papineni et al. 2002) shows that the contextual model scores mildly better than the baseline, particularly for EN\(\rightarrow \)FR. We retranslate all sentences with both models and compare the outputs to the reference translations: for FR\(\rightarrow \)EN, the 2to2 model scores 31.34 (compared to 31.15 for the baseline), and for EN\(\rightarrow \)FR, 2to2 scores 31.60 (compared to 30.99 for the baseline). These scores reflect the tendencies seen by the human evaluations: a smaller relative difference between the two models for FR\(\rightarrow \)EN and a greater difference for EN\(\rightarrow \)FR. For both language directions, 2to2 results in a higher bleu score, which is reflected in the smaller percentage of sentences perceived as poor compared to baseline. As for the quality difference remarked between the two language directions by the human participants, the bleu scores cannot offer such insights, since they cannot be compared across languages and on different sets of sentences.

4.4.2 Types of errors encountered

The comparison of the breakdown of problem types for each model and language direction is shown in Fig. 6.

Fig. 6
figure 6

Percentage of all sentences for each language direction and model type marked as containing each problem type (a sentence can have several problems). Bars are cumulative and show the percentage for sentences marked as medium (orange) and poor (red)

The few number of problems classed as other indicates that our categorisation of MT errors was sufficiently well chosen. The most salient errors for all language directions and models are related to word choice, especially when translating into French, with approximately 16% of sentences deemed to contain a word choice error. As with the overall evaluations, there are few differences between baseline and 2to2 for FR\(\rightarrow \)EN, but some differences can be seen for EN\(\rightarrow \)FR: 2to2 models perform better. There are fewer errors in most problem types, except word choice. However, the only error types that are statistically significant [according to a Fisher exact significance test (Fisher 1922), based on the presence or not of each error type for each model type] are style (\(p\le \) 0.1) and coherence (\(p\le \) 0.01). There is a notably difference in the the lower frequency of coherence-related errors for 2to2. Coherence errors also appear to be less serious, as there is a lower percentage of translations labelled as poor as opposed to medium. These results are encouraging, as they show that our data collection method is a viable way to collect human judgments, and that such judgments can reveal fine-grained differences in MT systems, even when evaluating on different sentence sets.

4.4.3 Global participant feedback

In spite of the errors, the translation quality is in general good, especially into English, and participant feedback is excellent concerning intelligibility and dialogue flow. As well as sentence-level judgments, participants indicated overall MT quality once the dialogue was complete. Participants indicated that they would use such a system to communicate with a speaker of another language 89% of the time. In 81% of dialogues, grammaticality was marked as either good or excellent. Coherence, style and meaning were all indicated as being good or excellent between 76 and 79% of the time. As a confirmation of the sentence-level evaluations, word choice was the most problematic error type, indicated in only 56% of dialogues as being good or excellent (40% of dialogues had average word choice, leaving a very small percentage in which it was perceived as poor). There were few differences seen between the two model types for these coarse-grained evaluations. One notable difference was seen for style for EN\(\rightarrow \)FR, where 2to2 scores better than the baseline. For baseline, style is marked as average for more dialogues than 2to2 (38% vs. 18%) and as good for fewer dialogues (46% vs. 65%).

4.4.4 Examples of errors and the challenge faced for the MT of bilingual dialogues

It is important to understand the potential problems that using MT for real-life conversations can present, particularly in terms of misunderstandings resulting in a negative social impact. Some of these problems can be flagged up in our corpus, and we will hope that these can be analysed further in the future.

Certain MT errors are easily identifiable, such as in Example 8, where an item has been repeated until the maximum sentence length is complete. However the most damaging mistakes are those that go undetected. Example 9 is an illustration of how a mistranslation can have quite serious consequences. In this example the translation has the opposite meaning of the intended utterance and the English speaker therefore understands something inappropriate from the French speaker’s original utterance.

  1. 8.

    EN: If I check out some cocktail recipes and I’ll buy all the mixers, fruits, mint, lemon, lime etc.

    MT: Si je prends des recettes de cocktail et que j’achète toutes les mixers, fruits, menthe, citron, citron, citron, citron, citron, citron, citron, citron, citron, citron, citron, citron, citron, citron, citron, citron, citron, citron, citron, citron, ...

    FR: Alors, je trouve que ça fait beaucoup de citron, mais sinon pas de problème.

    Ref: Well, I think that’s quite a lot of lemon, but otherwise no problem.

  2. 9.

    FR: De toute façon ton patron a toujours été antipathique !

    MT: Anyway, your boss has always been fond of you.

    Ref: In any case your boss has always been unpleasant!

    EN: Really? I mean he’s twice my age. And that would be inappropriate...

4.5 Focus on a dialogue-level phenomenon

We study one specific discursive phenomenon: the consistent use of French pronouns tu and vous. The French translation of singular you is ambiguous between tu (informal) and vous (formal). Their inconsistent use was one of the most commented problems by French speakers, and a strategy for controlling this choice has been suggested for this reason (Sennrich et al. 2016a). Neither of our models explicitly handles this choice, although 2to2 does take into account pairs of consecutive sentences, and therefore could be expected to have more consistent use of the pronouns across neighbouring sentences. As a proxy for their ability to account for lexical cohesion, we look at the two models’ ability to ensure consistent translation of the pronouns across consecutive sentences. For each model, we take translated sentences in which tu or vous appear, and for which the previous sentence also contains either tu or vous. By comparing the number of times the current sentence contains the same pronoun as the previous sentence, we can estimate the degree of translation consistency for this particular aspect. The results are shown in Table 3 for both MT models and also for all reference translations. The reference translations show a very high level of consistency in the use of pronouns in consecutive sentences, showing that in most scenarios we expect the same pronoun to be used by both speakers. For the comparison of the two MT models, although the absolute figures are too low to provide statistical significance, we can see a general trend that the 2to2 model does show greater consistency in the use of the pronouns over the baseline model, with + 10% in the consistency use of tu and + 6% in the consistent use of vous.

Table 3 For each model, the number of times each model translates using tu and vous in the current sentence and either of the forms tu and vous also appears in the previous sentence

5 Conclusion and future work

The corpus presented in this article is an original and useful resource for the evaluation of the MT of dialogues and dialogue in general. It provides a basis for many future research studies, in terms of the interaction between MT and humans: how good communication can be when using MT systems, how MT systems must adapt to real-life human behaviour and how humans handle communication errors. The corpus was designed to reflect three main characteristics:

  1. 1.

    MT-mediated dialogues;

  2. 2.

    a wide range of non-scripted topics;

  3. 3.

    fine-grained sentence-level human judgments of MT quality.

Although there exist corpora that satisfy some of these characteristics, our corpus is unique in displaying all three. As described previously, the pre-existing MT-mediated dialogue corpora are typically restricted to specific scenarios, often focused on a structured task. The relatively unrestricted nature of our corpus allows participants to be freer in their language. We also design the dialogues such that the participants do not have access to the original source sentences of the other participant, testing in a realistic setting how well MT models can be used for mediation.Footnote 13 This makes it possible to collect human judgments of MT quality that are based on the participants’ ability to spot errors in the continuity of a dialogue situation. The pre-existing corpora do not contain such fine-grained judgments of quality. Certain corpora contain judgements [e.g. the Field Experiment Data of Takezawa et al. (2007)], but only on a global level rather than sentence by sentence. The rich feedback on the translation quality of the MT models evaluated in our corpus makes it a useful tool for both evaluation and analysis of this setup.

We compared the use of two different model types in the collection of dialogues: a baseline NMT model that translated sentences independently of each other, and a lightly contextual model that translated sentences in the context of the previous sentence. Our analysis of translations produced by the models based on the sentence-level human quality judgments for each of the two MT models revealed interesting differences between the two model types. Whereas there was little difference seen between the models for FR\(\rightarrow \)EN, where there were deemed to be fewer errors overall by both models, greater differences could be seen for EN\(\rightarrow \)FR. There were notably 4% more sentences considered to be of perfect quality and there was a reduction in the number and severity of coherence errors, indicating that the contextual nature of the model could be improving translation coherence.

We intend to further extend the English–French corpus in future work and annotate it with discourse-level information, which will pave the way for future phenomenon-specific evaluation: how they are handled by different MT systems and evaluated by the participants. In this direction, we manually annotated anaphoric phenomena in 27 dialogues (anaphoric pronouns, event coreference, possessives, etc.). Despite the small size of this sample, it already displays interesting characteristics, which could provide a strong basis for future work. Anaphoric references are common in the sample annotated: 250 anaphoric pronouns, 34 possessive pronouns, and 117 instances of event coreference. Their incorrect translation was often a cause of communication problems (see Example 10 in which the French pronoun il is poorly translated as he rather than the inanimate version it), the impact of which will be investigated further.

  1. 10.

    FR: Je peux m’allonger sur ce canapé?

    MT: Can I lie on this couch?

    Ref: Can I lie down on the sofa?

    FR: Je ne veux pas déranger, il a l’air propre et neuf

    MT: I don’t want to bother, he looks clean and new...

    Ref: I don’t want to be any bother. It looks clean and new...

The protocol for data and MT judgment collection presented provides a useful framework for future evaluation of MT quality. Our preliminary analyses of sentence-level human judgments show that the evaluation procedure is viable, and we have observed interesting differences between the two types of MT model used in our experiments, providing complementary to automatic evaluation metrics. The same protocol could also be applied for new MT models and could also be extended to include more scenarios. It could also be extended to other language pairs, as part of a larger, international effort.