Automatic detection and correction of discourse marker errors made by Spanish native speakers in Portuguese academic writing

Sepúlveda-Torres, Lianet; Sanches Duran, Magali; Aluísio, Sandra Maria

doi:10.1007/s10579-019-09467-3

Automatic detection and correction of discourse marker errors made by Spanish native speakers in Portuguese academic writing

Project Notes
Published: 06 May 2019

Volume 53, pages 525–558, (2019)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Lianet Sepúlveda-Torres¹,
Magali Sanches Duran¹ &
Sandra Maria Aluísio¹

456 Accesses
2 Citations
Explore all metrics

Abstract

Discourse markers are words and expressions (such as: firstly, then, for example, because, as a result, likewise, in comparison, in contrast) that explicitly state the relational structure of the information in the text, i.e. signalling a sequential relationship between the current message and the previous discourse. Using these markers improves the cohesion and coherence of texts, facilitating reading comprehension. Although often included in tools that support the rhetoric structuring of texts, discourse markers have hardly been explored in writing support tools for learners of a second language. However, learners of a second language, including those at advanced levels, have trouble producing these lexical items, frequently replacing them with items from their native language or with literal translations of items in their own language, which often do not result in proper lexical items in the second language. In addition, students learn a single marker per function and use it repeatedly, producing monotonous texts. With the aim of contributing to reducing these difficulties, this paper presents a lexicon that will be used to support the task of automatically detecting and correcting discourse marker errors. Several heuristics have been evaluated to generate different types of errors. Automatic translation methods were used to semi-automatically compile the lexicon used in these heuristics. Similarity measures were also combined with these heuristics to correct discourse marker errors. The evaluated methods proved to be suitable for the task of identifying some types of discourse marker errors and can potentially identify many others, as long as new lexical inputs are incorporated into them.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Notes

https://github.com/liseli/HABLA.
Please, check a comprehensive list of learner corpora (written and spoken), produced by foreign language learners at https://uclouvain.be/en/research-institutes/ilc/cecl/learner-corpora-around-the-world.html.
Espanhol-Acadêmico-Br corpus can be made available for researchers upon request to the authors.
http://www.statmt.org/moses/giza/GIZA++.html.
http://www.statmt.org/moses/.
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html.

References

Aidinlou, N. A., & Mehr, H. S. (2012). The effect of discourse markers instruction on EFL learners’ writing. World Journal of Education, 2(2), 10–16.
Article Google Scholar
Alonso, M. R., Wanner, L., Vincze, O., Del Bosque, C. G., Veiga, V. N., Suárez, M. E., & Gonnzález, P. S. (2010). Towards a motivated annotation schema of collocation errors in learner corpora. In Proceedings of LREC 2010, Valletta, Malta.
Al-Rfou, R. (2012). Detecting English writing styles for non-native speakers. CoRR. arXiv:abs/1211.0498.
Aluisio, S. M., Pinheiro, G. M., Finger, M., Nunes, V., & Tagnin, S. E. (2003). The Lacio-web project: Overview and issues in Brazilian Portuguese corpora creation. In: Proceedings of the Corpus Linguistics, UCREL Technical Papers, (Vol. 16(1), Special Issue, pp. 14–21).
Arfé, B., Mason, L., & Fajardo, I. (2018). Simplifying informational text structure for struggling readers. Reading and Writing: An Interdisciplinary Journal, 31(9), 2191–2210. https://doi.org/10.1007/s11145-017-9785-6.
Article Google Scholar
Atwell, E. (1987). How to detect grammatical errors in a text without parsing it. In Proceedings of the EACL conference (pp. 38–45). Copenhagen, Denmark.
Aziz, W., & Specia, L. (2011). Fully automatic compilation of a Portuguese-English parallel corpus for statistical machine translation. In The 8th Brazilian symposium in information and human language technology, STIL, Short Paper Track. Cuiabá, MT.
Bai, M., You, J., Chen, K., & Chang, J. S. (2009). Acquiring translation equivalences of multiword expressions by normalized correlation frequencies. In Proceedings of the empirical methods in natural language processing (EMNLP 2009), (pp. 478–486). Singapore.
Brown, P., Pietra, V., Pietra, S., & Mercer, R. (1993). The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2), 263–311.
Google Scholar
Cardoso, N. (2012). Rembrandt: A named-entity recognition framework. In Proceedings of LREC 2012, (pp. 1240–1243). Istanbul, Turkey.
Carletta, J. (1996). Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2), 249–254.
Google Scholar
Feltrim, V. D., Antiqueira, L., Nunes, M. G. V., & Aluísio, S. M. (2003). A construção de uma ferramenta de auxílio à escrita de resumos acadêmicos em português. Campinas: In Anais do XXIII Congresso da Sociedade Brasileira de Computação.
Google Scholar
Fernández, S. I. (2005). Los marcadores discursivos en la argumentación escrita: Estudio comparado en el Español de España y en el Portugués de Brasil. Salamanca: Ediciones Universidad de Salamanca.
Google Scholar
Frey, B., & Dueck, D. (2007). Clustering by passing messages between data points. Science, 315(5814), 972–976.
Article Google Scholar
Gamon, M., Gao, J., Brockett, C., Klementiev, A., Dolan, W. B., Belenko, D., & Vanderwende, L., (2008). Using contextual speller techniques and language modelling for ESL error correction. In Proceedings of the international joint conference on natural language processing (IJCNLP), (pp. 449–456). Hyderabad, India.
García, J. R., Bustos, A., & Sánchez, E. (2015). The contribution of knowledge about anaphors, organisational signals and refutations to reading comprehension. Journal of Research in Reading, 38(4), 405–427. https://doi.org/10.1111/1467-9817.12021.
Article Google Scholar
Heeman, P. A., Byron, D., & Allen, J. F. (1998). Identifying discourse markers in spoken dialog. In Proceedings of the AAAI Spring symposium on applying machine learning and discourse processing, Stanford.
Hermet, M., & Désilets, A., (2009). Using first and second language models to correct preposition errors in second language authoring. In Proceedings of the fourth workshop on building educational applications using NLP (BEA). (pp. 64–72). Boulder, Colorado, USA.
Hofland, K. (1996). A program for aligning English and Norwegian sentences. In S. Hockey, N. Ide, & G. Perissinotto (Eds.), Research in humanities computing (pp. 165–178). Oxford: Oxford University Press.
Google Scholar
Jalilifar, A. R. (2008). Discourse markers in composition writings: The case of Iranian learners of English as a foreign language. Journal of CCSE, English Language Teaching, 1(2), 114–122.
Google Scholar
Koehn, P., Axelrod, A., Birch, A., Callison-Burch, C., Osborne, M., & Talbot, D. (2005). Edinburgh system description for the 2005 IWSLT speech translation evaluation. In International workshop on spoken language translation 2005. Pittsburgh, USA.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., & Herbst, E. (2007). Moses: Open source toolkit for statistical machine translation. Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session. Prague, Czech Republic.
Leacock, C., Chodorow, M., Gamon, M., & Tetreault, J. (2010). Automated grammatical error detection for language learners. San Rafael: Morgan and Claypool Publishers.
Book Google Scholar
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady., 10(8), 707–710.
Google Scholar
Llanos, L. C., & Gómez, P. G. (2014). Oral production of discourse markers by intermediate learners of Spanish: A corpus perspective. In J. Romero-Trillo (Ed.), Yearbook of corpus linguistics and pragmatics 2014 (pp. 239–259). New York, NY: Springer International Publishing.
Google Scholar
Lozano, Cristóbal. (2015). Learner corpora as a research tool for the investigation of lexical competence in L2 Spanish. Journal of Spanish Language Teaching, 2, 180–193.
Article Google Scholar
Martín, Z. M. A., & Portolés, L. J. (1999). Los marcadores del discurso. In I. Bosque & V. Demonte (Eds.), Gramática Descriptiva de la Lengua Española. Tercera parte. Entre la oración y el discurso. Morfología. (pp. 4051–4213). Madrid: Spain Calpe.
Martins, R., Hasegawa, R., Nunes, D., Montilha, G., & De oliveira, J. (1998). Linguistic issues in the development of Regra: A grammar checker for Brazilian Portuguese. Natural Language Engineering, 4(4), 287–307.
Article Google Scholar
Nunes, M. G. V., Vieira, F. M. C., Zavaglia, C., Sossolote, C. R. C., & Hernandez, J. (1996) (In Portuguese) The design of a lexicon for Brazilian Portuguese: Lessons learned and perspectives. In Proceedings of the II workshop on computational processing of written and speak Portuguese (pp. 61–70). Curitiba, Brazil.
Och, F., & Ney, H. (2000). Improved statistical alignment models. In Proceedings of the 38th annual meeting of the association for computational linguistics (pp. 440–447). Hong Kong.
Pardo, T. A. S. & Nunes, M. G. V. (2006). Review and evaluation of DiZer: An automatic discourse analyzer for Brazilian Portuguese. In The proceedings of the 7th workshop on computational processing of written and spoken Portuguese (pp. 180–189). Rio de Janeiro, Brazil.
Pecina, P., (2008). A machine learning approach to multiword expression extraction. In Proceedings of the LREC 2008 workshop towards a shared task for multiword expressions (pp. 54–57). Marrakech, Morocco.
Schourup, L. (1998). Discourse markers. Lingua, 107, 227–265.
Article Google Scholar
Sepúlveda-Torres, L., Rodrigues, R., & Aluísio, S. (2014). Espanhol-Acadêmico-Br: A corpus of academic Portuguese learners produced by native speakers of Spanish, In: S. Aluisio & S. O. Tagnin (Eds.), New languages technologies and linguistic research: A two-way road (pp. 98–111). Cambridge: Cambridge Scholars Publishing.
Shanru, Y. (2012). Discourse markers? An area of confusion. England: Newcastle University.
Google Scholar
Stehouwer, H., & Van Zaanen, M. (2009). Language models for contextual error detection and correction. In Proceedings of the EACL 2009 workshop on computational linguistic aspects of grammatical inference (pp. 41–48). Athens.
Stolcke, A. (2004). Srilm: An extensible language modeling toolkit. In Proceedings of the 7th international conference on spoken language processing (ICSLP 2002) (Vol. 2).
Tsao, N. & Wible, D. (2009). A method for unsupervised broad-coverage lexical error detection and correction. In Proceedings of the fourth workshop on innovative use of NLP for building educational applications (EdAppsNLP ‘09) (pp. 51–54) Stroudsburg, PA, USA.
Vande Casteele, A., & Collewaert, K. (2013). The use of discourse markers in Spanish language learners’ written compositions. Procedia: Social and Behavioral Sciences, 95, 550–556.
Google Scholar
Vogel, S., Ney, H., & Tillmann, C. (1996). Hmm-based word alignment in statistical translation. In Proceedings of the 16th international conference on computational linguistics (COLING 96) (pp 836–841). Copenhagen, Denmark.
Wu, H. & Zhou, M. (2003). Synonymous collocation extraction using translation information. In W. H. Erhard & Dan Roth (Eds.), ACL (pp. 120–127).

Download references

Acknowledgements

We thank the Brazilian Science Foundation FAPESP for financial support. Many thanks to Aline Evers and Roana Rodrigues for their support during the evaluation process.

Author information

Authors and Affiliations

Interinstitutional Center for Computational Linguistics (NILC), Institute of Mathematics and Computer Science, University of São Paulo, São Carlos, Brazil
Lianet Sepúlveda-Torres, Magali Sanches Duran & Sandra Maria Aluísio

Authors

Lianet Sepúlveda-Torres
View author publications
You can also search for this author in PubMed Google Scholar
Magali Sanches Duran
View author publications
You can also search for this author in PubMed Google Scholar
Sandra Maria Aluísio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lianet Sepúlveda-Torres.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sepúlveda-Torres, L., Sanches Duran, M. & Aluísio, S.M. Automatic detection and correction of discourse marker errors made by Spanish native speakers in Portuguese academic writing. Lang Resources & Evaluation 53, 525–558 (2019). https://doi.org/10.1007/s10579-019-09467-3

Download citation

Published: 06 May 2019
Issue Date: 15 September 2019
DOI: https://doi.org/10.1007/s10579-019-09467-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Automatic detection and correction of discourse marker errors made by Spanish native speakers in Portuguese academic writing

Abstract

Access this article

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation