Skip to main content
Log in

Exploiting languages proximity for part-of-speech tagging of three French regional languages

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

This paper presents experiments in part-of-speech tagging of low-resource languages. It addresses the case when no labeled data in the targeted language and no parallel corpus are available. We only rely on the proximity of the targeted language to a better-resourced language. We conduct experiments on three French regional languages. We try to exploit this proximity with two main strategies: delexicalization and transposition. The general idea is to learn a model on the (better-resourced) source language, which will then be applied to the (regional) target language. Delexicalization is used to deal with the difference in vocabulary, by creating abstract representations of the data. Transposition consists in modifying the target corpus to be able to use the source models. We compare several methods and propose different strategies to combine them and improve the state-of-the-art of part-of-speech tagging in this difficult scenario.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. http://restaure.unistra.fr/en/presentation/.

  2. http://als.wikipedia.org.

  3. The Alemannic Wikipedia contains articles written in several dialects from the Alemannic linguistic area.

  4. http://redac.univ-tlse2.fr/bateloc/.

  5. http://www.u-picardie.fr/LESCLaP/PICARTEXT/Public/.

  6. The Occitan corpus was at the time of the experiments much smaller than those for Alsatian and Picard, because an annotation phase was still on-going, but it now is of the same size as the others.

  7. https://ufal.mff.cuni.cz/w2c.

  8. The numbers are slightly different from Table 1 because these counts include a part of the corpus which was excluded from the corpus for the POS tagging experiments because it contains only Picard dictionaries.

  9. https://github.com/yuanzh/transfer_pos.

  10. https://github.com/yuanzh/transfer_pos.

  11. https://github.com/facebookresearch/fastText.

  12. https://spark.apache.org/docs/2.2.0/mllib-ensembles.html#random-forests.

  13. The typical way to stop such a procedure is to rely on an annotated development set. This would take us away from our initial scenario as it means annotated data is available for the target language. Another issue with this strategy is the small size of the devset with respect to the important orthographic variation in our data. Finding the proper sampling method to build an appropriate devset is a research question in itself, outside the scope of the present paper.

References

  • Agic, Z., Hovy, D., & Søgaard, A. (2015). If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing of the Asian federation of natural language processing, ACL 2015, July 26–31, 2015, Beijing, China, Volume 2: Short Papers, pp. 268–272.

  • Allauzen, A. & Bonneau-Maynard, H. (2008). Training and evaluation of POS taggers on the French MULTITAG corpus. In LREC.

  • Artetxe, M., Labaka, G., & Agirre, E. (2017). Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th annual meeting of the association for computational linguistics (Volume 1: Long papers), volume 1, pp. 451–462.

  • Berg-Kirkpatrick, T., Bouchard-Côté, A., DeNero, J., & Klein, D. (2010). Painless unsupervised learning with features. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics, HLT ’10. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 582–590.

  • Bernhard, D. (2014). Adding dialectal lexicalisations to linked open data resources: The example of Alsatian. In Proceedings of the workshop on collaboration and computing for under resourced languages in the linked open data era (CCURL 2014). Reykjavík, Iceland, pp. 23–29.

  • Bernhard, D. & Ligozat, A.-L. (2013). Es esch fàscht wie Ditsch, oder net? Étiquetage morphosyntaxique de l’alsacien en passant par l’allemand. In TALARE 2013. Les Sables d’Olonne, France, pp. 209–220.

  • Bernhard, D., Ligozat, A.-L., Martin, F., Bras, M., Magistry, P., Vergez-Couret, M., Steiblé, L., Erhart, P., Hathout, N., Huck, D., Rey, C., Reynés, P., Rosset, S., Sibille, J., & Lavergne, T. (2018). Corpora with part-of-speech annotations for three regional languages of France: Alsatian, Occitan and Picard. In Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018) Miyazaki, Japan: European Language Resources Association (ELRA).

  • Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.

    Article  Google Scholar 

  • Boudin, F. & Hernandez, N. (2012). Détection et correction automatique d’erreurs d’annotation morpho-syntaxique du French TreeBank. In Traitement Automatique des Langues Naturelles (TALN), pp. 281–291.

  • Brants, S., Dipper, S., Eisenberg, P., Hansen-Schirra, S., König, E., Lezius, W., et al. (2004). TIGER: Linguistic interpretation of a German corpus. Research on Language and Computation, 2(4), 597–620.

    Article  Google Scholar 

  • Bras, M., & Vergez-Couret, M. (2016). BaTelÒc: A text base for the Occitan language. In V. Ferreira & P. Bouda (Eds.), Language documentation and conservation in Europe (pp. 133–149). Honolulu: University of Hawaï Press.

    Google Scholar 

  • Candito, M., & Seddah, D. (2012). Le corpus Sequoia: Annotation syntaxique et exploitation pour l’adaptation d’analyseur par pont lexical. In TALN 2012—19e conférence sur le Traitement Automatique des Langues Naturelles Grenoble, France.

  • Garrette, D., & Baldridge, J. (2013). Learning a part-of-speech tagger from two hours of annotation. In HLT-NAACL, pp. 138–147.

  • Hovy, D., Plank, B., & Søgaard, A. (2014). Experiments with crowdsourced re-annotation of a POS tagging data set. In ACL (2), pp. 377–382.

  • Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. CoRR, arXiv:abs/1603.01360.

  • Majlis, M., & Zabokrtský, Z. (2012). Language richness of the web. In Proceedings of the eighth international conference on language resources and evaluation, LREC 2012, Istanbul, Turkey, May 23–25, 2012, pp. 2927–2934.

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119.

  • Millour, A., Fort, K., Bernhard, D., & Steiblé, L. (2017). Vers une solution légère de production de données pour le TAL: création d’un tagger de l’alsacien par crowdsourcing bénévole. In Traitement Automatique des Langues Naturelles (TALN).

  • Scherrer, Y. (2014). Unsupervised adaptation of supervised part-of-speech taggers for closely related languages. In Proceedings of the first workshop on applying NLP tools to similar languages, varieties and dialects. Dublin, Ireland: Association for Computational Linguistics and Dublin City University, pp. 30–38.

  • Schmid, H. (1995). Improvements in part-of-speech tagging with an application to German. In Proceedings of the ACL SIGDAT-workshop, pp. 47–50.

  • Taulé, M., Martí, M. A., & Recasens, M. (2008). AnCora: Multilevel annotated corpora for Catalan and Spanish. In LREC.

  • Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology—Volume 1, NAACL ’03. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 173–180.

  • Vergez-Couret, M. & Urieli, A. (2015). Analyse morphosyntaxique de l’occitan languedocien : l’amitié entre un petit languedocien et un gros catalan. In TALARE 2015 Caen, France.

  • Yarowsky, D., Ngai, G., & Wicentowski, R. (2001). Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the first international conference on Human language technology research. Association for Computational Linguistics, pp. 1–8.

  • Yu, Z., Marecek, D., Zabokrtský, Z., & Zeman, D. (2016). If you even don’t have a bit of Bible: Learning delexicalized POS taggers. In Proceedings of the tenth international conference on language resources and evaluation LREC 2016, Portorož, Slovenia, May 23–28, 2016.

  • Zhang, Y., Gaddy, D., Barzilay, R., & Jaakkola, T. (2016). Ten Pairs to tag—Multilingual POS tagging via coarse mapping between embeddings. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies. San Diego, California: Association for Computational Linguistics, pp. 1307–1317.

Download references

Acknowledgements

This work was supported by the French National Research Agency (ANR) under projet RESTAURE (ANR-14-CE24-0003-01). We also thank our colleagues from the RESTAURE project for their help in describing the languages and corpora.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anne-Laure Ligozat.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Magistry, P., Ligozat, AL. & Rosset, S. Exploiting languages proximity for part-of-speech tagging of three French regional languages. Lang Resources & Evaluation 53, 865–888 (2019). https://doi.org/10.1007/s10579-019-09463-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-019-09463-7

Keywords

Navigation