A large English–Thai parallel corpus from the web and machine-generated text

Lowphansirikul, Lalita; Polpanumas, Charin; Rutherford, Attapol T.; Nutanong, Sarana

doi:10.1007/s10579-021-09536-6

A large English–Thai parallel corpus from the web and machine-generated text

Original Paper
Published: 30 March 2021

Volume 56, pages 477–499, (2022)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Lalita Lowphansirikul¹,
Charin Polpanumas²,
Attapol T. Rutherford ORCID: orcid.org/0000-0003-2270-6082^3,4 &
…
Sarana Nutanong¹

938 Accesses
13 Citations
2 Altmetric
Explore all metrics

Abstract

The primary objective of our work is to build a large-scale English–Thai dataset for training neural machine translation models. We construct scb-mt-en-th-2020, an English–Thai machine translation dataset with over 1 million segment pairs, curated from various sources: news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled data, government documents, and text artificially generated by a pretrained language model. We present the methods for gathering data, aligning texts, and removing preprocessing noise and translation errors automatically. We also train machine translation models based on this dataset to assess the quality of the corpus. Our models perform comparably to Google Translation API (as of May 2020) for Thai–English and outperform Google when the Open Parallel Corpus (OPUS) is included in the training data for both Thai–English and English–Thai translation. The dataset is available for public use under CC-BY-SA 4.0 License. The pre-trained models and source code to reproduce our work are available under Apache-2.0 License.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

A Parallel Corpus of Theses and Dissertations Abstracts

Experimenting with Different Machine Translation Models in Medium-Resource Settings

The Direct Path May Not Be The Best: Portuguese-Chinese Neural Machine Translation

Notes

http://www.casmacat.eu/corpus/news-commentary.html.
http://opus.nlpl.eu/Tanzil.php.
http://opus.nlpl.eu/Ubuntu.php.
http://opus.nlpl.eu/KDE4.php.
http://opus.nlpl.eu/GNOME.php.
https://github.com/vistec-AI/dataset-releases/releases/tag/scb-mt-en-th-2020_v1.0.
https://github.com/vistec-AI/model-releases/releases/tag/SCB_1M+TBASE_v1.0.
https://voice.mozilla.org/en.
https://www.alexa.com/topsites/countries/TH.
https://tika.apache.org/.
See https://github.com/vistec-AI/pdf2parallel.
The source code and thresholds used for the preprocessing can be found at: https://github.com/vistec-AI/thai2nmt_preprocess.
tatoeba.org.
tanzil.net.
The source code used for the experiments can be found at: https://github.com/vistec-AI/thai2nmt.
https://www.aiforthai.in.th.

References

Abdelali, A., Guzman, F., Sajjad, H., & Vogel, S. (2014). The AMARA corpus: Building parallel language resources for the educational domain. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), European Language Resources Association (ELRA), Reykjavik, Iceland, pp 1856–1862, http://www.lrec-conf.org/proceedings/lrec2014/pdf/877_Paper.pdf
Agić Ž, V. I. (2019). JW300: A wide-coverage parallel corpus for low-resource languages. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, pp 3204–3210. https://doi.org/10.18653/v1/P19-1310, https://www.aclweb.org/anthology/P19-1310
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. ArXiv 1409
Byrne, B., Krishnamoorthi, K., Sankar, C., Neelakantan, A., Duckworth, D., Yavuz, S., Goodrich, B., Dubey, A., Cedilnik, A., & Kim, KY. (2019). Taskmaster-1: Toward a realistic and diverse dialog dataset. arXiv preprint arXiv:190905358
Cettolo, M., Niehues, J., Stüker, S., Bentivogli, L., Cattoni, R., & Federico, M. (2015). The iwslt 2015 evaluation campaign
Chaudhary, V., Tang, Y., Guzmán, F., Schwenk, H., & Koehn, P. (2019). Low-resource corpus filtering using multilingual sentence embeddings. In: Proceedings of the Fourth Conference on Machine Translation (Vol. 3: Shared Task Papers, Day 2), Association for Computational Linguistics, Florence, Italy, pp 261–266, https://doi.org/10.18653/v1/W19-5435, https://www.aclweb.org/anthology/W19-5435
Chen, T., & Kan, M. Y. (2011). Creating a live, public short message service corpus: The nus sms corpus. Language Resources and Evaluation. https://doi.org/10.1007/s10579-012-9197-9.
Christodouloupoulos, C., & Steedman, M. (2015). A massively parallel corpus: The bible in 100 languages. Language Resources and Evaluation, 49(2), 375–395. https://doi.org/10.1007/s10579-014-9287-y.
Article Google Scholar
Dolan, W. B., & Brockett, C. (2005). Automatically constructing a corpus of sentential paraphrases. In: Proceedings of the Third International Workshop on Paraphrasing (IWP2005), https://www.aclweb.org/anthology/I05-5002
Esplà, M., & Transducens, G. (2009). Bitextor, a free/open-source software to harvest translation memories from multilingual websites
Esplà, M., Forcada, M., Ramírez-Sánchez, G., & Hoang, H. (2019). ParaCrawl: Web-scale parallel corpora for the languages of the EU. In: Proceedings of Machine Translation Summit XVII Volume 2: Translator, Project and User Tracks, European Association for Machine Translation, Dublin, Ireland, pp 118–119, https://www.aclweb.org/anthology/W19-6721
Esplà-Gomis, M., & Forcada, ML. (2009). Bitextor, a free/open-source software to harvest translation memories from multilingual websites. Proceedings of MT Summit XII, Ottawa, Canada Association for Machine Translation in the Americas
Gale, W. A., & Church, K. (1993). A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1), 75–102.
Google Scholar
Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017). Convolutional sequence to sequence learning. CoRR arXiv:abs/1705.03122,
Hassan, H., Aue, A., Chen, C., Chowdhary, V., Clark, J., Federmann, C., Huang, X., Junczys-Dowmunt, M., Lewis, W., Li, M., Liu, S., Liu, T. Y., Luo, R., Menezes, A., Qin, T., Seide, F., Tan, X., Tian, F., Wu, L., Wu, S., Xia, Y., Zhang, D., Zhang, Z., & Zhou, M. (2018). Achieving human parity on automatic chinese to english news translation. ArXiv abs/1803.05567
Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C., & Socher, R. (2019). Ctrl: A conditional transformer language model for controllable generation. arXiv:1909.05858
Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. MT Summit Citeseer, 5, 79–86.
Google Scholar
Koehn, P., & Knowles, R. (2017). Six challenges for neural machine translation. In: Proceedings of the First Workshop on Neural Machine Translation, Association for Computational Linguistics, Vancouver, pp 28–39. https://doi.org/10.18653/v1/W17-3204, https://www.aclweb.org/anthology/W17-3204
Kosawat, K., Boriboon, M., Chootrakool, P., Chotimongkol, A., Klaithin, S., Kongyoung, S., Kriengket, K., Phaholphinyo, S., Purodakananda, S., Thanakulwarapas, T., & Wutiwiwatchai, C. (2009). Best 2009 : Thai word segmentation software contest. In: 2009 Eighth International Symposium on Natural Language Processing, pp 83–88. https://doi.org/10.1109/SNLP.2009.5340941
Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Brussels, Belgium, pp 66–71, https://doi.org/10.18653/v1/D18-2012, https://www.aclweb.org/anthology/D18-2012
Lison, P., & Tiedemann, J. (2016). OpenSubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), European Language Resources Association (ELRA), Portorož, Slovenia, pp 923–929, https://www.aclweb.org/anthology/L16-1147
Loper, E., & Bird, S. (2002). Nltk: The natural language toolkit. In: Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, Vol. 1, Association for Computational Linguistics, USA, ETMTNLP’02, p 63–70, https://doi.org/10.3115/1118108.1118117
Ma, X. (2006). Champollion: A robust parallel text sentence aligner. In: LREC, pp 489–492
Ott, M., Edunov, S., Grangier, D., & Auli, M. (2018). Scaling neural machine translation. ArXiv abs/1806.00187
Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., & Auli, M. (2019). fairseq: A fast, extensible toolkit for sequence modeling. In: Proceedings of NAACL-HLT 2019: Demonstrations
Papineni, K., Roukos, S., Ward, T., & Zhu, WJ. (2002). Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp 311–318, https://doi.org/10.3115/1073083.1073135, https://www.aclweb.org/anthology/P02-1040
Phatthiyaphaibun, W., Chaovavanich, K., Polpanumas, C., Suriyawongkul, A., Lowphansirikul, L., & Chormai, P. (2020). Pythainlp/pythainlp: Pythainlp 2.1.4. https://doi.org/10.5281/zenodo.3659277.
Popović, M. (2015). chrF: character n-gram F-score for automatic MT evaluation. In: Proceedings of the Tenth Workshop on Statistical Machine Translation, Association for Computational Linguistics, Lisbon, Portugal, pp 392–395, https://doi.org/10.18653/v1/W15-3049, https://www.aclweb.org/anthology/W15-3049
Post, M. (2018). A call for clarity in reporting BLEU scores. In: Proceedings of the Third Conference on Machine Translation: Research Papers, Association for Computational Linguistics, Belgium, Brussels, pp 186–191, https://www.aclweb.org/anthology/W18-6319
Slayden, G., Hwang, M. Y., & Schwartz, L. (2010). Thai sentence-breaking for large-scale smt. In: Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing, pp. 8–16
Sornlertlamvanich, V., Charoenporn, T., & Isahara, H. (1997). Orchid: Thai part-of-speech tagged corpus. National Electronics and Computer Technology Center Technical Report pp. 5–19
Tangsirirat, N., Suchato, A., Punyabukkana, P., & Wutiwiwatchai, C. (2013). Contextual behaviour features and grammar rules for thai sentence-breaking. 2013 10th International Conference on Electrical Engineering/Electronics (pp. 1–4). Telecommunications and Information Technology, IEEE: Computer.
Thompson, B., & Koehn, P. (2019). Vecalign: Improved sentence alignment in linear time and space. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, pp. 1342–1348, https://doi.org/10.18653/v1/D19-1136, https://www.aclweb.org/anthology/D19-1136
Tiedemann, J. (2012). Parallel data, tools and interfaces in OPUS. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey, pp. 2214–2218, http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf
Varga, D., Halácsy, P., Kornai, A., Nagy, V., Németh, L., & Trón, V. (2007). Parallel corpora for medium density languages. Amsterdam Studies In The Theory And History Of Linguistic Science Series, 4(292), 247.
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, AN., Kaiser, L., Polosukhin, I. (2017). Attention is all you need. CoRR arXiv:abs/1706.03762
Yang, Y., Cer, D. M., Ahmad, A., Guo, M., Law, J., Constant, N., Ábrego, G. H., Yuan, S., Tar, C., Sung, YH., Strope, B., & Kurzweil, R. (2019). Multilingual universal sentence encoder for semantic retrieval. ArXiv:abs/1907.04307
Zaidan, O. (2012). Crowdsourcing annotation for machine learning in natural language processing tasks
Ziemski, M., Junczys-Dowmunt, M., & Pouliquen, B. (2016). The united nations parallel corpus v1.0. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), European Language Resources Association (ELRA), Portorož, Slovenia, pp. 3530–3534, https://www.aclweb.org/anthology/L16-1561

Download references

Acknowledgements

This investigation is partially supported by the Digital Economy Promotion Agency Thailand under the infrastructure project code MP-62-003, Siam Commercial Bank, Special Task Force for Activating Research (STAR) Ratchadapiseksompoch Fund from Chulalongkorn university, and Research Grant for New Scholars from Thailand Research Fund (MRG6280175). We thank our data annotation partners Hope Data Annotations and Wang: Data Market; Office of the National Economic and Social Development Council (NESDC) through Phannisa Nirattiwongsakorn for providing government documents; Chonlapat Patanajirasit for training CRFCut sentence segmentation models on new datasets; Witchapong Daroontham for product review classification baselines; Pined Laohapiengsak for helping with sentence alignment using universal sentence encoder.

Author information

Authors and Affiliations

School of Information Science and Technology, Vidyasirimedhi Institution of Science and Technology, Rayong, Thailand
Lalita Lowphansirikul & Sarana Nutanong
PyThaiNLP, Bangkok, Thailand
Charin Polpanumas
Department of Linguistics, Chulalongkorn University, Bangkok, Thailand
Attapol T. Rutherford
Teaching and Learning Thai as a Foreign Language Group, Bangkok, Thailand
Attapol T. Rutherford

Authors

Lalita Lowphansirikul
View author publications
You can also search for this author in PubMed Google Scholar
Charin Polpanumas
View author publications
You can also search for this author in PubMed Google Scholar
Attapol T. Rutherford
View author publications
You can also search for this author in PubMed Google Scholar
Sarana Nutanong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Attapol T. Rutherford.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lowphansirikul, L., Polpanumas, C., Rutherford, A.T. et al. A large English–Thai parallel corpus from the web and machine-generated text. Lang Resources & Evaluation 56, 477–499 (2022). https://doi.org/10.1007/s10579-021-09536-6

Download citation

Accepted: 02 March 2021
Published: 30 March 2021
Issue Date: June 2022
DOI: https://doi.org/10.1007/s10579-021-09536-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

A large English–Thai parallel corpus from the web and machine-generated text

Abstract

Access this article

Similar content being viewed by others

A Parallel Corpus of Theses and Dissertations Abstracts

Experimenting with Different Machine Translation Models in Medium-Resource Settings

The Direct Path May Not Be The Best: Portuguese-Chinese Neural Machine Translation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A large English–Thai parallel corpus from the web and machine-generated text

Abstract

Access this article

Similar content being viewed by others

A Parallel Corpus of Theses and Dissertations Abstracts

Experimenting with Different Machine Translation Models in Medium-Resource Settings

The Direct Path May Not Be The Best: Portuguese-Chinese Neural Machine Translation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation