Abstract
The primary objective of our work is to build a large-scale English–Thai dataset for training neural machine translation models. We construct scb-mt-en-th-2020, an English–Thai machine translation dataset with over 1 million segment pairs, curated from various sources: news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled data, government documents, and text artificially generated by a pretrained language model. We present the methods for gathering data, aligning texts, and removing preprocessing noise and translation errors automatically. We also train machine translation models based on this dataset to assess the quality of the corpus. Our models perform comparably to Google Translation API (as of May 2020) for Thai–English and outperform Google when the Open Parallel Corpus (OPUS) is included in the training data for both Thai–English and English–Thai translation. The dataset is available for public use under CC-BY-SA 4.0 License. The pre-trained models and source code to reproduce our work are available under Apache-2.0 License.
Similar content being viewed by others
Notes
The source code and thresholds used for the preprocessing can be found at: https://github.com/vistec-AI/thai2nmt_preprocess.
tatoeba.org.
tanzil.net.
The source code used for the experiments can be found at: https://github.com/vistec-AI/thai2nmt.
References
Abdelali, A., Guzman, F., Sajjad, H., & Vogel, S. (2014). The AMARA corpus: Building parallel language resources for the educational domain. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), European Language Resources Association (ELRA), Reykjavik, Iceland, pp 1856–1862, http://www.lrec-conf.org/proceedings/lrec2014/pdf/877_Paper.pdf
Agić Ž, V. I. (2019). JW300: A wide-coverage parallel corpus for low-resource languages. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, pp 3204–3210. https://doi.org/10.18653/v1/P19-1310, https://www.aclweb.org/anthology/P19-1310
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. ArXiv 1409
Byrne, B., Krishnamoorthi, K., Sankar, C., Neelakantan, A., Duckworth, D., Yavuz, S., Goodrich, B., Dubey, A., Cedilnik, A., & Kim, KY. (2019). Taskmaster-1: Toward a realistic and diverse dialog dataset. arXiv preprint arXiv:190905358
Cettolo, M., Niehues, J., Stüker, S., Bentivogli, L., Cattoni, R., & Federico, M. (2015). The iwslt 2015 evaluation campaign
Chaudhary, V., Tang, Y., Guzmán, F., Schwenk, H., & Koehn, P. (2019). Low-resource corpus filtering using multilingual sentence embeddings. In: Proceedings of the Fourth Conference on Machine Translation (Vol. 3: Shared Task Papers, Day 2), Association for Computational Linguistics, Florence, Italy, pp 261–266, https://doi.org/10.18653/v1/W19-5435, https://www.aclweb.org/anthology/W19-5435
Chen, T., & Kan, M. Y. (2011). Creating a live, public short message service corpus: The nus sms corpus. Language Resources and Evaluation. https://doi.org/10.1007/s10579-012-9197-9.
Christodouloupoulos, C., & Steedman, M. (2015). A massively parallel corpus: The bible in 100 languages. Language Resources and Evaluation, 49(2), 375–395. https://doi.org/10.1007/s10579-014-9287-y.
Dolan, W. B., & Brockett, C. (2005). Automatically constructing a corpus of sentential paraphrases. In: Proceedings of the Third International Workshop on Paraphrasing (IWP2005), https://www.aclweb.org/anthology/I05-5002
Esplà, M., & Transducens, G. (2009). Bitextor, a free/open-source software to harvest translation memories from multilingual websites
Esplà, M., Forcada, M., Ramírez-Sánchez, G., & Hoang, H. (2019). ParaCrawl: Web-scale parallel corpora for the languages of the EU. In: Proceedings of Machine Translation Summit XVII Volume 2: Translator, Project and User Tracks, European Association for Machine Translation, Dublin, Ireland, pp 118–119, https://www.aclweb.org/anthology/W19-6721
Esplà-Gomis, M., & Forcada, ML. (2009). Bitextor, a free/open-source software to harvest translation memories from multilingual websites. Proceedings of MT Summit XII, Ottawa, Canada Association for Machine Translation in the Americas
Gale, W. A., & Church, K. (1993). A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1), 75–102.
Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017). Convolutional sequence to sequence learning. CoRR arXiv:abs/1705.03122,
Hassan, H., Aue, A., Chen, C., Chowdhary, V., Clark, J., Federmann, C., Huang, X., Junczys-Dowmunt, M., Lewis, W., Li, M., Liu, S., Liu, T. Y., Luo, R., Menezes, A., Qin, T., Seide, F., Tan, X., Tian, F., Wu, L., Wu, S., Xia, Y., Zhang, D., Zhang, Z., & Zhou, M. (2018). Achieving human parity on automatic chinese to english news translation. ArXiv abs/1803.05567
Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C., & Socher, R. (2019). Ctrl: A conditional transformer language model for controllable generation. arXiv:1909.05858
Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. MT Summit Citeseer, 5, 79–86.
Koehn, P., & Knowles, R. (2017). Six challenges for neural machine translation. In: Proceedings of the First Workshop on Neural Machine Translation, Association for Computational Linguistics, Vancouver, pp 28–39. https://doi.org/10.18653/v1/W17-3204, https://www.aclweb.org/anthology/W17-3204
Kosawat, K., Boriboon, M., Chootrakool, P., Chotimongkol, A., Klaithin, S., Kongyoung, S., Kriengket, K., Phaholphinyo, S., Purodakananda, S., Thanakulwarapas, T., & Wutiwiwatchai, C. (2009). Best 2009 : Thai word segmentation software contest. In: 2009 Eighth International Symposium on Natural Language Processing, pp 83–88. https://doi.org/10.1109/SNLP.2009.5340941
Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Brussels, Belgium, pp 66–71, https://doi.org/10.18653/v1/D18-2012, https://www.aclweb.org/anthology/D18-2012
Lison, P., & Tiedemann, J. (2016). OpenSubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), European Language Resources Association (ELRA), Portorož, Slovenia, pp 923–929, https://www.aclweb.org/anthology/L16-1147
Loper, E., & Bird, S. (2002). Nltk: The natural language toolkit. In: Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, Vol. 1, Association for Computational Linguistics, USA, ETMTNLP’02, p 63–70, https://doi.org/10.3115/1118108.1118117
Ma, X. (2006). Champollion: A robust parallel text sentence aligner. In: LREC, pp 489–492
Ott, M., Edunov, S., Grangier, D., & Auli, M. (2018). Scaling neural machine translation. ArXiv abs/1806.00187
Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., & Auli, M. (2019). fairseq: A fast, extensible toolkit for sequence modeling. In: Proceedings of NAACL-HLT 2019: Demonstrations
Papineni, K., Roukos, S., Ward, T., & Zhu, WJ. (2002). Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp 311–318, https://doi.org/10.3115/1073083.1073135, https://www.aclweb.org/anthology/P02-1040
Phatthiyaphaibun, W., Chaovavanich, K., Polpanumas, C., Suriyawongkul, A., Lowphansirikul, L., & Chormai, P. (2020). Pythainlp/pythainlp: Pythainlp 2.1.4. https://doi.org/10.5281/zenodo.3659277.
Popović, M. (2015). chrF: character n-gram F-score for automatic MT evaluation. In: Proceedings of the Tenth Workshop on Statistical Machine Translation, Association for Computational Linguistics, Lisbon, Portugal, pp 392–395, https://doi.org/10.18653/v1/W15-3049, https://www.aclweb.org/anthology/W15-3049
Post, M. (2018). A call for clarity in reporting BLEU scores. In: Proceedings of the Third Conference on Machine Translation: Research Papers, Association for Computational Linguistics, Belgium, Brussels, pp 186–191, https://www.aclweb.org/anthology/W18-6319
Slayden, G., Hwang, M. Y., & Schwartz, L. (2010). Thai sentence-breaking for large-scale smt. In: Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing, pp. 8–16
Sornlertlamvanich, V., Charoenporn, T., & Isahara, H. (1997). Orchid: Thai part-of-speech tagged corpus. National Electronics and Computer Technology Center Technical Report pp. 5–19
Tangsirirat, N., Suchato, A., Punyabukkana, P., & Wutiwiwatchai, C. (2013). Contextual behaviour features and grammar rules for thai sentence-breaking. 2013 10th International Conference on Electrical Engineering/Electronics (pp. 1–4). Telecommunications and Information Technology, IEEE: Computer.
Thompson, B., & Koehn, P. (2019). Vecalign: Improved sentence alignment in linear time and space. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, pp. 1342–1348, https://doi.org/10.18653/v1/D19-1136, https://www.aclweb.org/anthology/D19-1136
Tiedemann, J. (2012). Parallel data, tools and interfaces in OPUS. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey, pp. 2214–2218, http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf
Varga, D., Halácsy, P., Kornai, A., Nagy, V., Németh, L., & Trón, V. (2007). Parallel corpora for medium density languages. Amsterdam Studies In The Theory And History Of Linguistic Science Series, 4(292), 247.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, AN., Kaiser, L., Polosukhin, I. (2017). Attention is all you need. CoRR arXiv:abs/1706.03762
Yang, Y., Cer, D. M., Ahmad, A., Guo, M., Law, J., Constant, N., Ábrego, G. H., Yuan, S., Tar, C., Sung, YH., Strope, B., & Kurzweil, R. (2019). Multilingual universal sentence encoder for semantic retrieval. ArXiv:abs/1907.04307
Zaidan, O. (2012). Crowdsourcing annotation for machine learning in natural language processing tasks
Ziemski, M., Junczys-Dowmunt, M., & Pouliquen, B. (2016). The united nations parallel corpus v1.0. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), European Language Resources Association (ELRA), Portorož, Slovenia, pp. 3530–3534, https://www.aclweb.org/anthology/L16-1561
Acknowledgements
This investigation is partially supported by the Digital Economy Promotion Agency Thailand under the infrastructure project code MP-62-003, Siam Commercial Bank, Special Task Force for Activating Research (STAR) Ratchadapiseksompoch Fund from Chulalongkorn university, and Research Grant for New Scholars from Thailand Research Fund (MRG6280175). We thank our data annotation partners Hope Data Annotations and Wang: Data Market; Office of the National Economic and Social Development Council (NESDC) through Phannisa Nirattiwongsakorn for providing government documents; Chonlapat Patanajirasit for training CRFCut sentence segmentation models on new datasets; Witchapong Daroontham for product review classification baselines; Pined Laohapiengsak for helping with sentence alignment using universal sentence encoder.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Lowphansirikul, L., Polpanumas, C., Rutherford, A.T. et al. A large English–Thai parallel corpus from the web and machine-generated text. Lang Resources & Evaluation 56, 477–499 (2022). https://doi.org/10.1007/s10579-021-09536-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-021-09536-6