Skip to main content
Log in

A large English–Thai parallel corpus from the web and machine-generated text

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

The primary objective of our work is to build a large-scale English–Thai dataset for training neural machine translation models. We construct scb-mt-en-th-2020, an English–Thai machine translation dataset with over 1 million segment pairs, curated from various sources: news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled data, government documents, and text artificially generated by a pretrained language model. We present the methods for gathering data, aligning texts, and removing preprocessing noise and translation errors automatically. We also train machine translation models based on this dataset to assess the quality of the corpus. Our models perform comparably to Google Translation API (as of May 2020) for Thai–English and outperform Google when the Open Parallel Corpus (OPUS) is included in the training data for both Thai–English and English–Thai translation. The dataset is available for public use under CC-BY-SA 4.0 License. The pre-trained models and source code to reproduce our work are available under Apache-2.0 License.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. http://www.casmacat.eu/corpus/news-commentary.html.

  2. http://opus.nlpl.eu/Tanzil.php.

  3. http://opus.nlpl.eu/Ubuntu.php.

  4. http://opus.nlpl.eu/KDE4.php.

  5. http://opus.nlpl.eu/GNOME.php.

  6. https://github.com/vistec-AI/dataset-releases/releases/tag/scb-mt-en-th-2020_v1.0.

  7. https://github.com/vistec-AI/model-releases/releases/tag/SCB_1M+TBASE_v1.0.

  8. https://voice.mozilla.org/en.

  9. https://www.alexa.com/topsites/countries/TH.

  10. https://tika.apache.org/.

  11. See https://github.com/vistec-AI/pdf2parallel.

  12. The source code and thresholds used for the preprocessing can be found at: https://github.com/vistec-AI/thai2nmt_preprocess.

  13. tatoeba.org.

  14. tanzil.net.

  15. The source code used for the experiments can be found at: https://github.com/vistec-AI/thai2nmt.

  16. https://www.aiforthai.in.th.

References

  • Abdelali, A., Guzman, F., Sajjad, H., & Vogel, S. (2014). The AMARA corpus: Building parallel language resources for the educational domain. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), European Language Resources Association (ELRA), Reykjavik, Iceland, pp 1856–1862, http://www.lrec-conf.org/proceedings/lrec2014/pdf/877_Paper.pdf

  • Agić Ž, V. I. (2019). JW300: A wide-coverage parallel corpus for low-resource languages. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, pp 3204–3210. https://doi.org/10.18653/v1/P19-1310, https://www.aclweb.org/anthology/P19-1310

  • Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. ArXiv 1409

  • Byrne, B., Krishnamoorthi, K., Sankar, C., Neelakantan, A., Duckworth, D., Yavuz, S., Goodrich, B., Dubey, A., Cedilnik, A., & Kim, KY. (2019). Taskmaster-1: Toward a realistic and diverse dialog dataset. arXiv preprint arXiv:190905358

  • Cettolo, M., Niehues, J., Stüker, S., Bentivogli, L., Cattoni, R., & Federico, M. (2015). The iwslt 2015 evaluation campaign

  • Chaudhary, V., Tang, Y., Guzmán, F., Schwenk, H., & Koehn, P. (2019). Low-resource corpus filtering using multilingual sentence embeddings. In: Proceedings of the Fourth Conference on Machine Translation (Vol. 3: Shared Task Papers, Day 2), Association for Computational Linguistics, Florence, Italy, pp 261–266, https://doi.org/10.18653/v1/W19-5435, https://www.aclweb.org/anthology/W19-5435

  • Chen, T., & Kan, M. Y. (2011). Creating a live, public short message service corpus: The nus sms corpus. Language Resources and Evaluation. https://doi.org/10.1007/s10579-012-9197-9.

  • Christodouloupoulos, C., & Steedman, M. (2015). A massively parallel corpus: The bible in 100 languages. Language Resources and Evaluation, 49(2), 375–395. https://doi.org/10.1007/s10579-014-9287-y.

    Article  Google Scholar 

  • Dolan, W. B., & Brockett, C. (2005). Automatically constructing a corpus of sentential paraphrases. In: Proceedings of the Third International Workshop on Paraphrasing (IWP2005), https://www.aclweb.org/anthology/I05-5002

  • Esplà, M., & Transducens, G. (2009). Bitextor, a free/open-source software to harvest translation memories from multilingual websites

  • Esplà, M., Forcada, M., Ramírez-Sánchez, G., & Hoang, H. (2019). ParaCrawl: Web-scale parallel corpora for the languages of the EU. In: Proceedings of Machine Translation Summit XVII Volume 2: Translator, Project and User Tracks, European Association for Machine Translation, Dublin, Ireland, pp 118–119, https://www.aclweb.org/anthology/W19-6721

  • Esplà-Gomis, M., & Forcada, ML. (2009). Bitextor, a free/open-source software to harvest translation memories from multilingual websites. Proceedings of MT Summit XII, Ottawa, Canada Association for Machine Translation in the Americas

  • Gale, W. A., & Church, K. (1993). A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1), 75–102.

    Google Scholar 

  • Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017). Convolutional sequence to sequence learning. CoRR arXiv:abs/1705.03122,

  • Hassan, H., Aue, A., Chen, C., Chowdhary, V., Clark, J., Federmann, C., Huang, X., Junczys-Dowmunt, M., Lewis, W., Li, M., Liu, S., Liu, T. Y., Luo, R., Menezes, A., Qin, T., Seide, F., Tan, X., Tian, F., Wu, L., Wu, S., Xia, Y., Zhang, D., Zhang, Z., & Zhou, M. (2018). Achieving human parity on automatic chinese to english news translation. ArXiv abs/1803.05567

  • Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C., & Socher, R. (2019). Ctrl: A conditional transformer language model for controllable generation. arXiv:1909.05858

  • Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. MT Summit Citeseer, 5, 79–86.

    Google Scholar 

  • Koehn, P., & Knowles, R. (2017). Six challenges for neural machine translation. In: Proceedings of the First Workshop on Neural Machine Translation, Association for Computational Linguistics, Vancouver, pp 28–39. https://doi.org/10.18653/v1/W17-3204, https://www.aclweb.org/anthology/W17-3204

  • Kosawat, K., Boriboon, M., Chootrakool, P., Chotimongkol, A., Klaithin, S., Kongyoung, S., Kriengket, K., Phaholphinyo, S., Purodakananda, S., Thanakulwarapas, T., & Wutiwiwatchai, C. (2009). Best 2009 : Thai word segmentation software contest. In: 2009 Eighth International Symposium on Natural Language Processing, pp 83–88. https://doi.org/10.1109/SNLP.2009.5340941

  • Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Brussels, Belgium, pp 66–71, https://doi.org/10.18653/v1/D18-2012, https://www.aclweb.org/anthology/D18-2012

  • Lison, P., & Tiedemann, J. (2016). OpenSubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), European Language Resources Association (ELRA), Portorož, Slovenia, pp 923–929, https://www.aclweb.org/anthology/L16-1147

  • Loper, E., & Bird, S. (2002). Nltk: The natural language toolkit. In: Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, Vol. 1, Association for Computational Linguistics, USA, ETMTNLP’02, p 63–70, https://doi.org/10.3115/1118108.1118117

  • Ma, X. (2006). Champollion: A robust parallel text sentence aligner. In: LREC, pp 489–492

  • Ott, M., Edunov, S., Grangier, D., & Auli, M. (2018). Scaling neural machine translation. ArXiv abs/1806.00187

  • Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., & Auli, M. (2019). fairseq: A fast, extensible toolkit for sequence modeling. In: Proceedings of NAACL-HLT 2019: Demonstrations

  • Papineni, K., Roukos, S., Ward, T., & Zhu, WJ. (2002). Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp 311–318, https://doi.org/10.3115/1073083.1073135, https://www.aclweb.org/anthology/P02-1040

  • Phatthiyaphaibun, W., Chaovavanich, K., Polpanumas, C., Suriyawongkul, A., Lowphansirikul, L., & Chormai, P. (2020). Pythainlp/pythainlp: Pythainlp 2.1.4. https://doi.org/10.5281/zenodo.3659277.

  • Popović, M. (2015). chrF: character n-gram F-score for automatic MT evaluation. In: Proceedings of the Tenth Workshop on Statistical Machine Translation, Association for Computational Linguistics, Lisbon, Portugal, pp 392–395, https://doi.org/10.18653/v1/W15-3049, https://www.aclweb.org/anthology/W15-3049

  • Post, M. (2018). A call for clarity in reporting BLEU scores. In: Proceedings of the Third Conference on Machine Translation: Research Papers, Association for Computational Linguistics, Belgium, Brussels, pp 186–191, https://www.aclweb.org/anthology/W18-6319

  • Slayden, G., Hwang, M. Y., & Schwartz, L. (2010). Thai sentence-breaking for large-scale smt. In: Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing, pp. 8–16

  • Sornlertlamvanich, V., Charoenporn, T., & Isahara, H. (1997). Orchid: Thai part-of-speech tagged corpus. National Electronics and Computer Technology Center Technical Report pp. 5–19

  • Tangsirirat, N., Suchato, A., Punyabukkana, P., & Wutiwiwatchai, C. (2013). Contextual behaviour features and grammar rules for thai sentence-breaking. 2013 10th International Conference on Electrical Engineering/Electronics (pp. 1–4). Telecommunications and Information Technology, IEEE: Computer.

  • Thompson, B., & Koehn, P. (2019). Vecalign: Improved sentence alignment in linear time and space. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, pp. 1342–1348, https://doi.org/10.18653/v1/D19-1136, https://www.aclweb.org/anthology/D19-1136

  • Tiedemann, J. (2012). Parallel data, tools and interfaces in OPUS. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey, pp. 2214–2218, http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf

  • Varga, D., Halácsy, P., Kornai, A., Nagy, V., Németh, L., & Trón, V. (2007). Parallel corpora for medium density languages. Amsterdam Studies In The Theory And History Of Linguistic Science Series, 4(292), 247.

    Google Scholar 

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, AN., Kaiser, L., Polosukhin, I. (2017). Attention is all you need. CoRR arXiv:abs/1706.03762

  • Yang, Y., Cer, D. M., Ahmad, A., Guo, M., Law, J., Constant, N., Ábrego, G. H., Yuan, S., Tar, C., Sung, YH., Strope, B., & Kurzweil, R. (2019). Multilingual universal sentence encoder for semantic retrieval. ArXiv:abs/1907.04307

  • Zaidan, O. (2012). Crowdsourcing annotation for machine learning in natural language processing tasks

  • Ziemski, M., Junczys-Dowmunt, M., & Pouliquen, B. (2016). The united nations parallel corpus v1.0. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), European Language Resources Association (ELRA), Portorož, Slovenia, pp. 3530–3534, https://www.aclweb.org/anthology/L16-1561

Download references

Acknowledgements

This investigation is partially supported by the Digital Economy Promotion Agency Thailand under the infrastructure project code MP-62-003, Siam Commercial Bank, Special Task Force for Activating Research (STAR) Ratchadapiseksompoch Fund from Chulalongkorn university, and Research Grant for New Scholars from Thailand Research Fund (MRG6280175). We thank our data annotation partners Hope Data Annotations and Wang: Data Market; Office of the National Economic and Social Development Council (NESDC) through Phannisa Nirattiwongsakorn for providing government documents; Chonlapat Patanajirasit for training CRFCut sentence segmentation models on new datasets; Witchapong Daroontham for product review classification baselines; Pined Laohapiengsak for helping with sentence alignment using universal sentence encoder.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Attapol T. Rutherford.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lowphansirikul, L., Polpanumas, C., Rutherford, A.T. et al. A large English–Thai parallel corpus from the web and machine-generated text. Lang Resources & Evaluation 56, 477–499 (2022). https://doi.org/10.1007/s10579-021-09536-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-021-09536-6

Keywords

Navigation