当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
From Universal Language Model to Downstream Task: Improving RoBERTa-Based Vietnamese Hate Speech Detection
arXiv - CS - Computation and Language Pub Date : 2021-02-24 , DOI: arxiv-2102.12162
Quang Huu Pham, Viet Anh Nguyen, Linh Bao Doan, Ngoc N. Tran, Ta Minh Thanh

Natural language processing is a fast-growing field of artificial intelligence. Since the Transformer was introduced by Google in 2017, a large number of language models such as BERT, GPT, and ELMo have been inspired by this architecture. These models were trained on huge datasets and achieved state-of-the-art results on natural language understanding. However, fine-tuning a pre-trained language model on much smaller datasets for downstream tasks requires a carefully-designed pipeline to mitigate problems of the datasets such as lack of training data and imbalanced data. In this paper, we propose a pipeline to adapt the general-purpose RoBERTa language model to a specific text classification task: Vietnamese Hate Speech Detection. We first tune the PhoBERT on our dataset by re-training the model on the Masked Language Model task; then, we employ its encoder for text classification. In order to preserve pre-trained weights while learning new feature representations, we further utilize different training techniques: layer freezing, block-wise learning rate, and label smoothing. Our experiments proved that our proposed pipeline boosts the performance significantly, achieving a new state-of-the-art on Vietnamese Hate Speech Detection campaign with 0.7221 F1 score.

中文翻译:

从通用语言模型到下游任务:改进基于RoBERTa的越南仇恨语音检测

自然语言处理是人工智能的一个快速增长的领域。自从Google于2017年推出Transformer以来,大量语言模型(例如BERT,GPT和ELMo)都受到了这种体系结构的启发。这些模型在庞大的数据集上进行了训练,并在自然语言理解方面取得了最新的成果。但是,在用于下游任务的较小数据集上微调预训练的语言模型需要精心设计的管道,以减轻数据集的问题,例如缺少训练数据和数据不平衡。在本文中,我们提出了将通用RoBERTa语言模型适应特定文本分类任务的管道:越南语仇恨语音检测。我们首先通过在Masked Language Model任务上重新训练模型来调整数据集上的PhoBERT;然后,我们将其编码器用于文本分类。为了在学习新的特征表示时保留预先训练的权重,我们进一步利用了不同的训练技术:图层冻结,逐块学习率和标签平滑。我们的实验证明,我们提出的管道可以显着提高性能,以0.7221 F1的得分达到了越南仇恨语音检测运动的最新水平。
更新日期:2021-02-25
down
wechat
bug