当前位置: X-MOL 学术Inform. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Multi-label legal document classification: A deep learning-based approach with label-attention and domain-specific pre-training
Information Systems ( IF 3.0 ) Pub Date : 2021-01-18 , DOI: 10.1016/j.is.2021.101718
Dezhao Song , Andrew Vold , Kanika Madan , Frank Schilder

Multi-label document classification has a broad range of applicability to various practical problems, such as news article topic tagging, sentiment analysis, medical code classification, etc. A variety of approaches (e.g., tree-based methods, neural networks and deep learning systems that are specifically based on pre-trained language models) have been developed for multi-label document classification problems and have achieved satisfying performance on different datasets. In the legal domain, however, one is often faced with several key challenges when working with multi-label classification tasks. One critical challenge is the lack of high-quality human labeled datasets, which prevents researchers and practitioners from achieving decent performance on respective tasks. Also, existing methods on multi-label classification typically focus on the majority classes, which results in an unsatisfying performance for other important classes that do not have sufficient training samples. In order to tackle the above challenges, in this paper, we first present POSTURE50K, a novel legal extreme multi-label classification dataset, which we will release to the research community. The dataset contains 50,000 legal opinions and their manually labeled legal procedural postures. Labels in this dataset follow a Zipfian distribution, leaving many of the classes with only a few samples. Furthermore, we propose a deep learning architecture that adopts domain-specific pre-training and a label-attention mechanism for multi-label document classification. We evaluate our proposed architecture on POSTURE50K and another legal multi-label dataset EUROLEX57K, and show that our approach achieves better performances than two baseline systems and another four recent state-of-the-art methods on both datasets.



中文翻译:

多标签法律文件分类:一种基于深度学习的方法,具有标签注意和特定领域的预培训

多标签文档分类对各种实际问题具有广泛的适用性,例如新闻文章主题标记,情感分析,医学代码分类等。多种方法(例如,基于树的方法,神经网络和深度学习系统)专门基于预训练的语言模型开发的解决方案)已针对多标签文档分类问题进行了开发,并在不同数据集上取得了令人满意的性能。但是,在法律领域,在处理多标签分类任务时通常会面临几个关键挑战。一个关键的挑战是缺乏高质量的人类标记数据集,这阻止了研究人员和从业人员在各自任务上取得不错的表现。也,现有的多标签分类方法通常集中在多数班上,这导致其他重要班级的表现欠佳,而这些重要班级没有足够的训练样本。为了解决上述挑战,在本文中,我们首先介绍POSTURE50K,这是一种新颖的法律极端多标签分类数据集,将其发布给研究社区。数据集包含50,000个法律意见及其手动标记的法律程序状态。该数据集中的标签遵循Zipfian分布,而许多类只剩下几个样本。此外,我们提出了一种深度学习架构,该架构采用针对特定领域的预训练和标签注意机制进行多标签文档分类。

更新日期:2021-01-19
down
wechat
bug