Evaluating Document Representations for Content-based Legal Literature Recommendations,arXiv - CS - Information Retrieval

当前位置： X-MOL 学术 › arXiv.cs.IR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Evaluating Document Representations for Content-based Legal Literature Recommendations
arXiv - CS - Information Retrieval Pub Date : 2021-04-28 , DOI: arxiv-2104.13841
Malte Ostendorff, Elliott Ash, Terry Ruas, Bela Gipp, Julian Moreno-Schneider, Georg Rehm

Recommender systems assist legal professionals in finding relevant literature for supporting their case. Despite its importance for the profession, legal applications do not reflect the latest advances in recommender systems and representation learning research. Simultaneously, legal recommender systems are typically evaluated in small-scale user study without any public available benchmark datasets. Thus, these studies have limited reproducibility. To address the gap between research and practice, we explore a set of state-of-the-art document representation methods for the task of retrieving semantically related US case law. We evaluate text-based (e.g., fastText, Transformers), citation-based (e.g., DeepWalk, Poincar\'e), and hybrid methods. We compare in total 27 methods using two silver standards with annotations for 2,964 documents. The silver standards are newly created from Open Case Book and Wikisource and can be reused under an open license facilitating reproducibility. Our experiments show that document representations from averaged fastText word vectors (trained on legal corpora) yield the best results, closely followed by Poincar\'e citation embeddings. Combining fastText and Poincar\'e in a hybrid manner further improves the overall result. Besides the overall performance, we analyze the methods depending on document length, citation count, and the coverage of their recommendations. We make our source code, models, and datasets publicly available at https://github.com/malteos/legal-document-similarity/.

中文翻译：

评估基于内容的法律文献建议的文档表示形式

推荐系统可协助法律专业人士找到相关文献以支持其案件。尽管对专业很重要，但法律申请并不能反映推荐系统和代表性学习研究的最新进展。同时，通常在小型用户研究中评估法律推荐系统，而没有任何公共可用的基准数据集。因此，这些研究具有有限的可重复性。为了解决研究与实践之间的差距，我们探索了一组最先进的文档表示方法，以检索与语义相关的美国判例法。我们评估基于文本的（例如，fastText，Transformers），基于引用的（例如，DeepWalk，Poincar'e）和混合方法。我们总共比较了使用两种带有2964个文档注释的银标准的27种方法。白银标准是从Open Case Book和Wikisource中新创建的，可以在开放许可证下重复使用，以促进可重复性。我们的实验表明，平均的fastText单词向量（经过法律语料库训练）的文档表示形式产生了最佳结果，紧随其后的是Poincar'e引用嵌入。将fastText和Poincar'e混合使用可以进一步改善整体效果。除了总体性能之外，我们还根据文档的长度，引文计数及其建议的覆盖范围来分析这些方法。我们在https://github.com/malteos/legal-document-similarity/上公开提供了源代码，模型和数据集。我们的实验表明，平均的fastText词向量（经过法律语料库训练）的文档表示形式产生了最佳结果，紧随其后的是Poincar'e引用嵌入。将fastText和Poincar'e混合使用可以进一步改善整体效果。除了总体性能之外，我们还根据文档的长度，引文计数及其建议的覆盖范围来分析这些方法。我们在https://github.com/malteos/legal-document-similarity/上公开提供了源代码，模型和数据集。我们的实验表明，平均的fastText单词向量（经过法律语料库训练）的文档表示形式产生了最佳结果，紧随其后的是Poincar'e引用嵌入。以混合方式组合fastText和Poincar'e可以进一步改善总体效果。除了总体性能之外，我们还根据文档的长度，引文计数及其建议的覆盖范围来分析这些方法。我们在https://github.com/malteos/legal-document-similarity/上公开提供了源代码，模型和数据集。我们将根据文档的长度，引用次数及其建议的覆盖范围来分析这些方法。我们在https://github.com/malteos/legal-document-similarity/上公开提供了源代码，模型和数据集。我们将根据文档的长度，引用次数及其建议的覆盖范围来分析这些方法。我们在https://github.com/malteos/legal-document-similarity/上公开提供了源代码，模型和数据集。

更新日期：2021-04-29

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>