当前位置: X-MOL 学术The Electronic Library › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Building a training dataset for classification under a cost limitation
The Electronic Library ( IF 1.675 ) Pub Date : 2021-02-24 , DOI: 10.1108/el-07-2020-0209
Yen-Liang Chen , Li-Chen Cheng , Yi-Jun Zhang

Purpose

A necessary preprocessing of document classification is to label some documents so that a classifier can be built based on which the remaining documents can be classified. Because each document differs in length and complexity, the cost of labeling each document is different. The purpose of this paper is to consider how to select a subset of documents for labeling with a limited budget so that the total cost of the spending does not exceed the budget limit, while at the same time building a classifier with the best classification results.

Design/methodology/approach

In this paper, a framework is proposed to select the instances for labeling that integrate two clustering algorithms and two centroid selection methods. From the selected and labeled instances, five different classifiers were constructed with good classification accuracy to prove the superiority of the selected instances.

Findings

Experimental results show that this method can establish a training data set containing the most suitable data under the premise of considering the cost constraints. The data set considers both “data representativeness” and “data selection cost,” so that the training data labeled by experts can effectively establish a classifier with high accuracy.

Originality/value

No previous research has considered how to establish a training set with a cost limit when each document has a distinct labeling cost. This paper is the first attempt to resolve this issue.



中文翻译:

建立成本限制下的分类训练数据集

目的

文档分类的必要预处理是标记某些文档,以便可以基于分类器构建其余文档。由于每个文档的长度和复杂性都不同,因此标记每个文档的成本也不同。本文的目的是考虑如何选择预算有限的标签子集,以使支出的总成本不超过预算限制,同时构建具有最佳分类结果的分类器。

设计/方法/方法

在本文中,提出了一个框架来选择用于标记的实例,该实例集成了两种聚类算法和两种质心选择方法。从选定的实例和标记的实例中,构造了五个具有良好分类精度的不同分类器,以证明选定实例的优越性。

发现

实验结果表明,该方法可以在考虑成本约束的前提下,建立包含最合适数据的训练数据集。该数据集同时考虑了“数据代表性”和“数据选择成本”,因此专家标记的训练数据可以有效地建立高精度的分类器。

创意/价值

当每个文档具有不同的标签成本时,以前的研究都没有考虑过如何建立一个具有成本限制的培训集。本文是解决此问题的首次尝试。

更新日期:2021-02-24
down
wechat
bug