当前位置: X-MOL 学术Ecol. Inform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Exploiting partially-labeled data in learning predictive clustering trees for multi-target regression: A case study of water quality assessment in Ireland
Ecological Informatics ( IF 5.8 ) Pub Date : 2020-10-01 , DOI: 10.1016/j.ecoinf.2020.101161
Stevanche Nikoloski , Dragi Kocev , Jurica Levatić , David P. Wall , Sašo Džeroski

Many environmental problems give rise to predictive modeling tasks where several dependent variables need to be predicted simultaneousy from a given set of independent variables. When the target variables are numeric, the task at hand is called multi-target regression (MTR). An example task of this type is the assessment of quality of agricultural waters in Ireland according to three indicators: biological water quality, nitrogen concentration and phosphorus concentration.

Multi-target regression models are typically learnt from labeled training examples, where the values of both the dependent variables (labels) and the independent variables are provided, in a setting known as supervised learning. Many different approaches to supervised multi-target regression have been developed, among which predictive clustering trees and ensembles thereof stand out due to their effectiveness and efficiency. Recently, these approaches have been extended to exploit not only labeled examples, but also unlabeled examples, where only the values of the independent variables are provided, a setting known as semi-supervised learning.

In practice, training data can also contain partially labeled examples, where the values of some of the dependent variables are provided and others are missing (in addition to fully labeled examples where all target values are provided and completely unlabeled examples where no target values are provided). For the task of water quality assessment in Ireland, we encounter this kind of partially labeled data. Existing supervised and semi-supervised MTR approaches typically ignore partially labeled data.

In this paper, we propose the use of semi-supervised predictive clustering trees for MTR that can handle partially labeled examples. We apply these to the task of assessment of water quality in Ireland, showing that better performance can be achieved if partially labeled examples are exploited, rather than discarded. We build both local models (collections of single-target models predicting each target separately) and global models (multi-target models simultaneously predicting all targets), showing that global models are both smaller and easier to interpret, and also overfit less (and have better performance) as compared to local models.



中文翻译:

在学习预测性聚类树中利用部分标记的数据进行多目标回归:爱尔兰水质评估的案例研究

许多环境问题引起了预测建模任务,其中需要从给定的一组独立变量中同时预测几个因变量。当目标变量为数字时,手头的任务称为多目标回归(MTR)。这种类型的示例任务是根据三个指标对爱尔兰的农业水质进行评估:生物水质,氮浓度和磷浓度。

多目标回归模型通常是从​​带有标签的训练示例中学习的,其中在称为监督学习的环境中提供了因变量(​​标签)和自变量的值。已经开发了许多不同的监督多目标回归方法,其中预测聚类树及其集合因其有效性和效率而脱颖而出。最近,这些方法已扩展为不仅利用标记的示例,而且利用未标记的示例,其中仅提供自变量的值,这种设置称为半监督学习。

在实践中,训练数据还可以包含部分标记的示例,其中提供了一些因变量的值,而其他变量则缺失(除了提供所有目标值的完全标记的示例和不提供目标值的完全未标记的示例之外) )。对于爱尔兰的水质评估,我们遇到了这类带有部分标签的数据。现有的监督和半监督MTR方法通常会忽略部分标记的数据。

在本文中,我们建议将半监督预测聚类树用于MTR,它可以处理部分标记的示例。我们将这些应用于爱尔兰的水质评估任务,表明如果开发而不是丢弃带有部分标签的示例,则可以实现更好的性能。我们建立了局部模型(分别预测每个目标的单目标模型的集合)和全局模型(同时预测所有目标的多目标模型),这表明全局模型既小又容易解释,而且过拟合少(并且具有性能更好)。

更新日期:2020-10-01
down
wechat
bug