Rough set-based feature selection for weakly labeled data

https://doi.org/10.1016/j.ijar.2021.06.005Get rights and content

Abstract

Supervised learning is an important branch of machine learning (ML), which requires a complete annotation (labeling) of the involved training data. This assumption is relaxed in the settings of weakly supervised learning, where labels are allowed to be imprecise or partial. In this article, we study the setting of superset learning, in which instances are assumed to be labeled with a set of possible annotations containing the correct one. We tackle the problem of learning from such data in the context of rough set theory (RST). More specifically, we consider the problem of RST-based feature reduction as a suitable means for data disambiguation, i.e., for the purpose of figuring out the most plausible precise instantiation of the imprecise training data. To this end, we define appropriate generalizations of decision tables and reducts, using tools from generalized information theory and belief function theory. Moreover, we analyze the computational complexity and theoretical properties of the associated computational problems. Finally, we present results of a series of experiments, in which we analyze the proposed concepts empirically and compare our methods with a state-of-the-art dimensionality reduction algorithm, reporting a statistically significant improvement in predictive accuracy.

Introduction

Weakly supervised learning [69] refers to machine learning tasks in which training instances are not required to be associated with a precise target label. Instead, the annotations can be imprecise or partial. Such tasks could be the consequence of certain data pre-processing operations such as anonymization [15], [49] or censoring [17], could be due to imprecise measurements or expert opinions, or meant to limit data annotation costs [45]. Some examples of weakly supervised learning tasks include semi-supervised learning, but also more general tasks like learning from soft labels [8], [12], [13], [48], (in which partial labels are represented through belief functions) which, in turn, encompasses both learning from fuzzy labels [14], [28] (in which partial labels are represented through possibility distributions) and superset learning [29], [40], [44]. In this latter setting, which will be the focus of this article, each instance x is annotated with a set S of candidate labels that are deemed (equally) possible. In other words, we know that the label of x is an element of S, but nothing more. For example, an image could be tagged with {horse,pony,zebra}, suggesting that the animal shown on the picture is one of these three, though it is not exactly known which of them.

In the recent years, the superset learning task has been widely investigated both under the classification perspective [19], [30], [64], [66] and from a theoretical standpoint [39]. The latter result is particularly relevant, as it shows that, as in the standard PAC learning model, superset learnability is characterized by combinatorial dimensions (e.g., Vapnik-Chervonenkis or Natarajan dimension) which, in general, depend on the dimensionality (i.e., the number of features) of the learning problem. Thus, the availability of effective feature selection [24] or dimensionality reduction algorithms would be of critical importance in order to control model capacity and, hence, ensure proper model generalization. Nevertheless, this task has not received much attention so far [61].

In this article, which is an extension of our previous article [6], we study the application of rough set theory in the setting of superset learning. In particular, adhering to the generalized risk minimization principle [28], we consider the problem of feature reduction as a mean for data disambiguation, i.e., for the purpose of figuring out the most plausible precise instantiation of the imprecise training data. Compared to our previous work, we provide a finer characterization of the theoretical properties and relations among the proposed definitions of reduct through Theorem 3.4, Theorem 3.5, Theorem 3.7 that were previously left as open problems. In Section 4, which has been newly added, we also discuss two computational experiments by which we study the empirical performance of the proposed reduct definitions, also in comparison with the state-of-the-art method for dimensionality reduction in superset learning.

Section snippets

Background

In this section, we recall basic notions of rough set theory (RST) and belief function theory, which will be used in the main part of the article.

Superset decision tables and reducts

In this section, we extend some key concepts of rough set theory to the setting of superset learning.

Experiments

In this section, we present a series of experimental studies meant to evaluate the different definitions of reduct in superset learning as put forward in this paper, as well as the performance of the proposed algorithms in light of the state-of-the-art in superset dimensionality reduction (DELIN algorithm, see Section 2). More specifically, our experiments are aimed at studying the following aspects:

  • Reduct approximation: The ability of the different types of reducts to recover the true reducts

Conclusion

Addressing the problem of superset learning in the context of rough set theory, as we did in this paper, appears to be interesting and mutually beneficial for both sides:

  • RST provides natural tools for data disambiguation, which is at the core of methods for superset learning, most notably the notion of a reduct. Here, the basic idea is that the plausibility of an instantiation of the data is in direct correspondence with the (information-theoretic) complexity it implies for the dependency

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (69)

  • Judea Pearl

    Reasoning with belief functions: an analysis of compatibility

    Int. J. Approx. Reason.

    (1990)
  • Razieh Sheikhpour et al.

    A survey on semi-supervised feature selection methods

    Pattern Recognit.

    (2017)
  • Philippe Smets

    Information content of an evidence

    Int. J. Man-Mach. Stud.

    (1983)
  • Philippe Smets et al.

    The transferable belief model

    Artif. Intell.

    (1994)
  • K. Thangavel et al.

    Dimensionality reduction based on rough set theory: a review

    Appl. Soft Comput.

    (2009)
  • Y.Y. Yao et al.

    Interpretations of belief functions in the theory of rough sets

    Inf. Sci.

    (1998)
  • Shao-Pu Zhang et al.

    Belief function of Pythagorean fuzzy rough approximation space and its applications

    Int. J. Approx. Reason.

    (2020)
  • Yan-Lan Zhang et al.

    Relationships between relation-based rough sets and belief structures

    Int. J. Approx. Reason.

    (2020)
  • Joaquin Abellan

    Combining nonspecificity measures in Dempster–Shafer theory of evidence

    Int. J. Gen. Syst.

    (2011)
  • Joaquin Abellan et al.

    Completing a total uncertainty measure in the Dempster-Shafer theory

    Int. J. Gen. Syst.

    (1999)
  • Pierre C. Bellec et al.

    On the prediction loss of the lasso in the partially labeled setting

    Electron. J. Stat.

    (2018)
  • Rafael Bello et al.

    Rough sets in machine learning: a review

  • Andrea Campagner et al.

    Feature reduction in superset learning using rough sets and evidence theory

  • Timothee Cour et al.

    Learning from partial labels

    J. Mach. Learn. Res.

    (2011)
  • Arthur P. Dempster

    Upper and lower probabilities induced by a multivalued mapping

  • T. Denoeux

    A k-nearest neighbor classification rule based on Dempster-Shafer theory

    IEEE Trans. Syst. Man Cybern.

    (1995)
  • Thierry Denoeux

    A k-nearest neighbor classification rule based on Dempster-Shafer theory

  • Thierry Denoeux

    Maximum likelihood estimation from uncertain data in the belief function framework

    IEEE Trans. Knowl. Data Eng.

    (2011)
  • Adrian Dobra et al.

    Bounds for cell entries in contingency tables given marginal totals and decomposable graphs

    Proc. Natl. Acad. Sci. USA

    (2000)
  • Bradley Efron

    Censored data and the bootstrap

    J. Am. Stat. Assoc.

    (1981)
  • Lei Feng et al.

    Leveraging latent label distributions for partial label learning

  • Lei Feng et al.

    Partial label learning with self-guided retraining

  • A. Frank et al.

    UCI machine learning repository

    (2010)
  • Bernhard Ganter et al.

    Conceptual scaling

  • Cited by (24)

    • Nature of decision valuations in elimination of redundant attributes

      2024, International Journal of Approximate Reasoning
    View all citing articles on Scopus
    View full text