当前期刊: Machine Learning Go to current issue    加入关注   
显示样式:        排序: 导出
  • Sum–product graphical models
    Mach. Learn. (IF 2.809) Pub Date : 2019-06-27
    Mattia Desana, Christoph Schnörr

    This paper introduces a probabilistic architecture called sum–product graphical model (SPGM). SPGMs represent a class of probability distributions that combines, for the first time, the semantics of probabilistic graphical models (GMs) with the evaluation efficiency of sum–product networks (SPNs): Like SPNs, SPGMs always enable tractable inference using a class of models that incorporate context specific independence. Like GMs, SPGMs provide a high-level model interpretation in terms of conditional independence assumptions and corresponding factorizations. Thus, this approach provides new connections between the fields of SPNs and GMs, and enables a high-level interpretation of the family of distributions encoded by SPNs. We provide two applications of SPGMs in density estimation with empirical results close to or surpassing state-of-the-art models. The theoretical and practical results demonstrate that jointly exploiting properties of SPNs and GMs is an interesting direction of future research.

  • Analysis of Hannan consistent selection for Monte Carlo tree search in simultaneous move games
    Mach. Learn. (IF 2.809) Pub Date : 2019-07-25
    Vojtěch Kovařík, Viliam Lisý

    Abstract Hannan consistency, or no external regret, is a key concept for learning in games. An action selection algorithm is Hannan consistent (HC) if its performance is eventually as good as selecting the best fixed action in hindsight. If both players in a zero-sum normal form game use a Hannan consistent algorithm, their average behavior converges to a Nash equilibrium of the game. A similar result is known about extensive form games, but the played strategies need to be Hannan consistent with respect to the counterfactual values, which are often difficult to obtain. We study zero-sum extensive form games with simultaneous moves, but otherwise perfect information. These games generalize normal form games and they are a special case of extensive form games. We study whether applying HC algorithms in each decision point of these games directly to the observed payoffs leads to convergence to a Nash equilibrium. This learning process corresponds to a class of Monte Carlo Tree Search algorithms, which are popular for playing simultaneous-move games but do not have any known performance guarantees. We show that using HC algorithms directly on the observed payoffs is not sufficient to guarantee the convergence. With an additional averaging over joint actions, the convergence is guaranteed, but empirically slower. We further define an additional property of HC algorithms, which is sufficient to guarantee the convergence without the averaging and we empirically show that commonly used HC algorithms have this property.

  • Provable accelerated gradient method for nonconvex low rank optimization
    Mach. Learn. (IF 2.809) Pub Date : 2019-06-26
    Huan Li, Zhouchen Lin

    Optimization over low rank matrices has broad applications in machine learning. For large-scale problems, an attractive heuristic is to factorize the low rank matrix to a product of two much smaller matrices. In this paper, we study the nonconvex problem \(\min _{\mathbf {U}\in \mathbb {R}^{n\times r}} g(\mathbf {U})=f(\mathbf {U}\mathbf {U}^T)\) under the assumptions that \(f(\mathbf {X})\) is restricted \(\mu \)-strongly convex and L-smooth on the set \(\{\mathbf {X}:\mathbf {X}\succeq 0,\text{ rank }(\mathbf {X})\le r\}\). We propose an accelerated gradient method with alternating constraint that operates directly on the \(\mathbf {U}\) factors and show that the method has local linear convergence rate with the optimal dependence on the condition number of \(\sqrt{L/\mu }\). Globally, our method converges to the critical point with zero gradient from any initializer. Our method also applies to the problem with the asymmetric factorization of \(\mathbf {X}={\widetilde{\mathbf {U}}}{\widetilde{\mathbf {V}}}^T\) and the same convergence result can be obtained. Extensive experimental results verify the advantage of our method.

  • Rankboost $$+$$+ : an improvement to Rankboost
    Mach. Learn. (IF 2.809) Pub Date : 2019-08-12
    Harold Connamacher, Nikil Pancha, Rui Liu, Soumya Ray

    Abstract Rankboost is a well-known algorithm that iteratively creates and aggregates a collection of “weak rankers” to build an effective ranking procedure. Initial work on Rankboost proposed two variants. One variant, that we call Rb-d and which is designed for the scenario where all weak rankers have the binary range \(\{0,1\}\), has good theoretical properties, but does not perform well in practice. The other, that we call Rb-c, has good empirical behavior and is the recommended variation for this binary weak ranker scenario but lacks a theoretical grounding. In this paper, we rectify this situation by proposing an improved Rankboost algorithm for the binary weak ranker scenario that we call Rankboost\(+\). We prove that this approach is theoretically sound and also show empirically that it outperforms both Rankboost variants in practice. Further, the theory behind Rankboost\(+\) helps us to explain why Rb-d may not perform well in practice, and why Rb-c is better behaved in the binary weak ranker scenario, as has been observed in prior work.

  • Combining Bayesian optimization and Lipschitz optimization
    Mach. Learn. (IF 2.809) Pub Date : 2019-08-22
    Mohamed Osama Ahmed, Sharan Vaswani, Mark Schmidt

    Abstract Bayesian optimization and Lipschitz optimization have developed alternative techniques for optimizing black-box functions. They each exploit a different form of prior about the function. In this work, we explore strategies to combine these techniques for better global optimization. In particular, we propose ways to use the Lipschitz continuity assumption within traditional BO algorithms, which we call Lipschitz Bayesian optimization (LBO). This approach does not increase the asymptotic runtime and in some cases drastically improves the performance (while in the worst case the performance is similar). Indeed, in a particular setting, we prove that using the Lipschitz information yields the same or a better bound on the regret compared to using Bayesian optimization on its own. Moreover, we propose a simple heuristics to estimate the Lipschitz constant, and prove that a growing estimate of the Lipschitz constant is in some sense “harmless”. Our experiments on 15 datasets with 4 acquisition functions show that in the worst case LBO performs similar to the underlying BO method while in some cases it performs substantially better. Thompson sampling in particular typically saw drastic improvements (as the Lipschitz information corrected for its well-known “over-exploration” pheonemon) and its LBO variant often outperformed other acquisition functions.

  • Kappa Updated Ensemble for drifting data stream mining
    Mach. Learn. (IF 2.809) Pub Date : 2019-10-02
    Alberto Cano, Bartosz Krawczyk

    Learning from data streams in the presence of concept drift is among the biggest challenges of contemporary machine learning. Algorithms designed for such scenarios must take into an account the potentially unbounded size of data, its constantly changing nature, and the requirement for real-time processing. Ensemble approaches for data stream mining have gained significant popularity, due to their high predictive capabilities and effective mechanisms for alleviating concept drift. In this paper, we propose a new ensemble method named Kappa Updated Ensemble (KUE). It is a combination of online and block-based ensemble approaches that uses Kappa statistic for dynamic weighting and selection of base classifiers. In order to achieve a higher diversity among base learners, each of them is trained using a different subset of features and updated with new instances with given probability following a Poisson distribution. Furthermore, we update the ensemble with new classifiers only when they contribute positively to the improvement of the quality of the ensemble. Finally, each base classifier in KUE is capable of abstaining itself for taking a part in voting, thus increasing the overall robustness of KUE. An extensive experimental study shows that KUE is capable of outperforming state-of-the-art ensembles on standard and imbalanced drifting data streams while having a low computational complexity. Moreover, we analyze the use of Kappa versus accuracy to drive the criterion to select and update the classifiers, the contribution of the abstaining mechanism, the contribution of the diversification of classifiers, and the contribution of the hybrid architecture to update the classifiers in an online manner.

  • Conditional density estimation and simulation through optimal transport
    Mach. Learn. (IF 2.809) Pub Date : 2020-01-13
    Esteban G. Tabak, Giulio Trigila, Wenjun Zhao

    Abstract A methodology to estimate from samples the probability density of a random variable x conditional to the values of a set of covariates \(\{z_{l}\}\) is proposed. The methodology relies on a data-driven formulation of the Wasserstein barycenter, posed as a minimax problem in terms of the conditional map carrying each sample point to the barycenter and a potential characterizing the inverse of this map. This minimax problem is solved through the alternation of a flow developing the map in time and the maximization of the potential through an alternate projection procedure. The dependence on the covariates \(\{z_{l}\}\) is formulated in terms of convex combinations, so that it can be applied to variables of nearly any type, including real, categorical and distributional. The methodology is illustrated through numerical examples on synthetic and real data. The real-world example chosen is meteorological, forecasting the temperature distribution at a given location as a function of time, and estimating the joint distribution at a location of the highest and lowest daily temperatures as a function of the date.

  • Model-based kernel sum rule: kernel Bayesian inference with probabilistic models
    Mach. Learn. (IF 2.809) Pub Date : 2020-01-02
    Yu Nishiyama, Motonobu Kanagawa, Arthur Gretton, Kenji Fukumizu

    Kernel Bayesian inference is a principled approach to nonparametric inference in probabilistic graphical models, where probabilistic relationships between variables are learned from data in a nonparametric manner. Various algorithms of kernel Bayesian inference have been developed by combining kernelized basic probabilistic operations such as the kernel sum rule and kernel Bayes’ rule. However, the current framework is fully nonparametric, and it does not allow a user to flexibly combine nonparametric and model-based inferences. This is inefficient when there are good probabilistic models (or simulation models) available for some parts of a graphical model; this is in particular true in scientific fields where “models” are the central topic of study. Our contribution in this paper is to introduce a novel approach, termed the model-based kernel sum rule (Mb-KSR), to combine a probabilistic model and kernel Bayesian inference. By combining the Mb-KSR with the existing kernelized probabilistic rules, one can develop various algorithms for hybrid (i.e., nonparametric and model-based) inferences. As an illustrative example, we consider Bayesian filtering in a state space model, where typically there exists an accurate probabilistic model for the state transition process. We propose a novel filtering method that combines model-based inference for the state transition process and data-driven, nonparametric inference for the observation generating process. We empirically validate our approach with synthetic and real-data experiments, the latter being the problem of vision-based mobile robot localization in robotics, which illustrates the effectiveness of the proposed hybrid approach.

  • Improved graph-based SFA: information preservation complements the slowness principle
    Mach. Learn. (IF 2.809) Pub Date : 2019-12-26
    Alberto N. Escalante-B., Laurenz Wiskott

    Slow feature analysis (SFA) is an unsupervised learning algorithm that extracts slowly varying features from a multi-dimensional time series. SFA has been extended to supervised learning (classification and regression) by an algorithm called graph-based SFA (GSFA). GSFA relies on a particular graph structure to extract features that preserve label similarities. Processing of high dimensional input data (e.g., images) is feasible via hierarchical GSFA (HGSFA), resulting in a multi-layer neural network. Although HGSFA has useful properties, in this work we identify a shortcoming, namely, that HGSFA networks prematurely discard quickly varying but useful features before they reach higher layers, resulting in suboptimal global slowness and an under-exploited feature space. To counteract this shortcoming, which we call unnecessary information loss, we propose an extension called hierarchical information-preserving GSFA (HiGSFA), where some features fulfill a slowness objective and other features fulfill an information preservation objective. The efficacy of the extension is verified in three experiments: (1) an unsupervised setup where the input data is the visual stimuli of a simulated rat, (2) the localization of faces in image patches, and (3) the estimation of human age from facial photographs of the MORPH-II database. Both HiGSFA and HGSFA can learn multiple labels and offer a rich feature space, feed-forward training, and linear complexity in the number of samples and dimensions. However, the proposed algorithm, HiGSFA, outperforms HGSFA in terms of feature slowness, estimation accuracy, and input reconstruction, giving rise to a promising hierarchical supervised-learning approach. Moreover, for age estimation, HiGSFA achieves a mean absolute error of 3.41 years, which is a competitive performance for this challenging problem.

  • On cognitive preferences and the plausibility of rule-based models
    Mach. Learn. (IF 2.809) Pub Date : 2019-12-24
    Johannes Fürnkranz, Tomáš Kliegr, Heiko Paulheim

    It is conventional wisdom in machine learning and data mining that logical models such as rule sets are more interpretable than other models, and that among such rule-based models, simpler models are more interpretable than more complex ones. In this position paper, we question this latter assumption by focusing on one particular aspect of interpretability, namely the plausibility of models. Roughly speaking, we equate the plausibility of a model with the likeliness that a user accepts it as an explanation for a prediction. In particular, we argue that—all other things being equal—longer explanations may be more convincing than shorter ones, and that the predominant bias for shorter models, which is typically necessary for learning powerful discriminative models, may not be suitable when it comes to user acceptance of the learned models. To that end, we first recapitulate evidence for and against this postulate, and then report the results of an evaluation in a crowdsourcing study based on about 3000 judgments. The results do not reveal a strong preference for simple rules, whereas we can observe a weak preference for longer rules in some domains. We then relate these results to well-known cognitive biases such as the conjunction fallacy, the representative heuristic, or the recognition heuristic, and investigate their relation to rule length and plausibility.

  • Distributed block-diagonal approximation methods for regularized empirical risk minimization
    Mach. Learn. (IF 2.809) Pub Date : 2019-12-18
    Ching-pei Lee, Kai-Wei Chang

    Abstract In recent years, there is a growing need to train machine learning models on a huge volume of data. Therefore, designing efficient distributed optimization algorithms for empirical risk minimization (ERM) has become an active and challenging research topic. In this paper, we propose a flexible framework for distributed ERM training through solving the dual problem, which provides a unified description and comparison of existing methods. Our approach requires only approximate solutions of the sub-problems involved in the optimization process, and is versatile to be applied on many large-scale machine learning problems including classification, regression, and structured prediction. We show that our framework enjoys global linear convergence for a broad class of non-strongly-convex problems, and some specific choices of the sub-problems can even achieve much faster convergence than existing approaches by a refined analysis. This improved convergence rate is also reflected in the superior empirical performance of our method.

  • Learning higher-order logic programs
    Mach. Learn. (IF 2.809) Pub Date : 2019-12-03
    Andrew Cropper, Rolf Morel, Stephen Muggleton

    Abstract A key feature of inductive logic programming is its ability to learn first-order programs, which are intrinsically more expressive than propositional programs. In this paper, we introduce techniques to learn higher-order programs. Specifically, we extend meta-interpretive learning (MIL) to support learning higher-order programs by allowing for higher-order definitions to be used as background knowledge. Our theoretical results show that learning higher-order programs, rather than first-order programs, can reduce the textual complexity required to express programs, which in turn reduces the size of the hypothesis space and sample complexity. We implement our idea in two new MIL systems: the Prolog system \(\text {Metagol}_{ho}\) and the ASP system \(\text {HEXMIL}_{ho}\). Both systems support learning higher-order programs and higher-order predicate invention, such as inventing functions for map/3 and conditions for filter/3. We conduct experiments on four domains (robot strategies, chess playing, list transformations, and string decryption) that compare learning first-order and higher-order programs. Our experimental results support our theoretical claims and show that, compared to learning first-order programs, learning higher-order programs can significantly improve predictive accuracies and reduce learning times.

  • Exploiting causality in gene network reconstruction based on graph embedding
    Mach. Learn. (IF 2.809) Pub Date : 2019-12-03
    Gianvito Pio, Michelangelo Ceci, Francesca Prisciandaro, Donato Malerba

    Gene network reconstruction is a bioinformatics task that aims at modelling the complex regulatory activities that may occur among genes. This task is typically solved by means of link prediction methods that analyze gene expression data. However, the reconstructed networks often suffer from a high amount of false positive edges, which are actually the result of indirect regulation activities due to the presence of common cause and common effect phenomena or, in other terms, due to the fact that the adopted inductive methods do not take into account possible causality phenomena. This issue is accentuated even more by the inherent presence of a high amount of noise in gene expression data. Existing methods for the identification of a transitive reduction of a network or for the removal of (possibly) redundant edges suffer from limitations in the structure of the network or in the nature/length of the indirect regulation, and often require additional pre-processing steps to handle specific peculiarities of the networks (e.g., cycles). Moreover, they are not able to consider possible community structures and possible similar roles of the genes in the network (e.g. hub nodes), which may change the tendency of nodes to be highly connected (and with which nodes) in the network. In this paper, we propose the method INLOCANDA, which learns an inductive predictive model for gene network reconstruction and overcomes all the mentioned limitations. In particular, INLOCANDA is able to (i) identify and exploit indirect relationships of arbitrary length to remove edges due to common cause and common effect phenomena; (ii) take into account possible community structures and possible similar roles by means of graph embedding. Experiments performed along multiple dimensions of analysis on benchmark, real networks of two organisms (E. coli and S. cerevisiae) show a higher accuracy with respect to the competitors, as well as a higher robustness to the presence of noise in the data, also when a huge amount of (possibly false positive) interactions is removed. Availability: http://www.di.uniba.it/~gianvitopio/systems/inlocanda/

  • A scalable sparse Cholesky based approach for learning high-dimensional covariance matrices in ordered data
    Mach. Learn. (IF 2.809) Pub Date : 2019-06-04
    Kshitij Khare, Sang-Yun Oh, Syed Rahman, Bala Rajaratnam

    Abstract Covariance estimation for high-dimensional datasets is a fundamental problem in machine learning, and has numerous applications. In these high-dimensional settings the number of features or variables p is typically larger than the sample size n. A popular way of tackling this challenge is to induce sparsity in the covariance matrix, its inverse or a relevant transformation. In many applications, the data come with a natural ordering. In such settings, methods inducing sparsity in the Cholesky parameter of the inverse covariance matrix can be quite useful. Such methods are also better positioned to yield a positive definite estimate of the covariance matrix, a critical requirement for several downstream applications. Despite some important advances in this area, a principled approach to general sparse-Cholesky based covariance estimation with both statistical and algorithmic convergence safeguards has been elusive. In particular, the two popular likelihood based methods proposed in the literature either do not lead to a well-defined estimator in high-dimensional settings, or consider only a restrictive class of models. In this paper, we propose a principled and general method for sparse-Cholesky based covariance estimation that aims to overcome some of the shortcomings of current methods, but retains their respective strengths. We obtain a jointly convex formulation for our objective function, and show that it leads to rigorous convergence guarantees and well-defined estimators, even when \(p > n\) . Very importantly, the approach always leads to a positive definite and symmetric estimator of the covariance matrix. We establish both high-dimensional estimation and selection consistency, and also demonstrate excellent finite sample performance on simulated/real data.

  • Covariance-based dissimilarity measures applied to clustering wide-sense stationary ergodic processes
    Mach. Learn. (IF 2.809) Pub Date : 2019-06-26
    Qidi Peng, Nan Rao, Ran Zhao

    We introduce a new unsupervised learning problem: clustering wide-sense stationary ergodic stochastic processes. A covariance-based dissimilarity measure together with asymptotically consistent algorithms is designed for clustering offline and online datasets, respectively. We also suggest a formal criterion on the efficiency of dissimilarity measures, and discuss an approach to improve the efficiency of our clustering algorithms, when they are applied to cluster particular type of processes, such as self-similar processes with wide-sense stationary ergodic increments. Clustering synthetic data and real-world data are provided as examples of applications.

  • 2D compressed learning: support matrix machine with bilinear random projections
    Mach. Learn. (IF 2.809) Pub Date : 2019-05-23
    Di Ma, Songcan Chen

    Support matrix machine (SMM) is an efficient matrix classification method that can leverage the structure information within the matrix to improve the classification performance. However, its computational and storage costs are still expensive for high-dimensional data. To address these problems, in this paper, we consider a 2D compressed learning paradigm to learn the SMM classifier in some compressed data domain. Specifically, we use the Kronecker compressed sensing (KCS) to obtain the compressive measurements and learn the SMM classifier. We show that the Kronecker product measurement matrices used by KCS satisfies the restricted isometry property (RIP), which is a property to ensure the learnability of the compressed data. We further give a lower bound on the number of measurements required for KCS. Though this lower bound shows that KCS requires more measurements than the regular CS to satisfy the same RIP condition, KCS itself still enjoys lower computational and storage complexities. Then, using the RIP condition, we verify that the learned SMM classifier in the compressed domain can perform almost as well as the best linear classifier in the original uncompressed domain. Finally, our experimental results also demonstrate the feasibility of 2D compressed learning.

  • The kernel Kalman rule
    Mach. Learn. (IF 2.809) Pub Date : 2019-06-18
    Gregor H. W. Gebhardt, Andras Kupcsik, Gerhard Neumann

    Abstract Enabling robots to act in unstructured and unknown environments requires versatile state estimation techniques. While traditional state estimation methods require known models and make strong assumptions about the dynamics, such versatile techniques should be able to deal with high dimensional observations and non-linear, unknown system dynamics. The recent framework for nonparametric inference allows to perform inference on arbitrary probability distributions. High-dimensional embeddings of distributions into reproducing kernel Hilbert spaces are manipulated by kernelized inference rules, most prominently the kernel Bayes’ rule (KBR). However, the computational demands of the KBR do not scale with the number of samples. In this paper, we present two techniques to increase the computational efficiency of non-parametric inference. First, the kernel Kalman rule (KKR) is presented as an approximate alternative to the KBR that estimates the embedding of the state based on a recursive least squares objective. Based on the KKR we present the kernel Kalman filter (KKF) that updates an embedding of the belief state and learns the system and observation models from data. We further derive the kernel forward backward smoother (KFBS) based on a forward and backward KKF and a smoothing update in Hilbert space. Second, we present the subspace conditional embedding operator as a sparsification technique that still leverages from the full data set. We apply this sparsification to the KKR and derive the corresponding sparse KKF and KFBS algorithms. We show on nonlinear state estimation tasks that our approaches provide a significantly improved estimation accuracy while the computational demands are considerably decreased.

  • Speculate-correct error bounds for k -nearest neighbor classifiers
    Mach. Learn. (IF 2.809) Pub Date : 2019-06-18
    Eric Bax, Lingjie Weng, Xu Tian

    Abstract We introduce the speculate-correct method to derive error bounds for local classifiers. Using it, we show that k-nearest neighbor classifiers, in spite of their famously fractured decision boundaries, have exponential error bounds with \(\hbox {O} \left( \sqrt{(k + \ln n)/n} \right) \) range around an estimate of generalization error for n in-sample examples.

  • Logical reduction of metarules
    Mach. Learn. (IF 2.809) Pub Date : 2019-11-20
    Andrew Cropper, Sophie Tourret

    Many forms of inductive logic programming (ILP) use metarules, second-order Horn clauses, to define the structure of learnable programs and thus the hypothesis space. Deciding which metarules to use for a given learning task is a major open problem and is a trade-off between efficiency and expressivity: the hypothesis space grows given more metarules, so we wish to use fewer metarules, but if we use too few metarules then we lose expressivity. In this paper, we study whether fragments of metarules can be logically reduced to minimal finite subsets. We consider two traditional forms of logical reduction: subsumption and entailment. We also consider a new reduction technique called derivation reduction, which is based on SLD-resolution. We compute reduced sets of metarules for fragments relevant to ILP and theoretically show whether these reduced sets are reductions for more general infinite fragments. We experimentally compare learning with reduced sets of metarules on three domains: Michalski trains, string transformations, and game rules. In general, derivation reduced sets of metarules outperform subsumption and entailment reduced sets, both in terms of predictive accuracies and learning times.

  • Inductive general game playing
    Mach. Learn. (IF 2.809) Pub Date : 2019-11-18
    Andrew Cropper, Richard Evans, Mark Law

    General game playing (GGP) is a framework for evaluating an agent’s general intelligence across a wide range of tasks. In the GGP competition, an agent is given the rules of a game (described as a logic program) that it has never seen before. The task is for the agent to play the game, thus generating game traces. The winner of the GGP competition is the agent that gets the best total score over all the games. In this paper, we invert this task: a learner is given game traces and the task is to learn the rules that could produce the traces. This problem is central to inductive general game playing (IGGP). We introduce a technique that automatically generates IGGP tasks from GGP games. We introduce an IGGP dataset which contains traces from 50 diverse games, such as Sudoku, Sokoban, and Checkers. We claim that IGGP is difficult for existing inductive logic programming (ILP) approaches. To support this claim, we evaluate existing ILP systems on our dataset. Our empirical results show that most of the games cannot be correctly learned by existing systems. The best performing system solves only 40% of the tasks perfectly. Our results suggest that IGGP poses many challenges to existing approaches. Furthermore, because we can automatically generate IGGP tasks from GGP games, our dataset will continue to grow with the GGP competition, as new games are added every year. We therefore think that the IGGP problem and dataset will be valuable for motivating and evaluating future research.

  • A survey on semi-supervised learning
    Mach. Learn. (IF 2.809) Pub Date : 2019-11-15
    Jesper E. van Engelen, Holger H. Hoos

    Abstract Semi-supervised learning is the branch of machine learning concerned with using labelled as well as unlabelled data to perform certain learning tasks. Conceptually situated between supervised and unsupervised learning, it permits harnessing the large amounts of unlabelled data available in many use cases in combination with typically smaller sets of labelled data. In recent years, research in this area has followed the general trends observed in machine learning, with much attention directed at neural network-based models and generative learning. The literature on the topic has also expanded in volume and scope, now encompassing a broad spectrum of theory, algorithms and applications. However, no recent surveys exist to collect and organize this knowledge, impeding the ability of researchers and engineers alike to utilize it. Filling this void, we present an up-to-date overview of semi-supervised learning methods, covering earlier work as well as more recent advances. We focus primarily on semi-supervised classification, where the large majority of semi-supervised learning research takes place. Our survey aims to provide researchers and practitioners new to the field as well as more advanced readers with a solid understanding of the main approaches and algorithms developed over the past two decades, with an emphasis on the most prominent and currently relevant work. Furthermore, we propose a new taxonomy of semi-supervised classification algorithms, which sheds light on the different conceptual and methodological approaches for incorporating unlabelled data into the training process. Lastly, we show how the fundamental assumptions underlying most semi-supervised learning algorithms are closely connected to each other, and how they relate to the well-known semi-supervised clustering assumption.

  • Constructing generative logical models for optimisation problems using domain knowledge
    Mach. Learn. (IF 2.809) Pub Date : 2019-11-13
    Ashwin Srinivasan, Lovekesh Vig, Gautam Shroff

    In this paper we seek to identify data instances with a low value of some objective (or cost) function. Normally posed as optimisation problems, our interest is in problems that have the following characteristics: (a) optimal, or even near-optimal solutions are very rare; (b) it is expensive to obtain the value of the objective function for large numbers of data instances; and (c) there is domain knowledge in the form of experience, rules-of-thumb, constraints and the like, which is difficult to translate into the usual constraints for numerical optimisation procedures. Here we investigate the use of Inductive Logic Programming (ILP) to construct models within a procedure that progressively attempts to increase the number of near-optimal solutions. Using ILP in this manner requires a change in focus from discriminatory models (the usual staple for ILP) to generative models. Using controlled datasets, we investigate the use of probability-sampling of solutions based on the estimated cost of clauses found using ILP. Specifically, we compare the results obtained against: (a) simple random sampling; and (b) generative deep network models that use a low-level encoding and automatically construct higher-level features. Our results suggest: (1) Against each of the alternatives, probability-sampling from ILP-constructed models contain more near-optimal solutions; (2) The key to the effectiveness of ILP-constructed models is the availability of domain knowledge. We also demonstrate the use of ILP in this manner on two real-world problems from the area of drug-design (predicting solubility and binding affinity), using domain knowledge of chemical ring structures and functional groups. Taken together, our results suggest that generative modelling using ILP can be very effective for optimisation problems where: (a) the number of training instances to be used is restricted, and (b) there is domain knowledge relevant to low-cost solutions.

  • On some graph-based two-sample tests for high dimension, low sample size data
    Mach. Learn. (IF 2.809) Pub Date : 2019-11-13
    Soham Sarkar, Rahul Biswas, Anil K. Ghosh

    Abstract Testing for equality of two high-dimensional distributions is a challenging problem, and this becomes even more challenging when the sample size is small. Over the last few decades, several graph-based two-sample tests have been proposed in the literature, which can be used for data of arbitrary dimensions. Most of these test statistics are computed using pairwise Euclidean distances among the observations. But, due to concentration of pairwise Euclidean distances, these tests have poor performance in many high-dimensional problems. Some of them can have powers even below the nominal level when the scale-difference between two distributions dominates the location-difference. To overcome these limitations, we introduce some new dissimilarity indices and use them to modify some popular graph-based tests. These modified tests use the distance concentration phenomenon to their advantage, and as a result, they outperform the corresponding tests based on the Euclidean distance in a wide variety of examples. We establish the high-dimensional consistency of these modified tests under fairly general conditions. Analyzing several simulated as well as real data sets, we demonstrate their usefulness in high dimension, low sample size situations.

  • Active deep Q-learning with demonstration
    Mach. Learn. (IF 2.809) Pub Date : 2019-11-08
    Si-An Chen, Voot Tangkaratt, Hsuan-Tien Lin, Masashi Sugiyama

    Reinforcement learning (RL) is a machine learning technique aiming to learn how to take actions in an environment to maximize some kind of reward. Recent research has shown that although the learning efficiency of RL can be improved with expert demonstration, it usually takes considerable efforts to obtain enough demonstration. The efforts prevent training decent RL agents with expert demonstration in practice. In this work, we propose Active Reinforcement Learning with Demonstration, a new framework to streamline RL in terms of demonstration efforts by allowing the RL agent to query for demonstration actively during training. Under the framework, we propose Active deep Q-Network, a novel query strategy based on a classical RL algorithm called deep Q-network (DQN). The proposed algorithm dynamically estimates the uncertainty of recent states and utilizes the queried demonstration data by optimizing a supervised loss in addition to the usual DQN loss. We propose two methods of estimating the uncertainty based on two state-of-the-art DQN models, namely the divergence of bootstrapped DQN and the variance of noisy DQN. The empirical results validate that both methods not only learn faster than other passive expert demonstration methods with the same amount of demonstration and but also reach super-expert level of performance across four different tasks.

  • Principled analytic classifier for positive-unlabeled learning via weighted integral probability metric
    Mach. Learn. (IF 2.809) Pub Date : 2019-11-04
    Yongchan Kwon, Wonyoung Kim, Masashi Sugiyama, Myunghee Cho Paik

    We consider the problem of learning a binary classifier from only positive and unlabeled observations (called PU learning). Recent studies in PU learning have shown superior performance theoretically and empirically. However, most existing algorithms may not be suitable for large-scale datasets because they face repeated computations of a large Gram matrix or require massive hyperparameter optimization. In this paper, we propose a computationally efficient and theoretically grounded PU learning algorithm. The proposed PU learning algorithm produces a closed-form classifier when the hypothesis space is a closed ball in reproducing kernel Hilbert space. In addition, we establish upper bounds of the estimation error and the excess risk. The obtained estimation error bound is sharper than existing results and the derived excess risk bound has an explicit form, which vanishes as sample sizes increase. Finally, we conduct extensive numerical experiments using both synthetic and real datasets, demonstrating improved accuracy, scalability, and robustness of the proposed algorithm.

  • Rank minimization on tensor ring: an efficient approach for tensor decomposition and completion
    Mach. Learn. (IF 2.809) Pub Date : 2019-11-04
    Longhao Yuan, Chao Li, Jianting Cao, Qibin Zhao

    In recent studies, tensor ring decomposition (TRD) has become a promising model for tensor completion. However, TRD suffers from the rank selection problem due to the undetermined multilinear rank. For tensor decomposition with missing entries, the sub-optimal rank selection of traditional methods leads to the overfitting/underfitting problem. In this paper, we first explore the latent space of the TRD and theoretically prove the relationship between the TR-rank and the rank of the tensor unfoldings. Then, we propose two tensor completion models by imposing the different low-rank regularizations on the TR-factors, by which the TR-rank of the underlying tensor is minimized and the low-rank structures of the underlying tensor are exploited. By employing the alternating direction method of multipliers scheme, our algorithms obtain the TR factors and the underlying tensor simultaneously. In experiments of tensor completion tasks, our algorithms show robustness to rank selection and high computation efficiency, in comparison to traditional low-rank approximation algorithms.

  • Asymptotically optimal algorithms for budgeted multiple play bandits
    Mach. Learn. (IF 2.809) Pub Date : 2019-05-16
    Alex Luedtke, Emilie Kaufmann, Antoine Chambaz

    Abstract We study a generalization of the multi-armed bandit problem with multiple plays where there is a cost associated with pulling each arm and the agent has a budget at each time that dictates how much she can expect to spend. We derive an asymptotic regret lower bound for any uniformly efficient algorithm in our setting. We then study a variant of Thompson sampling for Bernoulli rewards and a variant of KL-UCB for both single-parameter exponential families and bounded, finitely supported rewards. We show these algorithms are asymptotically optimal, both in rate and leading problem-dependent constants, including in the thick margin setting where multiple arms fall on the decision boundary.

  • The Synergy Between PAV and AdaBoost.
    Mach. Learn. (IF 2.809) Pub Date : 2005-11-01
    W John Wilbur,Lana Yeganova,Won Kim

    Schapire and Singer's improved version of AdaBoost for handling weak hypotheses with confidence rated predictions represents an important advance in the theory and practice of boosting. Its success results from a more efficient use of information in weak hypotheses during updating. Instead of simple binary voting a weak hypothesis is allowed to vote for or against a classification with a variable strength or confidence. The Pool Adjacent Violators (PAV) algorithm is a method for converting a score into a probability. We show how PAV may be applied to a weak hypothesis to yield a new weak hypothesis which is in a sense an ideal confidence rated prediction and that this leads to an optimal updating for AdaBoost. The result is a new algorithm which we term PAV-AdaBoost. We give several examples illustrating problems for which this new algorithm provides advantages in performance.

  • A greedy feature selection algorithm for Big Data of high dimensionality.
    Mach. Learn. (IF 2.809) Pub Date : 2019-03-25
    Ioannis Tsamardinos,Giorgos Borboudakis,Pavlos Katsogridakis,Polyvios Pratikakis,Vassilis Christophides

    We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for feature selection (FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of p-values of conditional independence tests and meta-analysis techniques, PFBP relies only on computations local to a partition while minimizing communication costs, thus massively parallelizing computations. Similar techniques for combining local computations are also employed to create the final predictive model. PFBP employs asymptotically sound heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores. An extensive comparative evaluation also demonstrates the effectiveness of PFBP against other algorithms in its class. The heuristics presented are general and could potentially be employed to other greedy-type of FS algorithms. An application on simulated Single Nucleotide Polymorphism (SNP) data with 500K samples is provided as a use case.

  • Preserving differential privacy in convolutional deep belief networks.
    Mach. Learn. (IF 2.809) Pub Date : 2017-10-01
    NhatHai Phan,Xintao Wu,Dejing Dou

    The remarkable development of deep learning in medicine and healthcare domain presents obvious privacy issues, when deep neural networks are built on users' personal and highly sensitive data, e.g., clinical records, user profiles, biomedical images, etc. However, only a few scientific studies on preserving privacy in deep learning have been conducted. In this paper, we focus on developing a private convolutional deep belief network (pCDBN), which essentially is a convolutional deep belief network (CDBN) under differential privacy. Our main idea of enforcing ϵ-differential privacy is to leverage the functional mechanism to perturb the energy-based objective functions of traditional CDBNs, rather than their results. One key contribution of this work is that we propose the use of Chebyshev expansion to derive the approximate polynomial representation of objective functions. Our theoretical analysis shows that we can further derive the sensitivity and error bounds of the approximate polynomial representation. As a result, preserving differential privacy in CDBNs is feasible. We applied our model in a health social network, i.e., YesiWell data, and in a handwriting digit dataset, i.e., MNIST data, for human behavior prediction, human behavior classification, and handwriting digit recognition tasks. Theoretical analysis and rigorous experimental evaluations show that the pCDBN is highly effective. It significantly outperforms existing solutions.

  • Learning Classification Models of Cognitive Conditions from Subtle Behaviors in the Digital Clock Drawing Test.
    Mach. Learn. (IF 2.809) Pub Date : 2016-04-09
    William Souillard-Mandar,Randall Davis,Cynthia Rudin,Rhoda Au,David J Libon,Rodney Swenson,Catherine C Price,Melissa Lamar,Dana L Penney

    The Clock Drawing Test - a simple pencil and paper test - has been used for more than 50 years as a screening tool to differentiate normal individuals from those with cognitive impairment, and has proven useful in helping to diagnose cognitive dysfunction associated with neurological disorders such as Alzheimer's disease, Parkinson's disease, and other dementias and conditions. We have been administering the test using a digitizing ballpoint pen that reports its position with considerable spatial and temporal precision, making available far more detailed data about the subject's performance. Using pen stroke data from these drawings categorized by our software, we designed and computed a large collection of features, then explored the tradeoffs in performance and interpretability in classifiers built using a number of different subsets of these features and a variety of different machine learning techniques. We used traditional machine learning methods to build prediction models that achieve high accuracy. We operationalized widely used manual scoring systems so that we could use them as benchmarks for our models. We worked with clinicians to define guidelines for model interpretability, and constructed sparse linear models and rule lists designed to be as easy to use as scoring systems currently used by clinicians, but more accurate. While our models will require additional testing for validation, they offer the possibility of substantial improvement in detecting cognitive impairment earlier than currently possible, a development with considerable potential impact in practice.

  • Learning Probabilistic Logic Models from Probabilistic Examples.
    Mach. Learn. (IF 2.809) Pub Date : 2009-11-06
    Jianzhong Chen,Stephen Muggleton,José Santos

    We revisit an application developed originally using abductive Inductive Logic Programming (ILP) for modeling inhibition in metabolic networks. The example data was derived from studies of the effects of toxins on rats using Nuclear Magnetic Resonance (NMR) time-trace analysis of their biofluids together with background knowledge representing a subset of the Kyoto Encyclopedia of Genes and Genomes (KEGG). We now apply two Probabilistic ILP (PILP) approaches - abductive Stochastic Logic Programs (SLPs) and PRogramming In Statistical modeling (PRISM) to the application. Both approaches support abductive learning and probability predictions. Abductive SLPs are a PILP framework that provides possible worlds semantics to SLPs through abduction. Instead of learning logic models from non-probabilistic examples as done in ILP, the PILP approach applied in this paper is based on a general technique for introducing probability labels within a standard scientific experimental setting involving control and treated data. Our results demonstrate that the PILP approach provides a way of learning probabilistic logic models from probabilistic examples, and the PILP models learned from probabilistic examples lead to a significant decrease in error accompanied by improved insight from the learned results compared with the PILP models learned from non-probabilistic examples.

  • Detecting Inappropriate Access to Electronic Health Records Using Collaborative Filtering.
    Mach. Learn. (IF 2.809) Pub Date : 2014-04-01
    Aditya Krishna Menon,Xiaoqian Jiang,Jihoon Kim,Jaideep Vaidya,Lucila Ohno-Machado

    Many healthcare facilities enforce security on their electronic health records (EHRs) through a corrective mechanism: some staff nominally have almost unrestricted access to the records, but there is a strict ex post facto audit process for inappropriate accesses, i.e., accesses that violate the facility's security and privacy policies. This process is inefficient, as each suspicious access has to be reviewed by a security expert, and is purely retrospective, as it occurs after damage may have been incurred. This motivates automated approaches based on machine learning using historical data. Previous attempts at such a system have successfully applied supervised learning models to this end, such as SVMs and logistic regression. While providing benefits over manual auditing, these approaches ignore the identity of the users and patients involved in a record access. Therefore, they cannot exploit the fact that a patient whose record was previously involved in a violation has an increased risk of being involved in a future violation. Motivated by this, in this paper, we propose a collaborative filtering inspired approach to predicting inappropriate accesses. Our solution integrates both explicit and latent features for staff and patients, the latter acting as a personalized "finger-print" based on historical access patterns. The proposed method, when applied to real EHR access data from two tertiary hospitals and a file-access dataset from Amazon, shows not only significantly improved performance compared to existing methods, but also provides insights as to what indicates an inappropriate access.

  • Differential privacy based on importance weighting.
    Mach. Learn. (IF 2.809) Pub Date : 2014-02-01
    Zhanglong Ji,Charles Elkan

    This paper analyzes a novel method for publishing data while still protecting privacy. The method is based on computing weights that make an existing dataset, for which there are no confidentiality issues, analogous to the dataset that must be kept private. The existing dataset may be genuine but public already, or it may be synthetic. The weights are importance sampling weights, but to protect privacy, they are regularized and have noise added. The weights allow statistical queries to be answered approximately while provably guaranteeing differential privacy. We derive an expression for the asymptotic variance of the approximate answers. Experiments show that the new mechanism performs well even when the privacy budget is small, and when the public and private datasets are drawn from different populations.

  • Ensemble Clustering using Semidefinite Programming with Applications.
    Mach. Learn. (IF 2.809) Pub Date : 2010-05-01
    Vikas Singh,Lopamudra Mukherjee,Jiming Peng,Jinhui Xu

    In this paper, we study the ensemble clustering problem, where the input is in the form of multiple clustering solutions. The goal of ensemble clustering algorithms is to aggregate the solutions into one solution that maximizes the agreement in the input ensemble. We obtain several new results for this problem. Specifically, we show that the notion of agreement under such circumstances can be better captured using a 2D string encoding rather than a voting strategy, which is common among existing approaches. Our optimization proceeds by first constructing a non-linear objective function which is then transformed into a 0-1 Semidefinite program (SDP) using novel convexification techniques. This model can be subsequently relaxed to a polynomial time solvable SDP. In addition to the theoretical contributions, our experimental results on standard machine learning and synthetic datasets show that this approach leads to improvements not only in terms of the proposed agreement measure but also the existing agreement measures based on voting strategies. In addition, we identify several new application scenarios for this problem. These include combining multiple image segmentations and generating tissue maps from multiple-channel Diffusion Tensor brain images to identify the underlying structure of the brain.

  • Informing sequential clinical decision-making through reinforcement learning: an empirical study.
    Mach. Learn. (IF 2.809) Pub Date : 2011-07-30
    Susan M Shortreed,Eric Laber,Daniel J Lizotte,T Scott Stroup,Joelle Pineau,Susan A Murphy

    This paper highlights the role that reinforcement learning can play in the optimization of treatment policies for chronic illnesses. Before applying any off-the-shelf reinforcement learning methods in this setting, we must first tackle a number of challenges. We outline some of these challenges and present methods for overcoming them. First, we describe a multiple imputation approach to overcome the problem of missing data. Second, we discuss the use of function approximation in the context of a highly variable observation set. Finally, we discuss approaches to summarizing the evidence in the data for recommending a particular action and quantifying the uncertainty around the Q-function of the recommended policy. We present the results of applying these methods to real clinical trial data of patients with schizophrenia.

  • Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation.
    Mach. Learn. (IF 2.809) Pub Date : 2018-11-06
    Ioannis Tsamardinos,Elissavet Greasidou,Giorgos Borboudakis

    Cross-Validation (CV), and out-of-sample performance-estimation protocols in general, are often employed both for (a) selecting the optimal combination of algorithms and values of hyper-parameters (called a configuration) for producing the final predictive model, and (b) estimating the predictive performance of the final model. However, the cross-validated performance of the best configuration is optimistically biased. We present an efficient bootstrap method that corrects for the bias, called Bootstrap Bias Corrected CV (BBC-CV). BBC-CV's main idea is to bootstrap the whole process of selecting the best-performing configuration on the out-of-sample predictions of each configuration, without additional training of models. In comparison to the alternatives, namely the nested cross-validation (Varma and Simon in BMC Bioinform 7(1):91, 2006) and a method by Tibshirani and Tibshirani (Ann Appl Stat 822-829, 2009), BBC-CV is computationally more efficient, has smaller variance and bias, and is applicable to any metric of performance (accuracy, AUC, concordance index, mean squared error). Subsequently, we employ again the idea of bootstrapping the out-of-sample predictions to speed up the CV process. Specifically, using a bootstrap-based statistical criterion we stop training of models on new folds of inferior (with high probability) configurations. We name the method Bootstrap Bias Corrected with Dropping CV (BBCD-CV) that is both efficient and provides accurate performance estimates.

  • Boosted Multivariate Trees for Longitudinal Data.
    Mach. Learn. (IF 2.809) Pub Date : 2017-12-19
    Amol Pande,Liang Li,Jeevanantham Rajeswaran,John Ehrlinger,Udaya B Kogalur,Eugene H Blackstone,Hemant Ishwaran

    Machine learning methods provide a powerful approach for analyzing longitudinal data in which repeated measurements are observed for a subject over time. We boost multivariate trees to fit a novel flexible semi-nonparametric marginal model for longitudinal data. In this model, features are assumed to be nonparametric, while feature-time interactions are modeled semi-nonparametrically utilizing P-splines with estimated smoothing parameter. In order to avoid overfitting, we describe a relatively simple in sample cross-validation method which can be used to estimate the optimal boosting iteration and which has the surprising added benefit of stabilizing certain parameter estimates. Our new multivariate tree boosting method is shown to be highly flexible, robust to covariance misspecification and unbalanced designs, and resistant to overfitting in high dimensions. Feature selection can be used to identify important features and feature-time interactions. An application to longitudinal data of forced 1-second lung expiratory volume (FEV1) for lung transplant patients identifies an important feature-time interaction and illustrates the ease with which our method can find complex relationships in longitudinal data.

  • The Effect of Splitting on Random Forests.
    Mach. Learn. (IF 2.809) Pub Date : 2015-04-01
    Hemant Ishwaran

    The effect of a splitting rule on random forests (RF) is systematically studied for regression and classification problems. A class of weighted splitting rules, which includes as special cases CART weighted variance splitting and Gini index splitting, are studied in detail and shown to possess a unique adaptive property to signal and noise. We show for noisy variables that weighted splitting favors end-cut splits. While end-cut splits have traditionally been viewed as undesirable for single trees, we argue for deeply grown trees (a trademark of RF) end-cut splitting is useful because: (a) it maximizes the sample size making it possible for a tree to recover from a bad split, and (b) if a branch repeatedly splits on noise, the tree minimal node size will be reached which promotes termination of the bad branch. For strong variables, weighted variance splitting is shown to possess the desirable property of splitting at points of curvature of the underlying target function. This adaptivity to both noise and signal does not hold for unweighted and heavy weighted splitting rules. These latter rules are either too greedy, making them poor at recognizing noisy scenarios, or they are overly ECP aggressive, making them poor at recognizing signal. These results also shed light on pure random splitting and show that such rules are the least effective. On the other hand, because randomized rules are desirable because of their computational efficiency, we introduce a hybrid method employing random split-point selection which retains the adaptive property of weighted splitting rules while remaining computational efficient.

  • Using random forests to diagnose aviation turbulence.
    Mach. Learn. (IF 2.809) Pub Date : 2014-01-01
    John K Williams

    Atmospheric turbulence poses a significant hazard to aviation, with severe encounters costing airlines millions of dollars per year in compensation, aircraft damage, and delays due to required post-event inspections and repairs. Moreover, attempts to avoid turbulent airspace cause flight delays and en route deviations that increase air traffic controller workload, disrupt schedules of air crews and passengers and use extra fuel. For these reasons, the Federal Aviation Administration and the National Aeronautics and Space Administration have funded the development of automated turbulence detection, diagnosis and forecasting products. This paper describes a methodology for fusing data from diverse sources and producing a real-time diagnosis of turbulence associated with thunderstorms, a significant cause of weather delays and turbulence encounters that is not well-addressed by current turbulence forecasts. The data fusion algorithm is trained using a retrospective dataset that includes objective turbulence reports from commercial aircraft and collocated predictor data. It is evaluated on an independent test set using several performance metrics including receiver operating characteristic curves, which are used for FAA turbulence product evaluations prior to their deployment. A prototype implementation fuses data from Doppler radar, geostationary satellites, a lightning detection network and a numerical weather prediction model to produce deterministic and probabilistic turbulence assessments suitable for use by air traffic managers, dispatchers and pilots. The algorithm is scheduled to be operationally implemented at the National Weather Service's Aviation Weather Center in 2014.

  • Enhancing understanding and improving prediction of severe weather through spatiotemporal relational learning.
    Mach. Learn. (IF 2.809) Pub Date : 2014-01-01
    Amy McGovern,David J Gagne,John K Williams,Rodger A Brown,Jeffrey B Basara

    Severe weather, including tornadoes, thunderstorms, wind, and hail annually cause significant loss of life and property. We are developing spatiotemporal machine learning techniques that will enable meteorologists to improve the prediction of these events by improving their understanding of the fundamental causes of the phenomena and by building skillful empirical predictive models. In this paper, we present significant enhancements of our Spatiotemporal Relational Probability Trees that enable autonomous discovery of spatiotemporal relationships as well as learning with arbitrary shapes. We focus our evaluation on two real-world case studies using our technique: predicting tornadoes in Oklahoma and predicting aircraft turbulence in the United States. We also discuss how to evaluate success for a machine learning algorithm in the severe weather domain, which will enable new methods such as ours to transfer from research to operations, provide a set of lessons learned for embedded machine learning applications, and discuss how to field our technique.

  • A unified statistical approach to non-negative matrix factorization and probabilistic latent semantic indexing.
    Mach. Learn. (IF 2.809) Pub Date : 2015-03-31
    Karthik Devarajan,Guoli Wang,Nader Ebrahimi

    Non-negative matrix factorization (NMF) is a powerful machine learning method for decomposing a high-dimensional nonnegative matrix V into the product of two nonnegative matrices, W and H, such that V ∼ W H. It has been shown to have a parts-based, sparse representation of the data. NMF has been successfully applied in a variety of areas such as natural language processing, neuroscience, information retrieval, image processing, speech recognition and computational biology for the analysis and interpretation of large-scale data. There has also been simultaneous development of a related statistical latent class modeling approach, namely, probabilistic latent semantic indexing (PLSI), for analyzing and interpreting co-occurrence count data arising in natural language processing. In this paper, we present a generalized statistical approach to NMF and PLSI based on Renyi's divergence between two non-negative matrices, stemming from the Poisson likelihood. Our approach unifies various competing models and provides a unique theoretical framework for these methods. We propose a unified algorithm for NMF and provide a rigorous proof of monotonicity of multiplicative updates for W and H. In addition, we generalize the relationship between NMF and PLSI within this framework. We demonstrate the applicability and utility of our approach as well as its superior performance relative to existing methods using real-life and simulated document clustering data.

Contents have been reproduced by permission of the publishers.
上海纽约大学William Glover