当前期刊: Advances in Data Analysis and Classification Go to current issue    加入关注   
显示样式:        排序: IF: - GO 导出
  • The ultrametric correlation matrix for modelling hierarchical latent concepts
    Adv. Data Anal. Classif. (IF 2.098) Pub Date : 2020-05-28
    Carlo Cavicchia, Maurizio Vichi, Giorgia Zaccaria

    Many relevant multidimensional phenomena are defined by nested latent concepts, which can be represented by a tree-structure supposing a hierarchical relationship among manifest variables. The root of the tree is a general concept which includes more specific ones. The aim of the paper is to reconstruct an observed data correlation matrix of manifest variables through an ultrametric correlation matrix

  • Data generation for composite-based structural equation modeling methods
    Adv. Data Anal. Classif. (IF 2.098) Pub Date : 2020-05-26
    Rainer Schlittgen, Marko Sarstedt, Christian M. Ringle

    Examining the efficacy of composite-based structural equation modeling (SEM) features prominently in research. However, studies analyzing the efficacy of corresponding estimators usually rely on factor model data. Thereby, they assess and analyze their performance on erroneous grounds (i.e., factor model data instead of composite model data). A potential reason for this malpractice lies in the lack

  • Simultaneous dimension reduction and clustering via the NMF-EM algorithm
    Adv. Data Anal. Classif. (IF 2.098) Pub Date : 2020-05-25
    Léna Carel, Pierre Alquier

    Mixture models are among the most popular tools for clustering. However, when the dimension and the number of clusters is large, the estimation of the clusters become challenging, as well as their interpretation. Restriction on the parameters can be used to reduce the dimension. An example is given by mixture of factor analyzers for Gaussian mixtures. The extension of MFA to non-Gaussian mixtures is

  • Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data
    Adv. Data Anal. Classif. (IF 2.098) Pub Date : 2020-05-25
    Laura Anderlucci, Cinzia Viroli

    Topic detection in short textual data is a challenging task due to its representation as high-dimensional and extremely sparse document-term matrix. In this paper we focus on the problem of classifying textual data on the base of their (unique) topic. For unsupervised classification, a popular approach called Mixture of Unigrams consists in considering a mixture of multinomial distributions over the

  • Clustering discrete-valued time series
    Adv. Data Anal. Classif. (IF 2.098) Pub Date : 2020-05-20
    Tyler Roick, Dimitris Karlis, Paul D. McNicholas

    There is a need for the development of models that are able to account for discreteness in data, along with its time series properties and correlation. Our focus falls on INteger-valued AutoRegressive (INAR) type models. The INAR type models can be used in conjunction with existing model-based clustering techniques to cluster discrete-valued time series data. With the use of a finite mixture model

  • Gaussian mixture modeling and model-based clustering under measurement inconsistency
    Adv. Data Anal. Classif. (IF 2.098) Pub Date : 2020-05-12
    Shuchismita Sarkar, Volodymyr Melnykov, Rong Zheng

    Finite mixtures present a powerful tool for modeling complex heterogeneous data. One of their most important applications is model-based clustering. It assumes that each data group can be reasonably described by one mixture model component. This establishes a one-to-one relationship between mixture components and clusters. In some cases, however, this relationship can be broken due to the presence

  • Semiparametric mixtures of regressions with single-index for model based clustering
    Adv. Data Anal. Classif. (IF 2.098) Pub Date : 2020-04-23
    Sijia Xiang, Weixin Yao

    In this article, we propose two classes of semiparametric mixture regression models with single-index for model based clustering. Unlike many semiparametric/nonparametric mixture regression models that can only be applied to low dimensional predictors, the new semiparametric models can easily incorporate high dimensional predictors into the nonparametric components. The proposed models are very general

  • Mixture modeling of data with multiple partial right-censoring levels
    Adv. Data Anal. Classif. (IF 2.098) Pub Date : 2020-04-21
    Semhar Michael, Tatjana Miljkovic, Volodymyr Melnykov

    In this paper, a new flexible approach to modeling data with multiple partial right-censoring points is proposed. This method is based on finite mixture models, flexible tool to model heterogeneity in data. A general framework to accommodate partial censoring is considered. In this setting, it is assumed that a certain portion of data points are censored and the rest are not. This situation occurs

  • Kappa coefficients for dichotomous-nominal classifications
    Adv. Data Anal. Classif. (IF 2.098) Pub Date : 2020-04-07
    Matthijs J. Warrens

    Two types of nominal classifications are distinguished, namely regular nominal classifications and dichotomous-nominal classifications. The first type does not include an ‘absence’ category (for example, no disorder), whereas the second type does include an ‘absence’ category. Cohen’s unweighted kappa can be used to quantify agreement between two regular nominal classifications with the same categories

  • A cost-sensitive constrained Lasso
    Adv. Data Anal. Classif. (IF 2.098) Pub Date : 2020-03-12
    Rafael Blanquero, Emilio Carrizosa, Pepa Ramírez-Cobo, M. Remedios Sillero-Denamiel

    The Lasso has become a benchmark data analysis procedure, and numerous variants have been proposed in the literature. Although the Lasso formulations are stated so that overall prediction error is optimized, no full control over the accuracy prediction on certain individuals of interest is allowed. In this work we propose a novel version of the Lasso in which quadratic performance constraints are added

  • Data projections by skewness maximization under scale mixtures of skew-normal vectors
    Adv. Data Anal. Classif. (IF 2.098) Pub Date : 2020-03-10
    Jorge M. Arevalillo, Hilario Navarro

    Multivariate scale mixtures of skew-normal distributions are flexible models that account for the non-normality of data by means of a tail weight parameter and a shape vector representing the asymmetry of the model in a directional fashion. Its stochastic representation involves a skew-normal vector and a non negative mixing scalar variable, independent of the skew-normal vector, that injects tail

  • A novel semi-supervised support vector machine with asymmetric squared loss
    Adv. Data Anal. Classif. (IF 2.098) Pub Date : 2020-03-10
    Huimin Pei, Qiang Lin, Liran Yang, Ping Zhong

    Laplacian support vector machine (LapSVM), which is based on the semi-supervised manifold regularization learning framework, performs better than the standard SVM, especially for the case where the supervised information is insufficient. However, the use of hinge loss leads to the sensitivity of LapSVM to noise around the decision boundary. To enhance the performance of LapSVM, we present a novel semi-supervised

  • Count regression trees
    Adv. Data Anal. Classif. (IF 2.098) Pub Date : 2019-05-10
    Nan-Ting Liu, Feng-Chang Lin, Yu-Shan Shih

    Count data frequently appear in many scientific studies. In this article, we propose a regression tree method called CORE for analyzing such data. At each node, besides a Poisson regression, a count regression such as hurdle, negative binomial, or zero-inflated regression which can accommodate over-dispersion and/or excess zeros is fitted. A likelihood-based procedure is suggested to select split variables

  • A fragmented-periodogram approach for clustering big data time series
    Adv. Data Anal. Classif. (IF 2.098) Pub Date : 2019-06-14
    Jorge Caiado, Nuno Crato, Pilar Poncela

    We propose and study a new frequency-domain procedure for characterizing and comparing large sets of long time series. Instead of using all the information available from data, which would be computationally very expensive, we propose some regularization rules in order to select and summarize the most relevant information for clustering purposes. Essentially, we suggest to use a fragmented periodogram

  • Data clustering based on principal curves
    Adv. Data Anal. Classif. (IF 2.098) Pub Date : 2019-06-11
    Elson Claudio Correa Moraes, Danton Diego Ferreira, Giovani Bernardes Vitor, Bruno Henrique Groenner Barbosa

    In this contribution we present a new method for data clustering based on principal curves. Principal curves consist of a nonlinear generalization of principal component analysis and may also be regarded as continuous versions of 1D self-organizing maps. The proposed method implements the k-segment algorithm for principal curves extraction. Then, the method divides the principal curves into two or

  • Classification using sequential order statistics
    Adv. Data Anal. Classif. (IF 2.098) Pub Date : 2019-08-07
    Alexander Katzur, Udo Kamps

    Whereas discrimination methods and their error probabilities were broadly investigated for common data distributions such as the multivariate normal or t-distributions, this paper considers the case when the recorded data are assumed to be observations from sequential order statistics. Random vectors of sequential order statistics describe, e.g., successive failures in a k-out-of-n system or in other

  • Learning a metric when clustering data points in the presence of constraints
    Adv. Data Anal. Classif. (IF 2.098) Pub Date : 2019-05-16
    Ahmad Ali Abin, Mohammad Ali Bashiri, Hamid Beigy

    Learning an appropriate distance measure under supervision of side information has become a topic of significant interest within machine learning community. In this paper, we address the problem of metric learning for constrained clustering by considering three important issues: (1) considering importance degree for constraints, (2) preserving the topological structure of data, and (3) preserving some

  • How well do SEM algorithms imitate EM algorithms? A non-asymptotic analysis for mixture models
    Adv. Data Anal. Classif. (IF 2.098) Pub Date : 2019-07-10
    Johannes Blömer, Sascha Brauer, Kathrin Bujna, Daniel Kuntze

    In this paper, we present a theoretical and an experimental comparison of EM and SEM algorithms for different mixture models. The SEM algorithm is a stochastic variant of the EM algorithm. The qualitative intuition behind the SEM algorithm is simple: If the number of observations is large enough, then we expect that an update step of the stochastic SEM algorithm is similar to the corresponding update

  • Ensemble of optimal trees, random forest and random projection ensemble classification
    Adv. Data Anal. Classif. (IF 2.098) Pub Date : 2019-06-12
    Zardad Khan, Asma Gul, Aris Perperoglou, Miftahuddin Miftahuddin, Osama Mahmoud, Werner Adler, Berthold Lausen

    The predictive performance of a random forest ensemble is highly associated with the strength of individual trees and their diversity. Ensemble of a small number of accurate and diverse trees, if prediction accuracy is not compromised, will also reduce computational burden. We investigate the idea of integrating trees that are accurate and diverse. For this purpose, we utilize out-of-bag observations

  • Clustering genomic words in human DNA using peaks and trends of distributions
    Adv. Data Anal. Classif. (IF 2.098) Pub Date : 2019-05-31
    Ana Helena Tavares, Jakob Raymaekers, Peter J. Rousseeuw, Paula Brito, Vera Afreixo

    In this work we seek clusters of genomic words in human DNA by studying their inter-word lag distributions. Due to the particularly spiked nature of these histograms, a clustering procedure is proposed that first decomposes each distribution into a baseline and a peak distribution. An outlier-robust fitting method is used to estimate the baseline distribution (the ‘trend’), and a sparse vector of detrended

  • Optimal arrangements of hyperplanes for SVM-based multiclass classification
    Adv. Data Anal. Classif. (IF 2.098) Pub Date : 2019-07-26
    Víctor Blanco, Alberto Japón, Justo Puerto

    In this paper, we present a novel SVM-based approach to construct multiclass classifiers by means of arrangements of hyperplanes. We propose different mixed integer (linear and non linear) programming formulations for the problem using extensions of widely used measures for misclassifying observations where the kernel trick can be adapted to be applicable. Some dimensionality reductions and variable

  • Efficient regularized spectral data embedding
    Adv. Data Anal. Classif. (IF 2.098) Pub Date : 2020-02-24
    Lazhar Labiod, Mohamed Nadif

    Data embedding (DE) or dimensionality reduction techniques are particularly well suited to embedding high-dimensional data into a space that in most cases will have just two dimensions. Low-dimensional space, in which data samples (data points) can more easily be visualized, is also often used for learning methods such as clustering. Sometimes, however, DE will identify dimensions that contribute little

  • A combination of k -means and DBSCAN algorithm for solving the multiple generalized circle detection problem
    Adv. Data Anal. Classif. (IF 2.098) Pub Date : 2020-02-12
    Rudolf Scitovski, Kristian Sabo

    Motivated by the problem of identifying rod-shaped particles (e.g. bacilliform bacterium), in this paper we consider the multiple generalized circle detection problem. We propose a method for solving this problem that is based on center-based clustering, where cluster-centers are generalized circles. An efficient algorithm is proposed which is based on a modification of the well-known k-means algorithm

  • A robust spatial autoregressive scalar-on-function regression with t -distribution
    Adv. Data Anal. Classif. (IF 2.098) Pub Date : 2020-01-29
    Tingting Huang, Gilbert Saporta, Huiwen Wang, Shanshan Wang

    Modelling functional data in the presence of spatial dependence is of great practical importance as exemplified by applications in the fields of demography, economy and geography, and has received much attention recently. However, for the classical scalar-on-function regression (SoFR) with functional covariates and scalar responses, only a relatively few literature is dedicated to this relevant area

  • From-below Boolean matrix factorization algorithm based on MDL
    Adv. Data Anal. Classif. (IF 2.098) Pub Date : 2020-01-08
    Tatiana Makhalova, Martin Trnecka

    During the past few years Boolean matrix factorization (BMF) has become an important direction in data analysis. The minimum description length principle (MDL) was successfully adapted in BMF for the model order selection. Nevertheless, a BMF algorithm performing good results w.r.t. standard measures in BMF is missing. In this paper, we propose a novel from-below Boolean matrix factorization algorithm

  • Interval forecasts based on regression trees for streaming data
    Adv. Data Anal. Classif. (IF 2.098) Pub Date : 2019-12-18
    Xin Zhao, Stuart Barber, Charles C. Taylor, Zoka Milan

    In forecasting, we often require interval forecasts instead of just a specific point forecast. To track streaming data effectively, this interval forecast should reliably cover the observed data and yet be as narrow as possible. To achieve this, we propose two methods based on regression trees: one ensemble method and one method based on a single tree. For the ensemble method, we use weighted results

  • Modelling heterogeneity: on the problem of group comparisons with logistic regression and the potential of the heterogeneous choice model
    Adv. Data Anal. Classif. (IF 2.098) Pub Date : 2019-12-13
    Gerhard Tutz

    The comparison of coefficients of logit models obtained for different groups is widely considered as problematic because of possible heterogeneity of residual variances in latent variables. It is shown that the heterogeneous logit model can be used to account for this type of heterogeneity by considering reduced models that are identified. A model selection strategy is proposed that can distinguish

  • From here to infinity: sparse finite versus Dirichlet process mixtures in model-based clustering.
    Adv. Data Anal. Classif. Pub Date : 2019-04-23
    Sylvia Frühwirth-Schnatter,Gertraud Malsiner-Walli

    In model-based clustering mixture models are used to group data points into clusters. A useful concept introduced for Gaussian mixtures by Malsiner Walli et al. (Stat Comput 26:303-324, 2016) are sparse finite mixtures, where the prior distribution on the weight distribution of a mixture with K components is chosen in such a way that a priori the number of clusters in the data is random and is allowed

  • Ensemble of a subset of kNN classifiers.
    Adv. Data Anal. Classif. Pub Date : 2018-01-01
    Asma Gul,Aris Perperoglou,Zardad Khan,Osama Mahmoud,Miftahuddin Miftahuddin,Werner Adler,Berthold Lausen

    Combining multiple classifiers, known as ensemble methods, can give substantial improvement in prediction performance of learning algorithms especially in the presence of non-informative features in the data sets. We propose an ensemble of subset of kNN classifiers, ESkNN, for classification task in two steps. Firstly, we choose classifiers based upon their individual performance using the out-of-sample

  • Improved initialisation of model-based clustering using Gaussian hierarchical partitions.
    Adv. Data Anal. Classif. Pub Date : 2016-03-08
    Luca Scrucca,Adrian E Raftery

    Initialisation of the EM algorithm in model-based clustering is often crucial. Various starting points in the parameter space often lead to different local maxima of the likelihood function and, so to different clustering partitions. Among the several approaches available in the literature, model-based agglomerative hierarchical clustering is used to provide initial partitions in the popular mclust

  • Assessing and accounting for time heterogeneity in stochastic actor oriented models.
    Adv. Data Anal. Classif. Pub Date : 2011-10-18
    Joshua A Lospinoso,Michael Schweinberger,Tom A B Snijders,Ruth M Ripley

    This paper explores time heterogeneity in stochastic actor oriented models (SAOM) proposed by Snijders (Sociological Methodology. Blackwell, Boston, pp 361-395, 2001) which are meant to study the evolution of networks. SAOMs model social networks as directed graphs with nodes representing people, organizations, etc., and dichotomous relations representing underlying relationships of friendship, advice

Contents have been reproduced by permission of the publishers.