显示样式： 排序： IF:  GO 导出

Hierarchical clustering with discrete latent variable models and the integrated classification likelihood Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20210413
Etienne Côme, Nicolas Jouvin, Pierre Latouche, Charles BouveyronFinding a set of nested partitions of a dataset is useful to uncover relevant structure at different scales, and is often dealt with a datadependent methodology. In this paper, we introduce a general twostep methodology for modelbased hierarchical clustering. Considering the integrated classification likelihood criterion as an objective function, this work applies to every discrete latent variable

Nonlinear dimension reduction for conditional quantiles Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20210323
Eliana Christou, Annabel Settle, Andreas ArtemiouIn practice, data often display heteroscedasticity, making quantile regression (QR) a more appropriate methodology. Modeling the data, while maintaining a flexible nonparametric fitting, requires smoothing over a highdimensional space which might not be feasible when the number of the predictor variables is large. This problem makes necessary the use of dimension reduction techniques for conditional

Learning multivariate shapelets with multilayer neural networks for interpretable timeseries classification Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20210304
Roberto Medico, Joeri Ruyssinck, Dirk Deschrijver, Tom DhaeneShapelets are discriminative subsequences extracted from timeseries data. Classifiers using shapelets have proven to achieve performances competitive to stateoftheart methods, while enhancing the model’s interpretability. While a lot of research has been done for univariate timeseries shapelets, extensions for the multivariate setting have not yet received much attention. To extend shapeletsbased

Robust regression with compositional covariates including cellwise outliers Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20210224
Nikola Štefelová, Andreas Alfons, Javier PalareaAlbaladejo, Peter Filzmoser, Karel HronWe propose a robust procedure to estimate a linear regression model with compositional and realvalued explanatory variables. The proposed procedure is designed to be robust against individual outlying cells in the data matrix (cellwise outliers), as well as entire outlying observations (rowwise outliers). Cellwise outliers are first filtered and then imputed by robust estimates. Afterwards, rowwise

Sparse principal component regression via singular value decomposition approach Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20210208
Shuichi KawanoPrincipal component regression (PCR) is a twostage procedure: the first stage performs principal component analysis (PCA) and the second stage builds a regression model whose explanatory variables are the principal components obtained in the first stage. Since PCA is performed using only explanatory variables, the principal components have no information about the response variable. To address this

PCAKL: a parametric dimensionality reduction approach for unsupervised metric learning Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20210107
Alexandre L. M. LevadaDimensionality reduction algorithms are powerful mathematical tools for data analysis and visualization. In many pattern recognition applications, a feature extraction step is often required to mitigate the curse of the dimensionality, a collection of negative effects caused by an arbitrary increase in the number of features in classification tasks. Principal Component Analysis (PCA) is a classical

Functional data clustering by projection into latent generalized hyperbolic subspaces Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20210107
Alex Sharp, Ryan BrowneWe introduce a latent subpace model which facilitates modelbased clustering of functional data. Flexible clustering is attained by imposing jointly generalized hyperbolic distributions on projections of basis expansion coefficients into group specific subspaces. The model acquires parsimony by assuming these subspaces are of relatively low dimension. Parameter estimation is done through a multicycle

A process framework for inducing and explaining Datalog theories Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20210105
Mark Gromowski, Michael Siebers, Ute SchmidWith the increasing prevalence of Machine Learning in everyday life, a growing number of people will be provided with MachineLearned assessments on a regular basis. We believe that human users interacting with systems based on MachineLearned classifiers will demand and profit from the systems’ decisions being explained in an approachable and comprehensive way. We developed a general process framework

Automatic gait classification patterns in spastic hemiplegia Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20210104
Ana Aguilera, Alberto SuberoClinical gait analysis and the interpretation of related records are a powerful tool to aid clinicians in the diagnosis, treatment and prognosis of human gait disabilities. The aim of this study is to investigate kinematic, kinetic, and electromyographic (EMG) data from child patients with spastic hemiplegia (SH) in order to discover useful patterns in human gait. Data mining techniques and classification

A bivariate finite mixture growth model with selection Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20201229
David Aristei, Silvia Bacci, Francesco Bartolucci, Silvia PandolfiA model is proposed to analyze longitudinal data where two response variables are available, one of which is a binary indicator of selection and the other is continuous and observed only if the first is equal to 1. The model also accounts for individual covariates and may be considered as a bivariate finite mixture growth model as it is based on three submodels: (i) a probit model for the selection

Adapted singlecell consensus clustering (adaSC3) Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20201215
Cornelia Fuetterer, Thomas Augustin, Christiane FuchsThe analysis of singlecell RNA sequencing data is of great importance in health research. It challenges data scientists, but has enormous potential in the context of personalized medicine. The clustering of single cells aims to detect different subgroups of cell populations within a patient in a datadriven manner. Some comparison studies denote singlecell consensus clustering (SC3), proposed by

A Riemannian geometric framework for manifold learning of nonEuclidean data Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20201127
Cheongjae Jang, YungKyun Noh, Frank Chongwoo ParkA growing number of problems in data analysis and classification involve data that are nonEuclidean. For such problems, a naive application of vector space analysis algorithms will produce results that depend on the choice of local coordinates used to parametrize the data. At the same time, many data analysis and classification problems eventually reduce to an optimization, in which the criteria being

Robust semiparametric inference for polytomous logistic regression with complex survey design Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20201123
Elena Castilla, Abhik Ghosh, Nirian Martin, Leandro PardoAnalyzing polytomous response from a complex survey scheme, like stratified or cluster sampling is very crucial in several socioeconomics applications. We present a class of minimum quasi weighted density power divergence estimators for the polytomous logistic regression model with such a complex survey. This family of semiparametric estimators is a robust generalization of the maximum quasi weighted

Predicting brand confusion in imagery markets based on deep learning of visual advertisement content Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20201119
Atsuho Nakayama, Daniel BaierIn the consumer goods industry, unique brand positionings are assumed to be the road to success. They document product distinctiveness and so justify high prices. However, as products are getting more and more interchangeable, brand positionings must rely—at least partially—on supporting advertisements. Here, especially ads with visual content (e.g. photos, video clips) are able to connect brands with

Clustering of modalvalued symbolic data Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20201024
Nataša Kejžar, Simona KorenjakČerne, Vladimir BatageljSymbolic data analysis is based on special descriptions of data known as symbolic objects (SOs). Such descriptions preserve more detailed information about units and their clusters than the usual representations with mean values. A special type of SO is a representation with frequency or probability distributions (modal values). This representation enables us to simultaneously consider variables of

Sparse group fused lasso for model segmentation: a hybrid approach Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20201022
David DegrasThis article introduces the sparse group fused lasso (SGFL) as a statistical framework for segmenting sparse regression models with multivariate time series. To compute solutions of the SGFL, a nonsmooth and nonseparable convex program, we develop a hybrid optimization method that is fast, requires no tuning parameter selection, and is guaranteed to converge to a global minimizer. In numerical experiments

Better than the best? Answers via model ensemble in densitybased clustering Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20201002
Alessandro Casa, Luca Scrucca, Giovanna MenardiWith the recent growth in data availability and complexity, and the associated outburst of elaborate modelling approaches, model selection tools have become a lifeline, providing objective criteria to deal with this increasingly challenging landscape. In fact, basing predictions and inference on a single model may be limiting if not harmful; ensemble approaches, which combine different models, have

Editable machine learning models? A rulebased framework for user studies of explainability Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200911
Stanislav Vojíř, Tomáš KliegrSo far, most user studies dealing with comprehensibility of machine learning models have used questionnaires or surveys to acquire input from participants. In this article, we argue that compared to questionnaires, the use of an adapted version of a real machine learning interface can yield a new level of insight into what attributes make a machine learning model interpretable, and why. Also, we argue

A comparison of instancelevel counterfactual explanation algorithms for behavioral and textual data: SEDC, LIMEC and SHAPC Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200902
Yanou Ramon, David Martens, Foster Provost, Theodoros EvgeniouPredictive systems based on highdimensional behavioral and textual data have serious comprehensibility and transparency issues: linear models require investigating thousands of coefficients, while the opaqueness of nonlinear models makes things worse. Counterfactual explanations are becoming increasingly popular for generating insight into model predictions. This study aligns the recently proposed

Mixtures of factor analyzers with scale mixtures of fundamental skew normal distributions Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200902
Sharon X. Lee, TsungI Lin, Geoffrey J. McLachlanMixtures of factor analyzers (MFA) provide a powerful tool for modelling highdimensional datasets. In recent years, several generalizations of MFA have been developed where the normality assumption of the factors and/or of the errors were relaxed to allow for skewness in the data. However, due to the form of the adopted component densities, the distribution of the factors/errors in most of these models

A novel dictionary learning method based on total least squares approach with application in high dimensional biological data Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200902
Parvaneh Parvasideh, Mansoor RezghiIn recent years dictionary learning has become a favorite sparse feature extraction technique. Dictionary learning represents each data as a sparse combination of atoms (columns) of the dictionary matrix. Usually, the input data is contaminated by errors that affect the quality of the obtained dictionary and so sparse features. This effect is especially critical in applications with high dimensional

The GNG neural network in analyzing consumer behaviour patterns: empirical research on a purchasing behaviour processes realized by the elderly consumers Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200830
Kamila MigdałNajman, Krzysztof Najman, Sylwia BadowskaThe paper sheds light on the use of a selflearning GNG neural network for identification and exploration of the purchasing behaviour patterns. The test has been conducted on the data collected from consumers aged 60 years and over, with regard to three product purchases. The primary data used to explore the purchasing behaviour patterns was collected during a survey carried out among the elderly students

A perceptually optimised bivariate visualisation scheme for highdimensional foldchange data Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200818
André Müller, Ludwig Lausser, Adalbert Wilhelm, Timo Ropinski, Matthias Platzer, Heiko Neumann, Hans A. KestlerVisualising data as diagrams using visual attributes such as colour, shape, size, and orientation is challenging. In particular, large data sets demand graphical display as an essential step in the analysis. In order to achieve comprehension often different attributes need to be displayed simultaneously. In this work a comprehensible bivariate, perceptually optimised visualisation scheme for highdimensional

SEMTree hybrid models in the preferences analysis of the members of Polish households Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200810
Adam Sagan, Mariusz ŁapczyńskiThe purpose of the paper is to identify the dimensions of the strategy of resources allocation of Polish households members and test the hypothesis concerning risky shift effect in the relationship between strategy of family decision making and tradeoff in family scarce resources allocation. These dimensions were identified on the basis of nationwide empirical data gathered on a representative sample

Robust archetypoids for anomaly detection in big functional data Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200803
Guillermo Vinue, Irene EpifanioArchetypoid analysis (ADA) has proven to be a successful unsupervised statistical technique to identify extreme observations in the periphery of the data cloud, both in classical multivariate data and functional data. However, two questions remain open in this field: the use of ADA for outlier detection and its scalability. We propose to use robust functional archetypoids and adjusted boxplot to pinpoint

Adaptive sparse group LASSO in quantile regression Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200729
Alvaro MendezCivieta, M. Carmen AguileraMorillo, Rosa E. LilloThis paper studies the introduction of sparse group LASSO (SGL) to the quantile regression framework. Additionally, a more flexible version, an adaptive SGL is proposed based on the adaptive idea, this is, the usage of adaptive weights in the penalization. Adaptive estimators are usually focused on the study of the oracle property under asymptotic and double asymptotic frameworks. A key step on the

Hierarchical conceptual clustering based on quantile method for identifying microscopic details in distributional data Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200722
Kadri Umbleja, Manabu Ichino, Hiroyuki YaguchiSymbolic data is aggregated from bigger traditional datasets in order to hide entry specific details and to enable analysing large amounts of data, like big data, which would otherwise not be possible. Symbolic data may appear in many different but complex forms like intervals and histograms. Identifying patterns and finding similarities between objects is one of the most fundamental tasks of data

On the use of quantile regression to deal with heterogeneity: the case of multiblock data Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200719
Cristina Davino, Rosaria Romano, Domenico VistoccoThe aim of the paper is to propose a quantile regression based strategy to assess heterogeneity in a multiblock type data structure. Specifically, the paper deals with a particular data structure where several blocks of variables are observed on the same units and a structure of relations is assumed between the different blocks. The idea is that quantile regression complements the results of the least

A biasvariance analysis of stateoftheart random forest text classifiers Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200719
Thiago Salles, Leonardo Rocha, Marcos GonçalvesRandom forest (RF) classifiers do excel in a variety of automatic classification tasks, such as topic categorization and sentiment analysis. Despite such advantages, RF models have been shown to perform poorly when facing noisy data, commonly found in textual data, for instance. Some RF variants have been proposed to provide better generalization capabilities under such challenging scenario, including

Active learning of constraints for weighted feature selection Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200710
Samah Hijazi, Denis Hamad, Mariam Kalakech, Ali KalakechPairwise constraints, a cheaper kind of supervision information that does not need to reveal the class labels of data points, were initially suggested to enhance the performance of clustering algorithms. Recently, researchers were interested in using them for feature selection. However, in most current methods, pairwise constraints are provided passively and generated randomly over multiple algorithmic

A stochastic block model for interaction lengths Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200618
Riccardo Rastelli, Michael FopWe propose a new stochastic block model that focuses on the analysis of interaction lengths in dynamic networks. The model does not rely on a discretization of the time dimension and may be used to analyze networks that evolve continuously over time. The framework relies on a clustering structure on the nodes, whereby two nodes belonging to the same latent group tend to create interactions and noninteractions

Regime dependent interconnectedness among fuzzy clusters of financial time series Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200616
Giovanni De Luca, Paola ZuccolottoWe analyze the dynamic structure of lower tail dependence coefficients within groups of assets defined such that assets belonging to the same group are characterized by pairwise high associations between extremely low values. The groups are identified by means of a fuzzy cluster analysis algorithm. The tail dependence coefficients are estimated using the Joe–Clayton copula function, and the 75th percentile

Mestimators and trimmed means: from Hilbertvalued to fuzzy setvalued data Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200612
Beatriz Sinova, Stefan Van Aelst, Pedro TeránDifferent approaches to robustly measure the location of data associated with a random experiment have been proposed in the literature, with the aim of avoiding the high sensitivity to outliers or data changes typical for the mean. In particular, Mestimators and trimmed means have been studied in general spaces, and can be used to handle Hilbertvalued data. Both alternatives are of interest due to

ParticleMDI: particle Monte Carlo methods for the cluster analysis of multiple datasets with applications to cancer subtype identification Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200612
Nathan Cunningham, Jim E. Griffin, David L. WildWe present a novel nonparametric Bayesian approach for performing cluster analysis in a context where observational units have data arising from multiple sources. Our approach uses a particle Gibbs sampler for inference in which cluster allocations are jointly updated using a conditional particle filter within a Gibbs sampler, improving the mixing of the MCMC chain. We develop several approaches to

Isotonic boosting classification rules Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200612
David Conde, Miguel A. Fernández, Cristina Rueda, Bonifacio SalvadorIn many real classification problems a monotone relation between some predictors and the classes may be assumed when higher (or lower) values of those predictors are related to higher levels of the response. In this paper, we propose new boosting algorithms, based on LogitBoost, that incorporate this isotonicity information, yielding more accurate and easily interpretable rules. These algorithms are

Chained correlations for feature selection Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200609
Ludwig Lausser, Robin Szekely, Hans A. KestlerDatadriven algorithms stand and fall with the availability and quality of existing data sources. Both can be limited in highdimensional settings (\(n \gg m\)). For example, supervised learning algorithms designed for molecular pheno or genotyping are restricted to samples of the corresponding diagnostic classes. Samples of other related entities, such as arise in differential diagnosis, are usually

The ultrametric correlation matrix for modelling hierarchical latent concepts Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200528
Carlo Cavicchia, Maurizio Vichi, Giorgia ZaccariaMany relevant multidimensional phenomena are defined by nested latent concepts, which can be represented by a treestructure supposing a hierarchical relationship among manifest variables. The root of the tree is a general concept which includes more specific ones. The aim of the paper is to reconstruct an observed data correlation matrix of manifest variables through an ultrametric correlation matrix

Data generation for compositebased structural equation modeling methods Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200526
Rainer Schlittgen, Marko Sarstedt, Christian M. RingleExamining the efficacy of compositebased structural equation modeling (SEM) features prominently in research. However, studies analyzing the efficacy of corresponding estimators usually rely on factor model data. Thereby, they assess and analyze their performance on erroneous grounds (i.e., factor model data instead of composite model data). A potential reason for this malpractice lies in the lack

Simultaneous dimension reduction and clustering via the NMFEM algorithm Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200525
Léna Carel, Pierre AlquierMixture models are among the most popular tools for clustering. However, when the dimension and the number of clusters is large, the estimation of the clusters become challenging, as well as their interpretation. Restriction on the parameters can be used to reduce the dimension. An example is given by mixture of factor analyzers for Gaussian mixtures. The extension of MFA to nonGaussian mixtures is

Mixtures of DirichletMultinomial distributions for supervised and unsupervised classification of short text data Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200525
Laura Anderlucci, Cinzia ViroliTopic detection in short textual data is a challenging task due to its representation as highdimensional and extremely sparse documentterm matrix. In this paper we focus on the problem of classifying textual data on the base of their (unique) topic. For unsupervised classification, a popular approach called Mixture of Unigrams consists in considering a mixture of multinomial distributions over the

Clustering discretevalued time series Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200520
Tyler Roick, Dimitris Karlis, Paul D. McNicholasThere is a need for the development of models that are able to account for discreteness in data, along with its time series properties and correlation. Our focus falls on INtegervalued AutoRegressive (INAR) type models. The INAR type models can be used in conjunction with existing modelbased clustering techniques to cluster discretevalued time series data. With the use of a finite mixture model

Gaussian mixture modeling and modelbased clustering under measurement inconsistency Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200512
Shuchismita Sarkar, Volodymyr Melnykov, Rong ZhengFinite mixtures present a powerful tool for modeling complex heterogeneous data. One of their most important applications is modelbased clustering. It assumes that each data group can be reasonably described by one mixture model component. This establishes a onetoone relationship between mixture components and clusters. In some cases, however, this relationship can be broken due to the presence

Semiparametric mixtures of regressions with singleindex for model based clustering Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200423
Sijia Xiang, Weixin YaoIn this article, we propose two classes of semiparametric mixture regression models with singleindex for model based clustering. Unlike many semiparametric/nonparametric mixture regression models that can only be applied to low dimensional predictors, the new semiparametric models can easily incorporate high dimensional predictors into the nonparametric components. The proposed models are very general

Mixture modeling of data with multiple partial rightcensoring levels Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200421
Semhar Michael, Tatjana Miljkovic, Volodymyr MelnykovIn this paper, a new flexible approach to modeling data with multiple partial rightcensoring points is proposed. This method is based on finite mixture models, flexible tool to model heterogeneity in data. A general framework to accommodate partial censoring is considered. In this setting, it is assumed that a certain portion of data points are censored and the rest are not. This situation occurs

Kappa coefficients for dichotomousnominal classifications Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200407
Matthijs J. WarrensTwo types of nominal classifications are distinguished, namely regular nominal classifications and dichotomousnominal classifications. The first type does not include an ‘absence’ category (for example, no disorder), whereas the second type does include an ‘absence’ category. Cohen’s unweighted kappa can be used to quantify agreement between two regular nominal classifications with the same categories

A costsensitive constrained Lasso Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200312
Rafael Blanquero, Emilio Carrizosa, Pepa RamírezCobo, M. Remedios SilleroDenamielThe Lasso has become a benchmark data analysis procedure, and numerous variants have been proposed in the literature. Although the Lasso formulations are stated so that overall prediction error is optimized, no full control over the accuracy prediction on certain individuals of interest is allowed. In this work we propose a novel version of the Lasso in which quadratic performance constraints are added

A novel semisupervised support vector machine with asymmetric squared loss Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200310
Huimin Pei, Qiang Lin, Liran Yang, Ping ZhongLaplacian support vector machine (LapSVM), which is based on the semisupervised manifold regularization learning framework, performs better than the standard SVM, especially for the case where the supervised information is insufficient. However, the use of hinge loss leads to the sensitivity of LapSVM to noise around the decision boundary. To enhance the performance of LapSVM, we present a novel semisupervised

Efficient regularized spectral data embedding Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200224
Lazhar Labiod, Mohamed NadifData embedding (DE) or dimensionality reduction techniques are particularly well suited to embedding highdimensional data into a space that in most cases will have just two dimensions. Lowdimensional space, in which data samples (data points) can more easily be visualized, is also often used for learning methods such as clustering. Sometimes, however, DE will identify dimensions that contribute little

A combination of k means and DBSCAN algorithm for solving the multiple generalized circle detection problem Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200212
Rudolf Scitovski, Kristian SaboMotivated by the problem of identifying rodshaped particles (e.g. bacilliform bacterium), in this paper we consider the multiple generalized circle detection problem. We propose a method for solving this problem that is based on centerbased clustering, where clustercenters are generalized circles. An efficient algorithm is proposed which is based on a modification of the wellknown kmeans algorithm

A robust spatial autoregressive scalaronfunction regression with t distribution Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200129
Tingting Huang, Gilbert Saporta, Huiwen Wang, Shanshan WangModelling functional data in the presence of spatial dependence is of great practical importance as exemplified by applications in the fields of demography, economy and geography, and has received much attention recently. However, for the classical scalaronfunction regression (SoFR) with functional covariates and scalar responses, only a relatively few literature is dedicated to this relevant area

Frombelow Boolean matrix factorization algorithm based on MDL Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200108
Tatiana Makhalova, Martin TrneckaDuring the past few years Boolean matrix factorization (BMF) has become an important direction in data analysis. The minimum description length principle (MDL) was successfully adapted in BMF for the model order selection. Nevertheless, a BMF algorithm performing good results w.r.t. standard measures in BMF is missing. In this paper, we propose a novel frombelow Boolean matrix factorization algorithm

Interval forecasts based on regression trees for streaming data Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20191218
Xin Zhao, Stuart Barber, Charles C. Taylor, Zoka MilanIn forecasting, we often require interval forecasts instead of just a specific point forecast. To track streaming data effectively, this interval forecast should reliably cover the observed data and yet be as narrow as possible. To achieve this, we propose two methods based on regression trees: one ensemble method and one method based on a single tree. For the ensemble method, we use weighted results

Data projections by skewness maximization under scale mixtures of skewnormal vectors Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20200310
Jorge M. Arevalillo, Hilario NavarroMultivariate scale mixtures of skewnormal distributions are flexible models that account for the nonnormality of data by means of a tail weight parameter and a shape vector representing the asymmetry of the model in a directional fashion. Its stochastic representation involves a skewnormal vector and a non negative mixing scalar variable, independent of the skewnormal vector, that injects tail

Modelling heterogeneity: on the problem of group comparisons with logistic regression and the potential of the heterogeneous choice model Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20191213
Gerhard TutzThe comparison of coefficients of logit models obtained for different groups is widely considered as problematic because of possible heterogeneity of residual variances in latent variables. It is shown that the heterogeneous logit model can be used to account for this type of heterogeneity by considering reduced models that are identified. A model selection strategy is proposed that can distinguish

A stable cardinality distance for topological classification Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20191127
Vasileios Maroulas, Cassie Putman Micucci, Adam SpannausThis work incorporates topological features via persistence diagrams to classify point cloud data arising from materials science. Persistence diagrams are multisets summarizing the connectedness and holes of given data. A new distance on the space of persistence diagrams generates relevant input features for a classification algorithm for materials science data. This distance measures the similarity

Rank tests for functional data based on the epigraph, the hypograph and associated graphical representations Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20191127
Alba M. Franco Pereira, Rosa E. LilloVisualization techniques are very useful in data analysis. Their aim is to summarize information into a graph or a plot. In particular, visualization is especially interesting when one has functional data, where there is no total order between the data of a sample. Taking into account the information provided by the down–upward partial orderings based on the hypograph and the epigragh indexes, we propose

IsClusterMPP: clustering algorithm through point processes and influence space towards highdimensional data Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20191127
Khadidja Henni, PierreYves Louis, Brigitte Vannier, Ahmed MoussaClustering via marked point processes and influence space, IsClusterMPP, is a new unsupervised clustering algorithm through adaptive MCMC sampling of a marked point processes of interacting balls. The designed Gibbs energy cost function makes use of kinfluence space information. It detects clusters of different shapes, sizes and unbalanced local densities. It aims at dealing also with highdimensional

Mixtures of skewed matrix variate bilinear factor analyzers Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20191121
Michael P. B. Gallaugher, Paul D. McNicholasIn recent years, data have become increasingly higher dimensional and, therefore, an increased need has arisen for dimension reduction techniques for clustering. Although such techniques are firmly established in the literature for multivariate data, there is a relative paucity in the area of matrix variate, or threeway, data. Furthermore, the few methods that are available all assume matrix variate

Sparse classification with paired covariates Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20191115
Armin Rauschenberger, Iuliana CiocăneaTeodorescu, Marianne A. Jonker, Renée X. Menezes, Mark A. van de WielThis paper introduces the paired lasso: a generalisation of the lasso for paired covariate settings. Our aim is to predict a single response from two highdimensional covariate sets. We assume a onetoone correspondence between the covariate sets, with each covariate in one set forming a pair with a covariate in the other set. Paired covariates arise, for example, when two transformations of the same

Gaussian parsimonious clustering models with covariates and a noise component Adv. Data Anal. Classif. (IF 1.603) Pub Date : 20190920
Keefe Murphy, Thomas Brendan MurphyWe consider modelbased clustering methods for continuous, correlated data that account for external information available in the presence of mixedtype fixed covariates by proposing the MoEClust suite of models. These models allow different subsets of covariates to influence the component weights and/or component densities by modelling the parameters of the mixture as functions of the covariates.