显示样式： 排序： IF:  GO 导出

A new threestep method for using inverse propensity weighting with latent class analysis Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20210723
F. J. Clouth, S. Pauws, F. Mols, J. K. VermuntBiasadjusted threestep latent class analysis (LCA) is widely popular to relate covariates to class membership. However, if the causal effect of a treatment on class membership is of interest and only observational data is available, causal inference techniques such as inverse propensity weighting (IPW) need to be used. In this article, we extend the biasadjusted threestep LCA to incorporate IPW

A von Mises–Fisher mixture model for clustering numerical and categorical variables Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20210710
Xavier Bry, Lionel CucalaThis work presents a mixture model allowing to cluster variables of different types. All variables being measured on the same n statistical units, we first represent every variable with a unitnorm operator in \({\mathbb {R}}^{n\times n}\) endowed with an appropriate inner product. We propose a von Mises–Fisher mixture model on the unitsphere containing these operators. The parameters of the mixture

Robust clustering via mixtures of t factor analyzers with incomplete data Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20210705
WanLun Wang, TsungI LinMixtures of t factor analyzers (MtFA) are powerful and widely used tools for robust clustering of highdimensional data in the presence of outliers. However, the occurrence of missing values may cause analytical intractability and computational complexity when fitting the MtFA model. We explicitly derive the score vector and Hessian matrix of the MtFA model with incomplete data to approximate the information

Association measures for interval variables Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20210703
M. Rosário Oliveira, Margarida Azeitona, António Pacheco, Rui ValadasSymbolic Data Analysis (SDA) is a relatively new field of statistics that extends conventional data analysis by taking into account intrinsic data variability and structure. Unlike conventional data analysis, in SDA the features characterizing the data can be multivalued, such as intervals or histograms. SDA has been mainly approached from a sampling perspective. In this work, we propose a model that

A fingerprint of a heterogeneous data set Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20210703
Matteo Spallanzani, Gueorgui Mihaylov, Marco Prato, Roberto FontanaIn this paper, we describe the fingerprint method, a technique to classify bags of mixedtype measurements. The method was designed to solve a realworld industrial problem: classifying industrial plants (individuals at a higher level of organisation) starting from the measurements collected from their production lines (individuals at a lower level of organisation). In this specific application, the

Modelbased clustering for random hypergraphs Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20210628
Tin Lok James Ng, Thomas Brendan MurphyA probabilistic model for random hypergraphs is introduced to represent unary, binary and higher order interactions among objects in realworld problems. This model is an extension of the latent class analysis model that introduces two clustering structures for hyperedges and captures variation in the size of hyperedges. An expectation maximization algorithm with minorization maximization steps is

Prediction of brand stories spreading on social networks Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20210618
Thi Bich Ngoc Hoang, Josiane MotheOnline social network is a major media for many types of information communication. Although the primary purpose of social networks is to connect people, they are more and more used in online marketing to connect businesses with customers as well as to connect customers amongst themselves. Brand stories generated by consumers or businesses can be easily and widely spread. As a result, those stories

Finite mixture modeling of censored and missing data using the multivariate skewnormal distribution Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20210617
Francisco H. C. de Alencar, Christian E. Galarza, Larissa A. Matos, Victor H. LachosFinite mixture models have been widely used to model and analyze data from a heterogeneous populations. Moreover, data of this kind can be missing or subject to some upper and/or lower detection limits because of the constraints of experimental apparatuses. Another complication arises when measures of each population depart significantly from normality, such as asymmetric behavior. For such data structures

Modelbased twoway clustering of secondlevel units in ordinal multilevel latent Markov models Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20210615
Giorgio Eduardo Montanari, Marco Doretti, Maria Francesca MarinoIn this paper, an ordinal multilevel latent Markov model based on separate random effects is proposed. In detail, two distinct secondlevel discrete effects are considered in the model, one affecting the initial probability vector and the other affecting the transition probability matrix of the firstlevel ordinal latent Markov process. To model these separate effects, we consider a bidimensional

Estimating the class prior for positive and unlabelled data via logistic regression Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20210603
Małgorzata Łazęcka, Jan Mielniczuk, Paweł TeisseyreIn the paper, we revisit the problem of class prior probability estimation with positive and unlabelled data gathered in a singlesample scenario. The task is important as it is known that in positive unlabelled setting, a classifier can be successfully learned if the class prior is available. We show that without additional assumptions, class prior probability is not identifiable and thus the existing

Consensus among preference rankings: a new weighted correlation coefficient for linear and weak orderings Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20210528
Antonella Plaia, Simona Buscemi, Mariangela SciandraPreference data are a particular type of ranking data where some subjects (voters, judges,...) express their preferences over a set of alternatives (items). In most real life cases, some items receive the same preference by a judge, thus giving rise to a ranking with ties. An important issue involving rankings concerns the aggregation of the preferences into a “consensus”. The purpose of this paper

REMAXINT: a twomode clusteringbased method for statistical inference on twoway interaction Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20210427
Zaheer Ahmed, Alberto Cassese, Gerard van Breukelen, Jan SchepersWe present a novel method, REMAXINT, that captures the gist of twoway interaction in row by column (i.e., twomode) data, with one observation per cell. REMAXINT is a probabilistic twomode clustering model that yields twomode partitions with maximal interaction between row and column clusters. For estimation of the parameters of REMAXINT, we maximize a conditional classification likelihood in which

Hierarchical clustering with discrete latent variable models and the integrated classification likelihood Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20210413
Etienne Côme, Nicolas Jouvin, Pierre Latouche, Charles BouveyronFinding a set of nested partitions of a dataset is useful to uncover relevant structure at different scales, and is often dealt with a datadependent methodology. In this paper, we introduce a general twostep methodology for modelbased hierarchical clustering. Considering the integrated classification likelihood criterion as an objective function, this work applies to every discrete latent variable

Nonlinear dimension reduction for conditional quantiles Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20210323
Eliana Christou, Annabel Settle, Andreas ArtemiouIn practice, data often display heteroscedasticity, making quantile regression (QR) a more appropriate methodology. Modeling the data, while maintaining a flexible nonparametric fitting, requires smoothing over a highdimensional space which might not be feasible when the number of the predictor variables is large. This problem makes necessary the use of dimension reduction techniques for conditional

Learning multivariate shapelets with multilayer neural networks for interpretable timeseries classification Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20210304
Roberto Medico, Joeri Ruyssinck, Dirk Deschrijver, Tom DhaeneShapelets are discriminative subsequences extracted from timeseries data. Classifiers using shapelets have proven to achieve performances competitive to stateoftheart methods, while enhancing the model’s interpretability. While a lot of research has been done for univariate timeseries shapelets, extensions for the multivariate setting have not yet received much attention. To extend shapeletsbased

Robust regression with compositional covariates including cellwise outliers Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20210224
Nikola Štefelová, Andreas Alfons, Javier PalareaAlbaladejo, Peter Filzmoser, Karel HronWe propose a robust procedure to estimate a linear regression model with compositional and realvalued explanatory variables. The proposed procedure is designed to be robust against individual outlying cells in the data matrix (cellwise outliers), as well as entire outlying observations (rowwise outliers). Cellwise outliers are first filtered and then imputed by robust estimates. Afterwards, rowwise

Sparse principal component regression via singular value decomposition approach Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20210208
Shuichi KawanoPrincipal component regression (PCR) is a twostage procedure: the first stage performs principal component analysis (PCA) and the second stage builds a regression model whose explanatory variables are the principal components obtained in the first stage. Since PCA is performed using only explanatory variables, the principal components have no information about the response variable. To address this

PCAKL: a parametric dimensionality reduction approach for unsupervised metric learning Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20210107
Alexandre L. M. LevadaDimensionality reduction algorithms are powerful mathematical tools for data analysis and visualization. In many pattern recognition applications, a feature extraction step is often required to mitigate the curse of the dimensionality, a collection of negative effects caused by an arbitrary increase in the number of features in classification tasks. Principal Component Analysis (PCA) is a classical

Functional data clustering by projection into latent generalized hyperbolic subspaces Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20210107
Alex Sharp, Ryan BrowneWe introduce a latent subpace model which facilitates modelbased clustering of functional data. Flexible clustering is attained by imposing jointly generalized hyperbolic distributions on projections of basis expansion coefficients into group specific subspaces. The model acquires parsimony by assuming these subspaces are of relatively low dimension. Parameter estimation is done through a multicycle

A process framework for inducing and explaining Datalog theories Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20210105
Mark Gromowski, Michael Siebers, Ute SchmidWith the increasing prevalence of Machine Learning in everyday life, a growing number of people will be provided with MachineLearned assessments on a regular basis. We believe that human users interacting with systems based on MachineLearned classifiers will demand and profit from the systems’ decisions being explained in an approachable and comprehensive way. We developed a general process framework

Automatic gait classification patterns in spastic hemiplegia Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20210104
Ana Aguilera, Alberto SuberoClinical gait analysis and the interpretation of related records are a powerful tool to aid clinicians in the diagnosis, treatment and prognosis of human gait disabilities. The aim of this study is to investigate kinematic, kinetic, and electromyographic (EMG) data from child patients with spastic hemiplegia (SH) in order to discover useful patterns in human gait. Data mining techniques and classification

A bivariate finite mixture growth model with selection Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20201229
David Aristei, Silvia Bacci, Francesco Bartolucci, Silvia PandolfiA model is proposed to analyze longitudinal data where two response variables are available, one of which is a binary indicator of selection and the other is continuous and observed only if the first is equal to 1. The model also accounts for individual covariates and may be considered as a bivariate finite mixture growth model as it is based on three submodels: (i) a probit model for the selection

Adapted singlecell consensus clustering (adaSC3) Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20201215
Cornelia Fuetterer, Thomas Augustin, Christiane FuchsThe analysis of singlecell RNA sequencing data is of great importance in health research. It challenges data scientists, but has enormous potential in the context of personalized medicine. The clustering of single cells aims to detect different subgroups of cell populations within a patient in a datadriven manner. Some comparison studies denote singlecell consensus clustering (SC3), proposed by

A Riemannian geometric framework for manifold learning of nonEuclidean data Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20201127
Cheongjae Jang, YungKyun Noh, Frank Chongwoo ParkA growing number of problems in data analysis and classification involve data that are nonEuclidean. For such problems, a naive application of vector space analysis algorithms will produce results that depend on the choice of local coordinates used to parametrize the data. At the same time, many data analysis and classification problems eventually reduce to an optimization, in which the criteria being

Robust semiparametric inference for polytomous logistic regression with complex survey design Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20201123
Elena Castilla, Abhik Ghosh, Nirian Martin, Leandro PardoAnalyzing polytomous response from a complex survey scheme, like stratified or cluster sampling is very crucial in several socioeconomics applications. We present a class of minimum quasi weighted density power divergence estimators for the polytomous logistic regression model with such a complex survey. This family of semiparametric estimators is a robust generalization of the maximum quasi weighted

Predicting brand confusion in imagery markets based on deep learning of visual advertisement content Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20201119
Atsuho Nakayama, Daniel BaierIn the consumer goods industry, unique brand positionings are assumed to be the road to success. They document product distinctiveness and so justify high prices. However, as products are getting more and more interchangeable, brand positionings must rely—at least partially—on supporting advertisements. Here, especially ads with visual content (e.g. photos, video clips) are able to connect brands with

Clustering of modalvalued symbolic data Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20201024
Nataša Kejžar, Simona KorenjakČerne, Vladimir BatageljSymbolic data analysis is based on special descriptions of data known as symbolic objects (SOs). Such descriptions preserve more detailed information about units and their clusters than the usual representations with mean values. A special type of SO is a representation with frequency or probability distributions (modal values). This representation enables us to simultaneously consider variables of

Sparse group fused lasso for model segmentation: a hybrid approach Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20201022
David DegrasThis article introduces the sparse group fused lasso (SGFL) as a statistical framework for segmenting sparse regression models with multivariate time series. To compute solutions of the SGFL, a nonsmooth and nonseparable convex program, we develop a hybrid optimization method that is fast, requires no tuning parameter selection, and is guaranteed to converge to a global minimizer. In numerical experiments

Better than the best? Answers via model ensemble in densitybased clustering Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20201002
Alessandro Casa, Luca Scrucca, Giovanna MenardiWith the recent growth in data availability and complexity, and the associated outburst of elaborate modelling approaches, model selection tools have become a lifeline, providing objective criteria to deal with this increasingly challenging landscape. In fact, basing predictions and inference on a single model may be limiting if not harmful; ensemble approaches, which combine different models, have

Editable machine learning models? A rulebased framework for user studies of explainability Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200911
Stanislav Vojíř, Tomáš KliegrSo far, most user studies dealing with comprehensibility of machine learning models have used questionnaires or surveys to acquire input from participants. In this article, we argue that compared to questionnaires, the use of an adapted version of a real machine learning interface can yield a new level of insight into what attributes make a machine learning model interpretable, and why. Also, we argue

A comparison of instancelevel counterfactual explanation algorithms for behavioral and textual data: SEDC, LIMEC and SHAPC Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200902
Yanou Ramon, David Martens, Foster Provost, Theodoros EvgeniouPredictive systems based on highdimensional behavioral and textual data have serious comprehensibility and transparency issues: linear models require investigating thousands of coefficients, while the opaqueness of nonlinear models makes things worse. Counterfactual explanations are becoming increasingly popular for generating insight into model predictions. This study aligns the recently proposed

Mixtures of factor analyzers with scale mixtures of fundamental skew normal distributions Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200902
Sharon X. Lee, TsungI Lin, Geoffrey J. McLachlanMixtures of factor analyzers (MFA) provide a powerful tool for modelling highdimensional datasets. In recent years, several generalizations of MFA have been developed where the normality assumption of the factors and/or of the errors were relaxed to allow for skewness in the data. However, due to the form of the adopted component densities, the distribution of the factors/errors in most of these models

A novel dictionary learning method based on total least squares approach with application in high dimensional biological data Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200902
Parvaneh Parvasideh, Mansoor RezghiIn recent years dictionary learning has become a favorite sparse feature extraction technique. Dictionary learning represents each data as a sparse combination of atoms (columns) of the dictionary matrix. Usually, the input data is contaminated by errors that affect the quality of the obtained dictionary and so sparse features. This effect is especially critical in applications with high dimensional

The GNG neural network in analyzing consumer behaviour patterns: empirical research on a purchasing behaviour processes realized by the elderly consumers Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200830
Kamila MigdałNajman, Krzysztof Najman, Sylwia BadowskaThe paper sheds light on the use of a selflearning GNG neural network for identification and exploration of the purchasing behaviour patterns. The test has been conducted on the data collected from consumers aged 60 years and over, with regard to three product purchases. The primary data used to explore the purchasing behaviour patterns was collected during a survey carried out among the elderly students

A perceptually optimised bivariate visualisation scheme for highdimensional foldchange data Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200818
André Müller, Ludwig Lausser, Adalbert Wilhelm, Timo Ropinski, Matthias Platzer, Heiko Neumann, Hans A. KestlerVisualising data as diagrams using visual attributes such as colour, shape, size, and orientation is challenging. In particular, large data sets demand graphical display as an essential step in the analysis. In order to achieve comprehension often different attributes need to be displayed simultaneously. In this work a comprehensible bivariate, perceptually optimised visualisation scheme for highdimensional

SEMTree hybrid models in the preferences analysis of the members of Polish households Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200810
Adam Sagan, Mariusz ŁapczyńskiThe purpose of the paper is to identify the dimensions of the strategy of resources allocation of Polish households members and test the hypothesis concerning risky shift effect in the relationship between strategy of family decision making and tradeoff in family scarce resources allocation. These dimensions were identified on the basis of nationwide empirical data gathered on a representative sample

Robust archetypoids for anomaly detection in big functional data Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200803
Guillermo Vinue, Irene EpifanioArchetypoid analysis (ADA) has proven to be a successful unsupervised statistical technique to identify extreme observations in the periphery of the data cloud, both in classical multivariate data and functional data. However, two questions remain open in this field: the use of ADA for outlier detection and its scalability. We propose to use robust functional archetypoids and adjusted boxplot to pinpoint

Adaptive sparse group LASSO in quantile regression Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200729
Alvaro MendezCivieta, M. Carmen AguileraMorillo, Rosa E. LilloThis paper studies the introduction of sparse group LASSO (SGL) to the quantile regression framework. Additionally, a more flexible version, an adaptive SGL is proposed based on the adaptive idea, this is, the usage of adaptive weights in the penalization. Adaptive estimators are usually focused on the study of the oracle property under asymptotic and double asymptotic frameworks. A key step on the

Hierarchical conceptual clustering based on quantile method for identifying microscopic details in distributional data Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200722
Kadri Umbleja, Manabu Ichino, Hiroyuki YaguchiSymbolic data is aggregated from bigger traditional datasets in order to hide entry specific details and to enable analysing large amounts of data, like big data, which would otherwise not be possible. Symbolic data may appear in many different but complex forms like intervals and histograms. Identifying patterns and finding similarities between objects is one of the most fundamental tasks of data

On the use of quantile regression to deal with heterogeneity: the case of multiblock data Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200719
Cristina Davino, Rosaria Romano, Domenico VistoccoThe aim of the paper is to propose a quantile regression based strategy to assess heterogeneity in a multiblock type data structure. Specifically, the paper deals with a particular data structure where several blocks of variables are observed on the same units and a structure of relations is assumed between the different blocks. The idea is that quantile regression complements the results of the least

A biasvariance analysis of stateoftheart random forest text classifiers Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200719
Thiago Salles, Leonardo Rocha, Marcos GonçalvesRandom forest (RF) classifiers do excel in a variety of automatic classification tasks, such as topic categorization and sentiment analysis. Despite such advantages, RF models have been shown to perform poorly when facing noisy data, commonly found in textual data, for instance. Some RF variants have been proposed to provide better generalization capabilities under such challenging scenario, including

Active learning of constraints for weighted feature selection Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200710
Samah Hijazi, Denis Hamad, Mariam Kalakech, Ali KalakechPairwise constraints, a cheaper kind of supervision information that does not need to reveal the class labels of data points, were initially suggested to enhance the performance of clustering algorithms. Recently, researchers were interested in using them for feature selection. However, in most current methods, pairwise constraints are provided passively and generated randomly over multiple algorithmic

A stochastic block model for interaction lengths Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200618
Riccardo Rastelli, Michael FopWe propose a new stochastic block model that focuses on the analysis of interaction lengths in dynamic networks. The model does not rely on a discretization of the time dimension and may be used to analyze networks that evolve continuously over time. The framework relies on a clustering structure on the nodes, whereby two nodes belonging to the same latent group tend to create interactions and noninteractions

Regime dependent interconnectedness among fuzzy clusters of financial time series Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200616
Giovanni De Luca, Paola ZuccolottoWe analyze the dynamic structure of lower tail dependence coefficients within groups of assets defined such that assets belonging to the same group are characterized by pairwise high associations between extremely low values. The groups are identified by means of a fuzzy cluster analysis algorithm. The tail dependence coefficients are estimated using the Joe–Clayton copula function, and the 75th percentile

Mestimators and trimmed means: from Hilbertvalued to fuzzy setvalued data Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200612
Beatriz Sinova, Stefan Van Aelst, Pedro TeránDifferent approaches to robustly measure the location of data associated with a random experiment have been proposed in the literature, with the aim of avoiding the high sensitivity to outliers or data changes typical for the mean. In particular, Mestimators and trimmed means have been studied in general spaces, and can be used to handle Hilbertvalued data. Both alternatives are of interest due to

ParticleMDI: particle Monte Carlo methods for the cluster analysis of multiple datasets with applications to cancer subtype identification Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200612
Nathan Cunningham, Jim E. Griffin, David L. WildWe present a novel nonparametric Bayesian approach for performing cluster analysis in a context where observational units have data arising from multiple sources. Our approach uses a particle Gibbs sampler for inference in which cluster allocations are jointly updated using a conditional particle filter within a Gibbs sampler, improving the mixing of the MCMC chain. We develop several approaches to

Isotonic boosting classification rules Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200612
David Conde, Miguel A. Fernández, Cristina Rueda, Bonifacio SalvadorIn many real classification problems a monotone relation between some predictors and the classes may be assumed when higher (or lower) values of those predictors are related to higher levels of the response. In this paper, we propose new boosting algorithms, based on LogitBoost, that incorporate this isotonicity information, yielding more accurate and easily interpretable rules. These algorithms are

Chained correlations for feature selection Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200609
Ludwig Lausser, Robin Szekely, Hans A. KestlerDatadriven algorithms stand and fall with the availability and quality of existing data sources. Both can be limited in highdimensional settings (\(n \gg m\)). For example, supervised learning algorithms designed for molecular pheno or genotyping are restricted to samples of the corresponding diagnostic classes. Samples of other related entities, such as arise in differential diagnosis, are usually

The ultrametric correlation matrix for modelling hierarchical latent concepts Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200528
Carlo Cavicchia, Maurizio Vichi, Giorgia ZaccariaMany relevant multidimensional phenomena are defined by nested latent concepts, which can be represented by a treestructure supposing a hierarchical relationship among manifest variables. The root of the tree is a general concept which includes more specific ones. The aim of the paper is to reconstruct an observed data correlation matrix of manifest variables through an ultrametric correlation matrix

Data generation for compositebased structural equation modeling methods Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200526
Rainer Schlittgen, Marko Sarstedt, Christian M. RingleExamining the efficacy of compositebased structural equation modeling (SEM) features prominently in research. However, studies analyzing the efficacy of corresponding estimators usually rely on factor model data. Thereby, they assess and analyze their performance on erroneous grounds (i.e., factor model data instead of composite model data). A potential reason for this malpractice lies in the lack

Simultaneous dimension reduction and clustering via the NMFEM algorithm Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200525
Léna Carel, Pierre AlquierMixture models are among the most popular tools for clustering. However, when the dimension and the number of clusters is large, the estimation of the clusters become challenging, as well as their interpretation. Restriction on the parameters can be used to reduce the dimension. An example is given by mixture of factor analyzers for Gaussian mixtures. The extension of MFA to nonGaussian mixtures is

Mixtures of DirichletMultinomial distributions for supervised and unsupervised classification of short text data Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200525
Laura Anderlucci, Cinzia ViroliTopic detection in short textual data is a challenging task due to its representation as highdimensional and extremely sparse documentterm matrix. In this paper we focus on the problem of classifying textual data on the base of their (unique) topic. For unsupervised classification, a popular approach called Mixture of Unigrams consists in considering a mixture of multinomial distributions over the

Clustering discretevalued time series Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200520
Tyler Roick, Dimitris Karlis, Paul D. McNicholasThere is a need for the development of models that are able to account for discreteness in data, along with its time series properties and correlation. Our focus falls on INtegervalued AutoRegressive (INAR) type models. The INAR type models can be used in conjunction with existing modelbased clustering techniques to cluster discretevalued time series data. With the use of a finite mixture model

Gaussian mixture modeling and modelbased clustering under measurement inconsistency Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200512
Shuchismita Sarkar, Volodymyr Melnykov, Rong ZhengFinite mixtures present a powerful tool for modeling complex heterogeneous data. One of their most important applications is modelbased clustering. It assumes that each data group can be reasonably described by one mixture model component. This establishes a onetoone relationship between mixture components and clusters. In some cases, however, this relationship can be broken due to the presence

Semiparametric mixtures of regressions with singleindex for model based clustering Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200423
Sijia Xiang, Weixin YaoIn this article, we propose two classes of semiparametric mixture regression models with singleindex for model based clustering. Unlike many semiparametric/nonparametric mixture regression models that can only be applied to low dimensional predictors, the new semiparametric models can easily incorporate high dimensional predictors into the nonparametric components. The proposed models are very general

Mixture modeling of data with multiple partial rightcensoring levels Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200421
Semhar Michael, Tatjana Miljkovic, Volodymyr MelnykovIn this paper, a new flexible approach to modeling data with multiple partial rightcensoring points is proposed. This method is based on finite mixture models, flexible tool to model heterogeneity in data. A general framework to accommodate partial censoring is considered. In this setting, it is assumed that a certain portion of data points are censored and the rest are not. This situation occurs

Kappa coefficients for dichotomousnominal classifications Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200407
Matthijs J. WarrensTwo types of nominal classifications are distinguished, namely regular nominal classifications and dichotomousnominal classifications. The first type does not include an ‘absence’ category (for example, no disorder), whereas the second type does include an ‘absence’ category. Cohen’s unweighted kappa can be used to quantify agreement between two regular nominal classifications with the same categories

A costsensitive constrained Lasso Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200312
Rafael Blanquero, Emilio Carrizosa, Pepa RamírezCobo, M. Remedios SilleroDenamielThe Lasso has become a benchmark data analysis procedure, and numerous variants have been proposed in the literature. Although the Lasso formulations are stated so that overall prediction error is optimized, no full control over the accuracy prediction on certain individuals of interest is allowed. In this work we propose a novel version of the Lasso in which quadratic performance constraints are added

A novel semisupervised support vector machine with asymmetric squared loss Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200310
Huimin Pei, Qiang Lin, Liran Yang, Ping ZhongLaplacian support vector machine (LapSVM), which is based on the semisupervised manifold regularization learning framework, performs better than the standard SVM, especially for the case where the supervised information is insufficient. However, the use of hinge loss leads to the sensitivity of LapSVM to noise around the decision boundary. To enhance the performance of LapSVM, we present a novel semisupervised

Efficient regularized spectral data embedding Adv. Data Anal. Classif. (IF 2.134) Pub Date : 20200224
Lazhar Labiod, Mohamed NadifData embedding (DE) or dimensionality reduction techniques are particularly well suited to embedding highdimensional data into a space that in most cases will have just two dimensions. Lowdimensional space, in which data samples (data points) can more easily be visualized, is also often used for learning methods such as clustering. Sometimes, however, DE will identify dimensions that contribute little