
显示样式: 排序: IF: - GO 导出
-
Nonlinear dimension reduction for conditional quantiles Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2021-03-23 Eliana Christou, Annabel Settle, Andreas Artemiou
In practice, data often display heteroscedasticity, making quantile regression (QR) a more appropriate methodology. Modeling the data, while maintaining a flexible nonparametric fitting, requires smoothing over a high-dimensional space which might not be feasible when the number of the predictor variables is large. This problem makes necessary the use of dimension reduction techniques for conditional
-
Learning multivariate shapelets with multi-layer neural networks for interpretable time-series classification Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2021-03-04 Roberto Medico, Joeri Ruyssinck, Dirk Deschrijver, Tom Dhaene
Shapelets are discriminative subsequences extracted from time-series data. Classifiers using shapelets have proven to achieve performances competitive to state-of-the-art methods, while enhancing the model’s interpretability. While a lot of research has been done for univariate time-series shapelets, extensions for the multivariate setting have not yet received much attention. To extend shapelets-based
-
Robust regression with compositional covariates including cellwise outliers Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2021-02-24 Nikola Štefelová, Andreas Alfons, Javier Palarea-Albaladejo, Peter Filzmoser, Karel Hron
We propose a robust procedure to estimate a linear regression model with compositional and real-valued explanatory variables. The proposed procedure is designed to be robust against individual outlying cells in the data matrix (cellwise outliers), as well as entire outlying observations (rowwise outliers). Cellwise outliers are first filtered and then imputed by robust estimates. Afterwards, rowwise
-
Sparse principal component regression via singular value decomposition approach Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2021-02-08 Shuichi Kawano
Principal component regression (PCR) is a two-stage procedure: the first stage performs principal component analysis (PCA) and the second stage builds a regression model whose explanatory variables are the principal components obtained in the first stage. Since PCA is performed using only explanatory variables, the principal components have no information about the response variable. To address this
-
PCA-KL: a parametric dimensionality reduction approach for unsupervised metric learning Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2021-01-07 Alexandre L. M. Levada
Dimensionality reduction algorithms are powerful mathematical tools for data analysis and visualization. In many pattern recognition applications, a feature extraction step is often required to mitigate the curse of the dimensionality, a collection of negative effects caused by an arbitrary increase in the number of features in classification tasks. Principal Component Analysis (PCA) is a classical
-
Functional data clustering by projection into latent generalized hyperbolic subspaces Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2021-01-07 Alex Sharp, Ryan Browne
We introduce a latent subpace model which facilitates model-based clustering of functional data. Flexible clustering is attained by imposing jointly generalized hyperbolic distributions on projections of basis expansion coefficients into group specific subspaces. The model acquires parsimony by assuming these subspaces are of relatively low dimension. Parameter estimation is done through a multicycle
-
A process framework for inducing and explaining Datalog theories Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2021-01-05 Mark Gromowski, Michael Siebers, Ute Schmid
With the increasing prevalence of Machine Learning in everyday life, a growing number of people will be provided with Machine-Learned assessments on a regular basis. We believe that human users interacting with systems based on Machine-Learned classifiers will demand and profit from the systems’ decisions being explained in an approachable and comprehensive way. We developed a general process framework
-
Automatic gait classification patterns in spastic hemiplegia Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2021-01-04 Ana Aguilera, Alberto Subero
Clinical gait analysis and the interpretation of related records are a powerful tool to aid clinicians in the diagnosis, treatment and prognosis of human gait disabilities. The aim of this study is to investigate kinematic, kinetic, and electromyographic (EMG) data from child patients with spastic hemiplegia (SH) in order to discover useful patterns in human gait. Data mining techniques and classification
-
A bivariate finite mixture growth model with selection Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-12-29 David Aristei, Silvia Bacci, Francesco Bartolucci, Silvia Pandolfi
A model is proposed to analyze longitudinal data where two response variables are available, one of which is a binary indicator of selection and the other is continuous and observed only if the first is equal to 1. The model also accounts for individual covariates and may be considered as a bivariate finite mixture growth model as it is based on three submodels: (i) a probit model for the selection
-
Adapted single-cell consensus clustering (adaSC3) Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-12-15 Cornelia Fuetterer, Thomas Augustin, Christiane Fuchs
The analysis of single-cell RNA sequencing data is of great importance in health research. It challenges data scientists, but has enormous potential in the context of personalized medicine. The clustering of single cells aims to detect different subgroups of cell populations within a patient in a data-driven manner. Some comparison studies denote single-cell consensus clustering (SC3), proposed by
-
A Riemannian geometric framework for manifold learning of non-Euclidean data Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-11-27 Cheongjae Jang, Yung-Kyun Noh, Frank Chongwoo Park
A growing number of problems in data analysis and classification involve data that are non-Euclidean. For such problems, a naive application of vector space analysis algorithms will produce results that depend on the choice of local coordinates used to parametrize the data. At the same time, many data analysis and classification problems eventually reduce to an optimization, in which the criteria being
-
Robust semiparametric inference for polytomous logistic regression with complex survey design Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-11-23 Elena Castilla, Abhik Ghosh, Nirian Martin, Leandro Pardo
Analyzing polytomous response from a complex survey scheme, like stratified or cluster sampling is very crucial in several socio-economics applications. We present a class of minimum quasi weighted density power divergence estimators for the polytomous logistic regression model with such a complex survey. This family of semiparametric estimators is a robust generalization of the maximum quasi weighted
-
Predicting brand confusion in imagery markets based on deep learning of visual advertisement content Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-11-19 Atsuho Nakayama, Daniel Baier
In the consumer goods industry, unique brand positionings are assumed to be the road to success. They document product distinctiveness and so justify high prices. However, as products are getting more and more interchangeable, brand positionings must rely—at least partially—on supporting advertisements. Here, especially ads with visual content (e.g. photos, video clips) are able to connect brands with
-
Clustering of modal-valued symbolic data Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-10-24 Nataša Kejžar, Simona Korenjak-Černe, Vladimir Batagelj
Symbolic data analysis is based on special descriptions of data known as symbolic objects (SOs). Such descriptions preserve more detailed information about units and their clusters than the usual representations with mean values. A special type of SO is a representation with frequency or probability distributions (modal values). This representation enables us to simultaneously consider variables of
-
Sparse group fused lasso for model segmentation: a hybrid approach Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-10-22 David Degras
This article introduces the sparse group fused lasso (SGFL) as a statistical framework for segmenting sparse regression models with multivariate time series. To compute solutions of the SGFL, a nonsmooth and nonseparable convex program, we develop a hybrid optimization method that is fast, requires no tuning parameter selection, and is guaranteed to converge to a global minimizer. In numerical experiments
-
Better than the best? Answers via model ensemble in density-based clustering Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-10-02 Alessandro Casa, Luca Scrucca, Giovanna Menardi
With the recent growth in data availability and complexity, and the associated outburst of elaborate modelling approaches, model selection tools have become a lifeline, providing objective criteria to deal with this increasingly challenging landscape. In fact, basing predictions and inference on a single model may be limiting if not harmful; ensemble approaches, which combine different models, have
-
Editable machine learning models? A rule-based framework for user studies of explainability Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-09-11 Stanislav Vojíř, Tomáš Kliegr
So far, most user studies dealing with comprehensibility of machine learning models have used questionnaires or surveys to acquire input from participants. In this article, we argue that compared to questionnaires, the use of an adapted version of a real machine learning interface can yield a new level of insight into what attributes make a machine learning model interpretable, and why. Also, we argue
-
A comparison of instance-level counterfactual explanation algorithms for behavioral and textual data: SEDC, LIME-C and SHAP-C Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-09-02 Yanou Ramon, David Martens, Foster Provost, Theodoros Evgeniou
Predictive systems based on high-dimensional behavioral and textual data have serious comprehensibility and transparency issues: linear models require investigating thousands of coefficients, while the opaqueness of nonlinear models makes things worse. Counterfactual explanations are becoming increasingly popular for generating insight into model predictions. This study aligns the recently proposed
-
Mixtures of factor analyzers with scale mixtures of fundamental skew normal distributions Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-09-02 Sharon X. Lee, Tsung-I Lin, Geoffrey J. McLachlan
Mixtures of factor analyzers (MFA) provide a powerful tool for modelling high-dimensional datasets. In recent years, several generalizations of MFA have been developed where the normality assumption of the factors and/or of the errors were relaxed to allow for skewness in the data. However, due to the form of the adopted component densities, the distribution of the factors/errors in most of these models
-
A novel dictionary learning method based on total least squares approach with application in high dimensional biological data Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-09-02 Parvaneh Parvasideh, Mansoor Rezghi
In recent years dictionary learning has become a favorite sparse feature extraction technique. Dictionary learning represents each data as a sparse combination of atoms (columns) of the dictionary matrix. Usually, the input data is contaminated by errors that affect the quality of the obtained dictionary and so sparse features. This effect is especially critical in applications with high dimensional
-
The GNG neural network in analyzing consumer behaviour patterns: empirical research on a purchasing behaviour processes realized by the elderly consumers Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-08-30 Kamila Migdał-Najman, Krzysztof Najman, Sylwia Badowska
The paper sheds light on the use of a self-learning GNG neural network for identification and exploration of the purchasing behaviour patterns. The test has been conducted on the data collected from consumers aged 60 years and over, with regard to three product purchases. The primary data used to explore the purchasing behaviour patterns was collected during a survey carried out among the elderly students
-
A perceptually optimised bivariate visualisation scheme for high-dimensional fold-change data Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-08-18 André Müller, Ludwig Lausser, Adalbert Wilhelm, Timo Ropinski, Matthias Platzer, Heiko Neumann, Hans A. Kestler
Visualising data as diagrams using visual attributes such as colour, shape, size, and orientation is challenging. In particular, large data sets demand graphical display as an essential step in the analysis. In order to achieve comprehension often different attributes need to be displayed simultaneously. In this work a comprehensible bivariate, perceptually optimised visualisation scheme for high-dimensional
-
SEM-Tree hybrid models in the preferences analysis of the members of Polish households Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-08-10 Adam Sagan, Mariusz Łapczyński
The purpose of the paper is to identify the dimensions of the strategy of resources allocation of Polish households members and test the hypothesis concerning risky shift effect in the relationship between strategy of family decision making and trade-off in family scarce resources allocation. These dimensions were identified on the basis of nationwide empirical data gathered on a representative sample
-
Robust archetypoids for anomaly detection in big functional data Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-08-03 Guillermo Vinue, Irene Epifanio
Archetypoid analysis (ADA) has proven to be a successful unsupervised statistical technique to identify extreme observations in the periphery of the data cloud, both in classical multivariate data and functional data. However, two questions remain open in this field: the use of ADA for outlier detection and its scalability. We propose to use robust functional archetypoids and adjusted boxplot to pinpoint
-
Adaptive sparse group LASSO in quantile regression Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-07-29 Alvaro Mendez-Civieta, M. Carmen Aguilera-Morillo, Rosa E. Lillo
This paper studies the introduction of sparse group LASSO (SGL) to the quantile regression framework. Additionally, a more flexible version, an adaptive SGL is proposed based on the adaptive idea, this is, the usage of adaptive weights in the penalization. Adaptive estimators are usually focused on the study of the oracle property under asymptotic and double asymptotic frameworks. A key step on the
-
Hierarchical conceptual clustering based on quantile method for identifying microscopic details in distributional data Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-07-22 Kadri Umbleja, Manabu Ichino, Hiroyuki Yaguchi
Symbolic data is aggregated from bigger traditional datasets in order to hide entry specific details and to enable analysing large amounts of data, like big data, which would otherwise not be possible. Symbolic data may appear in many different but complex forms like intervals and histograms. Identifying patterns and finding similarities between objects is one of the most fundamental tasks of data
-
On the use of quantile regression to deal with heterogeneity: the case of multi-block data Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-07-19 Cristina Davino, Rosaria Romano, Domenico Vistocco
The aim of the paper is to propose a quantile regression based strategy to assess heterogeneity in a multi-block type data structure. Specifically, the paper deals with a particular data structure where several blocks of variables are observed on the same units and a structure of relations is assumed between the different blocks. The idea is that quantile regression complements the results of the least
-
A bias-variance analysis of state-of-the-art random forest text classifiers Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-07-19 Thiago Salles, Leonardo Rocha, Marcos Gonçalves
Random forest (RF) classifiers do excel in a variety of automatic classification tasks, such as topic categorization and sentiment analysis. Despite such advantages, RF models have been shown to perform poorly when facing noisy data, commonly found in textual data, for instance. Some RF variants have been proposed to provide better generalization capabilities under such challenging scenario, including
-
Active learning of constraints for weighted feature selection Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-07-10 Samah Hijazi, Denis Hamad, Mariam Kalakech, Ali Kalakech
Pairwise constraints, a cheaper kind of supervision information that does not need to reveal the class labels of data points, were initially suggested to enhance the performance of clustering algorithms. Recently, researchers were interested in using them for feature selection. However, in most current methods, pairwise constraints are provided passively and generated randomly over multiple algorithmic
-
A stochastic block model for interaction lengths Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-06-18 Riccardo Rastelli, Michael Fop
We propose a new stochastic block model that focuses on the analysis of interaction lengths in dynamic networks. The model does not rely on a discretization of the time dimension and may be used to analyze networks that evolve continuously over time. The framework relies on a clustering structure on the nodes, whereby two nodes belonging to the same latent group tend to create interactions and non-interactions
-
Regime dependent interconnectedness among fuzzy clusters of financial time series Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-06-16 Giovanni De Luca, Paola Zuccolotto
We analyze the dynamic structure of lower tail dependence coefficients within groups of assets defined such that assets belonging to the same group are characterized by pairwise high associations between extremely low values. The groups are identified by means of a fuzzy cluster analysis algorithm. The tail dependence coefficients are estimated using the Joe–Clayton copula function, and the 75th percentile
-
M-estimators and trimmed means: from Hilbert-valued to fuzzy set-valued data Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-06-12 Beatriz Sinova, Stefan Van Aelst, Pedro Terán
Different approaches to robustly measure the location of data associated with a random experiment have been proposed in the literature, with the aim of avoiding the high sensitivity to outliers or data changes typical for the mean. In particular, M-estimators and trimmed means have been studied in general spaces, and can be used to handle Hilbert-valued data. Both alternatives are of interest due to
-
ParticleMDI: particle Monte Carlo methods for the cluster analysis of multiple datasets with applications to cancer subtype identification Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-06-12 Nathan Cunningham, Jim E. Griffin, David L. Wild
We present a novel nonparametric Bayesian approach for performing cluster analysis in a context where observational units have data arising from multiple sources. Our approach uses a particle Gibbs sampler for inference in which cluster allocations are jointly updated using a conditional particle filter within a Gibbs sampler, improving the mixing of the MCMC chain. We develop several approaches to
-
Isotonic boosting classification rules Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-06-12 David Conde, Miguel A. Fernández, Cristina Rueda, Bonifacio Salvador
In many real classification problems a monotone relation between some predictors and the classes may be assumed when higher (or lower) values of those predictors are related to higher levels of the response. In this paper, we propose new boosting algorithms, based on LogitBoost, that incorporate this isotonicity information, yielding more accurate and easily interpretable rules. These algorithms are
-
Chained correlations for feature selection Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-06-09 Ludwig Lausser, Robin Szekely, Hans A. Kestler
Data-driven algorithms stand and fall with the availability and quality of existing data sources. Both can be limited in high-dimensional settings (\(n \gg m\)). For example, supervised learning algorithms designed for molecular pheno- or genotyping are restricted to samples of the corresponding diagnostic classes. Samples of other related entities, such as arise in differential diagnosis, are usually
-
The ultrametric correlation matrix for modelling hierarchical latent concepts Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-05-28 Carlo Cavicchia, Maurizio Vichi, Giorgia Zaccaria
Many relevant multidimensional phenomena are defined by nested latent concepts, which can be represented by a tree-structure supposing a hierarchical relationship among manifest variables. The root of the tree is a general concept which includes more specific ones. The aim of the paper is to reconstruct an observed data correlation matrix of manifest variables through an ultrametric correlation matrix
-
Data generation for composite-based structural equation modeling methods Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-05-26 Rainer Schlittgen, Marko Sarstedt, Christian M. Ringle
Examining the efficacy of composite-based structural equation modeling (SEM) features prominently in research. However, studies analyzing the efficacy of corresponding estimators usually rely on factor model data. Thereby, they assess and analyze their performance on erroneous grounds (i.e., factor model data instead of composite model data). A potential reason for this malpractice lies in the lack
-
Simultaneous dimension reduction and clustering via the NMF-EM algorithm Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-05-25 Léna Carel, Pierre Alquier
Mixture models are among the most popular tools for clustering. However, when the dimension and the number of clusters is large, the estimation of the clusters become challenging, as well as their interpretation. Restriction on the parameters can be used to reduce the dimension. An example is given by mixture of factor analyzers for Gaussian mixtures. The extension of MFA to non-Gaussian mixtures is
-
Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-05-25 Laura Anderlucci, Cinzia Viroli
Topic detection in short textual data is a challenging task due to its representation as high-dimensional and extremely sparse document-term matrix. In this paper we focus on the problem of classifying textual data on the base of their (unique) topic. For unsupervised classification, a popular approach called Mixture of Unigrams consists in considering a mixture of multinomial distributions over the
-
Clustering discrete-valued time series Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-05-20 Tyler Roick, Dimitris Karlis, Paul D. McNicholas
There is a need for the development of models that are able to account for discreteness in data, along with its time series properties and correlation. Our focus falls on INteger-valued AutoRegressive (INAR) type models. The INAR type models can be used in conjunction with existing model-based clustering techniques to cluster discrete-valued time series data. With the use of a finite mixture model
-
Gaussian mixture modeling and model-based clustering under measurement inconsistency Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-05-12 Shuchismita Sarkar, Volodymyr Melnykov, Rong Zheng
Finite mixtures present a powerful tool for modeling complex heterogeneous data. One of their most important applications is model-based clustering. It assumes that each data group can be reasonably described by one mixture model component. This establishes a one-to-one relationship between mixture components and clusters. In some cases, however, this relationship can be broken due to the presence
-
Semiparametric mixtures of regressions with single-index for model based clustering Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-04-23 Sijia Xiang, Weixin Yao
In this article, we propose two classes of semiparametric mixture regression models with single-index for model based clustering. Unlike many semiparametric/nonparametric mixture regression models that can only be applied to low dimensional predictors, the new semiparametric models can easily incorporate high dimensional predictors into the nonparametric components. The proposed models are very general
-
Mixture modeling of data with multiple partial right-censoring levels Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-04-21 Semhar Michael, Tatjana Miljkovic, Volodymyr Melnykov
In this paper, a new flexible approach to modeling data with multiple partial right-censoring points is proposed. This method is based on finite mixture models, flexible tool to model heterogeneity in data. A general framework to accommodate partial censoring is considered. In this setting, it is assumed that a certain portion of data points are censored and the rest are not. This situation occurs
-
Kappa coefficients for dichotomous-nominal classifications Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-04-07 Matthijs J. Warrens
Two types of nominal classifications are distinguished, namely regular nominal classifications and dichotomous-nominal classifications. The first type does not include an ‘absence’ category (for example, no disorder), whereas the second type does include an ‘absence’ category. Cohen’s unweighted kappa can be used to quantify agreement between two regular nominal classifications with the same categories
-
A cost-sensitive constrained Lasso Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-03-12 Rafael Blanquero, Emilio Carrizosa, Pepa Ramírez-Cobo, M. Remedios Sillero-Denamiel
The Lasso has become a benchmark data analysis procedure, and numerous variants have been proposed in the literature. Although the Lasso formulations are stated so that overall prediction error is optimized, no full control over the accuracy prediction on certain individuals of interest is allowed. In this work we propose a novel version of the Lasso in which quadratic performance constraints are added
-
A novel semi-supervised support vector machine with asymmetric squared loss Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-03-10 Huimin Pei, Qiang Lin, Liran Yang, Ping Zhong
Laplacian support vector machine (LapSVM), which is based on the semi-supervised manifold regularization learning framework, performs better than the standard SVM, especially for the case where the supervised information is insufficient. However, the use of hinge loss leads to the sensitivity of LapSVM to noise around the decision boundary. To enhance the performance of LapSVM, we present a novel semi-supervised
-
Efficient regularized spectral data embedding Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-02-24 Lazhar Labiod, Mohamed Nadif
Data embedding (DE) or dimensionality reduction techniques are particularly well suited to embedding high-dimensional data into a space that in most cases will have just two dimensions. Low-dimensional space, in which data samples (data points) can more easily be visualized, is also often used for learning methods such as clustering. Sometimes, however, DE will identify dimensions that contribute little
-
A combination of k -means and DBSCAN algorithm for solving the multiple generalized circle detection problem Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-02-12 Rudolf Scitovski, Kristian Sabo
Motivated by the problem of identifying rod-shaped particles (e.g. bacilliform bacterium), in this paper we consider the multiple generalized circle detection problem. We propose a method for solving this problem that is based on center-based clustering, where cluster-centers are generalized circles. An efficient algorithm is proposed which is based on a modification of the well-known k-means algorithm
-
A robust spatial autoregressive scalar-on-function regression with t -distribution Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-01-29 Tingting Huang, Gilbert Saporta, Huiwen Wang, Shanshan Wang
Modelling functional data in the presence of spatial dependence is of great practical importance as exemplified by applications in the fields of demography, economy and geography, and has received much attention recently. However, for the classical scalar-on-function regression (SoFR) with functional covariates and scalar responses, only a relatively few literature is dedicated to this relevant area
-
From-below Boolean matrix factorization algorithm based on MDL Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-01-08 Tatiana Makhalova, Martin Trnecka
During the past few years Boolean matrix factorization (BMF) has become an important direction in data analysis. The minimum description length principle (MDL) was successfully adapted in BMF for the model order selection. Nevertheless, a BMF algorithm performing good results w.r.t. standard measures in BMF is missing. In this paper, we propose a novel from-below Boolean matrix factorization algorithm
-
Interval forecasts based on regression trees for streaming data Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2019-12-18 Xin Zhao, Stuart Barber, Charles C. Taylor, Zoka Milan
In forecasting, we often require interval forecasts instead of just a specific point forecast. To track streaming data effectively, this interval forecast should reliably cover the observed data and yet be as narrow as possible. To achieve this, we propose two methods based on regression trees: one ensemble method and one method based on a single tree. For the ensemble method, we use weighted results
-
Data projections by skewness maximization under scale mixtures of skew-normal vectors Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2020-03-10 Jorge M. Arevalillo, Hilario Navarro
Multivariate scale mixtures of skew-normal distributions are flexible models that account for the non-normality of data by means of a tail weight parameter and a shape vector representing the asymmetry of the model in a directional fashion. Its stochastic representation involves a skew-normal vector and a non negative mixing scalar variable, independent of the skew-normal vector, that injects tail
-
Modelling heterogeneity: on the problem of group comparisons with logistic regression and the potential of the heterogeneous choice model Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2019-12-13 Gerhard Tutz
The comparison of coefficients of logit models obtained for different groups is widely considered as problematic because of possible heterogeneity of residual variances in latent variables. It is shown that the heterogeneous logit model can be used to account for this type of heterogeneity by considering reduced models that are identified. A model selection strategy is proposed that can distinguish
-
A stable cardinality distance for topological classification Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2019-11-27 Vasileios Maroulas, Cassie Putman Micucci, Adam Spannaus
This work incorporates topological features via persistence diagrams to classify point cloud data arising from materials science. Persistence diagrams are multisets summarizing the connectedness and holes of given data. A new distance on the space of persistence diagrams generates relevant input features for a classification algorithm for materials science data. This distance measures the similarity
-
Rank tests for functional data based on the epigraph, the hypograph and associated graphical representations Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2019-11-27 Alba M. Franco Pereira, Rosa E. Lillo
Visualization techniques are very useful in data analysis. Their aim is to summarize information into a graph or a plot. In particular, visualization is especially interesting when one has functional data, where there is no total order between the data of a sample. Taking into account the information provided by the down–upward partial orderings based on the hypograph and the epigragh indexes, we propose
-
Is-ClusterMPP: clustering algorithm through point processes and influence space towards high-dimensional data Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2019-11-27 Khadidja Henni, Pierre-Yves Louis, Brigitte Vannier, Ahmed Moussa
Clustering via marked point processes and influence space, Is-ClusterMPP, is a new unsupervised clustering algorithm through adaptive MCMC sampling of a marked point processes of interacting balls. The designed Gibbs energy cost function makes use of k-influence space information. It detects clusters of different shapes, sizes and unbalanced local densities. It aims at dealing also with high-dimensional
-
Mixtures of skewed matrix variate bilinear factor analyzers Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2019-11-21 Michael P. B. Gallaugher, Paul D. McNicholas
In recent years, data have become increasingly higher dimensional and, therefore, an increased need has arisen for dimension reduction techniques for clustering. Although such techniques are firmly established in the literature for multivariate data, there is a relative paucity in the area of matrix variate, or three-way, data. Furthermore, the few methods that are available all assume matrix variate
-
Sparse classification with paired covariates Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2019-11-15 Armin Rauschenberger, Iuliana Ciocănea-Teodorescu, Marianne A. Jonker, Renée X. Menezes, Mark A. van de Wiel
This paper introduces the paired lasso: a generalisation of the lasso for paired covariate settings. Our aim is to predict a single response from two high-dimensional covariate sets. We assume a one-to-one correspondence between the covariate sets, with each covariate in one set forming a pair with a covariate in the other set. Paired covariates arise, for example, when two transformations of the same
-
Gaussian parsimonious clustering models with covariates and a noise component Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2019-09-20 Keefe Murphy, Thomas Brendan Murphy
We consider model-based clustering methods for continuous, correlated data that account for external information available in the presence of mixed-type fixed covariates by proposing the MoEClust suite of models. These models allow different subsets of covariates to influence the component weights and/or component densities by modelling the parameters of the mixture as functions of the covariates.
-
Familywise decompositions of Pearson’s chi-square statistic in the analysis of contingency tables Adv. Data Anal. Classif. (IF 1.603) Pub Date : 2019-09-16 Rosaria Lombardo, Yoshio Takane, Eric J. Beh
Pearson’s chi-square statistic is well established for testing goodness-of-fit of various hypotheses about observed frequency distributions in contingency tables. A general formula for ANOVA-like decompositions of Pearson’s statistic is given under the independence assumption along with their extensions to higher-order tables. Mathematically, it makes the terms in the partitions and orthogonality among