当前期刊: Data Mining and Knowledge Discovery Go to current issue    加入关注   
显示样式:        排序: 导出
我的关注
我的收藏
您暂时未登录!
登录
  • NegPSpan: efficient extraction of negative sequential patterns with embedding constraints
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2020-01-21
    Thomas Guyet, René Quiniou

    Abstract Sequential pattern mining is concerned with the extraction of frequent or recurrent behaviors, modeled as subsequences, from a sequence dataset. Such patterns inform about which events are frequently observed in sequences, i.e. events that really happen. Sometimes, knowing that some specific event does not happen is more informative than extracting observed events. Negative sequential patterns (NSPs) capture recurrent behaviors by patterns having the form of sequences mentioning both observed events and absence of events. Few approaches have been proposed to mine such NSPs. In addition, the syntax and semantics of NSPs differ in the different methods which makes it difficult to compare them. This article provides a unified framework for the formulation of the syntax and the semantics of NSPs. Then, we introduce a new algorithm, NegPSpan, that extracts NSPs using a prefix-based depth-first scheme, enabling maxgap constraints that other approaches do not take into account. The formal framework highlights the differences between the proposed approach and methods from the literature, especially against the state of the art approach eNSP. Intensive experiments on synthetic and real datasets show that NegPSpan can extract meaningful NSPs and that it can process bigger datasets than eNSP thanks to significantly lower memory requirements and better computation times.

    更新日期:2020-01-22
  • Relaxing the strong triadic closure problem for edge strength inference
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2020-01-17
    Florian Adriaens, Tijl De Bie, Aristides Gionis, Jefrey Lijffijt, Antonis Matakos, Polina Rozenshtein

    Abstract Social networks often provide only a binary perspective on social ties: two individuals are either connected or not. While sometimes external information can be used to infer the strength of social ties, access to such information may be restricted or impractical to obtain. Sintos and Tsaparas (KDD 2014) first suggested to infer the strength of social ties from the topology of the network alone, by leveraging the Strong Triadic Closure (STC) property. The STC property states that if person A has strong social ties with persons B and C, B and C must be connected to each other as well (whether with a weak or strong tie). They exploited this property to formulate the inference of the strength of social ties as a NP-hard maximization problem, and proposed two approximation algorithms. We refine and improve this line of work, by developing a sequence of linear relaxations of the problem, which can be solved exactly in polynomial time. Usefully, these relaxations infer more fine-grained levels of tie strength (beyond strong and weak), which also allows one to avoid making arbitrary strong/weak strength assignments when the network topology provides inconclusive evidence. Moreover, these relaxations allow us to easily change the objective function to more sensible alternatives, instead of simply maximizing the number of strong edges. An extensive theoretical analysis leads to two efficient algorithmic approaches. Finally, our experimental results elucidate the strengths of the proposed approach, while at the same time questioning the validity of leveraging the STC property for edge strength inference in practice.

    更新日期:2020-01-17
  • A survey and benchmarking study of multitreatment uplift modeling
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2020-01-13
    Diego Olaya, Kristof Coussement, Wouter Verbeke

    Uplift modeling is an instrument used to estimate the change in outcome due to a treatment at the individual entity level. Uplift models assist decision-makers in optimally allocating scarce resources. This allows the selection of the subset of entities for which the effect of a treatment will be largest and, as such, the maximization of the overall returns. The literature on uplift modeling mostly focuses on queries concerning the effect of a single treatment and rarely considers situations where more than one treatment alternative is utilized. This article surveys the current literature on multitreatment uplift modeling and proposes two novel techniques: the naive uplift approach and the multitreatment modified outcome approach. Moreover, a benchmarking experiment is performed to contrast the performances of different multitreatment uplift modeling techniques across eight data sets from various domains. We verify and, if needed, correct the imbalance among the pretreatment characteristics of the treatment groups by means of optimal propensity score matching, which ensures a correct interpretation of the estimated uplift. Conventional and recently proposed evaluation metrics are adapted to the multitreatment scenario to assess performance. None of the evaluated techniques consistently outperforms other techniques. Hence, it is concluded that performance largely depends on the context and problem characteristics. The newly proposed techniques are found to offer similar performances compared to state-of-the-art approaches.

    更新日期:2020-01-13
  • Topical network embedding
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2019-10-24
    Min Shi, Yufei Tang, Xingquan Zhu, Jianxun Liu, Haibo He

    Networked data involve complex information from multifaceted channels, including topology structures, node content, and/or node labels etc., where structure and content are often correlated but are not always consistent. A typical scenario is the citation relationships in scholarly publications where a paper is cited by others not because they have the same content, but because they share one or multiple subject matters. To date, while many network embedding methods exist to take the node content into consideration, they all consider node content as simple flat word/attribute set and nodes sharing connections are assumed to have dependency with respect to all words or attributes. In this paper, we argue that considering topic-level semantic interactions between nodes is crucial to learn discriminative node embedding vectors. In order to model pairwise topic relevance between linked text nodes, we propose topical network embedding, where interactions between nodes are built on the shared latent topics. Accordingly, we propose a unified optimization framework to simultaneously learn topic and node representations from the network text contents and structures, respectively. Meanwhile, the structure modeling takes the learned topic representations as conditional context under the principle that two nodes can infer each other contingent on the shared latent topics. Experiments on three real-world datasets demonstrate that our approach can learn significantly better network representations, i.e., 4.1% improvement over the state-of-the-art methods in terms of Micro-F1 on Cora dataset. (The source code of the proposed method is available through the github link: https://github.com/codeshareabc/TopicalNE.)

    更新日期:2020-01-08
  • Grafting for combinatorial binary model using frequent itemset mining
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2019-10-28
    Taito Lee, Shin Matsushima, Kenji Yamanishi

    Abstract We consider the class of linear predictors over all logical conjunctions of binary attributes, which we refer to as the class of combinatorial binary models (CBMs) in this paper. CBMs are of high knowledge interpretability but naïve learning of them from labeled data requires exponentially high computational cost with respect to the length of the conjunctions. On the other hand, in the case of large-scale datasets, long conjunctions are effective for learning predictors. To overcome this computational difficulty, we propose an algorithm, GRAfting for Binary datasets (GRAB), which efficiently learns CBMs within the \(L_1\)-regularized loss minimization framework. The key idea of GRAB is to adopt weighted frequent itemset mining for the most time-consuming step in the grafting algorithm, which is designed to solve large-scale \(L_1\)-RERM problems by an iterative approach. Furthermore, we experimentally showed that linear predictors of CBMs are effective in terms of prediction accuracy and knowledge discovery.

    更新日期:2020-01-08
  • Interactive visual data exploration with subjective feedback: an information-theoretic approach
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2019-10-03
    Kai Puolamäki, Emilia Oikarinen, Bo Kang, Jefrey Lijffijt, Tijl De Bie

    Visual exploration of high-dimensional real-valued datasets is a fundamental task in exploratory data analysis (EDA). Existing projection methods for data visualization use predefined criteria to choose the representation of data. There is a lack of methods that (i) use information on what the user has learned from the data and (ii) show patterns that she does not know yet. We construct a theoretical model where identified patterns can be input as knowledge to the system. The knowledge syntax here is intuitive, such as “this set of points forms a cluster”, and requires no knowledge of maths. This background knowledge is used to find a maximum entropy distribution of the data, after which the user is provided with data projections for which the data and the maximum entropy distribution differ the most, hence showing the user aspects of data that are maximally informative given the background knowledge. We study the computational performance of our model and present use cases on synthetic and real data. We find that the model allows the user to learn information efficiently from various data sources and works sufficiently fast in practice. In addition, we provide an open source EDA demonstrator system implementing our model with tailored interactive visualizations. We conclude that the information theoretic approach to EDA where patterns observed by a user are formalized as constraints provides a principled, intuitive, and efficient basis for constructing an EDA system.

    更新日期:2020-01-08
  • A comparative study of data-dependent approaches without learning in measuring similarities of data objects
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2019-10-30
    Sunil Aryal, Kai Ming Ting, Takashi Washio, Gholamreza Haffari

    Abstract Conventional general-purpose distance-based similarity measures, such as Minkowski distance (also known as \(\ell _p\)-norm with \(p>0\)), are data-independent and sensitive to units or scales of measurement. There are existing general-purpose data-dependent measures, such as rank difference, Lin’s probabilistic measure and \(m_p\)-dissimilarity (\(p>0\)), which are not sensitive to units or scales of measurement. Although they have been shown to be more effective than the traditional distance measures, their characteristics and relative performances have not been investigated. In this paper, we study the characteristics and relationships of different general-purpose data-dependent measures. We generalise \(m_p\)-dissimilarity where \(p\ge 0\) by introducing \(m_0\)-dissimilarity and show that it is a generic data-dependent measure with data-dependent self-similarity, of which rank difference and Lin’s measure are special cases with data-independent self-similarity. We evaluate the effectiveness of a wide range of general-purpose data-dependent and data-independent measures in the content-based information retrieval and kNN classification tasks. Our findings show that the fully data-dependent measure of \(m_p\)-dissimilarity is a more effective alternative to other data-dependent and commonly-used distance-based similarity measures as its task-specific performance is more consistent across a wide range of datasets.

    更新日期:2020-01-08
  • A semi-supervised model for knowledge graph embedding
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2019-09-24
    Jia Zhu, Zetao Zheng, Min Yang, Gabriel Pui Cheong Fung, Yong Tang

    Knowledge graphs have shown increasing importance in broad applications such as question answering, web search, and recommendation systems. The objective of knowledge graph embedding is to encode both entities and relations of knowledge graphs into continuous low-dimensional vector spaces to perform various machine learning tasks. Most of the existing works only focused on the local structure of knowledge graphs when utilizing structural information of entities, which may not sincerely preserve the global structure of knowledge graphs.In this paper, we propose a semi-supervised model by adopting graph convolutional networks to utilize both local and global structural information of entities. Specifically, our model takes textual information of each entity into consideration as entity attributes in the process of learning. We show the effectiveness of our model by applying it to two traditional tasks for knowledge graph: entity classification and link prediction. Experimental results on two well-known corpora reveal the advantages of this model compared to state-of-the-art methods on both tasks. Moreover, the results show that even with only 1% labeled data to train, our model can still achieve good performance.

    更新日期:2020-01-08
  • Matching code and law: achieving algorithmic fairness with optimal transport
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2019-11-01
    Meike Zehlike, Philipp Hacker, Emil Wiedemann

    Increasingly, discrimination by algorithms is perceived as a societal and legal problem. As a response, a number of criteria for implementing algorithmic fairness in machine learning have been developed in the literature. This paper proposes the continuous fairness algorithm \((\hbox {CFA}\theta )\) which enables a continuous interpolation between different fairness definitions. More specifically, we make three main contributions to the existing literature. First, our approach allows the decision maker to continuously vary between specific concepts of individual and group fairness. As a consequence, the algorithm enables the decision maker to adopt intermediate “worldviews” on the degree of discrimination encoded in algorithmic processes, adding nuance to the extreme cases of “we’re all equal” and “what you see is what you get” proposed so far in the literature. Second, we use optimal transport theory, and specifically the concept of the barycenter, to maximize decision maker utility under the chosen fairness constraints. Third, the algorithm is able to handle cases of intersectionality, i.e., of multi-dimensional discrimination of certain groups on grounds of several criteria. We discuss three main examples (credit applications; college admissions; insurance contracts) and map out the legal and policy implications of our approach. The explicit formalization of the trade-off between individual and group fairness allows this post-processing approach to be tailored to different situational contexts in which one or the other fairness criterion may take precedence. Finally, we evaluate our model experimentally.

    更新日期:2020-01-08
  • A drift detection method based on dynamic classifier selection
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2019-10-11
    Felipe Pinagé, Eulanda M. dos Santos, João Gama

    Abstract Machine learning algorithms can be applied to several practical problems, such as spam, fraud and intrusion detection, and customer preferences, among others. In most of these problems, data come in streams, which mean that data distribution may change over time, leading to concept drift. The literature is abundant on providing supervised methods based on error monitoring for explicit drift detection. However, these methods may become infeasible in some real-world applications—where there is no fully labeled data available, and may depend on a significant decrease in accuracy to be able to detect drifts. There are also methods based on blind approaches, where the decision model is updated constantly. However, this may lead to unnecessary system updates. In order to overcome these drawbacks, we propose in this paper a semi-supervised drift detector that uses an ensemble of classifiers based on self-training online learning and dynamic classifier selection. For each unknown sample, a dynamic selection strategy is used to choose among the ensemble’s component members, the classifier most likely to be the correct one for classifying it. The prediction assigned by the chosen classifier is used to compute an estimate of the error produced by the ensemble members. The proposed method monitors such a pseudo-error in order to detect drifts and to update the decision model only after drift detection. The achievement of this method is relevant in that it allows drift detection and reaction and is applicable in several practical problems. The experiments conducted indicate that the proposed method attains high performance and detection rates, while reducing the amount of labeled data used to detect drift.

    更新日期:2020-01-08
  • Parameterized low-rank binary matrix approximation
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2020-01-02
    Fedor V. Fomin, Petr A. Golovach, Fahad Panolan

    Low-rank binary matrix approximation is a generic problem where one seeks a good approximation of a binary matrix by another binary matrix with some specific properties. A good approximation means that the difference between the two matrices in some matrix norm is small. The properties of the approximation binary matrix could be: a small number of different columns, a small binary rank or a small Boolean rank. Unfortunately, most variants of these problems are NP-hard. Due to this, we initiate the systematic algorithmic study of low-rank binary matrix approximation from the perspective of parameterized complexity. We show in which cases and under what conditions the problem is fixed-parameter tractable, admits a polynomial kernel and can be solved in parameterized subexponential time.

    更新日期:2020-01-04
  • Integer programming ensemble of temporal relations classifiers
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2020-01-02
    Catherine Kerr, Terri Hoare, Paula Carroll, Jakub Mareček

    The extraction of temporal events from text and the classification of temporal relations among both temporal events and time expressions are major challenges for the interface of data mining and natural language processing. We present an ensemble method, which reconciles the outputs of multiple heterogenous classifiers of temporal expressions. We use integer programming, a constrained optimisation technique, to improve on the best result of any individual classifier by choosing consistent temporal relations from among those recommended by multiple classifiers. Our ensemble method is conceptually simple and empirically powerful. It allows us to encode knowledge about the structure of valid temporal expressions as a set of constraints. It obtains new state-of-the-art results on two recent natural language processing challenges, SemEval-2013 TempEval-3 (Temporal Annotation) and SemEval-2016 Task 12 (Clinical TempEval), with F1 scores of 0.3915 and 0.595 respectively.

    更新日期:2020-01-04
  • Mining relaxed functional dependencies from data
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2019-12-23
    Loredana Caruccio, Vincenzo Deufemia, Giuseppe Polese

    Relaxed functional dependencies (rfds) are properties expressing important relationships among data. Thanks to the introduction of approximations in data comparison and/or validity, they can capture constraints useful for several purposes, such as the identification of data inconsistencies or patterns of semantically related data. Nevertheless, rfds can provide benefits only if they can be automatically discovered from data. In this paper we present an rfd discovery algorithm relying on a lattice structured search space, previously used for fd discovery, new pruning strategies, and a new candidate rfd validation method. An experimental evaluation demonstrates the discovery performances of the proposed algorithm on real datasets, also providing a comparison with other algorithms.

    更新日期:2020-01-04
  • Identifying exceptional (dis)agreement between groups
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2019-11-26
    Adnene Belfodil, Sylvie Cazalens, Philippe Lamarre, Marc Plantevit

    Under the term behavioral data, we consider any type of data featuring individuals performing observable actions on entities. For instance, voting data depict parliamentarians who express their votes w.r.t. legislative procedures. In this work, we address the problem of discovering exceptional (dis)agreement patterns in such data, i.e., groups of individuals that exhibit an unexpected (dis)agreement under specific contexts compared to what is observed in overall terms. To tackle this problem, we design a generic approach, rooted in the Subgroup Discovery/Exceptional Model Mining framework, which enables the discovery of such patterns in two different ways. A branch-and-bound algorithm ensures an efficient exhaustive search of the underlying search space by leveraging closure operators and optimistic estimates on the interestingness measures. A second algorithm abandons the completeness by using a sampling paradigm which provides an alternative when an exhaustive search approach becomes unfeasible. To illustrate the usefulness of discovering exceptional (dis)agreement patterns, we report a comprehensive experimental study on four real-world datasets relevant to three different application domains: political analysis, rating data analysis and healthcare surveillance.

    更新日期:2020-01-04
  • SIAS-miner: mining subjectively interesting attributed subgraphs
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2019-11-22
    Anes Bendimerad, Ahmad Mel, Jefrey Lijffijt, Marc Plantevit, Céline Robardet, Tijl De Bie

    Abstract Data clustering, local pattern mining, and community detection in graphs are three mature areas of data mining and machine learning. In recent years, attributed subgraph mining has emerged as a new powerful data mining task in the intersection of these areas. Given a graph and a set of attributes for each vertex, attributed subgraph mining aims to find cohesive subgraphs for which (some of) the attribute values have exceptional values. The principled integration of graph and attribute data poses two challenges: (1) the definition of a pattern syntax (the abstract form of patterns) that is intuitive and lends itself to efficient search, and (2) the formalization of the interestingness of such patterns. We propose an integrated solution to both of these challenges. The proposed pattern syntax improves upon prior work in being both highly flexible and intuitive. Plus, we define an effective and principled algorithm to enumerate patterns of this syntax. The proposed approach for quantifying interestingness of these patterns is rooted in information theory, and is able to account for background knowledge on the data. While prior work quantified the interestingness for the cohesion of the subgraph and for the exceptionality of its attributes separately, then combining these in a parameterized trade-off, we instead handle this trade-off implicitly in a principled, parameter-free manner. Empirical results confirm we can efficiently find highly interesting subgraphs.

    更新日期:2020-01-04
  • On normalization and algorithm selection for unsupervised outlier detection
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2019-11-21
    Sevvandi Kandanaarachchi, Mario A. Muñoz, Rob J. Hyndman, Kate Smith-Miles

    This paper demonstrates that the performance of various outlier detection methods is sensitive to both the characteristics of the dataset, and the data normalization scheme employed. To understand these dependencies, we formally prove that normalization affects the nearest neighbor structure, and density of the dataset; hence, affecting which observations could be considered outliers. Then, we perform an instance space analysis of combinations of normalization and detection methods. Such analysis enables the visualization of the strengths and weaknesses of these combinations. Moreover, we gain insights into which method combination might obtain the best performance for a given dataset.

    更新日期:2020-01-04
  • FastEE: Fast Ensembles of Elastic Distances for time series classification
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2019-11-18
    Chang Wei Tan, François Petitjean, Geoffrey I. Webb

    Abstract In recent years, many new ensemble-based time series classification (TSC) algorithms have been proposed. Each of them is significantly more accurate than their predecessors. The Hierarchical Vote Collective of Transformation-based Ensembles (HIVE-COTE) is currently the most accurate TSC algorithm when assessed on the UCR repository. It is a meta-ensemble of 5 state-of-the-art ensemble-based classifiers. The time complexity of HIVE-COTE—particularly for training—is prohibitive for most datasets. There is thus a critical need to speed up the classifiers that compose HIVE-COTE. This paper focuses on speeding up one of its components: Ensembles of Elastic Distances (EE), which is the classifier that leverages on the decades of research into the development of time-dedicated measures. Training EE can be prohibitive for many datasets. For example, it takes a month on the ElectricDevices dataset with 9000 instances. This is because EE needs to cross-validate the hyper-parameters used for the 11 similarity measures it encompasses. In this work, Fast Ensembles of Elastic Distances is proposed to train EE faster. There are two versions to it. The exact version makes it possible to train EE 10 times faster. The approximate version is 40 times faster than EE without significantly impacting the classification accuracy. This translates to being able to train EE on ElectricDevices in 13 h.

    更新日期:2020-01-04
  • Delayed labelling evaluation for data streams
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2019-11-16
    Maciej Grzenda, Heitor Murilo Gomes, Albert Bifet

    Abstract A large portion of the stream mining studies on classification rely on the availability of true labels immediately after making predictions. This approach is well exemplified by the test-then-train evaluation, where predictions immediately precede true label arrival. However, in many real scenarios, labels arrive with non-negligible latency. This raises the question of how to evaluate classifiers trained in such circumstances. This question is of particular importance when stream mining models are expected to refine their predictions between acquiring instance data and receiving its true label. In this work, we propose a novel evaluation methodology for data streams when verification latency takes place, namely continuous re-evaluation. It is applied to reference data streams and it is used to differentiate between stream mining techniques in terms of their ability to refine predictions based on newly arriving instances. Our study points out, discusses and shows empirically the importance of considering the delay of instance labels when evaluating classifiers for data streams.

    更新日期:2020-01-04
  • Deep multi-task learning for individuals origin–destination matrices estimation from census data
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2019-11-12
    Mehdi Katranji, Sami Kraiem, Laurent Moalic, Guilhem Sanmarty, Ghazaleh Khodabandelou, Alexandre Caminada, Fouad Hadj Selem

    Abstract Rapid urbanization has made the estimation of the human mobility flows a substantial task for transportation and urban planners. Worker and student mobility flows are among the most weekly regular displacements and consequently generate road congestion issues. With urge of demands on efficient transport planning policies, estimating their commuting facilitates the decision-making processes for local authorities. Worker and student censuses often contain home location, work places and educational institutions. This paper proposes a novel approach to estimate individuals origin–destination matrices from census datasets. We use a multi-task neural network to learn a generic model providing the spatio-temporal estimations of commuters dynamic mobility flows on daily basis from static censuses. Multi-task learning aims at leveraging functional information incorporated in multiple tasks, which allows ameliorating the generalization performance within all the tasks. We first aggregate individuals household travel surveys and census databases with working and studying trips. The model learns the temporal distribution of displacements from these static sources and then it is applied on scholar and worker mobility sources to predict the temporal characteristics of commuters’ displacements (i.e. origin–destination matrices). Our method yields substantially more stable predictions in terms of accuracy and results in a significant error rate control in comparison to single task learning.

    更新日期:2020-01-04
  • Correction to: Domain agnostic online semantic segmentation for multi-dimensional time series
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2019-02-14
    Shaghayegh Gharghabi, Chin-Chia Michael Yeh, Yifei Ding, Wei Ding, Paul Hibbing, Samuel LaMunion, Andrew Kaplan, Scott E. Crouter, Eamonn Keogh

    The article Domain agnostic online semantic segmentation for multi-dimensional time series, written by Shaghayegh Gharghabi, Chin-Chia Michael Yeh, Yifei Ding, Wei Ding, Paul Hibbing, Samuel LaMunion, Andrew Kaplan, Scott E. Crouter, Eamonn Keogh was originally published electronically on the publisher’s internet portal (currently SpringerLink) on 25 September 2018 without open access.

    更新日期:2020-01-04
  • Efficient mixture model for clustering of sparse high dimensional binary data
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2019-06-01
    Marek Śmieja, Krzysztof Hajto, Jacek Tabor

    Clustering is one of the fundamental tools for preliminary analysis of data. While most of the clustering methods are designed for continuous data, sparse high-dimensional binary representations became very popular in various domains such as text mining or cheminformatics. The application of classical clustering tools to this type of data usually proves to be very inefficient, both in terms of computational complexity as well as in terms of the utility of the results. In this paper we propose a mixture model, SparseMix, for clustering of sparse high dimensional binary data, which connects model-based with centroid-based clustering. Every group is described by a representative and a probability distribution modeling dispersion from this representative. In contrast to classical mixture models based on the EM algorithm, SparseMix: is specially designed for the processing of sparse data; can be efficiently realized by an on-line Hartigan optimization algorithm; describes every cluster by the most representative vector. We have performed extensive experimental studies on various types of data, which confirmed that SparseMix builds partitions with a higher compatibility with reference grouping than related methods. Moreover, constructed representatives often better reveal the internal structure of data.

    更新日期:2020-01-04
  • catch22 : CAnonical Time-series CHaracteristics
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2019-08-09
    Carl H. Lubba, Sarab S. Sethi, Philip Knaute, Simon R. Schultz, Ben D. Fulcher, Nick S. Jones

    Abstract Capturing the dynamical properties of time series concisely as interpretable feature vectors can enable efficient clustering and classification for time-series applications across science and industry. Selecting an appropriate feature-based representation of time series for a given application can be achieved through systematic comparison across a comprehensive time-series feature library, such as those in the hctsa toolbox. However, this approach is computationally expensive and involves evaluating many similar features, limiting the widespread adoption of feature-based representations of time series for real-world applications. In this work, we introduce a method to infer small sets of time-series features that (i) exhibit strong classification performance across a given collection of time-series problems, and (ii) are minimally redundant. Applying our method to a set of 93 time-series classification datasets (containing over 147,000 time series) and using a filtered version of the hctsa feature library (4791 features), we introduce a set of 22 CAnonical Time-series CHaracteristics, catch22, tailored to the dynamics typically encountered in time-series data-mining tasks. This dimensionality reduction, from 4791 to 22, is associated with an approximately 1000-fold reduction in computation time and near linear scaling with time-series length, despite an average reduction in classification accuracy of just 7%. catch22 captures a diverse and interpretable signature of time series in terms of their properties, including linear and non-linear autocorrelation, successive differences, value distributions and outliers, and fluctuation scaling properties. We provide an efficient implementation of catch22, accessible from many programming environments, that facilitates feature-based time-series analysis for scientific, industrial, financial and medical applications using a common language of interpretable time-series properties.

    更新日期:2020-01-04
  • A unifying view of explicit and implicit feature maps of graph kernels
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2019-09-17
    Nils M. Kriege, Marion Neumann, Christopher Morris, Kristian Kersting, Petra Mutzel

    Abstract Non-linear kernel methods can be approximated by fast linear ones using suitable explicit feature maps allowing their application to large scale problems. We investigate how convolution kernels for structured data are composed from base kernels and construct corresponding feature maps. On this basis we propose exact and approximative feature maps for widely used graph kernels based on the kernel trick. We analyze for which kernels and graph properties computation by explicit feature maps is feasible and actually more efficient. In particular, we derive approximative, explicit feature maps for state-of-the-art kernels supporting real-valued attributes including the GraphHopper and graph invariant kernels. In extensive experiments we show that our approaches often achieve a classification accuracy close to the exact methods based on the kernel trick, but require only a fraction of their running time. Moreover, we propose and analyze algorithms for computing random walk, shortest-path and subgraph matching kernels by explicit and implicit feature maps. Our theoretical results are confirmed experimentally by observing a phase transition when comparing running time with respect to label diversity, walk lengths and subgraph size, respectively.

    更新日期:2020-01-04
  • A probabilistic classifier ensemble weighting scheme based on cross-validated accuracy estimates
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2019-06-17
    James Large, Jason Lines, Anthony Bagnall

    Abstract Our hypothesis is that building ensembles of small sets of strong classifiers constructed with different learning algorithms is, on average, the best approach to classification for real-world problems. We propose a simple mechanism for building small heterogeneous ensembles based on exponentially weighting the probability estimates of the base classifiers with an estimate of the accuracy formed through cross-validation on the train data. We demonstrate through extensive experimentation that, given the same small set of base classifiers, this method has measurable benefits over commonly used alternative weighting, selection or meta-classifier approaches to heterogeneous ensembles. We also show how an ensemble of five well-known, fast classifiers can produce an ensemble that is not significantly worse than large homogeneous ensembles and tuned individual classifiers on datasets from the UCI archive. We provide evidence that the performance of the cross-validation accuracy weighted probabilistic ensemble (CAWPE) generalises to a completely separate set of datasets, the UCR time series classification archive, and we also demonstrate that our ensemble technique can significantly improve the state-of-the-art classifier for this problem domain. We investigate the performance in more detail, and find that the improvement is most marked in problems with smaller train sets. We perform a sensitivity analysis and an ablation study to demonstrate the robustness of the ensemble and the significant contribution of each design element of the classifier. We conclude that it is, on average, better to ensemble strong classifiers with a weighting scheme rather than perform extensive tuning and that CAWPE is a sensible starting point for combining classifiers.

    更新日期:2020-01-04
  • SAZED: parameter-free domain-agnostic season length estimation in time series data
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2019-07-26
    Maximilian Toller, Tiago Santos, Roman Kern

    Abstract Season length estimation is the task of identifying the number of observations in the dominant repeating pattern of seasonal time series data. As such, it is a common pre-processing task crucial for various downstream applications. Inferring season length from a real-world time series is often challenging due to phenomena such as slightly varying period lengths and noise. These issues may, in turn, lead practitioners to dedicate considerable effort to preprocessing of time series data since existing approaches either require dedicated parameter-tuning or their performance is heavily domain-dependent. Hence, to address these challenges, we propose SAZED: spectral and average autocorrelation zero distance density. SAZED is a versatile ensemble of multiple, specialized time series season length estimation approaches. The combination of various base methods selected with respect to domain-agnostic criteria and a novel seasonality isolation technique, allow a broad applicability to real-world time series of varied properties. Further, SAZED is theoretically grounded and parameter-free, with a computational complexity of \(\mathcal {O}(n\log n)\) , which makes it applicable in practice. In our experiments, SAZED was statistically significantly better than every other method on at least one dataset. The datasets we used for the evaluation consist of time series data from various real-world domains, sterile synthetic test cases and synthetic data that were designed to be seasonal and yet have no finite statistical moments of any order.

    更新日期:2020-01-04
  • Extending inverse frequent itemsets mining to generate realistic datasets: complexity, accuracy and emerging applications
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2019-07-20
    Domenico Saccá, Edoardo Serra, Antonino Rullo

    Abstract The development of novel platforms and techniques for emerging “Big Data” applications requires the availability of real-life datasets for data-driven experiments, which are however not accessible in most cases for various reasons, e.g., confidentiality, privacy or simply insufficient availability. An interesting solution to ensure high quality experimental findings is to synthesize datasets that reflect patterns of real ones using a two-step approach: first a real dataset X is analyzed to derive relevant patterns Z (latent variables) and, then, such patterns are used to reconstruct a new dataset \(X'\) that is like X but not exactly the same. The approach can be implemented using inverse mining techniques such as inverse frequent itemset mining ( \(\texttt {IFM}\) ), which consists of generating a transactional dataset satisfying given support constraints on the itemsets of an input set, that are typically the frequent ones. This paper introduces various extensions of \(\texttt {IFM}\) within a uniform framework with the aim to generate artificial datasets that reflect more elaborated patterns (in particular infrequency and duplicate constraints) of real ones. Furthermore, in order to further enlarge the application domain of \(\texttt {IFM}\) , an additional extension is introduced that considers more structured schemes for the datasets to be generated, as required in emerging big data applications, e.g., social network analytics.

    更新日期:2020-01-04
  • Contextual bandits with hidden contexts: a focused data capture from social media streams
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2019-08-10
    Sylvain Lamprier, Thibault Gisselbrecht, Patrick Gallinari

    This paper addresses the problem of real time data capture from social media. Due to different limitations, it is not possible to collect all the data produced by social networks such as Twitter. Therefore, to be able to gather enough relevant information related to a predefined need, it is necessary to focus on a subset of the information sources. In this work, we focus on user-centered data capture and consider each account of a social network as a source that can be followed at each iteration of a data capture process. This process, whose aim is to maximize the cumulative utility of the captured information for the specified need, is constrained at each time step by the number of users that can be monitored simultaneously. The problem of selecting a subset of accounts to listen to over time is a sequential decision problem under constraints, which we formalize as a bandit problem with multiple selections. In this work, we propose a contextual UCB-like approach, that uses the activity of any user during the current step to predict his future behavior. Besides the capture of usefulness variations, considering contexts also enables to improve the efficiency of the process by leveraging some structure in the search space. However, existing contextual bandit approaches do not fit for our setting where most of the contexts are hidden from the agent. We therefore propose a new algorithm, called HiddenLinUCB, which aims at dealing with such missing information via variational inference. Experiments demonstrate the very good behavior of this approach compared to existing methods for tasks of data capture from social networks.

    更新日期:2020-01-04
  • Attributed network embedding via subspace discovery
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2019-08-26
    Daokun Zhang, Jie Yin, Xingquan Zhu, Chengqi Zhang

    Network embedding aims to learn a latent, low-dimensional vector representations of network nodes, effective in supporting various network analytic tasks. While prior arts on network embedding focus primarily on preserving network topology structure to learn node representations, recently proposed attributed network embedding algorithms attempt to integrate rich node content information with network topological structure for enhancing the quality of network embedding. In reality, networks often have sparse content, incomplete node attributes, as well as the discrepancy between node attribute feature space and network structure space, which severely deteriorates the performance of existing methods. In this paper, we propose a unified framework for attributed network embedding–attri2vec—that learns node embeddings by discovering a latent node attribute subspace via a network structure guided transformation performed on the original attribute space. The resultant latent subspace can respect network structure in a more consistent way towards learning high-quality node representations. We formulate an optimization problem which is solved by an efficient stochastic gradient descent algorithm, with linear time complexity to the number of nodes. We investigate a series of linear and non-linear transformations performed on node attributes and empirically validate their effectiveness on various types of networks. Another advantage of attri2vec is its ability to solve out-of-sample problems, where embeddings of new coming nodes can be inferred from their node attributes through the learned mapping function. Experiments on various types of networks confirm that attri2vec is superior to state-of-the-art baselines for node classification, node clustering, as well as out-of-sample link prediction tasks. The source code of this paper is available at https://github.com/daokunzhang/attri2vec.

    更新日期:2020-01-04
  • Dynamics reconstruction and classification via Koopman features
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2019-06-24
    Wei Zhang, Yao-Chi Yu, Jr-Shin Li

    Abstract Knowledge discovery and information extraction of large and complex datasets has attracted great attention in wide-ranging areas from statistics and biology to medicine. Tools from machine learning, data mining, and neurocomputing have been extensively explored and utilized to accomplish such compelling data analytics tasks. However, for time-series data presenting active dynamic characteristics, many of the state-of-the-art techniques may not perform well in capturing the inherited temporal structures in these data. In this paper, integrating the Koopman operator and linear dynamical systems theory with support vector machines, we develop a novel dynamic data mining framework to construct low-dimensional linear models that approximate the nonlinear flow of high-dimensional time-series data generated by unknown nonlinear dynamical systems. This framework then immediately enables pattern recognition, e.g., classification, of complex time-series data to distinguish their dynamic behaviors by using the trajectories generated by the reduced linear systems. Moreover, we demonstrate the applicability and efficiency of this framework through the problems of time-series classification in bioinformatics and healthcare, including cognitive classification and seizure detection with fMRI and EEG data, respectively. The developed Koopman dynamic learning framework then lays a solid foundation for effective dynamic data mining and promises a mathematically justified method for extracting the dynamics and significant temporal structures of nonlinear dynamical systems.

    更新日期:2020-01-04
  • Wrangling messy CSV files by detecting row and type patterns
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2019-07-26
    G. J. J. van den Burg, A. Nazábal, C. Sutton

    Abstract Data scientists spend the majority of their time on preparing data for analysis. One of the first steps in this preparation phase is to load the data from the raw storage format. Comma-separated value (CSV) files are a popular format for tabular data due to their simplicity and ostensible ease of use. However, formatting standards for CSV files are not followed consistently, so each file requires manual inspection and potentially repair before the data can be loaded, an enormous waste of human effort for a task that should be one of the simplest parts of data science. The first and most essential step in retrieving data from CSV files is deciding on the dialect of the file, such as the cell delimiter and quote character. Existing dialect detection approaches are few and non-robust. In this paper, we propose a dialect detection method based on a novel measure of data consistency of parsed data files. Our method achieves 97% overall accuracy on a large corpus of real-world CSV files and improves the accuracy on messy CSV files by almost 22% compared to existing approaches, including those in the Python standard library. Our measure of data consistency is not specific to the data parsing problem, and has potential for more general applicability.

    更新日期:2020-01-04
  • A probabilistic classifier ensemble weighting scheme based on cross-validated accuracy estimates.
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2019-10-22
    James Large,Jason Lines,Anthony Bagnall

    Our hypothesis is that building ensembles of small sets of strong classifiers constructed with different learning algorithms is, on average, the best approach to classification for real-world problems. We propose a simple mechanism for building small heterogeneous ensembles based on exponentially weighting the probability estimates of the base classifiers with an estimate of the accuracy formed through cross-validation on the train data. We demonstrate through extensive experimentation that, given the same small set of base classifiers, this method has measurable benefits over commonly used alternative weighting, selection or meta-classifier approaches to heterogeneous ensembles. We also show how an ensemble of five well-known, fast classifiers can produce an ensemble that is not significantly worse than large homogeneous ensembles and tuned individual classifiers on datasets from the UCI archive. We provide evidence that the performance of the cross-validation accuracy weighted probabilistic ensemble (CAWPE) generalises to a completely separate set of datasets, the UCR time series classification archive, and we also demonstrate that our ensemble technique can significantly improve the state-of-the-art classifier for this problem domain. We investigate the performance in more detail, and find that the improvement is most marked in problems with smaller train sets. We perform a sensitivity analysis and an ablation study to demonstrate the robustness of the ensemble and the significant contribution of each design element of the classifier. We conclude that it is, on average, better to ensemble strong classifiers with a weighting scheme rather than perform extensive tuning and that CAWPE is a sensible starting point for combining classifiers.

    更新日期:2019-11-01
  • Domain agnostic online semantic segmentation for multi-dimensional time series.
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2019-03-05
    Shaghayegh Gharghabi,Chin-Chia Michael Yeh,Yifei Ding,Wei Ding,Paul Hibbing,Samuel LaMunion,Andrew Kaplan,Scott E Crouter,Eamonn Keogh

    Unsupervised semantic segmentation in the time series domain is a much studied problem due to its potential to detect unexpected regularities and regimes in poorly understood data. However, the current techniques have several shortcomings, which have limited the adoption of time series semantic segmentation beyond academic settings for four primary reasons. First, most methods require setting/learning many parameters and thus may have problems generalizing to novel situations. Second, most methods implicitly assume that all the data is segmentable and have difficulty when that assumption is unwarranted. Thirdly, many algorithms are only defined for the single dimensional case, despite the ubiquity of multi-dimensional data. Finally, most research efforts have been confined to the batch case, but online segmentation is clearly more useful and actionable. To address these issues, we present a multi-dimensional algorithm, which is domain agnostic, has only one, easily-determined parameter, and can handle data streaming at a high rate. In this context, we test the algorithm on the largest and most diverse collection of time series datasets ever considered for this task and demonstrate the algorithm's superiority over current solutions.

    更新日期:2019-11-01
  • Data-driven generation of spatio-temporal routines in human mobility.
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2018-01-01
    Luca Pappalardo,Filippo Simini

    The generation of realistic spatio-temporal trajectories of human mobility is of fundamental importance in a wide range of applications, such as the developing of protocols for mobile ad-hoc networks or what-if analysis in urban ecosystems. Current generative algorithms fail in accurately reproducing the individuals' recurrent schedules and at the same time in accounting for the possibility that individuals may break the routine during periods of variable duration. In this article we present Ditras (DIary-based TRAjectory Simulator), a framework to simulate the spatio-temporal patterns of human mobility. Ditras operates in two steps: the generation of a mobility diary and the translation of the mobility diary into a mobility trajectory. We propose a data-driven algorithm which constructs a diary generator from real data, capturing the tendency of individuals to follow or break their routine. We also propose a trajectory generator based on the concept of preferential exploration and preferential return. We instantiate Ditras with the proposed diary and trajectory generators and compare the resulting algorithm with real data and synthetic data produced by other generative algorithms, built by instantiating Ditras with several combinations of diary and trajectory generators. We show that the proposed algorithm reproduces the statistical properties of real trajectories in the most accurate way, making a step forward the understanding of the origin of the spatio-temporal patterns of human mobility.

    更新日期:2019-11-01
  • The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances.
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2017-01-01
    Anthony Bagnall,Jason Lines,Aaron Bostrom,James Large,Eamonn Keogh

    In the last 5 years there have been a large number of new time series classification algorithms proposed in the literature. These algorithms have been evaluated on subsets of the 47 data sets in the University of California, Riverside time series classification archive. The archive has recently been expanded to 85 data sets, over half of which have been donated by researchers at the University of East Anglia. Aspects of previous evaluations have made comparisons between algorithms difficult. For example, several different programming languages have been used, experiments involved a single train/test split and some used normalised data whilst others did not. The relaunch of the archive provides a timely opportunity to thoroughly evaluate algorithms on a larger number of datasets. We have implemented 18 recently proposed algorithms in a common Java framework and compared them against two standard benchmark classifiers (and each other) by performing 100 resampling experiments on each of the 85 datasets. We use these results to test several hypotheses relating to whether the algorithms are significantly more accurate than the benchmarks and each other. Our results indicate that only nine of these algorithms are significantly more accurate than both benchmarks and that one classifier, the collective of transformation ensembles, is significantly more accurate than all of the others. All of our experiments and results are reproducible: we release all of our code, results and experimental details and we hope these experiments form the basis for more robust testing of new algorithms in the future.

    更新日期:2019-11-01
  • ECM-Aware Cell-Graph Mining for Bone Tissue Modeling and Classification.
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2010-06-15
    Cemal Cagatay Bilgin,Peter Bullough,George E Plopper,Bülent Yener

    Pathological examination of a biopsy is the most reliable and widely used technique to diagnose bone cancer. However, it suffers from both inter- and intra- observer subjectivity. Techniques for automated tissue modeling and classification can reduce this subjectivity and increases the accuracy of bone cancer diagnosis. This paper presents a graph theoretical method, called extracellular matrix (ECM)-aware cell-graph mining, that combines the ECM formation with the distribution of cells in hematoxylin and eosin (H&E) stained histopathological images of bone tissues samples. This method can identify different types of cells that coexist in the same tissue as a result of its functional state. Thus, it models the structure-function relationships more precisely and classifies bone tissue samples accurately for cancer diagnosis. The tissue images are segmented, using the eigenvalues of the Hessian matrix, to compute spatial coordinates of cell nuclei as the nodes of corresponding cell-graph. Upon segmentation a color code is assigned to each node based on the composition of its surrounding ECM. An edge is hypothesized (and established) between a pair of nodes if the corresponding cell membranes are in physical contact and if they share the same color. Hence, multiple colored-cell-graphs coexist in a tissue each modeling a different cell-type organization. Both topological and spectral features of ECM-aware cell-graphs are computed to quantify the structural properties of tissue samples and classify their different functional states as healthy, fractured, or cancerous using support vector machines. Classification accuracy comparison to related work shows that ECM-aware cell-graph approach yields 90.0% whereas Delaunay triangulation and simple cell-graph approach achieves 75.0% and 81.1% accuracy, respectively.

    更新日期:2019-11-01
  • A Sequential Monte Carlo Method for Bayesian Analysis of Massive Datasets.
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2003-07-01
    Greg Ridgeway,David Madigan

    Markov chain Monte Carlo (MCMC) techniques revolutionized statistical practice in the 1990s by providing an essential toolkit for making the rigor and flexibility of Bayesian analysis computationally practical. At the same time the increasing prevalence of massive datasets and the expansion of the field of data mining has created the need for statistically sound methods that scale to these large problems. Except for the most trivial examples, current MCMC methods require a complete scan of the dataset for each iteration eliminating their candidacy as feasible data mining techniques.In this article we present a method for making Bayesian analysis of massive datasets computationally feasible. The algorithm simulates from a posterior distribution that conditions on a smaller, more manageable portion of the dataset. The remainder of the dataset may be incorporated by reweighting the initial draws using importance sampling. Computation of the importance weights requires a single scan of the remaining observations. While importance sampling increases efficiency in data access, it comes at the expense of estimation efficiency. A simple modification, based on the "rejuvenation" step used in particle filters for dynamic systems models, sidesteps the loss of efficiency with only a slight increase in the number of data accesses.To show proof-of-concept, we demonstrate the method on two examples. The first is a mixture of transition models that has been used to model web traffic and robotics. For this example we show that estimation efficiency is not affected while offering a 99% reduction in data accesses. The second example applies the method to Bayesian logistic regression and yields a 98% reduction in data accesses.

    更新日期:2019-11-01
  • FRaC: a feature-modeling approach for semi-supervised and unsupervised anomaly detection.
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2012-05-29
    Keith Noto,Carla Brodley,Donna Slonim

    Anomaly detection involves identifying rare data instances (anomalies) that come from a different class or distribution than the majority (which are simply called "normal" instances). Given a training set of only normal data, the semi-supervised anomaly detection task is to identify anomalies in the future. Good solutions to this task have applications in fraud and intrusion detection. The unsupervised anomaly detection task is different: Given unlabeled, mostly-normal data, identify the anomalies among them. Many real-world machine learning tasks, including many fraud and intrusion detection tasks, are unsupervised because it is impractical (or impossible) to verify all of the training data. We recently presented FRaC, a new approach for semi-supervised anomaly detection. FRaC is based on using normal instances to build an ensemble of feature models, and then identifying instances that disagree with those models as anomalous. In this paper, we investigate the behavior of FRaC experimentally and explain why FRaC is so successful. We also show that FRaC is a superior approach for the unsupervised as well as the semi-supervised anomaly detection task, compared to well-known state-of-the-art anomaly detection methods, LOF and one-class support vector machines, and to an existing feature-modeling approach.

    更新日期:2019-11-01
  • Sensor Selection to Support Practical Use of Health-Monitoring Smart Environments.
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2011-07-16
    Diane J Cook,Lawrence B Holder

    The data mining and pervasive sensing technologies found in smart homes offer unprecedented opportunities for providing health monitoring and assistance to individuals experiencing difficulties living independently at home. In order to monitor the functional health of smart home residents, we need to design technologies that recognize and track activities that people normally perform as part of their daily routines. One question that frequently arises, however, is how many smart home sensors are needed and where should they be placed in order to accurately recognize activities? We employ data mining techniques to look at the problem of sensor selection for activity recognition in smart homes. We analyze the results based on six data sets collected in five distinct smart home environments.

    更新日期:2019-11-01
  • Generalizing DTW to the multi-dimensional case requires an adaptive approach.
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2017-11-07
    Mohammad Shokoohi-Yekta,Bing Hu,Hongxia Jin,Jun Wang,Eamonn Keogh

    In recent years Dynamic Time Warping (DTW) has emerged as the distance measure of choice for virtually all time series data mining applications. For example, virtually all applications that process data from wearable devices use DTW as a core sub-routine. This is the result of significant progress in improving DTW's efficiency, together with multiple empirical studies showing that DTW-based classifiers at least equal (and generally surpass) the accuracy of all their rivals across dozens of datasets. Thus far, most of the research has considered only the one-dimensional case, with practitioners generalizing to the multi-dimensional case in one of two ways, dependent or independent warping. In general, it appears the community believes either that the two ways are equivalent, or that the choice is irrelevant. In this work, we show that this is not the case. The two most commonly used multi-dimensional DTW methods can produce different classifications, and neither one dominates over the other. This seems to suggest that one should learn the best method for a particular application. However, we will show that this is not necessary; a simple, principled rule can be used on a case-by-case basis to predict which of the two methods we should trust at the time of classification. Our method allows us to ensure that classification results are at least as accurate as the better of the two rival methods, and, in many cases, our method is significantly more accurate. We demonstrate our ideas with the most extensive set of multi-dimensional time series classification experiments ever attempted.

    更新日期:2019-11-01
  • Inhibiting diffusion of complex contagions in social networks: theoretical and experimental results.
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2015-03-10
    Chris J Kuhlman,V S Anil Kumar,Madhav V Marathe,S S Ravi,Daniel J Rosenkrantz

    We consider the problem of inhibiting undesirable contagions (e.g. rumors, spread of mob behavior) in social networks. Much of the work in this context has been carried out under the 1-threshold model, where diffusion occurs when a node has just one neighbor with the contagion. We study the problem of inhibiting more complex contagions in social networks where nodes may have thresholds larger than 1. The goal is to minimize the propagation of the contagion by removing a small number of nodes (called critical nodes) from the network. We study several versions of this problem and prove that, in general, they cannot even be efficiently approximated to within any factor ρ ≥ 1, unless P = NP. We develop efficient and practical heuristics for these problems and carry out an experimental study of their performance on three well known social networks, namely epinions, wikipedia and slashdot. Our results show that these heuristics perform significantly better than five other known methods. We also establish an efficiently computable upper bound on the number of nodes to which a contagion can spread and evaluate this bound on many real and synthetic networks.

    更新日期:2019-11-01
  • Visual Semantic Based 3D Video Retrieval System Using HDFS.
    Data Min. Knowl. Discov. (IF 2.879) Pub Date : 2016-12-23
    C Ranjith Kumar,S Suguna

    This paper brings out a neoteric frame of reference for visual semantic based 3d video search and retrieval applications. Newfangled 3D retrieval application spotlight on shape analysis like object matching, classification and retrieval not only sticking up entirely with video retrieval. In this ambit, we delve into 3D-CBVR (Content Based Video Retrieval) concept for the first time. For this purpose, we intent to hitch on BOVW and Mapreduce in 3D framework. Instead of conventional shape based local descriptors, we tried to coalesce shape, color and texture for feature extraction. For this purpose, we have used combination of geometric & topological features for shape and 3D co-occurrence matrix for color and texture. After thriving extraction of local descriptors, TB-PCT (Threshold Based- Predictive Clustering Tree) algorithm is used to generate visual codebook and histogram is produced. Further, matching is performed using soft weighting scheme with L2 distance function. As a final step, retrieved results are ranked according to the Index value and acknowledged to the user as a feedback .In order to handle prodigious amount of data and Efficacious retrieval, we have incorporated HDFS in our Intellection. Using 3D video dataset, we future the performance of our proposed system which can pan out that the proposed work gives meticulous result and also reduce the time intricacy.

    更新日期:2019-11-01
Contents have been reproduced by permission of the publishers.
导出
全部期刊列表>>
2020新春特辑
限时免费阅读临床医学内容
ACS材料视界
科学报告最新纳米科学与技术研究
清华大学化学系段昊泓
自然科研论文编辑服务
加州大学洛杉矶分校
上海纽约大学William Glover
南开大学化学院周其林
课题组网站
X-MOL
北京大学分子工程苏南研究院
华东师范大学分子机器及功能材料
中山大学化学工程与技术学院
试剂库存
天合科研
down
wechat
bug