• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-11-28
Zhou Yang; Heli Sun; Jianbin Huang; Zhongbin Sun; Hui Xiong; Shaojie Qiao; Ziyu Guan; Xiaolin Jia

Destination prediction is an essential task in various mobile applications and up to now many methods have been proposed. However, existing methods usually suffer from the problems of heavy computational burden, data sparsity, and low coverage. Therefore, a novel approach named DestPD is proposed to tackle the aforementioned problems. Differing from an earlier approach that only considers the starting and current location of a partial trip, DestPD first determines the most likely future location and then predicts the destination. It comprises two phases, the offline training and the online prediction. During the offline training, transition probabilities between two locations are obtained via Markov transition matrix multiplication. In order to improve the efficiency of matrix multiplication, we propose two data constructs, Efficient Transition Probability (ETP) and Transition Probabilities with Detours (TPD). They are capable of pinpointing the minimum amount of needed computation. During the online prediction, we design Obligatory Update Point (OUP) and Transition Affected Area (TAA) to accelerate the frequent update of ETP and TPD for recomputing the transition probabilities. Moreover, a new future trajectory prediction approach is devised. It captures the most recent movement based on a query trajectory. It consists of two components: similarity finding through Best Path Notation (BPN) and best node selection. Our novel BPN similarity finding scheme keeps track of the nodes that induces inefficiency and then finds similarity fast based on these nodes. It is particularly suitable for trajectories with overlapping segments. Finally, the destination is predicted by combining transition probabilities and the most probable future location through Bayesian reasoning. The DestPD method is proved to achieve one order of cut in both time and space complexity. Furthermore, the experimental results on real-world and synthetic datasets have shown that DestPD consistently surpasses the state-of-the-art methods in terms of both efficiency (approximately over 100 times faster) and accuracy.

更新日期：2020-01-14
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-11-20
Weiming Hu; Jun Gao; Bing Li; Ou Wu; Junping Du; Stephen Maybank

Current local density-based anomaly detection methods are limited in that the local density estimation and the neighborhood density estimation are not accurate enough for complex and large databases, and the detection performance depends on the size parameter of the neighborhood. In this paper, we propose a new kernel function to estimate samples’ local densities and propose a weighted neighborhood density estimation to increase the robustness to changes in the neighborhood size. We further propose a local kernel regression estimator and a hierarchical strategy for combining information from the multiple scale neighborhoods to refine anomaly factors of samples. We apply our general anomaly detection method to image saliency detection by regarding salient pixels in objects as anomalies to the background regions. Local density estimation in the visual feature space and kernel-based saliency score propagation in the image enable the assignment of similar saliency values to homogenous object regions. Experimental results on several benchmark datasets demonstrate that our anomaly detection methods overall outperform several state-of-art anomaly detection methods. The effectiveness of our image saliency detection method is validated by comparison with several state-of-art saliency detection methods.

更新日期：2020-01-14
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-12-05
Amal Saadallah; Luís Moreira-Matias; Ricardo Sousa; Jihed Khiari; Erik Jenelius; João Gama

Massive data broadcast by GPS-equipped vehicles provide unprecedented opportunities. One of the main tasks in order to optimize our transportation networks is to build data-driven real-time decision support systems. However, the dynamic environments where the networks operate disallow the traditional assumptions required to put in practice many off-the-shelf supervised learning algorithms, such as finite training sets or stationary distributions. In this paper, we propose BRIGHT: a drift-aware supervised learning framework to predict demand quantities. BRIGHT aims to provide accurate predictions for short-term horizons through a creative ensemble of time series analysis methods that handles distinct types of concept drift. By selecting neighborhoods dynamically, BRIGHT reduces the likelihood of overfitting. By ensuring diversity among the base learners, BRIGHT ensures a high reduction of variance while keeping bias stable. Experiments were conducted using three large-scale heterogeneous real-world transportation networks in Porto (Portugal), Shanghai (China), and Stockholm (Sweden), as well as with controlled experiments using synthetic data where multiple distinct drifts were artificially induced. The obtained results illustrate the advantages of BRIGHT in relation to state-of-the-art methods for this task.

更新日期：2020-01-14
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-11-30
Jinyoung Yeo; Seung-won Hwang; sungchul kim; Eunyee Koh; Nedim Lipka

As 98 percent of shoppers do not make a purchase on the first visit, we study the problem of predicting whether they would come back for a purchase later (i.e., conversion prediction). This problem is important for strategizing “retargeting”, for example, by sending coupons for customers who are likely to convert. For this goal, we study the following two problems, prediction of market and predictability of customer. First, prediction of market aims at identifying a conversion rate for a given product and its customer behavior modeling, which is an important analytics metric for retargeting process. Compared to existing approaches using either of customer or product-level conversion pattern, we propose a joint modeling of both patterns based on the well-studied buying decision process. Second, we can observe customer-specific behaviors after showing retargeting ads, to predict whether this specific customer follows the market model (high predictability) or not (low predictability). For the former, we apply the market model, and for the latter, we propose a new customer-specific prediction based on dynamic ad behavior features. To evaluate the effectiveness of our methods, we perform extensive experiments on the simulated dataset generated based on a set of real-world web logs and retargeting campaign logs. The evaluation results show that conversion predictions and predictability by our approach are consistently more accurate and robust than those by existing baselines in dynamic market environment.

更新日期：2020-01-14
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-11-30
Yongseok Son; Moonsub Kim; Sunggon Kim; Heon Young Yeom; Nam Sung Kim; Hyuck Han

As flash-based solid-state drive (SSD) becomes more prevalent because of the rapid fall in price and the significant increase in capacity, customers expect better data services than traditional disk-based systems. However, the order of magnitude performance provided and new characteristics of flash require a rethinking of data services. For example, backup and recovery is an important service in a database system since it protects data against unexpected hardware and software failures. To provide backup and recovery, backup/recovery tools or backup/recovery methods by operating systems can be used. However, the tools perform time-consuming jobs, and the methods may negatively affect run-time performance during normal operation even though high-performance SSDs are used. To handle these issues, we propose an SSD-assisted backup/recovery scheme for database systems. Our scheme is to utilize the characteristics (e.g., out-of-place update) of flash-based SSD for backup/recovery operations. To this end, we exploit the resources (e.g., flash translation layer and DRAM cache with supercapacitors) inside SSD, and we call our SSD with new backup/recovery functionality BR-SSD. We design and implement the functionality in the Samsung enterprise-class SSD (i.e., SM843Tn) for more realistic systems. Furthermore, we exploit and integrate BR-SSDs into database systems (i.e., MySQL) in replication and redundant array of independent disks (RAID) environments, as well as a database system in a single BR-SSD. The experimental result demonstrates that our scheme provides fast backup and recovery but does not negatively affect the run-time performance during normal operation.

更新日期：2020-01-14
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-11-23
Shaoxu Song; Yu Sun; Aoqian Zhang; Lei Chen; Jianmin Wang

Incomplete information often occurs along with many database applications, e.g., in data integration, data cleaning, or data exchange. The idea of data imputation is often to fill the missing data with the values of its neighbors who share the same/similar information. Such neighbors could either be identified certainly by editing rules or extensively by similarity relationships. Owing to data sparsity, the number of neighbors identified by editing rules w.r.t. value equality is rather limited, especially in the presence of data values with variances . To enrich the imputation candidates, a natural idea is to extensively consider the neighbors with similarity relationship. However, the candidates suggested by these (heterogenous) similarity neighbors may conflict with each other. In this paper, we propose to utilize the similarity rules with tolerance to small variations (instead of the aforesaid editing rules with strict equality constraints) to rule out the invalid candidates provided by similarity neighbors. To enrich the data imputation, i.e., imputing the missing values more , we study the problem of maximizing the missing data imputation. Our major contributions include (1) the np -hardness analysis on solving as well as approximating the problem, (2) exact algorithms for tackling the problem, and (3) efficient approximation with performance guarantees. Experiments on real and synthetic data sets demonstrate the superiority of our proposal in filling accuracy. We also demonstrate that the record matching application is indeed improved, after applying the proposed imputation.

更新日期：2020-01-14
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-11-19
Yan Yan; Mingkui Tan; Ivor W. Tsang; Yi Yang; Qinfeng Shi; Chengqi Zhang

Matrix factorization has been widely applied to various applications. With the fast development of storage and internet technologies, we have been witnessing a rapid increase of data. In this paper, we propose new algorithms for matrix factorization with the emphasis on efficiency. In addition, most existing methods of matrix factorization only consider a general smooth least square loss. Differently, many real-world applications have distinctive characteristics. As a result, different losses should be used accordingly. Therefore, it is beneficial to design new matrix factorization algorithms that are able to deal with both smooth and non-smooth losses. To this end, one needs to analyze the characteristics of target data and use the most appropriate loss based on the analysis. We particularly study two representative cases of low-rank matrix recovery, i.e., collaborative filtering for recommendation and high dynamic range imaging. To solve these two problems, we respectively propose a stage-wise matrix factorization algorithm by exploiting manifold optimization techniques. From our theoretical analysis, they are both are provably guaranteed to converge to a stationary point. Extensive experiments on recommender systems and high dynamic range imaging demonstrate the satisfactory performance and efficiency of our proposed method on large-scale real data.

更新日期：2020-01-14
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-11-28
Alejandro Moreo; Andrea Esuli; Fabrizio Sebastiani

In information retrieval (IR) and related tasks, term weighting approaches typically consider the frequency of the term in the document and in the collection in order to compute a score reflecting the importance of the term for the document. In tasks characterized by the presence of training data (such as text classification) it seems logical that the term weighting function should take into account the distribution (as estimated from training data) of the term across the classes of interest. Although “supervised term weighting” approaches that use this intuition have been described before, they have failed to show consistent improvements. In this article, we analyze the possible reasons for this failure, and call consolidated assumptions into question. Following this criticism, we propose a novel supervised term weighting approach that, instead of relying on any predefined formula, learns a term weighting function optimized on the training set of interest; we dub this approach Learning to Weight (LTW). The experiments that we run on several well-known benchmarks, and using different learning methods, show that our method outperforms previous term weighting approaches in text classification.

更新日期：2020-01-14
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-11-14
Qiang Cui; Shu Wu; Qiang Liu; Wen Zhong; Liang Wang

Sequential recommendation is a fundamental task for network applications, and it usually suffers from the item cold start problem due to the insufficiency of user feedbacks. There are currently three kinds of popular approaches which are respectively based on matrix factorization (MF) of collaborative filtering, Markov chain (MC), and recurrent neural network (RNN). Although widely used, they have some limitations. MF based methods could not capture dynamic user's interest. The strong Markov assumption greatly limits the performance of MC based methods. RNN based methods are still in the early stage of incorporating additional information. Based on these basic models, many methods with additional information only validate incorporating one modality in a separate way. In this work, to make the sequential recommendation and deal with the item cold start problem, we propose a M ulti- V iew R recurrent N eural N etwork ( MV-RNN ) model. Given the latent feature, MV-RNN can alleviate the item cold start problem by incorporating visual and textual information. First, At the input of MV-RNN, three different combinations of multi-view features are studied, like concatenation, fusion by addition and fusion by reconstructing the original multi-modal data. MV-RNN applies the recurrent structure to dynamically capture the user's interest. Second, we design a separate structure and a united structure on the hidden state of MV-RNN to explore a more effective way to handle multi-view features. Experiments on two real-world datasets show that MV-RNN can effectively generate the personalized ranking list, tackle the missing modalities problem, and significantly alleviate the item cold start problem.

更新日期：2020-01-14
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-11-15
Hai Jin; Changfu Lin; Hanhua Chen; Jiangchuan Liu

Efficient event stream dissemination is a challenging problem in large-scale Online Social Network (OSN) systems due to the costly inter-server communications caused by the per-user view data storage. To solve the problem, previous schemes mainly explore the structures of social graphs to reduce the inter-server traffic. Based on the observation of high cluster coefficients in OSNs, a state-of-the-art social piggyback scheme can save redundant messages by exploiting an intrinsic hub-structure in an OSN graph for message piggybacking. Essentially, finding the best hub-structure for piggybacking is equivalent to finding a variation of the densest sub-graph. The existing scheme computes the best hub-structure by iteratively removing the node with the minimum weighted degree. Such a scheme incurs a worst computation cost of $O(n^2)$O(n2) , making it not scalable to large-scale OSN graphs. Using alternative hub-structure instead of the best hub-structure can speed up the piggyback assignment. However, they greatly sacrifice the communication efficiency of the assignment schedule. Different from the existing designs, in this work, we propose a QuickPoint algorithm, which removes a fraction of nodes in each iteration in finding the best hub-structure. We mathematically prove that QuickPoint converges in $O(log_an) (a>1)$O(logan)(a>1) iterations in finding the best hub-structure for efficient piggyback. We implement QuickPoint in parallel atop Pregel, a vertex-centric distributed graph processing platform. Comprehensive experiments using large-scale data from Twitter and Flickr show that our scheme is 38.8× more efficient compared to existing schemes.

更新日期：2020-01-14
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-11-27
Boyi Hou; Qun Chen; Zhaoqiang Chen; Youcef Nafa; Zhanhuai Li

Even though many approaches have been proposed for entity resolution (ER), it remains very challenging to enforce quality guarantees. To this end, we propose a r isk-aware HUman-Machine cOoperation framework for ER, denoted by r -HUMO. Built on the existing HUMO framework, r -HUMO similarly enforces both precision and recall guarantees by partitioning an ER workload between the human and the machine. However, r -HUMO is the first solution that optimizes the process of human workload selection from a risk perspective. It iteratively selects human workload by real-time risk analysis based on the human-labeled results as well as the pre-specified machine metric. In this paper, we first introduce the r -HUMO framework and then present the risk model to prioritize the instances for manual inspection. Finally, we empirically evaluate r -HUMO's performance on real data. Our extensive experiments show that r -HUMO is effective in enforcing quality guarantees, and compared with the state-of-the-art alternatives, it can achieve desired quality control with reduced human cost.

更新日期：2020-01-14
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-11-30
Ioanna Tsalouchidou; Francesco Bonchi; Gianmarco De Francisci Morales; Ricardo Baeza-Yates

Large-scale dynamic interaction graphs can be challenging to process and store, due to their size and the continuous change of communication patterns between nodes. In this work, we address the problem of summarizing large-scale dynamic graphs, while maintaining the evolution of their structure and interactions. Our approach is based on grouping the nodes of the graph in supernodes according to their connectivity and communication patterns. The resulting summary graph preserves the information about the evolution of the graph within a time window. We propose two online algorithms for summarizing this type of graphs. Our baseline algorithm $k$k C based on clustering is fast but rather memory expensive. The second method we propose, named $\mu$μ C, reduces the memory requirements by introducing an intermediate step that keeps statistics of the clustering of the previous rounds. Our algorithms are distributed by design, and we implement them over the Apache Spark framework, so as to address the problem of scalability for large-scale graphs and massive streams. We apply our methods to several dynamic graphs, and show that we can efficiently use the summary graphs to answer temporal and probabilistic graph queries.

更新日期：2020-01-14
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-11-27
Xu Sun; Xuancheng Ren; Shuming Ma; Bingzhen Wei; Wei Li; Jingjing Xu; Houfeng Wang; Yi Zhang

We propose a simple yet effective technique to simplify the training and the resulting model of neural networks. In back propagation, only a small subset of the full gradient is computed to update the model parameters. The gradient vectors are sparsified in such a way that only the top- $k$k elements (in terms of magnitude) are kept. As a result, only $k$k rows or columns (depending on the layout) of the weight matrix are modified, leading to a linear reduction in the computational cost. Based on the sparsified gradients, we further simplify the model by eliminating the rows or columns that are seldom updated, which will reduce the computational cost both in the training and decoding, and potentially accelerate decoding in real-world applications. Surprisingly, experimental results demonstrate that most of the time we only need to update fewer than 5 percent of the weights at each back propagation pass. More interestingly, the accuracy of the resulting models is actually improved rather than degraded, and a detailed analysis is given. The model simplification results show that we could adaptively simplify the model which could often be reduced by around 9x, without any loss on accuracy or even with improved accuracy.

更新日期：2020-01-14
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-11-19
Jinyoung Yeo; Haeju Park; Sanghoon Lee; Eric Wonhee Lee; Seung-won Hwang

Over the past few years, knowledge bases (KBs) like DBPedia, Freebase, and YAGO have accumulated a massive amount of knowledge from web data. Despite their seemingly large size, however, individual KBs often lack comprehensive information on any given domain. For example, over 70 percent of people on Freebase lack information on place of birth. For this reason, the complementary nature across different KBs motivates their integration through a process of aligning instances. Meanwhile, since application-level machine systems, such as medical diagnosis, have heavily relied on KBs, it is necessary to provide users with trustworthy reasons why the alignment decisions are made. To address this problem, we propose a new paradigm, explainable instance alignment (XINA), which provides user-understandable explanations for alignment decisions. Specifically, given an alignment candidate, XINA replaces existing scalar representation of an aggregated score, by decision- and explanation-vector spaces for machine decision and user understanding, respectively. To validate XINA, we perform extensive experiments on real-world KBs and show that XINA achieves comparable performance with state-of-the-arts, even with far less human effort.

更新日期：2020-01-14
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2019-02-14
Dan Vilenchik

In this work we ask to which extent are simple statistics useful to make sense of social media data. By simple statistics we mean counting and bookkeeping type features such as the number of likes given to a user's post, a user's number of friends, etc. We find that relying solely on simple statistics is not always a good approach. Specifically, we develop a statistical framework that we term semantic shattering which allows to detect semantic inconsistencies in the data that may occur due to relying solely on simple statistics. We apply our framework to simple-statistics data collected from six online social media platforms and arrive at a surprising counter-intuitive finding in three of them, Twitter, Instagram and YouTube. We find that overall, the activity of the user is not correlated with the feedback that the user receives on that activity. A hint to understand this phenomenon may be found in the fact that the activity-feedback shattering did not occur in LinkedIn, Steam and Flickr. A possible explanation for this separation is the amount of effort required to produce content. The lesser the effort the lesser the correlation between activity and feedback. The amount of effort may be a proxy to the level of commitment that the users feel towards each other in the network, and indeed sociologists claim that commitment explains consistent human behavior, or lack thereof. However, the amount of effort or the level of commitment are by no means a simple statistic.

更新日期：2020-01-14
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-11-09
Georgios Giasemidis; Nikolaos Kaplis; Ioannis Agrafiotis; Jason R. C. Nurse

Social media communications are becoming increasingly prevalent; some useful, some false, whether unwittingly or maliciously. An increasing number of rumours daily flood the social networks. Determining their veracity in an autonomous way is a very active and challenging field of research, with a variety of methods proposed. However, most of the models rely on determining the constituent messages’ stance towards the rumour, a feature known as the “wisdom of the crowd.” Although several supervised machine-learning approaches have been proposed to tackle the message stance classification problem, these have numerous shortcomings. In this paper, we argue that semi-supervised learning is more effective than supervised models and use two graph-based methods to demonstrate it. This is not only in terms of classification accuracy, but equally important, in terms of speed and scalability. We use the Label Propagation and Label Spreading algorithms and run experiments on a dataset of 72 rumours and hundreds of thousands messages collected from Twitter. We compare our results on two available datasets to the state-of-the-art to demonstrate our algorithms’ performance regarding accuracy, speed, and scalability for real-time applications.

更新日期：2020-01-04
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-11-06
Asieh Ghanbarpour; Hassan Naderi

Many real-world networks such as Facebook, LinkedIn, and Wikipedia exhibit rich connectivity patterns along with worthwhile content nodes often labeled with meaningful attributes. Keyword search is an effective method to retrieve information from such useful networks. The aim of keyword search is to find a set of answers (subgraphs) covering all or part of the queried keywords. A challenge in keyword search systems is to rank answers according to their relevance to the query. This relevance lies in the textual content and structural compactness of the answers. In this paper, an attribute-specific ranking method is proposed based on language models to rank candidate answers according to their semantic information up to the attribute level. This method scores answers using a model enriched with attribute-specific preferences and integrating both the structure and content of answers. The proposed model is directly estimated on the sub-graphs (answers) and is defined such that it can preserve the local importance of keywords in nodes. Extensive experiments conducted on a standard evaluation framework with three real-world datasets illustrate the superior effectiveness of the proposed ranking method to that of the state-of-the-art methods.

更新日期：2020-01-04
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-11-06
Bolong Zheng; Kai Zheng; Christian S. Jensen; Nguyen Quoc Viet Hung; Han Su; Guohui Li; Xiaofang Zhou

With the proliferation of geo-textual objects on the web, extensive efforts have been devoted to improving the efficiency of top- $k$k spatial keyword queries in different settings. However, comparatively much less work has been reported on enhancing the quality and usability of such queries. In this context, we propose means of enhancing the usability of a top- $k$k group spatial keyword query, where a group of users aim to find $k$k objects that contain given query keywords and are nearest to the users. Specifically, when users receive the result of such a query, they may find that one or more objects that they expect to be in the result are in fact missing, and they may wonder why. To address this situation, we develop a so-called why-not query that is able to minimally modify the original query into a query that returns the expected, but missing, objects, in addition to other objects. Specifically, we formalize the why-not query in relation to the top- $k$k group spatial keyword query, called the W hy-not G roup S patial K eyword Query ( $\mathsf{WGSK}$WGSK ) that is able to provide a group of users with a more satisfactory query result. We propose a three-phase framework for efficiently computing the $\mathsf{WGSK}$WGSK . The first phase substantially reduces the search space for the subsequent phases by retrieving a set of objects that may affect the ranking of the user-expected objects. The second phase provides an incremental sampling algorithm that generates candidate weightings of more promising queries. The third phase determines the penalty of each refined query and returns the query with minimal penalty, i.e., the minimally modified query. Extensive experiments with real and synthetic data offer evidence that the proposed solution excels over baselines with respect to both effectiveness and efficiency.

更新日期：2020-01-04
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-11-12
Sin G. Teo; Jianneng Cao; Vincent C. S. Lee

Secure multi-party computation (SMC) allows parties to jointly compute a function over their inputs, while keeping every input confidential. It has been extensively applied in tasks with privacy requirements, such as privacy-preserving data mining (PPDM), to learn task output and at the same time protect input data privacy. However, existing SMC-based solutions are ad-hoc – they are proposed for specific applications, and thus cannot be applied to other applications directly. To address this issue, we propose a privacy model $\mathsf {DAG}$DAG (Directed Acyclic Graph) that consists of a set of fundamental secure operators (e.g., +, -, ×, /, and power). Our model is general – its operators, if pipelined together, can implement various functions, even complicated ones like Naïve Bayes classifier. It is also extendable – new secure operators can be defined to expand the functions that the model supports. For case study, we have applied our $\mathsf {DAG}$DAG model to two data mining tasks: kernel regression and Naïve Bayes. Experimental results show that $\mathsf {DAG}$DAG generates outputs that are almost the same as those by non-private setting, where multiple parties simply disclose their data. The experimental results also show that our $\mathsf {DAG}$DAG model runs in acceptable time, e.g., in kernel regression, when training data size is 683,093, one prediction in non-private setting takes 5.93 sec, and that by our $\mathsf {DAG}$DAG model takes 12.38 sec.

更新日期：2020-01-04
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-10-30
Seyed Ali Osia; Ali Taheri; Ali Shahin Shamsabadi; Kleomenis Katevas; Hamed Haddadi; Hamid R. Rabiee

We present and evaluate Deep Private-Feature Extractor (DPFE) , a deep model which is trained and evaluated based on information theoretic constraints. Using the selective exchange of information between a user's device and a service provider, DPFE enables the user to prevent certain sensitive information from being shared with a service provider, while allowing them to extract approved information using their model. We introduce and utilize the log-rank privacy, a novel measure to assess the effectiveness of DPFE in removing sensitive information and compare different models based on their accuracy-privacy trade-off. We then implement and evaluate the performance of DPFE on smartphones to understand its complexity, resource demands, and efficiency trade-offs. Our results on benchmark image datasets demonstrate that under moderate resource utilization, DPFE can achieve high accuracy for primary tasks while preserving the privacy of sensitive information.

更新日期：2020-01-04
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-11-12
Sutedi Sutedi; Noor Akhmad Setiawan; Teguh Bharata Adji

Incapability of relational database in handling large-scale data triggers the development of NoSQL database that becomes part of a big data ecosystem. NoSQL database has different characteristics compared to the relational database. However, NoSQL database requires data from the relational database as one of the structured data sources. Therefore, data pre-processing is required to ensure proper data migration from a relational database to NoSQL database. This data pre-processing is normally called data transformation. One of the simple and understandable transformation algorithms is graph transforming algorithm. However, the algorithm has a problem in solving a non-simple graph (multigraph). This research proposes an algorithm to overcome several multigraph problems. The experimental work confirms that the algorithm proposed in this research is able to transform data from a relational database to NoSQL schema that has a minimum number of redundant attributes while the data completeness is still maintained.

更新日期：2020-01-04
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-11-12
Fan Zhang; Conggai Li; Ying Zhang; Lu Qin; Wenjie Zhang

In social networks, the leave of critical users may significantly break network engagement, i.e., lead a large number of other users to drop out. A popular model to measure social network engagement is $k$k -core, the maximal subgraph in which every vertex has at least $k$k neighbors. To identify critical users, we propose the collapsed $k$k -core problem: given a graph $G$G , a positive integer $k$k and a budget $b$b , we aim to find $b$b vertices in $G$G such that the deletion of the $b$b vertices leads to the smallest $k$k -core. We prove the problem is NP-hard and inapproximate. An efficient algorithm is proposed, which significantly reduces the number of candidate vertices. We also study the user leave towards the model of $k$k -truss which further considers tie strength by conducting additional computation w.r.t. $k$k -core. We prove the corresponding collapsed $k$k -truss problem is also NP-hard and inapproximate. An efficient algorithm is proposed to solve the problem. The advantages and disadvantages of the two proposed models are experimentally compared. Comprehensive experiments on nine real-life social networks demonstrate the effectiveness and efficiency of our proposed methods.

更新日期：2020-01-04
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-11-06
Kaixing Dong; Bowen Zhang; Yanyan Shen; Yanmin Zhu; Jiadi Yu

The increasing amount of trajectory data facilitates a wide spectrum of practical applications in which large numbers of trajectory range and similarity queries are issued continuously. This calls for high-throughput trajectory query processing. Traditional in-memory databases lack considerations of the unique features of trajectories, while specialized trajectory query processing systems are typically designed for only one type of trajectory queries. This paper introduces GAT, a unified GPU-accelerated framework to process batch trajectory queries with the objective of high throughput. GAT follows the filtering-and-verification paradigm where we develop a novel index GTIDX for effectively filtering invalid trajectories on the CPU, and exploit the massive parallelism of the GPU for verification. To optimize the performance of GAT, we first greedily partition batch queries to reduce the amortized query processing latency. We then apply the Morton-based encoding method to coalesce data access requests from the GPU cores, and maintain a hash table to avoid redundant data transfer between CPU and GPU. To achieve load balance, we group size-varying cells into balanced blocks with similar numbers of trajectory points. Extensive experiments have been conducted over real-life trajectory datasets. The results show that GAT is efficient, scalable, and achieves high throughput with acceptable indexing cost.

更新日期：2020-01-04
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-11-09
Djamel-Edine Yagoubi; Reza Akbarinia; Florent Masseglia; Themis Palpanas

Indexing is crucial for many data mining tasks that rely on efficient and effective similarity query processing. Consequently, indexing large volumes of time series, along with high performance similarity query processing, have became topics of high interest. For many applications across diverse domains though, the amount of data to be processed might be intractable for a single machine, making existing centralized indexing solutions inefficient. We propose a parallel indexing solution that gracefully scales to billions of time series, and a parallel query processing strategy that, given a batch of queries, efficiently exploits the index. Our experiments, on both synthetic and real world data, illustrate that our index creation algorithm works on four billion time series in less than five hours, while the state of the art centralized algorithms do not scale and have their limit on 1 billion time series, where they need more than five days. Also, our distributed querying algorithm is able to efficiently process millions of queries over collections of billions of time series, thanks to an effective load balancing mechanism.

更新日期：2020-01-04
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-11-06

Poisson Factorization (PF) is the gold standard framework for recommendation systems with implicit feedback whose variants show state-of-the-art performance on real-world recommendation tasks. However, they do not explicitly take into account the temporal behavior of users which is essential to recommend the right item to the right user at the right time. In this paper, we introduce Recurrent Poisson Factorization (RPF) framework that generalizes the classical PF methods by utilizing a Poisson process for modeling the implicit feedback. RPF treats time as a natural constituent of the model, and takes important factors for recommendation into consideration to provide a rich family of time-sensitive factorization models. They include Hierarchical RPF that captures the consumption heterogeneity among users and items, Dynamic RPF that handles dynamic user preferences and item specifications, Social RPF that models the social-aspect of product adoption, Item-Item RPF that considers the inter-item correlations, and eXtended Item-Item RPF that utilizes items’ metadata to better infer the correlation among engagement patterns of users with items. We also develop an efficient variational algorithm for approximate inference that scales up to massive datasets. We demonstrate RPF's superior performance over many state-of-the-art methods on synthetic dataset, and wide variety of large scale real-world datasets.

更新日期：2020-01-04
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-11-13
Chen Jason Zhang; Lei Chen; H. V. Jagadish; Mengchen Zhang; Yongxin Tong

Schema matching is a central challenge for data integration systems. Inspired by the popularity and the success of crowdsourcing platforms, we explore the use of crowdsourcing to reduce the uncertainty of schema matching. Since crowdsourcing platforms are most effective for simple questions, we assume that each Correspondence Correctness Question (CCQ) asks the crowd to decide whether a given correspondence should exist in the correct matching. Furthermore, members of a crowd may sometimes return incorrect answers with different probabilities. Accuracy rates of individual crowd workers can be attributes of CCQs as well as evaluations of individual workers. We prove that uncertainty reduction equals to entropy of answers minus entropy of crowds and show how to obtain lower and upper bounds for it. We propose frameworks and efficient algorithms to dynamically manage the CCQs to maximize the uncertainty reduction within a limited budget of questions. We develop two novel approaches, namely “Single CCQ” and “Multiple CCQ”, which adaptively select, publish, and manage questions. We verify the value of our solutions with simulation and real implementation.

更新日期：2020-01-04
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-11-01
Stijn Heldens; Nelly Litvak; Maarten van Steen

Studying the movements of crowds is important for understanding and predicting the behavior of large groups of people. When analyzing crowds, one is often interested in the long-term macro-level motions of the crowd as a whole, as opposed to the micro-level short-term movements of individuals. A high-level representation of these motions is thus desirable. In this work, we present a scalable method for detection of crowd motion patterns , i.e., spatial areas describing the dominant motions within crowds. For measuring crowd movements, we propose a fast, scalable, and low-cost method based on proximity graphs. For analyzing crowd movements, we utilize a three-stage pipeline: (1) represents the behavior of each person at each moment in time as a low-dimensional data point, (2) cluster these data points based on spatial relations, and (3) concatenate these clusters based on temporal relations. Experiments on synthetic datasets reveals our method can handle various scenarios including curved lanes and diverging flows. Evaluation on real-world datasets shows our method is able to extract useful motion patterns which could not be properly detected by existing methods. Overall, we see our work as an initial step towards rich pattern recognition.

更新日期：2020-01-04
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-11-09
Xiaojun Chen; Guowen Yuan; Feiping Nie; Zhong Ming

With the rapid increase of the data size, it has increasing demands for selecting features by exploiting both labeled and unlabeled data. In this paper, we propose a novel semi-supervised embedded feature selection method. The new method extends the least square regression model by rescaling the regression coefficients in the least square regression with a set of scale factors, which is used for evaluating the importance of features. An iterative algorithm is proposed to optimize the new model. It has been proved that solving the new model is equivalent to solving a sparse model with a flexible and adaptable $\ell _{2,p}$ℓ2,p norm regularization. Moreover, the optimal solution of scale factors provides a theoretical explanation for why we can use $\lbrace \left\Vert \mathbf {w}^{1} \right\Vert _{2},\ldots, \left\Vert \mathbf {w}^{d} \right\Vert _{2}\rbrace${w12,...,wd2} to evaluate the importance of features. Experimental results on eight benchmark data sets show the superior performance of the proposed method.

更新日期：2020-01-04
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-11-19
James S. Okolica; Gilbert L. Peterson; Robert F. Mills; Michael R. Grimaila

Sequence pattern mining (SPM) seeks to find multiple items that commonly occur together in a specific order. One common assumption is that the relevant differences between items are captured through creating distinct items. In some domains, this leads to an exponential increase in the number of items. This paper presents a new SPM, Sequence Mining of Temporal Clusters (SMTC), that allows item differentiation through attribute variables for domains with large numbers of items. It also provides a new technique for addressing interleaving, a phenomena that occurs when two sequences occur simultaneously resulting in their items alternating. By first clustering items temporally and only focusing on sequences after the temporal clusters are established, it sidesteps the traditional interleaving issues. SMTC is evaluated on a digital forensics dataset, a domain with a large number of items and frequent interleaving. Its results are compared with Discontinuous Varied Order Sequence Mining (DVSM) with variables added (DVSM-V). By adding variables, both algorithms reduce the data by 96 percent, and identify 100 percent of the events while keeping the false positive rate below 0.03 percent. SMTC mines the data in 20 percent of the time it takes DVSM-V and provides a lower false positive rate even at higher similarity thresholds.

更新日期：2020-01-04
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-10-30
Fanhua Shang; Kaiwen Zhou; Hongying Liu; James Cheng; Ivor W. Tsang; Lijun Zhang; Dacheng Tao; Licheng Jiao

In this paper, we propose a simple variant of the original SVRG, called variance reduced stochastic gradient descent (VR-SGD). Unlike the choices of snapshot and starting points in SVRG and its proximal variant, Prox-SVRG, the two vectors of VR-SGD are set to the average and last iterate of the previous epoch, respectively. The settings allow us to use much larger learning rates, and also make our convergence analysis more challenging. We also design two different update rules for smooth and non-smooth objective functions, respectively, which means that VR-SGD can tackle non-smooth and/or non-strongly convex problems directly without any reduction techniques. Moreover, we analyze the convergence properties of VR-SGD for strongly convex problems, which show that VR-SGD attains linear convergence. Different from most algorithms that have no convergence guarantees for non-strongly convex problems, we also provide the convergence guarantees of VR-SGD for this case, and empirically verify that VR-SGD with varying learning rates achieves similar performance to its momentum accelerated variant that has the optimal convergence rate $\mathcal {O}(1/T^2)$O(1/T2) . Finally, we apply VR-SGD to solve various machine learning problems, such as convex and non-convex empirical risk minimization, and leading eigenvalue computation. Experimental results show that VR-SGD converges significantly faster than SVRG and Prox-SVRG, and usually outperforms state-of-the-art accelerated methods, e.g., Katyusha.

更新日期：2020-01-04
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2019-12-06

This index covers all technical items - papers, correspondence, reviews, etc. - that appeared in this periodical during the year, and items from previous years that were commented upon or corrected in this year. Departments and other items may also be covered if they have been judged to have archival value. The Author Index contains the primary entry for each item, listed under the first author's name. The primary entry includes the co-authors' names, the title of the paper or other item, and its location, specified by the publication abbreviation, year, month, and inclusive pagination. The Subject Index contains entries describing the item under all appropriate subject headings, plus the first author's name, the publication abbreviation, month, and year, and inclusive pages. Note that the item title is found only under the primary entry in the Author Index.

更新日期：2020-01-04
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-11-13
Jingchao Ni,Wei Cheng,Wei Fan,Xiang Zhang

Joint clustering of multiple networks has been shown to be more accurate than performing clustering on individual networks separately. This is because multi-network clustering algorithms typically assume there is a common clustering structure shared by all networks, and different networks can provide compatible and complementary information for uncovering this underlying clustering structure. However, this assumption is too strict to hold in many emerging applications, where multiple networks usually have diverse data distributions. More popularly, the networks in consideration belong to different underlying groups. Only networks in the same underlying group share similar clustering structures. Better clustering performance can be achieved by considering such groups differently. As a result, an ideal method should be able to automatically detect network groups so that networks in the same group share a common clustering structure. To address this problem, we propose a new method, ComClus, to simultaneously group and cluster multiple networks. ComClus is novel in combining the clustering approach of non-negative matrix factorization (NMF) and the feature subspace learning approach of metric learning. Specifically, it treats node clusters as features of networks and learns proper subspaces from such features to differentiate different network groups. During the learning process, the two procedures of network grouping and clustering are coupled and mutually enhanced. Moreover, ComClus can effectively leverage prior knowledge on how to group networks such that network grouping can be conducted in a semi-supervised manner. This will enable users to guide the grouping process using domain knowledge so that network clustering accuracy can be further boosted. Extensive experimental evaluations on a variety of synthetic and real datasets demonstrate the effectiveness and scalability of the proposed method.

更新日期：2019-11-01
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2019-10-01
Jinfei Liu,Juncheng Yang,Li Xiong,Jian Pei

Outsourcing data and computation to cloud server provides a cost-effective way to support large scale data storage and query processing. However, due to security and privacy concerns, sensitive data (e.g., medical records) need to be protected from the cloud server and other unauthorized users. One approach is to outsource encrypted data to the cloud server and have the cloud server perform query processing on the encrypted data only. It remains a challenging task to support various queries over encrypted data in a secure and efficient way such that the cloud server does not gain any knowledge about the data, query, and query result. In this paper, we study the problem of secure skyline queries over encrypted data. The skyline query is particularly important for multi-criteria decision making but also presents significant challenges due to its complex computations. We propose a fully secure skyline query protocol on data encrypted using semantically-secure encryption. As a key subroutine, we present a new secure dominance protocol, which can be also used as a building block for other queries. Furthermore, we demonstrate two optimizations, data partitioning and lazy merging, to further reduce the computation load. Finally, we provide both serial and parallelized implementations and empirically study the protocols in terms of efficiency and scalability under different parameter settings, verifying the feasibility of our proposed solutions.

更新日期：2019-11-01
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2019-10-01
Chencheng Li,Pan Zhou,Li Xiong,Qian Wang,Ting Wang

In the big data era, the generation of data presents some new characteristics, including wide distribution, high velocity, high dimensionality, and privacy concern. To address these challenges for big data analytics, we develop a privacy-preserving distributed online learning framework on the data collected from distributed data sources. Specifically, each node (i.e., data source) has the capacity of learning a model from its local dataset, and exchanges intermediate parameters with a random part of their own neighboring (logically connected) nodes. Hence, the topology of the communications in our distributed computing framework is unfixed in practice. As online learning always performs on the sensitive data, we introduce the notion of differential privacy (DP) into our distributed online learning algorithm (DOLA) to protect the data privacy during the learning, which prevents an adversary from inferring any significant sensitive information. Our model is of general value for big data analytics in the distributed setting, because it can provide rigorous and scalable privacy proof and have much less computational complexity when compared to classic schemes, e.g., secure multiparty computation (SMC). To tackle high-dimensional incoming data entries, we study a sparse version of the DOLA with novel DP techniques to save the computing resources and improve the utility. Furthermore, we present two modified private DOLAs to meet the need of practical applications. One is to convert the DOLA to distributed stochastic optimization in an offline setting, the other is to use the mini-batches approach to reduce the amount of the perturbation noise and improve the utility. We conduct experiments on real datasets in a configured distributed platform. Numerical experiment results validate the feasibility of our private DOLAs.

更新日期：2019-11-01
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2019-05-21
Jingbo Shang,Jialu Liu,Meng Jiang,Xiang Ren,Clare R Voss,Jiawei Han

As one of the fundamental tasks in text analysis, phrase mining aims at extracting quality phrases from a text corpus and has various downstream applications including information extraction/retrieval, taxonomy construction, and topic modeling. Most existing methods rely on complex, trained linguistic analyzers, and thus likely have unsatisfactory performance on text corpora of new domains and genres without extra but expensive adaption. None of the state-of-the-art models, even data-driven models, is fully automated because they require human experts for designing rules or labeling phrases. In this paper, we propose a novel framework for automated phrase mining, AutoPhrase, which supports any language as long as a general knowledge base (e.g., Wikipedia) in that language is available, while benefiting from, but not requiring, a POS tagger. Compared to the state-of-the-art methods, AutoPhrase has shown significant improvements in both effectiveness and efficiency on five real-world datasets across different domains and languages. Besides, AutoPhrase can be extend to model single-word quality phrases.

更新日期：2019-11-01
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2019-08-23
Yang Cao,Masatoshi Yoshikawa,Yonghui Xiao,Li Xiong

Differential Privacy (DP) has received increasing attention as a rigorous privacy framework. Many existing studies employ traditional DP mechanisms (e.g., the Laplace mechanism) as primitives to continuously release private data for protecting privacy at each time point (i.e., event-level privacy), which assume that the data at different time points are independent, or that adversaries do not have knowledge of correlation between data. However, continuously generated data tend to be temporally correlated, and such correlations can be acquired by adversaries. In this paper, we investigate the potential privacy loss of a traditional DP mechanism under temporal correlations. First, we analyze the privacy leakage of a DP mechanism under temporal correlation that can be modeled using Markov Chain. Our analysis reveals that, the event-level privacy loss of a DP mechanism may increase over time. We call the unexpected privacy loss temporal privacy leakage (TPL). Although TPL may increase over time, we find that its supremum may exist in some cases. Second, we design efficient algorithms for calculating TPL. Third, we propose data releasing mechanisms that convert any existing DP mechanism into one against TPL. Experiments confirm that our approach is efficient and effective.

更新日期：2019-11-01
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-07-24
Meng Wang,Zhanglong Ji,Hyeon-Eui Kim,Shuang Wang,Li Xiong,Xiaoqian Jiang

Privacy concern in data sharing especially for health data gains particularly increasing attention nowadays. Now some patients agree to open their information for research use, which gives rise to a new question of how to effectively use the public information to better understand the private dataset without breaching privacy. In this paper, we specialize this question as selecting an optimal subset of the public dataset for M-estimators in the framework of differential privacy (DP) in [1]. From a perspective of non-interactive learning, we first construct the weighted private density estimation from the hybrid datasets under DP. Along the same line as [2], we analyze the accuracy of the DP M-estimators based on the hybrid datasets. Our main contributions are (i) we find that the bias-variance tradeoff in the performance of our M-estimators can be characterized in the sample size of the released dataset; (2) based on this finding, we develop an algorithm to select the optimal subset of the public dataset to release under DP. Our simulation studies and application to the real datasets confirm our findings and set a guideline in the real application.

更新日期：2019-11-01
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2019-02-13
Jingbo Shang,Meng Jiang,Wenzhu Tong,Jinfeng Xiao,Jian Peng,Jiawei Han,

In the literature, two series of models have been proposed to address prediction problems including classification and regression. Simple models, such as generalized linear models, have ordinary performance but strong interpretability on a set of simple features. The other series, including tree-based models, organize numerical, categorical and high dimensional features into a comprehensive structure with rich interpretable information in the data. In this paper, we propose a novel Discriminative Pattern-based Prediction framework (DPPred) to accomplish the prediction tasks by taking their advantages of both effectiveness and interpretability. Specifically, DPPred adopts the concise discriminative patterns that are on the prefix paths from the root to leaf nodes in the tree-based models. DPPred selects a limited number of the useful discriminative patterns by searching for the most effective pattern combination to fit generalized linear models. Extensive experiments show that in many scenarios, DPPred provides competitive accuracy with the state-of-the-art as well as the valuable interpretability for developers and experts. In particular, taking a clinical application dataset as a case study, our DPPred outperforms the baselines by using only 40 concise discriminative patterns out of a potentially exponentially large set of patterns.

更新日期：2019-11-01
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2016-05-01
Yubao Wu,Ruoming Jin,Xiang Zhang

Top-k proximity query in large graphs is a fundamental problem with a wide range of applications. Various random walk based measures have been proposed to measure the proximity between different nodes. Although these measures are effective, efficiently computing them on large graphs is a challenging task. In this paper, we develop an efficient and exact local search method, FLoS (Fast Local Search), for top-k proximity query in large graphs. FLoS guarantees the exactness of the solution. Moreover, it can be applied to a variety of commonly used proximity measures. FLoS is based on the no local optimum property of proximity measures. We show that many measures have no local optimum. Utilizing this property, we introduce several operations to manipulate transition probabilities and develop tight lower and upper bounds on the proximity values. The lower and upper bounds monotonically converge to the exact proximity value when more nodes are visited. We further extend FLoS to measures having local optimum by utilizing relationship among different measures. We perform comprehensive experiments on real and synthetic large graphs to evaluate the efficiency and effectiveness of the proposed method.

更新日期：2019-11-01
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2017-09-26
Bo Li,Yevgeniy Vorobeychik,Muqun Li,Bradley Malin

Cheap ubiquitous computing enables the collection of massive amounts of personal data in a wide variety of domains. Many organizations aim to share such data while obscuring features that could disclose personally identifiable information. Much of this data exhibits weak structure (e.g., text), such that machine learning approaches have been developed to detect and remove identifiers from it. While learning is never perfect, and relying on such approaches to sanitize data can leak sensitive information, a small risk is often acceptable. Our goal is to balance the value of published data and the risk of an adversary discovering leaked identifiers. We model data sanitization as a game between 1) a publisher who chooses a set of classifiers to apply to data and publishes only instances predicted as non-sensitive and 2) an attacker who combines machine learning and manual inspection to uncover leaked identifying information. We introduce a fast iterative greedy algorithm for the publisher that ensures a low utility for a resource-limited adversary. Moreover, using five text data sets we illustrate that our algorithm leaves virtually no automatically identifiable sensitive instances for a state-of-the-art learning algorithm, while sharing over 93% of the original data, and completes after at most 5 iterations.

更新日期：2019-11-01
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2011-05-28
Parisa Rashidi,Diane J Cook,Lawrence B Holder,Maureen Schmitter-Edgecombe

The machine learning and pervasive sensing technologies found in smart homes offer unprecedented opportunities for providing health monitoring and assistance to individuals experiencing difficulties living independently at home. In order to monitor the functional health of smart home residents, we need to design technologies that recognize and track activities that people normally perform as part of their daily routines. Although approaches do exist for recognizing activities, the approaches are applied to activities that have been pre-selected and for which labeled training data is available. In contrast, we introduce an automated approach to activity tracking that identifies frequent activities that naturally occur in an individual's routine. With this capability we can then track the occurrence of regular activities to monitor functional health and to detect changes in an individual's patterns and lifestyle. In this paper we describe our activity mining and tracking approach and validate our algorithms on data collected in physical smart environments.

更新日期：2019-11-01
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2011-03-05
Thomas A Lasko,Staal A Vinterbo

The goal of data anonymization is to allow the release of scientifically useful data in a form that protects the privacy of its subjects. This requires more than simply removing personal identifiers from the data, because an attacker can still use auxiliary information to infer sensitive individual information. Additional perturbation is necessary to prevent these inferences, and the challenge is to perturb the data in a way that preserves its analytic utility.No existing anonymization algorithm provides both perfect privacy protection and perfect analytic utility. We make the new observation that anonymization algorithms are not required to operate in the original vector-space basis of the data, and many algorithms can be improved by operating in a judiciously chosen alternate basis. A spectral basis derived from the data's eigenvectors is one that can provide substantial improvement. We introduce the term spectral anonymization to refer to an algorithm that uses a spectral basis for anonymization, and we give two illustrative examples.We also propose new measures of privacy protection that are more general and more informative than existing measures, and a principled reference standard with which to define adequate privacy protection.

更新日期：2019-11-01
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2009-11-17
E Patrick Shironoshita,Yves R Jean-Mary,Ray M Bradley,Mansur R Kabuka

The SPARQL LeftJoin abstract operator is not distributive over Union; this limits the algebraic manipulation of graph patterns, which in turn restricts the ability to create query plans for distributed processing or query optimization. In this paper, we present semQA, an algebraic extension for the SPARQL query language for RDF, which overcomes this issue by transforming graph patterns through the use of an idempotent disjunction operator Or as a substitute for Union. This permits the application of a set of equivalences that transform a query into distinct forms. We further present an algorithm to derive the solution set of the original query from the solution set of a query where Union has been substituted by Or. We also analyze the combined complexity of SPARQL, proving it to be NP-complete. It is also shown that the SPARQL query language is not, in the general case, fixed-parameter tractable. Experimental results are presented to validate the query evaluation methodology presented in this paper against the SPARQL standard to corroborate the complexity analysis and to illustrate the gains in processing cost reduction that can be obtained through the application of semQA.

更新日期：2019-11-01
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2012-09-01
Vladimir Grupcev,Yongke Yuan,Yi-Cheng Tu,Jin Huang,Shaoping Chen,Sagar Pandit,Michael Weng

Particle simulation has become an important research tool in many scientific and engineering fields. Data generated by such simulations impose great challenges to database storage and query processing. One of the queries against particle simulation data, the spatial distance histogram (SDH) query, is the building block of many high-level analytics, and requires quadratic time to compute using a straightforward algorithm. Previous work has developed efficient algorithms that compute exact SDHs. While beating the naive solution, such algorithms are still not practical in processing SDH queries against large-scale simulation data. In this paper, we take a different path to tackle this problem by focusing on approximate algorithms with provable error bounds. We first present a solution derived from the aforementioned exact SDH algorithm, and this solution has running time that is unrelated to the system size N. We also develop a mathematical model to analyze the mechanism that leads to errors in the basic approximate algorithm. Our model provides insights on how the algorithm can be improved to achieve higher accuracy and efficiency. Such insights give rise to a new approximate algorithm with improved time/accuracy tradeoff. Experimental results confirm our analysis.

更新日期：2019-11-01
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-05-15
Chen Chen,Jingrui He,Nadya Bliss,Hanghang Tong

Networks are prevalent in many high impact domains. Moreover, cross-domain interactions are frequently observed in many applications, which naturally form the dependencies between different networks. Such kind of highly coupled network systems are referred to as multi-layered networks, and have been used to characterize various complex systems, including critical infrastructure networks, cyber-physical systems, collaboration platforms, biological systems and many more. Different from single-layered networks where the functionality of their nodes is mainly affected by within-layer connections, multi-layered networks are more vulnerable to disturbance as the impact can be amplified through cross-layer dependencies, leading to the cascade failure to the entire system. To manipulate the connectivity in multi-layered networks, some recent methods have been proposed based on two-layered networks with specific types of connectivity measures. In this paper, we address the above challenges in multiple dimensions. First, we propose a family of connectivity measures (SUBLINE) that unifies a wide range of classic network connectivity measures. Third, we reveal that the connectivity measures in SUBLINE family enjoy diminishing returns property, which guarantees a near-optimal solution with linear complexity for the connectivity optimization problem. Finally, we evaluate our proposed algorithm on real data sets to demonstrate its effectiveness and efficiency.

更新日期：2019-11-01
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2018-02-20
Bryan Minor,Janardhan Rao Doppa,Diane J Cook

Recent progress in Internet of Things (IoT) platforms has allowed us to collect large amounts of sensing data. However, there are significant challenges in converting this large-scale sensing data into decisions for real-world applications. Motivated by applications like health monitoring and intervention and home automation we consider a novel problem called Activity Prediction, where the goal is to predict future activity occurrence times from sensor data. In this paper, we make three main contributions. First, we formulate and solve the activity prediction problem in the framework of imitation learning and reduce it to a simple regression learning problem. This approach allows us to leverage powerful regression learners that can reason about the relational structure of the problem with negligible computational overhead. Second, we present several metrics to evaluate activity predictors in the context of real-world applications. Third, we evaluate our approach using real sensor data collected from 24 smart home testbeds. We also embed the learned predictor into a mobile-device-based activity prompter and evaluate the app for 9 participants living in smart homes. Our results indicate that our activity predictor performs better than the baseline methods, and offers a simple approach for predicting activities from sensor data.

更新日期：2019-11-01
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2017-12-16
Huan Gui,Jialu Liu,Fangbo Tao,Meng Jiang,Brandon Norick,Lance Kaplan,Jiawei Han

In real-world applications, objects of multiple types are interconnected, forming Heterogeneous Information Networks. In such heterogeneous information networks, we make the key observation that many interactions happen due to some event and the objects in each event form a complete semantic unit. By taking advantage of such a property, we propose a generic framework called HyperEdge-BasedEmbedding (Hebe) to learn object embeddings with events in heterogeneous information networks, where a hyperedge encompasses the objects participating in one event. The Hebe framework models the proximity among objects in each event with two methods: (1) predicting a target object given other participating objects in the event, and (2) predicting if the event can be observed given all the participating objects. Since each hyperedge encapsulates more information of a given event, Hebe is robust to data sparseness and noise. In addition, Hebe is scalable when the data size spirals. Extensive experiments on large-scale real-world datasets show the efficacy and robustness of the proposed framework.

更新日期：2019-11-01
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2017-11-07
Liangyue Li,Hanghang Tong,Nan Cao,Kate Ehrlich,Yu-Ru Lin,Norbou Buchler

In this paper, we study ways to enhance the composition of teams based on new requirements in a collaborative environment. We focus on recommending team members who can maintain the team's performance by minimizing changes to the team's skills and social structure. Our recommendations are based on computing team-level similarity, which includes skill similarity, structural similarity as well as the synergy between the two. Current heuristic approaches are one-dimensional and not comprehensive, as they consider the two aspects independently. To formalize team-level similarity, we adopt the notion of graph kernel of attributed graphs to encompass the two aspects and their interaction. To tackle the computational challenges, we propose a family of fast algorithms by (a) designing effective pruning strategies, and (b) exploring the smoothness between the existing and the new team structures. Extensive empirical evaluations on real world datasets validate the effectiveness and efficiency of our algorithms.

更新日期：2019-11-01
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2014-12-23
Elizabeth Ashley Durham,Murat Kantarcioglu,Yuan Xue,Csaba Toth,Mehmet Kuzu,Bradley Malin

The process of record linkage seeks to integrate instances that correspond to the same entity. Record linkage has traditionally been performed through the comparison of identifying field values (e.g., Surname), however, when databases are maintained by disparate organizations, the disclosure of such information can breach the privacy of the corresponding individuals. Various private record linkage (PRL) methods have been developed to obscure such identifiers, but they vary widely in their ability to balance competing goals of accuracy, efficiency and security. The tokenization and hashing of field values into Bloom filters (BF) enables greater linkage accuracy and efficiency than other PRL methods, but the encodings may be compromised through frequency-based cryptanalysis. Our objective is to adapt a BF encoding technique to mitigate such attacks with minimal sacrifices in accuracy and efficiency. To accomplish these goals, we introduce a statistically-informed method to generate BF encodings that integrate bits from multiple fields, the frequencies of which are provably associated with a minimum number of fields. Our method enables a user-specified tradeoff between security and accuracy. We compare our encoding method with other techniques using a public dataset of voter registration records and demonstrate that the increases in security come with only minor losses to accuracy.

更新日期：2019-11-01
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2014-11-18
Jiaoyun Yang,Yun Xu,Yi Shang,Guoliang Chen

The multiple longest common subsequence (MLCS) problem, related to the identification of sequence similarity, is an important problem in many fields. As an NP-hard problem, its exact algorithms have difficulty in handling large-scale data and time- and space-efficient algorithms are required in real-world applications. To deal with time constraints, anytime algorithms have been proposed to generate good solutions with a reasonable time. However, there exists little work on space-efficient MLCS algorithms. In this paper, we formulate the MLCS problem into a graph search problem and present two space-efficient anytime MLCS algorithms, SA-MLCS and SLA-MLCS. SA-MLCS uses an iterative beam widening search strategy to reduce space usage during the iterative process of finding better solutions. Based on SA-MLCS, SLA-MLCS, a space-bounded algorithm, is developed to avoid space usage from exceeding available memory. SLA-MLCS uses a replacing strategy when SA-MLCS reaches a given space bound. Experimental results show SA-MLCS and SLA-MLCS use an order of magnitude less space and time than the state-of-the-art approximate algorithm MLCS-APP while finding better solutions. Compared to the state-of-the-art anytime algorithm Pro-MLCS, SA-MLCS and SLA-MLCS can solve an order of magnitude larger size instances. Furthermore, SLA-MLCS can find much better solutions than SA-MLCS on large size instances.

更新日期：2019-11-01
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2015-01-01
Barnan Das,Narayanan C Krishnan,Diane J Cook

As machine learning techniques mature and are used to tackle complex scientific problems, challenges arise such as the imbalanced class distribution problem, where one of the target class labels is under-represented in comparison with other classes. Existing oversampling approaches for addressing this problem typically do not consider the probability distribution of the minority class while synthetically generating new samples. As a result, the minority class is not well represented which leads to high misclassification error. We introduce two Gibbs sampling-based oversampling approaches, namely RACOG and wRACOG, to synthetically generating and strategically selecting new minority class samples. The Gibbs sampler uses the joint probability distribution of attributes of the data to generate new minority class samples in the form of Markov chain. While RACOG selects samples from the Markov chain based on a predefined lag, wRACOG selects those samples that have the highest probability of being misclassified by the existing learning model. We validate our approach using five UCI datasets that were carefully modified to exhibit class imbalance and one new application domain dataset with inherent extreme class imbalance. In addition, we compare the classification performance of the proposed methods with three other existing resampling techniques.

更新日期：2019-11-01
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2014-10-21
Yang Mu,Henry Z Lo,Wei Ding,Kevin Amaral,Scott E Crouter

Physical activity consists complex behavior, typically structured in bouts which can consist of one continuous movement (e.g. exercise) or many sporadic movements (e.g. household chores). Each bout can be represented as a block of feature vectors corresponding to the same activity type. This paper introduces a general distance metric technique to use this block representation to first predict activity type, and then uses the predicted activity to estimate energy expenditure within a novel framework. This distance metric, dubbed Bipart, learns block-level information from both training and test sets, combining both to form a projection space which materializes block-level constraints. Thus, Bipart provides a space which can improve the bout classification performance of all classifiers. We also propose an energy expenditure estimation framework which leverages activity classification in order to improve estimates. Comprehensive experiments on waist-mounted accelerometer data, comparing Bipart against many similar methods as well as other classifiers, demonstrate the superior activity recognition of Bipart, especially in low-information experimental settings.

更新日期：2019-11-01
• IEEE Trans. Knowl. Data. Eng. (IF 3.857) Pub Date : 2014-09-30
Anand Kumar,Vladimir Grupcev,Yongke Yuan,Jin Huang,Yi-Cheng Tu,Gang Shen

This paper focuses on an important query in scientific simulation data analysis: the Spatial Distance Histogram (SDH). The computation time of an SDH query using brute force method is quadratic. Often, such queries are executed continuously over certain time periods, increasing the computation time. We propose highly efficient approximate algorithm to compute SDH over consecutive time periods with provable error bounds. The key idea of our algorithm is to derive statistical distribution of distances from the spatial and temporal characteristics of particles. Upon organizing the data into a Quad-tree based structure, the spatiotemporal characteristics of particles in each node of the tree are acquired to determine the particles' spatial distribution as well as their temporal locality in consecutive time periods. We report our efforts in implementing and optimizing the above algorithm in Graphics Processing Units (GPUs) as means to further improve the efficiency. The accuracy and efficiency of the proposed algorithm is backed by mathematical analysis and results of extensive experiments using data generated from real simulation studies.

更新日期：2019-11-01
Contents have been reproduced by permission of the publishers.

down
wechat
bug