Abstract

Most current online distributed machine learning algorithms have been studied in a data-parallel architecture among agents in networks. We study online distributed machine learning from a different perspective, where the features about the same samples are observed by multiple agents that wish to collaborate but do not exchange the raw data with each other. We propose a distributed feature online gradient descent algorithm and prove that local solution converges to the global minimizer with a sublinear rate . Our algorithm does not require exchange of the primal data or even the model parameters between agents. Firstly, we design an auxiliary variable, which implies the information of the global features, and estimate at each agent by dynamic consensus method. Then, local parameters are updated by online gradient descent method based on local data stream. Simulations illustrate the performance of the proposed algorithm.

1. Introduction

With the development of multiagent system, the observed data are being generated at anywhere, anytime, using different devices and technologies [13]. There is a lot of interest in extracting knowledge from this massive amount of data and using it to choose a suitable business strategy [46], to generate control command [79] or to make a decision [1013]. Many applications are required to process incoming data in online way, e.g., a bank monitors the transactions of its clients to detect frauds [2], wireless sensor networks makes inference [14], and sensor network tracks the uncooperative target [15]. The study of online learning is becoming an important topic of research itself [1618].

The success of online machine learning often depends on the entire data stream. In some applications, the observed data may be generated on and held by multiple agents [1, 13]. Collecting data to a central site for training incurs extra management and privacy concerns [1]. As a result, some distributed machine learning algorithms have been proposed to train a model by letting each agent perform local model updates and exchange some information between neighbors [1922]. Most of the existing algorithms fall into the data-parallel computation [1], where each agent has its local data stream with the entire features. However, in network applications, multiple agents are used to monitor an environment, where agents are distributed over space and are used to collect different measurements. For example, the observation is generated by different observed models [8, 9]. It is urgent to develop some applicable algorithm to deal with data streams with distributed features over networks.

In batch learning settings, some algorithms have been proposed for distributed features, such as variance-reduced dynamic diffusion (VRD2) [12], feature distributed machine learning (FDML) [1], and the ADMM (alternating direction method of multipliers) sharing [23]. VRD2 and FDML obtain the optimal solution in primal domain, and the local model is trained in a distributed manner based on the local features. The ADMM sharing algorithm formulates distributed feature learning as a distributed primal-dual problem and then obtains the optimal solution by ADMM algorithm. These algorithms in [1, 12, 23] effectively deal with the batch distributed feature learning in a distributed form. However, these algorithms in [1, 12, 23] need to access the entire dataset and cannot be applied in online settings. As the observation is continuously arriving very fast in networks, it is important to study online feature distributed machine learning.

In this paper, we consider the situation where the features are split across agents in online settings either due to privacy consideration or because they are already physically collected in a distributed manner by means of a networked architecture. We propose a distributed feature online gradient algorithm. Online supervised learning over networks is formulated as a “cost of sum” form. The procedure of the proposed algorithm requires two-scales: one scale is used to update the parameters by gradient descent and a second faster scale for running the consistency step multiple times to track an auxiliary term. The main contributions of this paper are summarized as follows.(1)We propose a distributed feature online gradient (DFOG) descent algorithm. By exchanging some information between neighbors, local solution can approximate the global solution. Compared with VRD2 [12], FDML [1], and the sharing ADMM algorithm [23], DFOG is applicable to online supervised learning with distributed features over networks.(2)We firstly formulate the centralized cost as a “cost of sum” form. By dynamic consensus algorithm, each node can track the sum term, which implies the entire features of the sample at each round time. Then, with the help of online gradient descent algorithm, each node locally updates the parameters based on its data stream.(3)We prove that the proposed algorithm achieves an regret bound. That is, local solution can approach to the global solution, which is the best decision trained based on the entire dataset. The only transmitted message is some parameters’ information, and the proposed algorithm does not require the data of the total number times and does not exchange the raw data between neighbors.

The rest of this paper is organized as follows: the problem formulation is discussed in Section 2. In Section 3, we focus on our online optimization algorithm with distributed features over multiagent system, followed by the theoretical results in Section 4. In Section 5, simulations illustrate the effectiveness of our algorithm. Finally, we conclude the paper in Section 6.

Notation and terminology: let be the feature space and be the corresponding label. We denote the th element of a matrix by . For , the set is denoted by . For a convex function , its gradient at a point is denoted as . We denote as the number of agents in the network. Let be the -dimensional vector space and is the Euclidean norm of a vector .

2. Problem Formulation

We consider a multiagent system with agents. The communication between agents is described by a connected graph [24], consisting of a set of nodes , a set of edges , and an adjacent matrix [19]. For each agent , we denote as a set of neighbors of agent (including agent itself).

Assumption 1. The graph and the adjacent weighted matrix satisfy the following [25]:(i) is a doubly stochastic matrix with positive diagonal, that is, , , and ;(ii)There exists a scalar such that if ;(iii)There exists an integer such that the graph is strongly connected.In this work, we focus on a binary online supervised learning with distributed features. The features are distributed over a collection of agents, as illustrated in Figure 1.
At each time , network receives a labeled sample . For all the time , we consider an empirical risk as follows:where the parameters are denoted as , is the dimension of the features, and is the corresponding scalar label of at time . Moreover, the cost is convex and differentiable. In most problem of interest, the cost function is dependent on the inner product , such as the linear SVM cost and the logistic regression cost . The factor represents the regularization term. Since the features of are distributed across agents, we set and to be column vector and formulate and into subvectors denoted by and , respectively, that is,Each subfeature vector and subvector is located at agent . Then, cost function (1) can be rewritten aswhere the regularization term is assumed to satisfy an additive form as This property holds for many popular regularization choices, such as , , and KL-divergence. Problem of this type has been studied before in the literature by using distributed optimization methods in [20, 21]. One common way is to formulate problem (3) into a constrained problem, that is,For all the time , problem (5) is a classical “cost of sum” form [20]. An effective way is to design the Lagrangian function by introducing the dual variable [23], namely,Problem (6) can be solved in a number of distributed primal-dual methods, such as alternating direction method of multipliers (ADMM) [4, 22, 26] and primal-dual methods [2729]. These techniques have good convergence properties but suffer from high computational costs and two-time scale communications.
The other way is studied in primal domain [12]. The algorithm in [12] requires a two-time scale operation: a faster time-scale for the consensus iterations and a slower time-scale for the data sampling and the gradient computing. First, we use a consensus strategy to obtain the sum term , namely,where denotes the index of the sample selected uniformly at random from . After sufficient iterations, it is well-known that . Then, the stochastic-gradient step is used to update the parameters , where the gradient is evaluated by the gradient vector of the cost evaluated at some random data .
In online settings, since the data is observed one by one, we cannot access to the total dataset . These algorithms in [1, 12, 23] cannot be applied for data stream with distributed feature over networks. For each time , the multiagent system is endowed with a sequence of cost function , and the goal is to minimize the sum of the cost function. Specifically, we want to minimize the difference between the total cost multiagent system has incurred and that of the best fixed decision in hindsight, which is called regret, and its definition is given as follows: where is the best decision of problem (1), that is, Moreover, we consider the time-varying cost function asGenerally speaking, the cost satisfies Assumption 2.

Assumption 2. The loss function is convex and differentiable, and the gradient is uniform boundedness, that is, for some scalar .
Regret is the standard measure of the performance of online optimization algorithm [19]. An algorithm attains good performance if the regret is sublinear as a function of the total time .

Remark 1. In the multiagent system, since the entries of the feature are distributed over agents, each agent just observes its own data stream. We face the following two challenges in solving problem (8):(1)Distributed challenge: each agent only receives local data stream and does not access to the entire features . Under the condition that we do not exchange the raw data between neighbors, each agent needs to obtain some information on the entire features.(2)Online challenge: at any time , we only have observation for and do not know for . It is difficult to store all the observations due to the high-dimensional and high-velocity data stream. We need to update the parameters based on the current sample and the previous parameters and pursue a solution approximating to the global solution , which is the best decision based on all the data as a prior in offline settings.

3. Distributed Feature Online Gradient Descent Algorithm

In this section, we first analyse a dynamic average consensus method for approximating the sum of at agent and propose an online convex optimization to update the parameters . The detailed framework is summarised in Figure 2.

Now, we consider the problem of minimizing (5) by means of an online convex optimization. Let denote the inner product that is available at time . The cost function can be described as

If each agent can obtain the auxiliary variable at any time , the parameters can be obtained by minimizing the local cost , which is defined as

However, the computation of needs to access to all the subfeatures and the subvectors over agents. We denote the average of the local inner products as

Motivated by works in [3034], can be approximated by a diffusion-based algorithm. Since the desired variable is proportional to the average value , , the consensus strategy can be used to approximate . Specifically, for the total number of iterations , each agent would repeat the following steps times:where . After each agent obtains the estimator of denoted as , problem (12) is converted into a differentiable dynamic problem. For online convex optimization problem, online gradient descent and its variants have been achieving optimal dynamic regret in many applications [35]. Recalling that and are partitioned into blocks, the gradient step can be performed in parallel over agents. Specifically,where the step-size should satisfy , , and .

The full algorithm is summarized in Algorithm 1.

(1)Initialization: set .
(2)Repeat for :
(3)
(4)For
(5)
(6)End
(7)
(8)End

Remark 2. Compared with FDML [1], VRD2 [12], and the ADMM sharing algorithm [23], DFOG is applicable for data stream with distributed features over multiagent system. At each round time, agents observe the same sample from different features. Each agent can obtain an auxiliary term, which implies the information on the entire features. Then, each agent locally runs a gradient descent step to update its local parameters. The procedure of Algorithm 1 is designed to update the parameters locally.

4. Algorithm Analysis

4.1. Convergency Analysis

In this section, we analyse the convergence of the proposed algorithm. We first show that the distance between and is upper bounded by the difference between and , which is shown in Lemma 1 and proved in [25].

Lemma 1. Let Assumption 1 holds, for all agents ; we have where is total number of agents and is the number of consensus steps in(14).

Then, we show that the regret of online gradient descent (OGD) is upper bounded by the cumulative difference between the loss of and , which is present in Lemma 2 and proved in [18].

Lemma 2. Let denotes the sequence of parameters produced by OGD. Then, for any , we haveBecause the features are distributed across agents, mainly illustrates the difference between local parameters and the corresponding parameters in global solution. Based on the above lemma, we derive a regret bound of for DFOG with the regularization term .

Theorem 1. Let Assumptions 1 and 2 hold, and consider running DFOG on a sequence of convex function, for all , with the regularization term . Let be the sequence of vectors produced by DFOG. If and , the regret of satisfies where . The proof is presented in Appendix.

Remark 3. This theorem indicates that the convergence rate of DFOG depends on the network topology through and the number of consensus steps . The larger the is or the smaller the is, the faster the convergence speed is. The theorem presents that the proposed algorithm converges to the global solution with sublinear rate. When the number of data samples increases, the difference between with will become closer.

4.2. Complexity Analysis
4.2.1. Time Complexity

There are two primary operations associated with learning for DFOG: (1) estimating the inner product for each sample at time and (2) updating the parameters at gradient descent step. At any time , each estimator computation requires arithmetic operations. There is one gradient descent step to update the parameters, which requires arithmetic operations. As for each time, each node will require arithmetic operations. Hence, single node requires arithmetic operations for DFOG.

4.2.2. Space Complexity

At any time , DFOG needs to store the parameters and , which are updated and time-varying. Hence, space complexity for DFOG is .

4.2.3. Communication Complexity

We denote the average degree of the communication graph as . At each consensus step, each node requires to exchange (float type, 4 bytes) with its neighbors. Since the network topology is an undirected graph, it requires bytes at any time . Hence, DFOG requires communication traffic of DFOG is bytes for all the time .

5. Simulation

In this section, we test our algorithm by minimizing norm regularized logistic regression on two public datasets, a9a and bank from UCI. Here, a multiagent system with 6 agents is considered, and the network is generated by the random geometric graph model. a9a dataset consists of 32561 training samples, 16281 testing samples, and 123 features. We separate the features into 6 parts, and each node obtains one part with 21, 21, 21, 20, 20, and 20 features as the local data, respectively. On the other hand, the bank dataset contains 4068 training samples, 453 testing samples, and 17 features. Similarly, we divide the features into 6 parts, each agent gets one part with 3, 3, 3, 3, 3, and 2 features as the local data, respectively. The loss function we use is

Generally, is a positive constant and . The simulations are implemented in MATLAB to verify the proposed algorithm. Specifically, we use two optimization libraries, SGDLibrary [36] and DADAM [37], to minimize (19). In our simulation, the parameters are set as summarised in Table 1.

We adopt dynamic consensus method to obtain in (14) and use online gradient descent algorithm to update the parameter locally in (15). In our simulations, we get the global model trained in a centralized manner if all the features were collected centrally by stochastic-gradient descent (SGD) algorithm. Next, we compare our algorithm against SGD algorithm proposed in [38] and keep track of the loss for different datasets and parameters settings. Figures 3 and 4 present the evolution of the cost during the training procedure for a9a and bank datasets, respectively. In addition, to make a fair comparison, we analyse the convergence curve based on the count of gradient calculated. Table 2 shows the testing error for different datasets and the error of parameters for DFOG and SGD. The results show that DFOG can converge to the centralized solution of SGD, while keeping local feature sets to the corresponding agent. That is, DFOG can deal with the online supervised learning problem caused by distributed features over networks.

We next show how the performance depends on different . Note that when is larger, we need to do more communication on the consistency step (14). Figure 5 shows the evolution of the cost with different . It can be found that the larger the we set, the faster the DFOG approaches to the centralized SGD algorithm.

6. Conclusions

In this paper, we considered an online supervised learning problem where the features are split across agents in online settings. We proposed an online supervised learning algorithm with distributed features over multiagent system. We first formulated the centralized cost as a “cost of sum” form. By dynamic consensus algorithm, each agent could effectively estimate the sum term, which is calculated based on the entire features at each round time. Then, with the help of online gradient descent algorithm, each agent locally updates the parameters. The algorithm designed does not require the data of the total number times and does not communicate the raw data between neighbors. We proved that local solution converges to the centralize minimizer, which is the best decision trained based on the entire dataset, and the proposed algorithm achieves an regret bound. Simulations with real dataset verify the conclusion.

Distributed machining learning algorithms are worth of further studies due to their promising future, including distributed online boosting, distributed decision tree [39], the use of Big data-aided learning [40], and distributed learning over time-varying communication topology in networks.

Appendix

Proof of Theorem 1: let Assumption 2 holds, for each time , then we havewhere is the inner product between vectors and . Moreover, we denote .

Using Lemma 2, we obtain

From equation (15), we have where .

We derive the gradient of cost (12) as follows:

Substituting (A.4) into (A.3),

Let Assumption 2 holds such that . Using Lemma 1, we have

Denoting and for , we derive

If and , then where . Theorem 1 has been proved.

Data Availability

a9a dataset has been derived from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html. Bank dataset has been derived from https://archive.ics.uci.edu/ml/datasets/Bank+Marketing.

Conflicts of Interest

The authors declare that they have no conflicts of interest.