FL and SL are two paradigms of collaborative learning that may respectively facilitate the data aspect and resource aspect of ubiquitous intelligence. On the other hand, training full models on individual devices in FL limits its capability of utilizing the highly diverse and heterogeneous computational resources in IoT while sequential client-server collaboration in SL restricts its effectiveness in leveraging the big data dispersed across a huge number of IoT devices. Therefore, the combination of FL and SL may fully unleash the advantages of both learning methods while mitigating their respective drawbacks. In this section, we review the state-of-the-art technologies for combining FL and SL to present a big picture about the latest developments in this area. The majority of the research on combining FL and SL assumes a horizontal partition of user data with a few recent works considering other configurations (such as vertical and sequential configurations) that are reviewed at the end of this section.
4.1. Hybrid Frameworks for Combining Split Learning with Federated Learning
Most of the currently available research on combining FL and SL has been conducted from a perspective of enhancing SL performance by applying the FL mechanism. The sequential relay-based training across multiple clients in SL slows down the global model convergence thus forming a bottleneck that limits the scalability of multi-client FL frameworks. In addition, the sequential training in SL may cause the “catastrophic forgetting” issue-the model favors the client data it recently used for training and is prone to forget what it learns from the previous client’s data [
63]. These issues of SL become severe in an IoT environment where a large number of clients have heterogeneous data distributions. In FL, all clients perform model training in parallel and the server aggregates the trained local models into a global model. Therefore, FL removes the training bottleneck caused by the shared server-side model in SL thus may speedup the training process. FL also ensures each client’s data to contribute the global model through model aggregation thus resolving the “catastrophic forgetting” problem of SL.
The main approach to combining SL with FL is to enable training parallelization and model aggregation in the SL framework. Since in SL the model training is split into two portions that are respectively executed on the client and server sides, training parallelization and model aggregation may be introduced separately on each side. Parallel training for server-side models may also be deployed on a single central server or distributed to multiple servers.
The SplitFed framework proposed in [
64] is an attempt to amalgamate FL and SL by enabling parallel training and aggregation for both client-side and server-side models. In the framework, all clients perform forward propagation on their client-side models in parallel and pass their smashed data to the main server. The main server processes the forward propagation and back-propagation for the server-side model with each client’s smashed data separately in parallel. Then the server sends the gradients of the smashed data to the respective clients for back-propagation. After receiving the gradients of its smashed data, each client performs the back-propagation to update its client-side model parameters. At the end of each epoch, the server updates its model using the FedAvg aggregation, i.e., the weighted average of the gradients that it computed during the backpropagation on each client’s smashed data. Similarly, the sub-models trained on all the participating clients will also be aggregated, which can be performed by a dedicated server (Fed server) or by the main server (acting as a Fed server). The experiment results reported in [
64] for comparative performance evaluation of the proposed framework against regular SL indicate that SplitFed is faster than SL in model convergence with a similar level of model accuracy as SL.
The SplitFed framework requires the server to have the sufficient computing power for training multiple instances of the server-side model, one for each participating client, in parallel. Such a requirement becomes overwhelming with the increasing number of clients especially when a large portion of a complex model is split to the server-side. Considering the constrained computational resources available on typical edge nodes in IoT, the server in SplitFed might have to be hosted in a cloud data center; however, the communications between client devices and the remote cloud server may become a bottleneck that deteriorates the system performance.
The authors of [
64] also proposed a variant of the SplitFed framework (referred to as SplitFedv2) that enables parallel training and aggregation only for the client-side. In SplitFedv2, the client-side operation remains the same as in SplitFed. On the server-side, the forward-backward propagation is performed sequentially with respect to the clients’ smashed data (that is, no aggregation for the server-side model). The server receives the smashed data from all participating clients synchronously and the order in which the corresponding server-side operations are performed for clients is chosen randomly. Compared to SplitFed, SplitFedv2 keeps the shared server-side model training feature of SL that may achieve higher model accuracy. The random order of client-side model training in SplitFedv2 also mitigates the catastrophic forgetting problem caused by the sequential training of basic SL.
A root reason for catastrophic forgetting in SL lies in the default alternative client training order. In each epoch, a client completes its training of the client-side model with the shared server-side model using its entire dataset before the client next in order starts training. The SplitFedv3 framework proposed in [
65] attempts to address the catastrophic forgetting issue by using alternative mini-batch training (instead of the regular alternative client training). In SplitFedv3, a client trains its model using one mini-batch of data samples and then updates the shared server-side model, after which the client next in order takes over. The advantage of alternative mini-batch training over alternative client training is that it avoids sequential training over the entire client dataset for the model, rendering the model training in a more randomized manner thus mitigating the catastrophic forgetting issue over the learning process.
The Cluster-based Parallel SL (CPSL) framework proposed in [
66] enables federated parallel training in SL following an idea of “first-parallel-then-sequential.” In CPSL, client devices are grouped into multiple clusters and each training round is divided into two stages-parallel intra-cluster training first and then sequential inter-cluster training. All clients in the same cluster perform parallel training in collaboration with the server in a way that is similar to SplitFedv2. After a single round of intra-cluster training is completed, all the client-side models in the cluster are uploaded to the server for aggregation and update. Then the updated model will be used to initialize the client model in the next cluster to start intra-cluster training in that cluster. That is, inter-cluster training in CPSL works in the same way as in the basic multi-client SL framework except that each “client” now consists of a cluster of user devices.
In order to face the challenges to collaborative learning brought in by the highly diverse IoT devices with heterogeneous resources and data distributions, the Hybrid Split and Federated Learning (HSFL) scheme proposed in [
67] combines these two learning methods in the same framework from a perspective of client selection and scheduling. The HSFL framework organizes the participating clients into two groups-one group performs federated learning and the other group performs split learning. The server in HSFL chooses the clients for each group in each training round based on the current computing and networking status of clients with the objective of minimizing the total energy consumption on user devices while achieving satisfactory learning performance. The HSFL scheme offers the flexibility to handle dynamic and diverse system environments in which a hybrid SL-FL framework is deployed. On the other hand, the control functions for client selection and scheduling need to be performed by the server for each round of training thus introducing extra complexity and additional computation overheads on the server.
The aforementioned hybrid SL-FL frameworks (SplitFed variants, CPSL, and HSFL) all have a centralized server architecture that deploys the full server-side model training (either parallel or sequential) on a single edge node, which is not scalable with the increasing number of clients. The parallelization of server-side model training may be realized using a decentralized architecture with multiple servers hosted on different edge nodes, which avoids the potential bottleneck that may be caused by a single server performing the parallel training of multiple server-side models.
In the Federated SL (FedSL) framework proposed in [
68], there are the same number of servers and clients. Instead of training all server-side models on a single server, the framework trains the server-side model corresponding to each client on an individual server. That is, all client-server pairs perform training of their respective client- and server-side models in parallel. When all data in each client have been used to update the model parameters (i.e., at the end of each training epoch), the server-side models are aggregated by a Fed server and updated on each server. In this framework, each server only works with a single client thus decoupling the performance dependency on other clients. Moreover, the communication pattern in this framework is changed to point-to-point data transfer between each pair of client and server, which avoids potential network congestion caused by data transmissions from multiple user devices to a single edge node as in frameworks with a central server.
In [
69], the authors conducted a comparative performance evaluation of the FedSL framework against the parallel SL framework proposed in [
62]. The obtained experiment results show that the training time of the FedSL framework is constantly shorter than that of parallel SL, which indicates that using multiple servers can better unleash the benefit of parallel training for both client and server sides. The FedSL architecture can be especially advantageous as the number of clients increases. It was also found that FedSL achieves a similar level of model accuracy as parallel SL if each client has enough data, but parallel SL converges to a better model if the dataset size at each client is small.
Although the one-to-one client-server pairing scheme proposed in [
68] enables federated parallel training in an SL framework, it limits the flexibility of deploying the training jobs on servers. In a typical edge computing-based IoT, the various edge nodes that may host SL servers have highly diverse amounts of computation and communication resources; thus demanding flexible resource allocation for SL server deployment. Toward this direction, a generalized SplitFed learning framework (SFLG) was proposed in [
70] to enable a varying number of server-side models that can be deployed on different edge nodes; therefore, one can flexibly choose the number of edge nodes for hosting the servers based on the available edge computing resources.
The architecture of the SFLG framework is illustrated in
Figure 5. In SFLG, the clients are divided into multiple groups and the clients in each group share one server-side model. The SFLG framework essentially combines the SplitFed, SplitFedv2/v3, and FedSL schemes in a hierarchical structure. Training on the clients in the same group works in the same way as in SplitFedv2 while the server-side models for different groups are trained in parallel and aggregated as in SplitFedv1. In addition, the server-side model training of different groups may be deployed either on a single server (as in SplitFed) or on multiple servers (as in FedSL). Therefore, SFLG is a generalized form of SplitFed (if one client per group and all groups are on a single server), SplitFedv2/v3 (if all clients are in one group served by a single server), and FedSL (if one client per group and one group per server). Therefore, SFLG offers a flexible architecture for combining federated and split learning with multiple possible configurations that can be chosen based on the learning objective as well as the available computing and networking resources in an IoT environment.
The current representative proposals for hybrid FL-SL frameworks can be categorized from two aspects: (i) the adoption strategy of federated parallel training-either on both the client and server sides or only on the client side, and (ii) the architecture for server deployment-either centralized on a single edge node or distributed across multiple edge nodes.
Table 4 gives a summary of the key features of the reviewed hybrid SL-FL frameworks from these two aspects. Enabling federated parallel training on both client and server sides may fully leverage the advantages of federated learning to enhance the scalability of split learning for involving a large number of users in collaborative model training. On the other hand, the strategy of adopting parallel training only on the client side allows hybrid SL-FL frameworks to inherit the advantage of fast convergence from split learning. Distributed deployment of the SL server on multiple edge nodes may improve resource utilization and enhance the deployment flexibility of hybrid SL-FL frameworks. However, the distributed server deployment strategy requires more sophisticated mechanisms for resource allocation and inter-server coordination; therefore, it may introduce additional computation complexity and communication overheads compared to the centralized server deployment.
A comparative study on the learning performance and training costs of SL, FL, and hybrid SL-FL frameworks was conducted in [
70] for various IoT settings. The learning performance of the frameworks was evaluated under iid, imbalanced, and non-iid data distributions with a different number of clients. The obtained experiment results indicate that under iid and imbalanced data distributions SL typically converges faster than FL but suffers unstable convergence (shown by frequent spikes in the learning curve). It was also found that SL performance is affected by the number of clients which implies that FL may outperform SL even under iid data in an IoT setting with a large number of clients. Experiment results for non-iid datasets demonstrate that SL is much more sensitive to non-iid data compared to FL, which makes SL fails to learn in some cases with highly skewed data distributions while the FL model still converges to a certain accuracy level. The learning performance of SplitFed (tested as a representative hybrid SL-FL framework) was found close to that of FL under all types of data distributions, which verifies the effectiveness of introducing federation in split learning for enhancing SL performance under non-iid data.
The SL, FL, and SplitFed frameworks were implemented on a Raspberry Pi platform for evaluating their computation and communication overheads in [
70]. The results obtained from the experiment setting (training a small model on a limited number of clients) indicate that SL always has fewer computation overheads (including CPU/memory usage and power consumption) compared to FL but suffers a longer training time and more communication costs. It was found that SplitFed greatly reduces the training time with the same level of overheads as compared to SL, which demonstrates the benefit of combining SL with FL that achieves shorter training time than that of SL with fewer computation overheads on client devices compared to FL.
4.2. Model Decomposition in Hybrid SL-FL Frameworks
A key aspect of SL and its combination with FL is to decompose an ML model and assign the resulted sub-models to clients and server(s). Proposals of hybrid SL-FL frameworks often focus on the collaboration among sub-models assuming a given cut layer; however, how the model is split has a significant impact on learning performance and training costs in split learning thus also deserving thorough investigation.
Some recent research on model decomposition considers the architectural features of specific ML models to determine the appropriate split of the models. For example, in [
71] the authors observed that the Vision Transformer, a recently developed deep learning model for image processing, has a design comprising three key components-a head for extracting input features, a transformer body for modeling the dependency between features, and a tail used for mapping features to task-specific outputs. Among the components, the transformer body is computing intensive and shared by various tasks. Based on this observation, the authors proposed a Federated Split Task-Agnostic (FESTA) framework in which multiple clients, each holding its own head and tail parts of the model, share a server hosting the transformer body. The server also aggregates the weights of local heads and tails from the clients via FedAvg to update the global head and tail, which are then distributed to all clients.
Another model-specific decomposition approach is employed by the FedBERT framework proposed in [
72]. FedBERT is a federated SL framework for pre-training the BERT model for natural language processing applications. The BERT model comprises three layers-an embedding layer, a transformer layer, and a head layer. The transformer layer is computing-intensive therefore should be trained on a server with sufficient computational resources, while the training of embedding and head layers can be deployed on resource-constrained client devices. In the FedBERT framework, each client performs forward propagation from the embedding layer to the transformer layer on the server. The server receives forward propagation outputs from all the clients and generates the output of the transformer layer, which is then sent back to the client as input of the head layer. The client computes the final output of the head layer and calculates the gradients to start the backpropagation along the reverse path. The authors proposed two training strategies for FedBERT-parallel training and sequential training. In parallel training, the server trains one transformer layer for each client in parallel and aggregates the transformers of all clients to a global transformer at the end of each training epoch. For sequential training, the server maintains a single transformer layer that is trained by different clients one by one using their own datasets.
The aforementioned methods for model decomposition leverage the design structures of specific models, which might not be applicable to general models. These methods also assume a static assignment of training workload between the clients and server, i.e., split the model at fixed cut layer(s); However, the available amounts of computing and networking resources on IoT devices may vary with time. Therefore, static split between the client- and server-side models lacks the flexibility to adapt to a dynamic IoT environment. As an attempt to address this challenge, FedAdapt was proposed in [
73] as a hybrid FL-SL framework that is able to adaptively determine which portion of a model to be offloaded to a server based on the computational resources on the client devices and the network bandwidth between clients and the server.
FedAdapt employs the reinforcement learning (RL) technique to dynamically determine an offloading point (OP) for each client in each round of training. All the layers of the model that are behind the OP are offloaded to a server. In order to make FedAdapt scalable to a large number of user devices, a clustering-based method is employed to divide all clients into multiple groups according to their computational resources-all the client devices in the same group are assumed to be homogeneous in their computing capabilities. The RL agent determines the OP for each group, i.e., all clients in the same group split their models in the same way. The maximal local training time of the clients in a group is used as the input state for the RL agent. Output action of the RL for each group is a value in [0, 1], which is the percentage of the model workload that stays on the client and can be used directly to determine the OP (the cut layer for splitting the model). The RL agent uses the average time of each training round as the reward function to minimize the average training time of all devices. The experiment results obtained in [
73] indicate that FedAdapt may achieve a substantial reduction in average training and can adapt to changes in network bandwidth as well as heterogeneity in IoT devices. On the other hand, FedAdapt introduces extra complexity and overheads caused by the RL agent and the clustering algorithm.
4.3. Reducing Overheads of Hybrid SL-FL Frameworks
The aforementioned works on combining SL with FL mainly focus on collaboration among workers to enable federated parallel training in split learning without a major change in the communication aspect of the SL framework. Therefore, the proposed hybrid SL-FL frameworks have basically the same level of communication overheads as the regular SL (or even more overheads with the additional data transfer for aggregating the client- and/or server-side models). Improving the communication efficiency of hybrid FL-SL frameworks in an IoT environment thus becomes an important research topic.
One attempt to reduce the communication overheads of hybrid SL-FL frameworks is the Multi-Head Split Learning (MHSL) scheme proposed in [
74]. MHSL essentially allows the clients in the SplitFedv2 framework to perform model training in parallel without synchronization through a federation server thus removing the communication and computation overheads associated with client-side model aggregation. Empirical study results demonstrated that performance degradation of MHSL compared to SplitFedv2 varies with datasets (1–2% for MNIST dataset and about 10% for CIFAR-10 dataset). In addition, the experiment results in [
74] were obtained using only iid data distributions. Therefore, whether acceptable performance can be guaranteed in SplitFedv2 without client-side synchronization still needs more thorough investigation.
In [
75], the authors proposed a local-loss-based scheme called LoccalSplitFed to improve communication efficiency and shorten the training latency of hybrid SL-FL. The proposed framework has the same architecture as SplitFed-parallel training and model aggregation on both client and server sides. Unlike SplitFed, the proposed framework introduces an auxiliary network on the client side, which is an extra layer that uses the cut layer outputs as the inputs to calculate a local loss function. Then the client performs backpropagation to update the client-side model using the gradients obtained based on the local loss function. Therefore, the proposed framework only transmits activations from clients to the server and sends no gradient from the server back to clients thus reducing the communication overheads roughly by half. In addition, the client-side model training does not need to wait for the gradients to come back from the server thus reducing the training latency. This may be particularly beneficial in an IoT environment with long communication delays due to limited bandwidth. The authors proved convergence for both client-side and server-side models and reported experiment results showing that the proposed framework outperforms SplitFed and FL in terms of both communication overheads and training time.
The techniques proposed in [
74,
75] focus on reducing the communication overheads for transmitting gradients from the server back to clients while keeping the transmission of smashed data from clients to the server unchanged. However, the wireless communication systems in an IoT environment often have more constrained bandwidth on the up-links (from client devices to the server hosted on an edge node) compared to the down-links (from the server to client devices). Therefore, reducing the amount of smashed data transmitted from clients to the server is also critical for improving the communication efficiency of hybrid SL-FL frameworks, especially in an IoT environment.
The FedLite scheme proposed in [
76] aims at reducing the up-link communication overheads in hybrid SL-FL while maintaining the accuracy of the learned model. The FedLite is based on an observation that, given a mini-batch of data, a client does not need to communicate per-example activation if the activations of different examples in the mini-batch exhibit enough similarity. Therefore, the FedLite framework performs clustering of the activations of each mini-batch of training data using an algorithm designed based on product quantization [
77] and only communicates the cluster centroids to the server. Such activation clustering is equivalent to adding a vector quantization layer between the client- and server-side models, which may lead to a drop in model accuracy due to the noised gradients that clients receive back from the server. To mitigate this issue, FedLite employs a gradient correction scheme that approximates the gradient by its first-order Taylor series expansion. The obtained empirical evaluation results show that FedLite is effective in achieving a high compression ratio with minimal accuracy loss. Although SplitFedv2 was assumed as the learning framework in [
76], the proposed FedLite scheme applies to any hybrid SL-FL framework that can benefit from a vector quantization layer.
Binarization, as an extreme quantization method, has also been explored as a technique for reducing overheads in SL. Binarized Neural Networks (BNNs) are neural networks where weights and activations are constrained to either −1 or +1 for mathematical convenience [
78]. Although BNNs consume fewer resources for computation and communications, they might not be able to achieve as good accuracy as their full-precision counterparts. In [
79], the authors proposed to binarize the client-side models during forward propagation to speed up computation on clients and reduce communication overheads for transmitting activations. Meanwhile, the server-side model is kept in high-precision computation to retain the accuracy of the global model. In the proposed binarized SL (B-SL) framework, the gradients are passed back to clients in full precision during backpropagation for updating the client-side models using a straight-through estimator [
80]. Experiment results demonstrated that B-SL can reduce overheads while maintaining model accuracy; however, the effectiveness of B-SL for overhead reduction and its loss in model accuracy are directly impacted by how the model is split and still need to be thoroughly evaluated.
The existing approaches proposed for combining SL with FL typically employ techniques of parallel client-side model training and decoupling the training of client- and server-side models, which potentially suffer the problems of server-client update imbalance and client model detachment [
81]. The server-client update imbalance occurs when the server-side model is trained with aggregated smashed data from multiple clients while the client-side models are trained only using the local datasets on individual clients. Such imbalance in the model update, although can be mitigated by client-side model aggregation, may become an issue with a large number of clients thus limiting the scalability of hybrid SL-FL frameworks. Decoupling the client- and server-side model training, for example local loss-based client-side training in LocalSplitFed, prevents the client-side model update from utilizing the full capacity of a deep neural network therefore may cause the detachment of the client-side model from the server-side model.
The LocFedMix-SL scheme was proposed in [
81] for tackling the problems of imbalance between client- and server-side updates and detachment of client and server models. The operations of LocFedMix-SL partly coincide with those of LocalSplitFed [
75] but employ some additional mechanisms. The key new elements are locally regularizing the sub-model at each client and augmenting smashed data at the server. The regularized local gradients maximize the mutual information between the raw and smashed data while avoiding extracting too much of the original features on the client side. The server combinatorially superpositions smashed data uploaded from different clients to produce new augmented smashed data for the forward propagation on the server side to address the imbalanced update problem. In addition, considering the crucial role of the global gradients provided by the server to clients in avoiding model detachment, LocFedMix-SL resorts global gradient backpropagation.
Although the LocFedMix-SL scheme offers an approach to addressing the update imbalance and model detachment issues associated with parallel SL, it introduces extra computation overheads (e.g., for smash data mix-up and local gradient regularization) as well as communication overheads. On the other hand, reducing communication overhead through removing client model synchronization as proposed in [
74] may sacrifice learning performance due to loss of information sharing among clients. In order to achieve communication/computation efficient federated SL without sacrificing learning performance, the authors of [
82] proposed a framework for split learning with gradient average and learning rate splitting (SGLR). In the SGLR framework, the update imbalance issue is addressed using a SplitLr algorithm that adjusts the learning rate for the server-side model according to the concatenated batch size on the server. For solving the model detachment problem SGLR employs a SplitAvg algorithm at the server, which calculates the weighted average of the gradients for a subset of clients and sends the averaged gradients to these clients for client-side model update. The experiment results reported in [
82] show that the SGLR framework achieves comparable model accuracy as the baseline FL/SL and SplitFed algorithms but has less computation overhead (no smashed data mix-up) and communication overhead (no communication for client model aggregation) compared to the LocFedMix-SL framework.
4.4. Hybrid SL-FL Frameworks with Vertical and Sequential Configurations
Although the majority of research on FL, SL, and their combination assumes a horizontal configuration, vertical FL recently started receiving more attention. Split learning offers a commonly used approach to addressing vertical federated learning, especially for neural network models.
A vertical SL architecture proposed in [
53] is illustrated in
Figure 6. In this architecture, each client
holds a segment of features
of the same set of data samples
X. An ML model
F (e.g., a deep neural network) is split into the client- and server-side models, and the client-side model is further vertically split across the clients. That is, the partial model
on client
takes
as the input. The outputs
(
) of all clients are transmitted to the server, which aggregates all client outputs to form an input
to the server-side model
to complete the forward propagation and calculates the loss function and gradients. During backpropagation, the gradients
for the respective
are sent back to
for updating the corresponding partial model
. In this architecture, the partial models on all clients are trained in parallel and aggregated together by the server who handles the training of the rest of the model; therefore, it essentially combines vertical SL and FL in the same architecture.
There are multiple mechanisms available for aggregating client outputs to form the input to the server-side model in a vertical SL-FL framework; for example, concatenation, element-wise average, element-wise sum, element-wise maximum, etc. Concatenation is a simple approach that is the closest to training a single model with all the input features. However, this method requires all clients to have intermediate outputs ready in each iteration thus being less robust to stragglers that are common in IoT where client devices have diverse computing and networking capabilities. The element-wise sum and element-wise average methods are similar to each other. Both methods require the partial models on all clients to have a compatible form so that their outputs can be combined. On the other hand, these methods allow the use of a secure aggregation protocol for enhancing the privacy and security of the learning framework. Element-wise maximum picks the activations with the maximum value for each neuron and discards the rest, which also requires the partial models on all clients to have compatible outputs.
The learning performance achieved by the vertical SL-FL framework using various aggregation mechanisms was evaluated in [
53]. The obtained results indicate that no significant difference in the performance of these mechanisms, which are all close to that of centralized model training. The maximum pooling method appears to have the best overall performance while the average-pooling method is also very attractive due to its support for secure aggregation protocols.
Since different attribute values of the same data points are distributed across the clients in the vertical SL-FL configuration, it is critical to identify and link the data points on different clients that belong to the same data samples shared among clients. In SL/FL frameworks, such data alignment must be performed in a way without exchanging either raw data or metadata among clients. Private Set Intersection (PSI), a multi-party computation cryptography technique, offers an approach to addressing this challenge. PSI allows two parties that each holds a set of elements to compute the intersection of these elements without revealing anything to the other party except for the elements in the intersection [
83].
The PyVertical framework proposed in [
84] employs the PSI technique for data alignment to support vertical FL using split neural networks. PyVertical uses a Diffie–Hellman key exchange-based PSI implementation with Bloom filters compression to reduce the communication complexity [
85]. This PSI protocol works as follows. First, the server runs the PSI protocol independently with each data owner and the intersection of IDs between the server and each data owner is revealed. Then the server computes the global intersection from all the previously identified intersections and communicates the global intersection to the data owners. In this setting, the data owners do not communicate and are not aware of each other’s identity thus facilitating client privacy protection.
Typical vertical FL algorithms assume that data features are split among multiple data owners while label information is held by a single entity (label owner). However, in some practical application scenarios, there might be multiple label owners holding the labels of different samples. For example, multiple specialist hospitals (heart disease hospital, lung disease hospital, cancer hospital, etc.) have different feature data for the same set of patients whose COVID test results (the labels) are available at different testing centers. The Multi-Vertical Federated Learning (Multi-VFL) framework proposed in [
86] aims to enable collaborative learning in such scenarios with multiple data owners and label owners.
The Multi-VFL framework makes use of split learning and adaptive federated optimization algorithms. In this framework, each data owner has a unique part of the client-side model and all label owners share the same server-side model. Data owners perform forward propagation on their respective partial models up to the cut layer and then send their activations to label owners. Label owners concatenate the activations coming from different data owners, complete their forward propagation, compute the loss, and perform backpropagation. Label owners send the corresponding gradients back to data owners who will then use the respective gradients to complete their backpropagation. The model parameters on label owners are aggregated by a Fed server using one of the FL algorithms such as FedAvg [
10], FedAdam [
87], FedYogi [
87], etc. Adaptive federated optimization is employed in Multi-VFL for solving the model shift problem (local models move away from the optimal global model) caused by the non-iid data from different owners. PSI is also applied in this framework to identify intersections of the datasets vertically split across data owners.
In addition to horizontal and vertical data partition, sequential data partition also occurs in some ML-based IoT applications. One-dimension convolutional neural network (1DCNN) is an ML model that can be used for time-series classification and the application of SL to 1DCNN with a single client was investigated in [
54]. The most common ML models to train on sequential data are Recurrent Neural Networks (RNNs), including Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) [
88]. However, existing SL approaches developed for feed-forward neural networks are not directly applicable to RNNs. To train on sequentially partitioned data, RNNs should be split in a way that preserves latent dependencies between the segments of sequential data available on different clients.
In order to address this issue, a framework that integrates SL and FL for training models on sequential data was proposed in [
89]. In this framework, an RNN is split into sub-networks and each sub-network is trained on one client containing a single segment of the multi-segment training sequence. The sub-networks communicate with each other through their hidden states in order to preserve the temporal relation between the data segments stored on different clients. Collaborative model training is performed across different clients following the SL pattern and the sub-networks trained on clients are aggregated by the server using an FL algorithm. An approach to split LSTM was proposed in [
90], which considers another type of sequential data partition in which the entire length of time sequence input is stored on each client. The authors proposed to split a LSTM from the
c-th layer instead of splitting the network based on the input steps. Despite the recently reported progress on split RNN and LSTM, hybrid SL-FL with sequential data partition is still in its infancy and needs more research as we will discuss in
Section 6.