Introduction

Thanks to the extensive applications of Internet technologies, the past decade has witnessed the explosive spreading of e-commerce. As a new business form spawned by e-commerce, the O2O (online to offline) business has emerged in various fields due to the maturity of mobile payment [16, 19]. One popular type of the O2O business is the online food delivery (OFD) service, which enables the customers to order food with only a few taps on the smartphone. Such convenience has attracted increasing number of customers and brings about a vast commercial opportunity. In the United States, the gross food sales online are expected to grow with an annual compound rate of 16% from 2017 according to Morgan Stanley Research [24]. Besides, China leads the way in market share and achieves great success [16, 17]. The number of online eaters in China is more than 450 million in 2019, increased by 12.7% since 2018 [22]. Moreover, the transaction volume of OFD market reached 642.3 billion yuan in 2019, which is nearly five times the transaction volume in 2015 [21]. In the future, it is of prospect that the OFD market will continue to grow.

There are two categories of OFD operations [37]. The first is restaurant-to-consumer delivery operation, such as McDonalds, Kentucky Fried Chicken, Domino's Pizza, Xibei, which can provide OFD services by the restaurants themselves or third-party logistics. The second is platform-to-consumer delivery operation, such as Uber Eats, Meituan, Ele.me, Just Eat, Swiggy. The OFD platforms integrate a variety of partner restaurants and offer them delivery services. Taking Meituan, a major OFD platform in China, as an example, an entire OFD process is illustrated in Fig. 1. Customers can search for favorite restaurants, choose from full menus, make an order and pay for the food on the OFD platforms. Once receiving the food ordering requests, the OFD platform will inform the restaurants to prepare food and appoint delivery persons, aka riders, to pick up food from the restaurants and deliver it to the customers. The benefits of the OFD platforms are twofold. First, The OFD platforms decrease the burden of management and marketing [17] so that the partner restaurants can invest more time and money on the food quality. In other words, restaurants can share the professional delivery resources [18] from the OFD platforms to save the cost of employing and managing riders; furthermore, the OFD platforms attract more customers for the restaurants even if they are not situated in popular dining regions. Second, on the other side of the market, the OFD platforms improve the quality of modern life. The customers can enjoy delicious food at their own choices from a wide array of restaurants and menus. Besides, the OFD platforms bring convenience to the people who are unwilling or have no time to cook. In consequence, more and more people are becoming accustomed to ordering food online for instant or pre-booked delivery [17], which promotes the prosperity of the OFD market.

Fig. 1
figure 1

Online food delivery service of Meituan platform

However, the explosive growth of OFD market also brings about huge challenges to OFD platforms. Firstly, the OFD platforms are faced with high dynamism from continuous order requests received over time, especially in lunch or dinner time. Secondly, the number of orders to be disposed is massive. According to the statistics from Meituan in Nov. 2019, the average number of orders received in lunch time per day exceeds 6 million. Thirdly, the orders need to be delivered at or ahead of the estimated time of arrival (ETA) committed by the platform. Fourthly, the number of riders is limited and there will be shortages of rider resources in the on-peak meal time or bad weather. Due to the above difficulties, the optimization problem in OFD is regarded as the ultimate challenge in last-mile logistics [28]. How to increase the punctuality of delivering orders with limited resources becomes a prime issue for OFD companies. Efficient and effective optimization technologies are in demand to solve the large-scale, highly dynamic, and multi-constraint optimization problem for OFD platforms to provide high-quality services.

In this paper, we investigate the dispatching problem arising in the real food delivery scenario of Meituan platform and address an OFD problem (OFDP). To deal with high dynamism and urgency, the OFDP is transformed into a static optimization problem within a time window. A matching algorithm with an adaptive tie-breaking strategy (MAATS) is proposed to effectively solve the OFDP by determining the matching relationship between newly-arrived orders and riders. The MAATS is mainly composed of a best-matching heuristic, multiple tie-breaking operators, and a machine learning (ML) model. The best-matching heuristic dispatches the orders to the best riders to ensure the solution quality. However, the best-matching heuristic may result in ties where more than one order is matched with a single rider. To effectively break the ties, multiple tie-breaking operators are proposed based on different optimization objectives to adapt to different scenarios. An adaptive tie-breaking strategy is designed to dynamically determine the best tie-breaking operators to utilize in the current scenario, which is implemented by ML techniques of an eXtreme Gradient Boosting (XGBoost) algorithm and a modified Deep Factorization Machine (mDeepFM). Problem-specific features are extracted to further improve the prediction precision of the ML models. Offline simulation experiments on real historical data from Meituan platform are carried out to validate the effectiveness of MAATS for solving the OFDP. Furthermore, we conduct A/B tests on the real online scheduling system to evaluate the practical performance of MAATS. The results of the offline simulations and online A/B tests show that the proposed MAATS is capable of increasing the delivery efficiency and customer satisfaction.

The remainder of our paper is organized as follows. “Literature review” gives the literature review of OFDP and the problems similar to OFDP. In “Problem description”, the OFDP is described in detail. “Methodology” introduces the proposed MAATS. The experiment results of offline simulations and online AB tests are shown and analyzed in “Experiment”. Finally, “Conclusions” ends this paper with conclusions and future work.

Literature review

The OFDP is an emerging field of research. Although there are publications [4, 39] studying the optimization problems existing in food or meal delivery services, these problems do not originate from the background of OFD. In this section, we focus on the works which consider the food delivery problem arising from the background of OFD. Besides, in the literature, the OFDP is most similar to the extensively-studied dynamic pickup and delivery problem (DPDP). Therefore, the researches of DPDPs are reviewed in this section as well.

Online food delivery

Lu et al. [19] considered a simplified case where orders from the same restaurant could only be assigned to one rider and proposed two genetic algorithms to solve the order assignment and the routing problem respectively. Steever et al. [31] studied a special scenario of OFDP where a single order might include the food from multiple restaurants. The authors formulated a mixed-integer linear programming (MILP) model and designed an auction-based heuristic which incorporated future-looking measurements to determine the order assignment. Liu et al. [18] addressed two versions of the on-demand food delivery problem, i.e. the opportunistic online takeout ordering and delivery (OTOD) where the taxi driver could deliver food when carrying passengers and the dedicated OTOD where the taxi driver could only deliver food. To solve the OTOD problems, they proposed a two-stage approach combined by a constructive algorithm and an adaptive large neighborhood search algorithm with a simulated annealing mechanism. Liu [17] presented an online optimization-driven algorithm for the on-demand meal delivery problem where the food was carried by drones. Luo et al. [20] proposed a genetic annealing algorithm to schedule the routes for delivery persons and studied the relationship between the delivery lateness and the tip that customers offered. Reyes et al. [28] proposed a rolling horizon algorithm to solve the meal delivery routing problem considering dynamic vehicle routing and courier shift scheduling. Yildiz and Savelsbergh [38] presented a novel formulation for the problem addressed in [28] and designed a simultaneous column and row generation method, the efficacy of which was demonstrated by experiments on the instances derived from real-life data. Cosmi et al. [7, 8] considered a special scenario with single-courier and single-restaurant, which was modeled as a single machine scheduling problem. The authors formulated this problem into several MILP models solved by Gurobi [7] and proposed a dynamic programming algorithm [8]. Ulmer et al. [33] presented an anticipatory customer assignment policy to deal with the stochasticity from the arrival time of new orders and the ready time of restaurants. Yu et al. [40] derived the lower bounds of online pickup and delivery problem and proposed two online dispatching algorithms. Chen et al. [5] proposed a hybrid differential evolution algorithm with two phases to determine the route planning and order dispatching respectively. Zheng et al. [43] developed a two-stage algorithm to solve the fuzzy online order dispatching problem considering the uncertainty of food preparation.

Dynamic pickup and delivery

For a comprehensive review of the DPDP before 2010, interested readers may refer to [3]. In this part, we focus on the recent publications studying the DPDP. Muñoz-Carpintero et al. [25] considered the DPDP under a dial-a-ride service scenario. The dial-a-ride system was formulated with a hybrid predictive control approach and the problem was solved by a genetic algorithm and a particle swarm optimization algorithm. Zhu et al. [44] proposed a multi-objective memetic algorithm embedded with a locality-sensitive hashing-based local search to simultaneously minimize route length, response time, and workload. Arslan et al. [2] formulated the DPDP into the offline problem by using the rolling horizon method and designed an exact recursive algorithm to solve the matching problem between tasks and drivers in crowdsourced delivery. Aleksandrov et al. [1] proposed several dispatching heuristics and designed a system based on neural network model that was able to determine the best heuristic to be adopted in current decision process for an online PDP. Fkaier et al. [14] proposed a K-means algorithm to assign clustered requests and used a dynamic programming algorithm to schedule the routes. Ferrucci and Bock [13] presented a real-time control approach to simultaneously handle tour plan execution and tour plan adaptation where the latter was realized by a tabu search algorithm. Sheridan et al. [30] proposed a dynamic nearest neighbor policy to assign vehicles to customers to optimize mean system time of the customers. Vonolfen and Affenzeller [34] proposed an intensity-based waiting strategy and a parametrization approach to adapt the strategy to different problem environments.

The literatures reviewed above address different OFDPs and DPDPs and provide various solution methodologies. However, these studies do not completely accord with the problem in this paper, which originates from the Meituan platform. Also, the existing algorithms cannot be adopted directly to solve our OFDP due to different problem characteristics and extremely-limited computation time. Therefore, we address the OFDP and propose specially-designed methods in this paper.

Problem description

With notations in Table 1, the OFDP can be transformed into a static optimization problem within a time window [T − \({\Delta }T\), T] by using the rolling horizon strategy as follows.

Table 1 Notations

At current time T, A set of n new online orders received in [T − \({\Delta }T\), T] need to be dispatched to m riders. Each rider qj carries \(n_{j}^{o}\) old orders which have already been dispatched to qj before T but have not been picked up or delivered yet. The assignment of each old order to riders is determined in the previous decision-making process and known as inputs at T. Each rider qj has a site node \(l_{j}\) where qj locates at T. Each order \(o\left( i \right)\) is associated with a pair of nodes \(\left( {i_{ + } ,i_{ - } } \right)\) or a single node \(i_{ - }\), where \(i_{ + }\) is the pickup node and \(i_{ - }\) is the delivery node. To be specific, since some of the old orders have been picked up, it is not necessary to include the pickup nodes of these orders into problem formulation. Each order has a soft deadline \(DDL_{{i_{ - } }}\) that is either appointed by customers or committed by the platform. Then the OFDP can be defined on a graph \(G = \left( {V,A} \right)\), where \(V = L \cup P \cup D \cup D^{p}\) is the set of nodes and \(A\) is the set of arcs. \(L\) is the set consisting of the site nodes of all riders, \(P\) is comprised by the pickup nodes of all new orders and the old orders that have not been picked up, \(D\) is the delivery node set of unpicked orders and \(D^{p}\) is the delivery node set of picked orders. Each arc \(\left( {i_{1} ,i_{2} } \right) \in A \left( {i_{1} ,i_{2} \in V, i_{1} \ne i_{2} } \right)\) is related to a travel time \(t\left( {i_{1} ,i_{2} } \right)\) and a travel distance \(d\left( {i_{1} ,i_{2} } \right)\). The constraints are listed as follows.

  • The food must be picked up before being delivered.

  • The food cannot be picked up before it has been prepared by the restaurant.

  • A new order can only be dispatched to one rider.

  • Old orders cannot be reassigned to other riders.

  • The total weight of food that a rider carries cannot exceed the trunk capacity.

The objective is to minimize the average dispatching cost (ADC) by determining the matching relationship between new orders and riders:

$$ {\text{Minimize ADC}} = \frac{1}{n}\mathop \sum \limits_{{q_{j} \in Q}} C\left( {O_{j}^{n} } \right), $$
(1)

where \(C\left( {O_{j}^{n} } \right)\) is the dispatching cost after rider qj is dispatched with a set of new orders \(O_{j}^{n}\), which can be calculated as follows.

Consider a given route \(R_{j} = \left[ {r_{j} \left( 0 \right),r_{j} \left( 1 \right), \ldots ,r_{j} \left( {N_{j} } \right)} \right]\) (\(r_{j} \left( 0 \right) = l_{j}\)) of qj, where \(r_{j} \left( i \right) \in V\) is the i-th node of route \(R_{j}\) and the length of \(R_{j}\) is \(N_{j} + 1\). The time \({\text{AT}}_{j,i}\) when qj arrives at node \(r_{j} \left( i \right)\) can be calculated as (2).

$$ {\text{AT}}_{j,i} = T + \mathop \sum \limits_{k = 1}^{i} t\left( {r_{j} \left( {k - 1} \right),r_{j} \left( k \right)} \right) + \mathop \sum \limits_{{r_{j} \left( k \right) \in R_{j} \left( i \right) \cap P}} {\text{WT}}_{j,k} , $$
(2)

where \(R_{j} \left( i \right) = \left[ {r_{j} \left( 0 \right),r_{j} \left( 1 \right), \ldots ,r_{j} \left( i \right)} \right]\) is a partial route of \(R_{j}\) and \({\text{WT}}_{j,k} = \max \left\{ {0,{\text{PT}}_{{r_{j} \left( k \right)}} - {\text{AT}}_{j,k} } \right\}\) is the duration that qj needs to wait for at pickup node \(r_{j} \left( k \right)\) for the preparation of food. \({\text{PT}}_{{r_{j} \left( k \right)}}\) is the time when the food is prepared at pickup node \(r_{j} \left( k \right)\). The tardiness of delivering food to node \(r_{j} \left( i \right)\) is defined as \({\text{TA}}_{j,i} = {\text{max}}\left\{ {0,{\text{AT}}_{j,i} - {\text{DDL}}_{{r_{j} \left( i \right)}} } \right\}\) and a nonlinear penalty of the tardiness is set as (3):

$$ {\text{PE}}_{j,i} = \left\{ {\begin{array}{*{20}l} {0,} \hfill & {{\text{TA}}_{j,i} \le 0} \hfill \\ {\theta {\text{TA}}_{j,i}^{2} ,} \hfill & {0 < {\text{TA}}_{j,i} < \Theta } \hfill \\ {\kappa {\text{TA}}_{j,i} + \sigma , } \hfill & {{\text{TA}}_{j,i} \ge \Theta } \hfill \\ \end{array} } \right., $$
(3)

where \(\theta\), \({\Theta }\), \(\kappa\) and \(\sigma\) are parameters. The time cost \({\text{TC}}_{j}\) of route \(R_{j}\) is defined as the summation of the penalty in (4):

$$ {\text{TC}}_{j} = \mathop \sum \limits_{{r_{j} \left( k \right) \in R_{j} \cap \left( {D \cup D^{p} } \right)}} {\text{PE}}_{j,k} . $$
(4)

The distance cost \({\text{DC}}_{j}\) of route \(R_{j}\) is defined as the total travel distance in (5):

$$ {\text{DC}}_{j} = \mathop \sum \limits_{k = 1}^{{N_{j} }} d\left( {r_{j} \left( {k - 1} \right),r_{j} \left( k \right)} \right). $$
(5)

For qj, the old route is constructed with the pickup and delivery nodes of orders in \(O_{j}^{o}\) and the new route is constructed with the pickup and delivery nodes of orders in \(O_{j}^{o} \cup O_{j}^{n}\). Denote the time cost of old route and new route as \({\text{TC}}_{j}^{o}\) and \({\text{TC}}_{j}^{n}\) respectively, and the corresponding distance cost as \({\text{DC}}_{j}^{o}\) and \({\text{DC}}_{j}^{n}\), respectively. Given that the riders are unwilling to change their routes too much after being dispatched with new orders, the difference between the new route and old route is considered for constructing the objective function, which derives the dispatching cost in (6):

$$ C\left( {O_{j}^{n} } \right) = \left| {{\text{TC}}_{j}^{n} - {\text{TC}}_{j}^{o} } \right| + \left| {{\text{DC}}_{j}^{n} - {\text{DC}}_{j}^{o} } \right|. $$
(6)

Before calculating \(C\left( {O_{j}^{n} } \right)\), the new route and old route will be scheduled by route planning methods. It should be noted that the route planning methods are basic components for solving the OFDP, which determine the dispatching costs and thus affect the final dispatching results. However, as space is limited and we mainly concentrate on the dispatching techniques in this paper, the adopted route planning methods are not elaborated and interested readers can refer to our previous work [5, 35, 42] for details.

Methodology

Due to the high dynamism, urgency and large-scale characteristics of OFDP, it is infeasible to solve this problem using exact methods. Indeed, the computation time is so limited in the online scheduling system that it is not applicable to implement a search process for finding the optimal solution online. In this situation, some efficient and effective heuristics become the first choices for OFDP. To obtain a high-quality solution in a very short time, we propose the following MAATS framework illustrated in Fig. 2, which mainly consists of a best-matching component and an adaptive tie-breaking component. In the best-matching component, a cost matrix is calculated by tentatively dispatching each new order to each rider. Then a partial solution with certain quality will be produced by efficiently matching new orders with best riders, i.e., the riders with minimum dispatching cost. As for adaptive tie-breaking, when ties occur where an order is matched with multiple best riders, the ML model will predict the best operator to break the ties based on the decision information extracted from input data and cost matrix. By using the best operator, a specific order is selected and then dispatched to the best rider. Repeat the above steps, and all orders will be dispatched efficiently and effectively.

Fig. 2
figure 2

Framework of MAATS

Best-matching

In best-matching, a cost matrix \(\left( {C_{i,j} } \right)_{n \times m} \) is first generated for decision. Each element \(C_{i,j}\) represents the cost of dispatching new order \(o\left( i \right)\) to rider qj according to (6), i.e., \(C_{i,j} = C\left( {O_{j}^{n} } \right)\) and \(O_{j}^{n} = \left\{ {o^{n} \left( i \right)} \right\}\). For each new order, the rider with minimum dispatching cost in the cost matrix is defined as its best rider. After finding the best rider for each order, there may appear to be two types of results as the example in Fig. 3 shows. The first type is one-to-one best-matching where a rider is regarded as the best rider by only one order. The second type is many-to-one best-matching where a rider is regarded as the best rider by more than one order. For the sake of clarity, we define these orders as candidate orders and their best rider as critical rider in many-to-one best-matching, while the orders and riders in one-to-one best-matching are defined as non-candidate orders and non-critical riders respectively.

Fig. 3
figure 3

An example of cost matrix and best-matching

For one-to-one best-matching, each non-candidate order is dispatched to its corresponding best rider (non-critical rider). However, for many-to-one best-matching, only one candidate order is selected and then dispatched to the critical rider instead of dispatching all candidate orders to the critical rider owing to the following reasons:

  1. 1.

    It lacks of information to make decisions because the overall cost of dispatching all candidate orders to the critical rider is unknown. In the cost matrix, each element only represents the cost of dispatching one order to one rider. The overall cost cannot be obtained by simply calculating the sum of each cost due to the nonlinearity of the cost function.

  2. 2.

    Dispatching all candidate orders to the critical rider may not be the best choice. Obviously, it is riskier for a rider to deliver punctually when serving multiple customers simultaneously. Besides, it is possible to result in unbalance of workloads among riders when critical riders are associated with too many candidate orders. In other words, some riders may be dispatched with an excessive number of orders while others may be dispatched with none.

Apparently, in many-to-one best-matching, the unselected candidate orders are not dispatched to any rider in current loop. These orders will be dispatched to its best riders by executing the best-matching in the next loop. Before finding the best riders in the next loop, the cost matrix needs to be updated. Specifically speaking, for a rider qj who is dispatched with a new order \(o\left( i \right)\), the old route will be tentatively replaced by the current new route which is constructed with the nodes of newly-dispatched order \(o\left( i \right)\) and the old order set \(O_{j}^{o}\). Then, the costs of dispatching the remaining orders to qj will be recalculated. Therefore, for the unselected candidate orders, their best riders may not be the same best riders in the previous loop due to the update of cost matrix. Then, the best-matching will be re-executed based on the updated cost matrix.

Tie-breaking operators

For many-to-one best-matching, how to select an appropriate candidate order has a great influence on the final dispatching results. This is because dispatching different candidate orders to the critical rider will affect the dispatching process of the remaining candidate orders. This selecting procedure is similar to the tie-breaking mechanism for solving permutation-based combinatorial optimization problems where some insertion-based algorithms need to select one from multiple partial solutions whose objective values are same. Selecting different partial solutions will influence the subsequent procedure to construct an entire solution, thus resulting in different final solutions. Therefore, how to break ties has a great influence on the performance of insertion-based algorithms [12]. In MAATS, the tie occurs whenever there is many-to-one best-matching. In addition, there may be plenty of ties occurring in on-peak hour due to numerous orders and shortage of delivery resources. Therefore, it is significant to break ties for best-matching. Based on different optimization goals, multiple tie-breaking operators are designed as follows.

  • MIN operator: dispatch the candidate order with minimum dispatching cost to the critical rider.

  • MINT operator: dispatch the candidate order with minimum time cost to the critical rider.

  • MIND operator: dispatch the candidate order with minimum distance cost to the critical rider.

  • MAX operator: dispatch the candidate order with maximum dispatching cost to the critical rider.

  • REG operator: dispatch the candidate order with maximum regret value to the critical rider.

The regret value \({\text{rv}}\left( {o\left( i \right)} \right)\) of dispatching order \(o\left( i \right)\) is defined as the cost difference between the second-best rider and the best rider, which can be calculated according to (7):

$$ {\text{rv}}\left( {o\left( i \right)} \right) = C_{i,\left[ 2 \right]} - C_{i,\left[ 1 \right]} , $$
(7)

where \(C_{i,\left[ 1 \right]}\) and \(C_{i,\left[ 2 \right]}\) is the minimum and second minimum dispatching cost of the i-th row in the cost matrix, respectively.

The former three are greedy operators which consider minimizing the cost from different perspectives. However, the greedy methods are easy to be trapped into local minima. To alleviate such a situation, the latter two operators are designed. The MAX operator considers dispatching the order that seems hardest to deliver (with largest dispatching cost) in case there may not be appropriate riders to deliver this order in the subsequent loops of best-matching. Note that the critical rider and the order selected by MAX operator are still best-matching even if the selected order has the largest cost among candidate orders. Therefore, the MAX operator will not deteriorate the quality of the final solution too much. The REG operator is inspired by the insertion heuristics in [29] which incorporate a kind of future-looking information. As mentioned before, the critical rider (ranking first) in current loop may not be the best rider for the remaining candidate orders in the next loop. In this situation, the second-best rider (ranking second) is most probable to be the best rider for the remaining candidate orders after the cost matrix is updated. The regret value is to measure the difference between the best rider and second-best rider, and a larger regret value indicates that it is more inappropriate to dispatch the order to the second-best rider. Therefore, the decision-maker will be regretful for dispatching the order to an inappropriate second-best rider instead of the best rider. In short, the REG operator can avoid bad cases to some degree.

Adaptive tie-breaking strategy

According to the no free lunch theorem for optimization [36], no algorithm can outperform any other algorithm on every instance. Similarly, the above tie-breaking operators exhibit different performances under different circumstances. However, if we can select the best operator at each time T, the all-day performance is possible to be improved compared with using any sole dispatching operator at each time T. To adaptively select the best operator, we propose an adaptive tie-breaking strategy based on ML techniques.

Multi-label classification

The selection of best tie-breaking operator can be transformed into a classification problem where each operator is regarded as a class label. The goal is to predict the label that the current instance (the static optimization problem at T) is associated with. However, there may be multiple tie-breaking operators that result in the same best dispatching solution. Therefore, it is a multi-label classification problem (MLCP) where each instance may be associated with a set of labels.

To deal with the multi-label classification problem [32], there are two categories of methods, i.e., problem transformation method (PTM) and algorithm adaption method (AAM). The former transforms a multi-label classification problem into one or more single-label classification problems or regression problems. The latter adapts specific learning algorithms to solve the multi-label classification problem directly. Briefly speaking, the PTMs try to fit data to the algorithm, while the AAMs try to fit the algorithm to data [41].

Feature design

Feature engineering is defined as the process to extract features from raw data by using domain knowledge, which plays an important role in data preparation for machine learning. Suitable features can improve the predictive performances of machine learning algorithms. However, useful features usually rely on the domain expertise of data scientists, iterative trial and error, and model evaluation [26], so it is not easy to extract effective features. Different from predicting some real-life parameters which may have highly-relevant factors, it is more difficult to extract highly-relevant features for predicting the best operators because the features need to serve as decision information for optimization. Since the best operator is determined and labeled by the final optimization results, the prediction model should have the ability to predict the optimization results to distinguish the best operators. Therefore, sufficient information that is possible to affect the optimization results should be provided for the model besides the basic information of the problem characteristics. To achieve this, we integrate the information from multiple dimensions into the feature space where the features are mainly categorized into problem-oriented features (information of the problem characteristics) and operator-oriented features (information affecting the optimization results).

Problem-oriented features
  1. 1)

    Spatial–temporal features

  2. City id: each city is numbered with a unique integer value.

  3. Day of week day: Day \(\in \left\{ {1,2, \ldots ,7} \right\}\) represents the day of week, which ranges from 1 to 7. 1 denotes Sunday, 2 denotes Monday, etc.

  4. Hour of day hour: Hour \(\in \left\{ {0,1, \ldots ,23} \right\}\) represents the hour of day at the current time.

  5. Minute of day minute: Minute \(\in \left\{ {0,1, \ldots ,1439} \right\}\) correlates with the minute of day at current time.

  1. 2)

    Rider number features

  • The number of critical riders ncr, which equals to the number of many-to-one best-matchings in the first loop.

  • The number of non-critical riders nncr, which equals to the number of one-to-one best-matchings in the first loop.

  • The number of best riders: ncr \(+\) nncr.

  • The proportion of critical riders accounting for best riders: ncr/(ncr \(+\) nncr).

  1. 3)

    Order number features

  • The number of candidate orders, which equals to the sum of the candidate order number of each critical rider.

  • The number of new orders, which equals to the sum of the candidate order number and non-candidate order number. (The number of non-candidate orders is not included because it equals to the number of non-critical riders)

  • The proportion of candidate orders accounting for new orders.

The rider number and order number features only contain the integral information of riders and orders, which do not consider the detailed information about each critical rider, each candidate order and the correlations between them. To describe the relationships between critical riders and candidate orders, we design the following aggregate features. However, it is not reasonable to consider each characteristic of each rider or order as one feature because: (1) this will lead to feature explosion since there are thousands of riders and orders; (2) the number of features will be uncertain since the number of riders or orders is unfixed. Therefore, the characteristics of riders or orders are aggregated by calculating the statistics of the same type of features. In this paper, the mean, sum, medium, maximum, minimum values, and standard deviation are considered as specific statistics.

  1. 4)

    Aggregate order number features

The statistics for each critical rider are computed and the statistics of the same type are averaged by the number of critical riders, which finally yields \(3 \times 6 = 18\) features in total.

  • Aggregate candidate order number: statistics of \(\left\{ {n_{1}^{c} ,n_{2}^{c} , \ldots ,n_{m}^{c} } \right\}\). \(n_{j}^{c}\) denotes the candidate orders that rider qj matches.

  • Aggregate old order number: statistics of \(\left\{ {n_{1}^{o} ,n_{2}^{o} , \ldots ,n_{m}^{o} } \right\}\). \(n_{j}^{o}\) denotes the number of old orders that rider qj carries.

  • Aggregate candidate-old order ratio: statistics of \(\left\{ {{\text{rco}}_{1} ,{\text{rco}}_{2} , \ldots ,{\text{rco}}_{m} } \right\}\). \({\text{rco}}_{j} = n_{j}^{c} /\left( {n_{j}^{o} + S} \right)\) denotes the ratio of the candidate order number \(n_{j}^{c}\) divided by old order number of qj, where S is a small decimal.

  1. 5)

    Aggregate geography-ETA features

The geographic and ETA relationships between orders are also extracted as features. Indeed, one of the reasons why multiple orders match a same best rider is that there exist similarities between candidate orders either in geographical distribution or ETA. Apparently, it is reasonable to dispatch a rider with two or more orders simultaneously with close restaurant locations, customer locations and ETAs. We denote this situation as bundle-dispatching. The operator that can lead to more bundle-dispatching results is more possible to outperform other operators. Therefore, the geographic and temporal relationships between orders are valuable information for predicting the best operators. The spatial–temporal order features can be mainly classified into three types as follows.

  • ETA features, which consider the similarities on ETAs of different candidate orders from a critical rider.

  • Geography features, which consider the similarities on the geographical distribution of different candidate orders from a critical rider.

  • Geography-ETA cross features, which consider the similarities on the aggregate relationship between geographical distribution and ETAs of different candidate orders from a critical rider.

  1. 6)

    Aggregate cost features

The cost matrix reveals the relationship between critical riders and candidate orders, which also serves as an indispensable information source for classification. The cost features mainly contain the statistics of route cost, time cost and distance cost in cost matrix.

Due to limited space, the specific design of aggregate geography-ETA features and aggregate cost features is attached in the Appendix.

Operator-oriented features

The tie-breaking operators differ in selecting different candidate orders. Hence, incorporating the selection information is helpful for classification. For each operator X \(\in\){MIN, MAX, MINT, MIND, REG}, the following features are calculated, which yield \(5 \times 7 = 35\) features in total.

  • The average regret value of unselected orders: \(\frac{1}{{\left| {R^{c} } \right|}}\mathop \sum \limits_{{j \in R^{c} }} \mathop \sum \limits_{{i \in O_{j}^{c} ,i \ne o_{j}^{X} }} \frac{{{\text{rv}}\left( i \right)}}{{\left| {O_{j}^{c} } \right|}}\), where \(R^{c}\) is the set of critical riders, \(O_{j}^{c}\) is the candidate order set of qj, \(o_{j}^{X}\) is the candidate order dispatched to rider qj by operator X and \(\left| \cdot \right|\) represents the cardinality of a set.

  • The average regret value of selected orders: \(\frac{1}{{\left| {R^{c} } \right|}}\mathop \sum \limits_{{j \in R^{c} }} {\text{rv}}\left( {o_{j}^{X} } \right)\).

  • The average total regret cost of selection: \(\frac{1}{{\left| {R^{c} } \right|}}\mathop \sum \limits_{{j \in R^{c} }} \left( {C_{{o_{j}^{X} ,\left[ 1 \right]}} + \mathop \sum \limits_{{i \in O_{j}^{c} ,i \ne o_{j}^{X} }} C_{i,\left[ 2 \right]} } \right)\).

  • The average best cost of the selected orders: \(\frac{1}{{\left| {R^{c} } \right|}}\mathop \sum \limits_{{j \in R^{c} }} C_{{o_{j}^{X} ,\left[ 1 \right]}}\).

  • The average second-best cost of the unselected orders: \(\frac{1}{{\left| {R^{c} } \right|}}\mathop \sum \limits_{{j \in R^{c} }} \mathop \sum \limits_{{i \in O_{j}^{c} ,i \ne o_{j}^{X} }} C_{i,\left[ 2 \right]}\).

  • The average candidate rider number of unselected orders: \(\frac{1}{{\left| {R^{c} } \right|}}\mathop \sum \limits_{{j \in R^{c} }} \mathop \sum \limits_{{i \in O_{j}^{c} ,i \ne o_{j}^{X} }} m_{i}^{c}\), where \(m_{i}^{c}\) is the number of riders that order \(o\left( i \right)\) can be dispatched to.

  • The average candidate rider number of selected orders: \(\frac{1}{{\left| {R^{c} } \right|}}\mathop \sum \limits_{{j \in R^{c} }} m_{{o_{j}^{X} }}^{c}\).

Classification model

Since there is no prior knowledge about which method solves our problem better, both PTM and AAM are attempted in this study. For PTM, we transform the MLCP into five independent binary classification problems using binary relevance [41]. Each classifier corresponds to a specific dispatching operator and predicts whether this operator yields the smallest dispatching cost among all operators, which is realized by XGBoost algorithm. For AAM, we extend the DeepFM [15] as a new model named as mDeepFM to directly predict the best operators. Both methods belong to supervised learning which learns from the training data and produces a model to predict the label of new instances.

XGBoost

XGBoost, first proposed by Chen [6] in 2016, is a powerful and popular tool in data mining fields. As a member of tree algorithms, XGBoost aims at improving the traditional gradient boosting decision tree in both model performance and computational speed. XGBoost reduces the look-up times of creating individual trees and supports parallel computing [11, 27], which makes it much faster than many other existing algorithms. Besides, due to its portability, scalability and flexibility, XGBoost can be operated on various platforms and provide multiple interfaces for users to define necessary parameters. Therefore, XGBoost achieves great success in many areas, especially the machine learning competitions, such as Kaggle where 17 out of 29 winning solutions uses XGBoost in 2015 and KDDCup where top 10 winning solutions all used XGBoost in 2015 [10]. Owing to the above merits, we choose XGBoost as a classification model to solve the MLCP.

XGBoost is a boosting algorithm which is a kind of ensemble learning techniques. The boosting algorithm generates multiple base models sequentially that compensate the deficiencies of each other to achieve better prediction performance [9, 23]. For XGBoost, the base model is decision tree. A decision tree starts from a single root node. New nodes are generated from the root node by selecting appropriate features and splitting the training data into two sets. Each new node, aka internal node, is associated with a set of training data and then can be used to generate the next new nodes. Repeat generating new nodes to split the training data precisely, and a decision tree will be produced. The end node of each branch is denoted as leaf node and related to a label. The split training data of each leaf node is classified as the class that belongs to the corresponding label. An example is given in Fig. 4 to illustrate the classification process of a tree ensemble model.

Fig. 4
figure 4

Example of tree ensemble model

The specific details of XGBoost is introduced as follows. Given a data set \(S = \left\{ {\left( {{\varvec{x}}_{l} ,Y_{l} } \right)|{\varvec{x}}_{l} \in R^{m} ,Y_{l} \in R} \right\}{ }\left( {\left| S \right| = n} \right)\) where \({\varvec{x}}_{l} \in R^{m}\) is the feature vector and \(Y_{l}\) is the corresponding label (\(Y_{l} = 1\) if the corresponding operator is the best; otherwise, \(Y_{l} = 0\)). Considering an ensemble model with K decision trees, the prediction \(\hat{Y}_{l}\) for an instance (\({\varvec{x}}_{l}\),\(Y_{l}\)) can be calculated by summing up the leaf score \(f_{k} \left( {{\varvec{x}}_{l} } \right)\) of each tree as (8):

$$ \hat{Y}_{l} = \mathop \sum \limits_{k = 1}^{K} f_{k} \left( {{\varvec{x}}_{l} } \right), $$
(8)

where \(f_{k}\) is a function that maps \({\varvec{x}}_{l}\) to a specific leaf score \(w_{{q\left( {{\varvec{x}}_{l} } \right)}}\), i.e., \(f_{k} \left( {{\varvec{x}}_{l} } \right) = w_{{q\left( {{\varvec{x}}_{l} } \right)}}\), and \(q\left( {{\varvec{x}}_{l} } \right)\) maps an instance \({\varvec{x}}_{l}\) to a leaf index, i.e., \(q\left( {{\varvec{x}}_{l} } \right):R^{m} \to \left\{ {1,2, \ldots ,T} \right\}\). The mapping relationship is determined by the tree structure.

Based on the predictions calculated above, the objective function of XGBoost is defined in (9) which includes a loss function term and a regularization term. The former measures the deviation between predictions and true labels for the data set while the latter serves as a penalty on the model complexity:

$$ O = \mathop \sum \limits_{l = 1}^{n} L\left( {Y_{l} ,\hat{Y}_{l} } \right) + \mathop \sum \limits_{k = 1}^{K} {\Omega }\left( {f_{k} } \right). $$
(9)

As a typical boosting algorithm, XGBoost adds a tree in each step. Without loss of generality, when generating the t-th tree, the objective function can be rewritten as (10) by using the Taylor expansion of the above function to the second order:

$$ \begin{aligned} O_{t} & = \mathop \sum \limits_{l = 1}^{n} L\left( {Y_{l} ,\hat{Y}_{l}^{{\left( {t - 1} \right)}} + f_{t} \left( {{\varvec{x}}_{l} } \right)} \right) + \mathop \sum \limits_{k = 1}^{t} {\Omega }\left( {f_{k} } \right) \\ & \cong \mathop \sum \limits_{l = 1}^{n} \left[ {L\left( {Y_{l} ,\hat{Y}_{l}^{{\left( {t - 1} \right)}} } \right) + g_{l} f_{t} \left( {{\varvec{x}}_{l} } \right) + \frac{1}{2}h_{l} f_{t}^{2} \left( {{\varvec{x}}_{l} } \right)} \right] + {\Omega }\left( {f_{t} } \right) + \mathop \sum \limits_{k = 1}^{t - 1} {\Omega }\left( {f_{k} } \right), \\ & \Leftrightarrow \tilde{O}_{t} = \mathop \sum \limits_{l = 1}^{n} \left[ {g_{l} f_{t} \left( {{\varvec{x}}_{l} } \right) + \frac{1}{2}h_{l} f_{t}^{2} \left( {{\varvec{x}}_{l} } \right)} \right] + {\Omega }\left( {f_{t} } \right) \\ \end{aligned} $$
(10)

where \(g_{l} = \frac{{\partial L\left( {Y_{l} ,\hat{Y}_{l}^{{\left( {t - 1} \right)}} } \right)}}{{\partial \hat{Y}_{l}^{{\left( {t - 1} \right)}} }}\) and \(h_{l} = \frac{{\partial^{2} L\left( {Y_{l} ,\hat{Y}_{l}^{{\left( {t - 1} \right)}} } \right)}}{{\partial \left( {\hat{Y}_{l}^{{\left( {t - 1} \right)}} } \right)^{2} }}\) are first and second-order gradient statistics on the loss function. It should be noted that the 1st to (− 1)-th terms are removed from the objective function since these terms are determined in the previous steps and can be regarded as constants. By introducing the specific calculation of penalty term \({\Omega }\left( {f_{k} } \right) = \gamma T + \frac{1}{2}\lambda \mathop \sum \limits_{j = 1}^{T} \omega_{j}^{2}\) and splitting the data sets into subsets \(I_{j} = \{ l|q\left( {{\varvec{x}}_{l} } \right) = j\}\) that are associated with leaf \(j\), the final form of the objective function can be obtained as follows:

$$ \begin{aligned} \tilde{O}_{t} & = \mathop \sum \limits_{l = 1}^{n} \left[ {g_{l} f_{t} \left( {{\varvec{x}}_{l} } \right) + \frac{1}{2}h_{l} f_{t}^{2} \left( {{\varvec{x}}_{l} } \right)} \right] + \gamma T + \frac{1}{2}\lambda \mathop \sum \limits_{j = 1}^{T} \omega_{j}^{2} \\ & = \mathop \sum \limits_{j = 1}^{T} \left[ {\omega_{j} \mathop \sum \limits_{{l \in I_{j} }} g_{l} + \frac{1}{2}\omega_{j}^{2} \left( {\mathop \sum \limits_{{l \in I_{j} }} h_{l} + \lambda } \right)} \right] + \gamma T. \\ \end{aligned} $$
(11)

Given that minimizing the objective function is a quadratic problem, the optimal leaf score of t-th tree and optimal objective can be easily obtained as follows:

$$ \omega_{j}^{*} = - \frac{{\mathop \sum \nolimits_{{l \in I_{j} }} g_{l} }}{{\mathop \sum \nolimits_{{l \in I_{j} }} h_{l} + \lambda }}, $$
(12)
$$ \tilde{O}_{t}^{*} = - \frac{1}{2}\mathop \sum \limits_{j = 1}^{T} \frac{{\left( {\mathop \sum \nolimits_{{l \in I_{j} }} g_{l} } \right)^{2} }}{{\mathop \sum \nolimits_{{l \in I_{j} }} h_{l} + \lambda }} + \lambda T. $$
(13)

The optimal values are calculated based on a specific tree structure. However, it is impracticable to traverse all possible structures. Usually, we start from the root node and add branches iteratively by using greedy methods which evaluate the quality of split according to the gain of objective function after split as (14):

$$ {\Delta }O_{{{\text{split}}}} = \frac{1}{2}\left[ {\frac{{\left( {\mathop \sum \nolimits_{{l \in I_{{\text{L}}} }} g_{l} } \right)^{2} }}{{\mathop \sum \nolimits_{{l \in I_{{\text{L}}} }} h_{l} + \lambda }} + \frac{{\left( {\mathop \sum \nolimits_{{l \in I_{{\text{R}}} }} g_{l} } \right)^{2} }}{{\mathop \sum \nolimits_{{l \in I_{{\text{R}}} }} h_{l} + \lambda }} - \frac{{\left( {\mathop \sum \nolimits_{l \in I} g_{l} } \right)^{2} }}{{\mathop \sum \nolimits_{l \in I} h_{l} + \lambda }}} \right] - \lambda , $$
(14)

where IL and IR denote the data sets of left and right nodes after split. The larger the gain, the better the split. However, it is time-consuming to enumerate all possible split. Therefore, XGBoost uses an approximate algorithm [6] for split finding. With the approximate algorithm, the trees can be fast generated and utilized for predictn.

mDeepFM

The DeepFM is first proposed to predict the click-through rate (CTR), which exhibits good performance in the recommendation system [15]. The typical characteristic of the DeepFM is to model low-order and high-order interactions of features by integrating factorization machine (FM) and deep neural network (DNN). Compared with its counterpart Wide and Deep model which uses a linear model to learn the low-order interactions of features by considering both raw features and cross-product features, the DeepFM employs the FM to learn from raw features directly and thus relies less on feature engineering. Besides, DeepFM makes full use of latent vectors to realize end-to-end learning and thus does not need to pre-training the FM layer. Therefore, we choose the DeepFM model for solving the MLCP.

Given that the prediction of CTR is a binary classification problem, the basic DeepFM cannot deal with multi-label classification directly. Therefore, we extend the DeepFM and propose a new model mDeepFM. By adding a transformation layer, the mDeepFM has two merits: (1) solving the MLCP directly by outputting a vector where each element denotes the probability of being the best operator for a certain operator; (2) making more precise predictions by further learning the correlations between low-order and high-order features. The structure of mDeepFM is illustrated in Fig. 5.

Fig. 5
figure 5

Structure of mDeepFM

Different from the XGBoost, each instance for the mDeepFM is preprocessed by converting the category features into one-hot coding sequences and integrating the label of each operator into a label vector. Consider an instance \(\left( {{\varvec{x}}_{l} ,Y_{l} } \right) \in S\) where \({\varvec{x}}_{l} = \left( {x_{{{\text{field}}_{1} }} , \ldots , x_{{{\text{field}}_{c} }} , x_{{{\text{field}}_{c + 1} }} , \ldots , x_{{{\text{field}}_{a} }} } \right)\) is an a-dimension vector. \(x_{{{\text{field}}_{1} }} , \ldots ,x_{{{\text{field}}_{c} }}\) represent the category features and \(x_{{{\text{field}}_{c + 1} }} , \ldots ,x_{{{\text{field}}_{a} }}\) represent the continuous features. Then each category feature is extended by one-hot coding, i.e., \(x_{{{\text{field}}_{i} }} \Rightarrow {\varvec{x}}_{{{\text{field}}_{i} }}^{{{\prime}}} = \left( {0, \ldots ,0,1,0, \ldots ,0} \right), \quad i \in \left\{ {1,2, \ldots ,c} \right\}\). By concatenating the one-hot coding category features and continuous features, the feature vector \({\varvec{x}}_{l}\) can be reconstructed as a d-dimension vector \({\varvec{x}}_{l}^{{{\prime}}} = \left( {x_{1} , \ldots ,x_{d} } \right) \in R^{d}\). In what follows, the details of each component are introduced.

  1. 1)

    FM component

The FM is used for modeling the interactions of single and pairwise features. Given an instance \(\left( {{\varvec{x}}_{l}^{{{\prime}}} ,Y_{l} } \right)\), the output of FM component in DeepFM can be calculated as follows:

$$ \hat{Y}_{{{\text{FM}}}} = \mathop \sum \limits_{i = 1}^{d} w_{i} x_{i} + \mathop \sum \limits_{i = 1}^{d} \mathop \sum \limits_{j = i + 1}^{d} {\varvec{v}}_{i} ,{\varvec{v}}_{j} x_{i} x_{j} , $$
(15)

where the first-order weights \(w_{i} \in R\) and latent vectors \({\varvec{v}}_{i} = \left[ {v_{i,1} ,v_{i,2} , \ldots ,v_{i,e} } \right] \in R^{e}\) (\(\forall i \in \left\{ {1,2, \ldots ,d} \right\}\)) are parameters, e is a hyperparameter denoting the dimensionality of the factorization and \({\varvec{v}}_{i} ,{\varvec{v}}_{j} = \mathop \sum \limits_{f = 1}^{e} v_{i,f} \cdot v_{j,f}\) is the inner product of vector \({\varvec{v}}_{i}\) and \({\varvec{v}}_{j}\). Instead of a single weight \(w_{i,j} \in R\), the FM employs the inner product of two latent vectors as the weight of a second-order term. This special interactive way of pairwise features enables FM to train latent vector \({\varvec{v}}_{i}\) and \({\varvec{v}}_{j}\) whenever \(x_{i}\) or \(x_{j}\) equals to zero, while using a single weight \(w_{i,j}\) will result in no interaction (\(w_{i,j} = 0\)). By introducing the specific calculation of \({\varvec{v}}_{i} ,{\varvec{v}}_{j}\), The prediction \(\hat{Y}_{FM}\) can be rewritten as follows:

$$ \hat{Y}_{{{\text{FM}}}} = \mathop \sum \limits_{i = 1}^{d} w_{i} x_{i} + \frac{1}{2}\mathop \sum \limits_{f = 1}^{e} \left( {\left( {\mathop \sum \limits_{i = 1}^{d} v_{i,f} x_{i} } \right)^{2} - \mathop \sum \limits_{i = 1}^{d} v_{i,f}^{2} x_{i}^{2} } \right). $$
(16)

To solve the MLCP directly, the outputs of FM are fed into a transformation layer in mDeepFM. However, feeding all weighted first-order and second-order terms will introduce many parameters between the FM layer and transformation layer. Besides, feeding the summation \(\hat{Y}_{{{\text{FM}}}}\) will lose useful information to some degree. Therefore, the weighted first-order and second-order terms are transformed into a vector \({\varvec{o}}_{{{\text{FM}}}}\) with a length of a + e before being fed into the next layer:

$$ {\varvec{o}}_{{{\text{FM}}}} = \left[ {{\varvec{o}}_{1} ,{\varvec{o}}_{2} } \right], $$

where \({\varvec{o}}_{1} = \left[ {o_{1}^{\left( 1 \right)} ,o_{1}^{\left( 2 \right)} , \ldots ,o_{1}^{\left( a \right)} } \right]\) is the first-order vector and \({\varvec{o}}_{2} = \left[ {o_{2}^{\left( 1 \right)} ,o_{2}^{\left( 2 \right)} , \ldots ,o_{2}^{\left( e \right)} } \right]\) is the second-order vector. The elements of \({\varvec{o}}_{1}\) and \({\varvec{o}}_{2}\) can be calculated as (17) and (18) respectively:

$$ o_{1}^{\left( i \right)} = \left\{ {\begin{array}{*{20}c} {w_{j} ,} & {x_{j} \in {\varvec{x}}_{{{\text{field}}_{i} }}^{^{\prime}} , x_{j} \ne 0 \forall i = 1,2, \ldots ,c} \\ {w_{j} x_{j} ,} & {x_{j} = x_{{{\text{field}}_{i} }} \forall i = c + 1, \ldots ,a} \\ \end{array} } \right., $$
(17)
$$ o_{2}^{\left( f \right)} = \frac{1}{2}\left( {\left( {\mathop \sum \limits_{i = 1}^{d} v_{i,f} x_{i} } \right)^{2} - \mathop \sum \limits_{i = 1}^{d} v_{i,f}^{2} x_{i}^{2} } \right),\quad \forall f \in \left\{ {1,2, \ldots ,e} \right\}. $$
(18)

To be specific, \({\varvec{o}}_{1}\) is an a-dimension vector and each element is related to each field of features. For category features, \(o_{1}^{\left( i \right)}\) equals to the weight which is related to the dimension with none-zero value under one-hot encoding; for continuous features, \(o_{1}^{\left( i \right)}\) equals to the corresponding first-order term. \({\varvec{o}}_{2}\) is an e-dimension vector inspired by (16), each element of which is contributed by all features and one dimension of each latent vector.

  1. 2)

    Embedding

To reduce the dimension of sparse data, the embedding layer compresses the one-hot vectors into dense vectors; to balance the contribution of category and continuous features, the embedding layer maps each continuous feature to a vector. In mDeepFM, the weight vectors of FM are employed for producing the embeddings as the DeepFM does. The computation of embedding for each field of features is as follows:

$$ {\varvec{e}}_{i} = \left\{ {\begin{array}{*{20}c} {{\varvec{v}}_{j} , } & {x_{j} \in {\varvec{x}}_{{{\text{field}}_{i} }}^{^{\prime}} , x_{j} \ne 0 \forall i = 1,2, \ldots ,c} \\ {x_{j} {\varvec{v}}_{j} ,} & {x_{j} = x_{{{\text{field}}_{i} }} \forall i = c + 1, \ldots ,a} \\ \end{array} } \right.. $$
(19)
  1. 3)

    Deep component

The deep component is constructed by a fully-connected feed-forward network to model high-order feature interactions. The input of deep layer is the output of the embedding layer \({\varvec{a}}_{{\text{D}}} \left( 0 \right) = \left[ {{\varvec{e}}_{1} ,{\varvec{e}}_{2} , \ldots ,{\varvec{e}}_{a} } \right]\). The output of each hidden layer can be calculated as \({\varvec{a}}_{{\text{D}}} \left( {l + 1} \right) = \sigma \left( {{\varvec{W}}_{{\text{D}}} \left( l \right){\varvec{a}}_{{\text{D}}} \left( l \right) + {\varvec{b}}_{{\text{D}}} \left( l \right)} \right)\), \(0 \le l < H_{{\text{D}}}\), where \({\varvec{W}}_{{\text{D}}} \left( l \right)\) and \({\varvec{b}}_{{\text{D}}} \left( l \right) \) are the weight matrix and bias vector of the l-th layer respectively and \(H_{{\text{D}}}\) is a parameter representing the number of hidden layers in deep component. The output of the deep component is as follows:

$$ {\varvec{o}}_{{{\text{DNN}}}} = {\varvec{a}}_{{\text{D}}} \left( {H_{{\text{D}}} } \right). $$
(20)
  1. 4)

    Transformation component

The transformation component is also a fully-connected feed-forward network to further learn the correlations between low-order and high-order features, which concatenates the outputs of FM \({\varvec{o}}_{{{\text{FM}}}}\) and deep layer \({\varvec{o}}_{{{\text{DNN}}}}\) as input \({\varvec{a}}_{{\text{T}}} \left( 0 \right) = \left[ {{\varvec{o}}_{{{\text{FM}}}} ,{\varvec{o}}_{{{\text{DNN}}}} } \right]\). The transformation layer is composed of multiple hidden layers and the output of each layer is \({\varvec{a}}_{{\text{T}}} \left( {l + 1} \right) = \sigma \left( {{\varvec{W}}_{{\text{T}}} \left( l \right){\varvec{a}}_{{\text{T}}} \left( l \right) + {\varvec{b}}_{{\text{T}}} \left( l \right)} \right)\), \(0 \le l < H_{{\text{T}}}\), where \(H_{{\text{T}}}\) is the number of hidden layers in the transformation component. The output of a hidden layer is fed into next layer with less neurons to compress the outputs gradually into a vector with the length of label number. The output of the transformation layer \({\varvec{a}}_{{\text{T}}} \left( {H_{{\text{T}}} } \right)\) is finally fed into a sigmoid function to obtain the predictions as (21):

$$ \hat{\varvec{Y}}_{{{\text{mDeepFM}}}} = {\text{sigmoid}}\left( {{\varvec{a}}_{{\text{T}}} \left( {H_{{\text{T}}} } \right)} \right). $$
(21)

In accord with the XGBoost, the operator with the largest probability is adopted as the final operator to break ties at current time T.

Experiment

In this section, offline and online experiments are conducted to evaluate the performance of the proposed algorithm. As mentioned before, the best operator is selected and executed at each time T so that the all-day performance of the dispatching system will be better than using a single operator. However, since the predictions of the classifiers cannot be accurate all the time, the all-day performance will be deteriorated when making wrong predictions compared with ideal classifiers. Therefore, it is of significance to investigate whether the MAATS outperforms the best-matching using a single tie-breaking operator, which is the principal intention of the offline experiments. However, the offline experiments only serve as preliminary simulations to demonstrate the effectiveness of MAATS for solving the OFDP, which cannot reflect the performance of the proposed algorithm when applied to practice due to the changing and uncertain real situation. Therefore, rigorous online A/B tests are conducted on the Meituan dispatching system to investigate whether the proposed algorithm is of application value.

Offline preliminary simulation

Settings of offline experiment

  1. 1)

    Datasets and parameters

The historical data on Meituan platform are used to generate training set and test set. The data are selected from the whole month of July 2020 in 21 cities based on the daily number of orders (denoted as DNO), including 4 large cities (DNO ≥ 250,000), 12 medium cities (100,000 ≤ DNO < 250,000) and 5 small cities (DNO < 100,000), which yield 2,580,805 instances in total. Each instance is related to a static optimization problem at a specific time T. The many-to-one best-matching instances are adopted as the training and test sets (Table 2). Each many-to-one best-matching instance is run 5 times for each operator independently and the operators that yield minimum ADC are defined as labels.

Table 2 Information of data sets

The main parameters are set as follows. For calculating the penalty of tardiness, \(\theta\), \({\Theta }\), \(\kappa\) and \(\sigma\) are empirically set as 0.06, 20, 8 and 136, respectively. For XGBoost, max_depth (to control tree depth), min_child_weight (to control splitting) and learning_rate (to control the step shrinkage) are set as 5, 4 and 0.1, respectively. For mDeepFM, the relu function is taken as activation function, the embedding length is set as e = 10, the structure of deep layer is 512-512-512, the structure of the transformation layer is 256-128-64-32-5 and the four spatial–temporal features are defined as the category features while the remaining are continuous features.

  1. 2)

    Evaluation metrics

  2. Metric of classification

The area under ROC (receiver operating characteristic) curve (AUC) is taken as the metric to evaluate the prediction accuracy of models, which is a widely used metric in the field of binary classification. The value of AUC is between 0 and 1. The larger the AUC is, the more accurately the model predicts.

  • Metric of optimization and observation

To test the effectiveness of the adaptive tie-breaking strategy, the best-matching heuristic combined with each single tie-breaking operator and random selection of tie-breaking operators is considered as the comparative algorithm, which yields BM-MIN, BM-MAX, BM-MINT, BM-MIND, BM-REG and BM-RAND. Two versions of MAATS with different ML models are denoted as MAATS-XGB and MAATS-mDeepFM. To evaluate the optimization performance for solving OFDP, the objective ADC in (1) is taken as the optimization metric. Since it is hard to infer the delivery efficiency from ADC, we define the following two observation metrics to evaluate the delivery efficiency from the perspective of distance and time:

$$ {\text{AID}} = \frac{1}{n}\mathop \sum \limits_{j = 1}^{m} \left( {{\text{DC}}_{j}^{n} - {\text{DC}}_{j}^{o} } \right), $$
(22)
$$ {\text{ACT}} = \frac{1}{N}\mathop \sum \limits_{j = 1}^{m} \left( {{\text{LLT}}_{j} - {\text{FAT}}_{j} } \right), $$
(23)

where \({\text{DC}}_{j}^{n}\) and \({\text{DC}}_{j}^{o}\) is the distance cost of new route and old route, respectively for rider qj, \({\text{LLT}}_{j}\) is the time when qj leaves the last node, \({\text{FAT}}_{j}\) is the time when qj arrives at the first node, n is the number of new orders and N is the number of all orders. AID represents the average increased distance on the route when dispatching a rider with a new order. ACT represents the average consumed time for a rider to serve an order.

The relative percentage deviation RPD is taken to evaluate the performance of an algorithm from different views of metrics:

$$ {\text{RPD}}_{{{\text{Metric}}}} \left( {{\text{alg}}} \right) = \frac{{{\text{Metric}}_{{{\text{alg}}}} - {\text{Metric}}_{{{\text{bst}}}} }}{{{\text{Metric}}_{{{\text{bst}}}} }} \times 100, $$
(24)

where \({\text{Metric}} \in \left\{ {{\text{ADC}},{\text{AID}},{\text{ACT}}} \right\}\), \({\text{Metric}}_{{{\text{alg}}}}\) is the \({\text{Metric}}\) of a certain algorithm and \({\text{Metric}}_{{{\text{bst}}}}\) is the best \({\text{Metric}}\) obtained among all algorithms. A smaller RPD indicates a better performance.

Results of the offline experiment

  1. 1)

    Distribution of the data set

Before testing the performance of the proposed algorithm, we investigate the distribution of the data sets to help the analysis of the following experimental results. The data sets are grouped according to the number of critical riders and candidate orders. The number of instances for each group is counted and the proportion of instances that each group accounts for is illustrated in Figs. 6 and 7. From Fig. 6, it is clear that most many-to-one best-matchings occur with less than 100 critical riders. Besides, the instances with 1–10 critical riders account for nearly 60% of all instances. From Fig. 7, it can be seen that most many-to-one best-matchings occur with less than 100 candidate orders. Besides, the instances with less than 10 candidate orders account for nearly 40% of all instances and the instances with 11–100 candidate orders account for nearly 50% of all instances.

Fig. 6
figure 6

Data distribution by the number of critical riders

Fig. 7
figure 7

Data distribution by the number of candidate orders

Figure 8 illustrates the distribution of different tie-breaking operators under different number of critical riders and candidate orders. For each operator, the instances are classified into positive ones and negative ones. If the operator performs best on an instance, then this instance is marked as a positive instance for this operator, otherwise a negative instance. A larger proportion of positive instances means that the corresponding operator achieves the best operator more frequently. Therefore, the performances of different operators can be ranked as: REG > MAX > MINT > MIN \(\approx\) MIND, which reveals that the REG operator outperforms the other operators in general.

  1. 2)

    Importance of features

Fig. 8
figure 8

Distribution of different operators

We analyze the contribution of each type of features based on the gain in (14), which is denoted as “Score”. The larger the score, the more important the feature. The scores of the top 50 features are illustrated in Fig. 9 in descending order. It can be seen that the spatial–temporal feature (City id) ranks first, which indicates that different regions require different operators. All rider number features and order number features rank in the top 50, which reveals the necessity of incorporating the basic problem characteristic information. Besides, the operator-oriented features exhibit remarkable importance because these features account for nearly half of the top 50 features. Moreover, quite a number of aggregate features enter in the top 50. To sum up, the results indicate that the design of problem-specific features is effective.

  1. 3)

    Prediction accuracy of the models

Fig. 9
figure 9

Importance of features

The AUCs of XGBoost and mDeepFM on predicting each operator are listed in Table 3. It can be seen that the prediction performance of mDeepFM is slightly better than XGBoost. Besides, the AUCs on different operators vary for both models and the AUCs of both models on a specific operator are close. This special phenomenon can be explained according to Fig. 8. To be specific, it is natural that the models tend to have a bias to the class with more instances to gain more accuracy. In other words, the more unbalanced the number of positive instances and negative instances, the more easily the models distinguish. Therefore, the models predict the performances of MIN and MINT operators more accurately since their ratios between positive and negative instances are most unbalanced, while it is more difficult to predict the performance of REG operator since the ratio is close to 0.5.

  1. 4)

    Performance of the algorithms

Table 3 AUC on predicting each tie-breaking operator

The results of RPDs under different metrics are listed in Tables 4, 5, 6, 7, 8 and 9. To investigate the performances of algorithms under different critical rider number (denoted as mcr) and candidate order number (denoted as nco), the PRDs are grouped by different mcr and nco based on the data distribution discussed above. Each cell within the tables is the average RPD on the same group of instances. Tables 4 and 5 shows that the BM-REG performs best among all the comparative algorithms, which is in accord with the analysis of Fig. 8. Besides, the MAATS-mDeepFM outperforms the MAATS-XGB on most instances, especially when mcr and nco is small, which mainly owes to the reason that the prediction accuracy of the mDeepFM is better than the XGBoost according to Table 3. The average RPDs of both MAATS-mDeepFM and MAATS-XGB are smaller than the comparative algorithms on the instances with small mcr and nco. However, when mcr and nco increase, the performances of both the above methods are getting close to the BM-REG. This is because the unbalanced positive and negative instances cause the models to overfit the REG operator, especially for the XGBoost. Nevertheless, the overall average RPDs in last line show that both adaptive tie-breaking algorithms surpass all other algorithms, which demonstrates the effectiveness of the adaptive tie-breaking strategy. Besides, it can be seen that the random selection of operators also performs better than using a single operator on some instances with small mcr and nco. Therefore, it can be concluded that hybrid and adaptive utilization of multiple tie-breaking operators is superior to the utilization of a single operator.

Table 4 RPDs on ADC grouped by the number of critical riders
Table 5 RPDs on ADC grouped by the number of candidate orders
Table 6 RPDs on AIDs grouped by the number of critical riders
Table 7 RPDs on AIDs grouped by the number of candidate orders
Table 8 RPDs on ACTs grouped by the number of critical riders
Table 9 RPDs on ACTs grouped by the number of candidate orders

The RPDs of the observation metrics are listed in Tables 6, 7, 8 and 9. Tables 6 and 7 show that the MAATS-mDeepFM obtains all best average RPDs considering the average increased distance, which means that the MAATS-mDeepFM saves more travel distances than the others when dispatching new orders. However, in Tables 8 and 9, an interesting result is that the BM-MAX achieves the best performance on the average RPDs of average consumed time. One inference can be that the MAX operator is capable of balancing the burdens of riders, thus shortening the total travel time for each rider on average. To be specific, dispatching the candidate order with the largest cost to a critical rider may lead to a high burden for this rider. Thus, in the next loop, this rider is not appropriate to be dispatched with any new orders so that the remaining candidate orders are dispatched to other riders. Nevertheless, the gap on ACTs between BM- MAX and each adaptive tie-breaking algorithm is much closer than those between BM-MAX and the other algorithms. However, on ADCs and AIDs, the BM-MAX is significantly inferior to the two adaptive tie-breaking algorithms. Therefore, the two adaptive tie-breaking algorithms still outperform the other algorithms in general. From another viewpoint, the OFDP is actually a multi-objective optimization problem so it is natural that the metrics conflict with each other.

Online A/B test

According to the analysis of Fig. 8 and the results in Tables 4, 5, 6, 7, 8 and 9, the BM-REG performs best among all comparative algorithms in general and the MAATS-mDeepFM is better between the two adaptive tie-breaking methods on a majority of instances. Therefore, the BM-REG and MAATS-mDeepFM are chosen as the representative methods for online A/B test.

Table 10 Results of online A/B test in City 1

Settings of online A/B test

A/B test is a commonly-used method to compare two versions with a single variable. In this paper, the way of using tie-breaking operators is taken as the variable, while the BM-REG and MAATS-mDeepFM share the same other algorithmic settings. The online A/B test is rigorously set as follows. Two cities are selected, each of which is split into a comparison region (C-region) and an experimental region (E-region). To make a fair comparison, the two regions are generated as similar as possible with respect to the following metrics during a period of time (denoted as comparison dates). We implement the BM-REG in the C-region and the MAATS-mDeepFM in the E-region for a period of experimental time (denoted as experimental dates). The test starts immediately after the comparison dates and the experimental dates last the same as the comparison dates. The online observation metrics are categorized into two types as follows.

  1. 1)

    Rider efficiency-based metrics

  2. average normalized travel distance:

$$ {\text{antd}} = \frac{1}{\left| O \right|}\mathop \sum \limits_{o\left( i \right) \in O} \frac{{{\text{ad}}_{i} }}{{{\text{nd}}_{i} }}, $$
(25)
  • average delivery time (unit: minute):

    $$ {\text{adt}} = \frac{1}{\left| O \right|}\mathop \sum \limits_{o\left( i \right) \in O} \left( {{\text{dt}}_{i} - {\text{rt}}_{i} } \right), $$
    (26)
  1. 2)

    Customer satisfaction-based metrics

  2. punctual rate:

    $$ {\text{pr}} = \frac{{\left| {O_{{{\text{punc}}}} } \right|}}{\left| O \right|} \times 100\% , $$
    (27)
  • Proportion of 55-min orders:

    $$ {\text{pp}}_{55} = \frac{{\left| {O_{55} } \right|}}{\left| O \right|} \times 100\% , $$
    (28)
  • 15-min overtime rate:

    $$ {\text{or}}_{15} = \frac{{\left| {O_{{15^{ + } }} } \right|}}{\left| O \right|} \times 100\% . $$
    (29)

\(O\) is the set of all orders during a day, \({\text{ad}}_{i}\) is the actual travel distance of the rider to deliver order \(o\left( i \right)\), \({\text{nd}}_{i}\) is the navigation travel distance to deliver \(o\left( i \right)\), \({\text{dt}}_{i}\) is the time when \(o\left( i \right)\) is delivered to the customer, \({\text{rt}}_{i}\) is the time when \(o\left( i \right)\) is received by the platform, \(O_{{{\text{punc}}}}\) is the set of orders that are delivered punctually, \(O_{55}\) is the set of orders that satisfy \({\text{dt}}_{i} - {\text{rt}}_{i} > 55 \,{\text{min}}\) and \(O_{{15^{ + } }}\) is the set of orders that are delayed for more than 15 min.

Results of online A/B tests

The results of the A/B tests are shown in Tables 10 and 11. Although the metrics of C-region and E-region are as close as possible, there still exist differences between the two regions. We denote the difference of metrics between E-region and C-region during comparison dates as their inherent difference (\(\Delta_{{{\text{inh}}}}\)) and the difference of metrics during experimental dates as an experimental difference (\(\Delta_{{{\text{exp}}}}\)). To exclude the influence of inherent differences, each experimental difference is revised by subtracting the inherent difference, which yields the final result (\(\Delta_{{{\text{real}}}}\)). From the final results, it can be seen that the MAATS-mDeepFM outperforms the BM-REG on all online observation metrics in both cities. Therefore, it can be concluded that the proposed algorithm is of application significance to improve the customer satisfaction and delivery efficiency for the platform.

Table 11 Results of online A/B test in City 2

Conclusions

This paper addresses an online food delivery problem derived from a real online food delivery platform, Meituan. To dispose the dynamism and urgency, the problem is transformed into a static optimization problem. An effective matching algorithm with an adaptive tie-breaking strategy is proposed for solving the problem. A best-matching heuristic is embedded to fast generate a partial solution possibly with ties while guaranteeing the quality. To construct a complete solution with high quality, multiple tie-breaking operators are designed based on different intentions. To fit in different scenarios, the best operator is adaptively utilized to break ties each time when solving the static optimization problem. The selection of the best operator is converted into a multi-label classification problem which is solved by specially-designed machine learning methods. Offline simulations are conducted on the real historical data from Meituan, which demonstrates the effectiveness of the proposed algorithm to solve the OFDP. Online A/B tests are carried out in real-world applications, which validate the practicable value of the proposed algorithm to improve customer satisfaction and delivery efficiency. Moreover, the experiment results also reveal the success of adaptive mechanism to utilize hybrid operators, problem-specific strategy to extract useful knowledge and the incorporation of ML techniques to assist decision making in complex optimization system.

In future work, the study on improving the performances of tie-breaking operators will be furthered. Also, it is interesting to develop more efficient and effective algorithms for solving the online food delivery problem by introducing other optimization methods such as genetic programming or other ML technologies such as deep reinforcement learning. Besides, it is valuable to consider the uncertainty arising in the online food delivery process such as uncertain food preparation time, delivery time, and rider behaviors.