Abstract

Previous studies have shown that training a reinforcement model for the sorting problem takes very long time, even for small sets of data. To study whether transfer learning could improve the training process of reinforcement learning, we employ Q-learning as the base of the reinforcement learning algorithm, apply the sorting problem as a case study, and assess the performance from two aspects, the time expense and the brain capacity. We compare the total number of training steps between nontransfer and transfer methods to study the efficiencies and evaluate their differences in brain capacity (i.e., the percentage of the updated Q-values in the Q-table). According to our experimental results, the difference in the total number of training steps will become smaller when the size of the numbers to be sorted increases. Our results also show that the brain capacities of transfer and nontransfer reinforcement learning will be similar when they both reach a similar training level.

1. Introduction

Reinforcement learning (RL) aims at learning policies to map from states to actions for the purpose of maximizing the expected accumulated reward and reaching the goal. Compared with the supervised learning approaches where the models are trained on the input set and the given output set, the RL agent has to interact with the environment and learn from those experiences through trial and error to yield the optimal behaviour.

Mathematically, RL can be formulated as a Markov decision process (MDP) which is a framework to model decision-making problems [1]. An MDP is represented by the tuple <S, A, T, R> where S denotes the state space in the environment and A is the action set to take in a given state. Function T is defined as which indicates the probability of the next state at time step t + 1 given the current state and the action taken at time step t. Function R is a reward scheme used to assign the score for the action performed under the state s and is used as a guidance for the agent to produce suitable behaviours. Then, the objective of the RL agent is to learn a policy which tells the agent what the best action to perform is while the environment is in the state with the parameter θ. In general, there are two main approaches to solving RL problems, model-based and model-free learning. In the model-based approaches, the goal is to learn the model of the environment and obtain the optimal policy relying on the past transitions. On the other hand, model-free approaches learn to directly acquire the optimal policy by the trial-and-error interactions without modelling the underlying environment. Model-based approaches are often sample-efficient, but the requirement of specifying the model of real-world tasks is often restrictive and difficult to satisfy. Therefore, model-free approach is commonly preferred over the model-based approach if it is not hard to sample the trajectories [2, 3]. Q-learning [4] and SARSA [5] are two well-known model-free RL algorithms which fit the optimized policy by learning the action-value (Q-value) function. Note that an action-value function is used to express the expectation of the reward for each state-action pair (s, a). In recent years, since the development of deep learning methods has gained significant attention and achieved innovations in many fields, it is common to adopt deep learning methods for RL algorithms in order to boost the performance. A combination of the convolutional neural network (CNN) [6] and Q-learning called deep Q-networks (DQN) [7, 8] is proposed to handle large state-action space. DQNs have been shown to reach at or even beyond human-level performance on many games. An alternative double estimator method, double Q-learning [9], is introduced to reduce the overestimations of the action values in the Q-learning algorithm. As double Q-learning was proposed in a tabular setting and DQN algorithm suffers from overestimations, double DQN is used for large-scale function approximation and to reduce the overestimations by combining double Q-learning and DQN [1012].

RL algorithms usually require large amounts of trial-and-error and many learning iterations to determine an effective policy from very large-scale state-action space, making them very time-consuming. Recently, there has been a strong interest in the development of deep learning models with the ability to transfer experiences across similar tasks. The two representative types of methods are the transfer of trained models and transfer of learned knowledge [13]. The first methods transform the neural network layers from the pretrained model to the target model [14, 15] whereas the second approaches aim at transferring learned knowledge from the trained network to the target network [16, 17]. A Q-learning-based approach has been applied for the sorting problem [18]. However, it takes large number of training steps to finish the training process, even for small sets of data. Since transfer learning has been widely adopted to speed up the training process, this motivates us to devise a transfer scheme and compare it with the nontransfer method in the training performance. In this paper, we conduct a series of experiments using the sorting problem as a case study. We transfer the knowledge learned from the task n to the task n + 1 where n is the size of the numbers to be sorted and continuously use a Q-learning-based method to train the model. The total number of training steps and the size of the brain capacity, which denotes the knowledge in the Q-table, are two metrics to measure the impact of transfer learning techniques.

The rest of this paper is organized as follows. Section 2 reviews the background and related work of this paper. Section 3 describes our training strategies and detailed methodology. Experimental setup and results are presented and discussed in Section 4. In Section 5, we discuss conclusions and future work.

In this section, we first give an overview of Q-learning which is the base RL algorithm in this paper. The application of RL in the sorting problem is discussed as well.

Q-learning, a form of model-free method, is one of the most known RL algorithms initially designed for the use of Markov decision processes. It updates the Q-value with the following rule:where is the action-value function to compute the expected reward of a state-action pair at time step t, is the learning rate, is the discount factor, and is the reward obtained after selecting action given state . The max operator from the update rule indicates that the agent chooses the best action a by computing the maximum Q-value for the next state . The mechanism to exploit the maximum Q-value while updating is called an off-policy algorithm, i.e., the choice of taking action and does not follow the same policy. On the contrary, the SARSA updates the Q-value based on the policy being followed by the following equation:

When the algorithm uses the same mechanism for the behaviour policy (i.e., ) and the estimation policy (i.e., ), it is called on-policy [19].

The sorting problem is a quintessential computer science task and has been applied to many fields since its emergence. Based on the analysis of all comparison-based sorting algorithms, the computation requires O (n log n) complexity. A RL-based approach, which applies stability and resiliency ideas from feedback controls, is proposed to overcome the errors and early program termination limitations for the traditional computing [20]. An empirical exploration compares the RL model with two traditional sorting algorithms and shows that the RL sorting model completes the task with less array manipulations. In order to investigate the effect of two different reward schemes, immediate reward and pure delayed reward, a Q-learning algorithm is implemented to compare the total number of training steps and average number of sorting steps [18]. A case study of the sorting problem is conducted and concludes that immediate reward takes much less steps to finish the task.

3. Methodology and Learning Design

In this section, we describe important features in our proposed methodology, which include training level and brain capacity. We also discuss how we designed the RL algorithm in order to formulate the sorting problem into RL settings.

3.1. RL-Based Setting for Sorting Problem

We model the initial state Step t, which consists of n elements, as the list of numbers to be sorted, and hence, there will be n factorial possible states denoted by Sn. For any state s at time, an action a (i, j) is defined as the swap of values in position i and position j. Thus, there will be possible actions in the action set An. Once the action is chosen under state , the next state is determined by exchanging the element in position i with the element in position j of the state . For example, assuming the initial state is and an action is performed, this state-action pair will result in the next state .

As suggested by the previous study [18] that immediate reward performs better than pure delayed reward, we use immediate reward scheme in this research. We give the reward by considering whether the action actually improves the number of elements in the correct position. A similarity value is introduced to measure the similarity between the current state and the goal state (i.e., the sorted list) as follows:where function will return one if two states have the same value at position i and zero otherwise. We then compute the difference of and to assign the reward as follows:

In this paper, is 1, is 0, and is −1. For the aforementioned example, since will receive a similarity value of 1 and will receive a value of 2, the reward value of will be given.

3.2. Learning Algorithm

The objective of the learning algorithm is to sort a given example which consists of n numbers for a series of episodes until the success rate reaches a predefined threshold. Algorithm 1 (RL_Sort) represents how we executed the model training on one training instance based on the Q-learning algorithm. The algorithm gives a list and a Q-table as inputs and then produces a new Q-table and the number of training steps as output. RL_Sort begins with the initialization of upper_bound, train_steps, and success_rate. The upper_bound is used to define the maximum allowed number of swaps for sorting and we set n + 1 as the threshold. The variable train_steps is to store the number of episodes spent for training. The variable success_rate is the criterion to terminate the training process and is set to 0.75 in our experiments. is the correct sorting result. The experimental parameters are as follows: α = 0.05, γ = 0.9, and ε = 0.85. In each episode from line 11 to line 31, the model chooses an action given current state s based on ε-greedy [21] and receives a new state s’ (lines 12∼13). There are two conditions in which the episode will end. In one condition, s′ is the and a positive reward (reward_win = 1) will be given (lines 16∼18). In the other condition, the number of swapping times already exceeds upper_bound and a negative reward (reward_lose = −1) will be received. Since the first condition reaches a success state, we will examine the success rate for the latest 100 episodes to determine whether the training process should stop or a new episode should begin. For the cases that the current episode needs to continue (lines 23∼28), the Q-table is updated based on the reward equation (4).

input: Straining, Qn[Sn, An]
(1) initialize
(2)  upper_bound = n + 1
(3)  train_steps = 0
(4)  success_rate = 0.75
(5)  Sgoal = [1, 2, ..., n]
(6) repeat
(7)  end = FALSE
(8)  swap_times = 0
(9)  s = Straining
(10)  current_rate = 0
(11)  repeat
(12)   Select an action a based on ε-greedy
(13)   Perform the action a and observe s′ and the corresponding reward
(14)   swap_times = swap_times + 1
(15)   if (s′ is S_goal) then
(16)    Qn[s, a]  ⟵  Qn [s, a] + α × (reward_win − Qn [s, a])
(17)    end = TRUE
(18)    Check the success rate for the latest 100 episodes and assign to current_rate
(19)   elseif (swap_times >upper_bound) then
(20)    
(21)    end = TRUE
(22)   else
(23)    if (dist(s′, S_goal) > dist(s, S_goal))
(24)     
(25)    elseif (dist(s′, S_goal) < dist(s, S_goal))
(26)     
(27)    else
(28)     
(29)    s  ⟵  s
(30)  until end is TRUE
(31)  train_steps = train_steps + 1
(32)  until current_rate >= success_rate
(33)  return Qn , train_steps

When the training task moves from the example of sorting n numbers to n + 1 numbers, values in Q-table are usually set to zero or randomly initialized. In our transfer setting, the knowledge learned from sorting n numbers is migrated to solve the problem of sorting n + 1 numbers. For the Q-table obtained from sorting n numbers (denoted as Q_source with size ), we expand its state representation by appending a number n + 1 at the end of each state to fit in the Q-table representation for sorting n + 1 numbers (denoted as Q_target with size ). Therefore, each state s in Q_source will become s.append(n + 1). We then are able to map the Q-value of the state-action pair from Q_souce to Q_target. In this way, as the number in position n + 1 is already in the correct position, we try to encourage the model to exploit the prior knowledge from Q_souce and avoid touching the action related to the position n + 1. For example, when n equals 3 and one of the state is [1, 3, 2] with actions , , and , we will transfer these 3 Q-values in Q_source to Q_target where the corresponding state is [1, 3, 2, 4] with actions , , and . Those nontransferable Q-values will be set to zero or randomly initialized. Figure 1 demonstrates how we transfer a Q-table from n = 3 to n = 4.

3.3. Performance Metrics

In this paper, we define three performance metrics which include training level, number of training steps, and brain capacity.

Training level is a performance-oriented indicator to measure how well the model can use the existing knowledge to perform the task during training. After finishing a training procedure of one instance for sorting n numbers, the model is scheduled to sort n! tasks where each task is given by a permutation of those n numbers. Subsequently, we compute the average number of sorting steps for these n! tasks as the model’s training level. Number of training steps, which is denoted as train_steps in Algorithm 1, is the number of episodes that the model spends on training an example. It is an important factor to measure the effectiveness of the algorithm. Brain capacity is concerned with the status of Q-table and is an important measure to compare the knowledge usage between nontransfer and transfer methods. It is defined as the ratio of entries which have been updated in a Q-table.

3.4. Experimental Setup and Results

In order to compare the difference and efficacy between nontransfer and transfer methods, a case study in the sorting problem is presented. We illustrate a series of experiments for both nontransfer and transfer RL to investigate the difference of training speed and the contrast of knowledge requirement.

3.4.1. Experimental Setup

We design an experimental setting to train the model to sort lists of n numbers where each list is from a permutation of {1, 2, ..., n}. In order to provide an equitable comparison, we run nontransfer and transfer RL in parallel and propose an algorithm, which is presented as pseudocode in Algorithm 2, to satisfy our needs.

input: Straining, TRQn−1[Sn−1, An−1]
(1) initialize
(2)  new NRQn[Sn, An]
(3)  new TRQn[Sn, An]
(4)  TRQn[Sn, An] ⟵ TRQn−1[Sn−1, An−1]
(5)  upper_bound = n + 1
(6)  Assign Straining to snt and str
(7)  finish = FALSE
(8)  NonTrans_Tr_Steps = 0
(9)  Trans_Tr_Steps = 0
(10) repeat
(11)  NRQn[Sn, An], Stepsnt = RL_Sort(snt , NRQn[Sn, An])
(12)  TRQn[Sn, An] , Stepstr = RL_Sort(str , TRQn[Sn, An])
(13)  NonTrans_Tr_Steps = NonTrans_Tr_Steps + Stepsnt
(14)  Trans_Tr_Steps = Trans_Tr_Steps + Stepstr
(15)  Sort n! lists in Sn by NRQn, compute the average Avgnt and pick the list with max value as snt
(16)  Sort n! lists in Sn by TRQn, compute the average Avgtr and pick the list with max value as str
(17)  if (|Avgnt − Avgtr|/Avgtr <= 0.1) or (Avgnt <= upper_bound and Avgtr <= upper_bound)
(18)   finish = TRUE
(19) until finish is TRUE

The input of Algorithm 2 consists of a list which is a permutation of {1, 2, ..., n} and a Q-table () which is learned from sorting n − 1 numbers. A Q-table () of nontransfer RL is initialized to zero for all Q-values and a Q-table () of transfer RL is transferred from as the mechanism discussed in Section 3. B. A variable upper_bound is used as one of the constraints for the training level. The input list is given to both and as the initial sorting list for both methods. Then, the algorithm starts iteratively to solve the sorting tasks. We will begin with the nontransfer RL. This process consists of training and evaluation. In the training part, we input the current Q-tables () and the list to Algorithm 1 to train the model (line 11). The number of training steps returned from Algorithm 1 is accumulated to the variable NonTrans_Tr_Steps (line 13). For the evaluation part, the returned NRQn[Sn, An] of Algorithm 1 is then used to sort n! lists from the permutation of {1, 2, ..., n} and the average number of sorting steps is model’s training level denoted as . We then select the list which takes the maximum number of steps to sort as the new (line 15). The same procedure is also applied for transfer RL as seen in lines 12, 14, and 16. The above process is repeated until two models reach a similar training level (i.e., and are very close or both of them are lower than upper_bound). This restriction is to ensure that both two methods exhibit comparable abilities to sort n! lists and affirm that it is fair to conduct a further comparison of the total number of training steps and the brain capacity.

3.4.2. Experimental Results

As an empirical study, we illustrate our results for n equal to 5, 6, 7, and 8. In order to produce a more fair view of the comparison, we repeat Algorithm 2 for 30 episodes for each n. The total number of training steps and the brain capacity are two perspectives to measure the performance. The total number of training steps for the nontransfer method is abbreviated to NonTrans_Tr_Steps and Trans_Tr_Steps for the transfer method. These two variable names are used in Algorithm 2 as well. Subsequently, we apply similar abbreviations to the brain capacity and denote them by NonTrans_Br_Capacity and Trans_Br_Capacity. NonTrans_Br_Capacity is calculated as the ratio of Q-values which have been updated in the NRQn[An, Sn] and Trans_Br_Capacity is the percentage of updated Q-values in the . The detailed results are reported in Tables 14 for different n. Looking at the comparison of the total number of training steps, we can see that the values of NonTrans_Tr_Steps and Trans_Tr_Steps increase significantly when n increases. It is worth noting that some of these two values are less than 100 when n is 5. Therefore, instead of using the latest 100 episodes to check the success rate mentioned in Section 3. B, we opt for the latest 10 episodes to examine that. Regarding the comparison of the brain capacity, the values of NonTrans_Br_Capacity and Trans_Br_Capacity are generally smaller than 0.25 and their values are almost less than 0.1 while n is greater than 6. This implies that the knowledge requirement only occupies a small portion of the Q-table in order to solve the sorting task.

For each episode, we also calculate the ratio of the total number of training steps (Ratio_Tr_Steps) as the division of NonTrans_Tr_Steps by Trans_Tr_Steps and the ratio of the brain capacity (Ratio_Br_Capacity) as the division of NonTrans_Br_Capacity by Trans_Br_Capacity. For the value of Ratio_Tr_Steps, there are nine numbers greater than or equal to 5.00 when n equals 5. But, as n increases, this phenomenon does not appear and the transfer effects diminish. For the value of Ratio_Br_Capacity, the range is much narrower and is largely concentrated between 0.75 and 1.25. As described in Algorithm 2, both nontransfer and transfer methods are required to have very close training levels in order to finish a training episode. Since close training level means that two methods have similar abilities and performance for sorting n! lists, this could explain why the value of Ratio_Br_Capacity is around 1. In general, transfer method exhibits better performance in terms of training steps. However, in some cases, Ratio_Tr_Steps is smaller than 1, which means nontransfer method takes less steps to complete the training. Since both methods require similar size of the brain capacity to sort n! lists, there may be possibilities that the transfer model exploits the transferred knowledge but does not explore enough to expand its knowledge. This will lead to take more training steps to finish the training process.

To explore the distribution of the Ratio_Tr_Steps and Ratio_Br_Capacity, boxplots are presented in Figures 2 and 3 to do the statistical analyses. A boxplot represents the minimum, 25th percentiles, median, 75th percentiles, and maximum of the given dataset. In Figure 2, we observe that the medians of the Ratio_Tr_Steps, which are the red lines inside the box, gradually decrease when n increases. This is in accordance with our previous observation that the growth of n may lower the transfer effects. In Figure 3, the medians of the Ratio_Br_Capacity all occur around 1.00 mostly aligning with our previous conjecture. In addition to the statistics in boxplots, we also compute the averages of Ratio_Tr_Steps and Ratio_Br_Capacity in Table 5. The average performance shows very similar trends as the boxplots.

4. Conclusions

It is reported from prior research that the Q-learning-based approach for the sorting problem requires a large number of training steps. Since the transfer learning method is able to share the knowledge learned from the source domains with the target domain, we devised a transfer scheme to investigate the time cost and knowledge usage issues between nontransfer and transfer models. The Q-table obtained from the prior task is served as the knowledge source to be transferred to the next task. We chose the sorting problem as our case study to analyse two important performance metrics, number of training steps and brain capacity. As a result of the experiment, the brain capacity for two models will be similar after reaching a similar training level. The difference of the total number of training steps between two models will be significant when n is smaller. However, as n increases, the proportion of the transferred knowledge will be smaller and the difference will become less pronounced, making the transfer effect insignificant.

As shown in Table 4, the maximum number of total training steps is close to 100,000 while n equals 8. It would be necessary to enable faster learning in order to handle larger n. Future work will therefore be concerned with the reduction of the state space. State abstraction [22, 23] with the ability to leverage the knowledge learned from prior experiences is worth the effort to improve the scalability of the current approach. Another area of future work is to extend the current tabular representation approach to the deep learning-based methods in order to improve the learning stability and computational efficiency.

Data Availability

No data were used to support the findings of the study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors are grateful to Timothy Kitterman for helpful discussions.