An Empirical Investigation of Transfer Effects for Reinforcement Learning

Jwo, Jung-Sing; Lin, Ching-Sheng; Lee, Cheng-Hsiung; Lo, Ya-Ching

doi:https://doi.org/10.1155/2020/8873057

Computational Intelligence and Neuroscience

On this page

Abstract Introduction Background and Related Work Conclusions Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Special Issue

Nature-Inspired Computing Applied to Neuroscience

View this Special Issue

Research Article | Open Access

Volume 2020 | Article ID 8873057 | https://doi.org/10.1155/2020/8873057

An Empirical Investigation of Transfer Effects for Reinforcement Learning

Jung-Sing Jwo,^1,2Ching-Sheng Lin ,¹Cheng-Hsiung Lee,¹and Ya-Ching Lo¹

Academic Editor: Wassim Ayadi

Received09 Sept 2020

Revised28 Oct 2020

Accepted05 Dec 2020

Published16 Dec 2020

Abstract

Previous studies have shown that training a reinforcement model for the sorting problem takes very long time, even for small sets of data. To study whether transfer learning could improve the training process of reinforcement learning, we employ Q-learning as the base of the reinforcement learning algorithm, apply the sorting problem as a case study, and assess the performance from two aspects, the time expense and the brain capacity. We compare the total number of training steps between nontransfer and transfer methods to study the efficiencies and evaluate their differences in brain capacity (i.e., the percentage of the updated Q-values in the Q-table). According to our experimental results, the difference in the total number of training steps will become smaller when the size of the numbers to be sorted increases. Our results also show that the brain capacities of transfer and nontransfer reinforcement learning will be similar when they both reach a similar training level.

1. Introduction

Reinforcement learning (RL) aims at learning policies to map from states to actions for the purpose of maximizing the expected accumulated reward and reaching the goal. Compared with the supervised learning approaches where the models are trained on the input set and the given output set, the RL agent has to interact with the environment and learn from those experiences through trial and error to yield the optimal behaviour.

Mathematically, RL can be formulated as a Markov decision process (MDP) which is a framework to model decision-making problems [1]. An MDP is represented by the tuple <S, A, T, R> where S denotes the state space in the environment and A is the action set to take in a given state. Function T is defined as which indicates the probability of the next state at time step t + 1 given the current state and the action taken at time step t. Function R is a reward scheme used to assign the score for the action performed under the state s and is used as a guidance for the agent to produce suitable behaviours. Then, the objective of the RL agent is to learn a policy which tells the agent what the best action to perform is while the environment is in the state with the parameter θ. In general, there are two main approaches to solving RL problems, model-based and model-free learning. In the model-based approaches, the goal is to learn the model of the environment and obtain the optimal policy relying on the past transitions. On the other hand, model-free approaches learn to directly acquire the optimal policy by the trial-and-error interactions without modelling the underlying environment. Model-based approaches are often sample-efficient, but the requirement of specifying the model of real-world tasks is often restrictive and difficult to satisfy. Therefore, model-free approach is commonly preferred over the model-based approach if it is not hard to sample the trajectories [2, 3]. Q-learning [4] and SARSA [5] are two well-known model-free RL algorithms which fit the optimized policy by learning the action-value (Q-value) function. Note that an action-value function is used to express the expectation of the reward for each state-action pair (s, a). In recent years, since the development of deep learning methods has gained significant attention and achieved innovations in many fields, it is common to adopt deep learning methods for RL algorithms in order to boost the performance. A combination of the convolutional neural network (CNN) [6] and Q-learning called deep Q-networks (DQN) [7, 8] is proposed to handle large state-action space. DQNs have been shown to reach at or even beyond human-level performance on many games. An alternative double estimator method, double Q-learning [9], is introduced to reduce the overestimations of the action values in the Q-learning algorithm. As double Q-learning was proposed in a tabular setting and DQN algorithm suffers from overestimations, double DQN is used for large-scale function approximation and to reduce the overestimations by combining double Q-learning and DQN [10–12].

RL algorithms usually require large amounts of trial-and-error and many learning iterations to determine an effective policy from very large-scale state-action space, making them very time-consuming. Recently, there has been a strong interest in the development of deep learning models with the ability to transfer experiences across similar tasks. The two representative types of methods are the transfer of trained models and transfer of learned knowledge [13]. The first methods transform the neural network layers from the pretrained model to the target model [14, 15] whereas the second approaches aim at transferring learned knowledge from the trained network to the target network [16, 17]. A Q-learning-based approach has been applied for the sorting problem [18]. However, it takes large number of training steps to finish the training process, even for small sets of data. Since transfer learning has been widely adopted to speed up the training process, this motivates us to devise a transfer scheme and compare it with the nontransfer method in the training performance. In this paper, we conduct a series of experiments using the sorting problem as a case study. We transfer the knowledge learned from the task n to the task n + 1 where n is the size of the numbers to be sorted and continuously use a Q-learning-based method to train the model. The total number of training steps and the size of the brain capacity, which denotes the knowledge in the Q-table, are two metrics to measure the impact of transfer learning techniques.

The rest of this paper is organized as follows. Section 2 reviews the background and related work of this paper. Section 3 describes our training strategies and detailed methodology. Experimental setup and results are presented and discussed in Section 4. In Section 5, we discuss conclusions and future work.

In this section, we first give an overview of Q-learning which is the base RL algorithm in this paper. The application of RL in the sorting problem is discussed as well.

Q-learning, a form of model-free method, is one of the most known RL algorithms initially designed for the use of Markov decision processes. It updates the Q-value with the following rule:where is the action-value function to compute the expected reward of a state-action pair at time step t, is the learning rate, is the discount factor, and is the reward obtained after selecting action given state . The max operator from the update rule indicates that the agent chooses the best action a by computing the maximum Q-value for the next state . The mechanism to exploit the maximum Q-value while updating is called an off-policy algorithm, i.e., the choice of taking action and does not follow the same policy. On the contrary, the SARSA updates the Q-value based on the policy being followed by the following equation:

When the algorithm uses the same mechanism for the behaviour policy (i.e., ) and the estimation policy (i.e., ), it is called on-policy [19].

The sorting problem is a quintessential computer science task and has been applied to many fields since its emergence. Based on the analysis of all comparison-based sorting algorithms, the computation requires O (n log n) complexity. A RL-based approach, which applies stability and resiliency ideas from feedback controls, is proposed to overcome the errors and early program termination limitations for the traditional computing [20]. An empirical exploration compares the RL model with two traditional sorting algorithms and shows that the RL sorting model completes the task with less array manipulations. In order to investigate the effect of two different reward schemes, immediate reward and pure delayed reward, a Q-learning algorithm is implemented to compare the total number of training steps and average number of sorting steps [18]. A case study of the sorting problem is conducted and concludes that immediate reward takes much less steps to finish the task.

3. Methodology and Learning Design

In this section, we describe important features in our proposed methodology, which include training level and brain capacity. We also discuss how we designed the RL algorithm in order to formulate the sorting problem into RL settings.

3.1. RL-Based Setting for Sorting Problem

We model the initial state Step t, which consists of n elements, as the list of numbers to be sorted, and hence, there will be n factorial possible states denoted by S_n. For any state s at time, an action a (i, j) is defined as the swap of values in position i and position j. Thus, there will be possible actions in the action set A_n. Once the action is chosen under state , the next state is determined by exchanging the element in position i with the element in position j of the state . For example, assuming the initial state is and an action is performed, this state-action pair will result in the next state .

As suggested by the previous study [18] that immediate reward performs better than pure delayed reward, we use immediate reward scheme in this research. We give the reward by considering whether the action actually improves the number of elements in the correct position. A similarity value is introduced to measure the similarity between the current state and the goal state (i.e., the sorted list) as follows:where function will return one if two states have the same value at position i and zero otherwise. We then compute the difference of and to assign the reward as follows:

In this paper, is 1, is 0, and is −1. For the aforementioned example, since will receive a similarity value of 1 and will receive a value of 2, the reward value of will be given.

3.2. Learning Algorithm

The objective of the learning algorithm is to sort a given example which consists of n numbers for a series of episodes until the success rate reaches a predefined threshold. Algorithm 1 (RL_Sort) represents how we executed the model training on one training instance based on the Q-learning algorithm. The algorithm gives a list and a Q-table as inputs and then produces a new Q-table and the number of training steps as output. RL_Sort begins with the initialization of upper_bound, train_steps, and success_rate. The upper_bound is used to define the maximum allowed number of swaps for sorting and we set n + 1 as the threshold. The variable train_steps is to store the number of episodes spent for training. The variable success_rate is the criterion to terminate the training process and is set to 0.75 in our experiments. is the correct sorting result. The experimental parameters are as follows: α = 0.05, γ = 0.9, and ε = 0.85. In each episode from line 11 to line 31, the model chooses an action given current state s based on ε-greedy [21] and receives a new state s’ (lines 12∼13). There are two conditions in which the episode will end. In one condition, s′ is the and a positive reward (reward_win = 1) will be given (lines 16∼18). In the other condition, the number of swapping times already exceeds upper_bound and a negative reward (reward_lose = −1) will be received. Since the first condition reaches a success state, we will examine the success rate for the latest 100 episodes to determine whether the training process should stop or a new episode should begin. For the cases that the current episode needs to continue (lines 23∼28), the Q-table is updated based on the reward equation (4).

input: S_training, Q_n[S_n, A_n]
(1)	initialize
(2)	upper_bound = n + 1
(3)	train_steps = 0
(4)	success_rate = 0.75
(5)	S_goal = [1, 2, ..., n]
(6)	repeat
(7)	end = FALSE
(8)	swap_times = 0
(9)	s = S_training
(10)	current_rate = 0
(11)	repeat
(12)	Select an action a based on ε-greedy
(13)	Perform the action a and observe s′ and the corresponding reward
(14)	swap_times = swap_times + 1
(15)	if (s′ is S_goal) then
(16)	Q_n[s, a] ⟵ Q_n [s, a] + α × (reward_win − Q_n [s, a])
(17)	end = TRUE
(18)	Check the success rate for the latest 100 episodes and assign to current_rate
(19)	elseif (swap_times >upper_bound) then
(20)
(21)	end = TRUE
(22)	else
(23)	if (dist(s′, S_goal) > dist(s, S_goal))
(24)
(25)	elseif (dist(s′, S_goal) < dist(s, S_goal))
(26)
(27)	else
(28)
(29)	s ⟵ s′
(30)	until end is TRUE
(31)	train_steps = train_steps + 1
(32)	until current_rate >= success_rate
(33)	return Q_n , train_steps

When the training task moves from the example of sorting n numbers to n + 1 numbers, values in Q-table are usually set to zero or randomly initialized. In our transfer setting, the knowledge learned from sorting n numbers is migrated to solve the problem of sorting n + 1 numbers. For the Q-table obtained from sorting n numbers (denoted as Q_source with size ), we expand its state representation by appending a number n + 1 at the end of each state to fit in the Q-table representation for sorting n + 1 numbers (denoted as Q_target with size ). Therefore, each state s in Q_source will become s.append(n + 1). We then are able to map the Q-value of the state-action pair from Q_souce to Q_target. In this way, as the number in position n + 1 is already in the correct position, we try to encourage the model to exploit the prior knowledge from Q_souce and avoid touching the action related to the position n + 1. For example, when n equals 3 and one of the state is [1, 3, 2] with actions , , and , we will transfer these 3 Q-values in Q_source to Q_target where the corresponding state is [1, 3, 2, 4] with actions , , and . Those nontransferable Q-values will be set to zero or randomly initialized. Figure 1 demonstrates how we transfer a Q-table from n = 3 to n = 4.

3.3. Performance Metrics

In this paper, we define three performance metrics which include training level, number of training steps, and brain capacity.

Training level is a performance-oriented indicator to measure how well the model can use the existing knowledge to perform the task during training. After finishing a training procedure of one instance for sorting n numbers, the model is scheduled to sort n! tasks where each task is given by a permutation of those n numbers. Subsequently, we compute the average number of sorting steps for these n! tasks as the model’s training level. Number of training steps, which is denoted as train_steps in Algorithm 1, is the number of episodes that the model spends on training an example. It is an important factor to measure the effectiveness of the algorithm. Brain capacity is concerned with the status of Q-table and is an important measure to compare the knowledge usage between nontransfer and transfer methods. It is defined as the ratio of entries which have been updated in a Q-table.

3.4. Experimental Setup and Results

In order to compare the difference and efficacy between nontransfer and transfer methods, a case study in the sorting problem is presented. We illustrate a series of experiments for both nontransfer and transfer RL to investigate the difference of training speed and the contrast of knowledge requirement.

3.4.1. Experimental Setup

We design an experimental setting to train the model to sort lists of n numbers where each list is from a permutation of {1, 2, ..., n}. In order to provide an equitable comparison, we run nontransfer and transfer RL in parallel and propose an algorithm, which is presented as pseudocode in Algorithm 2, to satisfy our needs.

input: S_training, TRQ_n−1[S_n−1, A_n−1]
(1)	initialize
(2)	new NRQ_n[S_n, A_n]
(3)	new TRQ_n[S_n, A_n]
(4)	TRQ_n[S_n, A_n] ⟵ TRQ_n−1[S_n−1, A_n−1]
(5)	upper_bound = n + 1
(6)	Assign S_training to s_nt and s_tr
(7)	finish = FALSE
(8)	NonTrans_Tr_Steps = 0
(9)	Trans_Tr_Steps = 0
(10)	repeat
(11)	NRQ_n[S_n, A_n], Steps_nt = RL_Sort(s_nt , NRQ_n[S_n, A_n])
(12) TRQ_n[S_n, A_n] , Steps_tr = RL_Sort(s_tr , TRQ_n[S_n, A_n])
(13)	NonTrans_Tr_Steps = NonTrans_Tr_Steps + Steps_nt
(14)	Trans_Tr_Steps = Trans_Tr_Steps + Steps_tr
(15)	Sort n! lists in S_n by NRQ_n, compute the average Avg_nt and pick the list with max value as s_nt
(16)	Sort n! lists in S_n by TRQ_n, compute the average Avg_tr and pick the list with max value as s_tr
(17)	if (\|Avg_nt − Avg_tr\|/Avg_tr <= 0.1) or (Avg_nt <= upper_bound and Avg_tr <= upper_bound)
(18)	finish = TRUE
(19)	until finish is TRUE

The input of Algorithm 2 consists of a list which is a permutation of {1, 2, ..., n} and a Q-table () which is learned from sorting n − 1 numbers. A Q-table () of nontransfer RL is initialized to zero for all Q-values and a Q-table () of transfer RL is transferred from as the mechanism discussed in Section 3. B. A variable upper_bound is used as one of the constraints for the training level. The input list is given to both and as the initial sorting list for both methods. Then, the algorithm starts iteratively to solve the sorting tasks. We will begin with the nontransfer RL. This process consists of training and evaluation. In the training part, we input the current Q-tables () and the list to Algorithm 1 to train the model (line 11). The number of training steps returned from Algorithm 1 is accumulated to the variable NonTrans_Tr_Steps (line 13). For the evaluation part, the returned NRQ_n[S_n, A_n] of Algorithm 1 is then used to sort n! lists from the permutation of {1, 2, ..., n} and the average number of sorting steps is model’s training level denoted as . We then select the list which takes the maximum number of steps to sort as the new (line 15). The same procedure is also applied for transfer RL as seen in lines 12, 14, and 16. The above process is repeated until two models reach a similar training level (i.e., and are very close or both of them are lower than upper_bound). This restriction is to ensure that both two methods exhibit comparable abilities to sort n! lists and affirm that it is fair to conduct a further comparison of the total number of training steps and the brain capacity.

3.4.2. Experimental Results

As an empirical study, we illustrate our results for n equal to 5, 6, 7, and 8. In order to produce a more fair view of the comparison, we repeat Algorithm 2 for 30 episodes for each n. The total number of training steps and the brain capacity are two perspectives to measure the performance. The total number of training steps for the nontransfer method is abbreviated to NonTrans_Tr_Steps and Trans_Tr_Steps for the transfer method. These two variable names are used in Algorithm 2 as well. Subsequently, we apply similar abbreviations to the brain capacity and denote them by NonTrans_Br_Capacity and Trans_Br_Capacity. NonTrans_Br_Capacity is calculated as the ratio of Q-values which have been updated in the NRQ_n[A_n, S_n] and Trans_Br_Capacity is the percentage of updated Q-values in the . The detailed results are reported in Tables 1–4 for different n. Looking at the comparison of the total number of training steps, we can see that the values of NonTrans_Tr_Steps and Trans_Tr_Steps increase significantly when n increases. It is worth noting that some of these two values are less than 100 when n is 5. Therefore, instead of using the latest 100 episodes to check the success rate mentioned in Section 3. B, we opt for the latest 10 episodes to examine that. Regarding the comparison of the brain capacity, the values of NonTrans_Br_Capacity and Trans_Br_Capacity are generally smaller than 0.25 and their values are almost less than 0.1 while n is greater than 6. This implies that the knowledge requirement only occupies a small portion of the Q-table in order to solve the sorting task.

For each episode, we also calculate the ratio of the total number of training steps (Ratio_Tr_Steps) as the division of NonTrans_Tr_Steps by Trans_Tr_Steps and the ratio of the brain capacity (Ratio_Br_Capacity) as the division of NonTrans_Br_Capacity by Trans_Br_Capacity. For the value of Ratio_Tr_Steps, there are nine numbers greater than or equal to 5.00 when n equals 5. But, as n increases, this phenomenon does not appear and the transfer effects diminish. For the value of Ratio_Br_Capacity, the range is much narrower and is largely concentrated between 0.75 and 1.25. As described in Algorithm 2, both nontransfer and transfer methods are required to have very close training levels in order to finish a training episode. Since close training level means that two methods have similar abilities and performance for sorting n! lists, this could explain why the value of Ratio_Br_Capacity is around 1. In general, transfer method exhibits better performance in terms of training steps. However, in some cases, Ratio_Tr_Steps is smaller than 1, which means nontransfer method takes less steps to complete the training. Since both methods require similar size of the brain capacity to sort n! lists, there may be possibilities that the transfer model exploits the transferred knowledge but does not explore enough to expand its knowledge. This will lead to take more training steps to finish the training process.

To explore the distribution of the Ratio_Tr_Steps and Ratio_Br_Capacity, boxplots are presented in Figures 2 and 3 to do the statistical analyses. A boxplot represents the minimum, 25^th percentiles, median, 75^th percentiles, and maximum of the given dataset. In Figure 2, we observe that the medians of the Ratio_Tr_Steps, which are the red lines inside the box, gradually decrease when n increases. This is in accordance with our previous observation that the growth of n may lower the transfer effects. In Figure 3, the medians of the Ratio_Br_Capacity all occur around 1.00 mostly aligning with our previous conjecture. In addition to the statistics in boxplots, we also compute the averages of Ratio_Tr_Steps and Ratio_Br_Capacity in Table 5. The average performance shows very similar trends as the boxplots.

4. Conclusions

It is reported from prior research that the Q-learning-based approach for the sorting problem requires a large number of training steps. Since the transfer learning method is able to share the knowledge learned from the source domains with the target domain, we devised a transfer scheme to investigate the time cost and knowledge usage issues between nontransfer and transfer models. The Q-table obtained from the prior task is served as the knowledge source to be transferred to the next task. We chose the sorting problem as our case study to analyse two important performance metrics, number of training steps and brain capacity. As a result of the experiment, the brain capacity for two models will be similar after reaching a similar training level. The difference of the total number of training steps between two models will be significant when n is smaller. However, as n increases, the proportion of the transferred knowledge will be smaller and the difference will become less pronounced, making the transfer effect insignificant.

As shown in Table 4, the maximum number of total training steps is close to 100,000 while n equals 8. It would be necessary to enable faster learning in order to handle larger n. Future work will therefore be concerned with the reduction of the state space. State abstraction [22, 23] with the ability to leverage the knowledge learned from prior experiences is worth the effort to improve the scalability of the current approach. Another area of future work is to extend the current tabular representation approach to the deep learning-based methods in order to improve the learning stability and computational efficiency.

Data Availability

No data were used to support the findings of the study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors are grateful to Timothy Kitterman for helpful discussions.

References

R. Li, Z. Zhao, Q. Sun et al., “Deep reinforcement learning for resource management in network slicing,” IEEE Access, vol. 6, pp. 74429–74441, 2018.
View at: Publisher Site | Google Scholar
S. Tschiatschek, K. Arulkumaran, J. Stühmer, and K. Hofmann, “Variational inference for data-efficient model learning in pomdps,” 2018, https://arxiv.org/pdf/1805.09281.pdf.
View at: Google Scholar
P. Malekzadeh, M. Salimibeni, A. Mohammadi, A. Assa, and K. N. Plataniotis, “MM-KTD: multiple model kalman temporal differences for reinforcement learning,” IEEE Access, vol. 8, pp. 128716–128729, 2020.
View at: Publisher Site | Google Scholar
C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8, no. 3-4, pp. 279–292, 1992.
View at: Publisher Site | Google Scholar
G. A. Rummery and M. Niranjan, On-Line Q-Learning Using Connectionist Systems, Department of Engineering, University of Cambridge, England, UK, 1994, https://www.researchgate.net/profile/Mahesan_Niranjan/publication/2500611_On-Line_Q-Learning_Using_Connectionist_Systems/links/5438d5db0cf204cab1d6db0f/On-Line-Q-Learning-Using-Connectionist-Systems.pdf.
Y. Kim, “Convolutional neural networks for sentence classification,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751, Doha, Qatar, October 2014.
View at: Publisher Site | Google Scholar
V. Mnih, K. Kavukcuoglu, D. Silver et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
View at: Publisher Site | Google Scholar
J. Farebrother, M. C. Machado, and M. Bowling, “Generalization and regularization in DQN,” 2018, https://arxiv.org/pdf/1810.00123.pdf.
View at: Google Scholar
H. V. Hasselt, “Double Q-learning,” in Advances in Neural Information Processing Systems, pp. 2613–2621, Massachusetts Institute of Technology Press, Cambridge, MA, USA, 2010, http://papers.nips.cc/paper/3964-double-q-learning.pdf.
View at: Google Scholar
H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 2016, March, https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/download/12389/11847.
View at: Google Scholar
Q. Zhang, M. Lin, L. T. Yang, Z. Chen, S. U. Khan, and P. Li, “A double deep Q-learning model for energy-efficient edge scheduling,” IEEE Transactions on Services Computing, vol. 12, no. 5, pp. 739–749, 2019.
View at: Publisher Site | Google Scholar
Y. Zhang, P. Sun, Y. Yin, L. Lin, and X. Wang, “Human-like autonomous vehicle speed control by deep reinforcement learning with double Q-learning,” in Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1251–1256, Changshu, China, June 2018.
View at: Publisher Site | Google Scholar
Y. Keneshloo, N. Ramakrishnan, and C. K. Reddy, “Deep transfer reinforcement learning for text summarization,” in Proceedings of the 2019 SIAM International Conference on Data Mining, pp. 675–683, Calgary, Canada, May 2019.
View at: Publisher Site | Google Scholar
J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” in Advances in Neural Information Processing Systems, pp. 3320–3328, MIT Press, Cambridge, MA, USA, 2014, http://papers.nips.cc/paper/5347-how-transferable-are-features-in-deep-neural-networks.pdf.
View at: Google Scholar
J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the NAACL-HLT, Minneapolis, MN, USA, January 2019.
View at: Publisher Site | Google Scholar
J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” in Advances in Neural Information Processing Systems, pp. 4077–4087, MIT Press, Cambridge, MA, USA, 2017, http://papers.nips.cc/paper/6996-prototypical-networks-for-few-shot-learning.pdf.
View at: Google Scholar
K. Fu, T. Zhang, Y. Zhang et al., “Meta-SSD: towards fast adaptation for few-shot object detection with meta-learning,” IEEE Access, vol. 7, pp. 77597–77606, 2019.
View at: Publisher Site | Google Scholar
C. Lin, J. Jwo, C. Lee, and Y. Lo, “Empirical explorations of strategic reinforcement learning: a case study in the sorting problem,” Proceedings of the Estonian Academy of Sciences, vol. 69, no. 3, pp. 186–196, 2020.
View at: Publisher Site | Google Scholar
H. Van Seijen, H. Van Hasselt, S. Whiteson, and M. Wiering, “A theoretical and empirical analysis of expected Sarsa,” in Proceedings of the 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp. 177–184, Nashville, TN, USA, March 2009.
View at: Publisher Site | Google Scholar
A. Faust, J. B. Aimone, C. D. James, and L. Tapia, “Resilient computing with reinforcement learning on a dynamical system: case study in sorting,” in Proceedings of the 2018 IEEE Conference on Decision and Control (CDC), pp. 5999–6006, Miami Beach, FL, USA, December 2018.
View at: Publisher Site | Google Scholar
E. Rodrigues Gomes and R. Kowalczyk, “Dynamic analysis of multiagent Q-learning with ε-greedy exploration,” in Proceedings of the 26th Annual International Conference on Machine Learning, pp. 369–376, Montreal, Canada, June 2009.
View at: Publisher Site | Google Scholar
D. Abel, D. Hershkowitz, and M. Littman, “Near optimal behavior via approximate state abstraction,” in Proceedings of the International Conference on Machine Learning, pp. 2915–2923, New York, NY, USA, June 2016, http://proceedings.mlr.press/v48/abel16.pdf.
View at: Google Scholar
D. Abel, D. Arumugam, L. Lehnert, and M. Littman, “State abstractions for lifelong reinforcement learning,” in Proceedings of the International Conference on Machine Learning, pp. 10–19, Stockholm, Sweden, July 2018, http://proceedings.mlr.press/v80/abel18a/abel18a.pdf.
View at: Google Scholar

Copyright

Copyright © 2020 Jung-Sing Jwo et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

643

Downloads

1101

Citations

Computational Intelligence and Neuroscience

Nature-Inspired Computing Applied to Neuroscience

An Empirical Investigation of Transfer Effects for Reinforcement Learning

Abstract

1. Introduction

2. Background and Related Work

3. Methodology and Learning Design

3.1. RL-Based Setting for Sorting Problem

3.2. Learning Algorithm

3.3. Performance Metrics

3.4. Experimental Setup and Results

3.4.1. Experimental Setup

3.4.2. Experimental Results

4. Conclusions

Data Availability

Conflicts of Interest

Acknowledgments

References

Copyright