Based on human learning system, reinforcement learning (RL) has shown superhuman performance. There is an emerging number of model-based (MB) RL approaches pursuing the potential benefits of higher sample efficiency and fast adaptation capacity. Howev...
Based on human learning system, reinforcement learning (RL) has shown superhuman performance. There is an emerging number of model-based (MB) RL approaches pursuing the potential benefits of higher sample efficiency and fast adaptation capacity. However, according to recent benchmark studies, MB is not always superior to model-free (MF) RL if it experiences difficulty in learning relatively easy tasks for humans or if the formation of a world model is hindered by the uncertain options of the task. To generalize the agent regardless of the task conditions, the RL agent has to use both MB and MF learning strategies parallelly. Recent findings in computational neuroscience suggest mounting evidence to support the key principle underlying RL in the human brain is meta-control, such as the arbitration control of MB and MF based on prediction error (PE). To this end, we propose a novel neuroscience-inspired RL algorithm called Meta-Dyna, which can flexibly adapt to frequent changes in environments, including both the goals and latent state-transition uncertainty, based on the concept of prefrontal meta-control. Using this approach, we test three environments and demonstrate optimal performance: i) Two-stage MDT, which is widely known for investigating the characteristics of human RL; ii) GridWorldLoCA, known as a benchmark environment for MB RL; iii) Gym Atari-Pong, newly designed based on OpenAI Gym Atari-Pong. We applied goal condition and state-transition probability based on the Two-stage MDT option. Experimental results show that our proposal exhibited better performance with respect to average reward, choice optimality, and energy efficiency than those of baseline RL models (p<0.001, independent sample t-test). Applying Meta Control based on the prefrontal cortex, Meta-Dyna demonstrates superiority in terms of a performance-speed-efficiency balance, as evidenced by the highest average rewards (Two-stage MDT - tabular: 0.61, neural network: 0.71 / Atari-Pong: -0.091), rapid convergence to optimal points (GridWorldLoCA), and lower learning costs (Atari-Pong-timestep). Gaining a deeper insight into these results would allow us not only to advance the computational theory of RL but also to build human-like RL agents.