The current invention relates to the multi-agents deep reinforcement learning (MADRL). In multi-agent deep reinforcement learning, many intelligent agents interact and work together in a setting where they try to learn from their mistakes and develop ...
The current invention relates to the multi-agents deep reinforcement learning (MADRL). In multi-agent deep reinforcement learning, many intelligent agents interact and work together in a setting where they try to learn from their mistakes and develop better decision-making skills.
Recently, MADRL showed very promising results in cooperative multi-agent systems (MAS) and proved its importance in this field. Particularly, in complex tasks like self-driving vehicles, two state gaming (StarCraft), logistics distribution in a factory, productivity optimization, and cooperative multi-robot exploration system. Many different techniques are introduced for solving these problems.
Deep multiagent reinforcement learning shows promising results in terms of completing many challenging tasks. To demonstrate its viability of the field, (VDN) enabled centralized value-function learning to be coupled by decentralized execution. Their approach combined the individual agent terms from a core state-action value function. VDN, however, can only represent a small class of centralized action-value functions and does not employ additional state information during training. Modern methods like QMIX employ the CTDE (centralized training with decentralized execution) paradigm. In this method, a mixer network is used to factorize the joint state-action value function for all agents as a monotonic function. In order to guarantee the individual-global-max condition IMG for each agent, the mixer network is employed to calculate the joint state-action value of all agents. A hyper-network, which predicts a strictly positive weight for the mixer network based on the present state of each agent as an input, is used to achieve the monotonic condition. The outputs of the mixer network also depend on the current state via this hyper-network. The mixing network is given the same DQN algorithm that was used in the optimization process. The joint action-value function class of QMIX is also restricted.
To address this limitation, QTRAN introduced a novel factorization method to express the complete value function class with the help of IGM consistency. However, although requiring more processing effort to implement, this method ensured more general factorization than QMIX. Mahajan et al.'s analysis of QMIX's exploration capabilities in particular contexts showed limitations. To improve the performance of all agents, they presented a paradigm in which a latent space exists. Therefore, obtaining effective scalability for supporting MARL remains a difficulty that is solved by QPLEX. Although QPLEX performs well, sophisticated networks are still needed to produce these outcomes. Additionally, because it employs a greedy policy for the choice of an individual agent's activity, it necessitates several training episodes for a sizable number of agents.
Additionally, two novel Deep Quality-Value (DQV)-based MARL algorithms known as QVMix and QVMix-Max have been developed by researchers. The development of these algorithms makes use of centralized training and decentralized execution. The outcomes of these algorithms demonstrate that QVMix outperformed the others because it is less prone to an overestimation bias of the Q function. However, QVMix also needs a lot of processing power and training time because it also employs a greedy method for choosing the actions taken by each individual agent.
In this thesis, to overcome these restrictions, we suggest a novel hybrid policy that is based on optimization and is inspired by nature. For the action selection of each individual agent in this policy, we employed GWO in conjunction with a greedy policy. Although they require environmental knowledge, optimization algorithms such as GWO (often used for finding the prey) and Ant Colony Optimizer (typically used for determining the shortest path) outperform the greedy policy. In GWO, agents are taught centrally, with the leader agent assisting the other agents. As a result, because the current innovation uses bio-inspired optimization, it takes less computer resources and fewer episodes than legacy methodologies. In which there are no communication restrictions and agents cooperate to attain the goal. Additionally, in a known environment, optimization strategies converge more quickly than greedy policies. The optimization algorithm, however, fails in an unknowable environment, but the greedy policy performs noticeably better. We therefore attain the greatest outcomes for both cases by combining these approaches. We compared our suggested approach to the cutting-edge QMIX and QVMix algorithms using the StarCraft 2 Learning Environment. The results of the experiments show that our algorithm performs better than QMIX and QVMix in every case and needs less training sessions.