Resumen
This paper explores the use of deep reinforcement learning in solving the multi-agent aircraft traffic planning (individual paths) and collision avoidance problem for a multiple UAS, such as that for a cargo drone network. Specifically, the Deep Q-Network (DQN) with Hindsight Experience Replay framework is adopted and trained on a three-dimensional state space that represents a congested urban environment with dynamic obstacles. Through formalising a Markov decision process (MDP), various flight and control parameters are varied between training simulations to study their effects on agent performance. Both fully observable MDPs (FOMDPs) and partially observable MDPs (POMDPs) are formulated to understand the role of shaping reward signals on training performance. While conventional traffic planning and optimisation techniques are evaluated based on path length or time, this paper aims to incorporate economic analysis by considering tangible and intangible sources of cost, such as the cost of energy, the value of time (VOT) and the value of reliability (VOR). By comparing outcomes from an integration of multiple cost sources, this paper is better able to gauge the impact of various parameters on efficiency. To further explore the feasibility of multiple UAS traffic planning, such as cargo drone networks, the trained agents are also subjected to multi-agent point-to-point and hub-and-spoke network environments. In these simulations, delivery orders are generated using a discrete event simulator with an arrival rate, which is varied to investigate the effect of travel demand on economic costs. Simulation results point to the importance of signal engineering, as reward signals play a crucial role in shaping reinforcements. The results also reflect an increase in costs for environments where congestion and arrival time uncertainty arise because of the presence of other agents in the network.