Nabil BELGASMI
Banque de Tunisie, Tunisia
Title: Multiobjective deep reinforcement learning approach for ATM cash replenishment planning
Biography
Biography: Nabil BELGASMI
Abstract
The current framework of reinforcement learning is based on a single objective performance optimization, that is maximizing the expected returns based on scalar rewards that come from either univariate environment response to the agent actions or from a weighted aggregation of a multivariate response. But in many real world situations, tradeoffs must be made among multiple conflicting objectives that have different order of magnitude, measurement units and business specific contexts related to the problem being solved (i.e. costs, lead time, quality of service, profits, etc.). The aggregation of such sub-rewards to get a scalar reward assumes a perfect knowledge about the decision maker preferences and the way she perceives the importance of each objective. In this study, we consider the problem of learning the best ATM cash replenishment policies in an uncertain multiobjective context given an arbitrary history of cash withdrawals that may be nonstationary and may contain outliers. We propose a model-free Multiobjective Deep Reinforcement Learning approach that allows us to compete against the human decision maker and to find the best policy per ATM that outperforms the current human policy. The idea is to disaggregate the performance of a replenishment policy to form a vector of objective functions. The performance of the human policy is then a multi-dimensional reference point (Rh). The task of the deep reinforcement learning algorithm is to find a policy that generates a set of performance points which Pareto-dominate the current human reference point (Rh).