Risk-aware multi-armed bandits with refined upper confidence bounds
journal contributionposted on 07.01.2021, 11:52 by Xingchi Liu, Mahsa DerakhshaniMahsa Derakhshani, Sangarapillai Lambotharan, Mihaela Van der Schaar
The classical multi-armed bandit (MAB) framework studies the exploration-exploitation dilemma of the decisionmaking problem and always treats the arm with the highest expected reward as the optimal choice. However, in some applications, an arm with a high expected reward can be risky to play if the variance is high. Hence, the variation of the reward should be considered to make the arm-selection process risk-aware. In this paper, the mean-variance metric is investigated to measure the uncertainty of the received rewards. We first study a risk-aware MAB problem when the reward follows a Gaussian distribution, and a concentration inequality on the variance is developed to design a Gaussian risk aware-upper confidence bound algorithm. Furthermore, we extend this algorithm to a novel asymptotic risk aware-upper confidence bound algorithm by developing an upper confidence bound of the variance based on the asymptotic distribution of the sample variance. Theoretical analysis proves that both proposed algorithms achieve the O(log(T)) regret. Finally, numerical results demonstrate that our algorithms outperform several risk-aware MAB algorithms.
Royal Academy of Engineering under the Leverhulme Trust Research Fellowship scheme (DerakhshaniLTRF1920\16\67)
- Mechanical, Electrical and Manufacturing Engineering