posted on 2021-01-07, 11:52authored byXingchi Liu, Mahsa DerakhshaniMahsa Derakhshani, Sangarapillai Lambotharan, Mihaela Van der Schaar
The classical multi-armed bandit (MAB) framework
studies the exploration-exploitation dilemma of the decisionmaking problem and always treats the arm with the highest
expected reward as the optimal choice. However, in some applications, an arm with a high expected reward can be risky to play if
the variance is high. Hence, the variation of the reward should be
considered to make the arm-selection process risk-aware. In this
paper, the mean-variance metric is investigated to measure the
uncertainty of the received rewards. We first study a risk-aware
MAB problem when the reward follows a Gaussian distribution,
and a concentration inequality on the variance is developed to
design a Gaussian risk aware-upper confidence bound algorithm.
Furthermore, we extend this algorithm to a novel asymptotic risk
aware-upper confidence bound algorithm by developing an upper
confidence bound of the variance based on the asymptotic distribution of the sample variance. Theoretical analysis proves that
both proposed algorithms achieve the O(log(T)) regret. Finally,
numerical results demonstrate that our algorithms outperform
several risk-aware MAB algorithms.
Funding
Royal Academy of Engineering under the Leverhulme Trust Research Fellowship scheme (DerakhshaniLTRF1920\16\67)
History
School
Mechanical, Electrical and Manufacturing Engineering
Published in
IEEE Signal Processing Letters
Volume
28
Pages
269 - 273
Publisher
Institute of Electrical and Electronics Engineers (IEEE)
Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.