Multi-armed bandit is a widely-studied model for
sequential decision-making problems. The most studied model in
the literature is stochastic bandits wherein the reward of each
arm follows an independent distribution. However, there is a wide
range of applications where the rewards of different alternatives
are correlated to some extent. In this paper, a class of structured bandit problems is studied in which rewards of different
arms are functions of the same unknown parameter vector. To
minimize the cumulative learning regret, we propose a globally informative Thompson sampling algorithm to learn and leverage
the correlation among arms, which can deal with unknown multidimensional parameter and non-monotonic reward functions.
Our studies demonstrate that the proposed algorithm achieves
significant improvement in the learning speed. In particular, the
designed algorithm is used to solve an edge transcoder selection
problem in crowdsourced live video streaming systems and shows
superior performance as compared to the existing schemes.
Funding
Communications Signal Processing Based Solutions for Massive Machine-to-Machine Networks (M3NETs)
Engineering and Physical Sciences Research Council
Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works