The multi-armed bandit problem is a simple model of decision-making with uncertainty that lies in the class of classical reinforcement learning problems. Given a set of arms, a learner interacts sequentially with these arms sampling a reward at each round and the objective of the learner is to identify the arm with the largest expected reward while maximizing the total cumulative reward. The problem that the learner faces is a trade-off between exploration and exploitation: one wants to explore all the arms to identify the best one but also wants to exploit those arms that give the best rewards. Bandit algorithms are online, which means that the strategy is adaptive at each round and is learned from previously observed events. Some real-life applications of bandit include clinical trials, dynamic pricing, advertisement recommendation, or online recommender systems among many others. As an example, a news article recommendation system can be modeled as a bandit considering the arms as the articles to recommend and the reward a binary outcome that indicates if the user clicks on the recommended article.

Recently quantum algorithms for the classical multi-armed bandit and also for classical recommendation systems have been proposed. While these focus on quantum algorithms applied to classical bandits or recommendation systems, our work treats multi-armed bandits from a different perspective. We focus on learning tasks where the environment and actions are quantum but the strategies are classical. In our model, which we call the multi-armed quantum bandit (MAQB) model, the arms correspond to different observables or measurements and the corresponding reward is distributed according to Born’s rule for these measurements on an unknown quantum state. The unknown quantum state is called the environment. We are interested in the optimal tradeoff between exploration (learning more about the unknown quantum state) and exploitation (using acquired information about the unknown state to choose the most rewarding but not necessarily most informative measurement).

The figure of merit that we study is the cumulative expected regret, the sum over all rounds of the difference between the maximal expected reward and the expected actual reward associated with the chosen action (observable) at each round. In this work, we give lower and upper bounds on the minimal regret attainable by any strategy in terms of the number of rounds, the number of arms, and the dimension of the Hilbert space for different sets of environments (general and pure quantum states) and actions (discrete and continuous).

We note that one can try to apply techniques from quantum state tomography or shadow tomography in this setting using the observables of the action set in order to estimate the unknown quantum state. If we learn the unknown quantum state (or some approximation thereof) we can choose the best action in order to minimize the regret. However, this strategy is not optimal: quantum state tomography algorithms can be thought of as pure exploration strategies since the algorithm only cares about choosing the action that helps us to learn most about the unknown quantum state, which is not necessarily the action that minimizes the regret.

Our work was presented at AQIS21, BIIDâ€™9, and QTML21 and it was published in Quantum.