The (classical) multi-armed bandit problem
The multi-armed bandit problem is a simple model of decision-making with uncertainty that lies in the class of classical reinforcement learning problems. Given a set of arms, a learner interacts sequentially with these arms sampling a reward at each round and the objective of the learner is to identify the arm with the largest expected reward while maximizing the total cumulative reward. The problem that the learner faces is a trade-off between exploration and exploitation: one wants to explore all the arms to identify the best one but also wants to exploit those arms that give the best rewards. Bandit algorithms are online, which means that the strategy is adaptive at each round and is learned from previously observed events. Some real-life applications of bandit include clinical trials, dynamic pricing, advertisement recommendation, or online recommender systems among many others. As an example, a news article recommendation system can be modeled as a bandit considering the arms as the articles to recommend and the reward a binary outcome that indicates if the user clicks on the recommended article.

The exploration-exploitation dilemma is a fundamental concept that arises in many situations. For example, in the above picture, we can relate the dilemma to the exploration of new restaurants (so we can find potentially new better dishes) and the exploitation to trying known restaurants (which we know that we like).
How do we approach bandits from a quantum perspective?
We are interested in the application of the multi-armed bandit theory to quantum learning tasks, as well as contributions to the classical problem of multi-armed bandits. This includes:
- Identifying which quantum tasks the exploration-exploitation dilemma is relevant. Learning in quantum information means learning relevant properties of an unknown quantum state through different measurements. This includes tasks such as standard quantum state tomography, quantum hypothesis testing, shadow tomography… Thus, the exploration-exploitation dilemma will be relevant in tasks where the learning (exploration) is restricted (by for example some energy loss constraints).
- Designing new algorithms where the underlying structure of the problem is quantum. While we are interested in working with classical algorithms, this means that we should identify which part of our problem is quantum and then design algorithms for this specific task. The relevant mathematical tools that we will use are mainly from probability theory such as concentration inequalities of random variables. The main difficulty arises is that when dealing with adaptive algorithms we need to use the right techniques suited for non-independent random variables.
- Conversely, to assess the optimality of our algorithms we need to design information-theoretic constructions (usually lower bounds to some figure of merit) that show the limitations of our algorithms. For this part, we mainly use tools from quantum information theory.
Multi-armed quantum bandits

Recently quantum algorithms for the classical multi-armed bandit and also for classical recommendation systems have been proposed. While these focus on quantum algorithms applied to classical bandits or recommendation systems, our work treats multi-armed bandits from a different perspective. In our first work “Multi-armed quantum bandits: Exploration versus exploitation when learning properties of quantum states“, we focused on learning tasks where the environment and actions are quantum but the strategies are classical. We defined the multi-armed quantum bandit (MAQB) model where the arms correspond to different observables or measurements, and the corresponding reward is distributed according to Born’s rule for these measurements on an unknown quantum state \rho . The unknown quantum state is called the environment. In this setting at each time step t\in [T] a learner selects an observable O_{a_t} from a known action set and measures a single copy of \rho and receives a reward x_t which is sampled from the probability distribution associated with the measurement of \rho using the chosen observable.
We are interested in the optimal tradeoff between exploration (learning more about the unknown quantum state) and exploitation (using acquired information about the unknown state to choose the most rewarding but not necessarily most informative measurement).
The figure of merit that we study is the cumulative expected regret, the sum over all rounds of the difference between the maximal expected reward and the expected actual reward associated with the chosen action (observable) at each round. In this work, we give lower and upper bounds on the minimal regret attainable by any strategy in terms of the number of rounds, the number of arms, and the dimension of the Hilbert space for different sets of environments (general and pure quantum states) and actions (discrete and continuous).
This work was presented at AQIS21 (long talk), BIID’9, and QTML21 and it was published in Quantum.
Learning pure quantum states (almost) without regret
After our first work on multi-armed quantum bandits, we looked into a different problem where the exploration-exploitation dilemma arises. We studied a new approach to quantum state tomography where a learner sequentially selects probe states to measure an unknown pure quantum state |\psi\rangle. The goal is to efficiently learn the unknown state while performing measurements that only minimally disturb the state (i.e. measurements that align well with the unknown state). Mathematically, this problem can be modeled as a bandit problem since fundamentally we are interested in optimizing an exploration-exploitation trade-off. In our setting, the exploration is related to the learning of the unknown state |\psi\rangle through the selection of sufficiently informative probes, and the exploitation to performing measurements in the direction of |\psi\rangle.
To quantify the penalty that the learner suffers for selecting probes that are orthogonal to the unknown state we use as a figure of merit the regret, which is defined as \text{Regret}(T) = \sum_{t=1}^T 1 - \langle \psi |\Pi_t |\psi\rangle where F(\Pi, \Pi_{t}) = \langle \psi | \Pi_{t} | \psi \rangle is the fidelity between the unknown state \Pi = |\psi\rangle \langle \psi | and the selected probe \Pi_t = |\psi_t\rangle \langle \psi_t | at time step t\in[T].
The particular quantum feature of our problem that is different from standard bandit algorithms is that the noise associated with the outcome measurements decays linearly as we select measurements close to the unknown state. This noise model is just given by Born’s rule and we had to take into account this feature to design our algorithm. We introduced a new technique based on a median of means least squares estimator (MOMLSE) \widehat{\Pi}_t.

Our algorithm is completely adaptive and at each time step outputs a MOMLSE estimator \widehat{\Pi}_t and builds a high-probability confidence region \mathcal{C}_t(shaded region) around the unknown state |\psi\rangle on the Bloch sphere representation. Then, it selects measurements \Pi^\pm_{a_t} that are close to the unknown state projecting into the Bloch sphere the extreme points of the largest principal axis of \mathcal{C}_t. This particular choice allows optimal learning of |\psi\rangle (exploration) and simultaneously minimizes the regret (exploitation).
To perform the theoretical analysis of the above algorithm we had to rethink and develop completely new techniques and proofs for linear stochastic bandits, as the existing gold standard methods were insufficient for our problem. This led us to write the work “Linear bandits with polylogarithmic minimax regret” which was accepted in the proceedings of the Conference on Learning Theory which is the premier and most competitive conference in the field of theoretical machine learning and learning theory: https://proceedings.mlr.press/v247/lumbreras24a.html.
After we adapted the algorithm for our quantum problem in our work “Learning pure quantum states (almost) without regret” and proved that \text{Regret}(T) = \Theta(\text{polylog} T). This result established the first linear stochastic bandit with a continuous action set that gives a polylogarithmic minimax regret in stark contrast to the square root scaling of the regret for typical bandit algorithms.
This work has been currently accepted as a long talk for AQIS24 and you can find the preprint at https://www.arxiv.org/abs/2406.18370.