The multi-armed stochastic bandits is a problem of decision-making with uncertainty where a learner interacts sequentially with an unknown set of probability distributions, sampling a reward at each round with the objective of maximizing its expected reward. The learner faces a dilemma of exploration versus exploitation, since in order to acquire information they have to explore the set of unknown probability distributions but at the same time want to pick those ones that apparently give the maximal reward.

We extend the classical problem to a quantum learning scenario initiating the study of tradeoffs between exploration and exploitation in online learning of quantum states. In the multi-armed quantum bandits setting we consider a learner that has oracle access to an unknown quantum state and at each round has to choose an observable from a set of actions aiming to maximize its expectation value on the state.

As a figure of merit we consider the expected cumulative regret that characterizes the gap between the maximal reward (given the action set) and the reward received at each round. Our goal is to study the dependence of the expected cumulative regret on the number of rounds, the number of available actions and the dimension of the underlying space. We provide various information-theoretic lower bounds on the expected cumulative regret and exhibit strategies that are optimal if the action set is finite and the unknown quantum state is mixed. If we have a promise that the unknown quantum state is pure and the action set contains all rank-1 projectors the regret can be interpreted as the infidelity with the target state and we provide some support for the conjecture that measurement strategies with less regret than the general case exist.

*Open-access preprint is available here**: **arXiv:2108.13050 *