Reinforcement learning in restless bandits

Scalable regret via Thompson sampling

Aditya Mahajan

McGill University

20 Nov, 2023

Motivation

  • Restless multi-armed bandits are a popular framework for solving resource allocation and scheduling problems.

Salient features

  • Whittle index policy avoids the curse of dimensionality (in # of arms)
  • It is a heuristic, but theoretically motivated and performs well in a large class of models. See Gast et al. (2023) for asymptotic optimality guarantees.

Limitation

  • Assumes a perfect knowledge of the system model!

\(\def\EXP{\mathbb{E}} \def\IND{\mathbb{1}} \def\PR{\mathbb{P}} \def\ALPHABET{\mathcal} \def\S{\mathsf{S}} \def\A{\mathsf{A}} \def\REGRET{\mathcal{R}} \def\bs{\boldsymbol} \def\OO{\tilde{\mathcal{O}}}\)

RMAB with unknown system model

  • Suppose the dynamics of the arms are unknown.

  • Our motivation: Using RMABs for operator allocation in human-robot teams.

  • Can we use a reinforcement learning (RL) to learn good policies?

Two challenges

  • Naive model-free RL cannot exploit the structure of RMAB … but see Avrachenkov and Borkar (2022) for using QL to learn WI.

  • The curse of dimensionality pops up again when we consider regret

Review: Regret of RL in MDPs

  • Consider average cost MDP \(\langle \mathcal{S}, \mathcal A, P, r\rangle\) with optimal performance: \[ J^\star = \sup_{ \pi \in \Pi} \lim_{T \to \infty} \frac 1T \EXP^{\pi} \biggl[ \sum_{t=1}^T r(S_t, A_t) \biggr] \]
  • Regret of a history-dependent and randomized learning policy \(π\): \[ \mathcal R(T; π) = \EXP^{\pi}\biggl[ T J^\star - \sum_{t=1}^T r(S_t, A_t) \biggr]. \]

  • Various characterizations of regret in the literature: UCRL, REGAL, SCAL, TSDE, …

Review: Regret of RL in MDPs

  • Lower bound \(\tilde Ω(\underline κ \sqrt{\S\A T})\)
  • Upper bound \(\OO(\bar κ S \sqrt{\A T})\)
  • \(\underline κ\) and \(\bar κ\) are model dependent constants.

Implication for RMAB

  • Consider an RMAB with \(n\) arms, at most \(m\) can be activated \[ \text{Regret} = \OO\biggl( \prod_{i=1}^n \S_i \sqrt{ \binom{n}{m} T } \biggr) \]
  • Regret upper bound exponential in \(n\) (curse of dimensionality).

Are we doomed?

Does the regret definition make sense?

  • Why consider regret w.r.t. the optimal policy?

  • Better to consider regret w.r.t. Whittle index policy

    \[ \mathcal R(T; π) = \EXP^{\pi}\biggl[ T J(\mu) - \sum_{t=1}^T r(S_t, A_t) \biggr] \] where \(μ\) is the Whittle index policy.

  • How does this regret scale with number of arms?

Let’s check that experimentally

  • Environment A: Machine maintenance model

    • Machines deteriorate stochastically over time
    • Incur a cost that depends on the state of the machine
    • One repairman, who can replace a machine at a cost.
  • Environment B: Link scheduling model

    • Multiple queues communicating on a common channel
    • Only one queue can be scheduled
    • Others incur a holding cost depending on queue length

Regret of QWI on Env A

Regret/\(\sqrt{T}\) vs \(T\)

Regret vs \(n\)

Regret of QWI on Env B

Regret/\(\sqrt{T}\) vs \(T\)

Regret vs \(n\)

Implication

  • Whittle index aware RL algos do well in practice.
  • The general purpose regret bound for MDPs is too loose.
  • Can we develop Whittle index aware regret bounds?
  • This talk: Regret of Thompson sampling for RMAB
    • Why Thompson sampling? Easy to implement (no hyperparameter tuning) and easy to analyze. Contemporary work by Gast et al. (2022) for rested MAB

System Model

  • \(n\) arms: \(\langle \ALPHABET S^i, \ALPHABET A^i = \{0, 1\}, P^i, r^i \rangle\)

  • Global state \(\boldsymbol S_t = (S^1_t, \dots, S^n_t)\) and global action \(\boldsymbol A_t = (A^1_t, \dots, A^n_t)\).

  • An action is feasible if \(\| \boldsymbol A_t \|_1 = m\).

  • Dynamics: \(\displaystyle \PR(\boldsymbol S_{t+1} \mid \boldsymbol S_{0:t}, \boldsymbol A_{0:t}) = \prod_{i = 1}^n P^i(S^i_{t+1} \mid S^i_t, A^i_t).\)
  • Rewards: Two reward models

    • Model A: All arms yield reward. (queueing ntw, machine maintenance, etc.)
    • Model B: Only active arms yield reward. (rested MAB, cognitive radios)

The optimization problem

Let \(\Pi\) denote the set of all randomized history-dependent policies.

\[ \max_{\pi \in \Pi} \liminf_{T \to ∞} \frac 1T \EXP^π\biggl[ \sum_{t=1}^T \boldsymbol r(\boldsymbol S_t, \boldsymbol A_t) \biggr] \quad \text{s.t. } \| \boldsymbol A_t \|_1 = m, \forall t \]

Whittle’s relaxation

  • Replace the hard constraint \(\| \boldsymbol A_t \|_1 = m, \forall t\) by the soft constraint

    \[ \limsup_{T \to ∞} \frac 1T \EXP^π \biggl[ \sum_{t=1}^T \| \boldsymbol A_t \|_1 \biggr] = m. \]

Whittle index policy

Key intuition: The Lagrange relaxation of the soft-constraint problem is equivalent to \(n\) decoupled optimization problems: for all \(i \in [n]\) \[ \max_{ π^i_{λ} \colon \ALPHABET S^i \to \{0, 1\} } \liminf_{T \to ∞} \frac 1T \EXP^{ π^i_{λ} } \biggl[ \sum_{t=1}^T \Bigl[ r^i(S^i_t, A^i_t) - λ A^i_t \Bigr] \biggr]. \]

Let \(π^i_{λ}\) be an optimal policy for this problem.

Indexability and Whittle index

  • Arm \(i\) is indexable if \(\ALPHABET W^i_{λ} = \{s \in \ALPHABET S^i : π^i_{λ}(s) = 0 \}\) is non-decreasing in \(λ\).
  • Whittle index: \(w^i(s) = \inf \{ λ \in \mathbb{R} : s \in \ALPHABET W^i_{λ} \}\).
  • Whittle index policy: Play the arms with the \(m\) highest Whittle indices

The learning setup

  • Let \(θ^i_\star \in Θ^i\) denote the unknown parameters of \(P^i(\cdot| \cdot, \cdot)\), where \(Θ^i\) is a known compact set.
  • Assumption A1: For all \(i \in [n]\) and \(θ^i \in Θ^i\), the RB \(\langle \ALPHABET S^i, \{0, 1\}, P^i(θ^i), r^i \rangle\) is indexable.

  • Assumption A2: For \(\pmb θ = (θ^1, \dots, θ^n) \in \pmb Θ\), the Whittle index policy \(μ_{\pmb θ}\) has well-defined DP solution \((J_{\pmb θ}, \boldsymbol V_{\pmb θ})\). \(\implies J_{\pmb θ}\) independent of initial state

  • The ergodicity coefficient of \(\pmb{P_{θ}}\) (transitions under WI policy) is defined as \[ λ_{\pmb{P_{θ}}} = 1 - \min_{\boldsymbol{s, s' \in \ALPHABET S}} \sum_{\boldsymbol {z \in \ALPHABET S}} \min\bigl\{ \pmb{P_{θ}}(\pmb{z} \mid \pmb{s}), \pmb{P_{θ}}(\pmb{z} \mid \pmb{s'}) \bigr\}. \]

  • Assumption A3: There exists a \(λ^\star < 1\) such that \(\sup_{\pmb{θ} \in \pmb{Θ}} λ_{\boldsymbol{P}_{\pmb θ}} \le λ^\star\)

Bayesian Learning framework

  • \(\{θ^i_{\star}\}_{i \in [n]}\) are independent random variables.
  • \(\phi^i_1\) denotes the prior on \(θ^i\).
  • \(\phi^i_{t}\) denotes the posterior on \(θ^i_\star\) at time \(t\): \[ \phi^i_{t+1}(d θ) = \frac{ P^i(s^i_{t+1} \mid s^i_t, a^i_t; θ) \phi^i_t(d θ) } { \int P^i(s^i_{t+1} \mid s^i_t, a^i_t; \tilde θ) \phi^i_t(d \tilde θ) } \]

  • If the prior is a conjugate distribution on \(Θ^i\), then the posterior can be updated in closed form.

  • The exact form of the posterior and the posterior update are not important for regret analysis.

RB-TSDE Algorithm

RB-TSDE Algorithm

RB-TSDE Algorithm

RB-TSDE Algorithm

Regret of RB-TSDE

Theorem 1: Under assumptions (A1)–(A3): \[ \ALPHABET R(T; \texttt{RB-TSDE}) < 40 α \frac {R_{\max}}{1 - λ^\star} \bar \S_n \sqrt{T \log T} \]

  • \(α = n\) for Model A and \(α = m\) for Model B
  • \(R_{\max} = \sup_{i \in [n]} \| r^i \|_{∞}\)
  • \(\bar \S_n = \sum_{i \in [n]} |\ALPHABET S^i|\)

Corollary 1: Under assumptions (A1)–(A3):

  • Regret of Model A: \(\OO(n^2 \sqrt{T})\) and regret of Model B: \(\OO(n^{1.5} \sqrt{mT})\).

Tighter regret characterization

  • Assumption A4: For each \(\pmb θ \in \pmb Θ\), \(\boldsymbol V_{\pmb θ}\) is \(\boldsymbol L_v\)-Lipschitz.

Theorem 2: Under assumptions (A1)–(A4): \[ \ALPHABET R(T; \texttt{RB-TSDE}) < 12 \max\{ n \sqrt{\bar \S_n}, \bar S_n \} \max\big\{ R_{\max}, D_{\max} \boldsymbol L_v \bigr\} \sqrt{K T \log T} \]

  • \(D_{\max} = \max_{i \in [n]} \text{diam}(\ALPHABET S^i)\).
  • \(K\) is a universal constant

Corollary 2: Under assumptions (A1)–(A4):

  • Regret of Model A and B: \(\OO(n^{1.5} \sqrt{T})\)

Does the theory match the experiments?

Regret of RB-TSDE on Env A

Regret/\(\sqrt{T}\) vs \(T\)

Regret vs \(n\)

Regret of RB-TSDE on Env B

Regret/\(\sqrt{T}\) vs \(T\)

Regret vs \(n\)

Comparison of RB-TSDE and QWI for Env A

RB-TSDE

QWI

Comparison of RB-TSDE and QWI for Env B

RB-TSDE

QWI

Discussion of the results

Why do we need Assumption A3?

Regret of TSDE is evaluated as follows:

\[\begin{align*} \hskip 1em & \hskip -1em \REGRET(T) = \underbrace{ \EXP\biggl[ T J_\star - \sum_{k=1}^{K_T} T_k J_k \biggr]} _{\text{regret due to sampling error $:= \REGRET_0(T)$}} \notag \\ & \underbrace{ \EXP\biggl[\sum_{k=1}^{K_T} \sum_{t=t_k}^{t_{k+1} - 1} \bs{V}_k(\bs{S}_{t+1}) - \bs{V}_k(\bs{S}_{t}) \biggr]} _{\text{regret due to time-varying policy $:= \REGRET_1(T)$}} % \notag \\ % &\hskip 4em + + \underbrace{ \EXP\biggl[\sum_{k=1}^{K_T} \sum_{t=t_k}^{t_{k+1} - 1} \bigl[ \bs{P}_k \bs{V}_k \bigr](\bs{S}_{t}) - \bs{V}_k(\bs{S}_{t+1}) \biggr]} _{\text{regret due to model mismatch $:= \REGRET_2(T)$}}. \end{align*}\]

where \(K_T =\) number of episodes until time \(T\).

Why do we need Assumption A3?

  • Bound on \(\REGRET_0(T)\): \(\displaystyle\qquad \EXP\biggl[ T J_\star - \sum_{k=1}^{K_T} T_k J_k \biggr] \le J_{\star} \EXP[K_T]\)

  • Bound on \(\REGRET_1(T)\):

\[ \EXP\biggl[\sum_{k=1}^{K_T} \sum_{t=t_k}^{t_{k+1} - 1} \bs{V}_k(\bs{S}_{t+1}) - \bs{V}_k(\bs{S}_{t}) \biggr] \le \EXP\biggl[\sum_{k=1}^{K_T} \text{sp}(\bs V_k) \biggr] \]

  • Bound on inner term in \(\REGRET_2(T)\):

\[ \EXP\Bigl[ \bigl[ \bs{P}_k \bs{V}_k \bigr](\bs{S}_{t}) - \bs{V}_k(\bs{S}_{t+1}) \Bigr] \le \EXP\Bigl[ \tfrac 12 \text{sp}(\bs V_k) \bigl\| \bs P_k(\cdot \mid \bs S_t, \bs A_t) - \bs P_{\star}( \cdot \mid \bs S_t, \bs A_t ) \bigr\|_{1} \Big] \]

Why do we need Assumption A3?

  • Thus, to bound the regret, we need to bound:

    • \(J_{\star}\), \(\quad\text{sp}(\bs V_{k}),\quad\) and \(\quad\bigl\| \bs P_k(\cdot \mid \bs S_t, \bs A_t) - \bs P_{\star}( \cdot \mid \bs S_t, \bs A_t ) \bigr\|_{1}\)
  • General negative result. It is shown in Khun (2023) that there exist RMAB with where \(\text{sp}(\bs V_{\star}) \ge C\) for an arbitrary \(C\).

  • We impose (A3) to bound \(\text{sp}(\bs V_k)\). Under (A3), we average cost Bellman operator is a contraction with contraction factor \(λ^{\star}\) and we can show \[ \text{sp}(\bs V_k) \le 2 α R_{\max}/ (1 - λ^{\star}). \]

Is Assumption A3 sufficient?

  • How can we verify A3? Ergodicity coefficient is upper bounded by contraction factor

    \[ \bar λ_{\pmb θ} = 1 - \min_{\substack{ \bs{s, s' \in \ALPHABET S} \\ \bs {a, a' \in \ALPHABET A}(m)}} \sum_{\bs{z \in \ALPHABET S}} \min\bigl\{ \bs P_{\pmb θ}(\bs z \mid \bs s, \bs a), \bs P_{\pmb θ}(\bs z \mid \bs s', \bs a') \bigr\} \]

  • Similarly define \(λ^i_{θ^i}\) for an arm.

  • Easy to see that if \(λ^i_{θ^i} < 1\) for all \(i \in [n]\), then \(\bar λ_{\bs θ} < 1\).
  • Sufficient condition. For every arm, there exists a distinguished state that can be reached from all state-action pairs with positive probability.

Is Assumption A3 sufficient?

  • How can we verify A3? Ergodicity coefficient is upper bounded by contraction factor

    \[ \bar λ_{\pmb θ} = 1 - \min_{\substack{ \bs{s, s' \in \ALPHABET S} \\ \bs {a, a' \in \ALPHABET A}(m)}} \sum_{\bs{z \in \ALPHABET S}} \min\bigl\{ \bs P_{\pmb θ}(\bs z \mid \bs s, \bs a), \bs P_{\pmb θ}(\bs z \mid \bs s', \bs a') \bigr\} \]

  • Similarly define \(\bar λ^i_{θ^i}\) for an arm.

  • Easy to see that if \(λ^i_{θ^i} < 1\) for all \(i \in [n]\), then \(\bar λ_{\bs θ} < 1\).
  • Negative result It is shown by Khun (2023) that if \(\bar λ^i_{θ^i} < 1 - ε\) for all \(i \in [n]\), then \(\bar λ_{\pmb θ} \le 1 - ε^n\).

What does this mean for regret?

  • We have shown that \(\REGRET(T) = \OO\biggl(\dfrac{1}{1 - λ^\star}\biggr)\)
  • If contraction factor of each bandit is less than \(ε\), then \[ \frac{1}{1 - λ^\star} \le \frac{1}{1 - \bar λ^\star} \le \biggl( \frac {1}{ε} \biggr)^n \]
  • This doesn’t mean that \(\REGRET(T)\) is exponential in \(n\).
  • But we still don’t know the exact dependence on \(n\).

Relaxation of Assumption A3

  • Assumption (A3’). There exists a \(τ^\star\) and \(λ^\star\) such that for all \(\pmb θ \in \pmb Θ\), \[λ_{\bs P^{\color{red}{τ^*}}_{\pmb θ}} \le λ^\star.\]

  • Under (A3’), \[ \text{sp}(\bs V_{θ}) \le 2 τ^\star α R_{\max}/ (1 - λ^\star). \]

  • Doesn’t really remove potential exponential dependence on \(n\)

  • But may be easier to verify (e.g., reach a distinguished state in \(τ^*\) steps)

Circumvent using Assumption A4?

  • Simplified the bound on the inner term in \(\REGRET_2(T)\): \[ \EXP\Bigl[ \bigl[ \bs{P}_k \bs{V}_k \bigr](\bs{S}_{t}) - \bs{V}_k(\bs{S}_{t+1}) \Bigr] \le \EXP\Bigl[ \bs L_v \ALPHABET K(\bs P_k(\cdot \mid \bs S_t, \bs A_t), \bs P_{\star}( \cdot \mid \bs S_t, \bs A_t )) \Big] \] There is no dependence on span!
  • Moreover \(\text{sp}(\bs V_k) \le \bs L_v n D_{\max}\)

  • No dependence on ergodicity coefficient!

  • Provide some sufficient conditions to verify (A4), but hard in general.

Future directions

  • Why \(n^{1.5}\) and not \(n\)?

  • Continuous state space?

  • What about non-indexable models?

  • Other low-complexity heuristics

  • More generalized model (weakly coupled MDPs)

Thank you

What about regret wrt to optimal policy?

  • We characterized regret wrt Whittle index policy
  • Possible to characterize regret wrt optimal policy

    • Need to compute the optimal policy at the beginning of each episode.
    • Intractable except for small models.
  • Nonetheless, the theory goes through, with similar bounds on regret.

Why is regret of RB-TSDE better than TSDE?

  • TSDE maintains a joint prior on \(\pmb θ_{\star} = (θ^1_\star, \dots, θ^n_\star)\).
  • RB-TSDE exploits the structure of the dynamics and maintains separate priors on \(\{ θ^i_{\star} \}_{i \in [n]}\)

  • Needs \(2n \S^2\) parameters rather than \(\S^n\) parameters.

    Implications:

    • More nuanced stopping condition for each episode and, therefore, tighter bound on number of episodes

    • Faster convergence of estimates \(\{ \hat θ^i_k \}_{i \in [n]}\) to \(\{ θ^i_{\star} \}_{i \in [n]}\).

References

Avrachenkov, K.E. and Borkar, V.S. 2022. Whittle index based q-learning for restless bandits with average reward. Automatica 139, 110186. DOI: 10.1016/j.automatica.2022.110186.
Gast, N., Gaujal, B., and Khun, K. 2022. Learning algorithms for Markovian bandits: Is posterior sampling more scalable than optimism? Transactions on Machine Learning Research Journal. Available at: https://inria.hal.science/hal-03262006.
Gast, N., Gaujal, B., and Yan, C. 2023. Exponential asymptotic optimality of whittle index policy. Queueing Systems 104, 1–2, 107–150. DOI: 10.1007/s11134-023-09875-x.
Khun, K. 2023. Indexability and learning algorithms for Markovian bandits.
Ouyang, Y., Gagrani, M., Nayyar, A., and Jain, R. 2017. Learning unknown markov decision processes: A Thompson sampling approach. Neurips.