Non-stationary Linear Bandits Revisited

Peng Zhao, Lijun Zhang

In this note, we revisit non-stationary linear bandits, a variant of stochastic linear bandits with a time-varying underlying regression parameter. Existing studies develop various algorithms and show that they enjoy an $\widetilde{O}(T^{2/3}(1+P_T)^{1/3})$ dynamic regret, where $T$ is the time horizon and $P_T$ is the path-length that measures the fluctuation of the evolving unknown parameter. However, we discover that a serious technical flaw makes the argument ungrounded. We revisit the analysis and present a fix. Without modifying original algorithms, we can prove an $\widetilde{O}(T^{3/4}(1+P_T)^{1/4})$ dynamic regret for these algorithms, slightly worse than the rate as was anticipated. We also show some impossibility results for the key quantity concerned in the regret analysis. Note that the above dynamic regret guarantee requires an oracle knowledge of the path-length $P_T$. Combining the bandit-over-bandit mechanism, we can also achieve the same guarantee in a parameter-free way.

Knowledge Graph



Sign up or login to leave a comment