Bandit Online Optimization Over the Permutahedron

Nir Ailon, Kohei Hatano, Eiji Takimoto

The permutahedron is the convex polytope with vertex set consisting of the vectors $(\pi(1),\dots, \pi(n))$ for all permutations (bijections) $\pi$ over $\{1,\dots, n\}$. We study a bandit game in which, at each step $t$, an adversary chooses a hidden weight weight vector $s_t$, a player chooses a vertex $\pi_t$ of the permutahedron and suffers an observed loss of $\sum_{i=1}^n \pi(i) s_t(i)$. A previous algorithm CombBand of Cesa-Bianchi et al (2009) guarantees a regret of $O(n\sqrt{T \log n})$ for a time horizon of $T$. Unfortunately, CombBand requires at each step an $n$-by-$n$ matrix permanent approximation to within improved accuracy as $T$ grows, resulting in a total running time that is super linear in $T$, making it impractical for large time horizons. We provide an algorithm of regret $O(n^{3/2}\sqrt{T})$ with total time complexity $O(n^3T)$. The ideas are a combination of CombBand and a recent algorithm by Ailon (2013) for online optimization over the permutahedron in the full information setting. The technical core is a bound on the variance of the Plackett-Luce noisy sorting process's "pseudo loss". The bound is obtained by establishing positive semi-definiteness of a family of 3-by-3 matrices generated from rational functions of exponentials of 3 parameters.

Knowledge Graph



Sign up or login to leave a comment