#### Improving Sample Complexity Bounds for Actor-Critic Algorithms

##### Tengyu Xu, Zhe Wang, Yingbin Liang

The actor-critic (AC) algorithm is a popular method to find an optimal policy in reinforcement learning. The finite-sample convergence rate for the AC and natural actor-critic (NAC) algorithms has been established recently, but under independent and identically distributed (i.i.d.) sampling and single-sample update at each iteration. In contrast, this paper characterizes the convergence rate and sample complexity of AC and NAC under Markovian sampling, with mini-batch data for each iteration, and with actor having general policy class approximation. We show that the overall sample complexity for a mini-batch AC to attain an $\epsilon$-accurate stationary point improves the best known sample complexity of AC by an order of $\mathcal{O}(\frac{1}{\epsilon}\log(\frac{1}{\epsilon}))$. We also show that the overall sample complexity for a mini-batch NAC to attain an $\epsilon$-accurate globally optimal point improves the known sample complexity of natural policy gradient (NPG) by $\mathcal{O}(\frac{1}{\epsilon}/\log(\frac{1}{\epsilon}))$. Our study develops several novel techniques for finite-sample analysis of RL algorithms including handling the bias error due to mini-batch Markovian sampling and exploiting the self variance reduction property to improve the convergence analysis of NAC.

arrow_drop_up