Learning a neural response generation model on data synthesized under the adversarial training framework helps to explore more possible responses. However, most of the data synthesized de novo are of low quality due to the vast size of the response space. In this paper, we propose a counterfactual off-policy method to learn on a better synthesis of data. It takes advantage of a real response to infer an alternative that was not taken using a structural casual model. Learning on the counterfactual responses helps to explore the high-reward area of the response space. An empirical study on the DailyDialog dataset shows that our approach significantly outperforms the HRED model as well as the conventional adversarial training approaches.