Memes are pixel-based multimedia documents containing images and expressions that usually raise a funny meaning when mixed. Hateful memes are also spread hatred through social networks. Automatically detecting the hateful memes would help reduce their harmful societal influence. The challenge of hateful memes detection lies in its multimodal information, unlike the conventional multimodal tasks, where the visual and textual information are semantically aligned. The multimodal information in the meme is weakly aligned or even irrelevant, which makes the model not only needs to understand the content in the memes but also reasoning over the multiple modalities. In this paper, we propose a novel method that incorporates the image captioning process into the memes detection process. We conducted extensive experiments on meme datasets and illustrated the effectiveness of our method. Our model also achieves promising results on the Hateful memes detection challenge.