Recently, deep learning approaches have been extensively studied for various problems in chemistry, such as property prediction, virtual screening, de novo molecule design, etc. Despite the impressive successes, separately designed networks for specific tasks are usually required for end-to-end training, so it is often difficult to acquire a unified principle to synergistically combine existing models and training datasets for novel tasks. To address this, here we present a novel multimodal chemical foundation model that can be used for various downstream tasks that require a simultaneous understanding of structure and property. Specifically, inspired by recent advances in pre-trained multi-modal foundation models such as Vision-Language Pretrained models (VLP), we proposed a novel structure-property multi-modal (SPMM) foundation model using the dual-stream transformer with X-shape attention, so that it can align the molecule structure and the chemical properties in a common embedding space. Thanks to the outstanding structure-property unimodal representation, experimental results confirm that SPMM can simultaneously perform molecule generation, property prediction, classification, reaction prediction, etc., which was previously not possible with a single architecture.