Fusing and ranking multimodal information remains always a challenging task. A robust decision-level fusion method should not only be dynamically adaptive for assigning weights to each representation but also incorporate inter-relationships among different modalities. In this paper, we propose a quantum-inspired model for fusing and ranking visual and textual information accounting for the dependency between the aforementioned modalities. At first, we calculate the text-based and image-based similarity individually. Two different approaches have been applied for computing each unimodal similarity. The first one makes use of the bag-of-words model. For the second one, a pre-trained VGG19 model on ImageNet has been used for calculating the image similarity, while a query expansion approach has been applied to the text-based query for improving the retrieval performance. Afterward, the local similarity scores fit the proposed quantum-inspired model. The inter-dependency between the two modalities is captured implicitly through "quantum interference". Finally, the documents are ranked based on the proposed similarity measurement. We test our approach on ImageCLEF2007photo data collection and show the effectiveness of the proposed approach. A series of interesting findings are discussed, which would provide theoretical and empirical foundations for future development of this direction.