Due to the difficulty of automatically mapping visual features with semantic descriptors, state-of-the-art frameworks have exhibited poor performance in terms of coverage and effectiveness for indexing the visual content. This prompted us to investigate the use of both the Web as a large information source from where to extract relevant contextual linguistic information and bimodal visual-textual indexing as a technique to enrich the vocabulary of index concepts. Our proposal is based on the Signal/Semantic approach for multimedia indexing which generates multi-facetted conceptual representations of the visual content. We propose to enrich these image representations with concepts automatically extracted from the visual contextual information. We specifically target the integration of semantic concepts which are more specific than the initial index concepts since they represent the visual content with greater accuracy and precision. Also, we aim to correct the faulty indexes resulting from the automatic semantic tagging. Experimentally, the details of the prototyping are given and the presented technique is tested in a Web-scale evaluation on 30 queries representing elaborate image scenes.