We introduce ConceptVision, a method that aims for high accuracy in categorizing large number of scenes, while keeping the model relatively simpler and efficient for scalability. The proposed method combines the advantages of both low-level representations and high-level semantic categories, and eliminates the distinctions between different levels through the definition of concepts. The proposed framework encodes the perspectives brought through different concepts by considering them in concept groups. Different perspectives are ensembled for the final decision. Extensive experiments are carried out on benchmark datasets to test the effects of different concepts, and methods used to ensemble. Comparisons with state-of-the-art studies show that we can achieve better results with incorporation of concepts in different levels with different perspectives.