The capability for environmental sound recognition (ESR) can determine the fitness of individuals in a way to avoid dangers or pursue opportunities when critical sound events occur. It still remains mysterious about the fundamental principles of biological systems that result in such a remarkable ability. Additionally, the practical importance of ESR has attracted an increasing amount of research attention, but the chaotic and non-stationary difficulties continue to make it a challenging task. In this study, we propose a spike-based framework from a more brain-like perspective for the ESR task. Our framework is a unifying system with a consistent integration of three major functional parts which are sparse encoding, efficient learning and robust readout. We first introduce a simple sparse encoding where key-points are used for feature representation, and demonstrate its generalization to both spike and non-spike based systems. Then, we evaluate the learning properties of different learning rules in details with our contributions being added for improvements. Our results highlight the advantages of the multi-spike learning, providing a selection reference for various spike-based developments. Finally, we combine the multi-spike readout with the other parts to form a system for ESR. Experimental results show that our framework performs the best as compared to other baseline approaches. In addition, we show that our spike-based framework has several advantageous characteristics including early decision making, small dataset acquiring and ongoing dynamic processing. Our framework is the first attempt to apply the multi-spike characteristic of nervous neurons to ESR. The outstanding performance of our approach would potentially contribute to draw more research efforts to push the boundaries of spike-based paradigm to a new horizon.