Predicting risk map of traffic accidents is vital for accident prevention and early planning of emergency response. Here, the challenge lies in the multimodal nature of urban big data. We propose a compact neural ensemble model to alleviate overfitting in fusing multimodal features and develop some new features such as fractal measure of road complexity in satellite images, taxi flows, POIs, and road width and connectivity in OpenStreetMap. The solution is more promising in performance than the baseline methods and the single-modality data based solutions. After visualization from a micro view, the visual patterns of the scenes related to high and low risk are revealed, providing lessons for future road design. From city point of view, the predicted risk map is close to the ground truth, and can act as the base in optimizing spatial configuration of resources for emergency response, and alarming signs. To the best of our knowledge, it is the first work to fuse visual and spatio-temporal features in traffic accident prediction while advances to bridge the gap between data mining based urban computing and computer vision based urban perception.