We present NaroNet, a Machine Learning framework that integrates the multiscale spatial, in situ analysis of the tumor microenvironment (TME) with patient-level predictions into a seamless end-to-end learning pipeline. Trained only with patient-level labels, NaroNet quantifies the phenotypes, neighborhoods, and neighborhood interactions that have the highest influence on the predictive task. We validate NaroNet using synthetic data simulating multiplex-immunostained images with adjustable probabilistic incidence of different TMEs. Then we apply our model to two real sets of patient tumors, one consisting of 336 seven-color multiplex-immunostained images from 12 high-grade endometrial cancers, and the other consisting of 372 35-plex mass cytometry images from 283 breast cancer patients. In both synthetic and real datasets, NaroNet provides outstanding predictions while associating those predictions to the presence of specific TMEs. This inherent interpretability could be of great value both in a clinical setting and as a tool to discover novel biomarker signatures.