An important step for limiting the negative impact of natural disasters is rapid damage assessment after a disaster occurred. For instance, building damage detection can be automated by applying computer vision techniques to satellite imagery. Such models operate in a multi-domain setting: every disaster is inherently different (new geolocation, unique circumstances), and models must be robust to a shift in distribution between disaster imagery available for training and the images of the new event. Accordingly, estimating real-world performance requires an out-of-domain (OOD) test set. However, building damage detection models have so far been evaluated mostly in the simpler yet unrealistic in-distribution (IID) test setting. Here we argue that future work should focus on the OOD regime instead. We assess OOD performance of two competitive damage detection models and find that existing state-of-the-art models show a substantial generalization gap: their performance drops when evaluated OOD on new disasters not used during training. Moreover, IID performance is not predictive of OOD performance, rendering current benchmarks uninformative about real-world performance. Code and model weights are available at https://github.com/ecker-lab/robust-bdd.