We introduce a self-supervised motion-transfer VAE model to disentangle motion and content from video. Unlike previous work regarding content-motion disentanglement in videos, we adopt a chunk-wise modeling approach and take advantage of the motion information contained in spatiotemporal neighborhoods. Our model yields per-chunk representations that can be modeled independently and preserve temporal consistency. Hence, we reconstruct whole videos in a single forward-pass. We extend the ELBO's log-likelihood term and include a Blind Reenactment Loss as inductive bias to leverage motion disentanglement, under the assumption that swapping motion features yields reenactment between two videos. We test our model on recently-proposed disentanglement metrics, and show that it outperforms a variety of methods for video motion-content disentanglement. Experiments on video reenactment show the effectiveness of our disentanglement in the input space where our model outperforms the baselines in reconstruction quality and motion alignment.