We propose a cross-attention Transformer for joint decoding of uplink OFDM signals received by multiple coordinated access points. A shared per-receiver encoder learns the time-frequency structure of each grid, and a token-wise cross-attention module fuses the receivers to produce soft log-likelihood ratios for a standard channel decoder without explicit channel estimates. Trained with a bit-metric objective, the model adapts its fusion to per-receiver reliability and remains robust under degraded links, strong frequency selectivity, and sparse pilots. Over realistic Wi-Fi channels, it outperforms classical pipelines and strong neural baselines, often matching or surpassing a local perfect-CSI reference while remaining compact and computationally efficient on commodity hardware, making it suitable for next-generation coordinated Wi-Fi receivers.