To effectively control large-scale distributed systems online, model predictive control (MPC) has to swiftly solve the underlying high-dimensional optimization. There are multiple techniques applied to accelerate the solving process in the literature, mainly attributed to software-based algorithmic advancements and hardware-assisted computation enhancements. However, those methods focus on arithmetic accelerations and overlook the benefits of the underlying system's structure. In particular, the existing decoupled software-hardware algorithm design that naively parallelizes the arithmetic operations by the hardware does not tackle the hardware overheads such as CPU-GPU and thread-to-thread communications in a principled manner. Also, the advantages of parallelizable subproblem decomposition in distributed MPC are not well recognized and exploited. As a result, we have not reached the full potential of hardware acceleration for MPC. In this paper, we explore those opportunities by leveraging GPU to parallelize the distributed and localized MPC (DLMPC) algorithm. We exploit the locality constraints embedded in the DLMPC formulation to reduce the hardware-intrinsic communication overheads. Our parallel implementation achieves up to 50x faster runtime than its CPU counterparts under various parameters. Furthermore, we find that the locality-aware GPU parallelization could halve the optimization runtime comparing to the naive acceleration. Overall, our results demonstrate the performance gains brought by software-hardware co-design with the information exchange structure in mind.