Classical computers require large memory resources and computational power to simulate quantum circuits with a large number of qubits. Even supercomputers that can store huge amounts of data face a scalability issue in regard to parallel quantum computing simulations because of the latency of data movements between distributed memory spaces. Here, we apply a cache blocking technique by inserting swap gates in quantum circuits to decrease data movements. We implemented this technique in the open source simulation framework Qiskit Aer. We evaluated our simulator on GPU clusters and observed good scalability.