The decoding performance of polar codes strongly depends on the decoding algorithm used, while also the decoder throughput and its latency mainly depend on the decoding algorithm. In this work, we implement the powerful successive cancellation list (SCL) decoder on a GPU and identify the bottlenecks of this algorithm with respect to parallel computing and its difficulties. The inherent serial decoding property of the SCL algorithm naturally limits the achievable speed-up gains on GPUs when compared to CPU implementations. In order to increase the decoding throughput, we use a hybrid decoding scheme based on the belief propagation (BP) decoder, which can be intra and inter-frame parallelized. The proposed scheme combines excellent decoding performance and high throughput within the signal-to-noise ratio (SNR) region of interest.