Vector Perturbation Precoding (VPP) can speed up downlink data transmissions in Large and Massive Multi-User MIMO systems but is known to be NP-hard. While there are several algorithms in the literature for VPP under total power constraint, they are not applicable for VPP under per-antenna power constraint. This paper proposes a novel, parallel tree search algorithm for VPP under per-antenna power constraint, called \emph{\textbf{TreeStep}}, to find good quality solutions to the VPP problem with practical computational complexity. We show that our method can provide huge performance gain over simple linear precoding like Regularised Zero Forcing. We evaluate TreeStep for several large MIMO~($16\times16$ and $24\times24$) and massive MIMO~($16\times32$ and $24\times 48$) and demonstrate that TreeStep outperforms the popular polynomial-time VPP algorithm, the Fixed Complexity Sphere Encoder, by achieving the extremely low BER of $10^{-6}$ at a much lower SNR.