The P4 language has drastically changed the networking field as it allows to quickly describe and implement new networking applications. Although a large variety of applications can be described with the P4 language, current programmable switch architectures impose significant constraints on P4 programs. To address this shortcoming, FPGAs have been explored as potential targets for P4 applications. P4 applications are described using three abstractions: a packet parser, match-action tables, and a packet deparser, which reassembles the output packet with the result of the match-action tables. While implementations of packet parsers and match-action tables on FPGAs have been widely covered in the literature, no general design principles have been presented for the packet deparser. Indeed, implementing a high-speed and efficient deparser on FPGAs remains an open issue because it requires a large amount of interconnections and the architecture must be tailored to a P4 program. As a result, in several works where a P4 application is implemented on FPGAs, the deparser consumes a significant proportion of chip resources. Hence, in this paper, we address this issue by presenting design principles for efficient and high-speed deparsers on FPGAs. As an artifact, we introduce a tool that generates an efficient vendor-agnostic deparser architecture from a P4 program. Our design has been validated and simulated with a cocotb-based framework. The resulting architecture is implemented on Xilinx Ultrascale+ FPGAs and supports a throughput of more than 200 Gbps while reducing resource usage by almost 10$\times$ compared to other solutions.