Algorithm audits have increased in recent years due to a growing need to independently assess the performance of automatically curated services that process, filter and rank the large and dynamic amount of information available on the internet. Among several methodologies to perform such audits, virtual agents stand out because they offer the possibility of performing systematic experiments simulating human behaviour without the associated costs of recruiting participants. Motivated by the importance of research transparency and replicability of results, this paper focuses on the challenges of such an approach, and it provides methodological details, recommendations, lessons learned and limitations that researchers should take into consideration when setting up experiments with virtual agents. We demonstrate the successful performance of our research infrastructure in multiple data collections with diverse experimental designs, and point to different changes and strategies that improved the quality of the method. We conclude that virtual agents are a promising venue for monitoring the performance of algorithms during longer periods of time, and we hope that this paper serves as a base to widen the research in this direction.