Localizing an object accurately with respect to a robot is a key step for autonomous robotic manipulation. In this work, we propose to tackle this task knowing only 3D models of the robot and object in the particular case where the scene is viewed from uncalibrated cameras -- a situation which would be typical in an uncontrolled environment, e.g., on a construction site. We demonstrate that this localization can be performed very accurately, with millimetric errors, without using a single real image for training, a strong advantage since acquiring representative training data is a long and expensive process. Our approach relies on a classification Convolutional Neural Network (CNN) trained using hundreds of thousands of synthetically rendered scenes with randomized parameters. To evaluate our approach quantitatively and make it comparable to alternative approaches, we build a new rich dataset of real robot images with accurately localized blocks.