This paper presents a new approach to visual zero-shot slot filling. The approach extends previous approaches by reformulating the slot filling task as Question Answering. Slot tags are converted to rich natural language questions that capture the semantics of visual information and lexical text on the GUI screen. These questions are paired with the user's utterance and slots are extracted from the utterance using a state-of-the-art ALBERT-based Question Answering system trained on the Stanford Question Answering dataset (SQuaD2). An approach to further refine the model with multi-task training is presented. The multi-task approach facilitates the incorporation of a large number of successive refinements and transfer learning across similar tasks. A new Visual Slot dataset and a visual extension of the popular ATIS dataset is introduced to support research and experimentation on visual slot filling. Results show F1 scores between 0.52 and 0.60 on the Visual Slot and ATIS datasets with no training data (zero-shot).