Classically, visual object tracking involves following a target object throughout a given video, and it provides us the motion trajectory of the object. However, for many practical applications, this output is often insufficient since additional semantic information is required to act on the video material. Example applications of this are surveillance and target-specific video summarization, where the target needs to be monitored with respect to certain predefined constraints, e.g., 'when standing near a yellow car'. This paper explores, tracking visual objects subjected to additional lingual constraints. Differently from Li et al., we impose additional lingual constraints upon tracking, which enables new applications of tracking. Whereas in their work the goal is to improve and extend upon tracking itself. To perform benchmarks and experiments, we contribute two datasets: c-MOT16 and c-LaSOT, curated through appending additional constraints to the frames of the original LaSOT and MOT16 datasets. We also experiment with two deep models SiamCT-DFG and SiamCT-CA, obtained through extending a recent state-of-the-art Siamese tracking method and adding modules inspired from the fields of natural language processing and visual question answering. Through experimental results, we show that the proposed model SiamCT-CA can significantly outperform its counterparts. Furthermore, our method enables the selective compression of videos, based on the validity of the constraint.