Automatic understanding of human affect using visual signals is a problem that has attracted significant interest over the past 20 years. However, human emotional states are quite complex. To appraise such states displayed in real-world settings, we need expressive emotional descriptors that are capable of capturing and describing this complexity. The circumplex model of affect, which is described in terms of valence (i.e., how positive or negative is an emotion) and arousal (i.e., power of the activation of the emotion), can be used for this purpose. Recent progress in the emotion recognition domain has been achieved through the development of deep neural architectures and the availability of very large training databases. To this end, Aff-Wild has been the first large-scale "in-the-wild" database, containing around 1,200,000 frames. In this paper, we build upon this database, extending it with 260 more subjects and 1,413,000 new video frames. We call the union of Aff-Wild with the additional data, Aff-Wild2. The videos are downloaded from Youtube and have large variations in pose, age, illumination conditions, ethnicity and profession. Both database-specific as well as cross-database experiments are performed in this paper, by utilizing the Aff-Wild2, along with the RECOLA database. The developed deep neural architectures are based on the joint training of state-of-the-art convolutional and recurrent neural networks with attention mechanism; thus exploiting both the invariant properties of convolutional features, while modeling temporal dynamics that arise in human behaviour via the recurrent layers. The obtained results show premise for utilization of the extended Aff-Wild, as well as of the developed deep neural architectures for visual analysis of human behaviour in terms of continuous emotion dimensions.