Shoulder surfing is an attack vector widely recognized as a real threat - enough to warrant researchers dedicating a considerable effort toward designing novel authentication methods to be shoulder surfing resistant. Despite a multitude of proposed solutions over the years, few have employed empirical evaluations and comparisons between different methods, and our understanding of the shoulder surfing phenomenon remains limited. Barring the challenges in experimental design, the reason for that can be primarily attributed to the lack of objective and comparable vulnerability measures. In this paper, we develop an ensemble of vulnerability metrics, a first endeavour toward a comprehensive assessment of a given method's susceptibility to observational attacks. In the largest on-site shoulder surfing experiment (n = 274) to date, we verify the model on four conceptually different authentication methods in two observation scenarios. On the example of a novel hybrid authentication method based on associations, we explore the effect of input type on the adversary's effectiveness. We provide first empirical evidence that graphical passwords are easier to observe; however, that does not necessarily mean that the observed information will allow the attacker to guess the victim's password easier. An in-depth analysis of individual metrics within the clusters offers insight into many additional aspects of the shoulder surfing attack not explored before. Our comparative framework makes an advancement in evaluation of shoulder surfing and furthers our understanding of observational attacks. The results have important implications for future shoulder surfing studies and the field of Password Security as a whole.