In this paper, we examine the statistical soundness of comparative assessments within the field of recommender systems in terms of reliability and human uncertainty. From a controlled experiment, we get the insight that users provide different ratings on same items when repeatedly asked. This volatility of user ratings justifies the assumption of using probability densities instead of single rating scores. As a consequence, the well-known accuracy metrics (e.g. MAE, MSE, RMSE) yield a density themselves that emerges from convolution of all rating densities. When two different systems produce different RMSE distributions with significant intersection, then there exists a probability of error for each possible ranking. As an application, we examine possible ranking errors of the Netflix Prize. We are able to show that all top rankings are more or less subject to high probabilities of error and that some rankings may be deemed to be caused by mere chance rather than system quality.