The present MVS methods with deep learning have an impressive performance than traditional MVS methods. However, the learning-based networks need lots of ground-truth 3D training data, which is not always easy to be available. To relieve the expensive costs, we propose an unsupervised normal-aided multi-metric network, named M^3VSNet, for multi-view stereo reconstruction without ground-truth 3D training data. Our network puts forward: (a) Pyramid feature aggregation to extract more contextual information; (b) Normal-depth consistency to make estimated depth maps more reasonable and precise in the real 3D world; (c) The multi-metric combination of pixel-wise and feature-wise loss function to learn the inherent constraint from the perspective of perception beyond the pixel value. The abundant experiments prove our M^3VSNet state of the arts in the DTU dataset with effective improvement. Without any finetuning, M^3VSNet ranks 1st among all unsupervised MVS network on the leaderboard of Tanks & Temples datasets until April 17, 2020. Our codebase is available at https://github.com/whubaichuan/M3VSNet.