Siamese networks are prevalent in person re-identification (re-id) tasks to address the similarity and dissimilarity among video frames. It mainly focuses on the inter-video variation between spatio-temporal features extracted from different videos, while the variation between features of the same video has been rarely discussed. In this paper, we introduce the concept of “mean-body” and define an intra-video loss to address the variation between spatio-temporal features of the same video. A novel loss is presented to boost the training of the re-id networks by combining the proposed intra-video loss and the Siamese loss. Specifically, the intra-video loss uses the unique mean-body of each camera viewpoint to make the video sequence more clustered, while the Siamese loss is to make the wrong matching videos more separated. To train the whole network, we update the network and the mean-body in an iterative manner. As a result, the proposed loss is expected to improve the generalization capability of the re-id networks on the testing set. Extensive results demonstrate that the presented approach outperforms the state-of-the-art algorithms on the publicly available data sets, such as PRID2011, iLIDS-VID, and MARS, in terms of re-id accuracy.