Temporal shift estimation for stereoscopic videos


Video synchronization is a fundamental computer-vision task necessary for a wide range of applications. A 3D video involves two streams, which show the scene from different angles simultaneously. It was demonstrated that desynchronization between streams causes severe discomfort for people watching the stereo video.

We propose a temporal shift (time difference) estimation method. In this method we assume that the temporal shift and geometric distortion between the two streams are constant throughout each scene. The result of the algorithm is a shift value measured in fractions of frame steps (inverted FPS).

Example of a detected shot with temporal shift
Drive Angry, #29242

We approached the task as a regression problem by constructing an equation that describes the spatio-temporal dependency using the motion disparity and stereo parallax vectors.

The proposed algorithm consists of the following two main stages:

  1. Calculate the stereo parallax and motion vectors using a block-based matching for each stereo frame;
  2. Estimate model parameters from motion vectors with high confidence using the RANSAC algorithm.

Stereoscopic video can employ horizontal disparity by design in order to achieve the stereo effect, but vertical disparity is always the result of spatio-temporal misalignment. The algorithm uses this assumption to restore a temporal shift value from vectors’ vertical components. The detailed algorithm description is published in [1].

A histogram of founded values. The tangent of the slope is the shift value in frames’ fractions


The algorithm has been tested on our synthetically created dataset. The video set contained 396 stereoscopic scenes with frame rate of 30 FPS from only converted stereoscopic movies, as they did not contain temporal shifts. The frames were subsampled to simulate the temporal shift (e.g. taking only even frames for the left view and uneven frames for the right view results in a shift of 0.5 frames). The final dataset consisted of subsampled views, resulting in a relative temporal shift by ±{0.25, 0.5, 1.0, 2.0} frames.

The comparison of the current algorithm with the previous work shows a significant gain. The error was calculated as the absolute difference between the target and estimated shift values in the frame steps. The evaluation problem was addressed as a classification of whether the error was below a threshold value. In the experiments, the least noticeable value of the time shift was estimated to be 0.10 frames, and it was used as a threshold error value for comparing algorithms.

The comparison of proposed algorithm with our previous work [1]. Left: the relation between the temporal shift accuracy and the estimation error threshold; Right: the exact scores for error threshold value equal to 0.10 frames.

Additionally, we’ve processed 60 full-length stereoscopic movies and revealed 198 scenes with temporal shift value at least 0.10 frames. Further examples can be found in our VQMT3D reports 8 and 9.

Histogram of revealed scenes with temporal shift



1. Ploshkin, A., and Vatolin, D.,
“Accurate method of temporal-shift estimation for 3D video,” [pdf]
2018-3DTV-Conference: 3D at any scale and any perspective (3DTV-CON),
2018. doi:10.1109/3DTV.2018.8478431

05 May 2020
See Also
Call for HEVC codecs 2019
Fourteen modern video codec comparison
Parallax range estimation in S3D video
Parallax determines the depth of S3D movies. The range of parallaxes should be both comfortable and entertaining for spectators.
Geometric distortions analysis and correction
Production of low-budget movies is prone to errors. Our method automatically corrects rotation and scale mismatch.
Automatic detection of artifacts in converted S3D videos
Our set of algorithms detects edge sharpness mismatch, cardboard effect, and crosstalk noticeability.
Neural network-based algorithm for classification of stereoscopic video by the production method
What method was used to create the 3D scene?
Perspective distortions estimation
How to detect a mismatch in the vertical position of the cameras?
Site structure