We are pretty good at making sense of the 3D world through 2D projections as humans, but the whole thing is not so easy when it comes to machines. The goal here is to develop a mechanism capable of achieving 3D world understanding by studying geometry and depth from 2D images via computation.
The problem here was that computers have hard times when both the camera and objects in a scene are in motion. The freely moving camera and objects can confuse algorithms as the traditional approach assumes the same object can be observed from more than one viewpoint at the same time, enabling triangulation. The assumption here needs either a multi-camera array or all objects being stationary while one camera moves through the scene.