In a recent post I wrote about the essential matrix. If the internal parameters (pixel pitch, focal length, principal point) of both cameras in a stereo rig are known, the essential matrix can be obtained from at least eight couples of images of 3D points. Once the essential matrix is known, it is possible to recover from it the rototranslation between the two cameras (external parameters), so completing the calibration (let me forget the lens distorsions, or better assume they have been already corrected). During this process, the 3D coordinates of the points are recovered too. That’s why the process is known as self-calibration or autocalibration: there are no separate calibration and measurement phases but, rather, the calibration is obtained automatically while measuring.

If the internal parameters are not known, the self-calibration follows a different  path, starting from the computation of the fundamental matrix instead of the essential matrix and including some more steps.

There are many variants of the self-calibration process, all sharing two common features:

• they are linear (or at least closed-form) methods, so they are guaranteed to reach a solution and they don’t need any initial approximation (this is good);
• they reach the solution by minimizing some sort of algebraic closure (this is bad; see eg this post to know why, even if in a different context).

To improve the accuracy a further, nonlinear step is necessary, accepting the results of self-calibration as an initial approximation and minimizing a geometrically significant quantity. This step is known as bundle adjustment because it tries to adjust the bundles of optical rays by modifying the calibration parameters and the positions of the reconstructed object points so that each ray ideally passes through an object point, the corresponding image point and the optical centre. The geometrically significant quantity being minimized is usually the reprojection error, that is, the distance between the measured image point and the intersection of the optical ray (through the object point and the optical centre) with the plane of the sensor.

Exactly as the essential matrix, bundle adjustment too was born in the photogrammetry community and then propagated to computer vision, where it has been the subject of an impressive amount of work in the last years. The paper Bundle Adjustment – A Modern Synthesis by Triggs, McLauchlan, Hartley and Fitzgibbon takes a synthetic but rather complete snapshot of the state of the art in the year 2000.

Much work has been done even after that date and, in my opinion, two new ideas are specially worth mentioning. One is the application of interval analysis to bundle adjustment by Fusiello et al. in 2004, of which I will write in a next post maybe(*), and the other is what could be called forward bundle adjustment.

The idea of forward bundle adjustment, to my knowledge, can be traced to a 2006 paper by Schweighofer and Pinz, and has been further elaborated by Mitchell, Warren, McKinnon and Upcroft in 2010.

The basic idea is simple: instead of minimizing the reprojection error, that is a distance in the image space, what is minimized is the distance in the object space (whence ‘forward’) between the object point and the optical ray, passing through the image point and the optical centre.

Let $\mathbf{C}$ be the optical centre, $\mathbf{v}$ a vector along the optical ray, $\mathbf{P}$ the object point and $\mathbf{P^*}$ its projection onto the optical ray; one wants to minimize $\left| \mathbf{e} \right|^2=\left| \mathbf{P}-\mathbf{P^*} \right|^2$.

The parametric equation of the optical ray is $\mathbf{L}(\tau)=\mathbf{C}+\mathbf{v} \tau$. The coordinates of $\mathbf{P}$ are $\mathbf{X}=[X Y Z]^T$ in the object frame and $R \mathbf{X}+\mathbf{t}$ in the camera reference frame. In this same frame, $\mathbf{P}^*$ is at $\mathbf{C}+\mathbf{v}\tau^*$, with $\tau^*=\frac{\mathbf{v}^T (R \mathbf{X}+\mathbf{t}-\mathbf{C})}{\mathbf{v}^T \mathbf{v}}$ minimizing $\left| \mathbf{L} (\tau) - \mathbf{P} \right|^2$; so at last one gathers $\mathbf{e}= (I-\frac{\mathbf{v} \mathbf{v}^T}{\mathbf{v}^T \mathbf{v}}) (R \mathbf{X} + \mathbf{t} - \mathbf{C})$.

Another key point in the work of the cited authors is to split the solution space in rotation $R$, traslation $\mathbf{t}$ and structure $\mathbf{X}$ and to solve separately for the three subsets, iterating to convergence; each subproblem has a closed form solution.

I don’t like this approach very much. Closed form solutions are necessary when no initial approximation is available, but this is not the case, as we can start from the output of self-calibration. They have the advantage that the solution is reached without wandering around and maybe dropping into a local minimum, but here this guarantee is lost because the three closed form solutions are iterated in sequence. It remains to be assessed how much dealing with three smaller sets of variables speeds up the convergence with respect to one single larger set.

In the version of forward bundle adjustment I have implemented, all the parameters – internal, external and distortion – are considered together and the nonlinear optimization is  performed with a sparse Levenberg-Marquardt algorithm, relying on multithreaded conjugate gradient to solve the linear steps. The symbolic Jacobian matrix is obtained by a CAS library. The rotations are parametrized with Euler angles, so avoiding the local minima corresponding to reflections which appear to plague the original formulation of the algorithm.

The typical calibration session consists of nine presentations of a pattern printed on paper and sticked on a roughly planar surface (according to Triggs, the presentations must be at least five). Each presentation supplies 96 object points, for a total of 864 object points, which are matched in the two images of the stereo rig. No preliminary RANSAC selection. The nonlinear system counts 1728 equations in 357 variables. Its RMS closure drops from 0.074 mm (with initial values from self-calibration) to 0.020 in three Levenberg-Marquardt iterations, taking less than 10 seconds on a  2.2 GHz Intel Core i7 (four cores used); then there is a further, slight improvement to 0.017 in about 4 minutes. In the same conditions, my implementation of the ‘backward’ bundle adjustment (minimizing error in image space) takes more than half an hour.

———-

(*) For the sake of precision, in the linked paper Fusiello et al. do not describe the application of interval analysis to bundle adjustment but rather to an algorithm for determining internal parameters, based on a nonlinear cost function which has no apparent geometric meaning. What I intended to say is that interval analysis, as described by Fusiello, can be applied to bundle adjustment; it is what I am working on precisely in these days.