## Randomized Hough Transform 2: Rototranslations

Here I briefly explained how the RHT works, with reference to a very simple case (finding the directions of rows and columns of a grid). Now I’ll spend some words on a more complex application.

When scanning the surface of an object with an optical scanner, be it based on laser or binocular photogrammetry, some portions of the surface may not be acquired, due to occlusions (they are hidden by other portions of the surface and therefore not reached by the laser ray or not seen by both cameras) or simply because the object is larger than the field of the scanning device.

The solution is to move the object and/or the scanner, so that the hidden parts become visible, and to repeat the acquisition as many times as needed. Then the rototranslations between the different poses must be determined in order to merge the acquisitions in a single point cloud.

To determine the rototranslations, one possibility is to have the scanner carried by something which keeps track of its position and orientation, eg a robotic arm having position encoders at its joints.

Another possibility is to match at least three (the more the better) non-aligned points across each couple of poses. Three couples of non-aligned points determine a rototranslation. The points can be intrinsic features of the object being scanned, or easily recognizable markers purposedly sticked onto the object.

In this second case we have a number of acquisitions, each with a list of 3D reference points. We know that the reference points of, say, acquisition A, as a whole, are in correspondence with those of, say, acquisition B, because they refer to the same 3D entities, but we don’t know in which order the correspondences are. Also, the elements in the A list need not be precisely as many as those in the B list – maybe one or more markers were acquired in A and missed in B, or the reverse. Even worse, maybe some markers were missed both in A and in B, but not the same markers – so that the two lists contain some points which do correspond, even if in unknown order, but also some which do not correspond at all.

Not very simple, isn’t it? Well, RHT can help.

In my previous post, RHT was used to extract shapes (straight lines, circumferences, …) from minimal subsets of samples (couples or triples of points in an image). Here, the concepts of shape and sample must be extended: the shape is the rototranslation, and the minimal subsets are triples of couples of 3D points.

While two non coincident points always define a straight line and three non collinear points always define a circumference, it is not true that any three couples of points always define a rototranslation. The rototranslation is a rigid transformation. Therefore, if the three of couples of points $(A, A'), (B, B'), (C, C')$ are connected by the rototranslation $\begin{bmatrix} R & \mathbf{t} \end{bmatrix}$, so that $A' = R A + \mathbf{t}$, $B' = R B + \mathbf{t}$, $C' = R C + \mathbf{t}$, it must be true that $\left| A-B \right| = \left| A' - B' \right|$, $\left| A - C \right| = \left| A' - C' \right|$, $\left| B - C \right| = \left| B' - C' \right|$.

This fact can be useful to discard a priori minimal subsets not respecting this constraint, avoiding further computation on them. Of course, one must remember that the equality sign is a mathematical abstraction never showing up in the real world. Due to unavoidable acquisition and reconstruction errors, the lengths of corresponding segments will never be precisely equal, so the check must be done with a small tolerance.

As said in my previous post on this subject, the RHT algorithm needs to compare two shapes to determine how much they differ. How can we measure the dissimilarity between two rototranslations?

I found satisfactory to define the dissimilarity as the root mean square difference of the lengths of the edges of a simplex. Let

$P_0 = (0, 0, 0)$

$P_1 = (s, 0, 0)$

$P_2 = (0, s, 0)$

$P_3 = (0, 0, s )$

where $s$ is a characteristic length of the problem, eg the linear size of the field of measure. Then I define the dissimilarity between two rototranslations $\begin{bmatrix} R & \mathbf{t} \end{bmatrix}$ and $\begin{bmatrix} R' & \mathbf{t'} \end{bmatrix}$ as

$d (\begin{bmatrix} R & \mathbf{t} \end{bmatrix}, \begin{bmatrix} R' & \mathbf{t'} \end{bmatrix}) = \sqrt{ \dfrac{\sum_{i=0}^{3} ((R P_i + \mathbf{t}) - (R' P_i + \mathbf{t'}))^2}{4} }$

$d$ is non-negative, it is null for identical rototranslations and, thanks to the characteristic length $s$, it keeps into account both the effects of the translation and of the rotation putting them on a common ground.

That said, the RHT algorithm for rototranslations follows the same steps as described in my previous post. Here is the pseudocode:

1. Initialize an empty list of rototranslations.
2. Repeat 3..10 for a maximum number of times:
3. Randomly choose three distinct points from the first acquisition.
4. Randomly choose three distinct points from the second acquisition.
5. If the rigidity constraint is violated, restart from 3.
6. Compute the rototranslation defined by the three couples of points.
7. If the list of rototranslations is empty, add the new rototranslation to the list, keeping track of the three couples of points used to compute it.
8. Otherwise if the list of rototranslations is not empty, compute the dissimilarity of the new rototranslation with respect to all those in the list.
9. If the lowest dissimilarity found is under a specified threshold, merge the two rototranslations by merging their associated samples (couples of points) and recomputing the rototranslation. If the merged rototranslation has enough samples, exit the iterations.
10. Otherwise if the lowest dissimilarity found is over a specified threshold, add the new rototranslation to the list.

Several algorithms are available to compute the rototranslation given three or more couples of points; see eg Lorusso, Eggert and Fisher, A Comparison of Four Algorithms for Estimating 3-D Rigid Transformations.

The choice of the random number generator used to extract the minimal subsets is critical. It must be a good one, with an as long as possible cycle. I used the Mersenne generator by Agner Fog. Also, be aware that two independent generators (with different seeds) must be used to extract the samples from the first and from the second acquisition.

The algorithm, as described, needs three parameters:

• maximum number of iterations;
• dissimilarity threshold to decide if the new rototranslation has to be merged with an existing one or must be added to the list;
• minimum number of samples to accept a rototranslation as a valid result and exit the algorithm.

It could also be a good idea to give up the search after a certain number of iteration without finding a new minimal subset different from those already seen.

I haven’t made thorough testing, but I dare say that RHT is competitive with RANSAC for finding rototranslations.

Here you find a short movie (5 min) showing the acquisition and registration of point clouds based on marker matching with the RHT for rototranslations algorithm.

## Gaze tracking as a novel input method

Smartphones and tablets usually have a camera on their back, to take photographs, and a frontal camera for videoconferencing.

In a recent model (Samsung Galaxy S4) the frontal camera can be used as an input device too: it can suspend the current operation if the user looks away or disappears, and it can react to gestures by scrolling or clicking. This is probably the best that can be done with one single frontal camera.

With two frontal cameras a gaze tracking functionality could be added, sufficiently precise to point at specific locations on the screen, so complementing or replacing a touch screen. (The expressions gaze tracking and eye tracking are interchangeable in common use, but to be precise eye tracking means tracking the movements of the eye with respect to the head, while gaze tracking is with respect to a fixed reference frame. So here I am specifically speaking of  gaze tracking, with respect to a reference frame fixed to the device).

The two cameras could be positioned at the top left and top right corners or, to maximize their distance (baseline), at two diagonally opposite corners.

The cameras should be attached to a common rigid frame for stability.

(Meta-note: Keeping text and images aligned in WordPress is a nightmare. Even when everything looks fine, there is no guarantee that it will stay so, due probably to updates to the WordPress platform. Should the post become unreadable due to bad alignment, please tell me in a comment and I will try to adjust it).

Eye/gaze tracking is being used since the beginning of last century in psychological research. Leaving aside intrusive techniques, requiring special contact lenses or electrodes, there is now a number of commercially available devices which are based on digitally processing the image of the eye; NuiaSMITobiiLC Technologies are the principal vendors; I also know of  an open source application based on OpenCV. Applications range from gaming to disability support. All these products, to my knowledge, need special illumination to create corneal reflections and, in most cases, a calibration is necessary to adapt the system to a new user. The need for special illumination makes gaze tracking not easily applicable to small, portable devices.

I have developed a method for visual measurement of holes in a metal sheet; it relies on detecting the margins of the hole, and is suitable for an iris too. No corneal reflection is needed. The geometric calibration of the two cameras (internal and external parameters and distortion correction coefficients) must be done only once and it is valid for all users; a new user could only be asked to look at a couple of points on the display, in order to determine the dominant eye and the angle between gaze direction and normal to the iris.

I made some experiments with what I had at hand, that is a measuring rig I was using for metal sheets. It has two high resolution (5 Mp) cameras with a baseline of about 500 mm. So not exactly the size of a smartphone, but there is no reason why the method shouldn’t work there too.

First, the irides must be found. To this purpose, several algorithms are available. As this is not a central point in this discussion, here I just select by hand the ROIs (Regions Of Interest) around the irides:

Next, edgels are extracted inside the ROIs:

Edgels correspond to abrupt transitions from dark to light and their position can be computed with sub-pixel accuracy, exploiting the information carried by the grey levels. Here we find many edgels on the border of the irides, which are what we were looking for, but we also find many spurious edgels on the eyelids, on the eyelashes and around reflections inside the irides. The edgels of interest lie along an almost continuous arc around the irides, while the spurious edgels are organized in short segments; this allows to prune most of them:

The next step is to fit ellipses to the surviving edgels. The fitting is preceded by a statistical selection with RANSAC to reject the few outliers (spurious edgels) which are still there.

It is to be noted that the ellipses are well fitted even if edgels are missing in the portion of their perimeters hidden by eyelids.

Each iris ellipse is the directrix of a cone having vertex in the optical center of the camera; the two cones based on the left and right image of the same iris ellipse intersect in 3D space, and this intersection contains the 3D iris ellipse:

The algorithm to reconstruct a 3D conic from two images was first described by Long Quan in his paper Conic Reconstruction and Corresponcence from Two Views, IEEE-PAMI vol. 18, no. 2, Feb. 1996. Cordelia Schmid and Andrew Zisserman in The Geometry and Matching of Lines and Curves over Multiple Views present a substantially similar but formally more elegant approach.

To make sure that the reconstruction of the 3D ellipses has been successful, I reproject them back onto the images:

They fit perfectly.

To make tridimensionality evident, here is an animation of the reconstructed irides in a reference frame fixed with the measuring rig and having origin in front of the nose, X axis to the left, Y upwards, Z towards the face:

 (units: mm) Left iris Right iris Centre X; Y; Z -28,79; 6,51; 126,37 33,14; 7,00; 128,80 Normal X; Y; Z 0,088; 0,044; -0,995 0,063; -0,005; -0,998 Major semiaxis 6,46 6,47 Minor semiaxis 5,84 5,98 Average diameter 12,30 12,44

The distance between centers is 61,98.

It remains to be noted that all the algorithms dealing with edgels – extraction, grouping, pruning – are well suitable for highly parallel computation, e.g. each pixel can be examined independently to determine if it contains an edgel. RANSAC provides for randomly choosing a number of minimal subsets of edgels and fitting and ellipse to each of them, and here too a high parallelism can be attained because each subset can be dealt with independently. Last, each of the two object ellipses can be reconstructed independently. So the whole process described here is highly parallelizable.

## Randomized Hough Transform

The Hough transform can be used to extract shapes from images.

As an example, consider a binary image (pixel values can be only 0 – black, belonging to the background – or 1 – white, belonging to the shape) from which one wants to extract segments of straight lines.

The 2D polar equation of a straight line is

$x cos(\theta) + y sin (\theta) - d = 0$

where

$x, y$ are the coordinates of a point on the line;

$\theta$ is the angle between the normal to the line and the $x$ axis;

$d$ is the distance of the line from the origin.

The Hough transform maps the image space $x, y$ into the parameter space $\theta, d$; a straight line in the image space is mapped into a single point in the parameter space, while a single point $x_P, y_P$ in the image space is mapped into the curve $d = x_P cos(\theta) + y_P sin(\theta)$ in the parameter space.

All the curves in parameter space corresponding to points which are aligned in image space will cross in the point $\theta, d$ corresponding to their joining line. So there will be many more crossings in or near a parameter point corresponding to a straight sequence of many pixels in image space, than in a parameter point corresponding to a line which only contains a few isolated pixels.

The parameter space is quantized into counting bins of size $\Delta \theta, \Delta d$. Each white pixel in the image is examined and its transformed curve is sampled, increasing the counts of the bins crossed by the curve. At the end, bins having high counts will reveal straight sequences of pixels in the image.

The algorithm, as described, is very expensive in terms of memory (all the bins in parameters space must be stored, even those which will reach low or zero count) and time (for each white pixel a curve must be sampled).

The Randomized Hough Transform (RHT) is a variant which worked very well for me. No bin is stored, and white pixels are not examined one at a time. Rather, couples of white pixels are randomly chosen and their joining line is computed. If it is sufficiently similar to a line already computed and stored, the score of this line is increased and its parameters are somehow averaged with those of the new line; otherwise the new line is stored separately with an initial score of 1. When enough couples of white pixels have been examined, lines with high scores do emerge, corresponding to true lines in the image.

Of course, when looking for shapes other than lines, a different number of pixels must be examined at once; eg when looking for a circumference triples of pixels must be chosen, because a circumference is determined by three points.

This modified algorithm is fast and effective, but some detail needs special care:

• one must be able to tell if two shapes are sufficiently close in parameter space to be considered the same shape. This means, first, to define a distance in parameter space, what is not always straightforward, and second, to choose a suitable threshold value;
• one must somehow decide how many tuples of pixels must be examined.

I adapted this algorithm for finding the directions of rows and columns of a grid in an image. Directions have one single parameter, an angle, so that in this case it is easy to define the distance in parameter space.

I have implemented the algorithm in C++. The obvious data structure is an stl multimap having the score as key and the direction as value, but multimap is highly inefficient. It is much better to use a list of structures containing both the direction and the score.

There is a certain similarity between RHT and Teuvo Kohonen’s Self-Organizing Maps (SOM); in both cases a new pattern is compared to all the existing patterns and it can be merged with the most similar one (winner); in both cases the winner is adjusted toward the new pattern. An important difference is that in classical SOM the number of existing patterns (neurons) is fixed since the beginning, while in RHT one starts with zero existing patterns and new patterns are added when there is no winner; also, the adjustment rules are more sophisticated in SOM than in RHT.

It looks very interesting to me that algorithms stemming from different fields and having different justifications could be so similar; this reminds me of convergent evolution.

## Reconstruction of 3D points from images in a calibrated binocular system

In a binocular system one can write

$\begin{bmatrix} u \\ v \\ w \end{bmatrix} \propto P \begin{bmatrix} X \\ Y \\ Z \\ W \end{bmatrix}$

and

$\begin{bmatrix} u' \\ v' \\ w' \end{bmatrix} \propto P' \begin{bmatrix} X \\ Y \\ Z \\ W \end{bmatrix}$

where

$\begin{bmatrix} X \\ Y \\ Z \\ W \end{bmatrix}$ is an object point in homogeneous coordinates in mm (*),

$\begin{bmatrix} u \\ v \\ w \end{bmatrix}$ and $\begin{bmatrix} u' \\ v' \\ w' \end{bmatrix}$ are its image points in homogeneous coordinates in pixels,

$P=K \begin{bmatrix} R & \mathbf{t} \end{bmatrix}$ and $P'=K' \begin{bmatrix} R' & \mathbf{t}' \end{bmatrix}$ are the projection matrices of the two cameras,

$K$ and $K'$ are the calibration matrices of the two cameras,

$\begin{bmatrix} R & \mathbf{t} \end{bmatrix}$ and $\begin{bmatrix} R' & \mathbf{t}' \end{bmatrix}$ are the rototranslations from a global reference system to the standard reference systems of the two cameras.

If the system is calibrated, $P$ and $P'$ are known up to scale; if the points are not at infinity, their euclidean coordinates $\frac{X}{W}, \frac{Y}{W}, \frac{Z}{W}$ can be obtained by DLT , in complete analogy with what described here for homographies:

$\dfrac{X}{W} (P_{31} \dfrac{u}{w} - P_{11}) + \dfrac{Y}{W}(P_{32} \dfrac{u}{w}-P_{12}) + \dfrac{Z}{W}(P_{33} \dfrac{u}{w} - P_{13})=P_{14}-P_{34} \dfrac{u}{w}$

$\dfrac{X}{W} (P_{31} \dfrac{v}{w} - P_{21}) + \dfrac{Y}{W}(P_{32} \dfrac{v}{w}-P_{22}) + \dfrac{Z}{W}(P_{33} \dfrac{v}{w} - P_{23})=P_{24}-P_{34} \dfrac{v}{w}$

$\dfrac{X}{W} (P'_{31} \dfrac{u'}{w'} - P'_{11}) + \dfrac{Y}{W}(P'_{32} \dfrac{u'}{w'}-P'_{12}) + \dfrac{Z}{W}(P'_{33} \dfrac{u'}{w'} - P'_{13})=P'_{14}-P'_{34} \dfrac{u'}{w'}$

$\dfrac{X}{W} (P'_{31} \dfrac{v'}{w'} - P'_{21}) + \dfrac{Y}{W}(P'_{32} \dfrac{v'}{w'}-P'_{22}) + \dfrac{Z}{W}(P'_{33} \dfrac{v'}{w'} - P'_{23})=P'_{24}-P'_{34} \dfrac{v'}{w'}$

This is a inhomogeneous overdetermined linear system of four equations in the three unknowns $\dfrac{X}{W}$, $\dfrac{Y}{W}$ and $\dfrac{Z}{W}$, and can be solved eg by singular value decomposition, but its closure has no apparent geometric meaning. As ususal, a method minimizing some sort of geometric error would be preferred.

The traditional way to reconstruct a 3D point with geometric optimization, as described eg by Hartley & Zisserman, Multiple View Geometry in Computer Vision, §12.3, is to find the minimum displacements of the reprojected image points on the image planes, so that the corrected image points do satisfy the epipolar constraint. This is a nonlinear problem whose solution is not straightforward.

As an alternative, one can use here the same concept underlying forward bundle adjustment: that is, minimize a distance in object space instead of image space, as follows:

The optical centre $\mathbf{C}= \begin{bmatrix} \tilde{\mathbf{C}} \\ 1 \end{bmatrix}$ is a null vector of the projection matrix: $P \mathbf{C} = \begin{bmatrix} M \mathbf{P}_{.4} \end{bmatrix} \begin{bmatrix} \tilde{\mathbf{C}} \\ 1 \end{bmatrix}= \mathbf{0}$, whence $\tilde{\mathbf{C}} = -M^{-1} \mathbf{P}_{.4}$; the direction of the optical ray $\tilde{\mathbf{D}}$ is the projection of the image point onto the plane at infinity: $P \mathbf{D} = \begin{bmatrix} M \mathbf{P}_{.4} \end{bmatrix} \begin{bmatrix} \tilde{\mathbf{D}} \\ 0 \end{bmatrix} = \mathbf{x}$, whence $\tilde{\mathbf{D}}=M^{-1} \mathbf{x}$.

The parametric equations of the optical rays are

$\tilde{\mathbf{L}}(\tau)=\tilde{\mathbf{C}} + \tau \tilde{\mathbf{D}}$

and

$\tilde{\mathbf{L}}'(\tau')=\tilde{\mathbf{C}}'+ \tau' \tilde{\mathbf{D}}'$

One wants to find a point $\tilde{\mathbf{X}}=\begin{bmatrix} X \\ Y \\ Z \end{bmatrix}$ and values of $\tau, \tau'$ such that the sum of squared distances $S=(\tilde{\mathbf{L}}(\tau) - \tilde{\mathbf{X}})^2+(\tilde{\mathbf{L}}'(\tau')-\tilde{\mathbf{X}})^2$ be minimum. By equating to zero the derivatives of $S$ with respect to $X, Y, Z, \tau, \tau'$ one gets a linear system of five equations in five unknowns, whose solution is straightforward.

This method is readily applied to the case of trinocular systems too, giving a linear system of six equations in six unknowns.

———-

(*) $\dfrac{X}{W}$, $\dfrac{Y}{W}$, $\dfrac{Z}{W}$ are mm.

## Vanishing points in presence of noise

Most self-calibration algorithms require a prior knowledge of the camera calibration matrix $K$; as an instance, you need it to normalize the image points as $\tilde{\mathbf{x}}=K^{-1} \mathbf{x}$ and therefore fit the essential matrix $E$.

With most commercial cameras it is safe to assume that the pixels are square and have no slant; in this case, $K= \begin{bmatrix} -fk & 0 & X_c \\ 0 & -fk & Y_c \\ 0 & 0 & 1 \end{bmatrix}$, where

• $X_c, Y_c$ are the coordinates of the principal point in pixel units; it is usually an acceptable first approximation to assume that the principal point is at the centre of the sensor;
• $k$ is the number of pixels per mm; it is usually safe to take the nominal value specified by the sensor vendor;
• $f$ is the focal length in mm. Taking the nominal value specified by the lens vendor could not be sufficient; even worse if, working with a zoom lens, one tries to read the focus scale.

So it would be nice to obtain at least $f$, or better the whole calibration matrix, from the images. The calibration matrix can be computed from the image of the absolute conic which, in the above conditions, is $\omega = \begin{bmatrix} \omega_1 & 0 & \omega_2 \\ 0 & \omega_1 & \omega_3 \\ \omega_2 & \omega_3 & \omega_4 \end{bmatrix}$; it is related to $K$ by $\omega \propto K^{-T}K^{-1}= \begin{bmatrix} \frac{1}{f^2 k^2} & 0 & -\frac{X_c}{f^2 k^2} \\ 0 & \frac{1}{f^2 k^2} & -\frac{Y_c}{f^2 k^2} \\ -\frac{X_c}{f^2 k^2} & -\frac{Y_c}{f^2 k^2} & 1+\frac{X_c^2+Y_c^2}{f^2 k^2} \end{bmatrix}$.

$\omega$, in turn, can be obtained from three couples of vanishing points $\mathbf{v}_1, \mathbf{v}_1'$; $\mathbf{v}_2, \mathbf{v}_2'$$\mathbf{v}_3, \mathbf{v}_3'$ corresponding to pencils of lines which are orthogonal in the object space; each couple contributes a constraint $\mathbf{v}^T \omega \mathbf{v}'=0$. Three constraints are sufficient because $\omega$ has four distinct non-null entries but, being the matrix of a conic section, it is homogeneous (defined up to a scale factor).

In an ideal world, things are so simple. All the lines of the pencil intersect exactly in the same point, whose coordinates are readily computed and can be used to obtain $\omega$.

In the real world, nothing is perfect and things are not so simple. Straight lines must be obtained by fitting them to points (edgels), which are noisy. As a result, each couple of lines in the pencil will intersect in a different point (though all the points should be very close).

Taking the centroid of the intersections may work, but it is possible to do better by realizing that we are facing a constrained minimum problem: we are looking for a pencil of straight lines, minimizing the sum of squared distances of each line from its own cloud of points, all lines constrained to intersect in the same point.

Let

• $N$ be the number of lines,
• $M_i, i \in 1 \cdots N$ be the number of points in the cloud of the i-th line,
• $x_{ij}, y_{ij}, j \in 1 \cdots M_i$ be the coordinates of the j-th point in the cloud of the i-th line,
• $\theta_i, d_i$ be the parameters of the polar representation of the i-th line,
• $x_I, y_I$ be the coordinates of the intersection;

then the problem can be written as

$min_{\{\theta_i\},\{d_i\}} \sum_{i=1}^N \sum_{j=1}^{M_i} (x_{ij} cos \theta_i + y_{ij} sin \theta_i - d_i)^2$

subject to

$x_I cos \theta_i + y_I sin \theta_i - d_i = 0, i \in 1 \cdots N$

The lagrangian is

$\Lambda (\{\theta_i\}, \{d_i\}, x_I, y_I)=\sum_{i=1}^N (\sum_{j=1}^{M_i} (x_{ij} cos \theta_i + y_{ij} sin \theta_i - d_i)^2 + \lambda_i (x_I cos \theta_i + y_I sin \theta_i - d_i))$

where $\lambda_i$ are the Lagrange multipliers.

This is a nonlinear problem and as such needs an initial approximation to start with. The unconstrained straight lines can be fitted eg with the method of Alciatore and Miranda, which is closed form and fast, and in this context possible outliers can be rejected eg with RANSAC; the intersection can be obtained as the centroid of all the intersections. It is a good idea to check that no intersection be too far from the centroid; this could be a symptom of something gone astray in the previous steps.

Unluckily, it is not possible to adapt to this problem the linear method I described here (reduction to the subspace of constraints) because the common intersection point is not known in advance.

Some more notes about my implementation of the forward bundle adjustment algorithm (see my previous post).

Monocular systems. According to Mitchell, Warren, McKinnon and Upcroft, the original algorithm does not work with monocular systems, due to a limitation of the sub-algorithm used to solve for the translation. In my implementation, the solution involves all the parameters together and this limitation is not present. I have tried a monocular version of my algorithm just to make sure that it works correctly, but of course I don’t use it for my binocular rig, as it does not keep into account the rig rigidity constraint.

Gauge freedom. There are seven degrees of gauge freedom which must be eliminated to avoid the solution wandering around (and possibly making bad encounters). The object point cloud is free to move rigidly if the rototranslation changes in the reverse way to compensate this movement: this means six unwanted degrees of freedom. Also, the object point cloud can grow or shrink if the focal lengths, pixel pitches and distances between optical centres change to compensate.

The easiest way to get rid of these unwanted degrees of freedom is:

• fix all the three coordinates of two points in the cloud, that is, exclude them from the variables. This removes six degrees of freedom: three translational, two rotational and one of the scale;
• fix one coordinate of a third point. This removes the last rotational degree of freedom.

The scale can later be adjusted by setting a global scale factor.

In a recent post I wrote about the essential matrix. If the internal parameters (pixel pitch, focal length, principal point) of both cameras in a stereo rig are known, the essential matrix can be obtained from at least eight couples of images of 3D points. Once the essential matrix is known, it is possible to recover from it the rototranslation between the two cameras (external parameters), so completing the calibration (let me forget the lens distorsions, or better assume they have been already corrected). During this process, the 3D coordinates of the points are recovered too. That’s why the process is known as self-calibration or autocalibration: there are no separate calibration and measurement phases but, rather, the calibration is obtained automatically while measuring.

If the internal parameters are not known, the self-calibration follows a different  path, starting from the computation of the fundamental matrix instead of the essential matrix and including some more steps.

There are many variants of the self-calibration process, all sharing two common features:

• they are linear (or at least closed-form) methods, so they are guaranteed to reach a solution and they don’t need any initial approximation (this is good);
• they reach the solution by minimizing some sort of algebraic closure (this is bad; see eg this post to know why, even if in a different context).

To improve the accuracy a further, nonlinear step is necessary, accepting the results of self-calibration as an initial approximation and minimizing a geometrically significant quantity. This step is known as bundle adjustment because it tries to adjust the bundles of optical rays by modifying the calibration parameters and the positions of the reconstructed object points so that each ray ideally passes through an object point, the corresponding image point and the optical centre. The geometrically significant quantity being minimized is usually the reprojection error, that is, the distance between the measured image point and the intersection of the optical ray (through the object point and the optical centre) with the plane of the sensor.

Exactly as the essential matrix, bundle adjustment too was born in the photogrammetry community and then propagated to computer vision, where it has been the subject of an impressive amount of work in the last years. The paper Bundle Adjustment – A Modern Synthesis by Triggs, McLauchlan, Hartley and Fitzgibbon takes a synthetic but rather complete snapshot of the state of the art in the year 2000.

Much work has been done even after that date and, in my opinion, two new ideas are specially worth mentioning. One is the application of interval analysis to bundle adjustment by Fusiello et al. in 2004, of which I will write in a next post maybe(*), and the other is what could be called forward bundle adjustment.

The idea of forward bundle adjustment, to my knowledge, can be traced to a 2006 paper by Schweighofer and Pinz, and has been further elaborated by Mitchell, Warren, McKinnon and Upcroft in 2010.

The basic idea is simple: instead of minimizing the reprojection error, that is a distance in the image space, what is minimized is the distance in the object space (whence ‘forward’) between the object point and the optical ray, passing through the image point and the optical centre.

Let $\mathbf{C}$ be the optical centre, $\mathbf{v}$ a vector along the optical ray, $\mathbf{P}$ the object point and $\mathbf{P^*}$ its projection onto the optical ray; one wants to minimize $\left| \mathbf{e} \right|^2=\left| \mathbf{P}-\mathbf{P^*} \right|^2$.

The parametric equation of the optical ray is $\mathbf{L}(\tau)=\mathbf{C}+\mathbf{v} \tau$. The coordinates of $\mathbf{P}$ are $\mathbf{X}=[X Y Z]^T$ in the object frame and $R \mathbf{X}+\mathbf{t}$ in the camera reference frame. In this same frame, $\mathbf{P}^*$ is at $\mathbf{C}+\mathbf{v}\tau^*$, with $\tau^*=\frac{\mathbf{v}^T (R \mathbf{X}+\mathbf{t}-\mathbf{C})}{\mathbf{v}^T \mathbf{v}}$ minimizing $\left| \mathbf{L} (\tau) - \mathbf{P} \right|^2$; so at last one gathers $\mathbf{e}= (I-\frac{\mathbf{v} \mathbf{v}^T}{\mathbf{v}^T \mathbf{v}}) (R \mathbf{X} + \mathbf{t} - \mathbf{C})$.

Another key point in the work of the cited authors is to split the solution space in rotation $R$, traslation $\mathbf{t}$ and structure $\mathbf{X}$ and to solve separately for the three subsets, iterating to convergence; each subproblem has a closed form solution.

I don’t like this approach very much. Closed form solutions are necessary when no initial approximation is available, but this is not the case, as we can start from the output of self-calibration. They have the advantage that the solution is reached without wandering around and maybe dropping into a local minimum, but here this guarantee is lost because the three closed form solutions are iterated in sequence. It remains to be assessed how much dealing with three smaller sets of variables speeds up the convergence with respect to one single larger set.

In the version of forward bundle adjustment I have implemented, all the parameters – internal, external and distortion – are considered together and the nonlinear optimization is  performed with a sparse Levenberg-Marquardt algorithm, relying on multithreaded conjugate gradient to solve the linear steps. The symbolic Jacobian matrix is obtained by a CAS library. The rotations are parametrized with Euler angles, so avoiding the local minima corresponding to reflections which appear to plague the original formulation of the algorithm.

The typical calibration session consists of nine presentations of a pattern printed on paper and sticked on a roughly planar surface (according to Triggs, the presentations must be at least five). Each presentation supplies 96 object points, for a total of 864 object points, which are matched in the two images of the stereo rig. No preliminary RANSAC selection. The nonlinear system counts 1728 equations in 357 variables. Its RMS closure drops from 0.074 mm (with initial values from self-calibration) to 0.020 in three Levenberg-Marquardt iterations, taking less than 10 seconds on a  2.2 GHz Intel Core i7 (four cores used); then there is a further, slight improvement to 0.017 in about 4 minutes. In the same conditions, my implementation of the ‘backward’ bundle adjustment (minimizing error in image space) takes more than half an hour.

———-

(*) For the sake of precision, in the linked paper Fusiello et al. do not describe the application of interval analysis to bundle adjustment but rather to an algorithm for determining internal parameters, based on a nonlinear cost function which has no apparent geometric meaning. What I intended to say is that interval analysis, as described by Fusiello, can be applied to bundle adjustment; it is what I am working on precisely in these days.