Overview of image formation in AWARE
Welcome everyone, to our humble home on the internet.
The AWARE project is a substantial team effort involving many institutions. The team is led by David Brady and his DISP group at Duke University. The electronic infrastructure was designed and built by Distant Focus Corp. The optics were designed by Duke and UC San Diego and fabricated by RPC Photonics. Our role at the University of Arizona was to develop the algorithmic approach and software tools for efficiently combining the individual camera images into the resulting scene.
Image formation in AWARE poses a number of challenges that separate it from other 'image stitching' tasks in the literature:
- The volume of data: Ultimately, the goal is for the sensors in the microcameras to run at 10 Hz (frames per second). This means that we have to be able to do image formation in less than 100 milliseconds, or we're going to be overtaken by new data from the system. Further, a one gigapixel image is 100-1000 times larger than the displays that are commonly available. We have to provide a way for people to interact with images of this scale.
- Optical constraints: The new optical architecture constrains the optics in the microcamera in unusual ways. The images contain a significant amount of distortion that we must correct (5-10%). Also, the image from a microcamera is severely vignetted---it dims rapidly as you move from its center to its edge. This must also be corrected, or the combined image will appear very splotchy.
However, unlike traditional image stitching algorithms, we have a tremendous amount of knowledge about how the individual images are interrelated. The microcameras are, after all, mounted in a rigid aluminum structure of our design. It's true that our knowledge isn't perfect---there are manufacturing tolerances, thermal expansion, and vibration to consider---but the situation is substantially constrained.
In developing the approach, we had four overarching goals:
- Scalability: We should be able to tackle larger images and more users by adding hardware resources to the computation. Further, this scaling should be linear or sublinear (the hardware requirements should grow no faster than the problem size).
- High-fidelity: A major advantage of the optical architecture is that it achieves high optical resolution that is normally very hard to achieve at this scale. Our approach should not sacrifice that image fidelity.
- Non-blocking: We want to allow multiple users to simultaneously interact with the camera. Our approach should not enable the actions of one user to affect the images delivered to another.
- Single framework: The same processing pipeline should be able to produce output of varying size and resolution.
Our general approach can be described as 'model based'. Rather than repeatedly trying to find a solution to the image combination problem, most of the time we will assume that we know precisely how to combine them (we just have to go through the steps to do it). This builds upon our strong amount of knowledge regarding how the images relate as well as how the microcamera optics distort and vignette the scene.
As mentioned above, our knowledge of exactly how everything goes together isn't perfect. Periodically, we run a registration procedure that updates our model of the system. This works by applying techniques known as 'feature transforms' to find possible correspondences between the cameras in the regions where their views overlap. An optimization procedure then tweaks the parameters in our model until the errors in these correspondences are minimized. The result is an updated model that we use until the next time we do registration. In this way we amortize the cost of the registration process across many frames.
The model-based approach simplifies the computational load, but we still may be dealing with up to about 1.4 billion pixels that we must combine into the final gigapixel image. Conveniently, with the model-based approach, we can cast the process of working through all these pixels in a form that is compatible with a map/reduce framework. Map/reduce is a method for distributed data processing (popularized by Google) that allows some algorithms to be easily broken into pieces and run on parallel hardware. In our case, we have versions that run on both CPU clusters as well as GPU cards. With this step, the algorithm scales nicely with available hardware.
Maximum Likelihood Estimation
For the 'reduce' step of map/reduce, we have to develop a method for combining multiple camera pixels that are looking at the same point in the scene. In order to maximize the fidelity of the resulting image, we treat each of these camera pixels as independent measurements of that common point in the scene (but each corrupted by different instantiations of random noise). We have derived a technique known as a 'maximum likelihood estimator' for combining these measurements. The result is the value of brightness in the scene that has the highest probability of producing the observed measurements. We use this estimator as the combination method in the reduce step.