1 Introduction

Accurately estimating the position or direction of a light source is essential for many physics-based computer vision tasks, such as shape from shading (Horn 1970), photometric stereo (Silver 1980; Woodham 1980), or reflectance and material estimation (Goldman et al. 2010). In these tasks, inaccurate light positions cause errors. For example, Fig. 1 shows the relation between light calibration error and surface normal estimation error in a synthetic experiment with a directional light, a Lambertian sphere as target object, and a basic photometric stereo method (Silver 1980; Woodham 1980). We can clearly see the importance of accurate light calibration. Ideally, the error of a calibration method is so small that developers of physics-based modeling algorithms never need consider it. Although there are approaches to refine inaccurate light calibration (Quéau et al. 2017) or bypass calibration altogether [uncalibrated photometric stereo (Alldrin et al. 2007; Matsushita et al. 2010; Chen et al. 2019)], they do not make highly accurate calibration obsolete. Uncalibrated photometric stereo cannot overcome the generalized bas-relief ambiguity for Lambertian materials and even in favorable settings they do not reach the accuracy of accurate calibration. Despite the importance of accurate light calibration, it remains laborious as researchers have not yet come up with accurate and easy to use techniques.

Fig. 1
figure 1

Light calibration error versus normal estimation error in photometric stereo. Each data point is the average of 100 independent runs

Fig. 2
figure 2

Clockwise from top left: our calibration target, a camera observing the movement of shadows cast by a point light while the target is moved, our algorithm’s workflow, and the estimation result (Color figure online)

This paper proposes a method for calibrating both distant and near point lights. We introduce a calibration target, shown in Fig. 2, that can be made within 1–2 min from off-the-shelf items for less than five dollars. Instead of specular highlights on spheres, we use a planar board (shadow receiver) and pins (shadow casters) that cast small point shadows on the board. Moving the board around in front of a static camera and light source and observing the pin head shadows under various board poses lets us determine the light position/direction.

The reasons why we operate with shadows on a planar target rather than with specular highlights or spherical targets are the following: A key factor in the overall calibration accuracy is the accuracy with which one can localize a calibration method’s points of interest in the captured images. With the off-the-shelf pins that we use, we can automatically localize shadow centers with an accuracy of \(\sim \) 1–2 px (Fig. 3, left), which is in marked contrast to how accurately we can detect specular highlights (Fig. 3, center and right). Moreover, our planar target translates small shadow localization errors only into small light direction errors. In contrast, mirror sphere methods amplify localization errors since the surface normal, which determines the light reflection angle, varies across the sphere.

From a geometric point of view, point lights are inverse pinhole cameras (Hu et al. 2004) (see Fig. 4). We can thus build upon past studies on multiview projective geometry. In particular, we show that shadows of static objects on a plane follow the principles of epipolar geometry. Further we show that, analogous to structure from motion (SfM) which jointly estimates camera poses and 3D point locations, we can jointly estimate light position/direction and shadow caster pin positions from moving our calibration target and observing the pin shadows, i.e., we can estimate light and pins via structure from pin motion. Conveniently, this joint estimation of light and pins allows users to place the pins arbitrarily on the board and without needing to know their locations—in contrast to most previous works—our calibration target does not need to be carefully manufactured or measured.

To summarize, the primary contributions of our work are as follows. First, we show that shadow projection with a unified light model for both nearby and distant light follows the principles of pinhole camera projection and epipolar geometry. Second, using these principles we show how the joint estimation of light position/direction and 3D shadow caster positions based on shadow observations can be formulated as a bundle adjustment problem and we develop a robust solution technique for accurately achieving this estimation. Finally, we introduce a practical light source calibration method based on an easy-to-make calibration target. Instead of requiring a carefully designed calibration target, our method only uses needle pins that are stuck at unknown locations on a plane.

The benefits of the new calibration target and associated solution method are an extremely simple target construction process, a calibration process that requires no manual intervention other than moving the target since all required information is inferred automatically, and improved accuracy compared to prior work.

Fig. 3
figure 3

Left: the pin head shadows on our planar target. Center and right: specular highlights on a mirror plane (Shen and Cheng 2011) and a mirror sphere

Fig. 4
figure 4

Cameras versus point lights. A camera matrix \(\mathbf {P}_i\) projects a scene point \(\mathbf {c}_j\) to an image point \(\mathbf {s}_{ij}\) just like a light matrix \(\mathbf {L}_i\) projects a scene point \(\mathbf {c}_j\) to a shadow \(\mathbf {s}_{ij}\). Conventional SfM estimates \(\mathbf {P}_i\) and \(\mathbf {c}_j\) from \(\{\mathbf {s}_{ij}\}\) and in this paper we show how to estimate \(\mathbf {L}_i\) and \(\mathbf {c}_j\) from \(\{\mathbf {s}_{ij}\}\)

2 Related Work

Light source calibration can be roughly divided into two tasks: geometric calibration and radiant intensity distribution (RID) calibration. This paper is solely concerned with geometric calibration but in this section, we introduce the prior works of both tasks and discuss the relationship to our work.

Geometric Light Source Calibration The goal of geometric light source calibration is to estimate

  1. (a)

    Light source directions in scenes with distant point light sources or

  2. (b)

    Light source positions in scenes with nearby point light sources.

In category (a), Zhang and Yang (2001) [and Wei (2003) with a more robust implementation] proposed a method to estimate multiple distant lights based on a Lambertian sphere’s shadow boundaries and their intensities. Wang and Samaras (2002) extended this to objects of arbitrary but known shape, combining information from the object’s shading and the shadows cast on the scene. Zhou and Kambhamettu (2002) estimated light directions from stereo images of a reference sphere with specular reflection. Cao and Shah (2005) proposed a method for estimating camera parameters and the light direction from shadows of common vertical objects such as walls instead of special, precisely fabricated objects, which can be used for images from less restricted settings.

In category (b), Powell et al. (2001) triangulated multiple light positions from highlights on three specular spheres at known positions. Other methods also used reflective spheres (Ackermann et al. 2013; Hara et al. 2005; Schnieders and Wong 2013; Takai et al. 2009; Wong et al. 2008) or specially designed geometric objects (Aoto et al. 2012; Bunteong and Chotikakamthorn 2016; Weber and Cipolla 2001). Unlike these methods, some methods were based on planar mirrors (Schnieders et al. 2009; Shen and Cheng 2011). They modeled the mirror by perspective projection and infer parameters similar to camera calibration. An interesting method, quite similar to ours in its simplicity and usage of shadows, is Bouguet and Perona’s (1999, Sec. 2.3): They captured a pencil standing upright at multiple positions on a plane and triangulated all rays from pencil tip shadow to pencil tip. The core difference to our method is that it does not jointly estimate the calibration target (i.e., pencil) with the light position.

In highlight-based geometric calibration methods, precisely localizing the light source center’s reflection on the specular surface is problematic in practice: Even with the shortest exposure at which one can still barely detect or annotate other parts of the calibration target (pose detection markers, sphere outline, etc.), the highlight is much bigger than an image of the light source (such as a switched off LED seen in the mirror) would be; see Fig. 3, center and right. Lens flare, noise, etc.further complicate segmenting the highlight. Also, since the highlight is generally not a circle but a conic section on a mirror plane or an even more complicated shape on a mirror sphere, the light source center’s image (i.e., the intersection of the light cone’s axis and the mirror) cannot be computed as the highlight’s centroid, as for example Shen and Cheng (2011) did. We thus argue that it is extremely hard to reliably localize light source centers on specular surfaces with pixel accuracy—even with careful manual annotation. Instead, we employ very small cast shadows for stable localization.

Mirror sphere-based geometric calibration methods suffer from the fact that the sphere curvature amplifies highlight localization errors into larger light direction errors since the surface normal, which determines the reflection angle, differs between erroneous and correct highlight location. Also, the spheres need to be very precise since “even slight geometric inaccuracies on the surface can lead to highlights that are offset by several pixels and markedly influence the stability of the results” (Ackermann et al. 2013). The prices of precise spheres (\(\sim \) $ 40 for a high-quality 60 mm bearing ball of which we need 3–8 for accurate calibration) rules out high-accuracy sphere-based calibration for users on a tight budget.

Further, sphere methods typically require accurate annotation of the sphere outline in the images. Although methods for automatic ellipse detection exist (Pătrăucean et al. 2016), accurately detecting the boundary of the mirror sphere is extremely difficult because the sphere’s exact outline is hard to distinguish from the background, especially in dark images, since the sphere also mirrors the background.

Regarding the triangulation of multiple line constraints for the position of a light source, Hartley and Sturm (1997) and Szeliski (2010, Sec. 7.1) pointed out that, given noisy observations, reprojection error minimization is superior to finding the 3D point closest to each ray in a set of rays. The latter is a popular choice in many methods of category (b), for example Shen and Cheng’s mirror plane method (Shen and Cheng 2011) or most sphere methods prior to Ackermann’s (2013). By contrast, Ackermann et al. (2013) and we follow Hartley and Sturm’s suggestion and minimize reprojection error to obtain a more accurate prediction of light source positions.

The connection between pinhole cameras and point lights that we describe and exploit in the next section, has already been shown by others: Hu et al. (2004) use it in a theoretical setting similar to ours with point objects and shadows. However, they do not turn it into a full mathematical formalism for the light estimation but only discuss it with geometric sketches (and suggest using the inferior triangulation method mentioned above).

We push the idea further by deriving mathematical solutions, extending it to a unified light model that includes distant light, embedding it in an SfM framework (Snavely et al. 2006; Triggs et al. 2000) that minimizes reprojection error, deriving an initialization for the non-convex minimization, devising a simple calibration target that leverages our method in the real world, and demonstrating our method’s accuracy in simulated and real-world experiments.

Radiant Intensity Distribution Calibration While point light sources are often assumed to emit light uniformly in all directions, practical light sources such as LEDs actually have a non-isotropic lighting distribution (radiant intensity distribution; RID), which is described in terms of light orientation, intensities, and an anisotropy function. Park et al. (2014) and Ma et al. (2019) handle non-isotropic lights and jointly estimate the light position and RID from imagery of shading and specular reflections on a planar calibration target. In the context of photometric stereo, Quéau et al. (2018), Collins and Bartoli (2012), and Song et al. (2018) proposed pipelines that first perform geometric calibration by specularity-based triangulation and then estimate the RID from shading on a planar target. Regarding the geometric calibration, these methods have shown to obtain a more accurate estimation compared to the plane-based methods of Park et al. (2014) and Ma et al. (2019). Although this paper focuses on geometric calibration, we note that plane-based RID calibration could be combined with our method since it uses a similar planar target.

Another way to handle practical light sources in photometric stereo are “semi-calibrated” approaches (Cho et al. 2018; Logothetis et al. 2017), which assume given light source positions/directions and estimate the non-uniform light intensities simultaneously with the scene shape. Their methods take care of part of RID, i.e., the intensities, and assume isotropic lights or known lights anisotropy function.

More generally, incoming light may be not only from a single point source but also from a distribution of many points. Sato et al. (2001, 2003) used shadows of an object of known shape to estimate illumination distributions of area lights while being restricted to distant light and having to estimate the shadow receiver’s reflectance. Recently, Gardner et al. (2017) proposed an estimation method of indoor illumination from a single image with a deep neural network. By limiting the scenes to indoor environments and training their model with a panorama image dataset, their method does not require any calibration target.

3 Shadow Geometry

As foundation for the later sections, in this section we will lay out the mathematics behind shadow projection. Specifically, we analyze how a point light or distant, parallel light projects shadows of infinitesimal shadow casters on a shadow receiver plane. In Sect. 3.1, we will derive the shadow projection in an entirely static scene. In Sect. 3.2, we will then analyze shadows in a scene where the plane and the shadow casters remain static but the light source moves.

Throughout this paper, we will denote matrices and vectors with bold upper and lower case, respectively, and the homogeneous form of vector \(\mathbf {v}\) with \(\tilde{\mathbf {v}}\). Further, we will sometimes use parentheses and indices to refer to parts of a vector/matrix/tensor: \((\mathbf {v})_i\) denotes vector \(\mathbf {v}\)’s ith element, \((\mathbf {L})_{i, j}\) is the element in row i and column j of matrix \(\mathbf {L}\), \((\mathbf {L})_{i, :}\) refers to the entire row i of \(\mathbf {L}\), and \((\mathbf {L})_{:,i:j}\) refers to all rows of columns i to j of \(\mathbf {L}\).

3.1 Shadow Formation Model

In this section, we show the mathematical relationship between a light source, shadow casters, and their corresponding shadows on a shadow receiver plane \(\Pi \). Let us for now assume that the pose of the plane \(\Pi \) is fixed to the world coordinate system’s xy plane.

Nearby Light Let a nearby point light be located at \(\mathbf {l}= \left[ l_x, l_y, l_z\right] ^\top \in \mathbb {R}^3\) in world coordinates. An infinitesimally small caster located at \(\mathbf {c}\in \mathbb {R}^3\) in world coordinates casts a shadow on the receiver plane \(\Pi \) at \(\mathbf {s}\in \mathbb {R}^2\) in \(\Pi \)’s 2D coordinate system, which is \(\bar{\mathbf {s}} = \left[ \mathbf {s}^\top , 0\right] ^{\top }\) in world coordinates because \(\Pi \) coincides with the world’s xy plane. Since \(\mathbf {l}\), \(\mathbf {c}\), and \(\bar{\mathbf {s}}\) are all on the same line, the lines \(\overline{\mathbf {c}\bar{\mathbf {s}}}\) and \(\overline{\mathbf {l}\bar{\mathbf {s}}}\) must be parallel:

$$\begin{aligned} (\mathbf {c}-\bar{\mathbf {s}})\times (\mathbf {l}-\bar{\mathbf {s}})=\mathbf {0}. \end{aligned}$$
(1)

Inserting \(\mathbf {c}{=} [c_x, c_y, c_z]^\top \), \(\bar{\mathbf {s}}{=}[s_x,s_y,0]^\top \), and \(\mathbf {l}=[l_x, l_y, l_z]^\top \) (all in non-homogeneous 3D global world coordinates) into Eq. (1) yields

$$\begin{aligned} (\mathbf {c}-\bar{\mathbf {s}})\times (\mathbf {l}-\bar{\mathbf {s}})= \begin{bmatrix} c_x-s_x\\ c_y-s_y\\ c_z-0 \end{bmatrix}\times \begin{bmatrix} l_x-s_x\\ l_y-s_y\\ l_z-0 \end{bmatrix} =\mathbf {0}. \end{aligned}$$

Expanding the cross-product yields

$$\begin{aligned}&\left\{ \begin{aligned}&(c_y-s_y)l_z-c_z(l_y-s_y)=0,\\&c_z(l_x-s_x)-(c_x-s_x)l_z=0,\\&(c_x-s_x)(l_y-s_y)-(c_y-s_y)(l_x-s_x)=0, \end{aligned} \right. \\ \Leftrightarrow&\left\{ \begin{aligned} s_x=\frac{c_xl_z-c_zl_x}{l_z-c_z},\\ s_y=\frac{c_yl_z-c_zl_y}{l_z-c_z}. \end{aligned}\right. \end{aligned}$$

We can then rewrite \(\mathbf {s}\) in homogeneous coordinates using scaling parameters \(\gamma \) and \(\lambda \):

$$\begin{aligned} \gamma \tilde{\mathbf {s}} =&\begin{bmatrix} \frac{c_xl_z-c_zl_x}{l_z-c_z}\\ \frac{c_yl_z-c_zl_y}{l_z-c_z}\\ 1 \end{bmatrix}\\ \underbrace{-(l_z-c_z)\gamma }_{\lambda }\tilde{\mathbf {s}} =&\begin{bmatrix} -(c_xl_z-c_zl_x)\\ -(c_yl_z-c_zl_y)\\ -(l_z-c_z) \end{bmatrix}\\ \lambda \tilde{\mathbf {s}} =&\begin{bmatrix} -l_z&{} 0 &{} l_x&{} 0\\ 0 &{} -l_z&{} l_y&{} 0\\ 0 &{} 0 &{} 1 &{} -l_z\end{bmatrix} \begin{bmatrix} c_x\\ c_y\\ c_z\\ 1 \end{bmatrix} = \mathbf {L}\tilde{\mathbf {c}}. \end{aligned}$$

In the following, we will call \(\mathbf {L}\) a light matrix. As we can see, point lights and pinhole cameras can be described by similar mathematical models with the following correspondences: (point light \(\Leftrightarrow \) pinhole camera), (shadow receiver plane \(\Leftrightarrow \) image plane), (shadow caster \(\Leftrightarrow \) scene point), and (light matrix \(\mathbf {L}\)\(\Leftrightarrow \) camera projection matrix \(\mathbf {P}=\mathbf {K}\left[ \,\mathbf {R}\mid \mathbf {t}\,\right] \)), as illustrated in Fig. 4. We can even decompose the light matrix \(\mathbf {L}\) into “intrinsics” and “extrinsics” parameterized by the light location \(\mathbf {l}\) as

$$\begin{aligned} \mathbf {L} = \begin{bmatrix} -l_z&{} 0 &{} l_x&{} 0\\ 0 &{} -l_z&{} l_y&{} 0\\ 0 &{} 0 &{} 1 &{} -l_z\end{bmatrix} = \underbrace{ \begin{bmatrix} -l_z&{} 0 &{} l_x\\ 0 &{} -l_z&{} l_y\\ 0 &{} 0 &{} 1 \end{bmatrix}}_{\mathrm {intrinsics\,\,}\mathbf {K}} \underbrace{ \begin{bmatrix} 1 &{} 0 &{} 0 &{} -l_x\\ 0 &{} 1 &{} 0 &{} -l_y\\ 0 &{} 0 &{} 1 &{} -l_z\end{bmatrix}}_{\mathrm {extrinsics\,\,}\left[ \mathbf {R}\mid \mathbf {t}\right] }. \end{aligned}$$
(2)

Distant Light For distant light, all light rays in the scene are parallel, \(\mathbf {l}= \left[ l_x, l_y, l_z\right] ^\top \in \mathcal {S}^2\) (\(\mathcal {S}^2 = \{\mathbf {v} \in \mathbb {R}^3: |\mathbf {v}|=1\}\)) is a light direction instead of a position, and the line \(\overline{\mathbf {c}\bar{\mathbf {s}}}\) must be parallel to \(\mathbf {l}\):

$$\begin{aligned} (\mathbf {c}-\bar{\mathbf {s}})\times \mathbf {l}=\mathbf {0}. \end{aligned}$$
(3)

Inserting \(\mathbf {c}\), \(\bar{\mathbf {s}}\), and \(\mathbf {l}\) into Eq. (3) yields

$$\begin{aligned} (\mathbf {c}-\bar{\mathbf {s}})\times \mathbf {l}= \begin{bmatrix} c_x-s_x\\ c_y-s_y\\ c_z-0 \end{bmatrix}\times \begin{bmatrix} l_x\\ l_y\\ l_z\end{bmatrix} =\mathbf {0}. \end{aligned}$$

By expanding the cross-product, we have

$$\begin{aligned}&\left\{ \begin{aligned}&(c_y-s_y)l_z-c_zl_y=0, \\&c_zl_x-(c_x-s_x)l_z=0, \\&(c_x-s_x)l_y-(c_y-s_y)l_x=0, \end{aligned} \right. \\ \Leftrightarrow&\left\{ \begin{aligned} s_x=\frac{c_xl_z-c_zl_x}{l_z},\\ s_y=\frac{c_yl_z-c_zl_y}{l_z}. \end{aligned}\right. \end{aligned}$$

We can then write \(\mathbf {s}\) in homogeneous coordinates as:

$$\begin{aligned} \gamma \tilde{\mathbf {s}} =&\begin{bmatrix} \frac{c_xl_z-c_zl_x}{l_z}\\ \frac{c_yl_z-c_zl_y}{l_z}\\ 1 \end{bmatrix}\\ \underbrace{-l_z\gamma }_{\lambda }\tilde{\mathbf {s}} =&\begin{bmatrix} -(c_xl_z-c_zl_x)\\ -(c_yl_z-c_zl_y)\\ -l_z\end{bmatrix}\\ \lambda \tilde{\mathbf {s}} =&\begin{bmatrix} -l_z&{} 0 &{} l_x&{} 0\\ 0 &{} -l_z&{} l_y&{} 0\\ 0 &{} 0 &{} 0 &{} -l_z\end{bmatrix} \begin{bmatrix} c_x\\ c_y\\ c_z\\ 1 \end{bmatrix} = \mathbf {L}\tilde{\mathbf {c}}. \end{aligned}$$

The difference to the nearby light case is the entry \((\mathbf {L})_{3,3}=0\). This light matrix resembles orthographic projection with a camera matrix \(\left[ {\begin{matrix}1&{}0&{}0&{}0\\ 0&{}1&{}0&{}0\\ 0&{}0&{}0&{}1\end{matrix}}\right] \).

Unifying Nearby and Distant Light Having two different models, a nearby and a distant light model, is a nuisance because it forces users to choose the one that better fits their scene, which can be hard especially for inexperienced users. Further, the real world does not exhibit a sharp transition from nearby to distant light (at a distance of, say, 3 m) but rather a smooth transition. It is thus desirable to have a unified model that also transitions smoothly. The intuition behind a unification is that orthographic projection can be seen as a special case of perspective projection with an infinite focal length.

Since in homogeneous coordinates we consider vectors equivalent if they are equal up to a constant, we can divide the nearby and distant light matrices by \(l_z\):

$$\begin{aligned} \mathbf {L}_\text {nearby}=&\begin{bmatrix} -l_z&{} 0 &{} l_x&{} 0\\ 0 &{} -l_z&{} l_y&{} 0\\ 0 &{} 0 &{} 1 &{} -l_z\end{bmatrix} \rightarrow \begin{bmatrix} -1 &{} 0 &{} l_x/l_z&{} 0\\ 0 &{} -1 &{} l_y/l_z&{} 0\\ 0 &{} 0 &{} 1/l_z&{} -1 \end{bmatrix},\\ \mathbf {L}_\text {distant}=&\begin{bmatrix} -l_z&{} 0 &{} l_x&{} 0\\ 0 &{} -l_z&{} l_y&{} 0\\ 0 &{} 0 &{} 0 &{} -l_z\end{bmatrix} \rightarrow \begin{bmatrix} -1 &{} 0 &{} l_x/l_z&{} 0\\ 0 &{} -1 &{} l_y/l_z&{} 0\\ 0 &{} 0 &{} 0 &{} -1 \end{bmatrix}. \end{aligned}$$

We can see that, if the light source moves towards infinity, \(\mathbf {L}_\text {nearby}\) converges to \(\mathbf {L}_\text {distant}\). Therefore, we use

$$\begin{aligned} \lambda \tilde{\mathbf {s}} = \begin{bmatrix} -1 &{} 0 &{} l_x/l_z&{} 0\\ 0 &{} -1 &{} l_y/l_z&{} 0\\ 0 &{} 0 &{} 1/l_z&{} -1 \end{bmatrix} \begin{bmatrix} c_x\\ c_y\\ c_z\\ 1 \end{bmatrix} = \mathbf {L}\tilde{\mathbf {c}}. \end{aligned}$$
(4)

as unified shadow projection equation for representing both nearby and distant light. Later in this paper, we will see that we can use the unified projection model for determining light positions and directions in synthetic as well as real-world datasets.

3.2 Epipolar Geometry for Shadow Correspondences

We will now look at a scene with a moving light source, static shadow casters, and a static shadow receiver plane. We analyze shadow correspondences, i.e., the relation between shadows belonging to the same caster but different light positions. As we saw, shadow projection is a special case of projective geometry. Thus, shadow correspondences in a static scene with a moving point light should follow the same principles as image point correspondences in pinhole camera projection with a static scene and a moving camera: epipolar geometry.

Fig. 5
figure 5

A plane has a shadow caster above it at a fixed but unknown position. Given shadow \(\mathbf {s}\) cast by light \(\mathbf {l}_1\), the caster causing this shadow can be anywhere on the line \(\overline{\mathbf {l}_1 \mathbf {s}}\), yielding an infinite series of caster position hypotheses. When the light moves to position \(\mathbf {l}_2\), this series results in a line of shadows, the equivalent of the epipolar line in camera geometry

3.2.1 Fundamental Shadow Matrix

Figure 5 shows a shadow receiver plane \(\Pi \) and two point lights at positions \(\mathbf {l}_1\) and \(\mathbf {l}_2\). Shadow \(\mathbf {s}\) from a shadow caster at an unknown position cast by light \(\mathbf {l}_1\) has corresponding shadows \(\mathbf {s}'\) cast by light \(\mathbf {l}_2\). These can be found on an epipolar line arising from a fundamental matrix, which we will derive now analogous to the derivation of the standard fundamental matrix.

Let \(\mathbf {c}\) be the caster position, \(\mathbf {l}_1\ne \mathbf {l}_2\) be the light positions in calibration target coordinates, \(\mathbf {L}_i\) be \(\mathbf {l}_i\)’s light matrix, \(\mathbf {L}_1^{+}=(\mathbf {L}_1^\top \mathbf {L}_1)^{-1}\mathbf {L}_1^\top \) be \(\mathbf {L}_1\)’s pseudo inverse, \( {\varvec{\emptyset }}_{\mathbf {L}_1}\in \textsf {null}(\mathbf {L}_1)\) be a non-zero vector in \(\mathbf {L}_1\)’s one-dimensional null space, and \(\eta \) be a scalar. We then have

$$\begin{aligned} \lambda _1\tilde{\mathbf {s}}_1&= \mathbf {L}_1\tilde{\mathbf {c}}\Rightarrow \tilde{\mathbf {c}} = \lambda _1\mathbf {L}_1^{+}\tilde{\mathbf {s}}_1 + \eta \,{\varvec{\emptyset }}_{\mathbf {L}_1}\\ \lambda _2\tilde{\mathbf {s}}_2&= \mathbf {L}_2\tilde{\mathbf {c}}= \lambda _1\mathbf {L}_2\mathbf {L}_1^{+}\tilde{\mathbf {s}}_1 + \eta \,\mathbf {L}_2\,{\varvec{\emptyset }}_{\mathbf {L}_1}. \end{aligned}$$

Multiplying \(\tilde{\mathbf {s}}_2^\top \left[ \mathbf {L}_2{\varvec{\emptyset }}_{\mathbf {L}_1}\right] _\times \) (with \(\left[ \cdot \right] _\times \) being the cross-product’s matrix form) from the left, we obtain

$$\begin{aligned} \lambda _2\underbrace{\tilde{\mathbf {s}}_2^\top \left[ \mathbf {L}_2 {\varvec{\emptyset }}_{\mathbf {L}_1}\right] _\times \tilde{\mathbf {s}}_2}_{0}&= \lambda _1\tilde{\mathbf {s}}_2^\top \left[ \mathbf {L}_2 {\varvec{\emptyset }}_{\mathbf {L}_1}\right] _\times \mathbf {L}_2\mathbf {L}_1^{+}\tilde{\mathbf {s}}_1 \nonumber \\&\quad +\eta \,\tilde{\mathbf {s}}_2^\top \underbrace{\left[ \mathbf {L}_2{\varvec{\emptyset }}_{\mathbf {L}_1}\right] _\times \mathbf {L}_2{\varvec{\emptyset }}_{\mathbf {L}_1}}_{\mathbf {0}}\nonumber \\ \Rightarrow \quad 0&= \tilde{\mathbf {s}}_2^\top \underbrace{\left[ \mathbf {L}_2{\varvec{\emptyset }}_{\mathbf {L}_1}\right] _\times \mathbf {L}_2\mathbf {L}_1^{+}}_{\mathbf {F}} \tilde{\mathbf {s}}_1. \end{aligned}$$
(5)

Thus, corresponding shadows \(\tilde{\mathbf {s}}_1\) and \(\tilde{\mathbf {s}}_2\) fulfill a condition with some fundamental matrix \(\mathbf {F}\) that is directly analogous to the regular correspondence condition. For \(\mathbf {L}_{i}=\left[ {\begin{matrix} -1 &{} 0 &{} {l_x^{(i)}}/{l_z^{(i)}} &{} 0\\ 0 &{} -1 &{} {l_y^{(i)}}/{l_z^{(i)}} &{} 0\\ 0 &{} 0 &{} 1/{l_z^{(i)}} &{} -1 \end{matrix}}\right] \), we obtain the null space vector \({\varvec{\emptyset }}_{\mathbf {L}_1} \propto \left[ {l_x^{(1)}}, {l_y^{(1)}}, {l_z^{(1)}}, 1\right] ^\top \) and finally the fundamental shadow matrix

$$\begin{aligned} \mathbf {F} =&\left[ \mathbf {L}_2{\varvec{\emptyset }}_{\mathbf {L}_1}\right] _\times \mathbf {L}_2\mathbf {L}_1^{+}\nonumber \\&\quad \propto \frac{1}{{l_z^{(2)}}} \left[ {\begin{matrix}0 &{} -{l_z^{(1)}} + {l_z^{(2)}} &{} -{l_y^{(1)}}{l_z^{(2)}}+{l_y^{(2)}}{l_z^{(1)}}\\ {l_z^{(1)}}-{l_z^{(2)}}&{}0&{}{l_x^{(1)}}{l_z^{(2)}}-{l_x^{(2)}}{l_z^{(1)}}\\ {l_y^{(1)}}{l_z^{(2)}}-{l_y^{(2)}}{l_z^{(1)}}&{}-{l_x^{(1)}}{l_z^{(2)}}+{l_x^{(2)}}{l_z^{(1)}}&{}0\end{matrix}}\right] \nonumber \\ =&\begin{bmatrix}0&{}f_1&{}f_2\\ -f_1&{}0&{}f_3\\ -f_2&{}-f_3&{}0\end{bmatrix}. \end{aligned}$$
(6)

Interestingly, this matrix is a special case of regular fundamental matrices: It is skew-symmetric. For corresponding shadows \(\tilde{\mathbf {s}}_1 = [u, v, 1]^\top \) and \(\tilde{\mathbf {s}}_2 = [u', v', 1]^\top \) the correspondence condition Eq. (5) becomes

$$\begin{aligned} 0&=\begin{bmatrix}u&v&1\end{bmatrix}\begin{bmatrix} 0&{}f_1&{}f_2\\ -f_1&{}0&{}f_3\\ -f_2&{}-f_3&{} 0\end{bmatrix}\begin{bmatrix}u'\\ v'\\ 1\end{bmatrix}\nonumber \\&=(uv'-vu')f_1+(u-u')f_2+(v-v')f_3 \end{aligned}$$
(7)

We can thus estimate the parameters \(f_1\), \(f_2\), and \(f_3\) of our fundamental shadow matrix up to scale by solving the homogeneous linear system

$$\begin{aligned} \begin{bmatrix} u_1v_1'-v_1u_1' &{} u_1-u_1' &{} v_1-v_1'\\ \vdots &{} \vdots &{} \vdots \\ u_nv_n'-v_nu_n' &{} u_n-u_n' &{} v_n-v_n' \end{bmatrix} \begin{bmatrix} f_1\\ f_2\\ f_3 \end{bmatrix} =\mathbf {0}_{n\times 1} \end{aligned}$$
(8)

with \(n\ge 2\) correspondences using singular value decomposition. This is actually equivalent to estimating regular essential matrices for cameras that only undergo pure translation and no relative rotation (Szeliski 2010, Eq. 7.27). This makes sense since we saw in Eq. (2) that point lights act like cameras with an identity rotation. As a consequence, all properties discussed in the following also hold for essential matrix estimation of cameras with pure translation.

Conveniently, the fundamental shadow matrix has a rank of 2 as a direct result of estimating the parameters of a skew-symmetric matrix and the rank does not need to be enforced in a post-processing step. Further, the matrix can be estimated up to scale from 2 point correspondences. Moreover, in contrast to regular fundamental/essential matrix estimation, fundamental shadow matrix estimation does not suffer from the most common degenerate scene point configuration: all 3D scene points lying in a plane. We will show this now.

Planar Degeneracy A degeneracy in fundamental matrix estimation means that a correct fundamental matrix \(\mathbf {F}\) exists and could be computed from the projection matrices, but the 3D scene points are in a configuration such that we can find an \(\mathbf {F}'\ne \mathbf {F}\) that also fulfills the correspondence condition \(\tilde{\mathbf {s}}_2^\top \mathbf {F}'\tilde{\mathbf {s}}_1=0\).

In regular SfM, a well-known degeneracy are coplanar scene points. In this case, their projections in both views are related by a homography: \(\mathbf {x}_i'=\mathbf {H}\mathbf {x}_i\), the correspondence condition \(0=\mathbf {x}_i^{\prime \top } \mathbf {F}\mathbf {x}_i=\mathbf {x}_i^{\prime ^\top }\underbrace{\mathbf {F}\mathbf {H}^{-1}}_{\mathbf {S}}\mathbf {x}_i'\) is true for any skew-symmetric \(\mathbf {S}\), and thus any fundamental matrix \(\mathbf {F}=\mathbf {S}\mathbf {H}\) (with any skew-symmetric \(\mathbf {S}\) and the homography \(\mathbf {H}\)) is a valid solution (Hartley and Zisserman 2004, Sec. 11.9.2). The planar degeneracy is relevant in practice because a camera may have only captured scene points from a wall, floor, or table surface.

We will now show that fundamental shadow matrices have no general planar degeneracy, i.e., a degeneracy from coplanar casters independent of their configuration towards the lights and image plane: fundamental shadow matrices are skew-symmetric, are thus essential matrices (Hartley and Zisserman 2004, Result 9.17) and can have at most the degeneracies of essential matrices. Further, since essential matrices have the form \([\mathbf {t}]_\times \mathbf {R}\), our skew-symmetric shadow matrices are a special case of essential matrices where we have \(\mathbf {R}=\mathbf {I}\). From Negahdaripour (1990) we know that when observing the projection of a 3D plane, there are 2 sets of translation, rotation, and 3D plane coordinates that satisfy the observations. Let \(\mathbf {R}\) be the true and \(\mathbf {R}'\) be the alternative rotation. Negahdaripour’s lemma (Negahdaripour 1990, p. 5) states that

$$\begin{aligned} \begin{aligned} \mathbf {R}'&= \mathbf {V} \mathbf {R} \text { with}\\ \mathbf {V}&= (1-\cos \bar{\theta })\,\bar{\mathbf {n}}\bar{\mathbf {n}}^\top + \sin \bar{\theta } \,\mathbf {N} +\cos \bar{\theta } \,\mathbf {I}, \end{aligned} \end{aligned}$$
(9)

where \(\mathbf {N}\) is skew-symmetric and \(\bar{\mathbf {n}}\) is a unit vector. Here the precise meanings of \(\bar{\mathbf {n}}\), \(\bar{\theta }\), and \(\mathbf {N}\) do not matter, only the form of Eq. (9) does. Recall that the rotations of fundamental shadow matrices are identities. Inserting \(\mathbf {R}'=\mathbf {R}=\mathbf {I}\) into Eq. (9) yields

$$\begin{aligned} \mathbf {I}&= \mathbf {V}\mathbf {I}=(1-\cos \bar{\theta })\,\bar{\mathbf {n}} \bar{\mathbf {n}}^\top + \sin \bar{\theta }\, \mathbf {N} +\cos \bar{\theta }\, \mathbf {I}\nonumber \\ \Rightarrow \;\mathbf {I}&=\bar{\mathbf {n}}\bar{\mathbf {n}}^\top + \frac{\sin \bar{\theta }}{1-\cos \bar{\theta }} \mathbf {N}. \end{aligned}$$
(10)

As \(\left| \bar{\mathbf {n}}\right| =1\), \(\bar{\mathbf {n}}\bar{\mathbf {n}}^\top \) has at most one diagonal element that is 1. As \(\mathbf {N}\) is skew-symmetric, the diagonal elements of \(\frac{\sin \bar{\theta }}{1-\cos \bar{\theta }} \mathbf {N}\) are all 0. Thus, Eq. (10) cannot be true, implying that constraining the essential matrix to a skew-symmetric matrix removes the planar degeneracy.

In our application, not having a general planar degeneracy is important as we work with approximately coplanar shadow casters (the red pin heads in Fig. 2). Further, it means we can estimate \(\mathbf {F}\) with 3 casters even though 3 casters are necessarily coplanar.

Apart from the lack of a general planar degeneracy, there do exist specialized ones. Most of them are fairly obvious and rather irrelevant in practice, e.g., casters lying on the image plane (their shadows thus always keeping their positions) or all lights and casters being collinear (all shadows thus being projected to the same point). Beside these, the only degeneracy we could find is all casters and both lights being coplanar, in which case all shadows are projected to the line where the image plane and the casters-and-lights plane intersect. However, this configuration is unlikely in reality. To make it non-degenerate, it suffices that 1 caster or 1 light is not in the casters-and-lights plane.

We further found empirically that the estimation of \(\mathbf {F}\) from 2 casters, which are necessarily collinear, is not generally degenerate. Both lights and one or both casters or both casters and one or both lights being collinear is degenerate. And again, both casters and both lights being coplanar is degenerate but one light or caster not being in the casters-and-lights plane is non-degenerate.

Hartley NormalizationHartley (1997) proposed normalizing the points \([u_i,v_i]^\top \) and \([u'_i,v'_i]^\top \) to zero mean and unit variance in x- and y-directions to increase the algorithm’s numerical stability in applications with large x and y image coordinates. In regular Hartley normalization, it is admissible to normalize the points of the left and right images independently (with scaling parameters \(s_x\), \(s_y\), \(s'_x\), \(s'_y\) and shifting parameters \(d_x\), \(d_y\), \(d'_x\), \(d'_y\)) to obtain the normalized point coordinates \((u_{\text {n}}, v_{\text {n}})\) and \((u'_{\text {n}}, v'_{\text {n}})\):

$$\begin{aligned}&\left[ \begin{array}{c} u_\text {n}\\ v_\text {n}\\ 1\end{array}\right] = \left[ \begin{array}{ccc} \frac{1}{s_x} &{} 0 &{} -\frac{d_x}{s_x}\\ 0 &{} \frac{1}{s_y} &{} -\frac{d_y}{s_y}\\ 0 &{} 0 &{} 1\end{array}\right] \begin{bmatrix}u\\ v\\ 1\end{bmatrix}\;\text { and }\\&\left[ \begin{array}{c} u'_\text {n}\\ v'_\text {n}\\ 1\end{array}\right] = \left[ \begin{array}{ccc} \frac{1}{s'_x} &{} 0 &{} -\frac{d'_x}{s'_x}\\ 0 &{} \frac{1}{s'_y} &{} -\frac{d'_y}{s'_y}\\ 0 &{} 0 &{} 1\end{array} \right] \begin{bmatrix}u'\\ v'\\ 1\end{bmatrix}. \end{aligned}$$

For fundamental shadow matrices, we need to be more cautious because we want to maintain their special structure. Normalizing both images identically, i.e.,

$$\begin{aligned} \begin{bmatrix}u_\text {n}\\ v_\text {n}\\ 1\end{bmatrix} = \mathbf {N}\begin{bmatrix}u\\ v\\ 1\end{bmatrix}\text { and }\begin{bmatrix} u'_\text {n}\\ v'_\text {n}\\ 1\end{bmatrix} = \mathbf {N} \begin{bmatrix}u'\\ v'\\ 1\end{bmatrix}\text { with }\;\mathbf {N}=\left[ \begin{array}{ccc} \frac{1}{s_x} &{} 0 &{} -\frac{d_x}{s_x}\\ 0 &{} \frac{1}{s_y} &{} -\frac{d_y}{s_y}\\ 0 &{} 0 &{} 1\end{array}\right] \end{aligned}$$

preserves \(\mathbf {F}\)’s skew-symmetry:

$$\begin{aligned}&\mathbf {N}^\top \begin{bmatrix}0&{}f_1&{}f_2\\ -f_1&{}0&{}f_3\\ -f_2&{}-f_3&{}0 \end{bmatrix}\mathbf {N}\\&\quad =\frac{1}{s_xs_y} \begin{bmatrix}0&{}f_1&{}-f_1d_y+f_2s_y\\ -f_1&{}0&{}f_1d_x+f_3s_x\\ f_1d_y-f_2s_y&{}-f_1d_x-f_3s_x&{}0\end{bmatrix}. \end{aligned}$$

Using different matrices for the left and right image would result in a non-skew-symmetric matrix which would require more than 2 point correspondences for estimation.

Analogous to regular Hartley normalization, from the matrix \(\mathbf {F}_\text {n}\) for normalized points, we can compute the matrix \(\mathbf {F}\) for unnormalized points as

$$\begin{aligned} \mathbf {F} = \mathbf {N}^\top \mathbf {F}_\text {n}\,\mathbf {N}. \end{aligned}$$

3.2.2 Trifocal Shadow Tensor

For shadow correspondences in 3 images, we have a trifocal shadow tensor (“tritensor” or “shadow tritensor” in the following). The tritensor’s general form for 3 general projection matrices \(\mathbf {P}_1\), \(\mathbf {P}_2\), \(\mathbf {P}_3\) is Hartley, R.I., Zisserman [2004, Eq. (17.12)]

$$\begin{aligned} (\mathcal {T})_{p,q,r}=(-1)^{p+1}\text {det}\left[ \begin{array}{c} (\mathbf {P}_1)_{1:p-1, :}\\ (\mathbf {P}_1)_{p+1:n, :}\\ (\mathbf {P}_2)_{q, :}\\ (\mathbf {P}_3)_{r, :} \end{array}\right] \end{aligned}$$
(11)

(recall that the \((\cdot )_{i,j}\) notation extracts parts of a matrix). Inserting 3 light matrices \(\mathbf {L}_{i}=\left[ {\begin{matrix}-1 &{} 0 &{} l_{x}^{(i)} / l_{z}^{(i)} &{} 0\\ 0 &{} -1 &{} l_{y}^{(i)} / l_{z}^{(i)} &{} 0\\ 0 &{} 0 &{} 1 / l_{z}^{(i)} &{} -1\end{matrix}}\right] \) for \(\mathbf {P}_1\), \(\mathbf {P}_2\), and \(\mathbf {P}_3\) yields

$$\begin{aligned} \mathcal {T}= & {} \frac{1}{l_z^{(1)}l_z^{(2)}l_z^{(3)}}\,\cdot \\&\left[ \left[ {\begin{matrix}l^{(1)}_{z} \left( l^{(2)}_{x} l^{(3)}_{z} - l^{(3)}_{x} l^{(2)}_{z}\right) &{} l^{(2)}_{z} \left( l^{(1)}_{y} l^{(3)}_{z} - l^{(3)}_{y} l^{(1)}_{z}\right) &{} l^{(2)}_{z} \left( l^{(3)}_{z} - l^{(1)}_{z}\right) \\ l^{(3)}_{z} \left( l^{(2)}_{y} l^{(1)}_{z} - l^{(1)}_{y} l^{(2)}_{z}\right) &{} 0 &{} 0\\ l^{(3)}_{z} \left( l^{(1)}_{z} - l^{(2)}_{z}\right) &{} 0 &{} 0\end{matrix}}\right] \right. \\&\left[ {\begin{matrix}0 &{} l^{(3)}_{z} \left( l^{(2)}_{x} l^{(1)}_{z} - l^{(1)}_{x} l^{(2)}_{z}\right) &{} 0\\ l^{(2)}_{z} \left( l^{(1)}_{x} l^{(3)}_{z} - l^{(3)}_{x} l^{(1)}_{z}\right) &{} l^{(1)}_{z} \left( l^{(2)}_{y} l^{(3)}_{z} - l^{(3)}_{y} l^{(2)}_{z}\right) &{} l^{(2)}_{z} \left( l^{(3)}_{z} - l^{(1)}_{z}\right) \\ 0 &{} l^{(3)}_{z} \left( l^{(1)}_{z} - l^{(2)}_{z}\right) &{} 0\end{matrix}}\right] \\&\left. \left[ {\begin{matrix}0 &{} 0 &{} l^{(3)}_{z} \left( l^{(2)}_{x} l^{(1)}_{z} - l^{(1)}_{x} l^{(2)}_{z}\right) \\ 0 &{} 0 &{} l^{(3)}_{z} \left( l^{(2)}_{y} l^{(1)}_{z} - l^{(1)}_{y} l^{(2)}_{z}\right) \\ l^{(2)}_{z} \left( l^{(1)}_{x} l^{(3)}_{z} - l^{(3)}_{x} l^{(1)}_{z}\right) &{} l^{(2)}_{z} \left( l^{(1)}_{y} l^{(3)}_{z} - l^{(3)}_{y} l^{(1)}_{z}\right) &{} l^{(1)}_{z} \left( l^{(3)}_{z} - l^{(2)}_{z}\right) \end{matrix}}\right] \right] . \end{aligned}$$

Thus, similar to the fundamental shadow matrix, the shadow tritensor also has a special structure as

$$\begin{aligned} \mathcal {T} = \left[ \mathbf {T}_1,\mathbf {T}_2,\mathbf {T}_3\right] =\left[ \left[ \begin{array}{ccc} t_{1} &{} t_{2} &{} t_{3}\\ t_{4} &{} 0 &{} 0\\ t_{5} &{} 0 &{} 0\end{array}\right] \left[ \begin{array}{ccc} 0 &{} t_{6} &{} 0\\ t_{7} &{} t_{8} &{} t_{3}\\ 0 &{} t_{5} &{} 0 \end{array}\right] \left[ \begin{array}{ccc} 0 &{} 0 &{} t_{6}\\ 0 &{} 0 &{} t_{4}\\ t_{7} &{} t_{2} &{} t_{9} \end{array}\right] \right] . \end{aligned}$$
(12)

It can further be verified that \(\mathbf {T}_1\), \(\mathbf {T}_2\), and \(\mathbf {T}_3\) always have rank 2 each, as required Hartley, R.I., Zisserman (2004, p. 373).

For the shadow tritensor, the correspondence condition for shadows \(\tilde{\mathbf {s}}_1=[u, v, 1]^\top \), \(\tilde{\mathbf {s}}_2=[u', v', 1]^\top \), and \(\tilde{\mathbf {s}}_3=[u'', v'', 1]^\top \) is Hartley, R.I., Zisserman [2004, Eq. (15.7)]:

$$\begin{aligned} {[}\tilde{\mathbf {s}}_2]_\times \Big (\sum \nolimits _{i\in \{1, 2, 3\}}(\tilde{\mathbf {s}}_1)_{i}\,\mathbf {T}_i\Big )[\tilde{\mathbf {s}}_3]_\times = \mathbf {0}_{3\times 3}, \end{aligned}$$

which can be reformulated into the system shown in Eq. (13) on the next page. Its row echelon form, which has the same solution as the original system, is shown in Eq. (14) on the next page,, revealing the system’s rank of 4. We thus need at least 2 shadow correspondences and stack their matrices to estimate the 9 unknowns of the shadow tritensor up to scale. In contrast, general tritensor estimation requires at least 7 correspondences for the linear algorithm Hartley, R.I., Zisserman (2004, Algorithm 16.1).

$$\begin{aligned}&\left[ \begin{array}{ccccccccc} 0 &{} v' &{} v v'' &{} v'' &{} v v' &{} 0 &{} 0 &{} - v &{} - v' v''\\ 0 &{} 0 &{} - u'' v &{} u - u'' &{} - u v' &{} 0 &{} v - v' &{} 0 &{} u'' v'\\ 0 &{} - u'' v' &{} 0 &{} - u v'' &{} v' \left( u v'' - u'' v\right) &{} 0 &{} v'' \left( v' - v\right) &{} u'' v &{} 0\\ 0 &{} u - u' &{} - u v'' &{} 0 &{} - u' v &{} v - v'' &{} 0 &{} 0 &{} u' v''\\ - u &{} 0 &{} u u'' &{} 0 &{} u u' &{} u'' &{} u' &{} 0 &{} - u' u''\\ u v'' &{} u'' \left( u' - u\right) &{} 0 &{} 0 &{} u' \left( u'' v - u v''\right) &{} - u'' v &{} - u' v'' &{} 0 &{} 0\\ 0 &{} - u v' &{} v'' \left( u v' - u' v\right) &{} - u' v'' &{} 0 &{} v' v'' - v v' &{} 0 &{} u' v &{} 0\\ u v' &{} 0 &{} u'' \left( u' v - u v'\right) &{} u' \left( u'' - u\right) &{} 0 &{} - u'' v' &{} - u' v &{} 0 &{} 0\\ - u v' v'' &{} u u'' v' &{} 0 &{} u u' v'' &{} 0 &{} u'' v v' &{} u' v v'' &{} - u' u'' v &{}0 \end{array}\right] \left[ \begin{array}{c} t_{1}\\ t_{2}\\ t_{3}\\ t_{4}\\ t_{5}\\ t_{6}\\ t_{7}\\ t_{8}\\ t_{9}\end{array}\right] =\mathbf {0}_{9\times 1} \end{aligned}$$
(13)
$$\begin{aligned}&\left[ \begin{matrix} - u &{} 0 &{} u u'' &{} 0 &{} u u' \\ 0 &{} - u'' v' &{} 0 &{} - u v'' &{} v' \left( u v'' - u'' v\right) \\ 0 &{} 0 &{} - u'' v &{} u - u'' &{} - u v' \\ 0 &{} 0 &{} 0 &{} - u u'' v'' \left( v \left( u - u'\right) + v' \left( u - u''\right) \right) &{} u u'' v' \left( u v v'' + u v' v'' - u' v v'' - u'' v^{2}\right) \\ \end{matrix}\right. \nonumber \\&\left. \begin{matrix} u'' &{} u' &{} 0 &{} - u' u''\\ 0 &{} v'' \left( - v + v'\right) &{} u'' v &{} 0\\ 0&{} v - v' &{} 0 &{} u'' v'\\ u''^{2} v v' \left( v - v''\right) &{} - u'' v'' \left( v - v'\right) \left( u v' + v \left( u - u'\right) \right) &{} u''^{2} v^{2} \left( u - u'\right) &{} u''^{2} v' v'' \left( - u v' + u' v\right) \\ \end{matrix}\right] \left[ \begin{array}{c} t_{1}\\ t_{2}\\ t_{3}\\ t_{4}\\ t_{5}\\ t_{6}\\ t_{7}\\ t_{8}\\ t_{9} \end{array} \right] =\mathbf {0}_{4\times 1} \end{aligned}$$
(14)

Hartley Normalization When normalizing input points for better numerical stability, we again need to normalize all images identically to maintain the tritensor’s special structure [Eq. (12)] and keep its parameters to 9:

$$\begin{aligned} \begin{bmatrix}u_\text {n}\\ v_\text {n}\\ 1\end{bmatrix}= & {} \mathbf {N}\begin{bmatrix}u\\ v\\ 1\end{bmatrix},\,\,\,\begin{bmatrix}u'_\text {n}\\ v'_\text {n}\\ 1\end{bmatrix} = \mathbf {N}\begin{bmatrix}u'\\ v'\\ 1\end{bmatrix},\,\,\, \text { and }\begin{bmatrix}u''_\text {n}\\ v''_\text {n}\\ 1\end{bmatrix} = \mathbf {N}\begin{bmatrix}u''\\ v''\\ 1\end{bmatrix}\\ \text {with }\mathbf {N}= & {} \begin{bmatrix}\frac{1}{s_x} &{} 0 &{} -\frac{d_x}{s_x}\\ 0 &{} \frac{1}{s_y} &{} -\frac{d_y}{s_y}\\ 0 &{} 0 &{} 1\end{bmatrix}. \end{aligned}$$

3.2.3 Quadrifocal Shadow Tensor

The quadrifocal shadow tensor \(\mathcal {Q}\) (or “shadow quadtensor” or “quadtensor” in the following) describes 4-view shadow correspondences. The quadtensor’s form for 4 general projection matrices \(\mathbf {P}_i\) is Hartley, R.I., Zisserman [2004, Eq. (17.21)]

$$\begin{aligned} (\mathcal {Q})_{p,q,r,s}=\text {det}\left[ \begin{array}{c} (\mathbf {P}_1)_{p, :}\\ (\mathbf {P}_2)_{q, :}\\ (\mathbf {P}_3)_{r, :}\\ (\mathbf {P}_4)_{s, :}\\ \end{array}\right] . \end{aligned}$$
(15)

Inserting light matrices for the projections yields \(\mathcal {Q}\) as shown in the left of Fig. 16 in the “Appendix”. Its exact content is less relevant here but we note that it has the following shape with 18 unknowns:

$$\begin{aligned} \mathcal {Q}=\left[ \begin{array}{ccc} \left[ {\begin{matrix}0 &{} 0 &{} 0\\ 0 &{} 0 &{} - q_{1}\\ 0 &{} q_{1} &{} 0\end{matrix}}\right] &{} \left[ {\begin{matrix}0 &{} 0 &{} q_{2}\\ 0 &{} 0 &{} q_{3}\\ q_{4} &{} q_{5} &{} q_{6}\end{matrix}}\right] &{} \left[ {\begin{matrix}0 &{} - q_{2} &{} 0\\ - q_{4} &{} q_{7} &{} q_{8}\\ 0 &{} q_{9} &{} 0\end{matrix}}\right] \\ \left[ {\begin{matrix}0 &{} 0 &{} q_{10}\\ 0 &{} 0 &{} q_{11}\\ q_{12} &{} q_{13} &{} - q_{6}\end{matrix}}\right] &{} \left[ {\begin{matrix}0 &{} 0 &{} - q_{14}\\ 0 &{} 0 &{} 0\\ q_{14} &{} 0 &{} 0\end{matrix}}\right] &{} \left[ {\begin{matrix}q_{15} &{} - q_{13} &{} - q_{8}\\ - q_{11} &{} 0 &{} 0\\ - q_{9} &{} 0 &{} 0\end{matrix}}\right] \\ \left[ {\begin{matrix}0 &{} - q_{10} &{} 0\\ - q_{12} &{} - q_{7} &{} - q_{16}\\ 0 &{} q_{17} &{} 0\end{matrix}}\right] &{} \left[ {\begin{matrix}- q_{15} &{} - q_{5} &{} q_{16}\\ - q_{3} &{} 0 &{} 0\\ - q_{17} &{} 0 &{} 0\end{matrix}}\right] &{} \left[ {\begin{matrix}0 &{} - q_{18} &{} 0\\ q_{18} &{} 0 &{} 0\\ 0 &{} 0 &{} 0\end{matrix}}\right] \end{array}\right] \end{aligned}$$
(16)

The correspondence conditions Hartley, R.I., Zisserman [2004, Eq. (17.20)] for 4 shadows \(\tilde{\mathbf {s}}_1=[u, v, 1]^\top \), \(\tilde{\mathbf {s}}_2=[u', v', 1]^\top \), \(\tilde{\mathbf {s}}_3=[u'', v'', 1]^\top \), \(\tilde{\mathbf {s}}_4=[u''', v''', 1]^\top \) are (with the Levi-Civita symbol \(\varepsilon \))

$$\begin{aligned} \sum _{\begin{array}{c} i,j,k,l,\\ p,q,r,s \end{array}}(\tilde{\mathbf {s}}_1)_{i}(\tilde{\mathbf {s}}_2)_{j}(\tilde{\mathbf {s}}_3)_{k}(\tilde{\mathbf {s}}_4)_{l}\,\varepsilon _{ipw}\varepsilon _{jqx}\varepsilon _{kry}\varepsilon _{lsz}(\mathcal {Q})_{p,q,r,s} = 0. \end{aligned}$$

Iterating over the free variables w, x, y, z yields 81 equations of which 3 are zero independent of the data. The remaining 78 put in a homogeneous system are shown in Fig. 16, right, in the “Appendix”. The system has rank 13 and stacking observations from \(\ge 2\) correspondences suffices to estimate the 18 unknowns up to scale.

3.2.4 Distant Light

In our calibration method that we describe in Sect. 4, we always use the unified light model for which we derived the shadow matrices/tensors above. Nevertheless, it is interesting to see what properties shadow matrix/tensor estimation has for scenes where we can assume distant light with certainty.

Inserting distant light matrices into our derivations reveals that the shadow matrices/tensors take on specialized shapes: Compared to their counterparts for the unified light model [Eqs. (6), (12), and (16)], they have the same pattern of repeated entries and zero entries, but a few additional entries become zero. For the fundamental shadow matrix we have \(f_1=0\) and the remaining 2 unknowns can be estimated from 1 shadow correspondence. For the tritensor, we have \(t_3=t_5=t_9=0\), and the system [Eq. (13)] stays rank 4; thus, we need 2 correspondences to estimate the remaining 6 unknowns. For the quadtensor, we have \(q_6=q_8=q_9=q_{16}=q_{17}=q_{18}=0\), the rank of the estimation system (Fig. 16, right) drops from 13 to 11 and we need only 1 correspondence to estimate the remaining 12 unknowns.

3.2.5 Shadow Correspondences Through an Uncalibrated Camera

If the shadow observations for which we try to find a shadow matrix/tensor or shadow correspondences, are not given in the coordinate system of the shadow receiver plane but only in the image coordinates of a static, uncalibrated pinhole camera that observes the receiver plane, the complexity of shadow matrix/tensor estimation does interestingly not increase: We can still estimate them from just 2 shadow correspondences. A static pinhole camera observing coplanar shadows can be modeled with a homography (invertible \(3 \times 3\) matrix) \(\mathbf {H}\) in the shadow projection equations:

$$\begin{aligned} \lambda _i\tilde{\mathbf {s}}_i=\mathbf {H}\mathbf {L}_i\tilde{\mathbf {c}}. \end{aligned}$$

Analogous to Eq. (5), we can then derive the fundamental shadow matrix for shadows observed through an uncalibrated camera:

$$\begin{aligned} \mathbf {F}_\text {c}&= \left[ \mathbf {H}\mathbf {L}_2{\varvec{\emptyset }}_{\mathbf {H}\mathbf {L}_1} \right] _\times \mathbf {H}\mathbf {L}_2(\mathbf {H}\mathbf {L}_1)^{+}\\&= \left[ \mathbf {H}\mathbf {L}_2{\varvec{\emptyset }}_{\mathbf {L}_1}\right] _\times \mathbf {H} \mathbf {L}_2\overbrace{(\mathbf {H}^{+}\mathbf {H}\mathbf {L}_1)^{+}(\mathbf {H} \mathbf {L}_1\mathbf {L}_1^{+})^{+}}^{\text {see Petersen and Pedersen (2012) Eq. (214)}} \\&= \left[ \mathbf {H}\mathbf {L}_2{\varvec{\emptyset }}_{\mathbf {L}_1}\right] _\times \mathbf {H} \mathbf {L}_2(\mathbf {H}^{-1}\mathbf {H}\mathbf {L}_1)^{+}\mathbf {H}^{+}\\&= \left[ \mathbf {H}\mathbf {L}_2{\varvec{\emptyset }}_{\mathbf {L}_1}\right] _\times \mathbf {H} \mathbf {L}_2\mathbf {L}_1^{+}\mathbf {H}^{-1}\\&= \overbrace{\det (\mathbf {H})\,\mathbf {H}^{-\top }\left[ \mathbf {L}_2 {\varvec{\emptyset }}_{\mathbf {L}_1}\right] _\times \mathbf {L}_2}^{\begin{array}{c} \text {follows from}\\ (\mathbf {Ha})\times (\mathbf {Hb})=\det (\mathbf {H})\mathbf {H}^{-\top } (\mathbf {a}\times \mathbf {b}) \end{array}}\mathbf {L}_1^{+}\mathbf {H}^{-1}\\&\propto \mathbf {H}^{-\top }\left[ \mathbf {L}_2{\varvec{\emptyset }}_{\mathbf {L}_1} \right] _\times \mathbf {L}_2\mathbf {L}_1^{+}\mathbf {H}^{-1}\\&= \mathbf {H}^{-\top }\mathbf {F}\mathbf {H}^{-1}, \end{aligned}$$

where \(\mathbf {F}\) is the original fundamental matrix of Eq. (5) for the shadows given in shadow receiver plane coordinates. \(\mathbf {F}_\text {c}\) is also skew-symmetric and can thus be estimated from \(\ge 2\) correspondences using Eq. (8). Analogous to Eqs. (11) and (15), the shadow tensors become

$$\begin{aligned}&(\mathcal {T})_{p,q,r}=(-1)^{p+1}\text {det}\left[ \begin{array}{c} (\mathbf {H}\mathbf {P}_1)_{1:p-1, :}\\ (\mathbf {H}\mathbf {P}_1)_{p+1:n, :}\\ (\mathbf {H}\mathbf {P}_2)_{q, :}\\ (\mathbf {H}\mathbf {P}_3)_{r, :} \end{array}\right] \text { and }\\&(\mathcal {Q})_{p,q,r,s}=\text {det}\left[ \begin{array}{c} (\mathbf {H}\mathbf {P}_1)_{p, :}\\ (\mathbf {H}\mathbf {P}_2)_{q, :}\\ (\mathbf {H}\mathbf {P}_3)_{r, :}\\ (\mathbf {H}\mathbf {P}_4)_{s, :}\\ \end{array} \right] . \end{aligned}$$

These have the same patterns of zeros and identical entries as the original tensors [Eqs. (12), (16)] and can thus be estimated from \(\ge 2\) correspondences using Eqs. (13), (14), or (16), respectively. The counterparts of these matrices/tensors for distant light have the same shape as those for nearby light and they all need to be estimated from \(\ge 2\) correspondences.

Table 1 Minimal number of shadow correspondences required to estimate shadow matrices/tensors

Table 1 sums up the minimal number of correspondences required for shadow matrix/tensor estimation.

4 Proposed Method

Our method estimates a nearby point light’s position or a distant light’s direction using a simple calibration target consisting of a shadow receiver plane and shadow casters above the plane (see Fig. 2). Our method automatically achieves the point light source calibration by observing the calibration target multiple times from a fixed viewpoint under a fixed point light source while changing the calibration target’s pose. The 3D positions of the shadow casters relative to the calibration target are treated unknown, which makes it particularly easy to build the target while the problem remains tractable as we will see later in this section. We now describe our proposed calibration method.

4.1 Light Source Calibration as Bundle Adjustment

Our goal is to determine the light \(\mathbf {l}\) in Eq. (4) by observing the shadows cast by unknown casters. A single shadow observation \(\mathbf {s}\) does not provide sufficient information to solve this. We thus let the receiver plane undergo multiple poses \(\{\left[ \mathbf {R}_i |\mathbf {t}_i \right] \}\). In pose i, the light position \(\mathbf {l}_i\) in receiver plane coordinates is related to \(\mathbf {l}\) in world coordinates as

$$\begin{aligned} \mathbf {l}_i = \big [\begin{array}{ccc} l_x^{(i)}&l_y^{(i)}&l_z^{(i)} \end{array}\big ]^\top = \mathbf {R}_i^\top \mathbf {l}- \mathbf {R}_i^\top \mathbf {t}_i. \end{aligned}$$

With this index i the matrices \(\{\mathbf {L}_i\}\) read

$$\begin{aligned} \mathbf {L}_i = \begin{bmatrix} -1 &{} 0 &{} l_x^{(i)}/l_z^{(i)} &{} 0\\ 0 &{} -1 &{} l_y^{(i)}/l_z^{(i)} &{} 0\\ 0 &{} 0 &{} 1/l_z^{(i)} &{} -1 \end{bmatrix}. \end{aligned}$$

If we use not only multiple poses \(\{\left[ \mathbf {R}_i |\mathbf {t}_i \right] \}\) but also multiple shadow casters \(\{\mathbf {c}_j\}\) (to increase the calibration accuracy as we show later), we obtain shadows \(\{\mathbf {s}_{ij}\}\) for each combination of pose i and caster j. Equation (4) then becomes

$$\begin{aligned} \lambda _{ij} \tilde{\mathbf {s}}_{ij} = \mathbf {L}_i \tilde{\mathbf {c}}_j. \end{aligned}$$

Assuming that the target poses \(\{[\mathbf {R}_i|\mathbf {t}_i]\}\) are known, our goal is to estimate the light position \(\mathbf {l}\) in world coordinates and the shadow caster locations \(\{\mathbf {c}_j\}\) in calibration target coordinates. We formulate this as a least-squares objective function of the reprojection error:

$$\begin{aligned} \min _{\mathbf {l}, \mathbf {c}_j, \lambda _{ij}} \sum _{i,j} \left\Vert \lambda _{ij} \tilde{\mathbf {s}}_{ij} - \mathbf {L}_i \tilde{\mathbf {c}}_j\right\Vert _2^2 \quad \text {s.t.} \quad \mathbf {l}= \mathbf {R}_i \mathbf {l}_i + \mathbf {t}_i. \end{aligned}$$
(17)

We solve this nonlinear least-squares problem with Levenberg–Marquardt (Nocedal and Wright 2006). For robust estimation we use RANSAC (Fischler and Bolles 1981): We repeatedly choose a random observation set, estimate \((\mathbf {l},\mathbf {c}_j,\lambda _{ij})\), and select the estimate with the smallest residual.

4.2 Initializing the Bundle Adjustment

Equation (17) is non-convex and thus affected by the initialization. To find a good initial guess, we relax our problem into a convex one as follows.

Nearby Light For nearby light, we can write the objective analogous to Eq. (1) as \((\mathbf {c}_j - \bar{\mathbf {s}}_{ij}) \times (\mathbf {l}_i - \bar{\mathbf {s}}_{ij}) = \mathbf {0}\) and then, using \(\mathbf {l}_i=\mathbf {R}_i^\top \mathbf {l}- \mathbf {R}_i^\top \mathbf {t}_i\), as

$$\begin{aligned} (\mathbf {c}_j - \bar{\mathbf {s}}_{ij})\times (\mathbf {R}_i^\top \mathbf {l}- \mathbf {R}_i^\top \mathbf {t}_i-\bar{\mathbf {s}}_{ij}) = \mathbf {0}. \end{aligned}$$

With \(\mathbf {c}_j=\left[ c_{j,x}, c_{j,y}, c_{j,z}\right] ^{\top }\), \(\bar{\mathbf {s}}_{ij}=\left[ s_x, s_y, 0\right] ^{\top }\), \(\mathbf {l}=\left[ l_x, l_y, l_z\right] ^{\top }\), \(\mathbf {R}_i^\top =\left[ {\begin{matrix} r_0 &{} r_1 &{} r_2\\ r_3 &{} r_4 &{} r_5\\ r_6 &{} r_7 &{} r_8\\ \end{matrix}}\right] \), and \(-\mathbf {R}_i^\top \mathbf {t}_i=\left[ t_x, t_y, t_z\right] ^\top \), we can rewrite this as

$$\begin{aligned} \mathbf {0}=&\,(\mathbf {c}_j - \bar{\mathbf {s}}_{ij})\times (\mathbf {R}_i^\top \mathbf {l}- \mathbf {R}_i^\top \mathbf {t}_i - \bar{\mathbf {s}}_{ij})\\ =&\begin{bmatrix} c_{j,x}-s_x\\ c_{j,y}-s_y\\ c_{j,z}\\ \end{bmatrix}\times \begin{bmatrix} \left[ r_0,r_1,r_2\right] \mathbf {l}+ t_x-s_x\\ \left[ r_3,r_4,r_5\right] \mathbf {l}+ t_y-s_y\\ \left[ r_6,r_7,r_8\right] \mathbf {l}+ t_z\\ \end{bmatrix}. \end{aligned}$$

Expanding the cross-product yields

$$\begin{aligned} \left\{ \begin{aligned} 0=\,&(c_{j,y}-s_y)(\left[ r_6,r_7,r_8\right] \mathbf {l}+t_z) -c_{j,z}(\left[ r_3,r_4,r_5\right] \mathbf {l}+t_y-s_y),\\ 0=\,&c_{j,z}(\left[ r_0,r_1,r_2\right] \mathbf {l}+t_x-s_x)-(c_{j,x}-s_x)(\left[ r_6,r_7,r_8\right] \mathbf {l}+t_z),\\ 0=\,&(c_{j,x}-s_x)(\left[ r_3,r_4,r_5\right] \mathbf {l}+t_y-s_y)\\&-(c_{j,y}-s_y)(\left[ r_0,r_1,r_2\right] \mathbf {l}+t_x-s_x), \end{aligned}\right. \end{aligned}$$

which we can rewrite as

$$\begin{aligned} \left\{ \begin{aligned}&-s_yt_z=-c_{j,y}(\left[ r_6,r_7,r_8\right] \mathbf {l}+t_z)+s_y\left[ r_6,r_7,r_8\right] \mathbf {l}\\&\quad +c_{j,z}(\left[ r_3,r_4,r_5\right] \mathbf {l}+t_y-s_y),\\&s_xt_z=-c_{j,z}(\left[ r_0,r_1,r_2\right] \mathbf {l}+t_x-s_x),\\&\quad +c_{j,x}(\left[ r_6,r_7,r_8\right] \mathbf {l}+t_z)-s_x\left[ r_6,r_7,r_8\right] \mathbf {l}\\&s_yt_x-s_xt_y=-c_{j,x}(\left[ r_3,r_4,r_5\right] \mathbf {l}+t_y-s_y)+s_x\left[ r_3,r_4,r_5\right] \mathbf {l}\\&\quad +c_{j,y}(\left[ r_0,r_1,r_2\right] \mathbf {l}+t_x-s_x)-s_y\left[ r_0,r_1,r_2\right] \mathbf {l}. \end{aligned}\right. \end{aligned}$$

and then write it in matrix form:

$$\begin{aligned} \left[ \begin{array}{ccc} r_{6} s_y&{} - r_{6} s_x&{} r_{3} s_x- r_{0} s_y\\ r_{7} s_y&{} - r_{7} s_x&{} r_{4} s_x- r_{1} s_y\\ r_{8} s_y&{} - r_{8} s_x&{} r_{5} s_x- r_{2} s_y\\ 0 &{} t_{z} &{} s_y- t_{y}\\ - t_{z} &{} 0 &{} - s_x+ t_{x}\\ - s_y+ t_{y} &{} s_x- t_{x} &{} 0\\ 0 &{} r_{6} &{} - r_{3}\\ - r_{6} &{} 0 &{} r_{0}\\ r_{3} &{} - r_{0} &{} 0\\ 0 &{} r_{7} &{} - r_{4}\\ - r_{7} &{} 0 &{} r_{1}\\ r_{4} &{} - r_{1} &{} 0\\ 0 &{} r_{8} &{} - r_{5}\\ - r_{8} &{} 0 &{} r_{2}\\ r_{5} &{} - r_{2} &{} 0 \end{array}\right] ^{\top } \left[ \begin{array}{c} l_x\\ l_y\\ l_z\\ c_{j,x}\\ c_{j,y}\\ c_{j,z}\\ l_xc_{j,x}\\ l_xc_{j,y}\\ l_xc_{j,z}\\ l_yc_{j,x}\\ l_yc_{j,y}\\ l_yc_{j,z}\\ l_zc_{j,x}\\ l_zc_{j,y}\\ l_zc_{j,z}\end{array}\right] = \left[ \begin{array}{c} -s_yt_{z} \\ s_xt_{z} \\ s_yt_{x} - s_xt_{y} \end{array}\right] . \end{aligned}$$
(18)

Note the matrix transpose used for space reasons. This equation captures one observation, i.e., one combination of pose i and caster j, but we need to stack equations from multiple observations. To simplify the following step, let us first split Eq. (18) into sub-matrices:

$$\begin{aligned} \left[ \mathbf {Q}_{ij}{\tiny \in \mathbb {R}^{3\times 3}}\quad \mathbf {W}_{ij}{\tiny \in \mathbb {R}^{3\times 12}} \right] \left[ \begin{array}{c} \mathbf {l}\,{\tiny \in \mathbb {R}^3}\\ \varvec{\theta }_{j}{\tiny \in \mathbb {R}^{12}} \end{array}\right] = \left[ \begin{array}{c} \mathbf {b}_{ij}{\,\tiny \in \mathbb {R}^3} \end{array}\right] . \end{aligned}$$

Note that here the matrix is not transposed.

Let \(N_p\) and \(N_c\) be the number of target poses and casters. The whole system of stacked equations then is

$$\begin{aligned} \underbrace{ \begin{bmatrix} \mathbf {Q}_{1,1}&{}\mathbf {W}_{1,1}&{}\mathbf {0}&{}\mathbf {0}&{}\;\cdots &{}\mathbf {0}\\ \vdots &{}\vdots &{}\vdots &{}\vdots &{}&{}\vdots \\ \mathbf {Q}_{N_p,1}&{}\mathbf {W}_{N_p,1}&{}\mathbf {0}&{}\mathbf {0}&{}\;\cdots &{}\mathbf {0}\\ \mathbf {Q}_{1,2}&{}\mathbf {0}&{}\mathbf {W}_{1,2}&{}\mathbf {0}&{}\;\cdots &{}\mathbf {0}\\ \vdots &{}\vdots &{}\vdots &{}\vdots &{}&{}\vdots \\ \mathbf {Q}_{N_p,2}&{}\mathbf {0}&{}\mathbf {W}_{N_p,2}&{}\mathbf {0}&{}\;\cdots &{}\mathbf {0}\\ \vdots &{}\vdots &{}\vdots &{}\vdots &{}&{}\vdots \\ \mathbf {Q}_{1,N_c}&{}\mathbf {0}&{}\mathbf {0}&{}\mathbf {0}&{}\;\cdots &{}\mathbf {W}_{1,N_c}\\ \vdots &{}\vdots &{}\vdots &{}\vdots &{}&{}\vdots \\ \mathbf {Q}_{N_p,N_c}&{}\mathbf {0}&{}\mathbf {0}&{}\mathbf {0}&{}\;\cdots &{}\mathbf {W}_{N_p,N_c}\\ \end{bmatrix}}_{\mathbf {A}} \underbrace{ \begin{bmatrix} \mathbf {l}\\ \varvec{\theta }_1\\ \varvec{\theta }_2\\ \vdots \\ \varvec{\theta }_{N_c} \end{bmatrix}}_{\varvec{\theta }} = \underbrace{ \begin{bmatrix} \mathbf {b}_{1,1}\\ \vdots \\ \mathbf {b}_{N_p,1}\\ \mathbf {b}_{1,2}\\ \vdots \\ \mathbf {b}_{N_p,2}\\ \vdots \\ \mathbf {b}_{1,N_c}\\ \vdots \\ \mathbf {b}_{N_p,N_c} \end{bmatrix}}_{\mathbf {b}}. \end{aligned}$$
(19)

We have \(3+12N_c\) unknowns since all observations share \(\mathbf {l}=[l_x,l_y,l_z]^\top \) and each \(\varvec{\theta }_{j}\) has 12 unknowns. We solve

$$\begin{aligned} \varvec{\theta }^* = \mathop {\mathrm{argmin}}\limits _{\varvec{\theta }} \left\Vert \mathbf {A}\varvec{\theta }- \mathbf {b}\right\Vert _1 \end{aligned}$$
(20)

using \(\ell _1\) minimization to be robust against outliers. For solving this equation we must fulfill

$$\begin{aligned} \underbrace{3N_pN_c}_{\#\text {equations}}&\ge \, \underbrace{12N_c\quad +3}_{\#\text {variables}} \quad \Leftrightarrow \quad N_p\ge 4 + \frac{1}{N_c}. \end{aligned}$$

Thus, observations from \(N_p=5\) poses suffice to derive a solution regardless of the number of casters, if the problem has a non-degenerate, unique solution. After obtaining \(\varvec{\theta }^*\), we disregard the second-order variables, such as \(l_xc_{j,x}\) and \(l_xc_{j,y}\), and use \(\mathbf {c}_j^*\) and \(\mathbf {l}^*\) as initialization for minimizing Eq. (17).

Distant Light For distant light, \(\mathbf {A}\) is rank deficient: \(\text {rank}(\mathbf {A})=3+12N_c-1\), because we modeled the light with 3 degrees of freedom (DoF) but distant light only has 2 DoF and the shadow observations can thus be explained by an infinite set of light vectors with same directions but different lengths. Thus, we can unfortunately not use Eq. (18) for distant light but we can automatically detect this case, switch to equations for distant light, solve, and switch back to the unified bundle adjustment of Eq. (17). So, again users do not have to choose a light model for their scene.

We detect distant light by constructing matrix \(\mathbf {A}\) for nearby light and detecting \(\mathbf {A}\)’s rank degeneracy: If the ratio between \(\mathbf {A}\)’s largest and smallest singular value is larger than \(4\times 10^4\) (in Sect. 5.1.4 we will give an analysis of this threshold), we switch to the following distant light equations.

For distant light we can write the objective analogous to Eq. (3) (using \(\mathbf {l}_i=\mathbf {R}_i^\top \mathbf {l}\)):

$$\begin{aligned} (\mathbf {c}_j - \bar{\mathbf {s}}_{ij})\times \mathbf {R}_i^\top \mathbf {l}= \mathbf {0}. \end{aligned}$$

Keeping the definitions of \(\mathbf {c}_j\), \(\bar{\mathbf {s}}_{ij}\), \(\mathbf {R}_i^\top \), and \(-\mathbf {R}_i^\top \mathbf {t}_i\), we can write this as

$$\begin{aligned} (\mathbf {c}_j - \bar{\mathbf {s}}_{ij})\times \mathbf {R}_i^\top \mathbf {l}= \begin{bmatrix} c_{j,x}-s_x\\ c_{j,y}-s_y\\ c_{j,z}\end{bmatrix}\times \begin{bmatrix} \left[ r_0,r_1,r_2\right] \mathbf {l}\\ \left[ r_3,r_4,r_5\right] \mathbf {l}\\ \left[ r_6,r_7,r_8\right] \mathbf {l}\end{bmatrix} = \mathbf {0}. \end{aligned}$$

Expanding the cross-product yields

$$\begin{aligned} \left\{ \begin{aligned} 0&=(c_{j,y}-s_y)\left[ r_6,r_7,r_8\right] \mathbf {l}-\,\,c_{j,z}\left[ r_3,r_4,r_5\right] \mathbf {l},\\ 0&=c_{j,z}\left[ r_0,r_1,r_2\right] \mathbf {l}-(c_{j,x}-s_x)\left[ r_6,r_7,r_8\right] \mathbf {l},\\ 0&=(c_{j,x}-s_x)\left[ r_3,r_4,r_5\right] \mathbf {l}-(c_{j,y}-s_y)\left[ r_0,r_1,r_2\right] \mathbf {l}. \end{aligned}\right. \end{aligned}$$

Setting \(\mathbf {l}=\left[ l_x,l_y,1\right] ^\top \) to reduce \(\mathbf {l}\) to two DoF yields

$$\begin{aligned} \left\{ \begin{aligned} 0=&(c_{j,y}-s_y)(r_6l_x+r_7l_y+r_8)-c_{j,z}(r_3l_x+r_4l_y+r_5),\\ 0=&c_{j,z}(r_0l_x+r_1l_y+r_2)-(c_{j,x}-s_x)(r_6l_x+r_7l_y+r_8),\\ 0=&(c_{j,x}-s_x)(r_3l_x+r_4l_y+r_5)-(c_{j,y}-s_y)(r_0l_x+r_1l_y+r_2), \end{aligned}\right. \end{aligned}$$

which we can rewrite as

$$\begin{aligned} \left\{ \begin{aligned} -s_yr_8=&-c_{j,y}(r_6l_x+r_7l_y+r_8)+s_y(r_6l_x+r_7l_y)\\&+c_{j,z}(r_3l_x+r_4l_y+r_5),\\ s_xr_8=&-c_{j,z}(r_0l_x+r_1l_y+r_2)\\&+c_{j,x}(r_6l_x+r_7l_y+r_8)-s_x(r_6l_x+r_7l_y),\\ s_yr_2-s_xr_5=&-c_{j,x}(r_3l_x+r_4l_y+r_5)+s_x(r_3l_x+r_4l_y)\\&+c_{j,y}(r_0l_x+r_1l_y+r_2)-s_y(r_0l_x+r_1l_y). \end{aligned}\right. \end{aligned}$$

This can be rewritten in matrix form as

(21)

Again, note the matrix transpose for space saving. Split into sub-matrices, this reads

$$\begin{aligned} \left[ \mathbf {Q}_{ij}{\tiny \in \mathbb {R}^{3\times 2}}\quad \mathbf {W}_{ij}{\tiny \in \mathbb {R}^{3\times 9}} \right] \left[ \begin{array}{c} \mathbf {l}\,{\tiny \in \mathbb {R}^2}\\ \varvec{\theta }_{j}{\tiny \in \mathbb {R}^{9}} \end{array}\right] = \left[ \begin{array}{c} \mathbf {b}_{ij}{\,\tiny \in \mathbb {R}^3} \end{array}\right] . \end{aligned}$$

Multiple observations get stacked in a similar manner to Eq. (19) and solved using Eq. (20). For solving, we must fulfill

$$\begin{aligned} \underbrace{3N_pN_c}_{\#\text {equations}}&\ge \, \underbrace{9N_c\quad +2}_{\#\text {variables}} \quad \Leftrightarrow \quad N_p\ge 3 + \frac{2}{3N_c}. \end{aligned}$$

Thus, 4 poses suffice regardless of the number of casters. Since the formulation for distant light [Eq. (20) in conjunction with Eqs. (19) and (21)] returns a light direction for \(\mathbf {l}^*\) but the unified bundle adjustment requires a light position, we need to convert from a direction to a position. We start at one of the casters and move the light source out very far in space:

$$\begin{aligned} \mathbf {l}^*_\text {position} = \mathbf {c}+ \kappa \,h_{\mathbf {c}}\,\mathbf {l}^*_\text {direction}, \end{aligned}$$

where \(\mathbf {c}\) is an arbitrary caster’s position in world coordinates (obtained through the relaxation method), \(\kappa \) is a large constant (we use \(\kappa =10^{10}\)), and \(h_{\mathbf {c}}\) is the caster’s height above the shadow receiver plane (which takes the scale of the scene into account).

4.3 Shadow Correspondence Search

For Eqs. (17), (18), and (21) we need to assign the same index j to all shadows \(\bar{\mathbf {s}}_{ij}\) that belong to the same caster \(\mathbf {c}_j\) in different images \(\{\mathbf {I}_i\}\). This correspondence problem is easier to solve if the input is structured, i.e., we have information about the relation between the input images. If the input is a video for example, then we can track shadows over consecutive frames. However, it is clearly desirable to also be able to handle unstructured input just like regular SfM. Unstructured input occurs if separate images or multiple videos are captured or if we record a video but some tracks break (due to lens flare, noise, shadows leaving the field of view, etc.).

In Sect. 3.2, we developed the basis for finding shadow correspondences in unstructured input: Shadows on a plane from a moving point light obey correspondence conditions with specialized fundamental matrices, tritensors, and quadtensors, which we will use for finding shadow correspondences and rejecting shadow misdetections. Since quadtensors are impractical because correspondence search in four views is very costly (unless some strong prior knowledge about possible correspondences narrows the search down), we will only cover fundamental matrices and tritensors in the following.

Formally, shadow correspondence search means we need to find permutations that match corresponding shadows between images. Let \(\mathbf {S}\) be the shadows \(\{\tilde{\mathbf {s}}_i\}\) and \(\mathbf {S}'\) be the shadows \(\{\tilde{\mathbf {s}}'_i\}\) stacked horizontally into matrices. For fundamental matrices we seek to find

$$\begin{aligned} \mathop {\mathrm{argmin}}\limits _{\mathbf {P}',\mathbf {F}}\sum _{i\in \{1,\dots ,N_c\}} \left\Vert \left( (\mathbf {S}'\mathbf {P}')_{:,i}\right) ^\top \mathbf {F}\left( \mathbf {S}\right) _{:,i}\right\Vert \nonumber \\ \text {s.t. }\mathbf {P}'\text { is a permutation matrix}. \end{aligned}$$
(22)

(Recall that \((\cdot )_{:, i}\) extracts a matrix’s ith column.) For tritensor estimation we need to solve

$$\begin{aligned}&\mathop {\mathrm{argmin}}\limits _{\begin{array}{c} \mathbf {P}', \mathbf {P}'',\\ \mathcal {T}=[\mathbf {T}_1,\mathbf {T}_2,\mathbf {T}_3] \end{array}}\sum _{i\in \{1, \dots ,N_c\}}\bigg \Vert \left[ (\mathbf {S}'\mathbf {P}')_{:,i} \;\right] _\times \cdot \nonumber \\&\Big (\sum _{j\in \{1,2,3\}}(\mathbf {S})_{j,i}\mathbf {T}_j\Big ) \left[ (\mathbf {S}''\mathbf {P}'')_{:,i}\right] _\times \bigg \Vert \nonumber \\&\text {s.t. }\mathbf {P}'\text { and }\mathbf {P}''\text { are permutation matrices.} \end{aligned}$$
(23)

We cannot use feature descriptors to narrow the search down since we want the shadows to be very small (to make the shadow “center” precisely localizable) and thus cannot vary a caster’s shape enough to make its shadows clearly distinguishable from other casters’ shadows. We thus find the minimizer of Eq. (23) with branch and bound on the detected shadow points without descriptors. This procedure returns a significant fraction of wrong correspondences. Inspired by the removal of inconsistent feature tracks in SfM [see, e.g., PhotoTourism (Snavely et al. 2006, Sec. 4.1)] we check the consistency of correspondences across multiple images.

Correspondence Consistency Check We work on two image pools; images with established shadow correspondences (“established pool”) and images with unknown correspondences (“unknown pool”). The established pool is initialized with a random unknown image and the goal is to move as many images as possible from the unknown to the established pool.

For fundamental shadow matrices we work in two phases: In phase 1 we randomly pick an image \(\mathbf {I}_e\) from the established pool and k images \(\mathbf {I}_{i_1}, ..., \mathbf {I}_{i_k}\) from the unknown pool.

Let be a binary function that is true iff Eq. (22)’s minimizer for images \(\mathbf {I}_{a}\) and \(\mathbf {I}_{b}\), matches shadow \(\mathbf {s}_{i}\) in \(\mathbf {I}_{a}\) with shadow \(\mathbf {s}_{j}\) in \(\mathbf {I}_{b}\). Then, if

holds (i.e., correspondences through the chain of images are consistent with the direct correspondences from the first to the last image), we move \(\{\mathbf {I}_{i_1}, ..., \mathbf {I}_{i_k}\}\) to the established pool. To make the constraint relatively strict, our implementation uses \(k=3\). Once half of all images are in the established pool, we switch to phase 2.

In phase 2 we assume all images in the established pool to be consistent. Thus, if we consider one unknown image and one shadow caster, all shadows in all established images that correspond to that particular caster should match the same shadow in the unknown image; and this should hold for all shadow casters. We pick a random unknown image, verify this criterion, and if more than half of all established images agree on their correspondences to the unknown image, it is moved to the established set. Elsewise, it is discarded. Phase 2 ends when the unknown pool is empty.

For shadow tritensors, we also work in two phases. In phase 1 we randomly select one unknown image as target, one established image, and l unknown images for testing (\(l=15\) in our implementation), iterate over the test images, and use Eq. (23) to compute the trifocal tensors for the target image, the established image and the current test image. If more than \(\frac{l}{2}\) of the tensors agree on the correspondences between target and established image, we move the target image to the established pool. Once half of all images are in the established pool, we switch to phase 2 which works equivalent to phase 2 for fundamental matrices.

4.4 Implementation Details

To obtain our target’s pose \(\{[\mathbf {R}_i|\mathbf {t}_i]\}\), we print ArUco markers (Garrido-Jurado et al. 2014) on a piece of paper, attach it to the target (see Fig. 2, left), and use OpenCV 3D pose estimation (Bradski et al. 2000). Our shadow casters are off-the-shelf pins with a length of \(\sim \) 30 mm and a head diameter of \(\sim \) 3 mm, which is big enough to easily detect them and small enough to accurately localize them. We can place the pins arbitrarily without measuring their position since the bundle adjustment estimates their position.

For shadow detection we developed a simple template matching scheme. For the templates we generated synthetic images of shadows consisting of a line with a circle at the end. To deal with varying projective transformations we use 12 rotation angles with 3 scalings each. We match the templates after binarizing the input image to extract shadowed regions more easily. Further we use the color of the pin heads to distinguish between heads and head shadows.

4.5 Estimating Multiple Lights Jointly

Fig. 6
figure 6

Two lights casting two shadows per pin

Our method can estimate multiple lights jointly. It can work with multiple lights that were captured

  1. (a)

    Simultaneously (see Fig. 6) or

  2. (b)

    Separately, i.e., we switch on one light at a time and capture its shadows. It is, of course, necessary to use the same calibration target for all lights so that all caster positions stay constant in calibration target coordinates.

From the bundle adjustment’s viewpoint both cases are equivalent since applying the calibration target poses transforms all shadow positions into the same coordinate system, namely the target’s—no matter whether they were captured simultaneously or separately.

For both cases the benefit over single light calibration is improved accuracy: More captured data puts more constraints on the shadow caster positions, therefore the caster position estimates will be more accurate and as a consequence the light position estimates will also be more accurate. An additional benefit of case (a) is that simultaneous light capturing saves time. Note that, although our equations set no theoretical limit for the number of lights to be simultaneously estimated, there are very strong practical limits: Since we need to reliably detect each shadow point in the imagery, we are limited to very few lights in practice due to the low contrast and overlap of shadows under too many lights.

In Sect. 4.3 we discussed a correspondence problem: Finding all shadows \(\bar{\mathbf {s}}_{ij}\) that belong to the same caster \(\mathbf {c}_j\) in different images i. Multiple lights entail another correspondence problem: finding all shadows from the same light to couple the correct shadows \(\bar{\mathbf {s}}_{i,j,k}\), casters \(\mathbf {c}_j\) and lights \(\mathbf {l}_{k}\) in our equations. For separately captured lights this is trivially to solve since we know which light was on when a particular shadow was captured.

Fig. 7
figure 7

A scene with 2 lights and 3 casters. In each pose, fundamental matrices enable us to split the 6 shadows into 2 sets (blue rectangles) of 3 shadows, each set corresponding to one of the lights. However, the fundamental matrices cannot match these shadow sets across poses due to an ambiguity that allows connecting any pair of sets (blue arrows) (Color figure online)

For simultaneously captured light this is harder. Let us consider the example in Fig. 7: We have a scene with 2 lights and 3 casters. The shadow projection process is hinted at with transparent casters and arrows. In each of the two target poses we can use fundamental shadow matrix estimation to separate the 6 shadows into 2 sets of 3 shadows, each corresponding to one of the lights. We, however, cannot match these sets of 3 shadows across images. We can find fundamental shadow matrices \(\mathbf {F}_1\) and \(\mathbf {F}_2\) that connect the shadow set \(\{\mathbf {s}_1, \mathbf {s}_2, \mathbf {s}_3\}\) with \(\{\mathbf {s}_7, \mathbf {s}_8, \mathbf {s}_9\}\) and \(\{\mathbf {s}_4, \mathbf {s}_5, \mathbf {s}_6\}\) with \(\{\mathbf {s}_{10}, \mathbf {s}_{11}, \mathbf {s}_{12}\}\) but we can also find matrices \(\mathbf {F}'_1\ne \mathbf {F}_1\) and \(\mathbf {F}'_2\ne \mathbf {F}_2\) that connect \(\{\mathbf {s}_1, \mathbf {s}_2, \mathbf {s}_3\}\) with \(\{\mathbf {s}_{10}, \mathbf {s}_{11}, \mathbf {s}_{12}\}\) and \(\{\mathbf {s}_4, \mathbf {s}_5, \mathbf {s}_6\}\) with \(\{\mathbf {s}_7, \mathbf {s}_8, \mathbf {s}_9\}\). This is because fundamental shadow matrices cannot distinguish whether the shadow movement resulted from changing the calibration target pose or from changing the light position.

When using only fundamental matrices, tritensors or quadtensors, this is a fundamental ambiguity and not an implementation problem. To overcome this, we require users to capture a video and track each shadow from frame to frame. Thereby we can follow the set \(\{\mathbf {s}_1, \mathbf {s}_2, \mathbf {s}_3\}\) transitioning into \(\{\mathbf {s}_7, \mathbf {s}_8, \mathbf {s}_9\}\) and then assign the same light index k to them.

To sum up, estimating multiple lights jointly increases the calibration accuracy. If multiple lights are captured separately, the data can be completely unstructured. If the lights are captured simultaneously, additional information is needed to resolve an ambiguity. In this case we expect a video as input.

5 Evaluation

We now assess our method’s accuracy using simulation experiments (Sect. 5.1) and real-world scenes (Sect. 5.2).

5.1 Simulation

For all following simulated experiments we randomly sampled target poses, caster positions and light positions (the latter only for near light conditions) from uniform distributions within the ranges shown in Fig. 8. The casters were randomly placed on a target of size \(200\times 200\). For distant light, we sampled the light direction’s polar angle \(\theta \) from \(\left[ 0^{\circ },45^{\circ }\right] \).

Fig. 8
figure 8

The arrows show the value ranges of our simulation experiments

Table 2 Estimation error (mean of ten random trials) in a synthetic, noise-free setting

We evaluated the absolute/angular error of estimated light positions/directions while varying the distance \(t_z\) between light and calibration target and the number of casters \(N_c\). Table 2 shows that each configuration’s mean error is \(\ge 14\) orders of magnitude smaller than the scene extent, confirming that our method solves the joint estimation of light position/direction and shadow caster positions accurately in an ideal setup. In our experience, the difference between using the unified light model and using one of the specialized models (nearby or distant) is negligible.

In practice, light source estimates will be deteriorated by two main error sources: (1) Shadow localization and (2) the marker-based target pose estimation.

Fig. 9
figure 9

Estimation error for synthetic near and distant light with Gaussian noise added to the shadow positions. Each data point is the median of 500 random trials. Top row: \(N_p=10\) and \(N_c=5\). The noise’s standard deviation \(\sigma \) is on the x-axis. Middle row: \(N_p=5\) and \(N_c\) is on the x-axis. Bottom row: \(N_c=5\) and \(N_p\) is on the x-axis

5.1.1 Shadow Localization Errors

To analyze the influence of shadow localization, we perturbed the shadow positions with Gaussian noise. In this and the following experiments we set \(t_z=500\) so that our synthetic scenes’ proportions match those of our main real-world scene E1 (see Fig. 13) that we introduce later, with 1 synthetic length unit corresponding to 1 mm. Figure 9 shows the estimation accuracy of the convex relaxation [Eq. (20)] compared to the accuracy of full bundle adjustment [Eq. (17) after initialization with convex relaxation] for near and distant light, varying \(N_p\) and \(N_c\), and varying standard deviation \(\sigma \) for the shadow position noise.

Figure 9’s top row confirms that larger shadow position noise results in larger error and full bundle adjustment mitigates the error compared to solving only the convex relaxation. Increasing the number of casters or target poses makes Eqs. (20) and (17) more overconstrained and thus reduces the error from noisy shadow locations as Fig. 9’s middle and bottom row confirm.

5.1.2 Target Pose Estimation Errors

To simulate errors in the target pose estimation, we performed an experiment where we added Gaussian noise to the target’s roll, pitch, and yaw. Figure 10’s top row shows that the error is again higher for stronger noise and the bundle adjustment mitigates the error of the convex relaxation. In Fig. 10’s middle and bottom row we increased the number of casters and target poses again. Bundle adjustment and increasing the number of poses reduce the error, but increasing the number of casters does not. This is not surprising since adding constraints to our system only helps if the constraints have independent noise. Here, the noises for all shadows \(\bar{\mathbf {s}}_{i,j}\) of the same pose i stem from the same pose noise and are thus highly correlated. Thus, increasing the number of poses is more important for improving the accuracy than increasing the number of casters.

Fig. 10
figure 10

Estimation error for synthetic near and distant light with Gaussian noise added to the target orientation (in deg.). Each data point is the median of 500 random trials. Top row: \(N_p=10\) and \(N_c=5\). The noise’s standard deviation \(\sigma \) is on the x-axis. Middle row: \(N_p=5\) and \(N_c\) is on the x-axis. Bottom row: \(N_c=5\) and \(N_p\) is on the x-axis

5.1.3 Combined Shadow Localization and Target Pose Estimation Errors

Previously, we studied the effect of shadow localization errors and target pose errors separately. In this section we show simulation results where we added both types of noise at the same time. Comparing the top rows of Figs. 9 and 10, we can see that shadow localization noise causes errors roughly twice as big as those from target pose noise. In this experiment we thus set the standard deviation for shadow localization noise to \(\sigma _\text {shadows}=0.01\) and for target pose noise to \(\sigma _\text {pose}=0.005\).

For the number of shadow casters varying from 1 to 9 and the number of poses varying from 5 to 100, Fig. 11 shows color-coded (log-scale) median error in the top row and standard deviation in the bottom row. Again, bundle adjustment and more poses and casters decrease the error. If the application at hand dictates one of the two parameters, e.g., if time restrictions forbid increasing \(N_p\) beyond 20, this can always be countered by increasing the other parameter. Even though the minimal conditions for solving the calibration are 1 caster and 4 or 5 poses, the data suggests that users should probably always use 3 or more casters and 20 or more poses in practice.

Fig. 11
figure 11

Estimation error for synthetic near and distant light with Gaussian noise added to shadow positions (\(\sigma _\text {shadows}=0.01\)) and target orientation in degrees (\(\sigma _\text {pose}=0.005\)). \(N_p\) is on the x-axis and \(N_c\) is on the y-axis. For each data point we performed 500 random trials. Top row: The median of the absolute error (near light)/angular error in degrees (distant light). Bottom row: The error’s standard deviation

5.1.4 Discerning Nearby from Distant Light

As discussed in Sect. 4.2, we can discern nearby and distant light based on the condition number of \(\mathbf {A}\). In a noise-free setup, the condition number becomes larger than \(10^{15}\) for distant light and smaller than \(10^{5}\) for near light even in the hardest setting: \(N_p=5\). Thus, near and distant light can easily be discerned.

Fig. 12
figure 12

Histograms of \(\mathbf {A}\)’s condition number for \(N_p=5\) (top), \(N_p=20\) (middle), and \(N_p=50\) (bottom). Each histogram is the aggregate of 100 random trials with \(\sigma _\text {shadows}=0.01, 0.1, 1.0\). As in the previous experiment of Sect. 5.1.3 we set \(\sigma _\text {pose}=\frac{1}{2}\sigma _\text {shadows}\) (Color figure online)

For noisy input, our method requires more poses for a clear distinction: Fig. 12 shows histograms of the condition numbers for \(N_p=5\), 20, and 50. We can see that for \(N_p=20\) and 50, near and distant light can clearly be separated using a threshold of \(4\times 10^{4}\) while for \(N_p=5\) the blue histogram of near light extends well beyond \(4\times 10^{4}\) and thus cannot be discerned from distant light. Based on this, we set the threshold to \(4\times 10^{4}\) and suggest working with \(N_p\ge 20\) target poses. As already discussed, we recommend increasing \(N_p\) as the primary way of error reduction. 20–50 poses are captured rather quickly.

5.2 Real-World Experiments

We created 4 real-world environments, see Fig. 13. In all experiments we calibrated the intrinsic camera parameters beforehand and removed lens distortions.

Fig. 13
figure 13

Our real-world environments. E1 has 4 LEDs fixed around the camera. In E2 we use a smartphone’s camera and LED. In E3 we observe the target under sun light. E4 has a floodlight fixed about 3 m away from the target

Environments E1 and E2 have near light, and E3 and E4 have distant light. In E1 we fixed four LEDs to positions around the camera with a 3D printed frame and calculated the LED’s ground truth locations from the frame geometry. We used a FLIR FL2G-13S2C-C camera with a resolution of 1280\(\times \)960. In E2 we separately calibrated two smartphones (Sony Xperia XZ1 and Huawei P9) to potentially open up the path for inexpensive, end user-oriented physics-based modeling with phones. Both phones have a 1920\(\times \)1080 px camera and an LED light. We assumed that LED and camera are in a plane orthogonal to the camera axis and through the optical center, and measured the distance between LED and camera to obtain the ground truth. In E3 we placed the target under direct sun light and took datasets at three different times to obtain three light directions. In E4 a floodlight was fixed about 3 m away from the target to approximate distant lighting. In both E3 and E4 we used a Canon EOS 5D Mark IV with a \(35\,\mathrm{mm}\) single-focus lens and a resolution of 6720\(\times \)4480 and obtained the ground truth light directions from measured shadow caster positions and hand-annotated shadow positions. In all environments, we used an A5-sized calibration target with \(N_c=5\) pins.

Table 3 Estimation errors in our four real-world scenes

Table 3 shows the achieved estimation results. The light position errors are \(1.5\%\) of the target-camera distance for E1 and \(1.2\%\) for E2, the light direction errors are \(\sim \) 1\(^{\circ }\), and the caster position errors are < 2.5 mm. Figure 14 shows how increasing the number of target poses monotonously decreases the estimation error on two of our real-world scenes.

5.3 Fundamental Shadow Matrix Versus Shadow Tritensor

We now analyze the shadow correspondence search and verification based on fundamental shadow matrices and trifocal shadow tensors described in Sect. 4.3. We picked 200 images from E1-1 and obtained ground truth correspondences through video tracking. The fundamental matrix-based method returned correspondences for 145 images of which 143 were correct and the tritensor-based method returned 152 images of which 151 were correct. Thus, the found correspondences are almost perfect (because we chose all consistency check criteria and parameters to be rather strict to discard a lot of frames and prefer to capture more frames instead), the number of matched images was by far sufficient for the subsequent calibration steps, and both consistency check methods performed almost identical. Note that we cannot deduce from this that fundamental shadow matrices and shadow tritensors themselves perform equally, because they are embedded in different consistency checks. We prefer working with fundamental matrices because their correspondence search runs an order of magnitude faster.

Fig. 14
figure 14

Estimation error for the first light of scene E1 and for scene E4. For each scene we captured 200 images, randomly picked \(N_p\) images from these, and estimated the light and caster positions. The gray bars and error bars represent median and interquartile range of 100 random iterations of this procedure

In certain situations users may, however, prefer tritensors over fundamental matrices or vice versa: The tritensor restricts the positions of shadows more and thus works well if the shadow detector detects the correct shadow but also has misdetections in the correct shadow’s vicinity. The tritensor can rule out almost all of those misdetections. If there are no misdetections but the correct shadow’s detected position has large noise, the tritensor may be too unforgiving and a user may prefer the fundamental matrix instead.

5.4 Estimating Multiple Lights Simultaneously

Capturing and estimating E1 ’s two top lights (reliably detecting shadows of more than two lights requires a better detector) simultaneously as described in Sect. 4.5 reduces the mean light and caster position errors from 6.9 and 1.1 to 5.1 and 0.6 mm, respectively.

As mentioned, we can also jointly calibrate lights whose image sets were captured separately, as long as they were all captured with the same calibration target. This necessitates the use of fundamental shadow matrices/shadow tritensors since there is (currently) no other method for matching shadows across image sets. For E1, joint calibration decreased the errors as shown in Table 4.

We can even jointly estimate nearby and distant light that was captured separately. We put the data from nearby and distant light separately through the convex relaxation, run bundle adjustment on them separately, and then run joint bundle adjustment. Picking one light from E1 and one light from E3, this procedure reduced the light position and direction errors and mean caster position error from 7.1 mm, 0.71 deg. and 1.21 mm to 3.8 mm, 0.53 deg. and 1.15 mm, respectively.

Table 4 Average estimation errors (all units in  mm) in scene E1 for 2, 3, and 4 lights captured separately and calibrated separately or simultaneously
Table 5 Estimation error in scene E1 (averaged over E1 ’s 4 lights) for Ours and Shen

5.5 Comparison with Existing Method

To put our method’s accuracy into perspective, Ackermann et al. (2013) achieved accuracies of about 30–70  mm on scenes 2–3 times as big as ours despite also minimizing reprojection error (thus being theoretically more accurate than methods based on simpler triangulation schemes according to Hartley and Sturm (1997) with very careful experiment execution. We believe this is at least partially due to their usage of spheres.

In this section we compare our calibration method—denoted as Ours–with an existing method. Because of Ackermann’s achieved accuracy we ruled out spheres and compared to a reimplementation of a state-of-the-art method based on a planar mirror (Shen and Cheng 2011)—denoted as Shen. Their method observes the specular reflection of the point light in the mirror, also models the mirror with perspective projection and infers parameters similar to camera calibration. In our implementation of Shen we again used ArUco markers to obtain the mirror’s pose and we annotated the highlight positions manually. For a fair comparison we also annotated the shadow positions for our method manually. For all hand annotations, we zoomed in on the specular/shadow area of an image and carefully picked up the specularity’s/shadow’s centroid.

In both methods we used an A4-sized target and observed the target while varying its pose \(\sim \) 500 mm away from the light source. We captured 30 poses for each method and annotated the shadows/reflections. Table 5 shows the estimation error of light source positions for Ours and Shen in scene E1. Ours with hand-annotated shadows as well as detected shadows outperforms Shen with annotated highlights.

6 Conclusions

Theoretic Contributions In this paper we explored the connections between point lights and pinhole cameras: Single view shadow projection from point lights follows the same principles as pinhole camera projection but with more specialized projection matrices. We devised a unified light model that smoothly interpolates between projection from a nearby light and distant light and thereby spares users having to choose between light models. As a consequence of point lights behaving like pinhole cameras, we saw that multi-view shadow correspondences follow the principles of epipolar geometry. Their fundamental matrices, trifocal and quadrifocal tensors have specialized shapes that allow estimating them from as few as 2 correspondences, and there is no general degeneracy in estimating fundamental shadow matrices from coplanar scene points. Shadow matrices/tensors allow us to establish point correspondences from unstructured sets of images without the use of tracking or feature matching. We further saw, that point lights and shadow caster positions can be simultaneously estimated using structure from motion and bundle adjustment.

Fig. 15
figure 15

a With a small baseline between camera and light, the caster may occlude the shadow as seen from the camera. b To solve this, we use a smaller shadow caster, bring the target closer to the camera and make the target smaller so the camera can capture it fully

We want to add a thought on calibration target design: Ackermann et al. (2013) pointed out that, analogous to the large depth uncertainty in narrow-baseline stereo, narrow baseline calibration targets such as Powell’s (2001) have a large light position uncertainty along the light direction. This can be decreased by either building a static wide-baseline calibration target, or by moving the target in the scene as we do. So, again our method is strongly connected to SfM where camera movement is key to reducing depth uncertainty.

Experimental Results Our noise-free simulation experiments showed that our formulation is correct and the solution method derives accurate estimates with negligible numerical errors. Thus, the solution quality is rather governed by the inaccuracy of target pose estimation and shadow detection. We showed on synthetic and real-world scenes that even with these inaccuracies our method accurately estimates light source positions/directions with measurements from a sufficient number of shadow casters and (more importantly) target poses, which can easily be collected by moving the proposed calibration target in front of the camera. Further, we showed that we can increase the calibration accuracy by estimating multiple lights simultaneously. Regarding the choice between fundamental shadow matrices and trifocal shadow tensors, we saw that both yield approximately equally accurate results.

A comparison with a state-of-the-art method based on highlights on a mirror plane showed our method’s superior accuracy. We believe the reason lies in our pin shadows’ accurate localizability. As discussed in Sect. 2, highlights are hard to localize accurately. In contrast, our pin shadows do not “bleed” into their neighborhood and we can easily control their size through the pin head size. If higher localization accuracy is required, one can choose pins smaller than ours.

Practical Implications In contrast to related work, our method requires no tedious, error-prone hand annotations of, e.g., sphere outlines, no precisely fabricated objects such as precise spheres, and no precise measurements of, e.g., sphere positions. Users need not even choose between the two different light models (nearby and distant) since our calibration method infers this automatically. Further, shadow matrices/tensors even allow unstructured image sets and not just videos to be used as input for the calibration. The construction of our calibration target is simple, fast and cheap and most calibration steps (e.g., target pose estimation and shadow detection/matching) run automatically. The only manual interaction—capturing images or a video while moving the target—is simple. To our knowledge no other method combines such simplicity and accuracy.

Limitations Our method cannot be used for scenes where light and camera are so close together that the caster occludes the image of the shadow (see Fig. 15a). The solution is to effectively increase the baseline between camera and light by using a smaller target and bringing it closer to the camera, as shown in Fig. 15b.

Future Work It may be possible to alleviate the occlusion problem above with a shadow detection that handles partial occlusions. Further, we want to analyze degenerate cases where our equations are rank deficient, e.g., a target with one caster being moved such that its shadow stays on the same spot.

The source code for this project can be downloaded from: github.com/hiroaki-santo/light-structure-from-pin-motion.