CS231A Course Notes 1: Camera Models, Lecture notes of Optics

Let's design a simple camera system – a system that can record an image of an object or scene in the 3D world. This camera system can be designed.

Typology: Lecture notes

2022/2023

Uploaded on 05/11/2023

ekaatma
ekaatma 🇺🇸

4.2

(34)

266 documents

1 / 17

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS231A Course Notes 1: Camera Models
Kenji Hata and Silvio Savarese
1 Introduction
The camera is one of the most essential tools in computer vision. It is the
mechanism by which we can record the world around us and use its output -
photographs - for various applications. Therefore, one question we must ask
in introductory computer vision is: how do we model a camera?
2 Pinhole cameras
barrier object
aperture
film
Figure 1: A simple working camera model: the pinhole camera model.
Let’s design a simple camera system a system that can record an image
of an object or scene in the 3D world. This camera system can be designed
by placing a barrier with a small aperture between the 3D object and a
photographic film or sensor. As Figure 1 shows, each point on the 3D object
emits multiple rays of light outwards. Without a barrier in place, every point
on the film will be influenced by light rays emitted from every point on the
3D object. Due to the barrier, only one (or a few) of these rays of light passes
through the aperture and hits the film. Therefore, we can establish a one-
to-one mapping between points on the 3D object and the film. The result is
that the film gets exposed by an “image” of the 3D object by means of this
mapping. This simple model is known as the pinhole camera model.
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download CS231A Course Notes 1: Camera Models and more Lecture notes Optics in PDF only on Docsity!

CS231A Course Notes 1: Camera Models

Kenji Hata and Silvio Savarese

1 Introduction

The camera is one of the most essential tools in computer vision. It is the mechanism by which we can record the world around us and use its output - photographs - for various applications. Therefore, one question we must ask in introductory computer vision is: how do we model a camera?

2 Pinhole cameras

object barrier aperture

film

Figure 1: A simple working camera model: the pinhole camera model.

Let’s design a simple camera system – a system that can record an image of an object or scene in the 3D world. This camera system can be designed by placing a barrier with a small aperture between the 3D object and a photographic film or sensor. As Figure 1 shows, each point on the 3D object emits multiple rays of light outwards. Without a barrier in place, every point on the film will be influenced by light rays emitted from every point on the 3D object. Due to the barrier, only one (or a few) of these rays of light passes through the aperture and hits the film. Therefore, we can establish a one- to-one mapping between points on the 3D object and the film. The result is that the film gets exposed by an “image” of the 3D object by means of this mapping. This simple model is known as the pinhole camera model.

𝒌

𝒊

𝒋

𝑶

𝑪′

𝑷′

𝑷 𝒇 𝚷′

Figure 2: A formal construction of the pinhole camera model.

A more formal construction of the pinhole camera is shown in Figure 2. In this construction, the film is commonly called the image or retinal plane. The aperture is referred to as the pinhole O or center of the camera. The distance between the image plane and the pinhole O is the focal length f. Sometimes, the retinal plane is placed between O and the 3D object at a distance f from O. In this case, it is called the virtual image or virtual retinal plane. Note that the projection of the object in the image plane and the image of the object in the virtual image plane are identical up to a scale (similarity) transformation.

Now, how do we use pinhole cameras? Let P =

[

x y z

]T

be a point on some 3D object visible to the pinhole camera. P will be mapped or pro-

jected onto the image plane Π′, resulting in point^1 P ′^ =

[

x′^ y′

]T

. Similarly, the pinhole itself can be projected onto the image plane, giving a new point C′. Here, we can define a coordinate system

[

i j k

]

centered at the pinhole O such that the axis k is perpendicular to the image plane and points toward it. This coordinate system is often known as the camera reference system or camera coordinate system. The line defined by C′^ and O is called the optical axis of the camera system. Recall that point P ′^ is derived from the projection of 3D point P on the image plane Π′. Therefore, if we derive the relationship between 3D point P and image plane point P ′, we can understand how the 3D world imprints itself upon the image taken by a pinhole camera. Notice that triangle P ′C′O is similar to the triangle formed by P , O and (0, 0 , z). Therefore, using the law of similar triangles we find that:

(^1) Throughout the course notes, let the prime superscript (e.g. P ′) indicate that this

point is a projected or complementary point to the non-superscript version. For example, P ′^ is the projected version of P.

object lens film

Figure 4: A setup of a simple lens model. Notice how the rays of the top point on the tree converge nicely on the film. However, a point at a different distance away from the lens results in rays not converging perfectly on the film.

in the image plane. Therefore, the problem of the majority of the light rays blocked due to a small aperture is removed (Figure 4). However, please note that this property does not hold for all 3D points, but only for some specific point P. Take another point Q which is closer or further from the image plane than P. The corresponding projection into the image will be blurred or out of focus. Thus, lenses have a specific distance for which objects are “in focus”. This property is also related to a photography and computer graphics concept known as depth of field, which is the effective range at which cameras can take clear photos.

object lens^ film z'

-z f zo

P

focal point^ P’

Figure 5: Lenses focus light rays parallel to the optical axis into the fo- cal point. Furthermore, this setup illustrates the paraxial refraction model, which helps us find the relationship between points in the image plane and the 3D world in cameras with lenses.

Camera lenses have another interesting property: they focus all light rays traveling parallel to the optical axis to one point known as the focal point (Figure 5). The distance between the focal point and the center of the lens is commonly referred to as the focal length f. Furthermore, light rays

passing through the center of the lens are not deviated. We thus can arrive at a similar construction to the pinhole model that relates a point P in 3D space with its corresponding point P ′^ in the image plane.

P ′^ =

[

x′ y′

]

[

z′^ xz z′^ yz

]

The derivation for this model is outside the scope of the class. However, please notice that in the pinhole model z′^ = f , while in this lens-based model, z′^ = f +z 0. Additionally, since this derivation takes advantage of the paraxial or “thin lens” assumption^2 , it is called the paraxial refraction model.

normal pincushion barrel

Figure 6: Demonstrating how pincushion and barrel distortions affect images.

Because the paraxial refraction model approximates using the thin lens assumption, a number of aberrations can occur. The most common one is referred to as radial distortion, which causes the image magnification to decrease or increase as a function of the distance to the optical axis. We classify the radial distortion as pincushion distortion when the magnifi- cation increases and barrel distortion^3 when the magnification decreases. Radial distortion is caused by the fact that different portions of the lens have differing focal lengths.

4 Going to digital image space

In this section, we will discuss the details of the parameters we must account for when modeling the projection from 3D space to the digital images we know. All the results derived will use the pinhole model, but they also hold for the paraxial refraction model.

(^2) For the angle θ that incoming light rays make with the optical axis of the lens, the

paraxial assumption substitutes θ for any place sin(θ) is used. This approximation of θ for sin θ holds as θ approaches 0. (^3) Barrel distortion typically occurs when one uses fish-eye lenses.

P ′^ =

[

x′ y′

]

[

f k xz + cx f l yz + cy

]

[

α xz + cx β yz + cy

]

Is there a better way to represent this projection from P → P ′? If this projection is a linear transformation, then it can be represented as a product of a matrix and the input vector (in this case, it would be P. However, from Equation 4, we see that this projection P → P ′^ is not linear, as the opera- tion divides one of the input parameters (namely z). Still, representing this projection as a matrix-vector product would be useful for future derivations. Therefore, can we represent our transformation as a matrix-vector product despite its nonlinearity? Homogeneous coordinates are the solution.

4.1.2 Homogeneous Coordinates

One way to solve this problem is to change the coordinate systems. For example, we introduce a new coordinate, such that any point P ′^ = (x′, y′) becomes (x′, y′, 1). Similarly, any point P = (x, y, z) becomes (x, y, z, 1). This augmented space is referred to as the homogeneous coordinate sys- tem. As demonstrated previously, to convert a Euclidean vector (v 1 , ..., vn) to homogeneous coordinates, we simply append a 1 in a new dimension to get (v 1 , ..., vn, 1). Note that the equality between a vector and its homogeneous coordinates only occurs when the final coordinate equals one. Therefore, when converting back from arbitrary homogeneous coordinates (v 1 , ..., vn, w), we get Euclidean coordinates (v w^1 , ..., v wn ). Using homogeneous coordinates, we can formulate

P (^) h′ =

αx + cxz βy + cyz z

α 0 cx 0 0 β cy 0 0 0 1 0

x y z 1

α 0 cx 0 0 β cy 0 0 0 1 0

 (^) Ph (5)

From this point on, assume that we will work in homogeneous coordinates, unless stated otherwise. We will drop the h index, so any point P or P ′^ can be assumed to be in homogeneous coordinates. As seen from Equation 5, we can represent the relationship between a point in 3D space and its image coordinates by a matrix vector relationship:

P ′^ =

x′ y′ z

α 0 cx 0 0 β cy 0 0 0 1 0

x y z 1

α 0 cx 0 0 β cy 0 0 0 1 0

 P = M P (6)

We can decompose this transformation a bit further into

P ′^ = M P =

α 0 cx 0 β cy 0 0 1

[

I 0

]

P = K

[

I 0

]

P (7)

The matrix K is often referred to as the camera matrix.

4.1.3 The Complete Camera Matrix Model

The camera matrix K contains some of the critical parameters that describes a camera’s characteristics and its model, including the cx, cy, k, and l param- eters as discussed above. Two parameters are currently missing this formula- tion: skewness and distortion. We often say that an image is skewed when the camera coordinate system is skewed, meaning that the angle between the two axes is slightly larger or smaller than 90 degrees. Most cameras have zero-skew, but some degree of skewness may occur because of sensor manu- facturing errors. Deriving the new camera matrix accounting for skewness is outside the scope of this class and we give it to you below:

K =

x′ y′ z

α −α cot θ cx (^0) sinβ θ cy 0 0 1

Most methods that we introduce in this class ignore distortion effects, there- fore our class camera matrix K has 5 degrees of freedom: 2 for focal length, 2 for offset, and 1 for skewness. These parameters are collectively known as the intrinsic parameters, as they are unique and inherent to a given camera and relate to essential properties of the camera, such as its manufacturing.

4.2 Extrinsic Parameters

So far, we have described a mapping between a point P in the 3D camera reference system to a point P ′^ in the 2D image plane using the intrinsic parameters of a camera described in matrix form. But what if the information about the 3D world is available in a different coordinate system? Then, we need to include an additional transformation that relates points from the world reference system to the camera reference system. This transformation is captured by a rotation matrix R and translation vector T. Therefore, given a point in a world reference system Pw, we can compute its camera coordinates as follows:

P =

[

R T

]

Pw (9)

rig usually consists of a simple pattern (i.e. checkerboard) with known di- mensions. Furthermore, the rig defines our world reference frame with origin Ow and axes iw, jw, kw. From the rig’s known pattern, we have known points in the world reference frame P 1 , ..., Pn. Finding these points in the image we take from the camera gives corresponding points in the image p 1 , ..., pn. We set up a linear system of equations from n correspondences such that for each correspondence Pi, pi and camera matrix M whose rows are m 1 , m 2 , m 3 :

pi =

[

ui vi

]

= M Pi =

[m 1 Pi m m 3 Pi 2 Pi m 3 Pi

]

As we see from the above equation, each correspondence gives us two equations and, consequently, two constraints for solving the unknown pa- rameters contained in m. From before, we know that the camera matrix has 11 unknown parameters. This means that we need at least 6 correspon- dences to solve this. However, in the real world, we often use more, as our measurements are often noisy. To explicitly see this, we can derive a pair of equations that relate ui and vi with Pi.

ui(m 3 Pi) − m 1 Pi = 0 vi(m 3 Pi) − m 2 Pi = 0

Given n of these corresponding points, the entire linear system of equa- tions becomes

u 1 (m 3 P 1 )−m 1 P 1 = 0 v 1 (m 3 P 1 )−m 2 P 1 = 0 .. . un(m 3 Pn)−m 1 Pn = 0 vn(m 3 Pn)−m 2 Pn = 0

This can be formatted as a matrix-vector product shown below:        P 1 T 0 T^ −u 1 P 1 T 0 T^ P 1 T −v 1 P 1 T .. . P (^) nT 0 T^ −unP (^) nT 0 T^ P (^) nT −vnP (^) nT

mT 1 mT 2 mT 3

 (^) = Pm = 0 (12)

When 2n > 11, our homogeneous linear system is overdetermined. For such a system m = 0 is always a trivial solution. Furthemore, even if there

were some other m that were a nonzero solution, then ∀k ∈ R, km is also a solution. Therefore, to constrain our solution, we complete the following minimization: minimize m ‖Pm‖^2

subject to ‖m‖^2 = 1

To solve this minimization problem, we simply use singular value decompo- sition. If we let P = U DV T^ , then the solution to the above minimization is to set m equal to the last column of V. The derivation for this solution is outside the scope of this class and you may refer to Section 5.3 of Hartley & Zisserman on pages 592-593 for more details. After reformatting the vector m into the matrix M , we now want to explicitly solve for the extrinsic and intrinsic parameters. We know our SVD-solved M is known up to scale, which means that the true values of the camera matrix are some scalar multiple of M :

ρM =

αr 1 T − α cot θrT 2 + cxrT 3 αtx − α cot θty + cxtz β sin θ r

T 2 +^ cyr

T 3

β sin θ ty^ +^ cytz rT 3 tz

Here, rT 1 , r 2 T , and rT 3 are the three rows of R. Dividing by the scaling parameter gives

M =

ρ

αr 1 T − α cot θrT 2 + cxrT 3 αtx − α cot θty + cxtz β sin θ r

T 2 +^ cyr

T 3

β sin θ ty^ +^ cytz rT 3 tz

[

A b

]

aT 1 aT 2 aT 3

b 1 b 2 b 3

Solving for the intrinsics gives

ρ = ±

‖a 3 ‖ cx = ρ^2 (a 1 · a 3 ) cy = ρ^2 (a 2 · a 3 )

θ = cos−^1

(a 1 × a 3 ) · (a 2 × a 3 ) ‖a 1 × a 3 ‖ · ‖a 2 × a 3 ‖

α = ρ^2 ‖a 1 × a 3 ‖ sin θ β = ρ^2 ‖a 2 × a 3 ‖ sin θ

The extrinsics are r 1 =

a 2 × a 3 ‖a 2 × a 3 ‖ r 2 = r 3 × r 1 r 3 = ρa 3 T = ρK−^1 b

of linear equations:

v 1 (m 1 P 1 )−u 1 (m 2 P 1 ) = 0 .. . vn(m 1 Pn)−un(m 2 Pn) = 0

Similar to before, this gives a matrix-vector product that we can solve via SVD:

Ln =

v 1 P 1 T −u 1 P 1 T .. .

vnP (^) nT −unP (^) nT

[

mT 1 mT 2

]

Once m 1 and m 2 are estimated, m 3 can be expressed as a nonlinear func- tion of m 1 , m 2 , and λ. This requires to solve a nonlinear optimization problem whose complexity is much simpler than the original one.

7 Appendix A: Rigid Transformations

The basic rigid transformations are rotation, translation, and scaling. This appendix will cover them for the 3D case, as they are common type in this class. Rotating a point in 3D space can be represented by rotating around each of the three coordinate axes respectively. When rotating around the coordi- nate axes, common convention is to rotate in a counter-clockwise direction. One intuitive way to think of rotations is how much we rotate around each degree of freedom, which is often referred to as Euler angles. However, this methodology can result in what is known as singularities, or gimbal lock, in which certain configurations result in a loss of a degree of freedom for the rotation. One way to prevent this is to use rotation matrices, which are a more gen- eral form of representing rotations. Rotation matrices are square, orthogonal matrices with determinant one. Given a rotation matrix R and a vector v, we can compute the resulting vector v′^ as

v′^ = Rv

Since rotation matrices are a very general representation of matrices, we can represent a rotation α, β, γ around each of the respective axes as follows:

Rx(α) =

0 cos α − sin α 0 sin α cos α

Ry(β) =

cos β 0 sin β 0 1 0 − sin β 0 cos β

Rz (γ) =

cos γ − sin γ 0 sin γ cos γ 0 0 0 1

Due to the convention of matrix multiplication, the rotation achieved by first rotating around the z-axis, then y-axis, then x-axis is given by the matrix product RxRyRz. Translations, or displacements, are used to describe the movement in a certain direction. In 3D space, we define a translation vector t with 3 values: the displacements in each of the 3 axes, often denoted as tx, ty, tz. Thus, given some point P which is translated to some other point P ′^ by t, we can write it as:

P ′^ = P + t =

Px Py Pz

tx ty tz

In matrix form, translations can be written using homogeneous coordi- nates. If we construct a translation matrix as

T =

1 0 0 tx 0 1 0 ty 0 0 1 tz 0 0 0 1

then we see that P ′^ = T P is equivalent to P ′^ = P + t. If we want to combine translation with our rotation matrix multiplication, we can again use homogeneous coordinates to our advantage. If we want to rotate a vector v by R and then translate it by t, we can write the resulting vector v′^ as: (^) [ v′ 1

]

[

R t 0 1

] [

v 1

]

Finally, if we want to scale the vector in certain directions by some amount Sx, Sy, Sz , we can construct a scaling matrix

S =

Sx 0 0 0 Sy 0 0 0 Sz

Therefore, if we want to scale a vector, then rotate, then translate, our final transformation matrix would be:

T =

[

RS t 0 1

]

Figure 9: The weak perspective model: projection onto the image plane

As we see, the last row of M is

[

]

in the weak perspective model, compared to

[

v 1

]

in the normal camera model. We do not prove this result and leave it to you as an exercise. The simplification is clearly demonstrated when mapping the 3D points to the image plane.

P ′^ = M P =

m 1 m 2 m 3

 P =

m 1 P m 2 P 1

Thus, we see that the image plane point ultimately becomes a magnification of the original 3D point, irrespective of depth. The nonlinearity of the projec- tive transformation disappears, making the weak perspective transformation a mere magnifier.

Figure 10: The orthographic projection model

Further simplification leads to the orthographic (or affine) projection model. In this case, the optical center is located at infinity. The projection

rays are now perpendicular to the retinal plane. As a result, this model ignores depth altogether. Therefore,

x′^ = x y′^ = y

Orthographic projection models are often used for architecture and industrial design. Overall, weak perspective models result in much simpler math, at the cost of being somewhat imprecise. However, it often yields results that are very accurate when the object is small and distant from the camera.