Correlation and Convolution in Image Processing: CMSC 426 Class Notes - Prof. David Jacobs, Study notes of Computer Science

These class notes for cmsc 426, fall 2005, introduce correlation and convolution as basic operations for image processing. The document starts with 1d images and averaging, then moves on to filter construction, correlation, and convolution. It also discusses separable filters and the convolution theorem.

Typology: Study notes

Pre 2010

Uploaded on 02/13/2009

koofers-user-0fm
koofers-user-0fm 🇺🇸

10 documents

1 / 17

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Correlation and Convolution
Class Notes for CMSC 426, Fall 2005
David Jacobs
Introduction
Correlation and Convolution are basic operations that we will perform to extract
information from images. They are in some sense the simplest operations that we can
perform on an image, but they are extremely useful. Moreover, because they are simple,
they can be analyzed and understood very well, and they are also easy to implement and
can be computed very efficiently. Our main goal is to understand exactly what
correlation and convolution do, and why they are useful. We will also touch on some of
their interesting theoretical properties; though developing a full understanding of them
would take more time than we have.
These operations have two key features: they are shift-invariant, and they are linear.
Shift-invariant means that we perform the same operation at every point in the image.
Linear means that this operation is linear, that is, we replace every pixel with a linear
combination of its neighbors. These two properties make these operations very simple;
it’s simpler if we do the same thing everywhere, and linear operations are always the
simplest ones.
We will first consider the easiest versions of these operations, and then generalize. We’ll
make things easier in a couple of ways. First, convolution and correlation are almost
identical operations, but students seem to find convolution more confusing. So we will
begin by only speaking of correlation, and then later describe convolution. Second, we
will start out by discussing 1D images. We can think of a 1D image as just a single row
of pixels. Sometimes things become much more complicated in 2D than 1D, but luckily,
correlation and convolution do not change much with the dimension of the image, so
understanding things in 1D will help a lot. Also, later we will find that in some cases it is
enlightening to think of an image as a continuous function, but we will begin by
considering an image as discrete, meaning as composed of a collection of pixels.
Notation
We will use uppercase letters such as I and J to denote an image. An image may be
either 2D (as it is in real life) or 1D. We will use lowercase letters, like i and j to denote
indices, or positions, in the image. When we index into an image, we will use the same
conventions as Matlab. First, that means that the first element of an image is indicated by
1 (not 0, as in Java, say). So if I is a 1D image, I(1) is its first element. Second, for 2D
images we give first the row, then the column. So I(3,6) is the pixel in the third row of
the image, and the sixth column.
An Example
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Correlation and Convolution in Image Processing: CMSC 426 Class Notes - Prof. David Jacobs and more Study notes Computer Science in PDF only on Docsity!

Correlation and Convolution

Class Notes for CMSC 426, Fall 2005 David Jacobs

Introduction

Correlation and Convolution are basic operations that we will perform to extract information from images. They are in some sense the simplest operations that we can perform on an image, but they are extremely useful. Moreover, because they are simple, they can be analyzed and understood very well, and they are also easy to implement and can be computed very efficiently. Our main goal is to understand exactly what correlation and convolution do, and why they are useful. We will also touch on some of their interesting theoretical properties; though developing a full understanding of them would take more time than we have.

These operations have two key features: they are shift-invariant , and they are linear. Shift-invariant means that we perform the same operation at every point in the image. Linear means that this operation is linear, that is, we replace every pixel with a linear combination of its neighbors. These two properties make these operations very simple; it’s simpler if we do the same thing everywhere, and linear operations are always the simplest ones.

We will first consider the easiest versions of these operations, and then generalize. We’ll make things easier in a couple of ways. First, convolution and correlation are almost identical operations, but students seem to find convolution more confusing. So we will begin by only speaking of correlation, and then later describe convolution. Second, we will start out by discussing 1D images. We can think of a 1D image as just a single row of pixels. Sometimes things become much more complicated in 2D than 1D, but luckily, correlation and convolution do not change much with the dimension of the image, so understanding things in 1D will help a lot. Also, later we will find that in some cases it is enlightening to think of an image as a continuous function, but we will begin by considering an image as discrete , meaning as composed of a collection of pixels.

Notation

We will use uppercase letters such as I and J to denote an image. An image may be either 2D (as it is in real life) or 1D. We will use lowercase letters, like i and j to denote indices, or positions, in the image. When we index into an image, we will use the same conventions as Matlab. First, that means that the first element of an image is indicated by 1 (not 0, as in Java, say). So if I is a 1D image, I(1) is its first element. Second, for 2D images we give first the row, then the column. So I(3,6) is the pixel in the third row of the image, and the sixth column.

An Example

One of the simplest operations that we can perform with correlation is local averaging. As we will see, this is also an extremely useful operation. Let’s consider a simple averaging operation, in which we replace every pixel in a 1D image by the average of that pixel and its two neighbors. Suppose we have an image I equal to:

Averaging is an operation that takes an image as input, and produces a new image as output. When we average the fourth pixel, for example, we replace the value 3 with the average of 2, 3, and 7. That is, if we call the new image that we produce J we can write: J(4) = (I(3)+I(4)+I(5))/3 = (2+3+7)/3 = 4. Or, for example, we also get: J(3) = (I(2)+I(3)+I(4))/3 = (4+2+3)/3 = 3. Notice that every pixel in the new image depends on the pixels in the old image. A possible error is to use J(3) when calculating J(4). Don’t do this; J(4) should only depend on I(3), I(4) and I(5). Averaging like this is shift- invariant, because we perform the same operation at every pixel. Every new pixel is the average of itself and its two neighbors. Averaging is linear because every new pixel is a linear combination of the old pixels. This means that we scale the old pixels (in this case, we multiply all the neighboring pixels by 1/3) and add them up. This example illustrates another property of all correlation and convolution that we will consider. The output image at a pixel is based on only a small neighborhood of pixels around it in the input image. In this case the neighborhood contains only three pixels. Sometimes we will use slightly larger neighborhoods, but generally they will not be too big.

Boundaries: We still haven’t fully described correlation, because we haven’t said what to do at the boundaries of the image. What is J(1)? There is no pixel on its left to include in the average, ie., I(0) is not defined. There are four common ways of dealing with this issue.

In the first method of handling boundaries, the original image is padded with zeros (in red italics).

The first way is to imagine that I is part of an infinitely long image which is zero everywhere except where we have specified. In that case, we have I(0) = 0, and we can say: J(1) = (I(0) + I(1) + I(2))/3 = (0 + 5 + 4)/3 = 3. Similarly, we have: J(10) = (I(9)+I(10)+I(11))/3 = (3 + 6 + 0)/3 = 3.

In the second method of handling boundaries, the original image is padded with the first and last values (in red italics).

The second way is to also imagine that I is part of an infinite image, but to extend it using the first and last pixels in the image. In our example, any pixel to the left of the first pixel in I would have the value 5, and any pixel to the right of the last pixel would have the value 6. So we would say: J(1) = (I(0) + I(1) + I(2))/3 = (5 + 5 + 4)/3 = 4 2/3, and J(10) = (I(9)+I(10)+I(11))/3 = (3 + 6 + 6)/3 = 5.

To produce the next number in the filtered image, we slide the filter over a pixel, and perform the same operation.

1/3 1/3 1/

5/3 4/3 2/ Σ

We continue doing this until we have produced every pixel in J. With this view of correlation, we can define a new averaging procedure by just defining a new filter. For example, suppose instead of averaging a pixel with its immediate neighbors, we want to average each pixel with immediate neighbors and their immediate neighbors. We can define a filter as (1/5, 1/5, 1/5, 1/5, 1/5). Then we perform the same operation as above, but using a filter that is five pixels wide. The first pixel in the resulting image will then be: J(1) = (I(-1)/5 + I(0)/5 + I(1)/5 + I(2)/5 + I(3)/5) = 1+1+1+4/5 + 2/5 = 4 1/5.

A Mathematical Definition for Correlation

It’s helpful to write this all down more formally. Suppose F is a correlation filter. It will be convenient notationally to suppose that F has an odd number of elements, so we can suppose that as it shifts, its center is right on top of an element of I. So we say that F has 2N+1 elements, and that these are indexed from - N to N , so that the center element of F is F(0). Then we can write:

 =−

N

i N

F  I ( x ) F ( i ) I ( x i )

where the circle denotes correlation. With this notation, we can define a simple box filter as:

0 for 1 , 0 , 1

for 1 , 0 , 1 3

i

i Fi

Constructing an Filter from a Continuous Function

It is pretty intuitive what a reasonable averaging filter should look like. Now we want to start to consider more general strategies for constructing filters. It commonly occurs that we have in mind a continuous function that would make a good filter, and we want to come up with a discrete filter that approximates this continuous function. Some reasons for thinking of filters first as continuous functions will be given when we talk about the

14/3 11/

Fourier transform. But in the mean time we’ll give an example of an important continuous function used for image smoothing, the Gaussian.

A one-dimensional Gaussian is:

= ^ − −

2

2 2

exp ( ) 2

Gx^ x

This is also known as a Normal distribution. Here μ is the mean value, and σ is the variance. Here’s a plot of a Gaussian:

The mean, μ, gives the location of the peak of the function. The parameter σ controls how wide the peak is. As σ gets smaller the peak becomes narrower.

The Gaussian has a lot of very useful properties, some of which we’ll mention later. When we use it as a filter, we can think of it as a nice way to average. The averaging filter that we introduced above replaces each pixel with the average of its neighbors. This means that nearby pixels all play an equal role in the average, and more distant pixels play no role. It is more appealing to use the Gaussian to replace each pixel with a weighted average of its neighbors. In this way, the nearest pixels influence the average more, and more distant pixels play a smaller and smaller role. This is more elegant, because we have a smooth and continuous drop-off in the influence of pixels on the result, instead of a sudden, discontinuous change. We will show this more rigorously when we discuss the Fourier transform. When we use a Gaussian for smoothing, we will set μ =0, because we want each pixel to be the one that has the biggest effect on its new, smoothed value. So our Gaussian has the form:

2 0 exp 2 2

σ π^ σ

G x = −^ x

σ will serve as a parameter that allows us to control how much we smooth the image (that is, how big a neighborhood we use for averaging). The bigger σ is, the more we smooth the image.

We now want to build a discrete filter that looks like a Gaussian. We can do this by just evaluating the Gaussian at discrete locations. That is, although G is defined for any continuous value of x , we will just use its values at discrete locations (… -3, -2, -1, 0, 1, 2, 3…). One way to think about this is that we are approximating our original,

Let’s consider an example. Suppose image intensity grows quadratically with position, so that I(x) = x^2_._ Then if we look at the intensities at positions 1, 2, 3, … they will look like:

If we filter this with a filter of the form (-1/2 0 1/2) we will get:

We know that the derivative of I(x), dI/dx = 2x. And sure enough, we have, for example, that J(2) = 4, J(3) = 6, … Notice that J(1) is not equal to 2 because of the way we handle the boundary, by setting I(0) = 1 instead of I(0) = 0. So at the boundary, our image doesn’t really reflect I(x) = x^2_._

We could just as easily have used a filter like (0 -1 1), which computes the expression: J(i) = (I(i+1)-I(i))/1 , or a filter (-1 1 0), which computes J(i) = (I(i) – I(i-1))/1. These are all reasonable approximations to a derivative. One advantage of the filter (-1/2 0 1/2) is that it treats the neighbors of a pixel symmetrically.

Matching with Correlation

Part of the power of correlation is that we can use it, and related methods, to find locations in an image that are similar to a template. To do this, think of the filter as a template; we are sliding it around the image looking for a location where the template overlaps the image so that values in the template are aligned with similar values in the image.

First, we need to decide how to measure the similarity between the template and the region of the image with which it is aligned. A simple and natural way to do this is to measure the sum of the square of the differences between values in the template and in the image. This increases as the difference between the two increases. For the difference between the filter and the portion of the image centered at x, we can write this as:

( ) ( ( ) ( ) ( ) ( ))

 (^ ( ))^  (^ (^ ))^ (^ ( ) (^ ))

 

= − =− =−

=− =−

= + + − +

N

i N

N

i N

N

i N

N

i N

N

i N

F i I x i FiI x i

F i I x i F i I x i FiIx i

2 2

(^222)

As shown, we can break the Euclidean distance into three parts. The first part depends only on the filter. This will be the same for every pixel in the image. The second part is the sum of squares of pixel values that overlap the filter. And the third part is twice the negative value of the correlation between F and I. We can see that, all things being equal, as the correlation between the filter and the image increases, the Euclidean distance between them decreases. This provides an intuition for using correlation to

match a template with an image. Places where the correlation between the two is high tend to be locations where the filter and image match well.

This also shows one of the weaknesses of using correlation to compare a template and image. Correlation can also be high in locations where the image intensity is high, even if it doesn’t match the template well. Here’s an example. Suppose we correlate the filter

3 7 5

with the image:

3 2 4 1 3 8 4 0 3 8 0 7 7 7 1 2

We get the following result:

40 43 39 34 64 85 52 27 61 65 59 84 105 75 38 27

Notice that we get a high result (85) when we center the filter on the sixth pixel, because (3,7,5) matches very well with (3,8,4). But, the highest correlation occurs at the 13th pixel, where the filter is matched to (7,7,7). Even though this is not as similar to (3,7,5), its magnitude is greater.

One way to overcome this is by just using the sum of square differences between the signals, as given above. This produces the result:

25 26 26 41 29 2 59 54 34 26 78 13 20 32 61 38

We can see that using Euclidean distance, (3,7,5) best matches the part of the image with pixel values (3,8,4). The next best match has values (0,7,7), but this is significantly worse.

Another option is to perform normalized correlation by computing::

(^ (^ ))^ (^ ( ))



=− = −

=−

N

i N

N

i N

N

i N

I x i F i

FiI x i

2 2

This measure of similarity is similar to correlation, but it is invariant to scaling the template or a corresponding region in the image. If we scale I(x-N)…I(x+N) , by a single, constant factor, this will scale the numerator and denominator by the same amount. Consequently, for example, (3,7,5) will have the same normalized correlation with (7,7,7) that it would have with (1,1,1). When we perform normalized correlation between (3,7,5) and the above image we get:

.946 .877 .934 .73 .81 .989 .64 .59 .78 .835 .61 .931 .95 .83 .57.

Example: We can perform averaging of a 2D image using a 2D box filter, which, for a 3x3 filter, will look like:

1/9 1/9 1/ 1/9 1/9 1/ 1/9 1/9 1/ F

Below we show an image, I, and the result of applying the box filter above to I to produce the resulting image, J.

I

I with padded boundaries J = F o I

Separable filters: Generally, 2D correlation is more expensive than 1D correlation because the sizes of the filters we use are larger. For example, if our filter is (N+1)x(N+1) in size, and our image contains MxM pixels, then the total number of multiplications we must perform is (N+1)^2 M^2. However, with an important class of filters, we can save on this computation. These are separable filters, which can be written as a combination of two smaller filters. For example, in the case of the box filter, we have:

That is, the box filter is equal to the correlation of two filters that have a row and a column of equal values, and all other values equal to zero (here we are handling boundaries by using 0 values outside the boundary). When this is the case, correlating an image with the box filter produces the same result as correlating it separately with the two filters on the right side of the equals sign. Intuitively, this makes sense. Correlating with the rightmost filter averages each pixel with its neighbors on the same row. The second correlation then averages these resulting pixels with those neighbors in the same

column, which are already averages of neighbors on the same row. This produces an average of all neighboring pixels.

The advantage of separating a filter in this way is that when we perform correlation with the matrices on the right, we don’t need to actually perform any multiplications where there are zeros. In fact, it is more convenient to simply write these filters as 3x1 and 1x3, ignoring the zeros. This means that for each pixel in the image, we only need to perform 6 multiplications instead of 9. In general, if we can separate a (N+1)x(N+1) filter into two filters that are (N+1)x1 and 1x(N+1), the total work we must do in correlation is reduced from (N+1)^2 M^2 to 2(N+1)M^2.

Gaussian filtering has the same separability. For a 2D Gaussian, we use a function that is rotationally symmetric, where the height of the filter falls off as a Gaussian function of the distance from the center. This has the form:

= ^ − +

2

2 2 0 exp 2 2

σ π^ σ

G x y x^ y

This can be separated because it can be expressed as:

− ^ −

2 2

2 2

2 2 0 exp 2 exp 2 2

exp 2

σ π σ σ π σ^ σ

G x y x y x^ y

It is the product of two components, one of which only depends on x, while the other only depends on y. The first component can be expressed with a filter that has a single row, and the second can be expressed with a filter that has a single column.

Convolution

Convolution is just like correlation, except that we flip over the filter before correlating. For example, convolution of a 1D image with the filter (3,7,5) is exactly the same as correlation with the filter (5,7,3). We can write the formula for this as:

=−

N

i N

F I ( x ) F ( i ) I ( x i )

In the case of 2D convolution we flip the filter both horizontally and vertically. This can be written as:

= − =−

N

j N

N

i N

F I ( x , y ) F ( i , j ) I ( x i , y j )

Notice that correlation and convolution are identical when the filter is symmetric.

As an example of a less attractive basis, supposed we represented points using the two basis vectors (1,0) and (1,1). These are not orthogonal, because <(1,0),(1,1)> = 1. Still, given any point, (x,y), we could always represent it as:

(x,y) = u(1,0) + v(1,1)

for some choice of u and v. We can compute u and v by taking v = y, u = x – y. For example, we would represent the point (7,3) as 4(1,0) + 3(1,1). So rather than represent the point with the Euclidean coordinates (7,3), we can represent this same point in this new basis with the coordinates (4,3). But note that the Pythogorean theorem wouldn’t hold. And, if I give you two basis vectors that are not orthonormal, it is not so easy to come up with a formula for determining the coordinates of a point using these basis vectors (especially as we look at higher dimensions).

The Fourier Series

The Fourier series is a good representation because it provides an orthonormal basis for images. We want to explore it in the simplest possible setting, to get the key intuitions. This is one case in which it is easier to get the idea by working with continuous functions, rather than discrete ones. So, instead of thinking of an image as a list of pixels, we’ll think of it as a continuous function, I(x). To simplify, we’ll also only do things in 1D, but all this extends to 2D images. We will also consider only periodic functions that go from 0 to 2π, and then repeat over and over again. This is equivalent to the third way that we describe above for handling pixels outside of the boundary of images. The main thing I’ll do to make things easier, though, is to be very sloppy mathematically, skipping precise conditions and giving intuitions instead of complete proofs.

The following functions provide an orthonormal basis for functions:

sin( ) ,

cos( ) , 2

fork =

kx kx

These functions form the Fourier Series. The first one is a constant function, and the others are just sines and cosines of increasing frequency.

To define things like whether a set of functions are orthonormal, we need to define an inner product between two functions. We do this with a continuous version of the inner product. With discrete vectors, to compute the inner product we multiply together the values of matching coordinates, and then add up the results. We do the same thing with continuous functions; we multiply them together, and integrate (add) the result. So the inner product between two functions, f(x) and g(x), defined in the domain from 0 to 2π, is:

=

2 π

0

f , g f ( x ) g ( x ) dx

When we say that a function, f, has unit magnitude, we mean that = 1. When we say that two functions are orthogonal, we mean that = 0. One reason that these are sensible definitions is that if we approximate a function by sampling its values, and building them into a vector, then the definitions in the continuous case give us the same result as the limit of what we get in the discrete case, as we sample the function more and more densely.

We can use this definition to show that the functions in the Fourier Series have unit magnitude, by taking the inner product of each function with itself, and showing that this is equal to 1. We can show they are orthogonal by showing that their inner products with each other are zero. Doing this requires solving some integrals of trigonometric functions.

To show that the Fourier Series forms a good coordinate system for all functions we must also show that we can express any function as a linear combination of the elements of the Fourier Series. That is, we must show that for any function, f, we can write:



=

1

0

cos( ) sin( ) 2

k

k k

kx a

kx f x a b

Then the values (a 0 , b 1 , a 1 , b 2 , a 2 , …) are the coordinates of the function in this new coordinate system provided by the Fourier Series. This is a very important fact, but we will skip the proof here. Keep in mind that I’m leaving out some necessary conditions, such as that this only works when f is a continuous function.

This result means that any function can be broken down into the sum of sine waves of different amplitudes and phases. We say that the part of the function that is due to longer frequency sine waves is the low frequency part of the function. The part that is due to high frequency sine waves is the high frequency component of the function. If a function is very bumpy, with small rapid changes in its value, these rapid changes will be due to the high frequency component of the function. If we can eliminate these high-frequency components, we can make the function smoother.

Implications of the orthonormality of the Fourier Series.

As with an orthonormal basis for vectors, the orthonormality of the Fourier Series means that we can use projection and (a generalization of) the Pythagorean theorem.

We can solve the for the values (a 0 , b 1 , a 1 , b 2 , a 2 , …) just by taking inner products. So:

 −

T

T

k k

kT dx kx a

sin( ) 2 cos( )

 −

T

T

k k

kT dx kx b

cos( ) 2 sin( )

This is called the sinc function. If we graph it, we see that it looks like:

This does not decay as fast as a Gaussian. Also it has oscillations that produce odd effects. As a result of these oscillations, some high frequency components of an image will be wiped out by averaging, while some similar high frequency components remain.

Aliasing and the Sampling Theorem

Sampling means that we know the value of a function, f , at discrete intervals, f(n π /T) for

n = 0, +-1, +-T. That is, we have 2T + 1 uniform samples of f(x), assuming f is a periodic function. Suppose now that f is band-limited to T. That means,



T

k

k k

kx a kx f x a b 1

0

cos( ) sin( ) 2

 −

T

T

T

a dx

0

We know the value of f(x) at 2T + 1 positions. If we plug these values into the above equation, we get 2T + 1 equations. The ai and bi give us 2T + 1 unknowns. These are linear equations. So we get a unique solution. This means that we can reconstruct f from its samples. However, if we have fewer samples, reconstruction is not possible. If there is a higher frequency component, then this can play havoc with our reconstruction of all low frequency components, and be interpreted as odd low frequency components.

This means that if we want to sample a signal, the best thing we can do is to low-pass filter it and then sample it. Then, we get a signal that matches our original signal perfectly in the low-frequency components, and this is the best we can do. To shrink images, in fact, we low-pass filter them and then sample them.