Partial Derivatives: Understanding the Concept and Finding the Derivatives of a Function, Study notes of Mathematical Methods for Numerical Analysis and Optimization

The concept of partial derivatives, a multivariable extension of single-variable derivatives. It provides examples and formulas for calculating partial derivatives of a function with respect to each variable. The document also discusses the graphical representation of functions and their level sets.

Typology: Study notes

Pre 2010

Uploaded on 08/26/2009

koofers-user-si0
koofers-user-si0 🇺🇸

10 documents

1 / 32

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Note: Since many elementary MVC textbooks cover the relation between partial derivatives
and total derivative in less depth than I believe is optimal for this course (or they do it
without reference to matrices), I am providing the key issues here as special notes.
Partial derivatives
Definition of partial derivatives
Everybody who can do single-variable derivatives from calculus 1&2, can already do partial
derivatives: E.g., for a 2-variable function fgiven by the formula f(x,y ) = (x3+ 3x2y2+
y3) sin(x2+ 3y), we can choose to treat ylike a (fixed) parameter and view the expression
as a function of the variable xalone, and we can take the derivative with respect to x.
Alternatively, we can view this expression as a function of yalone, treating xlike a parameter,
and taking the derivative with respect to y. These derivatives are called partial derivatives,
and the adjective ‘partial’ is fitting because they provide only part of the information that
should be contained in the derivative.
The prime notation ffrom single variable calculus won’t serve us here, because the prime
doesn’t tell us with which variable (xor y) we are dealing. Instead, we use the Leibniz
notation; in single variable calculus that would be df(x)/dx. However, in order to provide
a visual sign that these are partial derivatives, we replace the standard dwith ‘curly d’s’
. The significance in this notation will become more transparent soon; for the moment it’s
just a reminder that we are dealing with one variable at a time in a situation where several
variables are present.
So we can write an example for partial derivatives, with the fgiven above:
f(x, y) = (x3+ 3x2y2+y3) sin(x2+ 3y)
∂f (x, y)
∂x = (3x2+ 6xy2) sin(x2+ 3y) + (x3+ 3x2y2+y3) cos(x2+ 3y) 2x
∂f (x, y)
∂y = (6x2y+ 3y2) sin(x2+ 3y) + (x3+ 3x2y2+y3) 3 cos(x2+ 3y)
How do the curly d’s read aloud? For instance ‘partial df over dx’ or ‘dell f over dell x’.
Clarification of the function concept and appropriate notation
To explain and interpret partial derivatives, we need to work a bit on a clean language about
functions. Mind the mantra that a function is not the same as a formula. Think of a function
as a slot machine, that takes certain inputs and assigns a specific output to each legitimate
input. The output may be obtained from the input by means of a formula (or by means of
several formulas, in the case of piecewise defined functions). But even if one formula provides
the output, the function is not the formula, but rather the function is the whole input-output
device.
In the example of partial derivatives given above, we are actually talking about three different
functions (each of which has its output given by the same formula), but they are distinguished
by different input slots. Namely we are talking about one two-variable function, and two
single-variable functions.
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20

Partial preview of the text

Download Partial Derivatives: Understanding the Concept and Finding the Derivatives of a Function and more Study notes Mathematical Methods for Numerical Analysis and Optimization in PDF only on Docsity!

Note: Since many elementary MVC textbooks cover the relation between partial derivatives and total derivative in less depth than I believe is optimal for this course (or they do it without reference to matrices), I am providing the key issues here as special notes.

Partial derivatives

Definition of partial derivatives

Everybody who can do single-variable derivatives from calculus 1&2, can already do partial derivatives: E.g., for a 2-variable function f given by the formula f (x, y) = (x^3 + 3x^2 y^2 + y^3 ) sin(x^2 + 3y), we can choose to treat y like a (fixed) parameter and view the expression as a function of the variable x alone, and we can take the derivative with respect to x. Alternatively, we can view this expression as a function of y alone, treating x like a parameter, and taking the derivative with respect to y. These derivatives are called partial derivatives, and the adjective ‘partial’ is fitting because they provide only part of the information that should be contained in the derivative.

The prime notation f ′^ from single variable calculus won’t serve us here, because the prime doesn’t tell us with which variable (x or y) we are dealing. Instead, we use the Leibniz notation; in single variable calculus that would be df (x)/dx. — However, in order to provide a visual sign that these are partial derivatives, we replace the standard d with ‘curly d’s’ ∂. The significance in this notation will become more transparent soon; for the moment it’s just a reminder that we are dealing with one variable at a time in a situation where several variables are present.

So we can write an example for partial derivatives, with the f given above:

f (x, y) = (x^3 + 3x^2 y^2 + y^3 ) sin(x^2 + 3y)

∂f (x, y) ∂x

= (3x^2 + 6xy^2 ) sin(x^2 + 3y) + (x^3 + 3x^2 y^2 + y^3 ) cos(x^2 + 3y) 2x

∂f (x, y) ∂y

= (6x^2 y + 3y^2 ) sin(x^2 + 3y) + (x^3 + 3x^2 y^2 + y^3 ) 3 cos(x^2 + 3y)

How do the curly d’s read aloud? For instance ‘partial df over dx’ or ‘dell f over dell x’.

Clarification of the function concept and appropriate notation

To explain and interpret partial derivatives, we need to work a bit on a clean language about functions. Mind the mantra that a function is not the same as a formula. Think of a function as a slot machine, that takes certain inputs and assigns a specific output to each legitimate input. The output may be obtained from the input by means of a formula (or by means of several formulas, in the case of piecewise defined functions). But even if one formula provides the output, the function is not the formula, but rather the function is the whole input-output device.

In the example of partial derivatives given above, we are actually talking about three different functions (each of which has its output given by the same formula), but they are distinguished by different input slots. Namely we are talking about one two-variable function, and two single-variable functions.

Firstly, we have the two-variable function which we assigned the name f. A more elaborate (and maybe weird-looking) name would be f (·, ·). The dots represent the input slots. We customarily name the quantity that goes in the first slot as x, and the quantity that goes in the second slot as y. (Not always; in f (1, 2), the quantities are explicit numbers, and they don’t need to be given another name.) The full picture of what the function does is given in the following notation (which pure math folks love for its conceptual clarity and unambiguity, but which is not so often used in calculus because it is lengthy):

f : (x, y) 7 → f (x, y) = (x^3 + 3x^2 y^2 + y^3 ) sin(x^2 + 3y).

Before the colon, you have the name of the function (f ), after the colon, the explanation what the function does. You see its input slots, with (convential, but arbitrary) names (x and y) given to the variables that go in these slots. Then the assignment arrow 7 →, which symbolizes the function’s operation of taking the input and converting it into an output. Finally the output, with its generic conventional symbol f (x, y) (function with variables filled in the slots), and the actual formula that tells how the function calculates the input.

You’ll save yourself a lot of heartache with partial derivatives if you make sure that your notion of function in your brain represents this pattern and is not reduced to the formula at the end. And yes, I know, the ‘function=formula’ misconception has served you well so far, so it’s hard to get rid of it, but now you’d better dump the misconception anyway, lest it become the source of mysterious confusions down the road. Unlike the neatly groomed textbooks, I will not provide an artificially screened and protected environment designed to cater for survival with the beloved misconception.

In a rough metaphor, the function f is like a coke machine, with the two inputs being a coin and a button-push. The output f (x, y) is the coke.

Now if somebody rigs up the input-output device and keeps the button for diet cherry coke permanently selected, leaving you only one input slot, than this device is a new function. We can give it a completely new name like g, or we can call it f (·, y) with the second slot already filled with some constant y, and only the first slot ready to take an input (called x). This function is now f (·, y) : x 7 → f (x, y) = (x^3 + 3x^2 y^2 + y^3 ) sin(x^2 + 3y)

It is a single variable function, and its derivative at x is what we called ∂f (x, y)/∂x above. So you could write f ′(·, y) = ∂f (x, y)/∂x. (But writing it this way serves no other purpose than to illustrate a notation that may seem weird to you yet.)

The third function we were considering in our example is

f (x, ·) : y 7 → f (x, y) = (x^3 + 3x^2 y^2 + y^3 ) sin(x^2 + 3y)

Three different functions, all from the same formula!

You may not see the dot-slot notation too often, and it may be abhorrent to physicists, who might rather call the 2-variable function f (x, y), and the 1-variable functions f (x) and f (y) respectively, using the variables to distinguish which function we are talking about, and not bothering to give a name to the function itself.

observation will justify calling the function f from the graph (totally) differentiable – in contradistinction to partial differentiability, which only refers to the existence of the partial derivatives. Look at the lower right figure: The information that enters into the partial derivatives does not ‘see’ anything the function does in the place where the question marks are. The 2-variable function f could be modified wildly in these quadrants without affecting the single variable functions from which the partial derivatives are calculated. In particular, it could be modified so wildly that neither Π nor any other plane could reasonably be considered to be tangent to the graph. And this brings us to the limitations of the partial derivative:

Limitations of the partial derivative, and what we will do about them

As I have just pointed out: If we were to call a multi-variable function differentiable if merely all partial derivatives exist, we would end up calling some rather wild functions ‘differentiable’ that do not deserve to be called differentiable.

Take for instance our old friend

(x, y) 7 → f (x, y) :=

{ (^) xy x^2 + y^2 if (x, y) 6 = (0, 0) 0 if (x, y) = (0, 0)

This function is not even continuous at (0, 0); but still the single-variable functions f (·, 0) : x 7 → 0 and f (0, ·) : y 7 → 0 are constant and have trivially the derivative 0.

We will define a notion of ‘differentiable’ that still gives us the theorem that we had in single variables, namely: “If f is differentiable, then it is continuous”. In contrast, existence of the partial derivatives does not even guarantee continuity of a function, as we have just seen.

There is a practical limitation as well, and it is related to the theoretical limitation just mentioned. If the variables x and y are indeed cartesian coordinates of a point P , with the function depending on the point (geometrically), then the function P 7 → f (P ) has a meaning independent of the coordinate directions we choose. We could construct a continuous function like, e.g. (note the square root this time),

(x, y) 7 → g(x, y) :=

xy √ x^2 + y^2

if (x, y) 6 = (0, 0) 0 if (x, y) = (0, 0)

In polar coordinates, g(x, y) = g(r cos ϕ, r sin ϕ) = r sin ϕ cos ϕ. We can slice the graph of g in many directions, not only the two directions that were arbitrarily singled out as coordinate directions, and we get infinitely many single variable functions, just from slices through the origin; for instance t 7 → g(t, t) = 12 t, if we slice along the diagonal of the x-y-plane. This time, the function is continuous in the origin, and we also have tangent lines in all coordinate directions, but these tangent lines still do not assemble into a plane.

If the input of the function has a geometric meaning, then partial derivatives single out certain directions arbitrarily as ‘coordinate directions’, at the neglect of other directions. (Think about the other meaning of ‘partial’, whose opposite is not ‘total’ but ‘impartial’). We will also construct a notion of directional derivative, which generalizes partial derivatives in the sense that any direction can be used for getting a single-variable slice of the graph. With this notion, partial derivatives will simply be directional derivatives in coordinate directions. But again, even the existence of all directional derivatives in a point is not sufficient for the existence of a tangent plane, as the example g shows.

Worse even: Even if all directional derivatives in one point exist and are 0 (so that the tangent lines neatly fit together into a plane, namely a plane z = const, this still does not guarantee the continuity of the function in this point. The function from Homeworks #8, h(x, y) := x^2 y^4 /(x^4 +y^8 ) for (x, y) 6 = (0, 0) and h(0, 0) = 0 is an example for this phenomenon. (Details later.)

Planes, linear maps, derivatives, and matrices

Graphs that are planes; and linear maps

How does a 2-variable function T look like whose graph is a plane? – Well, all the single variable functions T (·, y) : x 7 → T (x, y) should have graphs that are straight lines, and all these lines should have the same slopes: So T (x, y) = g(y) + mx. Since the single variable function T (x, ·) : y 7 → T (x 0 , y) would also have to graph as a line, we need g(y) = a + ny for some constants a and n. In other words, it is just the linear functions T (x, y) = a + mx + ny whose graphs are planes.

In natural generalization, we call an ℓ-variable function T linear inhomogeneous (or: affine) if it is of the form T (x 1 ,... , xℓ) = a + m 1 x 1 +... mℓxℓ with constants a and mj (j = 1,... , ℓ). We call a vector-valued function T~ linear inhomogeneous (or: affine), if each component function Ti is a linear inhomogeneous function, i.e., if we can write, with constants ai and mij : T 1 (x 1 ,... , xℓ) = a 1 + m 11 x 1 +... + m 1 ℓxℓ T 2 (x 1 ,... , xℓ) = a 2 + m 21 x 1 +... + m 2 ℓxℓ .. . Tk(x 1 ,... , xℓ) = ak + mk 1 x 1 +... + mkℓxℓ

(LF )

In the case k = 1, ℓ = 2, which we can neatly graph in R^3 , the graph is a plane.

Note: The words ‘linear inhomogeneous’ or ‘affine’ are more popular in Linear Algebra. In Calculus, we often say simply ‘linear’ instead of linear inhomogeneous / affine. In Linear Algebra, the word ‘linear’ alone refers to the special case of (LF) where all ai are zero.

Matrices

Matrices have been invented as a more concise notation for situations like (LF). This notation will actually condense (LF) to a form that makes it look very similar to the scalar valued single variable case T (x) = a + mx.

Earlier we had combined the components x 1 ,... , xℓ into a vector, abbreviated as ~x, which even turns out to have a geometric interpretation, so we are not merely talking about an abbreviation, but rather ~x is the ‘real thing’ and its components xj are merely pieces of ~x defined in terms of an arbitrarily chosen cartesian coordinate system. In a similar spirit, we now arrange the coefficients mij into a rectangular array and call this whole array a matrix.

M :=

m 11 · · · m 1 ℓ m 21 · · · m 2 ℓ .. .

mk 1 · · · mkℓ

draw such a plane in R^3. In other cases, this sentence is merely a way of speaking, which, by metaphor carries over our geometric intuition into ‘dimensions never beheld by human eyes’. More formally, consider this statement a definition of ‘tangent plane’ in these cases of higher dimension.

In the case k = 1, ℓ = 1 of single variable calculus, our definition of total differentiability reduces to the old definition of differentiability for single variable functions. Df (~x∗) would have to be a 1 × 1 matrix, which we usually identify with the number that is the one and only entry of this matrix; and this number is what was called f ′(x∗) in single variable calculus. Indeed, in the SV case, the norms ‖ · ‖ become absolute values | · |, and our definition asserts that the limit

lim h→ 0

f (x∗ + h) − f (x∗) − T h h

vanishes for some number (‘1 × 1 matrix’) T. But this means that limh→ 0 f^ (x∗+h h)− f^ (x∗)exists and is T. So T is the derivative f ′(x∗).

Let’s drop the ∗ from the notation now. We study differentiability at a point ~x = [x 1 , x 2 ,... , xℓ]T^.

Next, we’ll see that total differentiability of f~ implies the existence of the partial derivatives, and that the entries of T are precisely these partial derivatives. We do this by choosing special vectors ~h, namely those that point in coordinate directions. For simplicity, assume that f is scalar valued. Let’s choose ~h = [t , 0 , 0 ,.. .]T^. We are looking for a (row) matrix T = [T 1 , T 2 ,... , Tℓ] that satifies the definition. Note that T~h = tT 1 + 0T 2 +... + 0Tℓ = tT 1. Differentiability requires in particular (for our chosen vector ~h) that that

lim t→ 0

f (x 1 + t, x 2 ,... , xℓ) − f (x 1 , x 2 ,... , xℓ) − tT 1 t

But this identifies T 1 as the partial derivative ∂f (~x)/∂x 1. If we had chosen ~h to be [0 , t , 0 ,.. .]T instead, we would have selected the second entry T 2 of T and identified it with the partial derivative ∂f (~x)/∂x 2 , and so on. It is clear from this deliberation that partial derivatives arise from the total derivative by choosing specific vectors ~h in the definition of differentia- bility. Total differentiability requires that the limit in the definition exists even without any restriction on how ~h goes to 0.

The same considerations carry over to the vector valued case. For a function f~ with compo- nent functions f 1 ,... , fk, the limit (TD) in the above definition will be zero if and only if the corresponding limit for each component function is 0. Our conclusion is:

If f~ is differentiable, then

D f~ (~x) =

∂f 1 ∂x 1

∂f 1 ∂x 2

∂f 1 ∂x 3

∂f 1 ∂xℓ ∂f 2 ∂x 1

∂f 2 ∂x 2

∂f 2 ∂x 3

∂f 2 ∂xℓ .. .

∂fk ∂x 1

∂fk ∂x 2

∂fk ∂x 3

∂fk ∂xℓ

So different rows of D f~ (~x) correspond to dif- ferent components of the function f~. For scalar valued functions f , the matrix Df (~x) is made up of only one row. — Different columns of D f~ (~x) correspond to the different variables. While the k × ℓ matrix in this formula can al- ways be constructed when the partial deriva- tives exist, this matrix only deserves the name D f~ (~x) if f~ is totally differentiable at ~x.

Only if f~ is totally differentiable does this matrix give an appropriate linear approximation to the function near ~x.

From now on, I’ll omit the vector arrow from f~ , regardless of whether f is scalar valued or vector valued. I will retain the vector arrow on ~x.

Proving Total Differentiability

Before entering into the task outlined in the headline, let’s note a very easy consequence of differentiablilty:

Theorem: If f is totally differentiable at ~x, then it is continuous there.

The proof is easy. If lim ~h→~ 0

‖f (~x∗ + ~h) − f (~x∗) − T~h‖ ‖~h‖

= 0, then in particular the numerator

must go to 0. So we get lim~h→~ 0

f (~x∗ + ~h) − f (~x∗) − T~h

= ~0 or 0 (as the case may be).

Since T~h → ~0 or 0 automatically as ~h → ~0, we conclude f (~x +~h) → f (~x), i.e., f is continuous at ~x.

Pedestrian Differentiability Proofs:

In principle, to prove that a function is totally differentiable, you first need to find an appro- priate matrix T to be used in definition (TD), then you have to check the limit property that is required in the definition. Finding T is easy, because the matrix formed from the partial derivatives is the only possible candidate, and partial derivatives are easy to calculate. The labor then consists of checking the limit property. We’ll see an example below, and anther one is in Hwk. #18.

Easy Differentiability Proofs:

Easy proofs are available if the partial derivatives you have computed as the only possible entries of the matrix T turn out to be continuous functions in a neighborhod of a point ~x. (That means, of course, that they have to be continuous in the multi-variable sense; continuity of the single variable functions obtained by freezing all but one variable will not suffice.) In that case, there is a theorem that guarantees that f is differentiable at ~x, and we save a lot of work.

Note: when I say ‘continuous in a neighborhood of ~x’, I mean: there is a little ball around ~x in which the functions in question are continuous.

A proof of the theorem in question is a very useful exercise to begin understanding the notion of total differentiability; so you do not want to skip over this proof (below).

Example of a pedestrian differentiability proof:

We consider the 2-variable function f (x, y) = x^2 y^2 x^2 + y^2

for (x, y) 6 = (0, 0) and f (0, 0) = 0. For

sake of comparison, we will also study the function g(x, y) = xy x^2 + y^2

for (x, y) 6 = (0, 0) and

g(0, 0) = 0.

We prove that f is differentiable in the origin, but g is not.

First we note that f (x, 0) = 0x^2 /(x^2 + 0) = 0 for x 6 = 0, and of course f (0, 0) = 0 also. So the single variable function x 7 → f (x, 0) is the constant 0. Its derivative at x = 0 is 0. (It’s derivative is 0 everywhere, but it is x = 0 we are interested in.) We have concluded ∂f ∂x (0,^ 0) = 0. The very same argument applies to show^

∂f ∂y (0,^ 0) = 0. The only matrix^ T^ that could be Df (0, 0) is [0 , 0].

In this layout, the first ‘staircase’ just rewrites the two f terms as a ‘telescoping sum’; the second ‘staircase’ models the partials we have in the formula for Num, but we have changed the arguments to match the ones in the first ‘staircase’. The last line merely corrects for the modifications made in the second ‘staircase’.

Let’s begin with what this last line contributes to the fraction in (TD); to this end we throw in the denominator

h^2 + k^2 + l^2 again:

( (^) ∂f (x, y + k, z + l)

∂x

∂f (x, y, z) ∂x

) (^) h √ h^2 + k^2 + l^2

(∂f (x, y, z + l) ∂y

∂f (x, y, z) ∂y

) (^) k √ h^2 + k^2 + l^2

The fractions have absolute value ≤ 1, and the differences in the parentheses go to 0, because the partials are continuous.

Now we combine matching steps in the two staircases. The first of them contributes

f (x + h, y + k, z + l) − f (x, y + k, z + l) − ∂f^ (x,y ∂x+k,z +l)h h

h √ h^2 + k^2 + l^2

to the fraction in (TD). Here we notice that f (x + h, y + k, z + l) − f (x, y + k, z + l) = ∂f (∗,y+k,z+l) ∂x h^ by the mean value theorem for the single variable function^ f^ (·, y^ +^ k, z^ +^ l), where ∗ is some number between x and x + h. Again we have exhibited a contribution that goes to 0 as h → 0 by the continuity of the partials.^1 The same reasoning applies for the other two steps of the staircases. So if we put the quantity Num into formula (TD), we obtain a sum of terms, each of which goes to 0 as (h, k, l) → (0, 0 , 0). And this proves total differentiability of f at (x, y, z).

Directional Derivative, and Geometric Interpretation of Df (~x) as ‘Vector

Eater’

We have seen that total differentiability implies the existence of partial derivatives. To see this, we merely had to choose for the vector ~h vectors t[1, 0 , 0 ,.. .]T^ , t[0, 1 , 0 ,.. .]T^ etc: vectors pointing in coordinate directions. Let us instead use vectors ~h := t~v with ~v a fixed vector pointing in any direction, coordinate or not.

We then get a single variable function t 7 → f (~x + t~v), which is obtained by restricting the multi-variable function f to inputs on the line {~x + t~v | t ∈ R}. If the derivative of this single variable function at t = 0 exists, we call this quantity the directional derivative of f at ~x in direction ~v, and denote it as ∂~vf (~x). In formulas

∂~vf (~x) := d dt

f (~x + t~v)

t=

Some authors use the word ‘directional’ derivative only if ~v has length 1, because the quantity in question depends both on the direction and the length of ~v. Only by normalizing (fixing) the length a-priori do we get a quantity that depends only on the direction. In this class however, I will not restrict the length of ~v and accept the drawback that the word ‘directional derivative’ could then be slightly misleading.

(^1) If you have a really excellent Hons Calc 2 vision, you’ll see that we actually use that the partials are uniformly continuous, which is implied by continuity on a bounded and closed domain. If you don’t see this sublety, ignore it in peace for now and try again seeing it after the course Math 341.

It is a healthy (and hopefully simple) exercise for you to prove the following Theorem: If f is totally differentiable at ~x, then it has a directional derivative in each direction ~v, and this derivative equals Df (~x)~v.

As with partial derivatives, even the existence of all directional derivatives in a point does not guarantee total differentiability, as is seen in Homework #13.

I used the symbol ~v for the direction vector and refrained from enforcing length 1 on it. The idea I have in mind is that ~v may be a velocity. Think of a function f : ~x 7 → f (~x) as a temperature function, depending on a location ~x ∈ R^3. Now if I start out at ~x, thermometer in hand, and move with velocity ~v, I’ll be at location ~x + t~v at time t. My thermometer records the temperature at each time t. That is, it records the temperature at the location where I am at time t. The rate of change of this temperature with respect to time is what we called directional derivative. Of course if I move faster, I’ll experience faster temperature changes: this accounts for the dependence on the length of ~v that is being hidden by the name ‘directional derivative’. But more significantly, the rate of change of the temperature will in general depend on the direction in which I am moving. This issue is absent in single variable calculus, because there is only one direction on the real line. (‘Negative direction’ is nothing but (−1) times positive direction, so it contributes no independent information about rates of change.^2 )

The notion of directional derivative is useful to understand why the total derivative has to be such a ‘bulky’ object like a matrix: It needs to have many pieces of information incorporated in it. In single variable calculus, every change dx in the input x is a multiple of one standard change +1. To tell how the output changes, all that is needed is one number f ′(x) that gives the amplification of the input change dx into an output change dy = f ′(x)dx (of course in linear approximation only). In multivariable calculus, if there is a notion of derivative that is to tell you the rate of change of the output f (~x) as you change the input ~x, this thing ‘derivative’ must ask back: ‘In which direction do you change the input?’ So it asks for a vector ~v, and in response it gives you a rate of change. Seen from this vantage point, it is clear that the derivative Df (~x) is not a vector, even though it has as many entries as a vector. Rather it is a ‘vector eater’: You must feed it a vector and it produces for you a rate of change (which is a number or a vector, depending on whether f is scalar valued or vector valued).

This distinction is reflected in the row vs column distinction: columns represent vectors, rows represent ‘vector eaters’. They are called ‘forms’ in more advanced mathematical contexts, but let’s keep the more descriptive word ‘vector eater’ just for fun for the purposes of this class.

In some MVC textbooks, you will see this distinction omitted ‘for simplicity’. Such simplifi- cation is perfectly good for crunching calculational problems, but it comes at the expense of disconnecting the geometric intuition from the calculational formalism.

An outlook far ahead: There are two ‘upgrades’ of MVC that you may encounter in more advanced courses: You may study ‘infinitely many variables’ (called functional analysis). In that context, the distinction between vector eaters and vectors becomes much more substantial and cannot be covered up by simply converting a row into a column. (^2) In linear algebra language, if you know it, I’d say that there is only one linearly independent direction on the real line

For reference, let me quote the familiar here:

Df (~x) =

[

∂f (~x) ∂x 1

∂f (~x) ∂x 2

∂f (~x) ∂xℓ

]

, ∇f (~x) = Df (~x)T^ =

∂f (~x) ∂x 1 ∂f (~x) ∂x 2 .. . ∂f (~x) ∂xℓ

The directional derivative is

∂~vf (~x) = Df (~x)~v = ∇f (~x) · ~v

The product in Df (~x)~v is a matrix product, the product in ∇f (~x) · ~v is the dot product of vectors.

For the moment, we now do fix the length of ~v to be 1, since we will now be interested in effects of the direction of ~v only; we ask the question: In which direction ~v is the rate of change of f largest? You may be inclined to use calculus to answer this question, since it is a maximum problem after all. But algebra does it much more easily: We note, from the Cauchy-Schwarz inequality, that ∇f (~x) · ~v ≤ ‖∇f (~x)‖ ‖~v‖ = ‖∇f (~x)‖. If ~v actually has the same direction as ∇f (~x), then the dot product is equal to ‖∇f (~x)‖ by the geometric definition of the dot product, or by direct calculation with ~v := ∇f (~x)/‖∇f (~x)‖.

For all other directions, the directional derivative is strictly less than ‖∇f (~x)‖. Geometrically this is because then the cos ϕ in the definition of the dot product is stritly < 1. Algebraically speaking, we can see the same thing from a second look into the proof of Cauchy Schwarz. If we do this, we see that ~a · ~b = ‖~a‖ ‖~b‖ only if ~a = t~b. (Assuming ~b 6 = ~0.)

So here is what we conclude: The direction of ∇f (~x) is the direction in which we have to go from ~x in order to experience the greatest rate of change. The rate of change we experience in this direction is the length (norm) of ∇f (~x). If we move at right angle to ∇f (~x), then the rate of change experienced is 0 (because in the dot product, the cosine of the angle is 0).

The following discussion is a tad informal and will become more rigorous after we have covered the multi-variable version of the chain rule: Assume we move not along a line ~x+t~v but along a level set, on which f is constant by definition of level set. The derivative (with respect to time t as we are moving) is therefore 0. At any moment, the velocity vector will be tangent to the level set, because we are moving within the level set. If the fact that we are not actually exploring f along a straight line but along a bent path doesn’t cause trouble (and the chain rule will tell us it doesn’t), we should still observe the directional derivative in direction ~v, which is tangential to the level set. Since this directional derivative is 0, we would have to be moving orthogonal to the gradient (unless the gradient vanishes, in which case it does not specify a direction at all).

This means that the gradient will always be orthogonal to the level sets of a function.

The following facts can be proved rigorously with more advanced methods, but can and should be appreciated at this stage: We consider a continuously differentiable function of two or three variables. (Could be more variables also, but I want to refer to your geometric intuition). Continuously differentiable means (a) differentiable and (b) the partial derivatives are continuous functions. In this case, the matrix-valued function ~x 7 → Df (~x) is continuous automatically. Then the following facts hold for level sets of f :

For two variables ~x =

[

x y

]

: At any point ~x where ∇f (~x) is not the zero vector, the level set

that passes through f (~x) looks like a smooth curve (graph of a continuously differentiable function y = g(x), or x = h(y)) in some ball around that point ~x. (Look in particular at the level sets that were the solution of Hwk #11.)

For three variables ~x =

x y z

: At any point ~x where ∇f (~x) is not the zero vector, the

level set that passes through f (~x) looks like a smooth surfacee (graph of a continuously differentiable function z = g(x, y), or y = h(x, z), or x = k(y, z)) in some ball around that point ~x.

At points where ∇f (~x) = ~0, the level sets may look weird or ‘untypical’: The following list of ‘building blocks’ for level sets in two variables is not exhaustive, but features the most common examples: At points where ∇f (~x) = ~0, the level set may consist of a single isolated point, or it could feature two (or sometimes more) smooth curves that are crossing each other. The level set might also look like a smooth piece of curve, giving no indication of the vanishing gradient.

We call any point where the gradient of f vanishes a critical point of f. The relevance of this notion is the following: If f has a local minimum or a local maximum at an interior point ~x∗ of the domain of f , then the gradient of f vanishes there. (Can you see why? This can be seen using the single variable slice functions only.) Conversely, the vanishing of ∇f (~x∗) is no guarantee that f has a minimimum or a maximum at ~x∗. (As in single variables, where the vanishing of the derivative doesn’t guarantee a minimum or a maximum either.) A new alternative to minimum and maximum that occurs with several variables is the possibility of saddle points. A saddle point is one that looks like a single variable maximum in some directions and like a single variable minimum in some other directions. The origin is a saddle point in Hwk. #11. A level line that goes through a saddle point will typically have a crossing there. We will require second derivatives to distinguish minima, maxima, and saddle points, and this will be studied later.

Rules for differentiation; in particular the chain rule

The following simple differentiation rules carry over from single variable calculus and are easy to prove.

  • The sum of differentiable functions is differentiable. If h = f + g, then Dh(~x) = Df (~x) + Dg(~x). (Similarly for differences.)
  • The product of scalar valued differentiable functions is differentiable. If h = f g, then Dh(~x) = f (~x)Dg(~x) + Df (~x)g(~x). The products on the right hand side are of course ‘scalar times matrix’.
  • The ratio of scalar valued differentiable functions is differentiable where the denominator doesn’t vanish. If h = f /g, then Dh(~x) = − (^) gf(^ (~x~x)) 2 Dg(~x) + Df (~x) (^) g(^1 ~x).
  • The single-variable product rule carries over to the dot product of vector valued functions. If h = f~ · ~g, then h′(t) = f~ ′(t) · ~g(t) + f~ (t) · ~g′(t).

The one rule that requires discussion and training is the chain rule. Actually it also carries over without modification from single variable calculus if you rely on the total derivative and matrix multiplication consistently. However, most of the time you will use it in a form that

But apart from this proof detail, this picture makes us understand why (g ◦ f )′(x) = g′(f (x))f ′(x).

Now the punchline is that the very same argument carries over almost literally to the multi-variable setting: This is a benefit of working with the total derivative as the primary object and viewing the partial derivatives as ‘parts’ of the total derivative, rather than viewing the partial derivatives as the primary pieces of information that need to be ‘somehow organized into a matrix or vector or whatever’.

The only changes that we need to make are: f and g may be vector valued, x and y may be vectors now, and instead of f ′, we have chosen to call the derivative Df. The input error ‘amplification’ is not merely achieved by multiplying with a number, but rather by multiplying with a matrix. This distinction is very natural, because deviation in different input variables may have different effects on the output; and matrix multiplication can achieve this effect, whereas multiplication by mere numbers cannot. So let’s redo the previous picture in the new notation:

~x −→ f~ −→ f~ (~x) =: ~y −→ ~g −→ ~g( f~ (~x))

d~x −→ D f~ (~x) · −→ D f~ (~x)d~x =: d~y −→ D~g(~y) · −→ D~g(~y)d~y = D~g( f~ (~x))D f~ (~x)d~x

~x −→ ~g ◦ f~ −→ (~g ◦ f~ )(x)

d~x −→ D~g( f~ (~x))D f~ (~x) · −→ D~g( f~ (~x))D f~ (~x)d~x

When I put vector symbols over ‘everything’, I do not mean to say that all these quantities must be vectors. The scalar case is included as special case with 1-component vectors. The different vectors may have differently many components, if only the ‘chain’ fits together: For instance, ~x may have 3 components, and f~ (~x) may have 2 components. Then the input variable ~y for ~g must also have two components, else the chain doesn’t fit together; but then the output ~g(~y) may have any number of components. The sizes of the matrices D f~ (~x) and D~g(~y) are accordingly, and the size restriction on matrix multiplication is automatically satisfied!

Now with the theory all neat and slick, all we need to understand is what this matrix form of the chain rule means in practice for the crummy partial derivatives with which we do all the practical calculations. Let’s do this in an example: We take a 3-variable function g (scalar valued), and assume its arguments x, y, z are themselves dependent on parameters s and t: Let’s say x = f 1 (s, t), y = f 2 (s, t) and z = f 3 (s, t). If we insert these into g, we get g(x, y, z) = g(f 1 (s, t), f 2 (s, t), f 3 (s, t)) =: h(s, t) So now h = g ◦ f. f is a 2-variable function whose values are 3-vectors (but I will omit the arrow on top of the f ), and they fit into the 3-variable function g, which in turn has numbers as values. Now we want to calculate ∂h/∂s and ∂h/∂t in terms of the partials of the fi and g. The chain rule says: Dh(s, t) = Dg(f (s, t))Df (s, t), which written out in detail, means

[

∂h ∂s

(s, t) ∂h ∂t

(s, t)

]

[

∂g ∂x

(f (s, t)) ∂g ∂y

(f (s, t)) ∂g ∂z

(f (s, t))

]

∂f 1 ∂s (s, t) ∂f 1 ∂t (s, t) ∂f 2 ∂s

(s, t) ∂f 2 ∂t

(s, t) ∂f 3 ∂s (s, t)

∂f 3 ∂t (s, t)

This can be written out as two equations:

∂h ∂s

(s, t) = ∂g ∂x

(f (s, t)) ∂f 1 ∂s

(s, t) + ∂g ∂y

(f (s, t)) ∂f 2 ∂s

(s, t) + ∂g ∂z

(f (s, t)) ∂f 3 ∂s

(s, t)

and a similar equation for the partial with respect to t. Remember that the f (s, t) inside g actually stands for three variables (f 1 (s, t), f 2 (s, t), f 3 (s, t)).

With the identification x = f 1 , y = f 2 , z = f 3 (that is usually done with the physicist’s convention about functions) and a common name like u for the output varaible of both g and h = g ◦ f , this is often abbreviated as

∂u ∂s

∂u ∂x

∂x ∂s

∂u ∂y

∂y ∂s

∂u ∂z

∂z ∂s

This is how you will find the chain rule in many books and many contexts. I have deliberately started with an involved and detailed notation, and then moved to this succinct and easy- to-remember version. The reason is that this ‘easy’ notation is ambiguous, and it is only the context that resolves the ambiguity. If you come to love the easy notation before having worked through the complicated one, you will find the issue of ambiguity in the curly ∂ notation rather difficult to stomach; and in situations where a hidden ambiguitiy does cause errors, it will then be very difficult to clear up the confusion. For the moment, let me make one simple comment about this issue: When we write ∂u/∂x, our notation expresses which quantity varies (namely x), but it does not tell us which variables remain fixed (namely y and z). If the ‘duh’ answer “all other variables other than x remain fixed” really is clear enough to tell you that the other variables are y and z, then the context has resolved the ambiguity of the notation; and this hapens in many cases (but not in all). In thermodynamics, you can study the pressure of a gas as a function of volume and temperature; or you can study it as a function of volume and energy content. And then, if you take a partial with respect to volume, it is no longer clear whether the temperature or the energy content are to remain fixed. And this might make a difference.

Here is one obvious thing that can be seen from the above chain rule: curly ∂ terms cannot just be ‘canceled’ as you would do with the dx’s and dy’s in single variable calculus. And this is a very good reason why we use curly ∂’s for partial derivatives: as a reminder that formal cancellation yields wrong results; not just sometimes, but nearly every time!

Applications of the chain rule

(1) The statement that the gradient is orthogonal to level lines (which we had discussed heuristically above) follows rigorously from the chain rule. Suppose t 7 → f~ (t) describes a curve within a level set of a function g: then g( f~ (t)) = c for all t. The derivative of this (constant) single-variable function is therefore 0. By the chain rule,

d dt

g( f~ (t)) = Dg( f~ (t)) f~ ′(t) = ∇g( f~ (t)) · f~ ′(t)

Now, f~ ′(t) is tangent to the curve described by f~ (t). If we interprete t as a time, f~ ′(t) is actually the velocity vector. For a 2-variable function g, the level set is typically a curve, and so f~ (t) must describe (part of) this curve. For 3 or more variable functions, the level set is a surface (or higher dimensional), and the curve described by t 7 → f~ (t) lies in this surface. But since this argument can be made for any curve within the level surface, we still conclude that

this, because there is no x left in the ‘numerator’ with respect to which I could differentiate. Should I write ∂f ∂x (2, −3)? Better, because now at least the order of operations is clear: First I take a derivative, then I plug in (2, −3). But still, f is the name of a function, and the generic names for its variables are arbitrary. I could have given the very same function by f (u, v) = u^2 + 2uv^3 , and then you would have written the same thing as ∂f ∂u (2, −3). The best,

I think, that I can do with the previous notation is to write ∂f^ ( ∂xx,y )|(x,y)=(2,−3), and this is clumsy.

While you will see ∂f^ ∂x(x,y ) written as ∂f ∂x (x, y), this latter is a ‘mixed’ notation. While ∂f ∂x clearly conveys that we take a partial derivative of the function f , which we subsequently evaluate at (x, y), the function f itself does not stipulate that its input variables be given specific names. What we really mean with ∂f ∂x is that we take a partial with respect to the first variable. And it is only because it is customary to call the first variable by the name of x that the notation identifies this fact. There is a ‘pure’ notation to indicate this: we write ∂ 1 f for the derivative of f with respect to its first argument. This is analog to the notation Df for the total derivative and to the Newton notation f ′^ for the single variable derivative. Each refers to a function with no regard to what its arguments may be called.

To illustrate this issue, let me give you an example where both notations are needed and where confusion would arise if we didn’t have a clean notation: Some functions have the property that its arguments can be swapped with impunity. For instance f : (x, y) 7 → x + y, and g : (x, y) 7 → xy are such functions. Let’s call them symmetric for the moment. More precisely, a 2-variable function f is called symmetric, iff f (x, y) = f (y, x) for all (x, y). For instance, h(x, y) = xy^2 + yx^2 is symmetric, but p(x, y) = xy^ is not symmetric. Now we want to show the following claim: If f is a symmetric function, then the function g defined by g(x, y) := ∂f (x, y)∂x + ∂f (x, y)∂y is symmetric. You see, since the very hypothesis reads f (x, y) = f (y, x), you’d be doomed if you tried to identify slots by variable names.

Here is a clean proof:

∂f (x, y) ∂x

∂f (x, y) ∂y = (∂ 1 f )(x, y) + (∂ 2 f )(x, y)

So g = ∂ 1 f + ∂ 2 f. We want to show that g(x, y) = g(y, x). Now

g(y, x) = (∂ 1 f )(y, x) + (∂ 2 f )(y, x) = ∂f (y, x) ∂y

∂f (y, x) ∂x

∂f (x, y) ∂y

∂f (x, y) ∂x = (∂ 2 f )(x, y) + (∂ 1 f )(x, y) = g(x, y)

It is at the = sign marked with ∗ that we used the hypothesis that f is symmetric.

There is one more notation you will encounter: Since the notation with ‘fractions’ of curly ∂’s is sometimes bulky, you may see the subscript notation: ∂x is often used instead of (^) ∂x∂. So I could have rewritten the above proof as follows:

g(y, x) = (∂ 1 f )(y, x) + (∂ 2 f )(y, x) = ∂yf (y, x) + ∂xf (y, x) =∗ ∂y f (x, y) + ∂xf (x, y) = (∂ 2 f )(x, y) + (∂ 1 f )(x, y) = g(x, y)

Similarly, in the physicist style variable notation, ux stands for ∂u ∂x. When u = f (x, y), you will also see the ‘mixed’ notation fx in analogy to ∂f ∂x.

My best advice is that in your own usage you should avoid ‘mixed’ notation altogether, i.e., never identify slots by default variable names, but be tolerant to the frequent occurrences when others use such notation.

I may be uptight on the notation issue, but students do suffer in courses on partial differential equations when they have fuzzy ideas about multi-variable calculus.

Proof of the chain rule

In this proof, x, h, k, g(x), f (g(x)) are all vectors, even though I don’t adorn them with arrows.

In a preliminary consideration, we prove that for a matrix T and a vector h, we have the estimate ‖T h‖ ≤ c‖h‖, where the constant C depends on the entries of the matrix T. For

instance we can take C =

ij (Tij^ ) (^2). This is a consequence of the Cauchy Schwarz inequal-

ity. The first entry of the vector T h is T 11 h 1 + T 12 h 2 +... T 1 nhn, which can be written as a dot product of the vector [T 11 , T 12 ,... , T 1 n]T^ with h. Therefore its absolute value is less than the product of the norms, or:

(T h)^21 ≤ (

j T^ 2 1 j )‖h‖ 2

Similarly for the other components of T h. Adding up these, we get

‖T h‖^2 ≤ (

ij T^ 2 ij )‖h‖ 2

Next, we want to show that Df (g(x))Dg(x) is the total derivative of f ◦ g. In other words, we have to show that

lim h→ 0

‖f (g(x + h)) − f (g(x)) − Df (g(x))Dg(x)h‖ ‖h‖

Rewriting this using the ε-δ definition of the limit, we have to show: For every ε > 0, there exists δ > 0 such that ‖h‖ < δ implies

‖f (g(x + h)) − f (g(x)) − Df (g(x))Dg(x)h‖ ≤ ε‖h‖ (G)

(eqn (G) for goal). Similarly, we rewrite the hypotheses that (H1) f is differentiable at g(x) and (H2) g is differentiable at x as: For every ε 1 > 0, there exists δ 1 > 0 such that ‖k‖ < δ 1 implies ‖f (g(x) + k) − f (g(x)) − Df (g(x))k‖ ≤ ε 1 ‖k‖ in particular for k = g(x + h) − g(x)

(H1)

For every ε 2 > 0, there exists δ 2 > 0 such that ‖h‖ < δ 2 implies

‖g(x + h) − g(x) − Dg(x)h‖ ≤ ε 2 ‖h‖ (H2)

We now calculate

‖f (g(x + h)) − f (g(x)) − Df (g(x))Dg(x)h‖ ≤ ‖f (g(x + h)) − f (g(x)) − Df (g(x))(g(x + h) − g(x))‖

  • ‖Df (g(x))(g(x + h) − g(x) − Dg(x)h)‖ ≤ ‖f (g(x + h)) − f (g(x)) − Df (g(x))(g(x + h) − g(x))‖
  • Mf ‖(g(x + h) − g(x) − Dg(x)h)‖

where Mf is the constant that comes from the matrix T = Df (g(x)) in the estimate ‖T k‖ ≤ C‖k‖. We aim to show that each of the two terms in the sum on the right is ≤ 12 ε‖h‖,