A note on the multivariable chain rule

Teaching machine learning, I have found that many students are unprepared for the level of vector calculus required, particularly when it comes to doing backprop calculations, which require the chain rule. Here I attempt to review the chain rule for computing gradients, and related concepts such as derivatives and Jacobians, in a cohesive way.

Review: single-variable chain rule

In single-variable calculus, the chain rule is often written

\frac{d z}{d x} = \frac{d z}{d y} \frac{d y}{d x}

$z$ $y$ $x$ . Students can remember this easily by "canceling terms" (although they are reminded that this is not technically correct).

We would like to extend this type of relationship to vector-valued functions of several variables. But first, we need to reinterpret the derivative.

The derivative as a linear map

$f'(x)$ $f$ $x$ $x$ tangent line $f$ $x$ $f'(x)$ . The tangent line can be written as the graph of the function

\bar{f} (x + Δ) = f (x) + f^{'} (x) Δ

This leads to a more general notion of a derivative as the best local linear approximation of the change of a function with respect to its input. This is typically called the total derivative or differentialderivative $f : X \to Y$ $X$ $Y$ $f'(x) : X \to Y$ $f$ $x$ if it is linear and satisfies

lim_{Δ \to 0} \frac{∥ f (x + Δ) - f (x) - f^{'} (x) (Δ) ∥_{Y}}{∥ Δ ∥_{X}} = 0.

$f : \R \to \R$ $f'(x)$ for both the scalar and for the linear map, but it holds that

lim_{Δ \to 0} \frac{| f (x + Δ) - f (x) - f^{'} (x) Δ |}{| Δ |} = lim_{Δ \to 0} | \frac{f (x + Δ) - f (x)}{Δ} - f^{'} (x) | = 0

$f'(x)(\Delta) = f'(x) \Delta$ .

Jacobian and gradient

$f : \R^m \to \R^n$ Jacobian matrix $J_f(x) \in \R^{n \times m}$ , which contains all the partial derivatives:

f^{'} (x) (Δ) = J_{f} (x) Δ where [J_{f} (x)]_{i j} = {\frac{\partial f_{i}}{\partial x_{j}} |}_{x}

$f : \R^m \to \R$ gradient $\nabla f(x) \in \R^m$ by

[\nabla f (x)]_{i} = {\frac{\partial f}{\partial x_{i}} |}_{x}

$\nabla f(x)$ $m \times 1$ matrix, then

J_{f} (x) = \nabla f (x)^{⊤}

$f : \R^m \to \R^n$ ,

\begin{matrix} J_{f} (x) = [\begin{matrix} \nabla f_{1} (x)^{⊤} \\ ⋮ \\ \nabla f_{n} (x)^{⊤} \end{matrix}] \end{matrix}

This may seem like a coincidence if your only knowledge of the transpose operator is that it flips a matrix across the diagonal. The conceptual interpretation of transposition (operating on a vector) is that it takes a vector and produces a linear functionallinear form $v$ $\ell_v(w) = v\T w = \sum_{i} v_iw_i$ $\ell_v$ $v\T$ ; while they are not literally the same "type" of object, they act the same way, and there is a one-to-one correspondence between them.

$\nabla f(x)$ is the vector corresponding to the directional derivative functional

f^{'} (x) (v) = \nabla f (x)^{⊤} v = \sum_{i} v_{i} \frac{\partial f}{\partial x_{i}} = v \cdot \nabla f (x)

$f$ $v$ $x$ $f$ $f$ is scalar-valued.

Generalization: gradients in Hilbert spaces

Hilbert space $\H$ $\inner{\cdot}{\cdot}$ $\R^n$ $\|v\|_\H = \sqrt{\inner{v}{v}}$ .

Riesz representation theorem $\ell : \H \to \R$ $v_\ell \in \H$ $\ell(x) = \inner{x}{v_\ell}$ $f : \H \to \R$ $\nabla f(x) \in \H$ satisfying

f^{'} (x) (v) = ⟨ v, \nabla f (x) ⟩_{H}

The multivariable chain rule

$f : X \to Y$ $g : Y \to Z$ $g \circ f : X \to Z$ has derivative

(g \circ f)^{'} (x) = g^{'} (f (x)) \circ f^{'} (x)

$g \circ f$ $f$ $g$ $(g \circ f)'$ $f'$ $g'$ .

This can also be expressed in terms of Jacobians:

J_{g \circ f} (x) = J_{g} (f (x)) J_{f} (x)

$y = f(x)$ $z = g(y)$ $\dd{\,\cdot}{\,\cdot}$ for the Jacobian, the above equation becomes

\frac{d z}{d x} = \frac{d z}{d y} \frac{d y}{d x}

just as before.

Example: ordinary least squares

A prototypical example in ML classes is ordinary least squares:

min_{w} L (w) where L (w) = ∥ X w - y ∥_{2}^{2}

$L$ $w$ $\nabla L(w) = 0$ . The gradient can be found by expanding the square and applying a couple of common matrix calculus identities:

\begin{aligned} \nabla L (w) & = \nabla_{w} ((X w - y)^{⊤} (X w - y)) \\ = \nabla_{w} (w^{⊤} X^{⊤} X w - 2 y^{⊤} X w + y^{⊤} y) \\ = 2 X^{⊤} X w - 2 X^{⊤} y \end{aligned}

$X\T X w = X\T y$ .

$\nabla L$ $z = f(w) = Xw-y$ $L = g(z) = \|z\|_2^2$ $\nabla g(z) = 2z$ $\dd{L}{z} = J_g(z) = 2z\T$ $\dd{z}{w} = J_f = X$ , as this is the best linear approximation, so

\frac{d L}{d w} = \frac{d L}{d z} \frac{d z}{d w} = 2 (X w - y)^{⊤} X

Then

\nabla L (w) = {(\frac{d L}{d w})}^{⊤} = 2 X^{⊤} (X w - y)

which is exactly what we got before.