Notes: Perturbation theory and condition numbers

Let's start by asking how "hard" it is to solve a given linear system, $Ax=b$. You will apply perturbation theory to answer this question.

This notebook is only for your edification. You do not need to submit it, but you are responsible for understanding the concept of a condition number and how to estimate it for a matrix using Numpy. (The code below shows you how!)

Intuition: Continuous functions of a single variable. To build your intuition, consider the simple case of a scalar function in a single continuous variable, $y = f(x)$. Suppose the input is perturbed by some amount, $\Delta x$. The output will also change by some amount, $\Delta y$. How large is $\Delta y$ relative to $\Delta x$?

Supposing $\Delta x$ is sufficiently small, you can approximate the change in the output by a Taylor series expansion of $f(x + \Delta x)$:

$$ y + \Delta y = f(x + \Delta x) = f(x) + \Delta x \frac{df}{dx} + O(\Delta x^2). $$

Since $\Delta x$ is assumed to be "small," we can approximate this relation by

$$ \begin{eqnarray} y + \Delta y & \approx & f(x) + \Delta x \frac{df}{dx} \\ \Delta y & \approx & \Delta x \frac{df}{dx}. \end{eqnarray} $$

This result should not be surprising: the first derivative measures the sensitivity of changes in the output to changes in the input. We will give the derivative a special name: it is the (absolute) condition number. If it is very large in the vicinity of $x$, then even small changes to the input will result in large changes in the output. Put differently, a large condition number indicates that the problem is intrinsically sensitive, so we should expect it may be difficult to construct an accurate algorithm.

In addition to the absolute condition number, we can define a relative condition number for the problem of evaluating $f(x)$.

$$ \begin{eqnarray} \Delta y & \approx & \Delta x \frac{df}{dx} \\ & \Downarrow & \\ \frac{|\Delta y|}{|y|} & \approx & \frac{|\Delta x|}{|x|} \cdot \underbrace{\frac{|df/dx| \cdot |x|}{|f(x)|}}_{\kappa_f(x)}. \end{eqnarray} $$

Here, the underscored factor, defined to be $\kappa_f(x)$, is the relative analogue of the absolute condition number. Again, its magnitude tells us whether the output is sensitive to the input.

Perturbation theory for linear systems. What if we perturb a linear system? How can we measure its sensitivity or "intrinsic difficulty" to solve?

First, recall the following identities linear algebraic identities:

  • Triangle inequality: $\|x + y\|_2 \leq \|x\|_2 + \|y\|_2$
  • Norm of a matrix-vector product: $\|Ax\|_2 \leq \|A\|_F\cdot\|x\|_2$
  • Norm of matrix-matrix product: $\|AB\|_F \leq \|A\|_F\cdot\|B\|_F$

To simplify the notation a little, we will drop the "$2$" and "$F$" subscripts.

Suppose all of $A$, $b$, and the eventual solution $x$ undergo additive perturbations, denoted by $A + \Delta A$, $b + \Delta b$, and $x + \Delta x$, respectively. Then, subtracting the original system from the perturbed system, you would obtain the following.

$$ \begin{array}{rrcll} & (A + \Delta A)(x + \Delta x) & = & b + \Delta b & \\ - [& Ax & = & b & ] \\ \hline & \Delta A x + (A + \Delta A) \Delta x & = & \Delta b & \\ \end{array} $$

Now look more closely at the perturbation, $\Delta x$, of the solution. Let $\hat{x} \equiv x + \Delta x$ be the perturbed solution. Then the above can be rewritten as,

$$\Delta x = A^{-1} \left(\Delta b - \Delta A \hat{x}\right),$$

where we have assumed that $A$ is invertible. (That won't be true for our overdetermined system, but let's not worry about that for the moment.)

How large is $\Delta x$? Let's use a norm to measure it and bound it using

$$ \begin{array}{rcl} \|\Delta x\| & = & \|A^{-1} \left(\Delta b - \Delta A \hat{x}\right)\| \\ & \leq & \|A^{-1}\|\cdot\left(\|\Delta b\| + \|\Delta A\|\cdot\|\hat{x}\|\right). \end{array} $$

You can rewrite this as follows:

$$ \begin{array}{rcl} \frac{\|\Delta x\|} {\|\hat{x}\|} & \leq & \|A^{-1}\| \cdot \|A\| \cdot \left( \frac{\|\Delta A\|} {\|A\|} + \frac{\Delta b} {\|A\| \cdot \|\hat{x}\|} \right). \end{array} $$

This bound says that the relative error of the perturbed solution, compared to relative perturbations in $A$ and $b$, scales with the product, $\|A^{-1}\| \cdot \|A\|$. This factor is the linear systems analogue of the condition number for evaluating the function $f(x)$! As such, we define

$$\kappa(A) \equiv \|A^{-1}\| \cdot \|A\|$$

as the condition number of $A$ for solving linear systems.

What values of $\kappa(A)$ are "large?" Generally, you want to compare $\kappa(A)$ to $1/\epsilon$, where $\epsilon$ is machine precision, which is the maximum relative error under rounding. We may look more closely at floating-point representations later on, but for now, a good notional value for $\epsilon$ is about $10^{-7}$ in single-precision and $10^{-15}$ in double-precision. (In Python, the default format for floating-point values is double-precision.)

This analysis explains why solving the normal equations directly could lead to computational problems. In particular, one can show that $\kappa(X^T X) \approx \kappa(X)^2$, which means forming $X^T X$ explicitly may make the problem harder to solve by a large amount!

Another scenario in which $X$ will have a large condition number is when it has nearly collinear predictors. See the examples below.

Fin! That's the end of these notes.