What is the Physical Meaning of the Levi-Civita Connection?

A Riemannian manifold is a differentiable manifold endowed with a Riemannian metric. Given a Riemannian metric, the lengths of curves can be defined via integration, and the angle of intersection of two curves is also well defined. However, there is a priori no direct way of aligning neighbouring tangent planes, hence no direct way of measuring acceleration, for example.

A connection defines the rate-of-change of a vector field. There are many possible connections, hence it is natural to ask what connection to use, and whether on a Riemannian metric there is a canonical choice of a connection. The answer to the first question is, of course, that it depends on what you are wanting to do. (On a Lie group there are several interesting connections, for example.) This short note focuses on the second question.

The canonical choice of a connection on a Riemannian manifold is the Levi-Civita connection. Why is it a natural choice though, and what is its physical meaning?

The short answer is that it corresponds to the parallel transport which would naturally occur if we were living on this manifold. For example, if we are told to walk in a straight line with zero acceleration, we could do this on our planet without defining a connection. How? We simply place one foot in front of the other to take our first step, then repeat, each time doing our best to walk in a straight line. What actually happens is gravity pulls our foot back to Earth and hence we end up walking in a great circle.

Similarly, if we walk while holding a lance, and are told not to move the lance as we walk, we are actually defining a rule for parallel transport.

Put simply, the Levi-Civita connection corresponds to the rule for parallel transport that results by walking along the curve without moving the vector/lance. Gravity will cause additional movement to ensure we stay on the surface of the planet and this is what is captured by the Levi-Civita connection.

Technically, it is not gravity but merely projection back onto the surface of the manifold. Specifically, by the Nash embedding theorem, any Riemannian manifold can be isometrically embedded into Euclidean space. Then walking “in a straight line” on the manifold is achieved by taking a step in Euclidean space then projecting back down to the manifold (just as gravity would pull your foot back to the ground). Then take the limit as the step size goes to zero.

For additional details, see my response here: https://mathoverflow.net/questions/376486/what-is-the-levi-civita-connection-trying-to-describe/376593#376593

Categories: Uncategorized

May 25, 2020 1 comment

With COVID-19 causing a transition to online teaching, I am now in the habit of making short videos…

Consider two isolated capacitors, one with a charge on it, one without. Then suddenly bring them together. The charge will distribute equally on the two capacitors. But energy will have been lost! Where did it go?

Categories: Uncategorized

A Note on Duality in Convex Optimisation

December 22, 2018 1 comment

This note elaborates on the “perturbation interpretation” of duality in convex optimisation. Caveat lector: regularity conditions are ignored.

Recall that in convex analysis, a convex set is fully determined by its supporting hyperplanes. Therefore, any result about convex sets should be re-writable in terms of supporting hyperplanes. This is the basis of duality. A simple yet striking example is Fenchel duality which is nicely illustrated in the Wikipedia. Although it relies on the Legendre transform, which I have written about here, the current note is self-contained: it explains the Legendre transform from first principles.

Let’s start with a classic example of duality. In its usual presentation, it is not at all clear that it relates to the aforementioned duality between convex sets and their hyperplanes. Nevertheless, before presenting a general method for taking the dual of a problem, it is useful to have a concrete example at hand.

Consider minimising $f(x)$ subject to $g(x) \leq 0$ where $f, g$ are convex functions and $x$ is a vector in $\mathbb{R}^n$. The dual of this problem is $\sup_{\lambda \geq 0} \inf_x f(x) + \lambda g(x)$. A number of remarks are made below.

Remark 1.1: One way to “understand” this duality is by understanding when it is permissible to replace $\sup \inf$ by $\inf \sup$. Such results go by the name of minimax theorems. Intuitively, if $\psi(x,y)$ has a single critical point that is a saddle point (with the correct orientation) then $\sup \inf$ can be replaced by $\inf \sup$. (And it is always true that $\sup_y \inf_x \phi(x,y) \leq \inf_x \sup_y \phi(x,y)$.) If we assume $\sup_{\lambda \geq 0} \inf_x f(x) + \lambda g(x)$ equals $\inf_x \sup_{\lambda \geq 0} f(x) + \lambda g(x)$ then it becomes straightforward to verify the dual gives the correct answer. Indeed, $\sup_{\lambda \geq 0} \lambda g(x)$ equals $0$ if $g(x) \leq 0$ and equals $\infty$ if $g(x) > 0$. Therefore, $\sup_{\lambda \geq 0} f(x) + \lambda g(x)$ equals $f(x)$ if $g(x) \leq 0$ and equals $\infty$ otherwise. Therefore, the unconstrained minimisation of $\sup_{\lambda \geq 0} f(x) + \lambda g(x)$ is equivalent to the constrained minimisation of $f(x)$ subject to $g(x) \leq 0$. However, this is by no means the end of the story. It is not even the beginning!

Remark 1.2: The dual appears to relate to the method of Lagrange multipliers. (See here for the physical significance of the Lagrange multiplier.) This makes sense because there are only two cases. Either the global minimum of $f(x)$ occurs at a point $x^\star$ which satisfies the constraint, in which case the problem reduces to an unconstrained optimisation problem, or the optimum feasible solution $x^\star$ satisfies the equality constraint $g(x^\star) = 0$, thereby suggesting a Lagrange multiplier approach. Again though, this is not the start of the story, although the following remarks do elaborate in this direction.

Remark 1.3: If the global minimum of $f(x)$ occurs at a point $x^\star$ which satisfies the constraint $g(x^\star) \leq 0$ then we can check what the solution to the dual $\sup_{\lambda \geq 0} \inf_x f(x) + \lambda g(x)$ looks like, as follows. Let $f^\star = f(x^\star)$. If $\lambda = 0$ then $\inf_x f(x) + \lambda g(x) = f^\star$. If $\lambda > 0$ then $\inf_x f(x) + \lambda g(x) \leq f^\star$ because $g(x^\star) \leq 0$ implies $f(x^\star) + \lambda g(x^\star) \leq f^\star$. Therefore, the largest value of $\inf_x f(x) + \lambda g(x)$ occurs when $\lambda = 0$, showing that the dual $\sup_{\lambda \geq 0} \inf_x f(x) + \lambda g(x)$ reduces to the correct solution $\inf_x f(x)$ in this case.

Remark 1.4: If the global minimum of $f(x)$ is infeasible (it does not satisfy the constraint $g(x) \leq 0$), we can reason as follows. The function $f(x) + \lambda g(x)$ is convex. Provided $f,g$ are sufficiently nice (e.g., strictly convex and differentiable), and $f$ is bounded from below, the minimum of $f(x) + \lambda g(x)$ occurs when $\nabla f(x) + \lambda \nabla g(x) = 0$. We recognise this as the same condition occurring in the method of Lagrange multipliers for solving the constrained problem of minimising $f(x)$ subject to $g(x)=0$. This Lagrange multiplier interpretation means that, for a given $\lambda$, solving $\nabla f(x) + \lambda \nabla g(x) = 0$ gives an $x$ which is the minimum of $f(x)$ subject to $g(x) = c$ for some $c$ and not necessarily for $c = 0$. Specifically, if $(\lambda', x')$ satisfies $\nabla f(x') + \lambda' \nabla g(x') = 0$ then $x=x'$ minimises $f(x)$ subject to $g(x) = g(x')$. For reference, denote by $(\lambda^\star, x^\star)$ the optimal solution, that is, $g(x^\star) = 0$ and $\nabla f(x^\star) + \lambda^\star \nabla g(x^\star) = 0$. Treating $x'$ as a function of $\lambda'$, consider the values of $f(x') + \lambda' g(x')$ compared with $f(x^\star) + \lambda^\star g(x^\star) = f(x^\star)$. We want to show $f(x') + \lambda' g(x') \leq f(x^\star)$ so that the dual problem finds the correct solution. Now, $f(x') + \lambda' g(x') = \inf_x f(x) + \lambda' g(x) \leq f(x^\star) + \lambda' g(x^\star) = f(x^\star)$, as required.

Remark 1.5: A benefit of the dual problem is that it has replaced the “complicated” constraint $g(x) \leq 0$ by the “simple” constraint $\lambda \geq 0$.

While we have linked the dual problem with Lagrange multipliers to explain what was going on, it is somewhat messy and requires additional assumptions such as differentiability. Furthermore, the dual problem was written down, rather than derived. The remainder of this note attempts to rectify these issues.

The Legendre Transform and Duality

Let $h \colon \mathbb{R} \rightarrow \mathbb{R}$ be a convex function; in all that follows, the domain can be $\mathbb{R}^n$ (or some other vector space) but we study the $n=1$ case for simplicity, and because no intuition is lost. Shading in the graph of $h$ produces its epigraph — $\{(x,y) \in \mathbb{R} \times \mathbb{R} \mid y \geq h(x) \}$ — which is convex. The supporting hyperplanes of the epigraph of $h$ therefore determine the epigraph of $h$ and therefore $h$ itself is determined by these supporting hyperplanes.

We will be lazy and write “supporting hyperplane of $h$” to mean a supporting hyperplane of the epigraph of $h$.

A hyperplane is of the form $y = \lambda x - c$ for constants $\lambda, c \in \mathbb{R}$. (In higher dimensions, we would write $y = \langle \lambda, x \rangle - c$.) Since for any given “orientation” $\lambda$ there is at most one supporting hyperplane, we can define a function $h^\ast(\lambda)$ by the condition that, for any given orientation $\lambda$, the plane $y = \lambda x$, when shifted down by $h^\ast(\lambda)$ to become $y = \lambda x - h^\ast(\lambda)$, is a supporting hyperplane of (the epigraph of) $h$. (If there is no such hyperplane then we will see presently it is natural to define $h^\ast(\lambda)$ to be infinity.)

It is straightforward to derive a formula for $h^\ast(\lambda)$. Imagine superimposing the line $y = \lambda x$ onto the graph of $h$. If the line intersects the interior of the epigraph of $h$ then it must be shifted down until it just touches the graph of $h$ without entering the interior of the epigraph. If the line does not touch the graph at all, it must be shifted up until it just touches. The amount of shift can be determined by examining, for each $x$, the distance between the line and the graph. This distance is $\lambda x - h(x)$, and the following calculation shows we want to maximise this quantity.

For $y = \lambda x - c$ to be a supporting hyperplane of $h$, it must satisfy two requirements. First, we must have that $h(x) \geq \lambda x - c$ for all $x$. This means $c$ must satisfy $c \geq \lambda x - h(x)$ for all $x$, or in other words, $c \geq \sup_x \lambda x - h(x)$. The second requirement is that the line should intersect the graph, or in other words, $c$ should be the smallest possible constant for which the first requirement holds. That is, $c = \sup_x \lambda x - h(x)$ is such that $y = \lambda x - c$ is a supporting hyperplane of $h$, and indeed, it is the unique supporting hyperplane with orientation $\lambda$.

We therefore define $h^\ast(\lambda) = \sup_x \lambda x - h(x)$. It is called the Legendre transform of $h$.

The original claim is that $h$ can be recovered from its supporting hyperplanes. That is, given $h^\ast$, we should be able to determine $h$. We can do this point by point. Given an $x$, we know that $h(x)$ must lie above all the supporting hyperplanes $y = \lambda x - h^\ast(\lambda)$. That is, we have $h(x) \geq \sup_\lambda \lambda x - h^\ast(\lambda)$. Additionally, we know $h(x)$ should touch at least one supporting hyperplane, therefore, $h(x) = \sup_\lambda \lambda x - h^\ast(\lambda)$.

It is striking that this formula has exactly the same form as the Legendre transform! Indeed, the Legendre transform is self-dual: the above can be rewritten as $h(x) = h^{\ast\ast}(x)$.

At this juncture, it is easy to state the algebra behind taking the dual of an optimisation problem. In my opinion though, there is more to be understood than just the following algebra. The remainder of this note therefore elaborates on the following.

Given a convex function $f(x)$ that we wish to minimise, we can write down a dual optimisation problem in the following way. First, we embed $f(x)$ into a family of convex cost functions $\phi(x,y)$ meaning we choose a function $\phi$ that is convex and for which $\phi(x,0) = f(x)$. Different embeddings can lead to different duals!

For reasons that will be explained later, the “trick” is to define $h(y) = \inf_x \phi(x,y)$ and then write down an expression for $h^{\ast\ast}(0)$. Since the Legendre transform is self-dual,  $h^{\ast\ast}(0) = h(0) = \inf_x \phi(x,0) = \inf_x f(x)$ gives the answer to the original optimisation problem.

By definition, $h^{\ast\ast}(0) = \sup_\lambda - h^{\ast}(\lambda)$. Also by definition, $h^\ast(\lambda) = \sup_y \lambda y - h(y) = \sup_y \lambda y - \inf_x \phi(x,y) = -\inf_y \inf_x \phi(x,y) - \lambda y$. We therefore arrive at the following.

Duality: $\sup_\lambda \inf_{x,y} \phi(x,y) - \lambda y = \inf_x \phi(x,0)$.

While the above algebra is straightforward, a number of questions remain.

1. What is going on geometrically?
2. Why should the left-hand side be easier to evaluate than the right-hand side?
3. How should the family $\phi$ be chosen?

It is remarked that if we are allowed to interchange $\sup$ and $\inf$ then the left-hand side becomes $\inf_x \inf_y \left( \phi(x,y) - \sup_\lambda \lambda y\right)$ and $\sup_\lambda \lambda y$ equals infinity unless $y = 0$, so that $\inf_x \inf_y \left( \phi(x,y) - \sup_\lambda \lambda y\right) = \inf_x \phi(x,0)$.

A Geometrical Explanation

The aim is to understand geometrically what the left-hand side of the equation labelled Duality is computing. Start with the term $\inf_{x,y} \phi(x,y) - \lambda y$. From our discussion about the Legendre transform, we recognise this term as being the amount by which to shift the plane $z = 0 x + \lambda y$ to make it a supporting hyperplane for the graph $z = \phi(x,y)$. (I like to imagine $\phi$ as the quadratic function $\phi(x,y) = x^2 + y^2$ and I think of holding a sheet of paper to the surface of $z = x^2 + y^2$, thereby forming a tangent plane which is the same thing as a supporting hyperplane in this case.) Indeed, $\inf_{x,y} \phi(x,y) - \lambda y = -\phi^\ast(0,\lambda)$ where $\phi^\ast(\nu,\lambda) = \sup_{x,y} \nu x + \lambda y - \phi(x,y)$ is the Legendre transform of $\phi$.

For a given $\lambda$, the supporting hyperplane $z = \lambda y - \phi^\ast(0,\lambda)$ touches (and, if $\phi$ is smooth, is “tangent” to) $\phi$ at points $(x',y')$ that achieve the supremum in the definition of the Legendre transform, that is, at points $(x',y')$ satisfying $\sup_{x,y} \lambda y - \phi(x,y) = \lambda y' - \phi(x',y')$. Equivalently, these are the points $(x',y')$ achieving the infimum in $\inf_{x,y} \phi(x,y) - \lambda y$.

The first key observation is that because the hyperplane $z = \lambda y - \phi^\ast(0,\lambda)$ does not vary in $x$, if it touches the graph of $\phi$ at $(x',y')$ then $x'$ must minimise $x \rightarrow \phi(x,y')$. Intuitively, this is because the derivative of the hyperplane in the $x$ direction being zero implies the derivative of $\phi$ in the $x$ direction (if it exists) must also be zero at the point of contact between the hyperplane and the graph. More formally, it is a straightforward generalisation of the property of the Legendre transform that $\inf_x h(x) = -h^\ast(0)$.  Visually, $\inf_x h(x) = -h^\ast(0)$ states that the minimum of $h$ can be found by starting with a horizontal line and raising or lowering it until it becomes tangent to $h$. Algebraically, it follows immediately from the definition of the Legendre transform: $-h^\ast(0) = -\sup_x -h(x) = \inf_x h(x)$.

To prove the assertion that $x'$ minimises $x \rightarrow \phi(x,y')$, fix an orientation $\lambda$ and assume $(x',y')$ satisfies $\sup_{x,y} \lambda y - \phi(x,y) = \lambda y' - \phi(x',y')$. Define $h(x) = \phi(x,y')$. Then $h^\ast(0) = \sup_x -h(x) = \sup_x -\phi(x,y')$. Now, $\lambda y' - \phi(x',y') = \sup_{x,y} \lambda y - \phi(x,y) = \sup_x \left( \sup_y \lambda y - \phi(x,y) \right) = \sup_x \lambda y' - \phi(x,y')$, showing that $-\phi(x',y') = \sup_x -\phi(x,y')$, implying $h^\ast(0) = -\phi(x',y')$. The previous paragraph tells us that $\inf_x h(x) = -h^\ast(0)$, hence $\inf_x \phi(x,y') = \phi(x',y')$, as claimed.

We are halfway there. Given an orientation $\lambda$, we can implicitly find a point $(x',y')$ by solving the optimisation problem $\inf_{x,y} \phi(x,y) - \lambda y$. We know for any such point $(x',y')$, $x'$ minimises $x \mapsto \phi(x,y')$. Therefore, if we can find a $\lambda$ such that $y' = 0$ then we are done.

To help focus on the implicit function $\lambda \rightarrow y'$, rewrite $\sup_{x,y} \lambda y - \phi(x,y)$ as $\sup_y \left( \lambda y - \inf_x \phi(x,y) \right)$. By defining $h(y) = \inf_x \phi(x,y)$, we get $\sup_{x,y} \lambda y - \phi(x,y) = h^\ast(\lambda)$. We want to find $\lambda$ such that the point $(x',y')$ achieving the supremum in $\sup_{x,y} \lambda y - \phi(x,y)$ satisfies $y' = 0$. That is, we want to find $\lambda$ such that $\sup_{x,y} \lambda y - \phi(x,y) = \sup_x - \phi(x,0) = -\inf_x \phi(x,0)$. Rewriting in terms of $h$ gives the condition $h^\ast(\lambda) = -h(0)$. We can solve this either geometrically or algebraically.

Geometrically, recall that $-h^\ast(\lambda)$ is the location on the vertical axis where the hyperplane with orientation $\lambda$ intersects the vertical axis. Visually consider a tangent line of the convex function $x \mapsto (x-1)^2$. As we vary the tangent point, we observe where the tangent line intersects the $y$-axis. It is visually clear that the maximum intersection point occurs when $x = 0$. We therefore expect that the solution to $h^\ast(\lambda) = -h(0)$ is the value of $\lambda$ that maximises $-h^\ast(\lambda)$.

Algebraically, start by recalling from earlier that we have shown a property of the Legendre transform is that $-g^\ast(0) = \inf_x g(x)$. What we want is the “dual” of this property, obtained via the substitution $g = h^\ast$, namely $-h^{\ast\ast}(0) = \inf_\lambda h^\ast(\lambda)$. Since the Legendre transform is self-dual, $h(0) = h^{\ast\ast}(0)$, and in particular, we have deduced that $\inf_\lambda h^\ast(\lambda) = -h(0)$, agreeing with our geometric intuition in the previous paragraph.

What we have shown is that if $\lambda$ maximises $-h^{\ast}(\lambda) = \inf_{x,y} \phi(x,y) - \lambda y$ then the infimum in $\inf_{x,y} \phi(x,y) - \lambda y$ is achieved at a point $(x',y')$ with $y' = 0$. Indeed, assume $\lambda$ maximises $-h^{\ast}(\lambda)$. Retracing ours steps shows that $-h^{\ast}(\lambda) = h(0)$ and since $h(0) = \inf_x \phi(x,0)$ and $-h^{\ast}(\lambda) = \inf_{x,y} \phi(x,y) - \lambda y$ we have that $\inf_{x,y} \phi(x,y) - \lambda y = \inf_x \phi(x,0)$. If $x'$ minimises $\phi(x,0)$, so that $\inf_x \phi(x,0) = \phi(x',0)$, then $\inf_{x,y} \phi(x,y) - \lambda y = \inf_x \phi(x,0) = \phi(x',0) = \phi(x',0) - \lambda\,0$ showing $(x',0)$ achieves the infimum.

Summary: To find $\inf_x \phi(x,0)$, we first find the intercept point $\inf_{x,y} \phi(x,y) - \lambda y$ of a hyperplane with orientation $(0,\lambda)$. We then vary $\lambda$ until the intercept point is as large as possible: $\sup_\lambda \inf_{x,y} \phi(x,y) - \lambda y$. We know from earlier that the corresponding hyperplane will tangentially intersect the graph of $\phi$ at a point $(x',y')$ where $y' = 0$ (because $\lambda$ has been maximised) and where $x'$ minimises $x \mapsto \phi(x,y')$ (because the hyperplane does not vary in $x$). Since the infimum is achieved at $y =0$, the value of $\sup_\lambda \inf_{x,y} \phi(x,y) - \lambda y$ is readily determined by substituting in $y = 0$. Specifically, $\sup_\lambda \inf_{x,y} \phi(x,y) - \lambda y = \sup_\lambda \inf_x \phi(x,0) = \inf_x \phi(x,0)$. This concludes our geometric interpretation of the Duality equation.

Remark: While introducing points $(x',y')$ at which the infimum is achieved in the Duality equation is essential for giving geometrical intuition, it is a hindrance when it comes to algebraic derivations. Indeed, the infimum need not be achieved at any point, or it may be achieved at multiple points. This is the natural tension in mathematics between intuitive explanations and rigorous (but not necessarily intuitive) algebra.

An Example of When the Dual Problem is Simpler

The Duality equation is only useful if solving $\inf_{x,y} \phi(x,y) - \lambda y$ for various $\lambda$ were easier than solving $\inf_x \phi(x,0)$. Naively, one might assume this is never the case! Indeed, $\inf_{x,y} \phi(x,y) - \lambda y = \inf_y \left( \inf_x \phi(x,y) - \lambda y \right)$ appears to contain the original optimisation problem! But this reasoning is flawed: the optimisation problem could be written equally well as $\inf_x \left( \inf_y \phi(x,y) - \lambda y \right)$, and an example is now given of where $\inf_x \left( \inf_y \phi(x,y) - \lambda y \right)$ is easier to evaluate than $\inf_x \phi(x,0)$.

Consider the constrained optimisation problem introduced near the start of this note: minimise $f(x)$ subject to $g(x) \leq 0$. The “discontinuity” at the boundary $g(x) = 0$ causes difficulties: unconstrained optimisation (of continuous convex functions!) is generally easier than constrained optimisation. For reasons that will be discussed in the next section, we introduce a family of optimisation problems, indexed by $y$, where the objective function $f(x)$ stays the same but the constraints vary with $y$, namely, $g(x) \leq -y$. (The choice of negative sign is for later convenience.)

In convex optimisation, a constrained optimisation problem can be written as an unconstrained (but discontinuous and therefore still “difficult”) optimisation problem simply by defining the value of the function to be infinity at points violating the constraint. One way to write this is as $f(x) + \sup_{\alpha \geq 0} \alpha g(x)$ because the supremum will equal zero if $g(x) \leq 0$, and will equal $\infty$ otherwise. We therefore define our embedding to be $\phi(x,y) = f(x) + \sup_{\alpha \geq 0} \alpha (g(x)+y)$, so that $\inf_x \phi(x,y)$ is the minimum of $f(x)$ subject to $g(x) \leq -y$.

Consider evaluating $\inf_y f(x) + \sup_{\alpha \geq 0} \alpha (g(x)+y) - \lambda y$. The infimum over all $y$ is equal to the minimum of the infimum over $y \leq -g(x)$ and the infimum over $y > -g(x)$. If $y \leq -g(x)$ then $g(x)+y \leq 0$ and $\sup_{\alpha \geq 0} \alpha (g(x)+y) = 0$. If $y > -g(x)$ then $g(x)+y > 0$ and $\sup_{\alpha \geq 0} \alpha (g(x)+y) = \infty$. Therefore, $\inf_y f(x) + \sup_{\alpha \geq 0} \alpha (g(x)+y) - \lambda y = f(x) + \inf_{y,\ y \leq -g(x)} - \lambda y$. The last infimum equals $\lambda g(x)$ if $\lambda \geq 0$ and equals $-\infty$ otherwise. Both $\inf_x f(x) + \lambda g(x)$ and $\inf_x f(x) - \infty$ are easier to evaluate than the original $\inf_x f(x) + \sup_{\alpha \geq 0} \alpha g(x)$.

Finally, it is mentioned that for this example, the left-hand side of the Duality equation becomes $\sup_{\lambda \geq 0} \inf_x f(x) + \lambda g(x)$. This is precisely the dual problem introduced at the start of this note.

How to Choose an Embedding?

First, a remark. One might ask, why not look for a function $\phi(x,y)$ such that $\inf_{x,y} \phi(x,y) = \inf_x \phi(x,0)$? In principle, it seems necessary to know $\inf_x \phi(x,0)$ before such an extension could be made, thereby rendering the extension unhelpful for computing $\inf_x \phi(x,0)$. The benefit of the Duality equation is that $\phi$ just needs to be convex.

Since choosing an extension $\phi$ so that $\inf_{x,y} \phi(x,y) - \lambda y$ is easier to evaluate than $\inf_x \phi(x,0)$ comes down to experience, what we do here is analyse in greater detail the choice in the previous section.

To aid visualisation, assume we wish to minimise $f(x) = (x-1)^2$ subject to $g(x) = x \leq 0$. Therefore, $\phi(x,0)$ equals $(x-1)^2$ for $x \leq 0$ and equals $\infty$ for $x > 0$.

Simply replicating $\phi(x,0)$ for other values of $y$, meaning choosing $\phi(x,y) = \phi(x,0)$, does not simplify the optimisation problem. Since the main difficulty is that $\phi$ contains jumps to infinity, the first thing we might aim for, is ensuring $\inf_y \phi(x,y) < \infty$ for all $x$.

We cannot arbitrarily choose values for $\phi(x,y)$ because $\phi$ must be convex: we must extend $\phi(x,0)$ while maintaining convexity and so $\phi(x,y)$ must depend on $\phi(x,0)$.

Taking $\phi(x,y) = f(x)$ for $y \neq 0$ ensures $\inf_y \phi(x,y) < \infty$ for all $x$, but $\phi$ will not be convex because, for a fixed $x > 0$, $y \mapsto \phi(x,y)$ would equal $f(x)$ for $y \neq 0$ but equal $\infty$ for $y = 0$. At least on one side, say, $y < 0$, we must have $\phi(x,y) = \infty$ if $\phi(x,0) = \infty$.

With these thoughts in mind, we are led to consider choosing $\phi(x,y)$ to be $f(x)$ for $x \leq y$ and $\infty$ for $x > y$. This gives us a convex function satisfying $\inf_y \phi(x,y) < \infty$ for all $x$. Up to a change of sign, it corresponds with the choice in the previous section.

The reader is encouraged to visualise the graph of $\phi$. Note that the graph of $\phi(x,y) - \lambda y$ is obtained simply by rotating the graph of $\phi$ about the $x$-axis, alternatively described as “tilting” $\phi$ forwards or backwards. Therefore, $\inf_{x,y} \phi(x,y)$ is the global minimum of $\phi$, while $\inf_{x,y} \phi(x,y) - \lambda y$ is the global minimum of a tilted version of $\phi$.

There is an interesting interpretation of duality in this particular case. Consider $\inf_{x,y} \phi(x,y) - \lambda y$. When $\lambda = 0$ this corresponds to finding the unconstrained minimum of $f(x)$. If $\lambda$ is now slowly decreased, so that $\phi(x,y) - \lambda y$ tilts forwards, $\inf_{x,y} \phi(x,y) - \lambda y$ is now solving the constrained problem of minimising $f(x)$ subject to a constraint $x \leq c$ where $c$ is smaller than the location of the global minimum of $f$. (As $\lambda$ decreases, so does $c$.) In this way, by continuously varying $\lambda$, we can increase the severity of the constraint and thereby track down the constrained minimum that we seek.

Simple Explanation for Interpretation of Lagrange Multiplier as Shadow Cost

December 10, 2018 1 comment

Minimising $f(x)$ subject to $g(x)=0$ can be solved using Lagrange multipliers, where $f(x) + \lambda g(x)$ is referred to as the Lagrangian. The optimal value $(x^\star, \lambda^\star)$ satisfies both $\nabla f(x^\star) + \lambda^\star \nabla g(x^\star)=0$ and $g(x^\star) = 0$. Although $\lambda$ was added to the problem, it turns out that $\lambda^\star$ has physical significance: it is the rate of change of the optimal value $f(x^\star)$ to perturbations $c$ around zero in the constraint $g(x)=c$. A simple explanation (rather than an algebraic proof) is given below.

First the reader is reminded why the condition $\nabla f(x^\star) + \lambda^\star \nabla g(x^\star)=0$ is introduced. Let $M = \{ x \mid g(x) = 0 \}$ denote the constraint set. Then $f$ restricted to $M$ has a local minimum at a point $x^\star \in M$ if small perturbations that remain on $M$ do not decrease $f$. In other words, the directional derivative of $f$ should be zero in directions tangent to $M$ at $x^\star$. Since $\nabla g(x^\star)$ is normal to the surface $M$ at $x^\star$, the directions $h$ tangent to $M$ at $x^\star$ are those satisfying $\langle \nabla g(x^\star), h \rangle = 0$. The directional derivative at $x^\star$ in the direction $h$ is $\langle \nabla f(x^\star), h \rangle$. The condition is thus that $\langle \nabla f(x^\star), h \rangle = 0$ for all $h$ satisfying $\langle \nabla g(x^\star), h \rangle = 0$. This condition is equivalent to requiring that $\nabla f(x^\star)$ is a scalar multiple of $\nabla g(x^\star)$, that is, $\nabla f(x^\star) = - \lambda^\star \nabla g(x^\star)$ for some scalar $\lambda^\star$.

To understand the physical significance of $\lambda^\star$, it helps to think in terms of contour lines (or level sets) for both $f$ and $g$. For concreteness, visualise the constraint set $\{x \mid g(x)=0\}$ as a circle. Let $f^\star = f(x^\star)$ denote the minimum achievable value of $f(x)$ subject to $g(x)=0$. Then the level sets $\{x \mid f(x)=f^\star + \epsilon \}$ may be visualised as ellipses whose sizes increase with $\epsilon$. Furthermore, if $\epsilon = 0$ then the level set is tangent to the constraint set (the circle), while if $\epsilon < 0$ then the level set $\{x \mid f(x)=f^\star + \epsilon\}$ does not intersect the constraint set (the circle).

The key point is to consider what happens if we change the constraint set (the circle) from $\{x \mid g(x)=0\}$ to $\{x \mid g(x)=c\}$ for some $c$ close to zero. Changing $c$ will perturb the circle. If the circle gets slightly bigger then we can find a new $x^\star$, close to the old one, lying on the perturbed circle for which $f(x^\star)$ is smaller than before. Since the gradient represents the direction of greatest increase, it seems sensible to assume the new minimum is at $x^\star - \epsilon \nabla f(x^\star)$ for some $\epsilon > 0$. We must choose $\epsilon$ so $x^\star - \epsilon \nabla f(x^\star)$ satisfies the new constraint: $g(x^\star - \epsilon \nabla f(x^\star)) = c$. To first order, $g(x^\star - \epsilon \nabla f(x^\star)) = g(x^\star) - \epsilon \langle \nabla g(x^\star), \nabla f(x^\star) \rangle = - \epsilon \langle \nabla g(x^\star), - \lambda^\star \nabla g(x^\star) \rangle = \epsilon \lambda^\star \| \nabla g(x^\star) \|^2$. That is, $\epsilon$ is given by solving $\epsilon \lambda^\star \| \nabla g(x^\star) \|^2 = c$.

The new value of the objective function, to first order, is given by $f(x^\star - \epsilon \nabla f(x^\star)) = f^\star - \epsilon \langle \nabla f(x^\star) , \nabla f(x^\star) \rangle = f^\star - \epsilon \| \nabla f(x^\star) \|^2 = f^\star - \epsilon (\lambda^\star)^2 \| \nabla g(x^\star) \|^2$. Since $\epsilon = c (\lambda^\star)^{-1} \| \nabla g(x^\star) \|^{-2}$, we get that the improvement in $f$ is $\epsilon (\lambda^\star)^2 \| \nabla g(x^\star) \|^2 = c \lambda^\star$. The rate-of-change of the improvement in $f$ with respect to a change $c$ in the constraint is precisely $\lambda^\star$, as we set out to show.

Note that the above is not a proof but rather a rule of thumb that aids in our intuition. An actual proof is straightforward but not so intuitive.

Categories: Uncategorized

Tennis: Attacking versus Defending

What is an attacking shot? Must an attacking shot be hit hard? What about a defensive shot? This essay will consider both the theory and the practice of attacking and defensive shots. And it will show that if you often lose 6-2 to an opponent then only a small amount of improvement might be enough to tip the scales the other way.

Tennis is a Game of Probabilities

Humans are not perfect. Anyone who has tried aiming for a small target by hitting balls fed from a ball machine will know that hitting identical balls will produce different outcomes. What is often not appreciated though is how sensitive the outcome of a tennis match is to small improvements.

Imagine a match played between two robots, where on average, out of every 20 points played, one robot wins just a single more point than the other robot. This is a very small difference in abilities. Mathematically, this translates to the superior robot having a probability of 55% of winning a point against the inferior robot. (I have used robots as a way of ignoring psychological factors. While the effects of psychological factors are relatively small, at a professional level where differences in abilities are so small to begin with, psychological factors can be crucial to determining the outcome of a match.)

Using the “Probability Calculator” found near the bottom of this page, we find that such a subtle difference in abilities translates into the superior robot winning a 5-set match 95.35% of the time. (The probability of winning a 3-set match is 91%.)

The graphs found on this page show that the set score is “most likely” to be 6-2 or 6-3 to the superior robot, so the next time you lose a match 6-2 or 6-3, believe that with only a small improvement you might be able to tip the scales the other way!

The In-Rally Effect of a Shot

While hitting a winner is a good (and satisfying) thing, no professional tries to hit a winner off every ball. Tennis is a game of probabilities, where subtle differences pay large dividends.

The aim of every ball is to increase the probability of eventually winning the point.

So if your opponent has hit a very good ball and you feel you are in trouble, you might think to yourself that you are down 30-70, giving yourself only 30% chance of winning the rally. Your aim with your return shot is to increase your odds. Even if you cannot immediately go back to 50-50, getting to 40-60 will then put 50-50 within reach on the subsequent shot.

So why not go for winners off every ball? If you can maintain a 55-45 advantage off each ball you hit, you will have a 62.31% chance of winning the game, an 81.5% chance of winning the set, and a 91% chance of winning a 3-set match. On the other hand, against a decent player, hitting a winner off a deep ball is difficult and carries with it a significant chance of failure, i.e., hitting the ball out. Studies have shown that even the very best players cannot control whether a ball narrowly falls in or narrowly falls out, meaning that winners hit very close to the line are rarely intentional but rather come from a shot aimed safely within the court that has drifted closer to the line than intended.

Attacking Options

If by an attacking ball we mean playing a shot that gives us the upper hand in the rally, then our shot needs to make it difficult for our opponent to hit a challenging ball for us. How can this be done?

First, our ability to hit an accurate ball relates to how well we are balanced, how much time we have to react, how fast the incoming ball is, the spin on the ball and the height of the ball at contact. So we can make it difficult for opponents in many different ways. The opponent’s balance can be upset by

• wrong-footing her;
• making her cover a relatively long distance in a short amount of time to reach the ball;
• jamming her by hitting straight at her;
• forcing her to move backwards (so she cannot transfer her weight forwards into the ball).

The opponent’s neurones can be given a harder task by

• giving her less time to calculate how to swing the racquet;
• varying the incoming ball on each shot (different spins, heights, depths, speeds);
• making it harder for her to predict what we will do;
• getting her thinking about other things (such as the score).

Due to our physical structure, making contact with the ball outside of our comfort zone (roughly, between knee to shoulder height) decreases the margin for error because it is harder to “flatten out” the trajectory of the racquet about the contact point.

These observations lead to a variety of strategies for gaining the upper hand, including the following basic ones.

• Take the ball early, on the rise, to take time away from the opponent.
• Hit sharp angles, going near the lines.
• Hit with heavy top-spin to get the ball to bounce above the opponent’s shoulder.
• Hit hard and flat.
• Hit very deep balls close to the baseline.

Each of these strategies carries a risk of hitting out, therefore, it is generally advised not to combine strategies: if you take the ball on the rise, do not also aim close to the lines or hit excessively hard, for example.

If you find yourself losing to someone but not knowing why, it is because subtle differences in the balls you are made to hit can increase the chances of making a mistake, and only very small changes in probability (such as dropping from winning 11 balls out of 20 to only 10 balls out of 20) can hugely affect the outcome of the match (such as dropping from a 91% chance of winning the 3-set match to only a 50% chance of winning). To emphasise, if your opponent takes the ball earlier than normal, you are unlikely to notice the subtle time difference, but over the course of the match, you will feel you are playing worse than normal.

If you are hitting hard but all your balls are coming back, remember that it is relatively easy to return a hard flat ball: the opponent can shorten her backswing yet still hit hard by using your pace, making a safe yet effective return. Hit hard and flat to the open court off a high short ball, but otherwise, try hitting with more top-spin.

Defensive Options

Just as an attacking ball need not be a fast ball, a defensive ball need not be a slow ball. Rather, a defensive ball is one where our aim is to return to an approximately 50-50 chance of winning the point. If we are off balance, or are facing a challenging ball, we cannot go for too much or we risk hitting out and immediately losing the point.

Reducing the risk of hitting out can be achieved in various ways.

• Do not swing as fast or with excessive spin.
• Take the ball later after the bounce, giving it more time to slow down.
• Block the ball back (perhaps taking it on the rise) to a safe region well away from the lines.
• Focus on being balanced, even if it means hitting with less power.

Importantly, everyone is different, and you should learn what your preferred defensive shots are. For example, while flat balls inherently have a lower margin for error, nonetheless some people may find they are more accurate hitting flat than hitting with top-spin simply due to their biomechanical structure and past practice.

Placement of a defensive shot is crucial. Because speed, spin and/or time have been sacrificed for greater safety, it is quite likely that if the opponent has time to get into a comfortable position they can punish your ball. This is yet another example of where small differences can have large consequences: if the opponent can step into the ball you might be in trouble, yet hitting just a metre deeper, or a metre further to the side, might prevent this. Moreover, never forget that placement is relative to where the opponent currently is: hitting deep to the backhand is normally a good shot unless the opponent is already there!

Changes in depth can be very effective but they must be relative to where the opponent is. If an opponent is attacking, she might have moved to being on the baseline, looking to take the next ball early. Hitting a deep top-spin shot, not necessarily very hard, is very effective in this scenario because the opponent is forced to move backwards and thus cannot generate as much power. (A skilled opponent can still hit hard, but even a 10 km/h reduction in ball speed makes a large difference.)

Technical Considerations

Watching the slow-motion clips on YouTube of professional players hitting groundstrokes shows that the same player has many different ways of hitting her forehand and backhand. At the point of contact, you can look to see where her feet are, which way her hips are pointing, which way her shoulders are pointing, the angle of her wrist, and the contact point of the ball relative to the body. And as she hits, you can look to see which parts of the body are moving and which have momentarily become static.

While an upcoming article may consider such technical aspects in greater detail, here it is simply noted that while you should be able to hit every kind of shot from any reasonable contact point, each has advantages and disadvantages. And since tennis is a game of probabilities, it comes as no surprise that the top players instinctively know not just what type of shot to hit but also how technically to hit it in the best possible way for them at that particular moment: if they have time, and want to hit a heavy top-spin, they will probably choose a contact point further away from their body and step into the ball, while if they must return a very fast ball, they may instead use a more open stance and hit the ball more in line with their eyes, well out in front.

Next time you are on the court, experiment with how changing the contact point (even over a relatively small range of 10–20 cm) can change how hard you can hit the ball, how accurately you can hit the ball, and how much top-spin you can generate. And do not forget, this may change depending on the type of incoming ball. For example, generating pace off a slow ball is better done using a different technique than returning a fast incoming ball. Failing to recognise this may mean you “feel” you are not hitting the ball well when in fact you are just not using the best technique for the particular type of shot you want to hit.

The other side of the coin is recognising that sometimes it is necessary to play a ball using a non-optimal technique due to a funny bounce, or lack of time (or inherent laziness). In this situation, it is important to adjust the type of shot you hit. If you are forced to return a hard-hit ball at full stretch, you will lose accuracy, so do not go anywhere near the lines. If you are jammed, you are not going to be able to hit as heavy a ball as you may wish, so you may change to hit flatter, opting to take time away from the opponent. Of course, changing from heavy to flat generally means changing where you want your shot to land: a shorter top-spin that bounces above shoulder height is generally good whereas a shorter flat shot is generally bad, for example.

Categories: Uncategorized

Does the Voltage Across an Inductor Immediately Reverse if the Inductor is Suddenly Disconnected?

Consider current flowing from a battery through an inductor then a resistor before returning to the battery again. What happens if the battery is suddenly removed from the circuit? Online browsing suggests that the voltage across the inductor reverses “to maintain current flow” but the explanations for this are either by incomplete analogy or by emphatic assertion. Moreover, one could argue for the opposite conclusion: if an inductor maintains current flow, then since the direction of current determines the direction of the voltage drop, the direction of the voltage drop should remain the same, not change!

To understand precisely what happens, it is important to think in terms of actual electrons. When the battery is connected, there is a stream of electrons being pushed out of the negative terminal of the battery, being pushed through the resistor, being pushed through the inductor then being pulled back into the battery through its positive terminal. The question is what happens if the inductor is ripped from the circuit, thereby disconnecting its ends from the circuit. (The explanation of what happens does not change in any substantial way whether it is the battery or the inductor that is removed.)

The analogy of an inductor is a heavy water wheel. The inductor stores energy in a magnetic field while a water wheel stores energy as rotational kinetic energy. But if we switch off the water supply to a water wheel, and the water wheel keeps turning, what happens? Nothing much! And if we disconnect an inductor, so there is no “circuit” for current to flow in, what can happen?

One trick is to think not of a water wheel but of a (heavy) fan inside a section of pipe. Ripping the inductor out of the circuit corresponds to cutting the piping on either side of the fan and immediately capping the ends of the pipes. This capping mimics the fact that electrons cannot flow past the ends of wires; not taking sparks into consideration. Crucially then, when we disconnect the fan, there is still piping on either side of the fan, and still water left in these pipes.

Consider the water pressure in the capped pipe segments on both sides of the fan. Assume prior to cutting out the fan, water had been flowing from right to left through the fan. (Indeed, when the pump is first switched on, it will cause a pressure difference to build up across the fan. This pressure difference is what causes the fan to start to spin. As the fan spins faster, this pressure difference gets less and (ideally) goes to zero in the limit.) Initially then, there is a higher pressure on the right side of the fan. The fan keeps turning, powered partly by the pressure difference but mainly by its stored rotational kinetic energy. (Think of its blades as being very heavy, therefore not wanting to slow down.) So water gets sucked from the pipe on the right and pushed into the pipe on the left. These pipes are capped, therefore, the pressure on the right decreases while the pressure on the left increases. “Voltage drop” is a difference in pressure, therefore, the “voltage drop” across the “inductor” is changing.

There is no discontinuous change in pressure! The claim that the voltage across an inductor will immediately reverse direction is false!

That said, the pressure difference is changing, and there will come a time when the left pipe will have a higher pressure than the right pipe. Now there are two competing forces: the stored kinetic energy in the fan wants to keep pumping water from right to left, while the larger pressure on the left wants to force water from left to right. The fan will start to slow down and eventually stop, albeit instantaneously. At the very moment the fan stops spinning, there is a much larger pressure on the left than on the right. Therefore, this pressure difference will force the fan to start spinning in the opposite direction!

Under ideal conditions then, the voltage across the inductor will oscillate!

Why should we believe this analogy though? Returning to the electrons, the story goes as follows. Assume an inductor, in a circuit, has a current flowing through it, from left to right. Therefore, electrons are flowing through the inductor from right to left (because Benjamin Franklin had 50% chance of getting the convention of current flow correct). If the inductor is ripped out of the circuit, the magnetic field that had been built up will still “push” electrons through the inductor in an attempt to maintain the same current flow. The density of electrons on the right side of the inductor will therefore decrease, while the density on the left side will therefore increase. Electrons repel each other, so it becomes harder and harder for the inductor to keep pushing electrons from right to left because every electron wants its own space and it is getting more and more crowded on the left side of the inductor. Eventually, the magnetic field has used up all its energy trying to cram as many electrons as possible into the left side of the inductor. The electrons on the left are wanting to get away from each other and are therefore pushing each other over to the right side of the inductor. This “force” induces a voltage drop across the inductor: as electrons want to flow from left to right, we say the left side of the inductor is more negative than the right side. The voltage drop has therefore reversed, but it did not occur immediately, nor will it last forever, because the system will oscillate: as the electrons on the left move to the right, they cause a magnetic field to build up in the inductor, and the process repeats ad infinitum.

Adding to the explanation, we can recognise a build-up of charge as a capacitor. There is always parasitic capacitance because charge can always accumulate in a section of wire. Therefore, there is no such thing as a perfect inductor (for if there were, we could not disconnect it!). Rather, an actual inductor can be modelled by an ideal inductor in parallel with an ideal capacitor. (Technically, there should also be a resistor in series to model the inevitable loss in ordinary inductors.) An inductor and capacitor in parallel form what is known as a resonant “LC” circuit, which, as the name suggests, resonates!

Intuition behind Caratheodory’s Criterion: Think “Sharp Knife” and “Shrink Wrap”!

Despite many online attempts at providing intuition behind Caratheodory’s criterion, I have yet to find an answer to why testing all sets should work.

Therefore, I have taken the liberty of proffering my own intuitive explanation. For the impatient, here is the gist. Justification and background material are given later.

The set U is non-measurable. The set A is measurable.

We will think of our sets as rocks. As explained later, a rock is not “measurable” if its boundary is too jagged. In the above image, the rock $U$ is not measurable because it has a very jagged boundary. On the other hand, the rock $A$ has a very nice boundary and hence we would like to consider $A$ to be “measurable”.

For $A$ to be measurable according to Caratheodory’s criterion, the requirement is for $\mu(U) = \mu(U \cap A^\mathrm{c}) + \mu(U \cap A)$ to hold for all sets $U$. While other references generally make it clear that this is a sensible requirement if $U$ is measurable (and we will go over this below), what makes Caratheodory’s criterion unintuitive is the requirement that $\mu(U) = \mu(U \cap A^\mathrm{c}) + \mu(U \cap A)$ must hold for all sets $U$, including non-measurable sets. However, we argue that a simple change in perspective makes it intuitively clear why we can demand $\mu(U) = \mu(U \cap A^\mathrm{c}) + \mu(U \cap A)$ holds for all $U$.

An outer measure uses shrink wrap to approximate the boundary. If A cuts the rock U cleanly then the same “errors” are made when approximating the boundaries of the three objects in the picture, hence Caratheodory’s criterion holds.

The two ingredients of our explanation are that the boundary of $A$ serves as a knife, and that an outer measure calculates volume by using shrink wrap. (For example, Lebesgue outer measure approximates a set by a finite union of cubes that fully cover the set, and uses the volume of the cubes as an estimate of the volume of the set; precisely, the infimum over all such approximations is used.)

In the above image, the red curve represents the shrink wrap. The outer measure $\mu(U)$ is given by the area enclosed by the red curve. (For rocks, which are three dimensional, we would use the volume encased by the shrink wrap.)

When we take $U \cap A^\mathrm{c}$ and $U \cap A$, we think of using part of the boundary of $A$ (the part that intersects $U$) to cut $U$. If $A$ is a rectangle (or cube) then we produce a very clean straight cut of the set (or rock) $U$.

Consider shrink wrapping  $U \cap A^\mathrm{c}$ and $U \cap A$ individually, as shown by the red curves in the above image. Two observations are critical.

• If the cut made by $A$ is clean then the shrink wrap fits the cut perfectly; no error is made for this part of the boundaries of $U \cap A^\mathrm{c}$ and $U \cap A$.
• For all other parts of the boundaries of $U \cap A^\mathrm{c}$ and $U \cap A$ (i.e., the jagged parts), the errors made by the shrink wrap for $U \cap A^\mathrm{c}$ and $U \cap A$ precisely equal the errors made by the shrink wrap for $U$ (because it is the same jagged boundary being fitted).

It follows that $\mu(U) = \mu(U \cap A^\mathrm{c}) + \mu(U \cap A)$ holds whenever $A$ has a nice boundary.

Regardless of whether $U$ is measurable or not, Caratheodory’s criterion is blind to the boundary of $U$ and instead is only testing how “smooth” the boundary of $A$ is. (Different sets $U$ will test different parts of the boundary of $A$.)

From the Beginning

First, it should be appreciated that Caratheodory’s criterion is not magical: it cannot always produce the exact sigma-algebra that you are thinking of, so it suffices to understand why, in certain situations, it is sensible. In particular, this means we can assume that the outer measure we are given, works by considering approximations of a given set $A$ by larger but nicer sets, such as finite collections of open cubes, whose measure we know how to compute. Taking the infimum of the measures of these approximating sets gives the outer measure of $A$.

In physical terms, think of measuring the volume of rocks. An outer measure works by wrapping shrink wrap as tightly as possible around the rock, then asking for the volume encased by the shrink wrap. The key point is that if the boundary of the rock is piecewise smooth then the shrink wrap computes its exact volume, whereas if the boundary is sufficiently jagged then the shrink wrap cannot follow the contours perfectly and the shrink wrap (i.e., the outer measure) will exaggerate the true volume of the rock.

It is easy to spot rocks with bad boundaries if the volume of the whole space is finite: look at both the rock and its complement. Place shrink wrap over both of them. If the shrink wrap can follow the contours then it will measure the volume of the two interlocking parts perfectly and the sum of the volumes will equal the volume of the whole space. If not, we know the rock is too ill-shaped to be considered measurable.

If the total volume is infinite then we must zoom in on smaller portions of the boundary. For example, we might take an open ball $U$ and look at the boundary of $A \cap U$ using our shrink wrap trick: shrink wrap both $A \cap U$ and its complement in $U$, namely, $A^\mathrm{c} \cap U$, and see if the volumes of $A \cap U$ and $A^\mathrm{c} \cap U$ add up to the volume of $U$. If not, we know $A$ is too ill-shaped to be considered measurable. This seemingly relies on our choice of $U$ being a sufficiently nice/benign shape that it does not interfere with our observation of the boundary of $A$. For Lebesgue outer measure, the choice of an open ball for $U$ seems entirely appropriate, but perhaps the right choice will depend on the particular outer measure under consideration. Caratheodory’s criterion removes the need for such a choice by requiring $\mu(A \cap U) + \mu(A^\mathrm{c} \cap U) = \mu(U)$ to hold for all $U$ and not just nice $U$. The images and explanation given at the start of this note explain why this works: the jaggedness of $U$ appears on both sides of the equation and cancels out, therefore Caratheodory’s criterion is really only testing the boundary of $A$.

Remarks

1. Of course, intuition must be backed up by rigorous mathematics, and it turns out that Caratheodory’s criterion is useful because it is easy to work with and produces the desired sigma-algebra in many situations of interest.
2. Our intuitive argument has focused on the boundary of sets. If we consider a Borel measure (one generated by the open sets of a topological space) then we know that open sets are measurable and hence it is indeed the behaviour at the boundary that matters. That said, intuition need not be perfect to be useful.

An Alternative (and Very Simple) Derivation of the Boltzmann Distribution

This note gives a very simple derivation of the Boltzmann distribution that avoids any mention of temperature or entropy. Indeed, the Boltzmann distribution can be understood as the unique distribution with the property that when a large (or even a small!) number of Boltzmann distributions are added together, all the different ways of achieving the same “energy” have the same probability of occurrence, as now explained.

For the sake of a mental image, consider a small volume of gas. This small volume can take on three distinct energy levels, say, 0, 1 and 2. Assume the choice of energy level is random, with corresponding probabilities $p_0, p_1$ and $p_2$. A large volume of the gas can be partitioned into many such small volumes, and the total energy is simply the sum of the energies of each of the small volumes.

Mathematically, let $N$ denote the number of small volumes and let $E_i$ denote the energy of the $i$th small volume, where $i$ ranges from 1 to $N$. The total energy of the system is $E = \sum_{i=1}^N E_i$.

For a fixed $N$ and a fixed total energy $E$, there may be many outcomes $E_1,\cdots,E_N$ with the prescribed energy $E$. For example, with $N=3$ and $E = 2$, the possibilities are $(2,0,0), (0,2,0), (0,0,2), (1,1,0), (1,0,1),(0,1,1)$. The probabilities of these possibilities are $p_2p_0p_0, p_0p_2p_0, p_0p_0p_2, p_1p_1p_0, p_1p_0p_1,p_0p_1p_1$ respectively. Here, we assume the small volumes are independent of each other (non-interacting particle model). Can we choose $p_0, p_1,p_2$ so that all the above probabilities are the same?

Since the ordering does not change the probabilities, the problem reduces to the following. Let $N_0, N_1$ and $N_2$ denote the number of small volumes which are in energy levels 0, 1 or 2 respectively. Then $N_0 + N_1 + N_2 = N$ and $N_1 + 2N_2 = E$. For fixed $N$ and $E$, we want $p_0^{N_0} p_1^{N_1} p_2^{N_2}$ to be constant for all valid $N_0, N_1, N_2$.

A brute-force computation is informative: write $N_1 = E - 2 N_2$ and $N_0 = N - N_1 - N_2 = N - E + N_2$. Then $p_0^{N_0} p_1^{N_1} p_2^{N_2} = p_0^N p_0^{-E} p_1^E p_0^{N_2} p_1^{-2N_2} p_2^{N_2}$. This will be independent of $N_2$ if and only if $(p_0 p_2 / p_1^2)^{N_2}$ is constant, or in other words, $p_0 p_2 = p_1^2$. Remarkably, this is valid in general, even for small $N$.

Up to a normalising constant, the solutions to $p_0 p_2 = p_1^2$ are parametrised by $p_0 = 1$ and $p_2 = p_1^2$. Varying $p_1$ varies the average energy of the system. Letting $\beta = -\ln p_1$, we can reparametrise the solutions by $p_i = \exp\{- \beta i\}$. (Check: $p_0 = \exp\{0\} = 1$; $p_1 = \exp\{\ln p_1\} = p_1$; and $p_2 = \exp\{2 \ln p_1\} = p_1^2$.) This is indeed the Boltzmann distribution when the energy levels are 0, 1 or 2.

More generally, we can show that the Boltzmann distribution is the unique distribution with the property that different configurations (or “microstates”) having the same $N$ and $E$ have the same probability of occurrence. In one direction, consider the (unnormalised) probabilities $p_i = \exp\{-\beta E_i\}$ that define the Boltzmann distribution. Then $\prod p_i^{N_i} = \exp\{-\beta \sum E_i N_i\} = \exp\{-\beta E\}$ where $E = \sum E_i N_i$ is the total energy of the system. (Here, the $i$th region has energy $E_i$, and there are precisely $N_i$ regions with this energy in the configuration under consideration.) Therefore, the probability of any particular configuration depends only on the total energy and is thus uniformly distributed conditioned on the total energy being known. In the other direction, although we omit the proof, we can derive the Boltzmann distribution in generality along the same lines as was done earlier for three specific energy levels: regardless of the total number of energy levels, it suffices to consider in turn the energy levels $E_0, E_1$ and $E_i$ with probabilities $p_0, p_1$ and $p_i$. Assuming the $N_j$ are zero except for $j \in \{0,1,i\}$ then leads to a constraint on $p_i / p_0$ as a function of $p_1 / p_0$.

Some Comments on the Situation of a Random Variable being Measurable with respect to the Sigma-Algebra Generated by Another Random Variable

If $Y$ is a $\sigma(X)$-measurable random variable then there exists a Borel-measurable function $f \colon \mathbb{R} \rightarrow \mathbb{R}$ such that $Y = f(X)$. The standard proof of this fact leaves several questions unanswered. This note explains what goes wrong when attempting a “direct” proof. It also explains how the standard proof overcomes this difficulty.

First some background. It is a standard result that $\sigma(X) = \{X^{-1}(B) | B \in \mathcal{B}\}$ where $\mathcal{B}$ is the set of all Borel subsets of the real line $\mathbb{R}$. Thus, if $A \in \mathcal{B}$ then there exists an $H \in \mathcal{B}$ such that $Y^{-1}(A) = X^{-1}(H)$. Indeed, this follows from the fact that since $Y$ is $\sigma(X)$-measurable, the inverse image $Y^{-1}(A)$ of any Borel set $A$ must lie in $\sigma(X)$.

A “direct” proof would endeavour to construct $f$ pointwise. The basic intuition (and not difficult to prove) is that $Y$ must be constant on sets of the form $X^{-1}(c)$ for $c \in \mathbb{R}$. This suggests defining $f$ by $f(x) = \sup Y(X^{-1}(x))$. Here, the supremum is used to go from the set $Y(X^{-1}(x))$ to what we believe to be its only element, or to $\infty$ if $X^{-1}(x)$ is empty. Unfortunately, this intuitive approach fails because the range of $X$ need not be Borel. This causes problems because $f^{-1}((-\infty,\infty))$ is the range of $X$ and must be Borel if $f$ is to be Borel-measurable.

Technically, we need a way of extending the definition of $f$ from the range of $X$ to a Borel set containing the range of $X$, and moreover, the extension must result in a measurable function.

Constructing an appropriate extension requires knowing more about $Y$ than simply $Y(X^{-1}(x))$ for each individual $x$. That is, we need a canonical representation of $Y$. Before we get to this though, let us look at two special cases.

Consider first the case when $Y = I_{A}$ for some measurable set $A$, where $I$ is the indicator function. If $Y$ is $\sigma(X)$-measurable then $Y^{-1}(1)$ must lie in $\sigma(X)$ and hence there exists a Borel $H$ such that $Y^{-1}(1) = A = X^{-1}(H)$. Let $f = I_H$. (It is Borel-measurable because $H$ is Borel.) To show $Y = f \circ X$, let $\omega$ be arbitrary. (Recall, random variables are actually functions, conventionally indexed by $\omega$.) If $\omega \in X^{-1}(H)$ then $X(\omega) \in H$ and $(f \circ X)(\omega) = 1$, while $Y(\omega) = 1$ because $\omega \in X^{-1}(H) = Y^{-1}(1)$. Otherwise, if $\omega \not\in X^{-1}(H)$ then analogous reasoning shows both $Y(\omega)$ and $(f \circ X)(\omega)$ equal zero.

How did the above construction avoid the problem of the range of $X$ not necessarily being Borel? The subtlety is that the choice of $H$ need not be unique, and in particular, $H$ may contain values which lie outside the range of $X$. Whereas a choice such as $f(x) = \sup Y(X^{-1}(x))$ assigns a single value (in this case, $\infty$) to values of $x$ not lying in the range of $X$, the choice $f = I_H$ can assign either $0$ or $1$ to values of $x$ not in the range of $X$, and by doing so, it can make $f$  Borel-measurable.

Consider next the case when $Y = I_{A_1} + 2 I_{A_2}$. As above, we can find Borel sets $H_i$ such that $A_i = X^{-1}(H_i)$ for $i=1,2$, and moreover, $f = I_{H_1} + 2 I_{H_2}$ gives a suitable $f$. Here, it can be readily shown that if $x$ is in the range of $X$ then $x$ can lie in at most one $H_i$. Thus, regardless of how the $H_i$ are chosen, $f$ will take on the correct value whenever $x$ lies in the range of $X$. Different choices of the $H_i$ can result in different extensions of $f$, but each such choice is Borel-measurable, as required.

The above depends crucially on having only finitely many indicator functions. A frequently used principle is that an arbitrary measurable function can be approximated by a sequence of bounded functions with each function being a sum of a finite number of indicator functions (i.e., a simple function). Therefore, the general case can be handled by using a sequence $Y_n$ of random variables converging pointwise to $Y$. Each $Y_n$ results in an $f_n$ obtained by replacing the $A_i$ by $H_i$, as was done in the paragraph above. For $x$ in the range of $X$, it turns out as one would expect: the $f_n(x)$ converge, and $f(x) = \lim f_n(x)$ gives the correct value for $f$ at $x$. For $x$ not in the range of $X$, there is no reason to expect the $f_n(x)$ to converge: the choice of the $H_i$ at the $n$th and $(n+1)$th steps are not coordinated in any way when it comes to which values to include from the complement of the image of $X$. Intuitively though, we could hope that each $H_i$ includes a “minimal” extension that is required to make $f$ measurable and that convergence takes place on this “minimal” extension. Thus, by choosing $f(x)$ to be zero whenever $f_n(x)$ does not converge, and choosing $f(x)$ to be the limit of $f_n(x)$ otherwise, we may hope that we have constructed a suitable $f$ despite how the $H_i$ at each step were chosen. Whether or not this intuition is correct, it can be shown mathematically that $f$ so defined is indeed the desired function. (See for example the short proof of Theorem 20.1 in the second edition of Billingsley’s book Probability and Measure.)

Finally, it is remarked that sometimes the monotone class theorem is used in the proof. Essentially, the idea is exactly the same: approximate $Y$ by a suitable sequence $Y_n$. The subtlety is that the monotone class theorem only requires us to work with indicator functions $I_A$ where $A$ is particularly nice (i.e., $A$ lies in a $\pi$-system generating the $\sigma$-algebra of interest). The price of this nicety is that $Y$ must be bounded. For the above problem, as Williams points out in his book Probability with Martingales, we can simply replace $Y$ by $\arctan Y$ to obtain a bounded random variable. On the other hand, there is nothing to be gained by working with particularly nice $A_i$, hence Williams’ admission that there is no real need to use the monotone class theorem (see A3.2 of his book).

Poles, ROCs and Z-Transforms

Why should all the poles be inside the unit circle for a discrete-time digital system to be stable? And why must we care about regions of convergence (ROC)? With only a small investment in time, it is possible to gain a very clear understanding of exactly what is going on — it is not complicated if learnt step at a time, and there are not many steps.

Step 1: Formal Laurent Series

Despite the fancy name, Step 1 is straightforward. Digital systems operate on sequences of numbers: a sequence of numbers goes into a system and a sequence of numbers comes out. For example, if $x[n]$ denotes an input sequence then an example of a discrete-time system is $y[n] = x[n] + \frac12 x[n-1]$.

How can a specific input sequence be written down, and how can the output sequence be computed?

The direct approach is to specify each individual $x[n]$, for example, $x[n] = 0$ for $n < 0$, $x[0] = 1$, $x[1] = 2$, $x[2] = 3$ and $x[n] = 0$ for $n > 2$. Direct substitution then determines the output: $y[0] = x[0] + \frac12 x[-1] = 1 + \frac12 \cdot 0 = 1$, $y[1] = x[1] + \frac12 x[0] = 2 + \frac12 \cdot 1 = 2\frac12$ and so forth.

The system $y[n] = x[n] + \frac12 x[n-1]$ has a number of nice properties, two of which are linearity and time invariance. This is often expressed in textbooks by saying the system is LTI, where the L in LTI stands for Linear and the TI stands for Time Invariant. When first learning about discrete-time digital systems, usually attention is restricted to linear time-invariant systems, and indeed, it is for such systems that the Z-transform proves most useful.

Any LTI system has the property that the output is the convolution of the input with the impulse response. For example, the impulse response of $y[n] = x[n] + \frac12 x[n-1]$ is found by injecting the input $x[0] = 1$ and $x[n] = 0$ for $|n| \geq 1$. This impulse as input produces the output $y[0] = 1$, $y[1] = \frac12$ and $y[n] = 0$ for $n \notin \{0,1\}$. It is common to denote the impulse response using the letter $h$, that is, $h[0] = 1$, $h[1] = \frac12$ and $h[n] = 0$ for $n \notin \{0,1\}$.

By the definition of convolution, the convolution of a sequence $x[n]$ with a sequence $h[n]$ is $c[n] = \sum_{m=-\infty}^{\infty} h[m] x[n-m]$. If $h[0] = 1$, $h[1] = \frac12$ and $h[n] = 0$ for $n \notin \{0,1\}$ then $c[n] = x[n] + \frac12 x[n-1]$ and this is equal to $y[n]$; we have verified in this particular case that the output of the original system is indeed given by convolving the input with the impulse response.

Let’s be honest: writing $h[0] = 1$, $h[1] = \frac12$ and $h[n] = 0$ for $n \notin \{0,1\}$ is tedious! An alternative is to use a formal Laurent series to represent a sequence. Here, “formal” means that we do not try to evaluate this series! A Laurent series just means instead of limiting ourselves to polynomials in a dummy variable $t$, we allow negative powers of $t$, and we allow an infinite number of non-zero terms. (The variable $t$ has nothing to do with time; it will be replaced in Step 3 by $z^{-1}$ where it will be referred to as the Z-transform. To emphasise, $t$ is just a dummy variable, a placeholder.)

Precisely, the sequence $h[0] = 1$, $h[1] = \frac12$ and $h[n] = 0$ for $n \notin \{0,1\}$ can be encoded as $1 + \frac12 t$; isn’t that much more convenient to write? In general, an arbitrary sequence $h[n]$ can be encoded as the formal series $\sum_{n=-\infty}^{\infty} h[n] t^n$. The sole purpose of the $t^n$ term is as a placeholder for placing the coefficient $h[n]$. In other words, if we are given the formal series $2 t^{-5} + 3 + t^7$ then we immediately know that that is our secret code for the sequence that consists of all zeros except for the -5th term which is 2, the 0th term which is 3 and the 7th term which is 1.

Being able to associate letters with encoded sequences is convenient, hence people will write $x(t) = 2 t^{-5} + 3 + t^7$ to mean $x[-5] = 2$, $x[0] = 3$, $x[7] = 1$ and $x[n] = 0$ for $n \notin \{-5,0,7\}$. Do not try to evaluate $x(t)$ for some value of $t$ though – that would make no sense at this stage. (Mathematically, we are dealing with a commutative ring that involves an indeterminant $t$. If you are super-curious, see Formal Laurent series.)

Now, there is an added benefit! If the input to our system is $x(t) = 1 + 2t + 3t^2$ and the impulse response is $h(t) = 1 + \frac12 t$ then there is a “simple” way to determine the output: just multiply these two series together! The multiplication rule for formal Laurent series is equivalent to the convolution rule for two sequences. You can check that $y(t) = h(t)x(t) = (1+\frac12 t) (1 + 2t + 3t^2) = 1 + 2t + 3t^2 + \frac12 t + t^2 + \frac32 t^3 = 1 + 2\frac12 t + 4 t^2 + \frac32 t^3$ (corresponding to $y[0]=1, y[1]=2\frac12, y[2] = 4$ etc) really is the output of our system if the input were $x[n] = 0$ for $n < 0$, $x[0] = 1$, $x[1] = 2$, $x[2] = 3$ and $x[n] = 0$ for $n > 2$.

• A formal Laurent series provides a convenient encoding of a discrete-time sequence.
• Convolution of sequences corresponds to multiplication of formal Laurent series.
• Do not think of formal Laurent series as functions of a variable. (They are elements of a commutative ring.) Just treat the variable as a placeholder.
• In particular, there is no need (and it makes no sense) to talk about ROC (regions of convergence) if we restrict ourselves to formal Laurent series.

Step 2: Analytic Functions

This is where the fun begins. We were told not to treat a formal Laurent series as a function, but it is very tempting to do so… so let’s do just that! 🙂 The price we pay though is that we do need to start worrying about ROC (regions of convergence) for otherwise we could end up with incorrect answers.

To motivate further our desire to break the rules, consider the LTI system $y[n] = \frac12 y[n-1] + x[n]$. Such a system has “memory” or an “internal state” because the current output $y[n]$ depends on a previous output $y[n-1]$ and hence the system must somehow remember previous values of the output. We can assume that in the distant past the system has been “reset to zero” and hence the output will be zero up until the first time the input is non-zero. The impulse response is determined by setting $x[0]$ to 1 and observing the output: $h[0] = 1$, $h[1] = \frac12$, $h[2] = \frac14$ and in general $h[n] = \frac1{2^n}$ for $n \geq 0$. Encoding this using a formal Laurent series gives the alternative representation $h(t) = \sum_{n=0}^\infty \frac1{2^n} t^n = 1 + \frac12 t + \frac14 t^2 + \cdots$.

There is a more “efficient” representation of $h(t) = \sum_{n=0}^\infty \frac1{2^n} t^n$ based on the Taylor series expansion $\frac1{1-\alpha t} = 1 + \alpha t + \alpha^2 t^2 + \alpha^3 t^3 + \cdots$. Indeed, if we were to break the rules and treat $h(t)$ as a function, we might be tempted to write $h(t) = \frac1{1 - \frac12 t}$ instead of $h(t) = \sum_{n=0}^\infty \frac1{2^n} t^n$. It turns out that, with some care, it is mathematically legitimate to do this. (Furthermore, Step 3 will explain why it is beneficial to go to this extra trouble.)

We know we can encode a sequence using a formal Laurent series, and we can reverse this operation to recover the sequence from the formal Laurent series. In Step 2 then, we just have to consider when we can encode a formal Laurent series as some type (what type?) of function, meaning in particular that it is possible to determine uniquely the formal Laurent series given the function.

Power series (and Laurent series) are studied in complex analysis: recall that every power series has a (possibly zero) radius of convergence, and within that radius of convergence, a power series can be evaluated. Furthermore, within the radius of convergence, a power series (with real or complex coefficients) defines a complex analytic function. The basic idea is that a formal Laurent series might (depending on how quickly the coefficients die out to zero) represent an analytic function on a part of the complex plane.

If the formal Laurent series only contains non-negative powers of $t$ then it is a power series $f(t) = \sum_{n=0}^\infty \alpha_n t^n$ and from complex analysis we know that there exists an $R$, the radius of convergence (which is possibly zero or possibly infinite), such that the sum converges if $|t| < R$ and the sum diverges if $|t| > R$. In particular, $f(t)$ is an analytic function on the open disk $\{ t \in \mathbb{C} \mid |t| < R\}$.

If the formal Laurent series only contains non-positive powers, that is, $f(t) = \sum_{n=-\infty}^0 \alpha_n t^n$, then we can consider it to be a power series in $t^{-1}$ and apply the above result. Since $f(t) = \sum_{n=0}^\infty \alpha_{-n} \left(\frac1t\right)^n$, there exists a $\tilde R$ (possibly zero or possibly infinite) such that $f(t)$ is an analytic function on $\{ t \in \mathbb{C} \mid |t| > \tilde R\}$. [From above, the condition would be $|t^{-1}| < R$ which is equivalent to $|t| > \tilde R^{-1}$ hence defining $\tilde R = R^{-1}$ verifies the claim.]

In the general case, a formal Laurent series decomposes as $\sum_{n=-\infty}^{-1} \alpha_n t^n + \sum_{n=0}^\infty \alpha_n t^n$. Both sums must converge if it is to define an analytic function, hence in the general case, a formal Laurent series defines an analytic function on a domain of the form $\{ t \in \mathbb{C} \mid \tilde R < |t| < R\}$.

The encoding of a sequence as an analytic function is therefore straightforward in principle: given a formal Laurent series $f(t) = \sum_{n=-\infty}^\infty \alpha_n t^n$, determine the constants $\tilde R$ and $R$, and provided the region $\{ t \in \mathbb{C} \mid \tilde R < |t| < R\}$ is non-empty, we can use the analytic function $f(t)$ to encode the sequence $\alpha_n$. We must specify the ROC $\{ t \in \mathbb{C} \mid \tilde R < |t| < R\}$ together with $f(t)$ whenever we wish to define a sequence in this way; without knowing the ROC, we do not know the domain of definition of $f(t)$, that is, we would not know for what values of $t$ does $f(t)$ describe the sequence we want. It is possible for a function $f(t)$ to describe different sequences depending on the ROC! An example of this is now given.

If $f(t) = \sum_{n=0}^\infty t^n$ then we have seen that $f(t) = \frac1{1-t}$. But wait! This is not entirely true. If $|t| < 1$ then certainly $\sum_{n=0}^\infty t^n = \frac1{1-t}$.Yet if $|t| > 1$ then $\sum_{n=0}^\infty t^n = \infty \neq \frac1{1-t}$. The function $t \mapsto \frac1{1-t}$ for $|t| > 1$ does not describe the series $\sum_{n=0}^\infty t^n$. It looks like a perfectly good function though, so what series does it describe?

The region $|t| > 1$ is unbounded. (In fact, it is an open disc centred at the point at infinity.) This motivates us to replace $t$ by $\tau = t^{-1}$ so that the domain becomes $|\tau| < 1$ and we can attempt looking for a power series in $\tau$. Precisely, $\frac1{1-t} = \frac1{1-\tau^{-1}} = \frac{-\tau}{1-\tau} = -\tau \sum_{n=0}^\infty \tau^n = -\sum_{n=1}^\infty \tau^n = \sum_{n=-\infty}^{-1} (-1) t^n$. Therefore, the single function $t \mapsto \frac1{1-t}$ actually encodes two different series depending on whether we treat it as a function on $|t|<1$ or a function on $|t|>1$.

Readers remembering their complex analysis will not find this bizarre because a holomorphic (i.e., complex differentiable) function generally requires more than one power series to represent it. A power series about a point is only valid up until a pole is encountered, after which another point must be chosen and a power series around that point used to continue the description of the function. When we change points, the coefficients of the power series will generally change. In the above example, the first series $\sum_{n=0}^\infty t^n$ is a power series around the origin while the second series $\sum_{n=-\infty}^{-1} (-1) t^n$ is a power series around the point at infinity and therefore naturally has different coefficients. (Thankfully there are only these two possibilities to worry about: a power series about a different point $c \in \mathbb{C}$ would look like $\sum_{n=0}^\infty \alpha_n (t-c)^n$ and is not of the form we described in Step 1. Only when $c=0$ or $c = \infty$ does the power series match up with the formal Laurent series in Step 1.)

While it might seem that introducing analytic functions is an unnecessary complication, it actually makes certain calculations simpler! Such a phenomenon happened in Step 1: we discovered convolution of sequences became multiplication of Laurent series (and usually multiplication is more convenient than convolution). In Step 3 we will see how the impulse response of $y[n] = \frac12 y[n-1] + x[n]$ can be written down immediately.

Step 3: The Z-transform

Define $y(t) = \sum_{n=-\infty}^\infty y[n] t^n$ and $x(t) = \sum_{n=-\infty}^\infty x[n] t^n$. Clearly, the constraint $y[n] = \frac12 y[n-1] + x[n]$ implies a relationship between $x(t)$ and $y(t)$, but how can we determine this relationship?

Note that $y[n] = \frac12 y[n-1] + x[n]$ is actually an infinite number of constraints, one for each $n$. The first step is to think of these as a single constraint on the whole sequences $\{\cdots,x[-2],x[-1],x[0],x[1],x[2],\cdots\}$ and $\{\cdots,y[-2],y[-1],y[0],y[1],y[2],\cdots\}$. We can do this either intuitively or rigorously, arriving at the same answer either way.

Intuitively, think of a table (my lack of WordPress skills prevents me from illustrating this nicely). The top row looks like $\cdots \mid y[-1] \mid y[0] \mid y[1] \mid \cdots$. The next row looks like $\cdots \mid \frac12 y[-2] \mid \frac12 y[-1] \mid \frac12 y[0] \mid \cdots$. The last row looks like $\cdots \mid x[-1] \mid x[0] \mid x[1] \mid \cdots$. Stack these three rows on top of each other so they line up properly: the purpose of the table is to line everything up correctly! Then another way of expressing $y[n] = \frac12 y[n-1] + x[n]$ is by saying that the first row of the table is equal to the second row of the table plus the third row of the table, where rows are to be added elementwise.

Rigorously, what has just happened is that we are treating a sequence as a vector in an infinite-dimensional vector space: just like $( y[-1], y[0], y[1] )$ is a vector in $\mathbb{R}^3$, we can think of $( \cdots, y[-1], y[0], y[1], \cdots )$ as a vector in $\mathbb{R}^\infty$. Each of the three rows of the table described above is simply a vector.

To be able to write the table compactly in vector form, we need some way of going from the vector $( \cdots, y[-1], y[0], y[1], \cdots )$ to the shifted-one-place-to-the-right version of it, namely $( \cdots, y[-2], y[-1], y[0], \cdots )$. In linear algebra, we know that a matrix can be used to map one vector to another vector provided the operation is linear, and in fact, shifting a vector is indeed a linear operation. In abstract terms, there exists a linear operator $S \colon \mathbb{R}^\infty \rightarrow \mathbb{R}^\infty$ that shifts a sequence one place to the right: $S(\ ( \cdots, y[-1], y[0], y[1], \cdots ) \ ) = ( \cdots, y[-2], y[-1], y[0], \cdots )$.

Letting $\mathbf{x}$ and $\mathbf{y}$ denote the vectors $( \cdots, x[-1], x[0], x[1], \cdots )$ and $( \cdots, y[-1], y[0], y[1], \cdots )$ respectively, we can rigorously write our system as $\mathbf{y} = \frac12 S \mathbf{y} + \mathbf{x}$. This is precisely the same as what we did when we said the first row of the table is equal to the second row of the table plus the third row of the table.

Our natural instinct is to collect like terms: if $I$ denotes the identity operator (think of a matrix with ones on the diagonal), so that $I\mathbf{y} = \mathbf{y}$, then $\mathbf{y} = \frac12 S \mathbf{y} + \mathbf{x}$ is equivalent to $\left( I - \frac12 S \right) \mathbf{y} = \mathbf{x}$, that is, $\mathbf{y} = \left( I - \frac12 S \right)^{-1} \mathbf{x}$. This equation tells us the output $\mathbf{y}$ as a function of the input $\mathbf{x}$ — but how useful is it? (Does the inverse $\left( I - \frac12 S \right)^{-1}$ exist, and even if it does, how can we evaluate it?)

While in some cases the expression $\mathbf{y} = \left( I - \frac12 S \right)^{-1} \mathbf{x}$ might actually be useful, often it is easier to use series rather than vectors to represent sequences: we will see that the linear operator $S$ is equivalent to multiplication by $t$, which is very convenient to work with!

Precisely, since $y(t) = \sum_{n=-\infty}^\infty y[n] t^n$ and $\mathbf{y} = ( \cdots, y[-1], y[0], y[1], \cdots )$ represent exactly the same sequence, it should be clear that the equation $\mathbf{y} = \frac12 S \mathbf{y} + \mathbf{x}$ is equivalent to the equation $y(t) = \frac12 t\,y(t) + x(t)$. Indeed, returning to the table, $y(t)$ corresponds to the first row, $x(t)$ corresponds to the last row, and $\frac12 t\,y(t) = \frac12 t \sum_{n=-\infty}^\infty y[n] t^n = \sum_{n=-\infty}^\infty \frac12 y[n-1] t^n$ corresponds to the second row.

Now, from Step 2, we know that we can think of $x(t)$ and $y(t)$ as analytic functions (provided we are careful about the ROC). Therefore, we feel entitled to continue manipulating $y(t) = \frac12 t\,y(t) + x(t)$ to obtain $y(t) = \frac1{1-\frac12t} x(t)$. Since by definition the impulse response $h(t)$ satisfies $y(t) = h(t) x(t)$, we seem to have immediately found the impulse response $h(t) = \frac1{1-\frac12t}$ of the original system; at the very least, it agrees with the longer calculation performed at the start of Step 2.

Remark: Actually, the only way to justify rigorously that the above manipulation is valid is to check the answer we have found really is the correct answer. Indeed, to be allowed to perform the manipulations we must assume the existence of a domain on which both $x(t)$ and $y(t)$ are analytic. If we assume $y(t)$ is “nice” then, under that assumption, prove that $y(t)$ is “nice”, that does not mean $y(t)$ is nice. It is like saying “if it is raining then we can show it really is raining, therefore it must be raining right now!”. A rigorous justification would look something like the following. First we must make some assumption about the input sequence for otherwise we have no idea what ROC to use. If we are interested in inputs that are uniformly bounded and which are zero up until time 0 then we can safely assume that $x(t)$ is analytic on $|t| < 1$. Since $1-\frac12t$ is non-zero whenever $|t| < 1$, we know $h(t) = \frac1{1-\frac12t}$ is analytic on $|t| < 1$. Therefore $y(t) = h(t) x(t)$ will be analytic on $|t| < 1$. This means that $y(t)$ can be expanded as a power series in a neighbourhood of the origin, and the coefficients of that power series are what we believe the output of the system will be. To check this really is the output of the system, it suffices to show $y(t) = \frac12 t\,y(t) + x(t)$ for $|t| < 1$. This is straightforward: $\frac12 t\, y(t) + x(t) = \left(\frac12 t\, h(t) + 1\right) x(t) = \left( \frac{\frac12 t}{1-\frac12t} + \frac{1-\frac12 t}{1-\frac12 t}\right) x(t) = \frac1{1-\frac12t} x(t) = h(t)x(t) = y(t)$ as required, where every manipulation can be seen to be valid for $|t| < 1$ (there are no divisions by zero or other bad things occurring).

The remark above shows it is (straightforward but) tedious to verify rigorously that we are allowed to perform the manipulations we want. It is much simpler to go the other way and define a system directly in terms of $h(t)$ and its ROC. Then, provided the ROC of $x(t)$ overlaps with the ROC of $h(t)$, the output $y(t)$ is given by $h(t)x(t)$ on the overlap, which can be correctly expanded as a Laurent series with regard to the ROC, and the output sequence read off.

All that remains is to introduce the Z-transform and explain why engineers treat $z^{-1}$ as a “time shift operator”.

The Z-transform is simply doing the same as what we have been doing, but using $z^{-1}$ instead of $t$. Why $z^{-1}$ rather than $z$? Just convention (and perhaps a bit of convenience too, e.g., it leads to stable systems being those with all poles inside the unit circle, rather than outside). In fact, sometimes $z$ is used instead of $z^{-1}$, hence you should always check what conventions an author or lecturer has decided to adopt.

The rule that engineers are taught is that when you “take the Z-transform” of $y[n] = \frac12 y[n-1] + x[n]$ then you replace $y[n]$ by $Y(z)$, $x(n)$ by $X(z)$ and $y[n-1]$ by $z^{-1} Y(z)$. The reason this rule works was justified at great length above: recall that as an intermediate mental step we can think of the input and output as vectors, and this led to thinking of them instead as series, because multiplication by $t = z^{-1}$ will then shift the sequence one place to the right. Thus, $Y(z) = \frac12 z^{-1} Y(z) + X(z)$ is asserting that the sequence $y[n]$ is equal to the sequence $x[n]$ plus a half times the sequence $y[n]$ shifted to the right by one place, which is equivalent to the original description $y[n] = \frac12 y[n-1] + x[n]$.

Step 4: Poles and Zeros

A straightforward but nonetheless rich class of LTI systems can be written in the form $y[n] = a_1 y[n-1] + a_2 y[n-2] + \cdots + a_p y[n-p] + b_0 x[n] + b_1 x[n-1] + \cdots + b_q x[n-q]$. Importantly, this class is “closed” in that if you connect two such systems together (the output of one connects to the input of the other) then the resulting system can again be written in the same way (but generally with larger values of $p,q$). Another important feature of this class of systems is that they can be readily implemented in hardware.

Applying the Z-transform to such a system shows that the impulse response in the Z-domain is a rational function. Note that the product of two rational functions is again a rational function, demonstrating that this class of systems is closed under composition, as stated above. Most of the time, it suffices to work with rational impulse responses.

If $H(z) = \frac{f(z)}{g(z)}$ is rational (and in reduced form — no common factors) then the zeros of $H(z)$ are the solutions of the polynomial equation $f(z)=0$ while the poles are the solutions of the polynomial equation $g(z)=0$. [More generally, if $H(z)$ is analytic, then the zeros are the solutions of $H(z)=0$ and the poles are the solutions of $\frac1{H(z)} = 0$.] Note that a pole of $H(z)$ is a zero of its inverse: if we invert a system, poles become zeros and zeros become poles.

Poles are important because poles are what determine the regions of convergence and hence they determine when our manipulations in Step 3 are valid. This manifests itself in poles having a physical meaning: as we will see, the closer a pole gets to the unit circle, the less stable a system becomes.

Real-world systems are causal: the output cannot depend on future input. The impulse response $h[n]$ of a causal system is zero for $n < 0$. Therefore, the formal Laurent series (Step 1) representation of $h[n]$ has no negative powers of $t$. Its region of convergence will therefore be of the form $|t| < R$ for some $R > 0$. (If $R=0$ then the system would blow up!) Since $z = t^{-1}$, the ROC of a causal transfer function $H(z)$ will therefore be of the form $|z| > R$ for some $R < \infty$. If $H(z)$ is rational then it has only a finite number of poles, and it suffices to choose $R$ to be the largest of the magnitudes of the poles.

Let’s look at an unstable system: $y[n] = 2 y[n-1] + x[n]$. This system is clearly unstable because its impulse response is readily seen to be $1, 2, 4, 8, 16, 32, \cdots$. We put in a bounded signal ($x[0]=1$ and $x[n] = 0$ for $|n| \geq 1$) yet obtain an unbounded output ($y[n] = 2^n$ for $n \geq 0$). Taking the Z-transform gives $Y(z) = 2 z^{-1} Y(z) + X(z)$, so that $H(z) = Y(z)/X(z) = \frac1{1-2z^{-1}}$. For $|z|>2$, this gives us the correct series representation $\frac1{1-2z^{-1}} = \sum_{n=0}^\infty 2^n z^{-n}$.

If we put a signal that starts at time zero (i.e., $x[n] = 0$ for $n < 0$) into a causal system we will get an output even if the system is unstable. This is because the input $X(z)$ will have a ROC of the form $|z| > R_X$ and $H(z)$ has a ROC of the form $|z| > R_H$, so $Y(z) = H(z)X(z)$ will have a ROC $|z| > \max\{R_X,R_H\}$.

If we put a signal that started at time $-\infty$ into our system, then there might be no solution! (Intuitively, it means the system will have to have blown up before reaching time 0.) For example, $X(z) = \frac1{1-z}$ with ROC $|z| < 1$ represents the signal $x[n] = 1$ for $n \leq 0$ and $x[n]=0$ for $n > 0$. We cannot form the product $Y(z) = H(z) X(z)$ because there is no value of $z$ for which both $H(z)$ and $X(z)$ are valid: one’s ROC is $|z|>2$ while the other’s is $|z| < 1$.

There are many different definitions of a sequence being bounded. Three examples are: 1) there exists an $M$ such that, for all $n$, $|x[n]| < M$; 2) $\sum_{n=-\infty}^\infty | x[n] | < \infty$; and 3) $\sum_{n=-\infty}^{\infty} | x[n] |^2 < \infty$. Out of these three, the easiest to detect whether $x[n]$ is bounded given only $X(z)$ is the second one: if the ROC of $X(z)$ includes $|z|=1$ then, by definition (see a complex analysis textbook), $\sum_{n=-\infty}^\infty x[n] z^{-n}$ is absolutely convergent for $|z|=1$, meaning $\sum_{n=-\infty}^\infty |x[n] z^{-n}| < \infty$, hence $\sum_{n=-\infty}^\infty |x[n]| < \infty$ because $|z|=1$ implies $|z^{-n}| = 1$ for all $n$. This type of boundedness is called “bounded in $l^1$“. For the reverse direction, recall that the radius of convergence $R$ is such that a power series in $t$ will converge for all $|t| < R$ and diverge for all $|t| > R$. Therefore, if the boundary of the largest possible ROC of $X(z)$ does not contain the unit circle then $x[n]$ must be unbounded in $l^1$. (The boundary case is inconclusive: the power series $x(t) = \sum_{n=1}^\infty \frac1{n^2} t^n$ has a ROC $|t| < 1$ yet $x[n] = \frac1{n^2}$ is bounded. On the other hand, $x(t) = \sum_{n=0}^\infty t^n$ has a ROC $|t| <1$ and $x[n] = 1$ is unbounded.)

A sequence $x[n]$ is bounded in $l^1$ if the largest possible ROC of $X(z)$ includes the unit circle $|z|=1$. A sequence $x[n]$ is unbounded in $l^1$ if the closure of the largest possible ROC does not include the unit circle.

If the ROC of the transfer function $H(z)$ includes the unit circle, then an $l^1$-bounded input will produce an $l^1$-bounded output. Indeed, the ROC of both $H(z)$ and $X(z)$ will include the unit circle, therefore, the ROC of $Y(z)$ will include the unit circle and $y[n]$ will be $l^1$-bounded.

If all the poles of a causal $H(z)$ are within the unit circle then the ROC will include the unit circle, hence a bounded input will produce a bounded output. Engineers therefore call a system stable if all the poles of $H(z)$ are inside the unit circle.

Note that it follows from earlier remarks that if $H(z)$ has a pole outside the unit circle then the impulse response will be unbounded. If $H(z)$ has a pole on the unit circle then the impulse response will resonant at the corresponding frequency indefinitely. For example, if $H(z) = \frac1{1-az^{-1}}$ where $|a|=1$ then $h[n] = a^n$ for $n \geq 0$. This does not die away because $|h[n]| = |a^n| = 1$ for all $n \geq 0$. By writing $a = e^{\jmath \omega}$ we get $h[n] = e^{\jmath \omega n}$ and therefore we think of this as a resonance at frequency $\omega$. Furthermore, if $a = r e^{\jmath \omega}$ where $|r| < 1$, then $h[n] = a^n = r^n e^{\jmath \omega n}$ and we see that there is a damped resonance at frequency $\omega$. As the pole gets closer to the unit circle (that is, $r \rightarrow 1$) the impulse response takes longer to die out. Engineers care about pole placement!

Epilogue

• Although often engineers can get away with just following recipes, it is safer in the long run and easier to remember the recipes if you have spent a bit more time to learn the underlying rationale for when and why the rules work.
• Linear algebra and complex analysis are considered core subjects of an undergraduate mathematics degree because they have such broad applicability: here we saw them play integral roles in discrete-time signal processing.
• This is because the “mathematical behaviour” of Laurent series matches the “physical behaviour” of discrete-time linear time-invariant systems.
• There is a close relationship between the Z-transform and the Fourier transform: just make the substitution $z = e^{\jmath \omega}$. This is only allowed if the ROC contains the unit circle. The concept of a pole does not explicitly appear in the Fourier domain because, rather than permit $z$ to be anywhere in the complex plane, the substitution $z = e^{\jmath \omega}$ means we are only looking at the behaviour of $H(z)$ on the unit circle.
• When it comes to system design, understanding where the poles and zeros are can be more productive than trying to understand the whole frequency spectrum.