### Archive

Archive for the ‘Informal Classroom Notes’ Category

## Introduction to the Grassmann Algebra and Exterior Products

Sadly, Grassmann’s mathematical work was not appreciated during his lifetime. Among other things, he introduced what is now called the Grassmann algebra.  It appears that Grassmann did this in part by looking for all possible ways a product structure could be introduced. Although there is strong geometric intuition behind the Grassmann algebra, it is not necessarily straightforward to grasp quickly this intuition from current introductory texts.  For example, if the Grassmann algebra is about lengths, areas and volumes of parallelotopes, why can $v_1$ and $v_2$ be added together to form a new vector $v_3 = v_1 + v_2$ when in general the length of $v_3$ will not be the sum of the lengths of $v_1$ and $v_2$?

In my mind, the key point to keep in mind, and which I have not seen written down elsewhere, is that in the context of Grassmann algebras, lower-dimensional parallelotopes should be considered merely as building blocks for higher-dimensional parallelotopes; some background is required before getting to this point though.

Stepping back, this note endeavours to re-invent the Grassmann algebra in an intuitive way, motivating the operations of addition and multiplication.  The point of departure is the desire to measure the relative volume of a $d$-dimensional oriented parallelotope in a vector space $V$ of dimension $d$.  Let us initially denote an oriented parallelotope by the ordered set $[v_1,\cdots,v_d]$ of vectors $v_1,\cdots,v_d \in V$ that form the sides of the parallelotope.  (See the wiki for a picture of the three-dimensional case.)  Here, “oriented” just means that the sides of the parallelotope are ordered. In hindsight, it becomes clear that it is simpler to work with oriented parallelotopes than non-oriented ones; a (multi-)linear theory can be developed for the former.  (Perhaps better motivation would come from considering how to define integration on a manifold, but I am endeavouring here to introduce Grassmann algebras without mention of forms from differential geometry.)

Given a metric on $V$, the volume of the parallelotope $[v_1,\cdots,v_d]$ can be computed by choosing an orthonormal basis for $V$ and computing the determinant of the matrix $A$ whose columns are the vectors $v_1,\cdots,v_d$ expressed as linear combinations of the basis vectors; put simply, if we assume $V$ is $\mathbb{R}^d$ and we use the Euclidean inner product then $A$ is the matrix whose $i$th column is $v_i$. Note that negative volumes are permissible, a consequence of working with oriented parallelotopes.  For brevity, parallelotopes will mean oriented parallelotopes and volumes will mean signed volumes.

If we don’t have a metric — or, precisely, we want to state results that are true regardless of which metric is being used — we can still make sense of one parallelotope being twice as big as another one, at least in certain situations.  For example, the parallelotope $[2v_1,\cdots,v_d]$ is twice as big as $[v_1,\cdots,v_d]$ because, no matter how we choose the metric, the volume of the former really will be twice that of the latter.  A key question to ask is: if $[v_1,\cdots,v_d]$ and $[w_1,\cdots,w_d]$ are two parallelotopes, will the ratio of their volumes be independent of the metric chosen?

Since we have limited ourselves to $d$-dimensional parallelotopes in $d$-dimensional space, it turns out that the answer is in the affirmative.  A proof can be based on the equality $\mathrm{det}(B^{-1}AB) = \mathrm{det}(A)$ by choosing the matrix $B$ to be the change of orthonormal basis induced by a change of metric.

If we decide that two (oriented) parallelotopes are equivalent whenever their (signed) volume is the same regardless of the metric chosen then it turns out that we can form a vector space structure on the set $P_V$ of all $d$-dimensional parallelotopes up to equivalence in a given $d$-dimensional vector space $V$.  Note that we are working with a quotient space structure; although we use the notation $[v_1,\cdots,v_d]$ to represent an element of $P_V$, different representations may correspond to the same element.  (Precisely, we have a projection $\pi: V \times \cdots \times V \rightarrow P_V$ taking $d$ vectors and returning the corresponding element of $P_V$, where $\pi(v_1,\cdots,v_d) = \pi(w_1,\cdots,w_d)$ if and only if the signed volume of $[v_1,\cdots,v_d]$ equals the signed volume of $[w_1,\cdots,w_d]$ regardless of the metric chosen.)  We choose to define scalar multiplication in $P_V$ by $\alpha \cdot [v_1,\cdots,v_d] \mapsto [\alpha v_1,\cdots,v_d]$.  (Note that the $\alpha$ could have multiplied any one of the $v_i$ because elements of $P_V$ are only distinguished up to differences in volume.)  That is to say, scalar multiplication corresponds to scaling the volume of the parallelotope.

Vector space addition in $P_V$ is worthy of contemplation even if the ultimate definition is straightforward.  (From a pedagogical perspective, having a simple mathematical definition does not imply having an intuitive understanding; Grassmann algebras have a simple mathematical definition, but one that belies the ingenuity required by Grassmann to develop them and one that potentially lacks the intuition required to feel comfortable with them.)  Thinking first in terms of cubes then in terms of parallelotopes, it is clear geometrically that $[v_1,v_2,\cdots,v_d] + [w_1,v_2,\cdots,v_d] = [v_1 + w_1, v_2, \cdots, v_d]$. In other words, if all but one vector are the same, there is an obvious geometric meaning that can be given to vector space addition in $P_V$.  Perhaps other special cases can be found.  Nevertheless, the general rule we wish to follow (if at all possible) is that if $[v_1,\cdots,v_d] + [w_1,\cdots,w_d] = [u_1,\cdots,u_d]$ then this should be taken to mean that the volume of the parallelotope $[v_1,\cdots,v_d]$ plus the volume of the parallelotope $[w_1,\cdots,w_d]$ is equal to the volume of the parallelotope $[u_1,\cdots,u_d]$.  If this is possible, then one way to achieve it is to define $[v_1,\cdots,v_d] + [w_1,\cdots,w_d]$ as follows.  Arbitrarily choose a basis $e_1,\cdots,e_d$ for $V$.  Then we know that there exist constants $\alpha$ and $\beta$ such that the volume of $[v_1,\cdots,v_d]$ is equal to $\alpha$ times the volume of $[e_1,\cdots,e_d]$, and the volume of $[w_1,\cdots,w_d]$ equals $\beta$ times the volume of $[e_1,\cdots,e_d]$.  Then $[v_1,\cdots,v_d] + [w_1,\cdots,w_d]$ is defined to be $(\alpha + \beta) \cdot [e_1,\cdots,e_d]$.  One can check that this indeed works; it endows $P_V$ with a well-defined vector space structure. (Precisely, one must first verify that our definitions are consistent — given $x, y \in P_V$, we claim that no matter which parallelotopes $[v_1,\cdots,v_d] \in \pi^{-1}(x)$, $[w_1,\cdots,w_d] \in \pi^{-1}(y)$ and $[e_1,\cdots,e_d]$ we used, the same element $\pi((\alpha + \beta) \cdot [e_1,\cdots,e_d])$ will be obtained — and then verify that the axioms of a vector space are satisfied.)

After all this effort, one may be disappointed to learn that $P_V$ is one-dimensional.  However, that is to be expected; we wanted $P_V$ to represent the (signed) volume of an (oriented) parallelotope and hence $P_V$ is essentially just the set of real numbers with the usual scalar multiplication and vector addition.  What we have done though is introduce the notation and mindset to pave the way for generalising this reasoning to parallelotopes of arbitrary dimension in $V$.

Importantly, the following approach will not work, in that it will not re-create the Grassmann algebra.  Consider all one-dimensional parallelotopes in $V$, where now $\dim V > 1$.  If $[v_1]$ and $[v_2]$ are two such parallelotopes then one might be tempted to declare that $[v_3] = [v_1] + [v_2]$ if and only if the length of $v_3$ is equal to the sum of the lengths of $v_1$ and $v_2$ with respect to all metrics.  This would lead to an infinite-dimensional vector space though, since it would only be possible to add two vectors that were linearly dependent.

An algebra (in this context) is a vector space that also has defined on it a rule for multiplying two elements, such that the multiplicative structure is consistent with the vector space structure, e.g., the associative and distributive laws hold.  Does multiplication enter the picture in any way when we think of volume?  For a start, the area of a rectangle can be calculated by taking the product of the lengths of two adjoining sides.  We are thus tempted to introduce a symbol $*$ that allows us to construct a higher-dimensional parallelotope from two lower-dimensional ones — namely, $[v_1,\cdots,v_i] * [w_1,\cdots,w_j] = [v_1,\cdots,v_i,w_1,\cdots,w_j]$ — and have some faint hope that this simple concatenation-of-parallelotopes operator behaves in a way expected of a multiplication operator.

Now for the key decision, which I have not seen stated elsewhere yet believe to be the key to understanding Grassmann algebras in a simple way.  Because the paragraph before last pointed out that we cannot treat length in a metric-independent way if we wish to stay in finite dimensions, we must use our definition of metric-independent volume to induce a weaker notion of metric-independent length, area and volume on lower-dimensional parallelotopes of the ambient space $V$. Precisely, we declare that $[v_1,\cdots,v_i]$ is equivalent to $[w_1,\cdots,w_i]$ if and only if, for all vectors $u_1,\cdots,u_{d-i}$, we have that $[v_1,\cdots,v_i,u_1,\cdots,u_{d-i}]$ has the same volume as $[w_1,\cdots,w_i,u_1,\cdots,u_{d-i}]$, where as usual $d$ is the dimension of $V$.  In particular, lower-dimensional parallelotopes are considered merely as building blocks for $d$-dimensional parallelotopes in $d$-dimensional spaces.  Immediate questions to ask are does this work in theory and is it useful in practice.  It does work; it leads to the Grassmann algebra. And it has found numerous uses in practice, but that is a different story which will not be told here.

It is now a straightforward journey to the finish line.  Let $P_V^d$ denote what was earlier denoted $P_V$, and in general, let $P_V^i$ denote the set of all $i$-dimensional (oriented) parallelotopes up to the aforementioned equivalence relation.  Each of these sets can be made into a vector space with vector space operations relating directly to volumes. Precisely, if $[v_1,\cdots,v_i] \in P_V^i$ then the scalar multiple $\alpha \cdot [v_1,\cdots,v_i]$ is the parallelotope $[w_1,\cdots,w_i]$ (unique up to equivalence) such that, for all vectors $u_1,\cdots,u_{d-i}$, the volume of $[w_1,\cdots,w_i,u_1,\cdots,u_{d-i}]$ is precisely $\alpha$ times the volume of $[v_1,\cdots,v_i,u_1,\cdots,u_{d-i}]$ regardless of which metric is used to measure volume.  (This implies that the volume of $[w_1,\cdots,w_i]$ is precisely $\alpha$ times the volume of $[v_1,\cdots,v_i]$ but the converse is not necessarily true.)  Vector addition can be defined in a similar way.

It can be shown that $P_V^1$ is linearly isomorphic to $V$.  Indeed, if $v_3 = v_1 + v_2$ then $[v_3] = [v_1] + [v_2]$ because, for any vectors $u_1,\cdots,u_{d-1}$, the volume of the parallelotope $[v_3,u_1,\cdots,u_{d-1}]$ will equal the sum of the volumes of $[v_1,u_1,\cdots,u_{d-1}]$ and $[v_2,u_1,\cdots,u_{d-1}]$. Conversely, if $[v_3] = [v_1] + [v_2]$ then one can deduce by strategic choices of  $u_1,\cdots,u_{d-1}$ that the only possibility is $v_3 = v_1 + v_2$.  (Think in terms of determinants of matrices.)

As hinted at before, we expect multiplication to come into play and we expect it to behave nicely with respect to addition because we know, for example, that a rectangle of side lengths $a,c$ and a rectangle of side lengths $b,c$ have total area $ac+bc = (a+b)c$.  In other words, in $P_V^2$ at least, we expect that $[v_1] * [v_3] + [v_2] * [v_3] = ([v_1]+[v_2]) * [v_3]$.  This is indeed the case — for any $u_1,\cdots,u_{d-2}$ it is clear that $[v_1,v_3,u_1,\cdots,u_{d-2}] + [v_2,v_3,u_1,\cdots,u_{d-2}] = [v_1+v_2,v_3,u_1,\cdots,u_{d-2}]$ — and here the point is to explain why $*$ should behave like multiplication rather than prove rigorously that it does.

When it comes to rigorous proofs, it is time to switch from geometric intuition to mathematical precision.  Here, the key step is in recognising that the volume of a $d$-dimensional parallelotope $[v_1,\cdots,v_d]$ in a $d$-dimensional vector space is a multi-linear function of the constituent vectors $v_1,\cdots,v_d$.  In fact, it is not just any multi-linear map but an alternating one, meaning that if two adjacent vectors are swapped then the volume changes sign.  This is the starting point for the modern definition of exterior algebra, also known as the Grassmann algebra.

I intentionally used non-conventional notation because it was important to introduce concepts one by one.  First, because the operator $*$ introduced above is anti-commutative (it is almost as familiar as ordinary multiplication except that the sign can change, e.g., $[v_1] * [v_2] = - [v_2] * [v_1]$) it is common to denote it by the wedge product $\wedge$ instead.  Furthermore, since $P_V^1$ is isomorphic to $V$ it is customary to omit the square brackets, writing $v_1$ for $[v_1]$, writing $v_1 \wedge v_2$ for $[v_1,v_2]$, and so forth.

There are some loose ends which I do not tidy up since the aim of this note is to prepare the reader for a standard account of the exterior algebra; perhaps though the last point to clarify is that the Grassmann algebra is the direct sum of the base field plus $P_V^1$ plus $P_V^2$ up to $P_V^d$.  Thus, if two parallelotopes cannot be added geometrically to form a new parallelotope, either because they are of differing dimensions, or roughly speaking because changing metrics would cause them to change in incongruous ways as building blocks, then they are just left written as a sum.

In summary:

• The exterior algebra of a vector space $V$ is a vector space whose elements represent equivalence classes of linear combinations of oriented parallelotopes in $V$.
• If $d$ is the dimension of $V$ then two $d$-dimensional parallelotopes are equivalent if and only if they have the same $d$-dimensional volume as each other with respect to any and all metrics.
• Multiplying a parallelotope by a scalar just multiplies its volume by the same amount (without changing the subspace in which it lies).
• A higher-dimensional parallelotope is constructed from lower-dimensional ones via the wedge product $\wedge$ which, except for possible sign changes, behaves precisely like a multiplication operator (because, roughly speaking, volume is determined by multiplying one-dimensional lengths together).
• Two $i$-dimensional parallelotopes $x$ and $y$ are equivalent if and only if, when treated as building blocks for constructing parallelotopes $x \wedge t$ and $y \wedge t$ of the same dimension as $V$, the volumes of the resulting parallelotopes $x \wedge t$ and $y \wedge t$ are always the same, regardless of which metric is used and how $t$ is chosen.
• The sum of two $i$-dimensional parallelotopes $x$ and $y$ equals the $i$-dimensional parallelotope $z$ if and only if, for all $(d-i)$-dimensional parallelotopes $t$, the volume of $z \wedge t$ equals the sum of the volumes of $x \wedge t$ and $y \wedge t$ regardless of which metric is used.  (Such a $z$ need not exist, in which case the resulting vector space sum is denoted simply by $x+y$.)

As always, this note may be unnecessarily long because it was written in a linear fashion from start to finish. Hopefully though, the general direction taken has some appeal.

## Differentiating Matrix Expressions The Easy Way, and an Elementary yet Genuine use for the Tensor Product

In many areas of science requiring differentiating multivariate functions $f: \mathbb{R}^n \rightarrow \mathbb{R}$, the derivative is often treated as a vector, and the second-order derivative treated as a matrix. This leads to notation with sometimes $\frac{df}{dx}$ appearing and sometimes its transpose $\left(\frac{df}{dx}\right)^T$ appearing. Extending this notation to higher derivatives, or to functions $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$, becomes even more messy.

An alternative is to treat derivatives as (multi-)linear maps. If, at some stage, vectors and matrices are required, i.e., gradients and Hessians, these can be easily read off from the derivatives. But often these are not required.  Basically, the difference is working in a particular coordinate system — the gradient and Hessian are only defined with respect to an inner product and that determines the “coordinate system” being used — versus working in a coordinate-free manner.

In Differential Calculus, Tensor Products, and the Importance of Notation, a quick overview is given, but one which points out several subtleties.  (For additional examples, see this earlier post.) Furthermore, it introduces the tensor product as a way of simplifying the notation further. This is an elementary yet genuine application benefitting from the tensor product, and is currently the best way I know of introducing tensor products early on to students in a meaningful way.  (I am not very pleased with my earlier attempt at an introductory article on the tensor product as I don’t feel it is interesting enough.)

## Sets of Measure Zero in Probability

Probability is unique in that, on the one hand, its axioms rely on advanced mathematics, yet on the other hand, it is not only used across all areas of science, it comes up in everyday conversation, especially when the topic is gambling or tomorrow’s weather.  I suspect this is the main reason why one does not have to look very far to find vehement debates about aspects of probability and statistics, whether it be Bayesians versus non-Bayesians, or the incorrect application of hypothesis testing in clinical trials, or of relevance to this article, Borel’s paradox.

When learning probability, people tend to worry about what would happen if an outcome having zero probability of occurring actually occurs.  It does not help that books often use the notation $p(x \mid y)$ to mean the probability of the random variable $X$ taking the value $x$ given that the random variable $Y$ was observed to equal $y$ because it is not long before this notation is used when $Y$ has a continuous distribution, so that the probability that $Y$ exactly equals $y$ is zero.  (This occurs even though the book will have warned the reader earlier that conditional probability cannot be defined on sets of measure zero, that is, sets having zero probability of occurring.)

The beauty of Kolmogorov’s axioms of probability theory is that they sidestep irrelevant but otherwise complicating issues.  This is often not appreciated because many textbooks choose to work with classical notation such as $p(x \mid y)$ rather than with measure-theoretic notation such as $E[ g(X) \mid \sigma(Y)]$ where $g$ is a test function and $\sigma(Y)$ is the $\sigma$-algebra generated by $Y$.

• emphasise that $p(x \mid y)$ is an abuse of notation and must be used with care;
• emphasise that we should not use Euclidean distance when thinking in terms of probabilities (i.e., while there is not much difference between something having a chance of 0.1 or 0.0999999 of occurring, there is an infinite difference between something having a chance of $10^{-23}$ and a chance of $0$ of occurring);
• conclude that it would be more surprising if Borel’s paradox did not exist.

Tangentially, my observation in this field is that there are essentially just two principles that need to be kept in mind in order to be able to avoid/resolve all debates:

• Think first about the finite case, that is, the case when the total number of possible outcomes is finite.  Then realise that the infinite case is a mathematical construct that should be understood as a “limit” of the finite case, but that sometimes different limits can give different answers. The resolution then is to go back to the original question in order to determine which limit is the right one (or conclude that the original question is ill-posed because different choices of limits will give different answers).
• Never think purely in terms of estimation; always think in terms of the resulting decisions that may be taken in response to being told that estimate.

The standard theory of probability as we know it is designed for answering questions (filtering, hypothesis testing etc) based on the underlying assumption that the quality of the answers will be judged by some sort of expectation.  For example, with respect to the frequentist interpretation, this would mean that we would want to design a filter so that, if the filter were applied every day to a new set of data, then on average, the filter would perform as well as possible (perhaps in the sense of minimising the mean-squared error).  The presence of the expectation operator is crucial, for reasons which will become clear presently.

If the set of possible outcomes is finite, and if we agree that for all intents and purposes we are only interested in what happens on average (e.g., designing a filter so it works well on average), then there is no loss of generality in assuming that all outcomes have a non-zero chance of occurring.  In this case, conditional probability $p(x \mid y)$ is well-defined.

In general, when there are sets of measure zero, one must always remember that according to the rigorous theory, one cannot condition on a set of measure zero. There is a simple reason why; we are not given enough information to be able to do this in any meaningful way.  A simple example will illustrate this.  Assume that I pick a point on a circle uniformly at random.  Choose two distinct points, $p$ and $q$, on the circle.  Intuitively at least, no harm or inconsistency will come if I conclude that, given that I know that the outcome is either the point $p$ or the point $q$ then the probability that it was $p$ is precisely one half; this accords with our intuitive understanding of “uniformly at random”. However, assume that instead I use the following procedure to pick a point on the circle: first I choose a point $x$ uniformly at random, then I check if it is $p$.  If it is $p$, I declare my choice to be $q$, otherwise I declare my choice to be $x$.  So now, it seems intuitively clear that the probability that the outcome is $p$ given that it is either $p$ or $q$ is zero. Since in Kolmogorov’s formulation of probability these two scenarios have identical descriptions, it follows that it is impossible to know what the conditional probability is of observing $p$ given that either $p$ or $q$ was chosen.  This is not a shortcoming but an advantage of Kolmogorov’s framework! As long as we agree that we are working inside an expectation operator (remember the earlier discussion) then such questions are irrelevant.  While the theory could conceivably be expanded to allow meaningful answers to be given in certain situations, would it have any practical value?

The appearance of $p(x \mid y)$ is an abuse of notation.  The advantage is that it looks easier than the measure-theoretic approach.  The disadvantage is that wrong answers can be obtained if one is not aware of its correct interpretation; see M. Proschan and B. Presnell, “Expect the unexpected from conditional expectation,” Am Stat, vol. 52, no. 3, pp. 248–252, 1998.

The fact of the matter is that (when it exists), $p(x \mid y)$ is not a function, but an equivalence class of functions.  To specify an equivalence class, we must write down a representative function belonging to that equivalence class, hence $p(x \mid y)$ will always look like a function, e.g., $p(x \mid y) = (2\pi)^{-1/2}e^{-(x-y)^2/2}$.  It is easy to start thinking that it should be a function.  And some simple examples can be written down where there seems to be only one obvious choice of $p(x \mid y)$.  But $p(x \mid y)$ is not a function, and Borel’s paradox is the reason why.

Precisely, $p(x \mid y)$  is a Radon-Nikodym derivative.  It satisfies $Pr\{X \in A, Y \in B\} = \int_B p(y)\, \int_A p(x \mid y)\,dx\,dy$.  If $p(x \mid y)$ exists then it will not be unique because integration is blind to what happens on sets of measure zero.  Therefore, $p(x \mid y)$ should only appear inside an integral, since it is only averaged versions of $p(x \mid y)$ that are well-defined. Hence the importance of the expectation operator mentioned earlier; at the end of the day, we are always working under an integral sign, even if it is not always explicitly written down. Therefore, although it may look like $p(x \mid y)$ is being treated as a well-defined function, it isn’t, because there is an implicit if not explicit expectation somewhere.

It is both interesting and wonderful that it is possible to define an equivalence class of functions in such a way that pointwise evaluation is meaningless yet as soon as there is an integral sign then everything is well-defined.  This is at the heart of Borel’s paradox.  And unless one is already familiar with working with equivalence classes from other areas of mathematics, it may well take a bit of getting used to.  In other areas, square brackets are sometimes used to denote equivalence classes.  It might be clearer then to write $[p(x \mid y)]$ to remind ourselves that $[p(1 \mid 2)]$ is not defined if $Pr\{Y=2\}=0$; different members of the same equivalence class need not agree on sets of measure zero. Provided we always work inside an integral, this is not an issue.

Although there is merit in defining a probability to be a number between 0 and 1 inclusive, one should be careful not to think in terms of the usual Euclidean metric.  In some situations, intuition would be better guided if a probability of zero was thought of as $-\infty$ and a probability of one was thought of as $\infty$.  (One way to achieve this is to send a probability $p \in (0,1)$ to the real number $\ln(p) - \ln(1-p)$.)  The reason can be seen by thinking in terms of examining long sequences of outcomes.  If one biased coin had a probability of 0.001 of coming up heads, and another had a probability of 0.0001, then what we readily notice is the factor of ten; we would expect to get ten times as many heads with the first coin than with the second.  Even though the absolute difference $|0.001-0.0001|$ is small, the large ratio $0.001 / 0.0001$ makes the expected sequences very different.  Similarly, a coin with a chance of $10^{-23}$ of heads is infinitely different from a coin with no chance of heads; essentially any countably infinite sequence of outcomes of the first coin will have infinitely many heads whereas essentially any countably infinite sequence of outcomes of the second coin will have no (or at worst a finite number of) heads. These behaviours are very different!

In other words, whereas one would expect that in many cases, if an experiment was undertaken repeatedly which involved the tossing of a coin and certain observations were made by averaging over many trials, then whether an unbiased coin was used, or a biased coin was used with probability $p$ of heads, there would be consistency in that the difference in observations between a biased and an unbiased coin could be made arbitrarily small by taking $p$ arbitrarily close to $0.5$. The preceding paragraph implies that there should be no reason whatsoever why such consistency should be observed between $p \rightarrow 0$ and $p = 0$.  It does not matter how close $p$ is to zero, it is still “infinitely far away” from zero in terms of expected behaviour.  This is the simple reason why Borel’s paradox is not surprising; we should not expect any form of continuity will hold in the limit as $p$ goes to zero.  And as soon as there is no continuity then the possibility exists for different sequences to have different limits: just think of the function $f(x) = \sin(1/x)$ and what happens as $x \rightarrow 0$.  For any $y \in [-1,1]$, a sequence $\{x_n\}$ can be constructed such that $x_n \rightarrow 0$ and $f(x_n) \rightarrow y$. Borel’s paradox is just another demonstration that different sequences can give different answers when continuity does not hold.

In summary:

• There is no reason for there to be any sort of “continuity” when approaching the extremes of probability 0 or probability 1 because something with probability $10^{-23}$ of happening is still infinitely more likely to happen than something with probability 0 and hence does not serve as a good approximation.
• In practice, there is always an expectation operator inside of which we are working. Therefore, while the notation $p(x \mid y)$ may suggest we are conditioning on sets of measure zero, we are not. We are always working with equivalence classes.
• The beauty of Kolmogorov’s axioms is that the lack of continuity that occurs when approaching the extremes of probability can be neatly sidestepped.
• Although in some simple examples it would be possible to extend Kolmogorov’s axioms so that conditioning on sets of measure zero were allowed, would this have any practical use?  For a start, it would require more information be provided than the standard probability triple $(\Omega,\mathfrak{F},\mathcal{P})$.

## When is Independence not Independence?

October 25, 2011 1 comment

This brief article uses statistical independence as an example of when a mathematical definition is intentionally chosen to be different from the original motivating definition. (Another example comes from topology; the motivating/naive definition of a topological space would involve limits but instead open sets are used to define a topology.) This exemplifies the following messages:

• There is a difference between mathematical manipulations and intuition (and both must be learnt side-by-side); see also the earlier article on The Importance of Intuition.
• Understanding a definition mainly means understanding the usefulness of the definition and how it can be applied in practice.
• This has implications for how to teach and how to learn mathematical concepts.

Two random events, $A$ and $B$, are statistically independent if $\mathcal{P}[A \cap B] = \mathcal{P}[A] \mathcal{P}[B]$. Here is the (small) conundrum. If one were to stare at this definition, it may not make much sense. What is it really telling us about the two events? On the other hand, if one were to learn that if $A$ and $B$ are “unrelated” events that have “nothing to do with each other” then $\mathcal{P}[A \cap B] = \mathcal{P}[A] \mathcal{P}[B]$ must hold, then one might falsely believe to have understood the definition. Indeed, if $A$ and $B$ are related to each other, and $B$ and $C$ are related to each other, then surely $A$ and $C$ are related to each other? Conversely, if event $A$ is defined in terms of event $B$ then surely $A$ is related to $B$? Both these statements are false if ‘related’ is replaced by ‘statistically dependent’.

The true way of understanding statistical independence is to i) acknowledge that while it is motivated from real life by the intuitive notion of unrelated events, it is a different concept that has nevertheless proved to be very useful; and ii) be able to list a number of useful applications. Therefore, upon reading a definition that does not immediately feel comfortable, it may be better to flick through the remainder of the book to see the various uses of the definition than to stare blankly at the definition hoping for divine intuition.

For completeness, two naturally occurring examples of how statistical independence differs from “functional independence” are given. The first comes from the theory of continuous-time Markov chains but can be stated simply. Let $\lambda_1$ and $\lambda_2$ be two positive real numbers representing departure rates. Let $T_1$ and $T_2$ be independent and exponentially distributed random variables with parameters $\lambda_1$ and $\lambda_2$ respectively.  (That is, $\mathcal{P}[ T_i > t ] = e^{-\lambda_i t}$ for $t \geq 0$ and $i = 1,2$.) The rule for deciding where to move to next (in the context of Markov chains) is to see which departure time, $T_1$ or $T_2$, is smaller.  (If $T_1$ is smaller than $T_2$ we move to destination 1, otherwise we move to destination 2.) Let $p$ be the probability that $T_1$ is smaller: $p = \mathcal{P}[T_1 < T_2]$. It can be shown that $p = \lambda_1 / (\lambda_1 + \lambda_2)$, and moreover, that the event $T_1 < T_2$ is statistically independent of the departure time $\min\{T_1,T_2\}$. This may seem strange if one thinks in terms of related events, so it is important to treat statistical independence as a mathematical concept that merely means $\mathcal{P}[A \cap B] = \mathcal{P}[A] \mathcal{P}[B]$ regardless of whether or not $A$ and $B$ are, in any sense of the word, “related” to each other.

The second example is that an event $A$ can be statistically independent of itself! In fact, this turns out to be useful: to prove that $A$ is an “extreme” event, by which I merely mean that either $\mathcal{P}[A] = 0$ or $\mathcal{P}[A]=1$, it suffices to prove that $A$ is independent of itself, and sometimes the latter is easier to prove than the former. (Furthermore, having first proved that $\mathcal{P}[A]$ can only be zero or one can then make it easier to prove that it equals one, for instance.)

In closing, it is remarked that one can always challenge a definition by asking why this particular definition. Perhaps a different definition of statistical independence might be better? The response will always be: try to find a better definition! Sometimes you might be successful; this is how definitions are refined and generalised over time. Just keep in mind that a “good” definition is one that is useful and not necessarily one that mimics perfectly our intuition from the real world.

## Tensors and Matrices

This is a sequel to The Tensor Product in response to a comment posted there. It endeavours to explain the difference between a tensor and a matrix.  It also explains why ‘tensors’ were not mentioned in The Tensor Product.

A matrix is a two-dimensional array of numbers (belonging to a field such as $\mathbb{R}$ or $\mathbb{C}$) which can be used freely for any purpose, including for organising data collected from an experiment. Nevertheless, there are a number of commonly defined operations involving scalars, vectors and matrices, such as matrix addition, matrix multiplication, matrix-by-vector multiplication and scalar multiplication. These operations allow a matrix to be used to represent a linear map from one vector space to another, and it is this aspect of matrices that is the most relevant here.

Although standard usage means that an $n \times m$ matrix over $\mathbb{R}$ defines a linear map from the vector space $\mathbb{R}^m$ to the vector space $\mathbb{R}^n$, this is an artefact of $\mathbb{R}^n$ and $\mathbb{R}^m$ implicitly being endowed with a canonical choice of basis vectors. It is perhaps cleaner to start from scratch and observe that a matrix on its own does not define a linear map between two vector spaces: given two two-dimensional vector spaces $V, W$ and the matrix $A = [1,2;3,4]$ (by which I mean the two-by-two matrix whose elements are, from left-to-right top-to-bottom, $1,2,3,4$), what linear map from $V$ to $W$ does $A$ represent? By convention, a linear map is represented by a matrix in the following way, with respect to a particular choice of basis vectors for $V$ and for $W$. Let $v_1, v_2 \in V$ be a basis for $V$, and $w_1, w_2 \in W$ a basis for $W$. If $f: V \rightarrow W$ is a linear map then it is fully determined once the values of $f(v_1)$ and $f(v_2)$ are revealed. Furthermore, since $f(v_1)$ is an element of $W$, it can be written as $f(v_1) = \alpha_{11} w_1 + \alpha_{21} w_2$ for a unique choice of scalars $\alpha_{11}$ and $\alpha_{21}$. Similarly, $f(v_2) = \alpha_{12} w_1 + \alpha_{22} w_2$. Knowing the scalars $\alpha_{11}, \alpha_{12}, \alpha_{21}$ and $\alpha_{22}$, together with knowing the choice of basis vectors $v_1, v_2, w_1$ and $w_2$, allows one to determine what the linear map $f$ is. (An alternative but essentially equivalent approach would have been to agree that a matrix defines a linear map from $\mathbb{R}^m$ to $\mathbb{R}^n$, and therefore, to represent a linear map $f: V \rightarrow W$ by a matrix, it is first necessary to choose an isomorphism from $V$ to $\mathbb{R}^m$ and an isomorphism from $W$ to $\mathbb{R}^n$.)

Unlike a matrix, which can only represent a linear map between vector spaces once a choice of basis vectors has been made, a tensor is a linear (or multi-linear) map. The generalisation from linear to multi-linear maps is a distraction which may lead one to believe the difference between a tensor and a matrix is that a tensor is a generalisation of a matrix to higher dimensions, but this is missing the key point: the machinery of changes of coordinates, which is external to the definition of a matrix as an array of numbers, is internal to the definition of a tensor.

Examples of tensors are linear maps $f: V \rightarrow \mathbb{R}$ and $g: V \rightarrow V$, and bilinear maps $h: V \times V \rightarrow \mathbb{R}$. They are tensors because they are multi-linear maps between vector spaces. Choices of basis vectors do not enter the picture until one wishes to describe a particular map $f$ to a friend; unless it is possible to define $f$ in terms of known linear maps such as $\mathrm{trace}$, it becomes necessary to write down a set of basis vectors and specify the tensor as an array of numbers with respect to this choice of basis vectors. This leads to the traditional definition of tensors, which is still commonly used in physics and engineering.

For convenience and consistency of notation, usually tensors are re-written as multi-linear maps into $\mathbb{R}$ (or whatever the ground field is). Both $f$ and $h$ above are already of this form, but $g$ is not. This is easily rectified; there is a natural equivalence between linear maps $g: V \rightarrow V$ and bilinear maps $\tilde g: V \times V^\ast \rightarrow \mathbb{R}$ where $V^\ast$ is the dual space of $V$; recall that elements of $V^\ast$ are simply linear functionals on $V$, that is, if $\sigma \in V^\ast$ then $\sigma$ is a linear function $\sigma: V \rightarrow \mathbb{R}$. This equivalence becomes apparent by observing that if, for a fixed $v \in V$, the values of $(\sigma \circ g)(v) = \sigma(g(v))$ are known for every $\sigma \in V^\ast$ then the value of $g(v)$ is readily deduced, and furthermore, the map taking a $v \in V$ and a $\sigma \in V^\ast$ to $(\sigma \circ g)(v) \in \mathbb{R}$ is bilinear; precisely, the correspondence between $g$ and $\tilde g$ is given by $\tilde g(v,\sigma) = (\sigma \circ g)(v)$. (If that’s not immediately clear, an intermediate step is the realisation that if the value of $\sigma(w)$ is known for every $\sigma \in V^\ast$ then the value of $w \in V$ is readily determined. Therefore, if we are unhappy about the range of $g$ being $V$, we can simply use elements of $V^\ast$ to probe the value of $w=g(v)$.)

An equivalent definition of a tensor is therefore a multi-linear map of the form $T: V \times \cdots \times V \times V^\ast \cdots \times V^\ast \rightarrow \mathbb{R}$; see the wikipedia for details. (Linear maps between different vector spaces is a slightly more general concept and is not considered here for simplicity.)

It remains to introduce the tensor product (and to give another definition of tensors, this time in terms of tensor products).

First observe that $\mathcal{T}^p_q$, the set of all multi-linear maps $T: V^\ast \times \cdots \times V^\ast \times V \times \cdots \times V \rightarrow \mathbb{R}$ where there are $p$ copies of $V^\ast$ and $q$ copies of $V$, can be made into a vector space in an obvious way; just use pointwise addition and scalar multiplication of the multi-linear maps. Next, observe that $\mathcal{T}^0_1$ and $V^\ast$ are isomorphic. Furthermore, since $V^{\ast\ast}$, the dual of the dual of $V$, is naturally isomorphic to $V$, it is readily seen that $\mathcal{T}^1_0$ is isomorphic to $V$. Can $\mathcal{T}^p_q$ be constructed from multiple copies of $\mathcal{T}^0_1$ and $\mathcal{T}^1_0$?

It turns out that $\mathcal{T}^p_q$ is isomorphic to $\mathcal{T}^1_0 \otimes \cdots \otimes \mathcal{T}^1_0 \otimes \mathcal{T}^0_1 \otimes \cdots \otimes \mathcal{T}^0_1$ where there are $p$ copies of $\mathcal{T}^1_0$, $q$ copies of $\mathcal{T}^0_1$ and $\otimes$ is the tensor product defined in The Tensor Product. Alternatively, one could have invented the tensor product by examining how $\mathcal{T}^2_0$ can be constructed from two copies of $\mathcal{T}^1_0$, then observing that the same construction can be repeated and applied to $\mathcal{T}^1_0$, thereby ‘deriving’ a useful operation denoted by $\otimes$.

Another equivalent definition of a tensor is therefore an element of a vector space of the form $V \otimes \cdots \otimes V \otimes V^\ast \otimes \cdots \otimes V^\ast$, and this too is explained in the wikipedia.

To summarise the discussion so far (and restricting attention to the scalar field $\mathbb{R}$ for simplicity):

• Matrices do not, on their own, define linear maps between vector spaces (although they do define linear maps between Euclidean spaces $\mathbb{R}^n$).
• A tensor is a multi-linear map whose domain and range involve zero or more copies of a vector space $V$ and its dual $V^\ast$.
• Any such map can be re-arranged to be of the form $T: V^\ast \times \cdots \times V^\ast \times V \times \cdots \times V \rightarrow \mathbb{R}$.
• The space of such maps (for a fixed number $q$ of copies of $V$ and $p$ copies of $V^\ast$) forms a vector space denoted $\mathcal{T}^p_q$.
• It turns out that there exists a single operation $\otimes$ which takes two vector spaces and returns a third, such that, for any $p,q,r,s$, $\mathcal{T}^p_q \otimes \mathcal{T}^r_s$ is isomorphic to $\mathcal{T}^{p+r}_{q+s}$.  This operation is called the tensor product.
• Since $\mathcal{T}^1_0$ and $\mathcal{T}^0_1$ are isomorphic to $V$ and $V^\ast$ respectively, an equivalent definition of a tensor is an element of the vector space $V \otimes \cdots \otimes V \otimes V^\ast \otimes \cdots \otimes V^\ast$.

Some loose ends are now tidied up. Without additional reading though, certain remaining parts of this article are unlikely to be self-explanatory; the main purpose is to alert the reader what to look out for when learning from textbooks. First, it will be explained how an element of $V \otimes V^\ast$ represents a linear map $h: V \rightarrow V$. Then an additional usage will be given of the tensor product symbol: the tensor product of two multi-linear maps results in a new multi-linear map. This additional aspect of tensor products was essentially ignored in The Tensor Product. Lastly, an explanation is given of why I omitted any mention of tensors in The Tensor Product.

Let $x \in V \otimes V^\ast$. The naive way to proceed is as follows. Introduce a basis $\{v_i\}$ for $V$ and $\{\sigma_j\}$ for $V^\ast$; different choices will ultimately lead to the same result. Then $x$ can be written as a linear combination $x = \sum_{i,j} \alpha_{ij} v_i \otimes \sigma_j$ where the $\alpha_{ij}$ are scalars. Recall from The Tensor Product that the $v_i \otimes \sigma_j$ are just formal symbols used to distinguish one basis vector of $V \otimes V^\ast$ from another. Here’s the trick; we now associate to each $v_i \otimes \sigma_j$ the linear map $h_{ij}: V \rightarrow V$ that sends $v \in V$ to $\sigma_j(v) v_i$; clearly $v \mapsto \sigma_j(v) v_i$ is a linear map from $V$ to $V$. (This is relatively easy to remember, for how else could we combine $v_i$ with $\sigma_j$ to obtain a linear map from $V$ to $V$?) Then, we associate to $x = \sum_{i,j} \alpha_{ij} v_i \otimes \sigma_j$ the linear map $h = \sum_{i,j} \alpha_{ij} h_{ij}$. It can be verified that this mapping is an isomorphism from the vector space $V \otimes V^\ast$ to the vector space of linear maps from $V$ to $V$, and moreover, the same mapping results regardless of the original choice of basis vectors for $V$ and $V^\ast$. While this is useful for actual computations, it does not explain how we knew to use the above trick of sending $v \in V$ to $\sigma_j(v) v_i$.

A more sophisticated way to proceed uses the universal property characterisation of tensor product and makes it clear why the above construction works. (The universal property characterisation is defined in the wikipedia among other places.) Essentially, under this characterisation, every bilinear map from $V \times V^\ast$ to $\mathbb{R}$ induces a unique linear map from $V \otimes V^\ast$ to $\mathbb{R}$, and conversely, every linear map from $V \otimes V^\ast$ to $\mathbb{R}$ induces a unique bilinear map from $V \times V^\ast$ to $\mathbb{R}$. Now, we already know from earlier that linear maps from $V$ to $V$ are equivalent to bilinear maps from $V \times V^\ast$ to $\mathbb{R}$. As now shown, choosing an element of $V \otimes V^\ast$ is equivalent to choosing a linear map from $V \otimes V^\ast$ to $\mathbb{R}$. Indeed, by definition of dual, linear maps from $V \times V^\ast$ to $\mathbb{R}$ are precisely the elements of $(V \otimes V^\ast)^\ast \cong V^\ast \otimes V^{\ast\ast} \cong V^\ast \otimes V \cong V \otimes V^\ast$, as required.

So far, we have only introduced the tensor product of two vector spaces. However, there is a companion operation which takes elements $v \in V$ and $w \in W$ of vector spaces and returns an element $v \otimes w$ of the vector space $V \otimes W$. (Recall that we have only introduced the formal symbol $v_i \otimes w_j$ to denote a basis vector in the case where $\{v_i\}$ and $\{w_j\}$ are chosen bases for $V$ and $W$; no meaning has been given yet to $v \otimes w$.)  For calculations, it suffices to think of $v \otimes w$ as the element obtained by applying formal algebraic laws such as $(u+v) \otimes w = (u \otimes w) + (v \otimes w)$. Precisely, if $v = \sum_i \alpha_i v_i$ and $w = \sum_j \beta_j w_j$ where $\{v_i\}$ and $\{w_j\}$ are chosen bases for $V$ and $W$ then $v \otimes w$ is defined to be $\sum_{i,j} \alpha_i \beta_j (v_i \otimes w_j)$, as suggested by the formal manipulations $v \otimes w = (\sum_i \alpha_i v_i) \otimes (\sum_j \beta_j v_j) = \sum_i \alpha_i (v_i \otimes \sum_j \beta_j v_j) = \sum_i \alpha_i \sum_j \beta_j (v_i \otimes v_j)$. I mention this only because I want to point out that this leads to the following simple rule for computing the tensor product of two multi-linear maps.

Let $S$ and $T$ be multi-linear maps; in fact, for ease of presentation, only the special case $S,T: V \rightarrow \mathbb{R}$ will be considered. Then a multi-linear map can be formed from $S$ and $T$, namely, $(u,v) \mapsto S(u) T(v)$. This multi-linear map is denoted $S \otimes T$ and is called the tensor product of $S$ and $T$, in agreement with the discussion in the previous paragraph.

It is easier to motivate the tensor product $S \otimes T$ of two tensors than it is to motivate the tensor product of two tensor spaces $\mathcal{T}^p_q \otimes \mathcal{T}^r_s$. Here is an example of such motivation.

What useful multi-linear maps can be formed from the linear maps $S,T: V \rightarrow \mathbb{R}$? Playing around, it seems $v \mapsto S(v)+T(v)$ is a linear map; let’s call it $S+T$. Multiplication does not work because $v \mapsto S(v)T(v)$ is not linear, so $ST$ is not a tensor. The ordinary cross-product would lead to a map $S \times T: V \times V \rightarrow \mathbb{R} \times \mathbb{R}$ which is not of the form of a tensor as stated earlier. What we can do though is form the map $(v,w) \mapsto S(v)T(w)$.  We denote this bilinear map by $S \otimes T: V \times V \rightarrow \mathbb{R}$. Experience has shown the construction $S \otimes T$ to be useful (which, at the end of the day, is the main justification for introducing new definitions and symbols), although for the moment, the choice of symbol $\otimes$ has not been justified save for it should be different from more common symbols such as $S+T$, $ST$ and $S \times T$ which mean different things, as observed immediately above. Furthermore, the definition of $\otimes$ as stated above readily extends to general tensors…

One could perhaps use the above paragraph as the start of an introduction to tensor products. In The Tensor Product, I chose instead to ignore tensors completely because, although there is nothing ‘difficult’ about them, the above indicates that there are a multitude of small issues that need to be explained, and at the end of the day, it is not even clear at the outset if there is any use in studying tensor products of tensor spaces! What does the fancy symbol $\otimes$ buy us that we could not have obtained for free just by working directly with multi-linear maps and arrays of numbers? (There are benefits, but they appear in more advanced areas of mathematics and hence are hard to motivate concretely at an elementary level. Of course, one could say the tensor product reduces the study of multi-linear maps to linear maps, that is, back to more familiar territory, but multi-linear maps are not that difficult to work with directly in the first place. )

The Tensor Product motivated the tensor product by wishing to construct $\mathbb{R}[x,y]$ from $\mathbb{R}[x]$ and $\mathbb{R}[y]$. It was stated there that this allowed properties of $\mathbb{R}[x,y]$ to be deduced from properties of $\mathbb{R}[x]$. A solid example of this would be localisation of rings: if we have managed to show that the localisation $(\mathbb{R}[x])[1/x]$ is isomorphic to the ring of Laurent polynomials in one variable, then properties of the tensor product allow us to conclude that the localisation $(\mathbb{R}[x,y])[1/xy]$ is isomorphic to the ring of Laurent polynomials in two variables. Even without such examples though, I am more comfortable taking for granted that it is useful to be able to construct $\mathbb{R}[x,y]$ from $\mathbb{R}[x]$ and $\mathbb{R}[y]$ than it is useful to be able to construct $\mathcal{T}^{p+q}_{r+s}$ from $\mathcal{T}^p_r$ and $\mathcal{T}^q_s$.

## The Tensor Product

Various discussions on the internet indicate the concept of tensor product is not always intuitive to grasp on a first reading. Perhaps the reason is it is harder to motivate the concept of tensor product on the vector space $\mathbb{R}^n$ than it is to motivate the concept of a tensor product on a polynomial ring. The following endeavours to give an easily understood pre-introduction to the tensor product. That is to say, the aim is to motivate the standard introductions that are online and in textbooks. Familiarity is assumed with only linear algebra and basic polynomial manipulations (i.e., adding and multiplying polynomials), hence the first step is to introduce just enough formalism to talk about polynomial rings.

By $\mathbb{R}[x]$ is meant the set of all polynomials in the indeterminate $x$ with real-valued coefficients. With the usual definitions of addition and scalar multiplication, $\mathbb{R}[x]$ becomes a vector space over the field $\mathbb{R}$ of real numbers.  For example, $x^2 + 2x + 3$ and $x^3 - 2x$ are elements of $\mathbb{R}[x]$ and can be added to form $x^3 + x^2 + 3$. An example of scalar multiplication is that $3 \in \mathbb{R}$ times $x^2 + 2x + 3 \in \mathbb{R}[x]$ is $3x^2 + 6x + 9 \in \mathbb{R}[x]$. These definitions of addition and scalar multiplication together satisfy the axioms of a vector space, thereby making $\mathbb{R}[x]$ into a vector space over $\mathbb{R}$. (The reason why $\mathbb{R}[x]$ is called a polynomial ring rather than a polynomial vector space is because it has additional structure — it is a special kind of a vector space — in the form of a multiplication operator which is compatible with the addition and scalar multiplication operators. This is not important for us though.)

Polynomials in two indeterminates, say $x$ and $y$, also form a vector space (and indeed, a ring) with respect to the usual operations of polynomial addition and scalar multiplication.  This vector space is denoted $\mathbb{R}[x,y]$.

The cross-product $\times$ of two vector spaces is relatively easy to motivate and understand, so much so that the reader is assumed to be familiar with the cross-product.  Recall that if $U$ and $V$ are vector spaces then the elements of $U \times V$ are merely the pairs $(u,v)$ where $u \in U$ and $v \in V$. Recall too that $\mathbb{R} \times \mathbb{R}$ is isomorphic to $\mathbb{R}^2$. In particular, observe that the cross-product is a means of building a new vector space from other vector spaces.

The tensor-product is just like the cross-product in that it too allows one to build a new vector space from other vector spaces. It might be illuminating to think of the other vector spaces as building blocks, and the new vector space as something more complicated built from these simpler building blocks, although of course this need not always be the case. Regardless, what makes such constructions useful is that properties of the new vector space can be deduced from properties of the building block vector spaces. (The simplest example of such a property is the dimension of a vector space; the dimension of $U \times V$ is the sum of the dimensions of $U$ and $V$.)

Can $\mathbb{R}[x,y]$ be obtained from $\mathbb{R}[x]$ and $\mathbb{R}[y]$? Let’s try the cross-product. The vector space $\mathbb{R}[x] \times \mathbb{R}[y]$ consists of pairs $(p(x),q(y))$ where $p(x)$ represents a polynomial in $\mathbb{R}[x]$ and $q(y)$ a polynomial in $\mathbb{R}[y]$. This does not appear to work, for how should the polynomial $xy^2 + yx^3 + 1$ be represented in the form $(p(x),q(y))$? (If the ring structure were taken into account then it would be possible to multiply two elements of $\mathbb{R}[x] \times \mathbb{R}[y]$ and it would be seen that, in effect, $\mathbb{R}[x] \times \mathbb{R}[y]$ consists of polynomials that can be factored as $p(x)q(y)$, a proper subset of $\mathbb{R}[x,y]$. For this introduction though, all that is important is that $\mathbb{R}[x,y]$ is not the cross-product of $\mathbb{R}[x]$ and $\mathbb{R}[y]$.)

With the belief that $\mathbb{R}[x,y]$ should be constructible from $\mathbb{R}[x]$ and $\mathbb{R}[y]$, let’s try to figure out what is required to get the job done. Favouring simplicity over sophistication, let’s investigate the problem in terms of basis vectors.  The obvious choices of basis vectors are $\{1,x,x^2,\cdots\}$ for $\mathbb{R}[x];$ $\{1,y,y^2,\cdots\}$ for $\mathbb{R}[y];$ and $\{1,x,y,x^2,xy,y^2,\cdots\}$ for $\mathbb{R}[x,y].$ We are in luck as the pattern is easy to spot: the basis vectors of $\mathbb{R}[x,y]$ are just all pairwise products (in the sense of polynomial multiplication) of the basis vectors of $\mathbb{R}[x]$ with the basis vectors of $\mathbb{R}[y]$!

We are therefore motivated to define a construction that takes two vector spaces $U$ and $V$ and creates a new vector space, denoted $U \otimes V$, where the new vector space is defined in terms of its basis vectors; roughly speaking, the basis vectors of $U \otimes V$ comprise all formal pairwise products of basis vectors in a particular basis of $U$ with basis vectors in a particular basis of $V$. How to do this precisely may not be readily apparent but the fact that $\mathbb{R}[x,y]$ is a well-defined vector space that, intuitively at least, is constructible from $\mathbb{R}[x]$ and $\mathbb{R}[y]$ gives us hope that such a construction can be made to work. Needless to say, such a construction is possible and is precisely the tensor product.

There are two equivalent ways of defining the tensor product. The first follows immediately from the above description. If $\{u_1,\cdots,u_i\}$ and $\{v_1,\cdots,v_j\}$ are bases for $U$ and $V$ then $U \otimes V$ is defined to be the vector space formed from all formal linear combinations of the basis vectors $u_1 \otimes v_1, u_1 \otimes v_2, \cdots, u_1 \otimes v_j, u_2 \otimes v_1, \cdots, v_i \otimes v_j$. It is emphasised that $u_1 \otimes v_1$ is just a symbol, a means for identifying a particular basis vector. (At the risk of belabouring the point, I can form a vector space from the basis vectors Fred and Charlie; typical elements would be 2 Fred + 3 Charlie and 5 Fred – 1 Charlie, and the vector space addition of these two elements would be 7 Fred + 2 Charlie.) Of course, choosing a different set of basis vectors for $U$ or $V$ would result in a different vector space $U \otimes V$, however, the resulting vector spaces will always be isomorphic to each other. Therefore, $U \otimes V$ should be thought of as representing any vector space that can be obtained using the above method. (This finer point is not dwelled on here but is a common occurrence in mathematics; a construction can yield a new space which is only unique up to isomorphism. Once tensor products have been understood, the reader is invited to think further about this lack of uniqueness, how it is handled and why it is not an issue in practice.)

The more sophisticated way of defining the tensor product is to consider bilinear maps. Like most things, it is important to understand both methods. The elementary method above provides a basic level of intuition that is easy to grasp but the method is clumsy to work with and harder to generalise. The more sophisticated method is more convenient to work with and adds an extra layer of intuition, but masks the basic level of intuition and therefore actually offers less in the way of intuition to the neophyte.

The motivated reader may now like to:

1. Read an introduction to the tensor product on vector spaces that uses basis vectors.
2. Read an alternative introduction that uses bilinear maps.
3. Work out why the two methods are equivalent.

A hint is offered as to why there is a relationship between “products” and “bilinear maps”, and why the tensor product can be thought of as “the most general product”. Assume we want to define a product on a vector space.  That is, let $V$ be a vector space and we want to define a function $f: V \times V \rightarrow V$ that we think of as multiplication; given $u,v \in V$, their product is defined to be $f(u,v)$. What do we mean by multiplication? Presumably, we mean that the laws hold of associativity, distributivity and scalar multiplication, such as $f(u,f(v,w)) = f(f(u,v),w)$ and $f(u+v,w) = f(u,w) + f(v,w)$. Merely requiring the scalar multiplication and distributive laws to hold is equivalent to requiring that $f$ be bilinear; that is the connection. In general, one can seek to define a multiplication rule between two different vector spaces, which is the level of generality at which the cross-product works. As textbooks will hasten to point out, any bilinear function $f: U \times V \rightarrow W$ can be represented by a linear map from $U \otimes V$ to $W$. Personally, I prefer to think of this in the first instance as a bonus result we get for free from the tensor product rather than as the initial motivation for introducing the tensor product; only with the benefit of hindsight are we motivated to define the tensor product using bilinear maps.

Tensor products turn out to be very convenient in other areas too. As just one example, in homological algebra they are a convenient way of changing rings. A special case is converting a vector space over the real numbers into a vector space over the complex numbers; an engineer would not think twice about doing this — anywhere a real number is allowed, allow now a complex number, and carry on as normal! — but how can it be formalised? Quite simply in fact; if $V$ is a vector space over the real field then $\mathbb{C} \otimes V$ is the complexification of $V$; although technically $\mathbb{C} \otimes V$  is a vector space over the real field according to our earlier definition of tensor product, there is a natural way to treat it as a vector space over the complex field. The reader is invited to contemplate this at leisure, recalling that $\mathbb{C}$ is a two-dimensional vector space over the real field, with basis $\{1,\sqrt{-1}\}$.

In conclusion, the tensor product can be motivated by the desire to construct the vector space of polynomials in two indeterminates out of two copies of a vector space of polynomials in only a single indeterminate. Once the construction is achieved, it is found to have a number of interesting properties, including properties relating to bilinear maps. With the benefit of hindsight, it is cleaner to redefine the tensor product in terms of bilinear maps; this broadens its applicability and tidies up the maths, albeit at the expense of possibly making less visible some of the basic intuition about the tensor product.

## Introduction to the Legendre Transform

I have not come across an introduction to the Legendre transform which was entirely intuitive and satisfying.  The article Making Sense of the Legendre Transform is one such attempt and has a number of attractive features.  However, like other attempts, it immediately links the Legendre transform to re-parametrising a function by its derivative, yet this link is not made crystal clear. Below, I present an alternative introduction to the Legendre transform which takes as its starting point the fact that a convex set is uniquely defined by the collection of its supporting hyperplanes.

My purpose is not to be rigorous, but instead, to present just enough facts for the reader to be comfortable with the basic ideas surrounding the Legendre transform and to tune the reader’s receptiveness to other pedagogic material on Legendre transforms.

### The Legendre Transform

Readers are invited to draw their favourite convex function $f: \mathbb{R} \rightarrow \mathbb{R}$ on a piece of paper.  Treating the graph of $f$ as the boundary of a cup, we can fill the inside of it.  The resulting shape (the boundary plus the inside) is called the epigraph of the convex function.  (Precisely, the epigraph of $f$ is $\{(x,z) \mid z \geq f(x)\}$.)  Observe that the epigraph of a convex function is a convex set.

A line (or more generally, a hyperplane, if we were working in a higher dimension) is called a supporting hyperplane of a convex set in $\mathbb{R}^2$ if it intersects the convex set and the convex set is contained on just one side of it. (This implies that the points of intersection are boundary points.) For example, the horizontal line $y=0$ is a supporting hyperplane for the epigraph of $y=x^2$.  Going further, given a gradient $m$, we can draw the line $y=mx+c$ on the same graph as $y = x^2$, and we observe that if $c$ is a large negative number then the line does not intersect $y=x^2$ (or its epigraph) at all, and as $c$ is increased, there comes a time when $y=mx+c$ first touches $y=x^2$. This value of $c$ makes $y=mx+c$ a supporting hyperplane for the epigraph of $y=x^2$. As $c$ increases further, there will be points of the epigraph of $y=x^2$ on either side of the line $y=mx+c$. For the function $y=e^x$, it is seen that if $m$ is negative, there are no corresponding supporting hyperplanes. In general, given a convex function $f: \mathbb{R} \rightarrow \mathbb{R}$, for each slope $m$ there is at most one value of $c$ making $y=mx+c$ a supporting hyperplane of the epigraph of $f$. Furthermore, the set of values of $m$ for which a supporting hyperplane exists forms a convex set.

Recall that in convex analysis, it is a very useful fact that a convex set is defined by its supporting hyperplanes. Therefore, if we know all the supporting hyperplanes of (the epigraph of) $f$ then we should be able to reconstruct $f$. Shortly, we will endeavour to do this from first principles.

Perhaps the simplest way of “storing” what the supporting hyperplanes of $f$ are, is to graph $c$ versus $m$. Indeed, for each value of $m$, there is at most one value of $c$ for which $y=mx+c$ is a supporting hyperplane, hence plotting $m$ on the horizontal axis and $c$ on the vertical axis produces the graph of a function. In fact, it can be proved that this graph will always be concave if $f$ is convex. To visualise this, pick a point $x$ and draw a tangent line to the graph $y=x^2$ at the point $(x,x^2)$. This is a supporting hyperplane with $m$ equal to $\frac{df}{dx}=2x$ and $c$ equal to the $y$-intercept of the tangent line, namely $-x^2$. Now, as $x$ goes from $-\infty$ to $\infty$, it can be seen that $m$ monotonically increases and $c$ initially increases and then decreases. Furthermore, observe that plotting the points $(m,c)$ corresponding to the supporting hyperplanes is the same as plotting the points $\{(2x,-x^2) \mid x \in \mathbb{R}\}$ which is the same as plotting the graph $\{(m,c) \mid c = -\frac{m^2}{4}\}$ of the function $c = -\frac{m^2}{4}$.

Plotting $-c$ versus $m$ will produce a convex function if $f$ is convex, and this is the convention which has proved expedient. Clearly, there is no material difference between plotting $-c$ or $c$ as a function of $m$.

Formally, we have just seen that to any convex function $f$ we can associate a function $h$ such that $y=mx+c$ is a supporting hyperplane of (the epigraph of) $f$ if and only if $c = -h(m)$ (where the negative sign is just a matter of convention). The function $h$ thus defined is called the Legendre transform of $f$. As mentioned earlier, $h$ is not necessarily defined on the whole of $\mathbb{R}$ but only on a convex subset of $\mathbb{R}$. To overcome this minor notational inconvenience, it is customary to set $h(m)$ equal to infinity for values of $m$ for which it would otherwise be undefined. This is a standard trick in convex analysis for avoiding the need to keep track explicitly of the set on which a convex function is defined.

Can we think of another way of storing the set of supporting hyperplanes of $f$? Each hyperplane $y=mx+c$ is represented by the pair of coefficients $(m,c)$.  If we had chosen to fix $c$ instead, we would have found that depending on the value of $c$, there could be multiple values of $m$ for which $y=mx+c$ is a supporting hyperplane. It seems sensible then to represent the set of all points $(m,c)$ corresponding to supporting hyperplanes by using the aforementioned function $h$. Therefore, I choose to interpret the Legendre transform as what one would arrive at if charged with the task of writing down in a nice and simple way what the supporting hyperplanes are of the epigraph of a convex function $f$.

As promised earlier, let’s see how a function $f$ can be recovered from its Legendre transform $h$. Recall that for each $m$ (for which $h(m)$ is finite), the line $y=mx-h(m)$ is a supporting hyperplane.  In particular, we know that at least one point of the graph of $f$ lies on this line, and we know that every single point of the graph of $f$ lies on or above this line.  By drawing all the supporting hyperplanes, we can see intuitively that they therefore trace out $f$.  In fact, readers familiar with envelopes of curves will see that this is the same idea here, and for those not familiar, clicking on the link will bring up the Wikipedia entry with a nice figure showing how the supporting hyperplanes trace out the function.

For simplicity, consider the special case when $h$ is differentiable.  (A convex function is differentiable at all but at most a countable number of points, and even at such points, left and right derivatives still exist.) Choose an $m$.  How can we find an $x$ such that the point $(x,mx-h(m))$ on the supporting hyperplane $y = mx-h(m)$ belongs to the graph of $f$?  If we perturb $m$, we get another supporting hyperplane $y = (m+\epsilon)x-h(m+\epsilon)$. This perturbed hyperplane, as we will call it, intersects with the original hyperplane $y = mx-h(m)$ at the point with $x$ coordinate given by $x = \frac{1}{\epsilon}\left[h(m+\epsilon)-h(m)\right]$ which approaches $h'(m)$, the derivative of $h$, as $\epsilon \rightarrow 0$. For positive $\epsilon$, the perturbed hyperplane $y = (m+\epsilon)x-h(m+\epsilon)$ will lie above the original hyperplane whenever $x$ is larger than the point of intersection.  Similarly, if $\epsilon$ is negative, the perturbed hyperplane will lie above the original hyperplane whenever $x$ is smaller than the point of intersection.  Therefore, the point of intersection in the limit $\epsilon \rightarrow 0$ is precisely the point which also lies on the graph of $f$. (If this is not clear, re-read the third sentence of the preceding paragraph.) Thus, the point $(h'(m),mh'(m)-h(m))$ lies on the graph of $f$ for all $m$ for which $h$ is defined and differentiable. (If $h$ is not differentiable at $m$, we would have a range of valid $x$ values, namely, those values lying between the left derivative and the right derivative of $h$ at $m$. We would therefore obtain a line segment belonging to the graph of $f$. For the moment though, this level of detail is a distraction.)

For completeness — and because something unexpected will reveal itself — let’s give a formula for the graph of $h$ if we are given only $f$, under the simplifying assumption that $f$ is differentiable. As hinted at earlier, it can be shown that the supporting hyperplanes are the same as the tangent lines of $f$. The tangent line of $f$ at the point $(x_0,f(x_0))$ is simply $y - y_0 = m(x-x_0)$ where $m=f'(x_0)$ and $y_0 = f(x_0)$.  The $y$-intercept is thus $-mx_0+y_0$, and hence the point $(m,mx_0-y_0) = (f'(x_0),x_0f'(x_0)-f(x_0))$ lies on the graph of $h$. Comparing this with the previous paragraph, we see excitedly that going from $f$ to the graph of $h$ has exactly the same form as going from $h$ to the graph of $f$, and indeed, it can be established rigorously that if $h$ is the Legendre transform of a convex function $f$, then $f$ is the Legendre transform of $h$. The Legendre transform is its own inverse.

We close this section by deriving the standard formula for the Legendre transform. Let $f$ be an arbitrary function. (It need not even be convex.) Recall the idea given earlier about computing the Legendre transform, namely, we draw the line $y=mx+c$ for $c$ a very small number and gradually increase $c$ until the line first intersects the graph of $f$.  That is to say, we want the smallest value (or, if the smallest value does not exist, the infimum) of $c$ for which $y=mx+c$ intersects the graph of $f$. Therefore, it is equally valid to look for the set of all values of $c$ for which $y=mx+c$ intersects the graph of $f$, then choose the infimum of this set. This set is easily found by fixing $m$ and looking in turn at each point $(x_0,f(x_0))$ on the graph of $f$ and seeing for what value of $c$ the line $y=mx+c$ passes through this point. The line passing through $(x_0,f(x_0))$ with gradient $m$ is simply $y-f(x_0)=m(x-x_0)$ and its $y$-intercept is thus $-mx_0+f(x_0)$. The infimum of $c$ for which $y=mx+c$ intersects the graph of $f$ is therefore $\inf_{x_0} -mx_0+f(x_0)$.  The Legendre transform is defined to be the negative of this, by convention, therefore, we have arrived at the mathematical definition of the Legendre transform: $h(m) = \sup_x mx-f(x)$.

### The Legendre Transform in Physics

Often the Legendre transform is introduced by saying that it allows a function $f(x)$ to be re-parametrised by its derivative, meaning precisely: Given a slope $m$, first find the value of $x$ such that $f'(x)=m$ then return the value of $f$ at this point.  This is written concisely as $f(x(m))$, where $x$ is now considered a function of $m$. This is useful in physics; see for example Making Sense of the Legendre Transform.

Starting from first principles, assume we are given a differentiable and strictly convex function $f: \mathbb{R} \rightarrow \mathbb{R}$ and we are given a slope $m$, and we wish to write down an explicit expression for the function implicitly defined by the requirement that $m$ is mapped to $f(x)$ where $x$ satisfies $f'(x)=m$. With ideas from the previous section fresh in our minds, we know that we can find the point $x$ for which $f'(x)=m$ by finding the supporting hyperplane $y=mx+c$ with slope $m$ for the epigraph of $f$, and seeing where the supporting hyperplane intersects the graph of $f$. If $h$ is the Legendre transform of $f$, then we know the hyperplane is $y = mx-h(m)$.  Therefore, finding $\{x \mid f'(x)=m\}$ is equivalent to finding $\{x \mid mx-h(m)=f(x)\}$. The (possibly) good news is that the latter expression involves the function $f$ and not its derivative, but the bad news is that this expression is still an implicit one for $x$.

The most important observation — and one of the reasons for including the derivations in the previous section — is that provided the derivative exists, we have that $x = h'(m)$ satisfies $f'(x)=m$. (Recall the discussion about how the graph of $f$ can be recovered by thinking about what perturbed hyperplanes tell us.)  Therefore, it is not the Legendre transform but its derivative which allows us to write down an explicit expression for the value of $x$ for which $f'(x)=m$. Furthermore, note from the previous section that $f(h'(m))$ can be written in terms of $h$ alone, as $mh'(m)-h(m)$.

Although this section opened by considering how to write down an explicit expression for $f(x(m))$, and the Legendre transform was found to be a valuable stepping-stone, the fact of the matter is that $f(x(m))$ is not necessarily well-defined unless the derivative of $f$ exists and is injective, whereas the Legendre transform is always well-defined and has useful properties. Therefore, although we may have originally thought we wanted an expression for $f(x(m))$ as an intermediate step in achieving an ultimate objective (e.g., reformulating a physical law in a more convenient way — see Making Sense of the Legendre Transform), it is mathematically advantageous to work with the Legendre transform instead.  In the “worst case” we will need to require that the Legendre transform $h$ of $f$ is differentiable and write $f(x(m))$ as $f(h'(m))$ (or $mh'(m)-h(m)$), but it may be possible that an alternative manipulation can be found for achieving the ultimate objective which uses $h$ directly and does not require $h'(m)$ to exist. That is to say, there is nothing to be lost and a chance of something to be gained by working with $h$ rather than $f(x(m))$ simply because the latter can be obtained from the former in a simple way.

Summary

• Within convex analysis, a convex set is uniquely defined by its supporting hyperplanes, and some properties of a convex set are better elucidated by examining its supporting hyperplanes.
• A function is convex if and only if its epigraph is a convex set.
• The Legendre transform is what one would come up with if charged with the task of describing the set of all supporting hyperplanes of the epigraph of a convex function.
• The Legendre transform is its own inverse (when applied to a convex function).
• The Legendre transform is related to the concept of the envelope of a curve.
• If $h$ is the Legendre transform of $f$ then, if $h'(m)$ exists, the point $x=h'(m)$ has the property that $f'(x)=m$.
• The Legendre transform (precisely, its derivative) therefore allows a function $f$ to be re-parametrised by its derivative. This is often the motivation given for the usefulness of the Legendre transform.
• In trying to derive from first principles a re-parametrisation for $f$ in terms of its derivative, the Legendre transform arises naturally, as an intermediate step.

## Information Geometry – Coordinate Independence (Lecture 6)

The next few lectures aim to provide an introduction to several basic concepts in differential geometry required for progressing our understanding of information geometry. Rather than commence with a definition of differential geometry, the idea of “coordinate independence” will be studied in the simpler setting of affine geometry first. Roughly speaking, differential geometry combines the notion of coordinate independence with the notion of gluing simpler things together to form more complicated objects.

Let $S$ be a set. It may represent all the points on an infinitely large sheet of paper, in which case one must resist the temptation to think of $S$ as a subset of $\mathbb{R}^3$ but rather, envisage $S$ as the whole universe; there is nothing other than $S$. Alternatively, $S$ might represent the set of all elephants in the world.

Consider first the case when $S$ is the sheet of paper. In fact, assume we all live on $S$; the world is flat. In order to write down where someone lives, we need a coordinate chart. We need an injective function $f: S \rightarrow \mathbb{R}^2$ which assigns to every point a unique pair of numbers which we call the coordinates of the point. In order for this to be successful, everyone needs to use the same coordinate chart $f$. But given just $S$, no two people are likely to choose the same chart. How could they? Just for starters, it would necessitate someone drawing a big cross on the ground and declaring that everyone must consider that point to have coordinates $(0,0)$. Extra information beyond the set itself is required if different people are able to construct the same coordinate chart.

Sometimes, as we will now see, there is extra information available but it is not enough to determine a unique coordinate chart. If every person had a magnetic compass and a ruler then they could agree that moving one metre east must correspond to increasing the first coordinate by one, and moving one metre to the north must correspond to increasing the second coordinate by one. Two people’s charts will still differ in general, but only in the choice of origin. Although people would not be able to communicate where they live in absolute terms – saying I live at $(2,4)$ is no good to anyone else with a different coordinate chart – there is still a wealth of information that can be communicated. Saying that the difference in coordinates between my house and your house is $(5,2)$ is enough for you to find your way to my house; although it is likely our coordinate charts differ, the same answer is obtained no matter which chart is used. This is called coordinate independence.

The more possibilities there are for the charts, the fewer the number of coordinate independent properties. For example, if now people’s rulers are confiscated and they only have magnetic compasses, people’s coordinate charts can differ from each other’s in more ways. Saying I live $(5,2)$ away from your home will no longer work; my 5 units east will almost surely differ from your 5 units east. We could however, still agree on whether a collection of trees lies in a straight line or not.

Precisely, every person may decide to define that a collection of trees $s_1,\cdots,s_n \in S$ lies in a straight line if, under their personal coordinate chart $f: S \rightarrow \mathbb{R}^2$, the images of the trees $f(s_1),\cdots,f(s_N)$ lie in a straight line. This definition works because even though two people’s coordinate charts may be different, their definitions of lying in a straight line turn out to be the same. We will see presently that this can be understood in terms of a simple concept called transition functions. Note too that earlier we were implicitly thinking in terms of definitions too; we defined the location of your house $s_1$ relative to my house $s_2$ to be the vector $f(s_1) - f(s_2)$ and it was a useful definition whenever it was coordinate independent, as it was when we had both a compass and a ruler but not when we had a compass alone.

If $S$ represents a set of elephants then I might choose a coordinate chart $f: S \rightarrow \mathbb{R}$ by defining $f(s)$ to be the length of the trunk of elephant $s$. Tom might define his chart by measuring the length of the tail. We would not agree on the absolute size of an elephant but if the length of an elephant’s trunk and its tail is always a fixed ratio then we would agree on what it means for one elephant to be twice as big as another elephant.

Let’s formalise the above mathematically. The set $\mathbb{R}^2$ can be given a lot of extra structure. We are used to thinking of it as a vector space – we know how to add two points together in a sensible and consistent way – and we commonly introduce a norm for measuring distance and sometimes even an inner product for measuring angles. If $f: S \rightarrow \mathbb{R}^2$ is a bijection then any structure we have on $\mathbb{R}^2$ can be transferred to $S$. We can make $S$ a vector space simply by defining $s_1 + s_2 = f^{-1}( f(s_1) + f(s_2) )$ and $\alpha s = f^{-1}(\alpha f(s))$, for instance.

Let $\mathcal{F}$ be a set of bijective functions of the form $f: S \rightarrow \mathbb{R}^2$. Each element of $\mathcal{F}$ represents a valid coordinate chart, or the way we had introduced it earlier, each person uses their own coordinate chart and $\mathcal{F}$ is the set of all these coordinate charts. Unless $\mathcal{F}$ contains only a single coordinate chart, we can no longer transfer arbitrary structures from $\mathbb{R}^2$ to $S$ in a coordinate independent way; we saw examples of this earlier. What structures can be transferred?

A bit of thought reveals that the key is to study the transition functions $f \circ g^{-1}$ for all pairs $f,g \in \mathcal{F}$. Observe that $f \circ g^{-1}$ is a function from $\mathbb{R}^2$ to $\mathbb{R}^2$. We can therefore use the structure on $\mathbb{R}^2$ to determine what properties $f \circ g^{-1}$ has, for example, it might be that $\mathcal{F}$ is such that $f \circ g^{-1}$ is always a linear function; linearity is a property which can be defined in terms of the vector space structure on $\mathbb{R}^2$.

Recalling the earlier examples, when people had magnetic compasses and rulers, the transition functions $f \circ g^{-1}$ would always have the form $f \circ g^{-1}(x) = x + v$ for some vector $v$. (Changing charts would cause $v$ to change; indeed, $v$ represents the difference in the choice of origin of the two charts.) When people only had magnetic compasses, $f \circ g^{-1}$ would be of the more general form $f \circ g^{-1}(x) = D x + v$ for some positive diagonal matrix $D$. (Here, I have assumed that each person would build their own ruler, so everyone has rulers, they are just of different lengths.)

Linking in with previous lectures, the set $S$ can be made into an affine space by introducing a collection of coordinate charts $\mathcal{F}$ such that for any two charts $f,g \in \mathcal{F}$, their transition function $f \circ g^{-1}$ always has the form $f \circ g^{-1}(x) = A x + b$ for some matrix $A$ and vector $b$. Because the image of a straight line under such a transition function remains a straight line, different people with different coordinate charts will still agree on what is and what is not a straight line in $S$. It is a worthwhile exercise to prove that this definition of an affine space is equivalent to the definition given in earlier lectures.

To summarise, there is interest in playing the following mathematical game:

• We are given a set $S$ and a collection $\mathcal{F}$ of coordinate charts $f: S \rightarrow \mathbb{R}^n$.
• We want to give the set $S$ some structure coming from the structure on $\mathbb{R}^n$.
• We want to do this in a coordinate independent way, meaning that if I use my own coordinate chart and you use your own coordinate chart then we get the same structure on $S$.

The secret is to look at the form of the transition functions $f \circ g^{-1}$ for all pairs $f,g \in \mathcal{F}$. The more general the form of the transition functions, the less structure can be transferred from $\mathbb{R}^n$ to $S$ in a coordinate independent way.

The relevance to information geometry is that the parametrisation used to describe a family of densities $\{ p(x;\theta) \}$ is, to a large extent, irrelevant. Properties that depend on a particular choice of parametrisation are generally not as attractive as properties which are coordinate independent. If $\{ p(x;\mu,\sigma) \mid \sigma > 0 \}$ represents the family of Gaussian random variables parametrised by mean $\mu$ and variance $\sigma^2$ then there is little justification in calling the subfamily $\{ p(x; 2t,5t+7) \mid t > 0\}$ a line segment because there does not appear to be anything special about the parametrisation of Gaussians by mean and variance. (It turns out that for exponential families, statisticians have come up with a set of parametrisations they believe to be nice. Although these parametrisations are not unique, their transition functions are always affine functions; this is why it was possible to introduce an affine structure in earlier lectures. Note that we have not pursued this to the end because we want to move quickly to a more powerful concept coming from differential geometry which will subsume this affine geometry.)

For completeness, note that a parametrisation is just the inverse of a coordinate chart. If we think of defining a family by specifying a function from $\theta \in \mathbb{R}^n$ to $p(.;\theta)$ then we speak of a parametrisation. On the other hand, if someone points to a density $q(x)$ then we can determine its coordinates by asking what value of $\theta$ makes $q(x) = p(x;\theta)$.

## Information Geometry – Using the Fisher Information Matrix to Define an Inner Product (Lecture 5)

### Motivation

In Lecture 3, a local test for determining if a family was exponential was introduced. The last part of the test involved seeing if the second-order derivatives could be written as linear combinations of the first-order derivatives. As will be shown in a subsequent lecture, such calculations are sometimes easier to do if an inner product is introduced; we are therefore motivated to define an inner product on the space spanned by the first-order derivatives $\frac{\partial \log p}{\partial \theta_1},\cdots,\frac{\partial \log p}{\partial \theta_n}$.

Alternatively, we may simply be motivated to introduce some geometry into the family $\{ p(x;\theta) \}$. Precisely, pick a particular density $q$ from the family and consider two curves passing through $q$ at time zero. By this is meant $\gamma_1$ and $\gamma_2$ are two functions from $(-\epsilon,\epsilon)$ to parameter space, so that as $t$ varies, $p(x;\gamma_1(t))$ and $p(x;\gamma_2(t))$ trace out two curves in the space of probability densities, and it is required that the curves intersect at $q$ when $t=0$, namely $p(x;\gamma_1(0)) = p(x;\gamma_2(0)) = q$.

If the space of probability densities had a “geometry” then we should be able to say at what angle any two curves intersect at. Mathematically, we wish to compute the inner product of the “velocity vector” of $p(x;\gamma_1(t))$ at $t=0$ with the “velocity vector” of $p(x;\gamma_2(t))$ at $t=0$.

We are inclined to work not with $p(x;\theta)$ directly but with $\log p(x;\theta)$; the mapping is bijective so nothing is lost or gained by doing this, it is simply the case that we know from previous work that we can interpret log-likelihoods as elements of a vector space (albeit a vector space who has forgotten where he placed his origin) and therefore we are comfortable to differentiate $\log p(x;\theta)$ with respect to $\theta$. (See too the previous lecture on the Fisher Information Matrix.)

### An Inner Product

For any two curves $\gamma_1, \gamma_2$ intersecting at time $t=0$ we wish to define their inner product c. Since $\left.\frac{\partial \log p(x;\gamma(t))}{\partial t}\right|_{t=0}$ is a linear function of $\gamma'(0)$, it suffices to write down the rule for computing the inner product in terms of $\gamma_1'(0)$ and $\gamma_2'(0)$. It is desirable to do this because $\gamma'(0)$ is a finite-dimensional vector whereas $\left.\frac{\partial \log p(x;\gamma(t))}{\partial t}\right|_{t=0}$ is infinite-dimensional.

The catch though is that if we work with $\gamma'(0)$ then we must ensure that we get the same answer for $\langle \left.\frac{\partial \log p(x;\gamma_1(t))}{\partial t}\right|_{t=0}, \left.\frac{\partial \log p(x;\gamma_2(t))}{\partial t}\right|_{t=0} \rangle$ even if we change to a new parametrisation of the family $\{ p(x;\theta) \}$. Precisely, suppose $\{ q(x;\phi) \}$ represents the same family as $\{ p(x;\theta) \}$ but with respect to a different parametrisation $\phi$. (We assume there is a one-to-one correspondence betweeen $\phi$ and $\theta$; given any $\phi$ there is a $\theta$ such that $q(x;\phi) = p(x;\theta)$, and given any $\theta$ there is a $\phi$ such that $q(x;\phi) = p(x;\theta)$ too.) If $\tilde\gamma_1, \tilde\gamma_2$ are such that $q(x;\tilde\gamma_1(t)) = p(x;\gamma_1(t))$ and $q(x;\tilde\gamma_2(t)) = p(x;\gamma_2(t))$ then we must obtain the same answer for $\langle \left.\frac{\partial \log p(x;\gamma_1(t))}{\partial t}\right|_{t=0}, \left.\frac{\partial \log p(x;\gamma_2(t))}{\partial t}\right|_{t=0} \rangle$ as we do for $\langle \left.\frac{\partial \log q(x;\tilde\gamma_1(t))}{\partial t}\right|_{t=0}, \left.\frac{\partial \log q(x;\tilde\gamma_2(t))}{\partial t}\right|_{t=0} \rangle$.

It so happens that the Fisher Information Matrix can be used to define an inner product in a coordinate-independent way, meaning the same answer will be obtained regardless of how the family is parametrised. Before considering why we use the Fisher Information Matrix, let’s just see first how it can be used to do this.

Note that on a finite-dimensional vector space, every inner product $\langle x,y \rangle$ is of the form $\langle x,y \rangle = y^T Q x$ for some positive-definite symmetric matrix $Q$. The matrix $Q$ determines the inner product and the idea we will experiment with is using the Fisher Information matrix as the $Q$ matrix. To wit, we define:

$\langle \left.\frac{\partial \log p(x;\gamma_1(t))}{\partial t}\right|_{t=0}, \left.\frac{\partial \log p(x;\gamma_2(t))}{\partial t}\right|_{t=0} \rangle = [\gamma_2'(0)]^T \mathcal{I}(\theta_0) [\gamma_1'(0)]$ where $\theta_0 = \gamma_1(0) = \gamma_2(0)$ is the point of intersection of the two curves at time $t=0$.

From its form, and assuming the Fisher Information matrix is positive-definite (as usual, technical assumptions are being omitted in order to focus attention on the higher-level details), it is clear that we have defined an inner product; the axioms of an inner product are satisfied. What we need to check is whether or not we get the same answer if we change the parametrisation of our family.

First, the way the Fisher Information matrix changes when we change parametrisations must be determined. From the aforementioned correspondence between $\theta$ and $\phi$ we can regard one as a function of the other and write: $\log q(x;\phi) = \log p(x;\theta(\phi))$. Differentiating this yields: $\frac{\partial \log q}{\partial \phi_i} = \sum_k \frac{\partial \log p}{\partial \theta_k} \frac{\partial \theta_k}{\partial \phi_i}$. It follows almost immediately that:

$\mathcal{I}(\phi) = \left[\frac{d\theta}{d\phi}\right]^T\,\mathcal{I}(\theta)\,\left[\frac{d\theta}{d\phi}\right]$ where the $ij$th entry of $\frac{d\theta}{d\phi}$ is $\frac{\partial \theta_i}{\partial \phi_j}$.

(As discussed in class, the way to understand this is to consider the special case of $\theta$ being a linear function of $\phi$ and writing down the relationship between the squared-error of estimating $\theta$ and the squared-error of estimating $\phi$ and recalling that the Fisher Information matrix is essentially the inverse of the (asymptotic) squared-error. That the Fisher Information matrix determines the asymptotic performance explains why higher-order derivatives of $\theta$ with respect to $\phi$ do not appear in the above formula.)

We may write $\tilde\gamma(t) = \phi( \gamma(t) )$ to signify the relationship between $\gamma$ and $\tilde\gamma$; they both trace out the same curve of probability densities, just with respect to different coordinates. Therefore, $\tilde\gamma'(0) = \frac{d\phi}{d\theta} \gamma'(0)$ $= \left[ \frac{d\theta}{d\phi} \right]^{-1} \gamma'(0)$.

Collating the above results shows that an inner product defined with respect to the Fisher Information matrix is indeed coordinate-independent:

$[\tilde\gamma_2'(0)]^T \mathcal{I}(\phi_0) [\tilde\gamma_1'(0)] = \left[ \frac{d\phi}{d\theta} \gamma_2'(0) \right]^T\,\left[ \frac{d\theta}{d\phi} \right]^T\,\mathcal{I}(\theta_0)\,\left[ \frac{d\theta}{d\phi} \right]\,\left[ \frac{d\phi}{d\theta} \gamma_1'(0) \right] = [\gamma_2'(0)]^T\,\mathcal{I}(\theta_0)\,[\gamma_1'(0)]$.

The reader is urged to think carefully about what has actually been done. We have come close to putting an inner product on the infinite-dimensional space of all probability densities. Given any finite-dimensional family we can define an inner product using the Fisher Information matrix in a consistent way which depends only on the densities themselves and not on their parametrisations. (There is a small lie in the last sentence; arbitrary parametrisations are not permitted but rather, any two parametrisations we consider need to be “reasonably nice” with respect to each other, such as the mapping from one to the other being continuously differentiable. Such finer points will be elaborated on later in the course.)

### Justification

We have not endeavoured to justify why we want to use the Fisher Information matrix as an inner product; we have just demonstrated that because it transforms in the right way, we can use it to define an inner product if we so choose. It is appealing because it is coordinate-independent; it doesn’t matter how we parametrise our family, we get the same inner product being defined. This doesn’t automatically mean that it is a useful or sensible inner product though.

It turns out that it is both useful and sensible. That it is sensible comes from a deep property of being invariant with respect to sufficient statistics (this will be explained later in the course), and indeed, the Fisher metric (as we shall henceforth refer to the inner product coming from the Fisher Information matrix) is the essentially unique metric with this highly desirable property. That it is useful will be seen later when we gain a better understanding of what geometrical results become available by having an inner product structure.

## Information Geometry – Fisher Information Matrix (Lecture 4)

July 25, 2010 1 comment

As we will see in subsequent lectures, the Fisher Information Matrix plays an important role in information geometry.

### Definition and Example

Associated with a family of probability densities $\{ p(x;\theta) \}$, where $x \in \mathbb{R}^n$ and $\theta \in \mathbb{R}^k$, is a function $\mathcal{I}: \mathbb{R}^k \rightarrow \mathbb{R}^{n \times n}$ whose value $\mathcal{I}(\theta)$ at any point $\theta$ is called the Fisher Information Matrix. Although it does have several properties which warrant it being thought of as a measure of information, it would be misleading to read too much into the name “information”. Personally, I would call it the “asymptotic information” matrix to reduce the risk of erroneous intuition.

Precisely, the Fisher Information at $\theta$ is defined to be the matrix $\mathcal{I}(\theta)$ whose $ij$th entry equals $\mathbf{E}\left[ \frac{\partial \log p}{\partial \theta_i} \frac{\partial \log p}{\partial \theta_j} \right]$. When stated in this fashion, the Fisher Information Matrix can appear mysterious for two reasons; it is not immediately clear how to calculate it, and it is not at all clear why anything meaningful should result from such a calculation. It is therefore informative to calculate the Fisher Information for the family of Gaussian random variables with unknown mean $\theta$ but known variance $\sigma^2$.

Let $p(x;\theta) = \frac1{\sqrt{2\pi}\sigma}e^{-\frac12\left(\frac{x-\theta}{\sigma}\right)^2}$. The log-likelihood is therefore $\log p = -\frac12\log 2\pi - \log\sigma -\frac12\left(\frac{x-\theta}{\sigma}\right)^2$. Differentiating yields $\frac{\partial \log p}{\partial \theta} = \frac{x-\theta}{\sigma^2}$. The expectation appearing in the definition of the Fisher Information Matrix is with respect to $x$, where the density of $x$ is taken to be $p(x;\theta)$. Precisely, $\mathbf{E}[ g(x,\theta) ]$ is defined to be $\int g(x,\theta)\,p(x;\theta)\,dx$. Therefore the Fisher Information is:

$\mathcal{I}(\theta) = \int \frac{x-\theta}{\sigma^2}\,\frac{x-\theta}{\sigma^2}\,\frac1{\sqrt{2\pi}\sigma}e^{-\frac12\left(\frac{x-\theta}{\sigma}\right)^2}\,dx$.

That said, it is often easier to evaluate the Fisher Information by thinking in terms of expectation, and indeed, this is one reason why the Fisher Information is written as an expectation rather than an integration. Thinking of $x$ as a Gaussian random variable with known mean $\theta$ and variance $\sigma^2$, the expectation of $(x-\theta)^2$ is, by definition, $\sigma^2$. Therefore, the expectation of $\frac{x-\theta}{\sigma^2}\,\frac{x-\theta}{\sigma^2}$ is $\frac1{\sigma^2}$. Stated formally:

$\mathcal{I}(\theta) = \frac1{\sigma^2}$.

Had we considered a family of multivariate Gaussian random variables with unknown mean $\theta$ but known covariance matrix $\Sigma$, that is $p(x;\theta) = (2\pi)^{-n/2} \left| \Sigma \right|^{-1/2} e^{-\frac12 (x-\theta)^T \Sigma^{-1} (x-\theta)}$, then the Fisher Information Matrix would have been

$\mathcal{I}(\theta) = \Sigma^{-1}$.

It just so happens that in these cases, the Fisher Information Matrix is constant with respect to $\theta$. As will be seen presently, it is not a coincidence that the Fisher Information Matrix appears to be the reciprocal of the accuracy with which we can expect to be able to estimate $\theta$ given an observation $x$. As the variance decreases, the amount of information increases.

Before interpreting these results, it is remarked that although the definition of the Fisher Information Matrix came before information geometry, the fact that the definition of the Fisher Information Matrix requires the log-likelihood function to be differentiated means that $\log p$ was being treated as a vector space.

### Interpretation

Assume that a friend chooses a value of $\theta$ and leaves it fixed. Once a second, the friend generates a random variable with distribution $p(\cdot;\theta)$. This sequence of independent and identically distributed random variables will be denoted by $x_1,x_2,\cdots$. Given the first $N$ random variables $x_1,\cdots,x_N$, can we guess what $\theta$ is, and how accurate is our guess likely to be?

For finite $N$, there is almost never a single best method for estimating $\theta$; follow this link for the reason why. However, it is sensible to ask if there is an estimation rule which works well as $N \rightarrow \infty$.

Observe that the density of $x_1,\cdots,x_N$ is just a product of densities: $p(x_1,\cdots,x_N;\theta) = \prod_{i=1}^N p(x_i;\theta)$. It follows that the Fisher Information Matrix for $p(x_1,\cdots,x_N;\theta)$ is simply $N$ times the Fisher Information Matrix for $p(x;\theta)$. [Indeed, one of the reasons why it is called "information" is because it is additive.]

The maximum-likelihood estimate of $\theta$ given the single observation $x$ is the value of $\theta$ which maximises $p(x;\theta)$. (Here, note that the observed value of $x$ is substituted into the expression for $p(x;\theta)$ leaving just a function of $\theta$; it is this function which is maximised.) Let $\widehat{\theta}_N$ denote the maximum-likelihood estimate of $\theta$ based on the $N$ observations $x_1,\cdots,x_N$. (Whether we have $N$ observations or a single observation of dimension $N$ is one and the same thing; the maximum likelihood estimate is the “most likely” value of $\theta$ given by maximising $p(x_1,\cdots,x_N;\theta)$.) Under reasonably mild regularity conditions, it turns out that:

1. $\lim_{N \rightarrow \infty} \widehat{\theta}_N = \theta$;
2. $\lim_{N \rightarrow \infty} N\cdot\mathbf{E}\left[ (\widehat{\theta}_N - \theta) (\widehat{\theta}_N - \theta)^T \right] = \mathcal{I}^{-1}(\theta)$.

That is to say, the maximum-likelihood estimator is asymptotically unbiased and the leading term in its asymptotic performance, as measured by its covariance matrix, is $\frac1N \mathcal{I}^{-1}(\theta)$.

Therefore, the inverse of the Fisher Information Matrix determines the asymptotic performance of the maximum-likelihood estimator as the number of the samples goes to infinity.

There is more to the story. The Cramer-Rao Bound states that if $\widehat{\theta}$ is an unbiased estimator of $\theta$ then its performance must be lower-bounded by the inverse of the Fisher Information Matrix: $\mathbf{E}\left[ (\widehat{\theta} - \theta) (\widehat{\theta} - \theta)^T \right] \geq \mathcal{I}^{-1}(\theta)$ always holds (for an unbiased estimator). Therefore, no other estimator can have a better leading term in its asymptotic performance than the maximum-likelihood estimator. This is important because it implies that the definition of the Fisher Information Matrix is intrinsic; it measures the best possible asymptotic performance rather than merely the performance of the maximum-likelihood estimator. (There is nothing special about the maximum-likelihood estimator except that it is one example of an estimator which is asymptotically efficient, meaning that asymptotically it achieves the Cramer-Rao Bound.)

Extensive literature is available on the Fisher Information Matrix, including on the wikipedia. I therefore curtail my remarks to the above.