Sadly, Grassmann’s mathematical work was not appreciated during his lifetime. Among other things, he introduced what is now called the Grassmann algebra. It appears that Grassmann did this in part by looking for all possible ways a product structure could be introduced. Although there is strong geometric intuition behind the Grassmann algebra, it is not necessarily straightforward to grasp quickly this intuition from current introductory texts. For example, if the Grassmann algebra is about lengths, areas and volumes of parallelotopes, why can and be added together to form a new vector when in general the length of will not be the sum of the lengths of and ?
In my mind, the key point to keep in mind, and which I have not seen written down elsewhere, is that in the context of Grassmann algebras, lower-dimensional parallelotopes should be considered merely as building blocks for higher-dimensional parallelotopes; some background is required before getting to this point though.
Stepping back, this note endeavours to re-invent the Grassmann algebra in an intuitive way, motivating the operations of addition and multiplication. The point of departure is the desire to measure the relative volume of a -dimensional oriented parallelotope in a vector space of dimension . Let us initially denote an oriented parallelotope by the ordered set of vectors that form the sides of the parallelotope. (See the wiki for a picture of the three-dimensional case.) Here, “oriented” just means that the sides of the parallelotope are ordered. In hindsight, it becomes clear that it is simpler to work with oriented parallelotopes than non-oriented ones; a (multi-)linear theory can be developed for the former. (Perhaps better motivation would come from considering how to define integration on a manifold, but I am endeavouring here to introduce Grassmann algebras without mention of forms from differential geometry.)
Given a metric on , the volume of the parallelotope can be computed by choosing an orthonormal basis for and computing the determinant of the matrix whose columns are the vectors expressed as linear combinations of the basis vectors; put simply, if we assume is and we use the Euclidean inner product then is the matrix whose th column is . Note that negative volumes are permissible, a consequence of working with oriented parallelotopes. For brevity, parallelotopes will mean oriented parallelotopes and volumes will mean signed volumes.
If we don’t have a metric — or, precisely, we want to state results that are true regardless of which metric is being used — we can still make sense of one parallelotope being twice as big as another one, at least in certain situations. For example, the parallelotope is twice as big as because, no matter how we choose the metric, the volume of the former really will be twice that of the latter. A key question to ask is: if and are two parallelotopes, will the ratio of their volumes be independent of the metric chosen?
Since we have limited ourselves to -dimensional parallelotopes in -dimensional space, it turns out that the answer is in the affirmative. A proof can be based on the equality by choosing the matrix to be the change of orthonormal basis induced by a change of metric.
If we decide that two (oriented) parallelotopes are equivalent whenever their (signed) volume is the same regardless of the metric chosen then it turns out that we can form a vector space structure on the set of all -dimensional parallelotopes up to equivalence in a given -dimensional vector space . Note that we are working with a quotient space structure; although we use the notation to represent an element of , different representations may correspond to the same element. (Precisely, we have a projection taking vectors and returning the corresponding element of , where if and only if the signed volume of equals the signed volume of regardless of the metric chosen.) We choose to define scalar multiplication in by . (Note that the could have multiplied any one of the because elements of are only distinguished up to differences in volume.) That is to say, scalar multiplication corresponds to scaling the volume of the parallelotope.
Vector space addition in is worthy of contemplation even if the ultimate definition is straightforward. (From a pedagogical perspective, having a simple mathematical definition does not imply having an intuitive understanding; Grassmann algebras have a simple mathematical definition, but one that belies the ingenuity required by Grassmann to develop them and one that potentially lacks the intuition required to feel comfortable with them.) Thinking first in terms of cubes then in terms of parallelotopes, it is clear geometrically that . In other words, if all but one vector are the same, there is an obvious geometric meaning that can be given to vector space addition in . Perhaps other special cases can be found. Nevertheless, the general rule we wish to follow (if at all possible) is that if then this should be taken to mean that the volume of the parallelotope plus the volume of the parallelotope is equal to the volume of the parallelotope . If this is possible, then one way to achieve it is to define as follows. Arbitrarily choose a basis for . Then we know that there exist constants and such that the volume of is equal to times the volume of , and the volume of equals times the volume of . Then is defined to be . One can check that this indeed works; it endows with a well-defined vector space structure. (Precisely, one must first verify that our definitions are consistent — given , we claim that no matter which parallelotopes , and we used, the same element will be obtained — and then verify that the axioms of a vector space are satisfied.)
After all this effort, one may be disappointed to learn that is one-dimensional. However, that is to be expected; we wanted to represent the (signed) volume of an (oriented) parallelotope and hence is essentially just the set of real numbers with the usual scalar multiplication and vector addition. What we have done though is introduce the notation and mindset to pave the way for generalising this reasoning to parallelotopes of arbitrary dimension in .
Importantly, the following approach will not work, in that it will not re-create the Grassmann algebra. Consider all one-dimensional parallelotopes in , where now . If and are two such parallelotopes then one might be tempted to declare that if and only if the length of is equal to the sum of the lengths of and with respect to all metrics. This would lead to an infinite-dimensional vector space though, since it would only be possible to add two vectors that were linearly dependent.
An algebra (in this context) is a vector space that also has defined on it a rule for multiplying two elements, such that the multiplicative structure is consistent with the vector space structure, e.g., the associative and distributive laws hold. Does multiplication enter the picture in any way when we think of volume? For a start, the area of a rectangle can be calculated by taking the product of the lengths of two adjoining sides. We are thus tempted to introduce a symbol that allows us to construct a higher-dimensional parallelotope from two lower-dimensional ones — namely, — and have some faint hope that this simple concatenation-of-parallelotopes operator behaves in a way expected of a multiplication operator.
Now for the key decision, which I have not seen stated elsewhere yet believe to be the key to understanding Grassmann algebras in a simple way. Because the paragraph before last pointed out that we cannot treat length in a metric-independent way if we wish to stay in finite dimensions, we must use our definition of metric-independent volume to induce a weaker notion of metric-independent length, area and volume on lower-dimensional parallelotopes of the ambient space . Precisely, we declare that is equivalent to if and only if, for all vectors , we have that has the same volume as , where as usual is the dimension of . In particular, lower-dimensional parallelotopes are considered merely as building blocks for -dimensional parallelotopes in -dimensional spaces. Immediate questions to ask are does this work in theory and is it useful in practice. It does work; it leads to the Grassmann algebra. And it has found numerous uses in practice, but that is a different story which will not be told here.
It is now a straightforward journey to the finish line. Let denote what was earlier denoted , and in general, let denote the set of all -dimensional (oriented) parallelotopes up to the aforementioned equivalence relation. Each of these sets can be made into a vector space with vector space operations relating directly to volumes. Precisely, if then the scalar multiple is the parallelotope (unique up to equivalence) such that, for all vectors , the volume of is precisely times the volume of regardless of which metric is used to measure volume. (This implies that the volume of is precisely times the volume of but the converse is not necessarily true.) Vector addition can be defined in a similar way.
It can be shown that is linearly isomorphic to . Indeed, if then because, for any vectors , the volume of the parallelotope will equal the sum of the volumes of and . Conversely, if then one can deduce by strategic choices of that the only possibility is . (Think in terms of determinants of matrices.)
As hinted at before, we expect multiplication to come into play and we expect it to behave nicely with respect to addition because we know, for example, that a rectangle of side lengths and a rectangle of side lengths have total area . In other words, in at least, we expect that . This is indeed the case — for any it is clear that — and here the point is to explain why should behave like multiplication rather than prove rigorously that it does.
When it comes to rigorous proofs, it is time to switch from geometric intuition to mathematical precision. Here, the key step is in recognising that the volume of a -dimensional parallelotope in a -dimensional vector space is a multi-linear function of the constituent vectors . In fact, it is not just any multi-linear map but an alternating one, meaning that if two adjacent vectors are swapped then the volume changes sign. This is the starting point for the modern definition of exterior algebra, also known as the Grassmann algebra.
I intentionally used non-conventional notation because it was important to introduce concepts one by one. First, because the operator introduced above is anti-commutative (it is almost as familiar as ordinary multiplication except that the sign can change, e.g., ) it is common to denote it by the wedge product instead. Furthermore, since is isomorphic to it is customary to omit the square brackets, writing for , writing for , and so forth.
There are some loose ends which I do not tidy up since the aim of this note is to prepare the reader for a standard account of the exterior algebra; perhaps though the last point to clarify is that the Grassmann algebra is the direct sum of the base field plus plus up to . Thus, if two parallelotopes cannot be added geometrically to form a new parallelotope, either because they are of differing dimensions, or roughly speaking because changing metrics would cause them to change in incongruous ways as building blocks, then they are just left written as a sum.
- The exterior algebra of a vector space is a vector space whose elements represent equivalence classes of linear combinations of oriented parallelotopes in .
- If is the dimension of then two -dimensional parallelotopes are equivalent if and only if they have the same -dimensional volume as each other with respect to any and all metrics.
- Multiplying a parallelotope by a scalar just multiplies its volume by the same amount (without changing the subspace in which it lies).
- A higher-dimensional parallelotope is constructed from lower-dimensional ones via the wedge product which, except for possible sign changes, behaves precisely like a multiplication operator (because, roughly speaking, volume is determined by multiplying one-dimensional lengths together).
- Two -dimensional parallelotopes and are equivalent if and only if, when treated as building blocks for constructing parallelotopes and of the same dimension as , the volumes of the resulting parallelotopes and are always the same, regardless of which metric is used and how is chosen.
- The sum of two -dimensional parallelotopes and equals the -dimensional parallelotope if and only if, for all -dimensional parallelotopes , the volume of equals the sum of the volumes of and regardless of which metric is used. (Such a need not exist, in which case the resulting vector space sum is denoted simply by .)
As always, this note may be unnecessarily long because it was written in a linear fashion from start to finish. Hopefully though, the general direction taken has some appeal.
Differentiating Matrix Expressions The Easy Way, and an Elementary yet Genuine use for the Tensor Product
In many areas of science requiring differentiating multivariate functions , the derivative is often treated as a vector, and the second-order derivative treated as a matrix. This leads to notation with sometimes appearing and sometimes its transpose appearing. Extending this notation to higher derivatives, or to functions , becomes even more messy.
An alternative is to treat derivatives as (multi-)linear maps. If, at some stage, vectors and matrices are required, i.e., gradients and Hessians, these can be easily read off from the derivatives. But often these are not required. Basically, the difference is working in a particular coordinate system — the gradient and Hessian are only defined with respect to an inner product and that determines the “coordinate system” being used — versus working in a coordinate-free manner.
In Differential Calculus, Tensor Products, and the Importance of Notation, a quick overview is given, but one which points out several subtleties. (For additional examples, see this earlier post.) Furthermore, it introduces the tensor product as a way of simplifying the notation further. This is an elementary yet genuine application benefitting from the tensor product, and is currently the best way I know of introducing tensor products early on to students in a meaningful way. (I am not very pleased with my earlier attempt at an introductory article on the tensor product as I don’t feel it is interesting enough.)
Probability is unique in that, on the one hand, its axioms rely on advanced mathematics, yet on the other hand, it is not only used across all areas of science, it comes up in everyday conversation, especially when the topic is gambling or tomorrow’s weather. I suspect this is the main reason why one does not have to look very far to find vehement debates about aspects of probability and statistics, whether it be Bayesians versus non-Bayesians, or the incorrect application of hypothesis testing in clinical trials, or of relevance to this article, Borel’s paradox.
When learning probability, people tend to worry about what would happen if an outcome having zero probability of occurring actually occurs. It does not help that books often use the notation to mean the probability of the random variable taking the value given that the random variable was observed to equal because it is not long before this notation is used when has a continuous distribution, so that the probability that exactly equals is zero. (This occurs even though the book will have warned the reader earlier that conditional probability cannot be defined on sets of measure zero, that is, sets having zero probability of occurring.)
The beauty of Kolmogorov’s axioms of probability theory is that they sidestep irrelevant but otherwise complicating issues. This is often not appreciated because many textbooks choose to work with classical notation such as rather than with measure-theoretic notation such as where is a test function and is the -algebra generated by .
This article endeavours to:
- emphasise that is an abuse of notation and must be used with care;
- emphasise that we should not use Euclidean distance when thinking in terms of probabilities (i.e., while there is not much difference between something having a chance of 0.1 or 0.0999999 of occurring, there is an infinite difference between something having a chance of and a chance of of occurring);
- conclude that it would be more surprising if Borel’s paradox did not exist.
Tangentially, my observation in this field is that there are essentially just two principles that need to be kept in mind in order to be able to avoid/resolve all debates:
- Think first about the finite case, that is, the case when the total number of possible outcomes is finite. Then realise that the infinite case is a mathematical construct that should be understood as a “limit” of the finite case, but that sometimes different limits can give different answers. The resolution then is to go back to the original question in order to determine which limit is the right one (or conclude that the original question is ill-posed because different choices of limits will give different answers).
- Never think purely in terms of estimation; always think in terms of the resulting decisions that may be taken in response to being told that estimate.
The standard theory of probability as we know it is designed for answering questions (filtering, hypothesis testing etc) based on the underlying assumption that the quality of the answers will be judged by some sort of expectation. For example, with respect to the frequentist interpretation, this would mean that we would want to design a filter so that, if the filter were applied every day to a new set of data, then on average, the filter would perform as well as possible (perhaps in the sense of minimising the mean-squared error). The presence of the expectation operator is crucial, for reasons which will become clear presently.
If the set of possible outcomes is finite, and if we agree that for all intents and purposes we are only interested in what happens on average (e.g., designing a filter so it works well on average), then there is no loss of generality in assuming that all outcomes have a non-zero chance of occurring. In this case, conditional probability is well-defined.
In general, when there are sets of measure zero, one must always remember that according to the rigorous theory, one cannot condition on a set of measure zero. There is a simple reason why; we are not given enough information to be able to do this in any meaningful way. A simple example will illustrate this. Assume that I pick a point on a circle uniformly at random. Choose two distinct points, and , on the circle. Intuitively at least, no harm or inconsistency will come if I conclude that, given that I know that the outcome is either the point or the point then the probability that it was is precisely one half; this accords with our intuitive understanding of “uniformly at random”. However, assume that instead I use the following procedure to pick a point on the circle: first I choose a point uniformly at random, then I check if it is . If it is , I declare my choice to be , otherwise I declare my choice to be . So now, it seems intuitively clear that the probability that the outcome is given that it is either or is zero. Since in Kolmogorov’s formulation of probability these two scenarios have identical descriptions, it follows that it is impossible to know what the conditional probability is of observing given that either or was chosen. This is not a shortcoming but an advantage of Kolmogorov’s framework! As long as we agree that we are working inside an expectation operator (remember the earlier discussion) then such questions are irrelevant. While the theory could conceivably be expanded to allow meaningful answers to be given in certain situations, would it have any practical value?
The appearance of is an abuse of notation. The advantage is that it looks easier than the measure-theoretic approach. The disadvantage is that wrong answers can be obtained if one is not aware of its correct interpretation; see M. Proschan and B. Presnell, “Expect the unexpected from conditional expectation,” Am Stat, vol. 52, no. 3, pp. 248–252, 1998.
The fact of the matter is that (when it exists), is not a function, but an equivalence class of functions. To specify an equivalence class, we must write down a representative function belonging to that equivalence class, hence will always look like a function, e.g., . It is easy to start thinking that it should be a function. And some simple examples can be written down where there seems to be only one obvious choice of . But is not a function, and Borel’s paradox is the reason why.
Precisely, is a Radon-Nikodym derivative. It satisfies . If exists then it will not be unique because integration is blind to what happens on sets of measure zero. Therefore, should only appear inside an integral, since it is only averaged versions of that are well-defined. Hence the importance of the expectation operator mentioned earlier; at the end of the day, we are always working under an integral sign, even if it is not always explicitly written down. Therefore, although it may look like is being treated as a well-defined function, it isn’t, because there is an implicit if not explicit expectation somewhere.
It is both interesting and wonderful that it is possible to define an equivalence class of functions in such a way that pointwise evaluation is meaningless yet as soon as there is an integral sign then everything is well-defined. This is at the heart of Borel’s paradox. And unless one is already familiar with working with equivalence classes from other areas of mathematics, it may well take a bit of getting used to. In other areas, square brackets are sometimes used to denote equivalence classes. It might be clearer then to write to remind ourselves that is not defined if ; different members of the same equivalence class need not agree on sets of measure zero. Provided we always work inside an integral, this is not an issue.
Although there is merit in defining a probability to be a number between 0 and 1 inclusive, one should be careful not to think in terms of the usual Euclidean metric. In some situations, intuition would be better guided if a probability of zero was thought of as and a probability of one was thought of as . (One way to achieve this is to send a probability to the real number .) The reason can be seen by thinking in terms of examining long sequences of outcomes. If one biased coin had a probability of 0.001 of coming up heads, and another had a probability of 0.0001, then what we readily notice is the factor of ten; we would expect to get ten times as many heads with the first coin than with the second. Even though the absolute difference is small, the large ratio makes the expected sequences very different. Similarly, a coin with a chance of of heads is infinitely different from a coin with no chance of heads; essentially any countably infinite sequence of outcomes of the first coin will have infinitely many heads whereas essentially any countably infinite sequence of outcomes of the second coin will have no (or at worst a finite number of) heads. These behaviours are very different!
In other words, whereas one would expect that in many cases, if an experiment was undertaken repeatedly which involved the tossing of a coin and certain observations were made by averaging over many trials, then whether an unbiased coin was used, or a biased coin was used with probability of heads, there would be consistency in that the difference in observations between a biased and an unbiased coin could be made arbitrarily small by taking arbitrarily close to . The preceding paragraph implies that there should be no reason whatsoever why such consistency should be observed between and . It does not matter how close is to zero, it is still “infinitely far away” from zero in terms of expected behaviour. This is the simple reason why Borel’s paradox is not surprising; we should not expect any form of continuity will hold in the limit as goes to zero. And as soon as there is no continuity then the possibility exists for different sequences to have different limits: just think of the function and what happens as . For any , a sequence can be constructed such that and . Borel’s paradox is just another demonstration that different sequences can give different answers when continuity does not hold.
- There is no reason for there to be any sort of “continuity” when approaching the extremes of probability 0 or probability 1 because something with probability of happening is still infinitely more likely to happen than something with probability 0 and hence does not serve as a good approximation.
- In practice, there is always an expectation operator inside of which we are working. Therefore, while the notation may suggest we are conditioning on sets of measure zero, we are not. We are always working with equivalence classes.
- The beauty of Kolmogorov’s axioms is that the lack of continuity that occurs when approaching the extremes of probability can be neatly sidestepped.
- Although in some simple examples it would be possible to extend Kolmogorov’s axioms so that conditioning on sets of measure zero were allowed, would this have any practical use? For a start, it would require more information be provided than the standard probability triple .
This brief article uses statistical independence as an example of when a mathematical definition is intentionally chosen to be different from the original motivating definition. (Another example comes from topology; the motivating/naive definition of a topological space would involve limits but instead open sets are used to define a topology.) This exemplifies the following messages:
- There is a difference between mathematical manipulations and intuition (and both must be learnt side-by-side); see also the earlier article on The Importance of Intuition.
- Understanding a definition mainly means understanding the usefulness of the definition and how it can be applied in practice.
- This has implications for how to teach and how to learn mathematical concepts.
Two random events, and , are statistically independent if . Here is the (small) conundrum. If one were to stare at this definition, it may not make much sense. What is it really telling us about the two events? On the other hand, if one were to learn that if and are “unrelated” events that have “nothing to do with each other” then must hold, then one might falsely believe to have understood the definition. Indeed, if and are related to each other, and and are related to each other, then surely and are related to each other? Conversely, if event is defined in terms of event then surely is related to ? Both these statements are false if ‘related’ is replaced by ‘statistically dependent’.
The true way of understanding statistical independence is to i) acknowledge that while it is motivated from real life by the intuitive notion of unrelated events, it is a different concept that has nevertheless proved to be very useful; and ii) be able to list a number of useful applications. Therefore, upon reading a definition that does not immediately feel comfortable, it may be better to flick through the remainder of the book to see the various uses of the definition than to stare blankly at the definition hoping for divine intuition.
For completeness, two naturally occurring examples of how statistical independence differs from “functional independence” are given. The first comes from the theory of continuous-time Markov chains but can be stated simply. Let and be two positive real numbers representing departure rates. Let and be independent and exponentially distributed random variables with parameters and respectively. (That is, for and .) The rule for deciding where to move to next (in the context of Markov chains) is to see which departure time, or , is smaller. (If is smaller than we move to destination 1, otherwise we move to destination 2.) Let be the probability that is smaller: . It can be shown that , and moreover, that the event is statistically independent of the departure time . This may seem strange if one thinks in terms of related events, so it is important to treat statistical independence as a mathematical concept that merely means regardless of whether or not and are, in any sense of the word, “related” to each other.
The second example is that an event can be statistically independent of itself! In fact, this turns out to be useful: to prove that is an “extreme” event, by which I merely mean that either or , it suffices to prove that is independent of itself, and sometimes the latter is easier to prove than the former. (Furthermore, having first proved that can only be zero or one can then make it easier to prove that it equals one, for instance.)
In closing, it is remarked that one can always challenge a definition by asking why this particular definition. Perhaps a different definition of statistical independence might be better? The response will always be: try to find a better definition! Sometimes you might be successful; this is how definitions are refined and generalised over time. Just keep in mind that a “good” definition is one that is useful and not necessarily one that mimics perfectly our intuition from the real world.
This is a sequel to The Tensor Product in response to a comment posted there. It endeavours to explain the difference between a tensor and a matrix. It also explains why ‘tensors’ were not mentioned in The Tensor Product.
A matrix is a two-dimensional array of numbers (belonging to a field such as or ) which can be used freely for any purpose, including for organising data collected from an experiment. Nevertheless, there are a number of commonly defined operations involving scalars, vectors and matrices, such as matrix addition, matrix multiplication, matrix-by-vector multiplication and scalar multiplication. These operations allow a matrix to be used to represent a linear map from one vector space to another, and it is this aspect of matrices that is the most relevant here.
Although standard usage means that an matrix over defines a linear map from the vector space to the vector space , this is an artefact of and implicitly being endowed with a canonical choice of basis vectors. It is perhaps cleaner to start from scratch and observe that a matrix on its own does not define a linear map between two vector spaces: given two two-dimensional vector spaces and the matrix (by which I mean the two-by-two matrix whose elements are, from left-to-right top-to-bottom, ), what linear map from to does represent? By convention, a linear map is represented by a matrix in the following way, with respect to a particular choice of basis vectors for and for . Let be a basis for , and a basis for . If is a linear map then it is fully determined once the values of and are revealed. Furthermore, since is an element of , it can be written as for a unique choice of scalars and . Similarly, . Knowing the scalars and , together with knowing the choice of basis vectors and , allows one to determine what the linear map is. (An alternative but essentially equivalent approach would have been to agree that a matrix defines a linear map from to , and therefore, to represent a linear map by a matrix, it is first necessary to choose an isomorphism from to and an isomorphism from to .)
Unlike a matrix, which can only represent a linear map between vector spaces once a choice of basis vectors has been made, a tensor is a linear (or multi-linear) map. The generalisation from linear to multi-linear maps is a distraction which may lead one to believe the difference between a tensor and a matrix is that a tensor is a generalisation of a matrix to higher dimensions, but this is missing the key point: the machinery of changes of coordinates, which is external to the definition of a matrix as an array of numbers, is internal to the definition of a tensor.
Examples of tensors are linear maps and , and bilinear maps . They are tensors because they are multi-linear maps between vector spaces. Choices of basis vectors do not enter the picture until one wishes to describe a particular map to a friend; unless it is possible to define in terms of known linear maps such as , it becomes necessary to write down a set of basis vectors and specify the tensor as an array of numbers with respect to this choice of basis vectors. This leads to the traditional definition of tensors, which is still commonly used in physics and engineering.
For convenience and consistency of notation, usually tensors are re-written as multi-linear maps into (or whatever the ground field is). Both and above are already of this form, but is not. This is easily rectified; there is a natural equivalence between linear maps and bilinear maps where is the dual space of ; recall that elements of are simply linear functionals on , that is, if then is a linear function . This equivalence becomes apparent by observing that if, for a fixed , the values of are known for every then the value of is readily deduced, and furthermore, the map taking a and a to is bilinear; precisely, the correspondence between and is given by . (If that’s not immediately clear, an intermediate step is the realisation that if the value of is known for every then the value of is readily determined. Therefore, if we are unhappy about the range of being , we can simply use elements of to probe the value of .)
An equivalent definition of a tensor is therefore a multi-linear map of the form ; see the wikipedia for details. (Linear maps between different vector spaces is a slightly more general concept and is not considered here for simplicity.)
It remains to introduce the tensor product (and to give another definition of tensors, this time in terms of tensor products).
First observe that , the set of all multi-linear maps where there are copies of and copies of , can be made into a vector space in an obvious way; just use pointwise addition and scalar multiplication of the multi-linear maps. Next, observe that and are isomorphic. Furthermore, since , the dual of the dual of , is naturally isomorphic to , it is readily seen that is isomorphic to . Can be constructed from multiple copies of and ?
It turns out that is isomorphic to where there are copies of , copies of and is the tensor product defined in The Tensor Product. Alternatively, one could have invented the tensor product by examining how can be constructed from two copies of , then observing that the same construction can be repeated and applied to , thereby ‘deriving’ a useful operation denoted by .
Another equivalent definition of a tensor is therefore an element of a vector space of the form , and this too is explained in the wikipedia.
To summarise the discussion so far (and restricting attention to the scalar field for simplicity):
- Matrices do not, on their own, define linear maps between vector spaces (although they do define linear maps between Euclidean spaces ).
- A tensor is a multi-linear map whose domain and range involve zero or more copies of a vector space and its dual .
- Any such map can be re-arranged to be of the form .
- The space of such maps (for a fixed number of copies of and copies of ) forms a vector space denoted .
- It turns out that there exists a single operation which takes two vector spaces and returns a third, such that, for any , is isomorphic to . This operation is called the tensor product.
- Since and are isomorphic to and respectively, an equivalent definition of a tensor is an element of the vector space .
Some loose ends are now tidied up. Without additional reading though, certain remaining parts of this article are unlikely to be self-explanatory; the main purpose is to alert the reader what to look out for when learning from textbooks. First, it will be explained how an element of represents a linear map . Then an additional usage will be given of the tensor product symbol: the tensor product of two multi-linear maps results in a new multi-linear map. This additional aspect of tensor products was essentially ignored in The Tensor Product. Lastly, an explanation is given of why I omitted any mention of tensors in The Tensor Product.
Let . The naive way to proceed is as follows. Introduce a basis for and for ; different choices will ultimately lead to the same result. Then can be written as a linear combination where the are scalars. Recall from The Tensor Product that the are just formal symbols used to distinguish one basis vector of from another. Here’s the trick; we now associate to each the linear map that sends to ; clearly is a linear map from to . (This is relatively easy to remember, for how else could we combine with to obtain a linear map from to ?) Then, we associate to the linear map . It can be verified that this mapping is an isomorphism from the vector space to the vector space of linear maps from to , and moreover, the same mapping results regardless of the original choice of basis vectors for and . While this is useful for actual computations, it does not explain how we knew to use the above trick of sending to .
A more sophisticated way to proceed uses the universal property characterisation of tensor product and makes it clear why the above construction works. (The universal property characterisation is defined in the wikipedia among other places.) Essentially, under this characterisation, every bilinear map from to induces a unique linear map from to , and conversely, every linear map from to induces a unique bilinear map from to . Now, we already know from earlier that linear maps from to are equivalent to bilinear maps from to . As now shown, choosing an element of is equivalent to choosing a linear map from to . Indeed, by definition of dual, linear maps from to are precisely the elements of , as required.
So far, we have only introduced the tensor product of two vector spaces. However, there is a companion operation which takes elements and of vector spaces and returns an element of the vector space . (Recall that we have only introduced the formal symbol to denote a basis vector in the case where and are chosen bases for and ; no meaning has been given yet to .) For calculations, it suffices to think of as the element obtained by applying formal algebraic laws such as . Precisely, if and where and are chosen bases for and then is defined to be , as suggested by the formal manipulations . I mention this only because I want to point out that this leads to the following simple rule for computing the tensor product of two multi-linear maps.
Let and be multi-linear maps; in fact, for ease of presentation, only the special case will be considered. Then a multi-linear map can be formed from and , namely, . This multi-linear map is denoted and is called the tensor product of and , in agreement with the discussion in the previous paragraph.
It is easier to motivate the tensor product of two tensors than it is to motivate the tensor product of two tensor spaces . Here is an example of such motivation.
What useful multi-linear maps can be formed from the linear maps ? Playing around, it seems is a linear map; let’s call it . Multiplication does not work because is not linear, so is not a tensor. The ordinary cross-product would lead to a map which is not of the form of a tensor as stated earlier. What we can do though is form the map . We denote this bilinear map by . Experience has shown the construction to be useful (which, at the end of the day, is the main justification for introducing new definitions and symbols), although for the moment, the choice of symbol has not been justified save for it should be different from more common symbols such as , and which mean different things, as observed immediately above. Furthermore, the definition of as stated above readily extends to general tensors…
One could perhaps use the above paragraph as the start of an introduction to tensor products. In The Tensor Product, I chose instead to ignore tensors completely because, although there is nothing ‘difficult’ about them, the above indicates that there are a multitude of small issues that need to be explained, and at the end of the day, it is not even clear at the outset if there is any use in studying tensor products of tensor spaces! What does the fancy symbol buy us that we could not have obtained for free just by working directly with multi-linear maps and arrays of numbers? (There are benefits, but they appear in more advanced areas of mathematics and hence are hard to motivate concretely at an elementary level. Of course, one could say the tensor product reduces the study of multi-linear maps to linear maps, that is, back to more familiar territory, but multi-linear maps are not that difficult to work with directly in the first place. )
The Tensor Product motivated the tensor product by wishing to construct from and . It was stated there that this allowed properties of to be deduced from properties of . A solid example of this would be localisation of rings: if we have managed to show that the localisation is isomorphic to the ring of Laurent polynomials in one variable, then properties of the tensor product allow us to conclude that the localisation is isomorphic to the ring of Laurent polynomials in two variables. Even without such examples though, I am more comfortable taking for granted that it is useful to be able to construct from and than it is useful to be able to construct from and .
Various discussions on the internet indicate the concept of tensor product is not always intuitive to grasp on a first reading. Perhaps the reason is it is harder to motivate the concept of tensor product on the vector space than it is to motivate the concept of a tensor product on a polynomial ring. The following endeavours to give an easily understood pre-introduction to the tensor product. That is to say, the aim is to motivate the standard introductions that are online and in textbooks. Familiarity is assumed with only linear algebra and basic polynomial manipulations (i.e., adding and multiplying polynomials), hence the first step is to introduce just enough formalism to talk about polynomial rings.
By is meant the set of all polynomials in the indeterminate with real-valued coefficients. With the usual definitions of addition and scalar multiplication, becomes a vector space over the field of real numbers. For example, and are elements of and can be added to form . An example of scalar multiplication is that times is . These definitions of addition and scalar multiplication together satisfy the axioms of a vector space, thereby making into a vector space over . (The reason why is called a polynomial ring rather than a polynomial vector space is because it has additional structure — it is a special kind of a vector space — in the form of a multiplication operator which is compatible with the addition and scalar multiplication operators. This is not important for us though.)
Polynomials in two indeterminates, say and , also form a vector space (and indeed, a ring) with respect to the usual operations of polynomial addition and scalar multiplication. This vector space is denoted .
The cross-product of two vector spaces is relatively easy to motivate and understand, so much so that the reader is assumed to be familiar with the cross-product. Recall that if and are vector spaces then the elements of are merely the pairs where and . Recall too that is isomorphic to . In particular, observe that the cross-product is a means of building a new vector space from other vector spaces.
The tensor-product is just like the cross-product in that it too allows one to build a new vector space from other vector spaces. It might be illuminating to think of the other vector spaces as building blocks, and the new vector space as something more complicated built from these simpler building blocks, although of course this need not always be the case. Regardless, what makes such constructions useful is that properties of the new vector space can be deduced from properties of the building block vector spaces. (The simplest example of such a property is the dimension of a vector space; the dimension of is the sum of the dimensions of and .)
Can be obtained from and ? Let’s try the cross-product. The vector space consists of pairs where represents a polynomial in and a polynomial in . This does not appear to work, for how should the polynomial be represented in the form ? (If the ring structure were taken into account then it would be possible to multiply two elements of and it would be seen that, in effect, consists of polynomials that can be factored as , a proper subset of . For this introduction though, all that is important is that is not the cross-product of and .)
With the belief that should be constructible from and , let’s try to figure out what is required to get the job done. Favouring simplicity over sophistication, let’s investigate the problem in terms of basis vectors. The obvious choices of basis vectors are for for and for We are in luck as the pattern is easy to spot: the basis vectors of are just all pairwise products (in the sense of polynomial multiplication) of the basis vectors of with the basis vectors of !
We are therefore motivated to define a construction that takes two vector spaces and and creates a new vector space, denoted , where the new vector space is defined in terms of its basis vectors; roughly speaking, the basis vectors of comprise all formal pairwise products of basis vectors in a particular basis of with basis vectors in a particular basis of . How to do this precisely may not be readily apparent but the fact that is a well-defined vector space that, intuitively at least, is constructible from and gives us hope that such a construction can be made to work. Needless to say, such a construction is possible and is precisely the tensor product.
There are two equivalent ways of defining the tensor product. The first follows immediately from the above description. If and are bases for and then is defined to be the vector space formed from all formal linear combinations of the basis vectors . It is emphasised that is just a symbol, a means for identifying a particular basis vector. (At the risk of belabouring the point, I can form a vector space from the basis vectors Fred and Charlie; typical elements would be 2 Fred + 3 Charlie and 5 Fred – 1 Charlie, and the vector space addition of these two elements would be 7 Fred + 2 Charlie.) Of course, choosing a different set of basis vectors for or would result in a different vector space , however, the resulting vector spaces will always be isomorphic to each other. Therefore, should be thought of as representing any vector space that can be obtained using the above method. (This finer point is not dwelled on here but is a common occurrence in mathematics; a construction can yield a new space which is only unique up to isomorphism. Once tensor products have been understood, the reader is invited to think further about this lack of uniqueness, how it is handled and why it is not an issue in practice.)
The more sophisticated way of defining the tensor product is to consider bilinear maps. Like most things, it is important to understand both methods. The elementary method above provides a basic level of intuition that is easy to grasp but the method is clumsy to work with and harder to generalise. The more sophisticated method is more convenient to work with and adds an extra layer of intuition, but masks the basic level of intuition and therefore actually offers less in the way of intuition to the neophyte.
The motivated reader may now like to:
- Read an introduction to the tensor product on vector spaces that uses basis vectors.
- Read an alternative introduction that uses bilinear maps.
- Work out why the two methods are equivalent.
A hint is offered as to why there is a relationship between “products” and “bilinear maps”, and why the tensor product can be thought of as “the most general product”. Assume we want to define a product on a vector space. That is, let be a vector space and we want to define a function that we think of as multiplication; given , their product is defined to be . What do we mean by multiplication? Presumably, we mean that the laws hold of associativity, distributivity and scalar multiplication, such as and . Merely requiring the scalar multiplication and distributive laws to hold is equivalent to requiring that be bilinear; that is the connection. In general, one can seek to define a multiplication rule between two different vector spaces, which is the level of generality at which the cross-product works. As textbooks will hasten to point out, any bilinear function can be represented by a linear map from to . Personally, I prefer to think of this in the first instance as a bonus result we get for free from the tensor product rather than as the initial motivation for introducing the tensor product; only with the benefit of hindsight are we motivated to define the tensor product using bilinear maps.
Tensor products turn out to be very convenient in other areas too. As just one example, in homological algebra they are a convenient way of changing rings. A special case is converting a vector space over the real numbers into a vector space over the complex numbers; an engineer would not think twice about doing this — anywhere a real number is allowed, allow now a complex number, and carry on as normal! — but how can it be formalised? Quite simply in fact; if is a vector space over the real field then is the complexification of ; although technically is a vector space over the real field according to our earlier definition of tensor product, there is a natural way to treat it as a vector space over the complex field. The reader is invited to contemplate this at leisure, recalling that is a two-dimensional vector space over the real field, with basis .
In conclusion, the tensor product can be motivated by the desire to construct the vector space of polynomials in two indeterminates out of two copies of a vector space of polynomials in only a single indeterminate. Once the construction is achieved, it is found to have a number of interesting properties, including properties relating to bilinear maps. With the benefit of hindsight, it is cleaner to redefine the tensor product in terms of bilinear maps; this broadens its applicability and tidies up the maths, albeit at the expense of possibly making less visible some of the basic intuition about the tensor product.
I have not come across an introduction to the Legendre transform which was entirely intuitive and satisfying. The article Making Sense of the Legendre Transform is one such attempt and has a number of attractive features. However, like other attempts, it immediately links the Legendre transform to re-parametrising a function by its derivative, yet this link is not made crystal clear. Below, I present an alternative introduction to the Legendre transform which takes as its starting point the fact that a convex set is uniquely defined by the collection of its supporting hyperplanes.
My purpose is not to be rigorous, but instead, to present just enough facts for the reader to be comfortable with the basic ideas surrounding the Legendre transform and to tune the reader’s receptiveness to other pedagogic material on Legendre transforms.
The Legendre Transform
Readers are invited to draw their favourite convex function on a piece of paper. Treating the graph of as the boundary of a cup, we can fill the inside of it. The resulting shape (the boundary plus the inside) is called the epigraph of the convex function. (Precisely, the epigraph of is .) Observe that the epigraph of a convex function is a convex set.
A line (or more generally, a hyperplane, if we were working in a higher dimension) is called a supporting hyperplane of a convex set in if it intersects the convex set and the convex set is contained on just one side of it. (This implies that the points of intersection are boundary points.) For example, the horizontal line is a supporting hyperplane for the epigraph of . Going further, given a gradient , we can draw the line on the same graph as , and we observe that if is a large negative number then the line does not intersect (or its epigraph) at all, and as is increased, there comes a time when first touches . This value of makes a supporting hyperplane for the epigraph of . As increases further, there will be points of the epigraph of on either side of the line . For the function , it is seen that if is negative, there are no corresponding supporting hyperplanes. In general, given a convex function , for each slope there is at most one value of making a supporting hyperplane of the epigraph of . Furthermore, the set of values of for which a supporting hyperplane exists forms a convex set.
Recall that in convex analysis, it is a very useful fact that a convex set is defined by its supporting hyperplanes. Therefore, if we know all the supporting hyperplanes of (the epigraph of) then we should be able to reconstruct . Shortly, we will endeavour to do this from first principles.
Perhaps the simplest way of “storing” what the supporting hyperplanes of are, is to graph versus . Indeed, for each value of , there is at most one value of for which is a supporting hyperplane, hence plotting on the horizontal axis and on the vertical axis produces the graph of a function. In fact, it can be proved that this graph will always be concave if is convex. To visualise this, pick a point and draw a tangent line to the graph at the point . This is a supporting hyperplane with equal to and equal to the -intercept of the tangent line, namely . Now, as goes from to , it can be seen that monotonically increases and initially increases and then decreases. Furthermore, observe that plotting the points corresponding to the supporting hyperplanes is the same as plotting the points which is the same as plotting the graph of the function .
Plotting versus will produce a convex function if is convex, and this is the convention which has proved expedient. Clearly, there is no material difference between plotting or as a function of .
Formally, we have just seen that to any convex function we can associate a function such that is a supporting hyperplane of (the epigraph of) if and only if (where the negative sign is just a matter of convention). The function thus defined is called the Legendre transform of . As mentioned earlier, is not necessarily defined on the whole of but only on a convex subset of . To overcome this minor notational inconvenience, it is customary to set equal to infinity for values of for which it would otherwise be undefined. This is a standard trick in convex analysis for avoiding the need to keep track explicitly of the set on which a convex function is defined.
Can we think of another way of storing the set of supporting hyperplanes of ? Each hyperplane is represented by the pair of coefficients . If we had chosen to fix instead, we would have found that depending on the value of , there could be multiple values of for which is a supporting hyperplane. It seems sensible then to represent the set of all points corresponding to supporting hyperplanes by using the aforementioned function . Therefore, I choose to interpret the Legendre transform as what one would arrive at if charged with the task of writing down in a nice and simple way what the supporting hyperplanes are of the epigraph of a convex function .
As promised earlier, let’s see how a function can be recovered from its Legendre transform . Recall that for each (for which is finite), the line is a supporting hyperplane. In particular, we know that at least one point of the graph of lies on this line, and we know that every single point of the graph of lies on or above this line. By drawing all the supporting hyperplanes, we can see intuitively that they therefore trace out . In fact, readers familiar with envelopes of curves will see that this is the same idea here, and for those not familiar, clicking on the link will bring up the Wikipedia entry with a nice figure showing how the supporting hyperplanes trace out the function.
For simplicity, consider the special case when is differentiable. (A convex function is differentiable at all but at most a countable number of points, and even at such points, left and right derivatives still exist.) Choose an . How can we find an such that the point on the supporting hyperplane belongs to the graph of ? If we perturb , we get another supporting hyperplane . This perturbed hyperplane, as we will call it, intersects with the original hyperplane at the point with coordinate given by which approaches , the derivative of , as . For positive , the perturbed hyperplane will lie above the original hyperplane whenever is larger than the point of intersection. Similarly, if is negative, the perturbed hyperplane will lie above the original hyperplane whenever is smaller than the point of intersection. Therefore, the point of intersection in the limit is precisely the point which also lies on the graph of . (If this is not clear, re-read the third sentence of the preceding paragraph.) Thus, the point lies on the graph of for all for which is defined and differentiable. (If is not differentiable at , we would have a range of valid values, namely, those values lying between the left derivative and the right derivative of at . We would therefore obtain a line segment belonging to the graph of . For the moment though, this level of detail is a distraction.)
For completeness — and because something unexpected will reveal itself — let’s give a formula for the graph of if we are given only , under the simplifying assumption that is differentiable. As hinted at earlier, it can be shown that the supporting hyperplanes are the same as the tangent lines of . The tangent line of at the point is simply where and . The -intercept is thus , and hence the point lies on the graph of . Comparing this with the previous paragraph, we see excitedly that going from to the graph of has exactly the same form as going from to the graph of , and indeed, it can be established rigorously that if is the Legendre transform of a convex function , then is the Legendre transform of . The Legendre transform is its own inverse.
We close this section by deriving the standard formula for the Legendre transform. Let be an arbitrary function. (It need not even be convex.) Recall the idea given earlier about computing the Legendre transform, namely, we draw the line for a very small number and gradually increase until the line first intersects the graph of . That is to say, we want the smallest value (or, if the smallest value does not exist, the infimum) of for which intersects the graph of . Therefore, it is equally valid to look for the set of all values of for which intersects the graph of , then choose the infimum of this set. This set is easily found by fixing and looking in turn at each point on the graph of and seeing for what value of the line passes through this point. The line passing through with gradient is simply and its -intercept is thus . The infimum of for which intersects the graph of is therefore . The Legendre transform is defined to be the negative of this, by convention, therefore, we have arrived at the mathematical definition of the Legendre transform: .
The Legendre Transform in Physics
Often the Legendre transform is introduced by saying that it allows a function to be re-parametrised by its derivative, meaning precisely: Given a slope , first find the value of such that then return the value of at this point. This is written concisely as , where is now considered a function of . This is useful in physics; see for example Making Sense of the Legendre Transform.
Starting from first principles, assume we are given a differentiable and strictly convex function and we are given a slope , and we wish to write down an explicit expression for the function implicitly defined by the requirement that is mapped to where satisfies . With ideas from the previous section fresh in our minds, we know that we can find the point for which by finding the supporting hyperplane with slope for the epigraph of , and seeing where the supporting hyperplane intersects the graph of . If is the Legendre transform of , then we know the hyperplane is . Therefore, finding is equivalent to finding . The (possibly) good news is that the latter expression involves the function and not its derivative, but the bad news is that this expression is still an implicit one for .
The most important observation — and one of the reasons for including the derivations in the previous section — is that provided the derivative exists, we have that satisfies . (Recall the discussion about how the graph of can be recovered by thinking about what perturbed hyperplanes tell us.) Therefore, it is not the Legendre transform but its derivative which allows us to write down an explicit expression for the value of for which . Furthermore, note from the previous section that can be written in terms of alone, as .
Although this section opened by considering how to write down an explicit expression for , and the Legendre transform was found to be a valuable stepping-stone, the fact of the matter is that is not necessarily well-defined unless the derivative of exists and is injective, whereas the Legendre transform is always well-defined and has useful properties. Therefore, although we may have originally thought we wanted an expression for as an intermediate step in achieving an ultimate objective (e.g., reformulating a physical law in a more convenient way — see Making Sense of the Legendre Transform), it is mathematically advantageous to work with the Legendre transform instead. In the “worst case” we will need to require that the Legendre transform of is differentiable and write as (or ), but it may be possible that an alternative manipulation can be found for achieving the ultimate objective which uses directly and does not require to exist. That is to say, there is nothing to be lost and a chance of something to be gained by working with rather than simply because the latter can be obtained from the former in a simple way.
- Within convex analysis, a convex set is uniquely defined by its supporting hyperplanes, and some properties of a convex set are better elucidated by examining its supporting hyperplanes.
- A function is convex if and only if its epigraph is a convex set.
- The Legendre transform is what one would come up with if charged with the task of describing the set of all supporting hyperplanes of the epigraph of a convex function.
- The Legendre transform is its own inverse (when applied to a convex function).
- The Legendre transform is related to the concept of the envelope of a curve.
- If is the Legendre transform of then, if exists, the point has the property that .
- The Legendre transform (precisely, its derivative) therefore allows a function to be re-parametrised by its derivative. This is often the motivation given for the usefulness of the Legendre transform.
- In trying to derive from first principles a re-parametrisation for in terms of its derivative, the Legendre transform arises naturally, as an intermediate step.
The next few lectures aim to provide an introduction to several basic concepts in differential geometry required for progressing our understanding of information geometry. Rather than commence with a definition of differential geometry, the idea of “coordinate independence” will be studied in the simpler setting of affine geometry first. Roughly speaking, differential geometry combines the notion of coordinate independence with the notion of gluing simpler things together to form more complicated objects.
Let be a set. It may represent all the points on an infinitely large sheet of paper, in which case one must resist the temptation to think of as a subset of but rather, envisage as the whole universe; there is nothing other than . Alternatively, might represent the set of all elephants in the world.
Consider first the case when is the sheet of paper. In fact, assume we all live on ; the world is flat. In order to write down where someone lives, we need a coordinate chart. We need an injective function which assigns to every point a unique pair of numbers which we call the coordinates of the point. In order for this to be successful, everyone needs to use the same coordinate chart . But given just , no two people are likely to choose the same chart. How could they? Just for starters, it would necessitate someone drawing a big cross on the ground and declaring that everyone must consider that point to have coordinates . Extra information beyond the set itself is required if different people are able to construct the same coordinate chart.
Sometimes, as we will now see, there is extra information available but it is not enough to determine a unique coordinate chart. If every person had a magnetic compass and a ruler then they could agree that moving one metre east must correspond to increasing the first coordinate by one, and moving one metre to the north must correspond to increasing the second coordinate by one. Two people’s charts will still differ in general, but only in the choice of origin. Although people would not be able to communicate where they live in absolute terms – saying I live at is no good to anyone else with a different coordinate chart – there is still a wealth of information that can be communicated. Saying that the difference in coordinates between my house and your house is is enough for you to find your way to my house; although it is likely our coordinate charts differ, the same answer is obtained no matter which chart is used. This is called coordinate independence.
The more possibilities there are for the charts, the fewer the number of coordinate independent properties. For example, if now people’s rulers are confiscated and they only have magnetic compasses, people’s coordinate charts can differ from each other’s in more ways. Saying I live away from your home will no longer work; my 5 units east will almost surely differ from your 5 units east. We could however, still agree on whether a collection of trees lies in a straight line or not.
Precisely, every person may decide to define that a collection of trees lies in a straight line if, under their personal coordinate chart , the images of the trees lie in a straight line. This definition works because even though two people’s coordinate charts may be different, their definitions of lying in a straight line turn out to be the same. We will see presently that this can be understood in terms of a simple concept called transition functions. Note too that earlier we were implicitly thinking in terms of definitions too; we defined the location of your house relative to my house to be the vector and it was a useful definition whenever it was coordinate independent, as it was when we had both a compass and a ruler but not when we had a compass alone.
If represents a set of elephants then I might choose a coordinate chart by defining to be the length of the trunk of elephant . Tom might define his chart by measuring the length of the tail. We would not agree on the absolute size of an elephant but if the length of an elephant’s trunk and its tail is always a fixed ratio then we would agree on what it means for one elephant to be twice as big as another elephant.
Let’s formalise the above mathematically. The set can be given a lot of extra structure. We are used to thinking of it as a vector space – we know how to add two points together in a sensible and consistent way – and we commonly introduce a norm for measuring distance and sometimes even an inner product for measuring angles. If is a bijection then any structure we have on can be transferred to . We can make a vector space simply by defining and , for instance.
Let be a set of bijective functions of the form . Each element of represents a valid coordinate chart, or the way we had introduced it earlier, each person uses their own coordinate chart and is the set of all these coordinate charts. Unless contains only a single coordinate chart, we can no longer transfer arbitrary structures from to in a coordinate independent way; we saw examples of this earlier. What structures can be transferred?
A bit of thought reveals that the key is to study the transition functions for all pairs . Observe that is a function from to . We can therefore use the structure on to determine what properties has, for example, it might be that is such that is always a linear function; linearity is a property which can be defined in terms of the vector space structure on .
Recalling the earlier examples, when people had magnetic compasses and rulers, the transition functions would always have the form for some vector . (Changing charts would cause to change; indeed, represents the difference in the choice of origin of the two charts.) When people only had magnetic compasses, would be of the more general form for some positive diagonal matrix . (Here, I have assumed that each person would build their own ruler, so everyone has rulers, they are just of different lengths.)
Linking in with previous lectures, the set can be made into an affine space by introducing a collection of coordinate charts such that for any two charts , their transition function always has the form for some matrix and vector . Because the image of a straight line under such a transition function remains a straight line, different people with different coordinate charts will still agree on what is and what is not a straight line in . It is a worthwhile exercise to prove that this definition of an affine space is equivalent to the definition given in earlier lectures.
To summarise, there is interest in playing the following mathematical game:
- We are given a set and a collection of coordinate charts .
- We want to give the set some structure coming from the structure on .
- We want to do this in a coordinate independent way, meaning that if I use my own coordinate chart and you use your own coordinate chart then we get the same structure on .
The secret is to look at the form of the transition functions for all pairs . The more general the form of the transition functions, the less structure can be transferred from to in a coordinate independent way.
The relevance to information geometry is that the parametrisation used to describe a family of densities is, to a large extent, irrelevant. Properties that depend on a particular choice of parametrisation are generally not as attractive as properties which are coordinate independent. If represents the family of Gaussian random variables parametrised by mean and variance then there is little justification in calling the subfamily a line segment because there does not appear to be anything special about the parametrisation of Gaussians by mean and variance. (It turns out that for exponential families, statisticians have come up with a set of parametrisations they believe to be nice. Although these parametrisations are not unique, their transition functions are always affine functions; this is why it was possible to introduce an affine structure in earlier lectures. Note that we have not pursued this to the end because we want to move quickly to a more powerful concept coming from differential geometry which will subsume this affine geometry.)
For completeness, note that a parametrisation is just the inverse of a coordinate chart. If we think of defining a family by specifying a function from to then we speak of a parametrisation. On the other hand, if someone points to a density then we can determine its coordinates by asking what value of makes .
In Lecture 3, a local test for determining if a family was exponential was introduced. The last part of the test involved seeing if the second-order derivatives could be written as linear combinations of the first-order derivatives. As will be shown in a subsequent lecture, such calculations are sometimes easier to do if an inner product is introduced; we are therefore motivated to define an inner product on the space spanned by the first-order derivatives .
Alternatively, we may simply be motivated to introduce some geometry into the family . Precisely, pick a particular density from the family and consider two curves passing through at time zero. By this is meant and are two functions from to parameter space, so that as varies, and trace out two curves in the space of probability densities, and it is required that the curves intersect at when , namely .
If the space of probability densities had a “geometry” then we should be able to say at what angle any two curves intersect at. Mathematically, we wish to compute the inner product of the “velocity vector” of at with the “velocity vector” of at .
We are inclined to work not with directly but with ; the mapping is bijective so nothing is lost or gained by doing this, it is simply the case that we know from previous work that we can interpret log-likelihoods as elements of a vector space (albeit a vector space who has forgotten where he placed his origin) and therefore we are comfortable to differentiate with respect to . (See too the previous lecture on the Fisher Information Matrix.)
An Inner Product
For any two curves intersecting at time we wish to define their inner product c. Since is a linear function of , it suffices to write down the rule for computing the inner product in terms of and . It is desirable to do this because is a finite-dimensional vector whereas is infinite-dimensional.
The catch though is that if we work with then we must ensure that we get the same answer for even if we change to a new parametrisation of the family . Precisely, suppose represents the same family as but with respect to a different parametrisation . (We assume there is a one-to-one correspondence betweeen and ; given any there is a such that , and given any there is a such that too.) If are such that and then we must obtain the same answer for as we do for .
It so happens that the Fisher Information Matrix can be used to define an inner product in a coordinate-independent way, meaning the same answer will be obtained regardless of how the family is parametrised. Before considering why we use the Fisher Information Matrix, let’s just see first how it can be used to do this.
Note that on a finite-dimensional vector space, every inner product is of the form for some positive-definite symmetric matrix . The matrix determines the inner product and the idea we will experiment with is using the Fisher Information matrix as the matrix. To wit, we define:
where is the point of intersection of the two curves at time .
From its form, and assuming the Fisher Information matrix is positive-definite (as usual, technical assumptions are being omitted in order to focus attention on the higher-level details), it is clear that we have defined an inner product; the axioms of an inner product are satisfied. What we need to check is whether or not we get the same answer if we change the parametrisation of our family.
First, the way the Fisher Information matrix changes when we change parametrisations must be determined. From the aforementioned correspondence between and we can regard one as a function of the other and write: . Differentiating this yields: . It follows almost immediately that:
where the th entry of is .
(As discussed in class, the way to understand this is to consider the special case of being a linear function of and writing down the relationship between the squared-error of estimating and the squared-error of estimating and recalling that the Fisher Information matrix is essentially the inverse of the (asymptotic) squared-error. That the Fisher Information matrix determines the asymptotic performance explains why higher-order derivatives of with respect to do not appear in the above formula.)
We may write to signify the relationship between and ; they both trace out the same curve of probability densities, just with respect to different coordinates. Therefore, .
Collating the above results shows that an inner product defined with respect to the Fisher Information matrix is indeed coordinate-independent:
The reader is urged to think carefully about what has actually been done. We have come close to putting an inner product on the infinite-dimensional space of all probability densities. Given any finite-dimensional family we can define an inner product using the Fisher Information matrix in a consistent way which depends only on the densities themselves and not on their parametrisations. (There is a small lie in the last sentence; arbitrary parametrisations are not permitted but rather, any two parametrisations we consider need to be “reasonably nice” with respect to each other, such as the mapping from one to the other being continuously differentiable. Such finer points will be elaborated on later in the course.)
We have not endeavoured to justify why we want to use the Fisher Information matrix as an inner product; we have just demonstrated that because it transforms in the right way, we can use it to define an inner product if we so choose. It is appealing because it is coordinate-independent; it doesn’t matter how we parametrise our family, we get the same inner product being defined. This doesn’t automatically mean that it is a useful or sensible inner product though.
It turns out that it is both useful and sensible. That it is sensible comes from a deep property of being invariant with respect to sufficient statistics (this will be explained later in the course), and indeed, the Fisher metric (as we shall henceforth refer to the inner product coming from the Fisher Information matrix) is the essentially unique metric with this highly desirable property. That it is useful will be seen later when we gain a better understanding of what geometrical results become available by having an inner product structure.
As we will see in subsequent lectures, the Fisher Information Matrix plays an important role in information geometry.
Definition and Example
Associated with a family of probability densities , where and , is a function whose value at any point is called the Fisher Information Matrix. Although it does have several properties which warrant it being thought of as a measure of information, it would be misleading to read too much into the name “information”. Personally, I would call it the “asymptotic information” matrix to reduce the risk of erroneous intuition.
Precisely, the Fisher Information at is defined to be the matrix whose th entry equals . When stated in this fashion, the Fisher Information Matrix can appear mysterious for two reasons; it is not immediately clear how to calculate it, and it is not at all clear why anything meaningful should result from such a calculation. It is therefore informative to calculate the Fisher Information for the family of Gaussian random variables with unknown mean but known variance .
Let . The log-likelihood is therefore . Differentiating yields . The expectation appearing in the definition of the Fisher Information Matrix is with respect to , where the density of is taken to be . Precisely, is defined to be . Therefore the Fisher Information is:
That said, it is often easier to evaluate the Fisher Information by thinking in terms of expectation, and indeed, this is one reason why the Fisher Information is written as an expectation rather than an integration. Thinking of as a Gaussian random variable with known mean and variance , the expectation of is, by definition, . Therefore, the expectation of is . Stated formally:
Had we considered a family of multivariate Gaussian random variables with unknown mean but known covariance matrix , that is , then the Fisher Information Matrix would have been
It just so happens that in these cases, the Fisher Information Matrix is constant with respect to . As will be seen presently, it is not a coincidence that the Fisher Information Matrix appears to be the reciprocal of the accuracy with which we can expect to be able to estimate given an observation . As the variance decreases, the amount of information increases.
Before interpreting these results, it is remarked that although the definition of the Fisher Information Matrix came before information geometry, the fact that the definition of the Fisher Information Matrix requires the log-likelihood function to be differentiated means that was being treated as a vector space.
Assume that a friend chooses a value of and leaves it fixed. Once a second, the friend generates a random variable with distribution . This sequence of independent and identically distributed random variables will be denoted by . Given the first random variables , can we guess what is, and how accurate is our guess likely to be?
For finite , there is almost never a single best method for estimating ; follow this link for the reason why. However, it is sensible to ask if there is an estimation rule which works well as .
Observe that the density of is just a product of densities: . It follows that the Fisher Information Matrix for is simply times the Fisher Information Matrix for . [Indeed, one of the reasons why it is called "information" is because it is additive.]
The maximum-likelihood estimate of given the single observation is the value of which maximises . (Here, note that the observed value of is substituted into the expression for leaving just a function of ; it is this function which is maximised.) Let denote the maximum-likelihood estimate of based on the observations . (Whether we have observations or a single observation of dimension is one and the same thing; the maximum likelihood estimate is the “most likely” value of given by maximising .) Under reasonably mild regularity conditions, it turns out that:
That is to say, the maximum-likelihood estimator is asymptotically unbiased and the leading term in its asymptotic performance, as measured by its covariance matrix, is .
Therefore, the inverse of the Fisher Information Matrix determines the asymptotic performance of the maximum-likelihood estimator as the number of the samples goes to infinity.
There is more to the story. The Cramer-Rao Bound states that if is an unbiased estimator of then its performance must be lower-bounded by the inverse of the Fisher Information Matrix: always holds (for an unbiased estimator). Therefore, no other estimator can have a better leading term in its asymptotic performance than the maximum-likelihood estimator. This is important because it implies that the definition of the Fisher Information Matrix is intrinsic; it measures the best possible asymptotic performance rather than merely the performance of the maximum-likelihood estimator. (There is nothing special about the maximum-likelihood estimator except that it is one example of an estimator which is asymptotically efficient, meaning that asymptotically it achieves the Cramer-Rao Bound.)
Extensive literature is available on the Fisher Information Matrix, including on the wikipedia. I therefore curtail my remarks to the above.