## Advances in Mathematics

Advances in mathematics occur in one of two ways.

The first occurs by the solution of some outstanding problem, such as the Bieberbach conjecture or Fermat’s conjecture. Such solutions are justly acclaimed by the mathematical community. The solution of every famous mathematical problem is the result of joint effort of a great many mathematicians. It always comes as an unexpected application of theories that were previously developed without a specific purpose, theories whose effectiveness was at first thought to be highly questionable.

Mathematicians realized long ago that it is hopeless to get the lay public to understand the miracle of unexpected effectiveness of theory. The public, misled by two hundred years of Romantic fantasies, clamors for some “genius” whose brain power cracks open the secrets of nature. It is therefore a common public relations gimmick to give the entire credit for the solution of famous problems to the one mathematician who is responsible for the last step.

It would probably be counterproductive to let it be known that behind every “genius” there lurks a beehive of research mathematicians who gradually built up to the “final” step in seemingly pointless research papers. And it would be fatal to let it be known that the showcase problems of mathematics are of little or no interest for the progress of mathematics. We all know that they are dead ends, curiosities, good only as confirmation of the effectiveness of theory. What mathematicians privately celebrate when one of their showcase problems is solved is Polya’s adage: “no problem is ever solved directly.”

There is a second way by which mathematics advances, one that mathematicians are also reluctant to publicize. It happens whenever some commonsense notion that had heretofore been taken for granted is discovered to be wanting, to need clarification or definition. Such foundational advances produce substantial dividends, but not right away. The usual accusation that is leveled against mathematicians who dare propose overhauls of the obvious is that of being “too abstract.” As if one piece of mathematics could be “more abstract” than another, except in the eyes of the beholder (it is time to raise a cry of alarm against the misuse of the word “abstract,” which has become as meaningless as the word “Platonism.”)

An amusing case history of an advance of the second kind is uniform convergence, which first made headway in the latter quarter of the nineteenth century. The late Herbert Busemann told me that while he was a student, his analysis teachers admitted their inability to visualize uniform convergence, and viewed it as the outermost limit of abstraction. It took a few more generations to get uniform convergence taught in undergraduate classes.

The hostility against groups, when groups were first “abstracted” from the earlier “group of permutations” is another case in point. Hadamard admitted to being unable to visualize groups except as groups of permutations. In the thirties, when groups made their first inroad into physics via quantum mechanics, a staunch sect of reactionary physicists, repeatedly cried “Victory!” after convincing themselves of having finally rid physics of the “Gruppenpest.” Later, they tried to have this episode erased from the history of physics.

In our time, we have witnessed at least two displays of hostility against new mathematical ideas. The first was directed against lattice theory, and its virulence all but succeeded in wiping lattice theory off the mathematical map. The second, still going on, is directed against the theory of categories. Grothendieck did much to show the simplifying power of categories in mathematics. Categories have broadened our view all the way to the solution of the Weil conjectures. Today, after the advent of braided categories and quantum groups, categories are beginning to look downright concrete, and the last remaining anticategorical reactionaries are beginning to look downright pathetic.

There is a common pattern to advances in mathematics of the second kind. They inevitably begin when someone points out that items that were formerly thought to be “the same” are not really “the same,” while the opposition claims that “it does not matter,” or “these are piddling distinctions.” Take the notion of species that is the subject of this book. The distinction between “labeled graphs” and “unlabeled graphs” has long been familiar. Everyone agrees on the definition of an unlabeled graph, but until a while ago the notion of labeled graph was taken as obvious and not in need of clarification. If you objected that a graph whose vertices are labeled by cyclic permutations – nowadays called a “fat graph” – is not the same thing as a graph whose vertices are labeled by integers, you were given a strange look and you would not be invited to the next combinatorics meeting.

Excerpt from the Forward by Gian-Carlo Rota (1997) to the book “Combinatorial Species and Tree-like Structures” by F. Bergeron et al.

## “Most Likely” is an All or Nothing Proposition

The principle of maximum likelihood estimation is generally not explained well; readers are made to believe that it should be obvious to them that choosing the “most likely outcome” is the most sensible thing to do. It isn’t obvious, and it need not be the most sensible thing to do.

First, recall the statement I made in an earlier paper:

The author believes firmly that asking for an estimate of a parameter is,

a priori, a meaningless question. It has been given meaning by force of habit. An estimate only becomes useful once it is used to make a decision, serving as a proxy for the unknown true parameter value. Decisions include: the action taken by a pilot in response to estimates from the flight computer; an automated control action in response to feedback; and, what someone decides they hear over a mobile phone (with the pertinent question being whether the estimate produced by the phone of the transmitted message is intelligible).Without knowing the decision to be made, whether an estimator is good or bad is unanswerable.One could hope for an estimator that works well for a large class of decisions, and the author sees this as the context of estimation theory.

Consider the following problem. Assume two coins are tossed, but somehow the outcome of the first coin influences the outcome of the second coin. Specifically, the possible outcomes (H = heads, T = tails) and their probabilities are: HH ; HT ; TH ; TT . Given these probabilities, what is our best guess as to the outcome? We have been conditioned to respond by saying that the *most likely* outcome is the one with the highest probability, namely, HH. What is our best guess as to the outcome of the first coin only? Well, there is chance it will be H and chance it will be T, so the most likely outcome is T. How can it be that the most likely outcome of the first coin is T but the most likely outcome of both coins is HH?

The (only) way to understand this sensibly is to think in terms of how the estimate will be used. What “most likely” really means is that it is the best strategy to use when placing an all-or-nothing bet. If I must bet on the outcome of the two coins, and I win $1 if I guess correctly and win nothing otherwise, my best strategy is to bet on HH. If I must bet on the outcome of the first coin, the best strategy is to bet on T. This is not a contradiction because betting on the first coin being T is the same as betting on the two coins being either TH or TT. I can now win in two cases, not just one; it is a different gamble.

The above is not an idle example. In communications, the receiver must estimate what symbols were sent. A typical mathematical formulation of the problem is estimating the state of a hidden Markov chain. One can choose to estimate the most likely sequence of states or the most likely state at a particular instance. The above example explains the difference and helps determine which is the more appropriate estimate to use.

Finally, it is noted that an all-or-nothing bet is not necessarily the most appropriate way of measuring the performance of an estimator. For instance, partial credit might be given for being close to the answer, so if I guess two coins correctly I win $2, if I guess one coin correctly I win $1, otherwise I win nothing. This can be interpreted as “regularising” the maximum likelihood estimate. Nevertheless, at the end of the day, the only way to understand an estimator is in the broader context of the types of decisions that can be made well by using that estimator.

## Background Information for Continuous-time Filtering and Estimation on Manifolds

The preprint A Primer on Stochastic Differential Geometry in Signal Processing discusses, among other things, the following in simple but rigorous terms:

- How Brownian motion can be generated on Riemannian manifolds;
- How “coloured” (technically, left-invariant) Brownian motion can be generated on Lie groups;
- Ito and Stratonovich integrals, and the transfer principle of Stratonovich integrals making them convenient to use for stochastic differential equations on manifolds;
- The special orthogonal groups SO(n);
- How a “Gaussian random variable” can be generated on a Riemannian manifold;
- How state-space models extend to manifolds;
- How stochastic development provides a convenient framework for understanding stochastic processes on manifolds;
- Whether or not stochastic integrals are “pathwise” computable.

The last section of the paper includes the following:

Several concepts normally taken for granted, such as unbiasedness of an estimator, are not geometric concepts and hence raise the question of their correct generalisations to manifolds. The answer is that the difficulty lies not with manifolds, but with the absence of meaning to ask for an estimate of a parameter. The author believes firmly that asking for an estimate of a parameter is, *a priori*, a meaningless question. It has been given meaning by force of habit. An estimate only becomes useful once it is used to make a decision, serving as a proxy for the unknown true parameter value. Decisions include: the action taken by a pilot in response to estimates from the flight computer; an automated control action in response to feedback; and, what someone decides they hear over a mobile phone (with the pertinent question being whether the estimate produced by the phone of the transmitted message is intelligible). *Without knowing the decision to be made, whether an estimator is good or bad is unanswerable.* One could hope for an estimator that works well for a large class of decisions, and the author sees this as the context of estimation theory.

## Optimisation Geometry

In an invited book chapter (downloadable from arXiv), I made a first attempt at understanding how the *geometry* of a family of cost functions influences the *computational complexity* of the resulting optimisation problem. Importantly, real-time optimisation problems were studied rather than classical “once-off” optimisation problems.

Real-time optimisation problems differ from classical optimisation problems in that the class of cost functions is known beforehand and (considerable) time can be expended beforehand studying this class prior to developing a tailor-made algorithm for solving the particular real-time optimisation problem at hand. Real-time optimisation problems deserve closer attention because there is no reason for classical optimisation methods to perform particularly well for real-time problems.

In addition to demonstrating how an algorithm with guaranteed performance can, in principle, be constructed for any real-time optimisation problem, a geometric framework was given which is hoped will yield, in future work, insight into the computational complexity of real-time optimisation problems.

An embryonic concept is that overall complexity divides into intrinsic complexity and extrinsic complexity. The intrinsic complexity is the unavoidable complexity of the real-time optimisation problem, the best that can be done with infinite resources allocated to simplifying the problem beforehand. The extrinsic complexity is the additional complexity coming from how the optimisation problem is posed; for example, if a quadratic cost function is composed with a complicated diffeomorphism then the resulting optimisation problem is “difficult” whereas the underlying optimisation problem, that of minimising a quadratic function, is “easy”. (This distinction makes less sense for “once-off” optimisation because there is no opportunity to determine beforehand, “free of charge”, whether or not the original problem can be simplified by a suitable change of coordinates.) The coordinate-independent nature of geometry suggests differential topology/geometry is an appropriate tool to be using in this investigation.

## Introduction to the Grassmann Algebra and Exterior Products

Sadly, Grassmann’s mathematical work was not appreciated during his lifetime. Among other things, he introduced what is now called the Grassmann algebra. It appears that Grassmann did this in part by looking for all possible ways a product structure could be introduced. Although there is strong geometric intuition behind the Grassmann algebra, it is not necessarily straightforward to grasp quickly this intuition from current introductory texts. For example, if the Grassmann algebra is about lengths, areas and volumes of parallelotopes, why can and be added together to form a new vector when in general the length of will not be the sum of the lengths of and ?

In my mind, the key point to keep in mind, and which I have not seen written down elsewhere, is that in the context of Grassmann algebras, lower-dimensional parallelotopes should be considered merely as building blocks for higher-dimensional parallelotopes; some background is required before getting to this point though.

Stepping back, this note endeavours to re-invent the Grassmann algebra in an intuitive way, motivating the operations of addition and multiplication. The point of departure is the desire to measure the relative volume of a -dimensional oriented parallelotope in a vector space of dimension . Let us initially denote an oriented parallelotope by the ordered set of vectors that form the sides of the parallelotope. (See the wiki for a picture of the three-dimensional case.) Here, “oriented” just means that the sides of the parallelotope are ordered. In hindsight, it becomes clear that it is simpler to work with oriented parallelotopes than non-oriented ones; a (multi-)linear theory can be developed for the former. (Perhaps better motivation would come from considering how to define integration on a manifold, but I am endeavouring here to introduce Grassmann algebras without mention of forms from differential geometry.)

Given a metric on , the volume of the parallelotope can be computed by choosing an orthonormal basis for and computing the determinant of the matrix whose columns are the vectors expressed as linear combinations of the basis vectors; put simply, if we assume is and we use the Euclidean inner product then is the matrix whose th column is . Note that negative volumes are permissible, a consequence of working with oriented parallelotopes. For brevity, parallelotopes will mean oriented parallelotopes and volumes will mean signed volumes.

If we don’t have a metric — or, precisely, we want to state results that are true regardless of which metric is being used — we can still make sense of one parallelotope being twice as big as another one, at least in certain situations. For example, the parallelotope is twice as big as because, no matter how we choose the metric, the volume of the former really will be twice that of the latter. A key question to ask is: if and are two parallelotopes, will the ratio of their volumes be independent of the metric chosen?

Since we have limited ourselves to -dimensional parallelotopes in -dimensional space, it turns out that the answer is in the affirmative. A proof can be based on the equality by choosing the matrix to be the change of orthonormal basis induced by a change of metric.

If we decide that two (oriented) parallelotopes are equivalent whenever their (signed) volume is the same regardless of the metric chosen then it turns out that we can form a vector space structure on the set of all -dimensional parallelotopes *up to equivalence* in a given -dimensional vector space . Note that we are working with a quotient space structure; although we use the notation to represent an element of , different representations may correspond to the same element. (Precisely, we have a projection taking vectors and returning the corresponding element of , where if and only if the signed volume of equals the signed volume of regardless of the metric chosen.) We choose to define scalar multiplication in by . (Note that the could have multiplied any one of the because elements of are only distinguished up to differences in volume.) That is to say, scalar multiplication corresponds to scaling the volume of the parallelotope.

Vector space addition in is worthy of contemplation even if the ultimate definition is straightforward. (From a pedagogical perspective, having a simple mathematical definition does not imply having an intuitive understanding; Grassmann algebras have a simple mathematical definition, but one that belies the ingenuity required by Grassmann to develop them and one that potentially lacks the intuition required to feel comfortable with them.) Thinking first in terms of cubes then in terms of parallelotopes, it is clear geometrically that . In other words, if all but one vector are the same, there is an obvious geometric meaning that can be given to vector space addition in . Perhaps other special cases can be found. Nevertheless, the general rule we wish to follow (if at all possible) is that if then this should be taken to mean that the volume of the parallelotope plus the volume of the parallelotope is equal to the volume of the parallelotope . If this is possible, then one way to achieve it is to define as follows. Arbitrarily choose a basis for . Then we know that there exist constants and such that the volume of is equal to times the volume of , and the volume of equals times the volume of . Then is defined to be . One can check that this indeed works; it endows with a well-defined vector space structure. (Precisely, one must first verify that our definitions are consistent — given , we claim that no matter which parallelotopes , and we used, the same element will be obtained — and then verify that the axioms of a vector space are satisfied.)

After all this effort, one may be disappointed to learn that is one-dimensional. However, that is to be expected; we wanted to represent the (signed) volume of an (oriented) parallelotope and hence is essentially just the set of real numbers with the usual scalar multiplication and vector addition. What we have done though is introduce the notation and mindset to pave the way for generalising this reasoning to parallelotopes of arbitrary dimension in .

Importantly, the following approach will not work, in that it will not re-create the Grassmann algebra. Consider all one-dimensional parallelotopes in , where now . If and are two such parallelotopes then one might be tempted to declare that if and only if the length of is equal to the sum of the lengths of and with respect to all metrics. This would lead to an infinite-dimensional vector space though, since it would only be possible to add two vectors that were linearly dependent.

An algebra (in this context) is a vector space that also has defined on it a rule for multiplying two elements, such that the multiplicative structure is consistent with the vector space structure, e.g., the associative and distributive laws hold. Does multiplication enter the picture in any way when we think of volume? For a start, the area of a rectangle can be calculated by taking the product of the lengths of two adjoining sides. We are thus tempted to introduce a symbol that allows us to construct a higher-dimensional parallelotope from two lower-dimensional ones — namely, — and have some faint hope that this simple concatenation-of-parallelotopes operator behaves in a way expected of a multiplication operator.

Now for the **key decision**, which I have not seen stated elsewhere yet believe to be the key to understanding Grassmann algebras in a simple way. Because the paragraph before last pointed out that we cannot treat length in a metric-independent way if we wish to stay in finite dimensions, we must use our definition of metric-independent volume to induce a weaker notion of metric-independent length, area and volume on lower-dimensional parallelotopes of the ambient space . Precisely, we declare that is equivalent to if and only if, for all vectors , we have that has the same volume as , where as usual is the dimension of . In particular, lower-dimensional parallelotopes are considered merely as building blocks for -dimensional parallelotopes in -dimensional spaces. Immediate questions to ask are does this work in theory and is it useful in practice. It does work; it leads to the Grassmann algebra. And it has found numerous uses in practice, but that is a different story which will not be told here.

It is now a straightforward journey to the finish line. Let denote what was earlier denoted , and in general, let denote the set of all -dimensional (oriented) parallelotopes up to the aforementioned equivalence relation. Each of these sets can be made into a vector space with vector space operations relating directly to volumes. Precisely, if then the scalar multiple is the parallelotope (unique up to equivalence) such that, for all vectors , the volume of is precisely times the volume of regardless of which metric is used to measure volume. (This implies that the volume of is precisely times the volume of but the converse is not necessarily true.) Vector addition can be defined in a similar way.

It can be shown that is linearly isomorphic to . Indeed, if then because, for any vectors , the volume of the parallelotope will equal the sum of the volumes of and . Conversely, if then one can deduce by strategic choices of that the only possibility is . (Think in terms of determinants of matrices.)

As hinted at before, we expect multiplication to come into play and we expect it to behave nicely with respect to addition because we know, for example, that a rectangle of side lengths and a rectangle of side lengths have total area . In other words, in at least, we expect that . This is indeed the case — for any it is clear that — and here the point is to explain why should behave like multiplication rather than prove rigorously that it does.

When it comes to rigorous proofs, it is time to switch from geometric intuition to mathematical precision. Here, the key step is in recognising that the volume of a -dimensional parallelotope in a -dimensional vector space is a multi-linear function of the constituent vectors . In fact, it is not just any multi-linear map but an alternating one, meaning that if two adjacent vectors are swapped then the volume changes sign. This is the starting point for the modern definition of exterior algebra, also known as the Grassmann algebra.

I intentionally used non-conventional notation because it was important to introduce concepts one by one. First, because the operator introduced above is anti-commutative (it is almost as familiar as ordinary multiplication except that the sign can change, e.g., ) it is common to denote it by the wedge product instead. Furthermore, since is isomorphic to it is customary to omit the square brackets, writing for , writing for , and so forth.

There are some loose ends which I do not tidy up since the aim of this note is to prepare the reader for a standard account of the exterior algebra; perhaps though the last point to clarify is that the Grassmann algebra is the direct sum of the base field plus plus up to . Thus, if two parallelotopes cannot be added geometrically to form a new parallelotope, either because they are of differing dimensions, or roughly speaking because changing metrics would cause them to change in incongruous ways as building blocks, then they are just left written as a sum.

In summary:

- The exterior algebra of a vector space is a vector space whose elements represent equivalence classes of linear combinations of oriented parallelotopes in .
- If is the dimension of then two -dimensional parallelotopes are equivalent if and only if they have the same -dimensional volume as each other with respect to any and all metrics.
- Multiplying a parallelotope by a scalar just multiplies its volume by the same amount (without changing the subspace in which it lies).
- A higher-dimensional parallelotope is constructed from lower-dimensional ones via the wedge product which, except for possible sign changes, behaves precisely like a multiplication operator (because, roughly speaking, volume is determined by multiplying one-dimensional lengths together).
- Two -dimensional parallelotopes and are equivalent if and only if, when treated as building blocks for constructing parallelotopes and of the same dimension as , the volumes of the resulting parallelotopes and are always the same, regardless of which metric is used and how is chosen.
- The sum of two -dimensional parallelotopes and equals the -dimensional parallelotope if and only if, for all -dimensional parallelotopes , the volume of equals the sum of the volumes of and regardless of which metric is used. (Such a need not exist, in which case the resulting vector space sum is denoted simply by .)

As always, this note may be unnecessarily long because it was written in a linear fashion from start to finish. Hopefully though, the general direction taken has some appeal.

## Differentiating Matrix Expressions The Easy Way, and an Elementary yet Genuine use for the Tensor Product

In many areas of science requiring differentiating multivariate functions , the derivative is often treated as a vector, and the second-order derivative treated as a matrix. This leads to notation with sometimes appearing and sometimes its transpose appearing. Extending this notation to higher derivatives, or to functions , becomes even more messy.

An alternative is to treat derivatives as (multi-)linear maps. If, at some stage, vectors and matrices are required, i.e., gradients and Hessians, these can be easily read off from the derivatives. But often these are not required. Basically, the difference is working in a particular coordinate system — the gradient and Hessian are only defined with respect to an inner product and that determines the “coordinate system” being used — versus working in a coordinate-free manner.

In Differential Calculus, Tensor Products, and the Importance of Notation, a quick overview is given, but one which points out several subtleties. (For additional examples, see this earlier post.) Furthermore, it introduces the tensor product as a way of simplifying the notation further. This is an elementary yet genuine application benefitting from the tensor product, and is currently the best way I know of introducing tensor products early on to students in a meaningful way. (I am not very pleased with my earlier attempt at an introductory article on the tensor product as I don’t feel it is interesting enough.)

## Optimisation on Manifolds — Lifting Iterative Algorithms from Euclidean Space to Manifolds

A natural question to ask in the theory of Optimisation on Manifolds is how the Newton method can be generalised to an iterative method on a manifold. Traditionally, the standard way was formulaic: because the Newton method involves a gradient, a Hessian, and vector space addition, generalise these concepts first then apply Newton’s formula. This led to the Riemannian Newton method. Gradients, Hessians and vector addition are not intrinsic though; they are just one realisation of an iterative method that uses the first and second order derivatives to converge locally quadratically to a (non-degenerate) critical point. The standard Newton formula in Euclidean space is designed to be optimal for quadratic cost functions, but a simple change of coordinates will change it to being optimal for another class of cost functions instead.

It turns out that changes of coordinates play a fundamental role in understanding how the Newton method, or any (memoryless) iterative method for that matter, can be lifted to manifolds. In A Framework for Generalising the Newton Method and Other Iterative Methods from Euclidean Space to Manifolds, it was essentially demonstrated that all generalised Newton methods can be understood as applying a different change of coordinates at each iteration, and local quadratic convergence can be assured if and only if the Newton method in Euclidean space is “robust” to these changes, meaning its radius of convergence can be uniformly bounded under all coordinate changes of interest. The Newton method is indeed very robust, leading to a tremendous variety of possibilities for lifting it to manifolds which go well beyond the traditional class of Riemannian Newton methods.

It was also pointed out there is generally no reason to insist that the lift is uniquely defined globally. It suffices to lift locally in response to where the iteration is currently at. Visually, one way to think of this is to re-centre at each step of the iteration: rather than move a point around a sphere searching for a critical point, keep applying rotations to the sphere endeavouring to bring a critical point to the North pole. More generally, arbitrary transformations of the manifold, or of the ambient space containing the manifold, can be used for re-centring at a distinguished point, and the Newton method need only be lifted to a region containing the distinguished point. (A special case is “rolling without slipping”, but there a Riemannian metric is present, which results in the local lifts fitting together globally. This need not be the case for more general re-centring techniques, and may possibly lead to more efficient algorithms that use simpler lifts.)

Many competing factors come into play when designing a Newton method for a particular problem. Hopefully the framework will prove useful, but it is far from being a panacea. It raises more questions than it answers.