The next few lectures aim to provide an introduction to several basic concepts in differential geometry required for progressing our understanding of information geometry. Rather than commence with a definition of differential geometry, the idea of “coordinate independence” will be studied in the simpler setting of affine geometry first. Roughly speaking, differential geometry combines the notion of coordinate independence with the notion of gluing simpler things together to form more complicated objects.
Let be a set. It may represent all the points on an infinitely large sheet of paper, in which case one must resist the temptation to think of as a subset of but rather, envisage as the whole universe; there is nothing other than . Alternatively, might represent the set of all elephants in the world.
Consider first the case when is the sheet of paper. In fact, assume we all live on ; the world is flat. In order to write down where someone lives, we need a coordinate chart. We need an injective function which assigns to every point a unique pair of numbers which we call the coordinates of the point. In order for this to be successful, everyone needs to use the same coordinate chart . But given just , no two people are likely to choose the same chart. How could they? Just for starters, it would necessitate someone drawing a big cross on the ground and declaring that everyone must consider that point to have coordinates . Extra information beyond the set itself is required if different people are able to construct the same coordinate chart.
Sometimes, as we will now see, there is extra information available but it is not enough to determine a unique coordinate chart. If every person had a magnetic compass and a ruler then they could agree that moving one metre east must correspond to increasing the first coordinate by one, and moving one metre to the north must correspond to increasing the second coordinate by one. Two people’s charts will still differ in general, but only in the choice of origin. Although people would not be able to communicate where they live in absolute terms – saying I live at is no good to anyone else with a different coordinate chart – there is still a wealth of information that can be communicated. Saying that the difference in coordinates between my house and your house is is enough for you to find your way to my house; although it is likely our coordinate charts differ, the same answer is obtained no matter which chart is used. This is called coordinate independence.
The more possibilities there are for the charts, the fewer the number of coordinate independent properties. For example, if now people’s rulers are confiscated and they only have magnetic compasses, people’s coordinate charts can differ from each other’s in more ways. Saying I live away from your home will no longer work; my 5 units east will almost surely differ from your 5 units east. We could however, still agree on whether a collection of trees lies in a straight line or not.
Precisely, every person may decide to define that a collection of trees lies in a straight line if, under their personal coordinate chart , the images of the trees lie in a straight line. This definition works because even though two people’s coordinate charts may be different, their definitions of lying in a straight line turn out to be the same. We will see presently that this can be understood in terms of a simple concept called transition functions. Note too that earlier we were implicitly thinking in terms of definitions too; we defined the location of your house relative to my house to be the vector and it was a useful definition whenever it was coordinate independent, as it was when we had both a compass and a ruler but not when we had a compass alone.
If represents a set of elephants then I might choose a coordinate chart by defining to be the length of the trunk of elephant . Tom might define his chart by measuring the length of the tail. We would not agree on the absolute size of an elephant but if the length of an elephant’s trunk and its tail is always a fixed ratio then we would agree on what it means for one elephant to be twice as big as another elephant.
Let’s formalise the above mathematically. The set can be given a lot of extra structure. We are used to thinking of it as a vector space – we know how to add two points together in a sensible and consistent way – and we commonly introduce a norm for measuring distance and sometimes even an inner product for measuring angles. If is a bijection then any structure we have on can be transferred to . We can make a vector space simply by defining and , for instance.
Let be a set of bijective functions of the form . Each element of represents a valid coordinate chart, or the way we had introduced it earlier, each person uses their own coordinate chart and is the set of all these coordinate charts. Unless contains only a single coordinate chart, we can no longer transfer arbitrary structures from to in a coordinate independent way; we saw examples of this earlier. What structures can be transferred?
A bit of thought reveals that the key is to study the transition functions for all pairs . Observe that is a function from to . We can therefore use the structure on to determine what properties has, for example, it might be that is such that is always a linear function; linearity is a property which can be defined in terms of the vector space structure on .
Recalling the earlier examples, when people had magnetic compasses and rulers, the transition functions would always have the form for some vector . (Changing charts would cause to change; indeed, represents the difference in the choice of origin of the two charts.) When people only had magnetic compasses, would be of the more general form for some positive diagonal matrix . (Here, I have assumed that each person would build their own ruler, so everyone has rulers, they are just of different lengths.)
Linking in with previous lectures, the set can be made into an affine space by introducing a collection of coordinate charts such that for any two charts , their transition function always has the form for some matrix and vector . Because the image of a straight line under such a transition function remains a straight line, different people with different coordinate charts will still agree on what is and what is not a straight line in . It is a worthwhile exercise to prove that this definition of an affine space is equivalent to the definition given in earlier lectures.
To summarise, there is interest in playing the following mathematical game:
- We are given a set and a collection of coordinate charts .
- We want to give the set some structure coming from the structure on .
- We want to do this in a coordinate independent way, meaning that if I use my own coordinate chart and you use your own coordinate chart then we get the same structure on .
The secret is to look at the form of the transition functions for all pairs . The more general the form of the transition functions, the less structure can be transferred from to in a coordinate independent way.
The relevance to information geometry is that the parametrisation used to describe a family of densities is, to a large extent, irrelevant. Properties that depend on a particular choice of parametrisation are generally not as attractive as properties which are coordinate independent. If represents the family of Gaussian random variables parametrised by mean and variance then there is little justification in calling the subfamily a line segment because there does not appear to be anything special about the parametrisation of Gaussians by mean and variance. (It turns out that for exponential families, statisticians have come up with a set of parametrisations they believe to be nice. Although these parametrisations are not unique, their transition functions are always affine functions; this is why it was possible to introduce an affine structure in earlier lectures. Note that we have not pursued this to the end because we want to move quickly to a more powerful concept coming from differential geometry which will subsume this affine geometry.)
For completeness, note that a parametrisation is just the inverse of a coordinate chart. If we think of defining a family by specifying a function from to then we speak of a parametrisation. On the other hand, if someone points to a density then we can determine its coordinates by asking what value of makes .
In Lecture 3, a local test for determining if a family was exponential was introduced. The last part of the test involved seeing if the second-order derivatives could be written as linear combinations of the first-order derivatives. As will be shown in a subsequent lecture, such calculations are sometimes easier to do if an inner product is introduced; we are therefore motivated to define an inner product on the space spanned by the first-order derivatives .
Alternatively, we may simply be motivated to introduce some geometry into the family . Precisely, pick a particular density from the family and consider two curves passing through at time zero. By this is meant and are two functions from to parameter space, so that as varies, and trace out two curves in the space of probability densities, and it is required that the curves intersect at when , namely .
If the space of probability densities had a “geometry” then we should be able to say at what angle any two curves intersect at. Mathematically, we wish to compute the inner product of the “velocity vector” of at with the “velocity vector” of at .
We are inclined to work not with directly but with ; the mapping is bijective so nothing is lost or gained by doing this, it is simply the case that we know from previous work that we can interpret log-likelihoods as elements of a vector space (albeit a vector space who has forgotten where he placed his origin) and therefore we are comfortable to differentiate with respect to . (See too the previous lecture on the Fisher Information Matrix.)
An Inner Product
For any two curves intersecting at time we wish to define their inner product c. Since is a linear function of , it suffices to write down the rule for computing the inner product in terms of and . It is desirable to do this because is a finite-dimensional vector whereas is infinite-dimensional.
The catch though is that if we work with then we must ensure that we get the same answer for even if we change to a new parametrisation of the family . Precisely, suppose represents the same family as but with respect to a different parametrisation . (We assume there is a one-to-one correspondence betweeen and ; given any there is a such that , and given any there is a such that too.) If are such that and then we must obtain the same answer for as we do for .
It so happens that the Fisher Information Matrix can be used to define an inner product in a coordinate-independent way, meaning the same answer will be obtained regardless of how the family is parametrised. Before considering why we use the Fisher Information Matrix, let’s just see first how it can be used to do this.
Note that on a finite-dimensional vector space, every inner product is of the form for some positive-definite symmetric matrix . The matrix determines the inner product and the idea we will experiment with is using the Fisher Information matrix as the matrix. To wit, we define:
where is the point of intersection of the two curves at time .
From its form, and assuming the Fisher Information matrix is positive-definite (as usual, technical assumptions are being omitted in order to focus attention on the higher-level details), it is clear that we have defined an inner product; the axioms of an inner product are satisfied. What we need to check is whether or not we get the same answer if we change the parametrisation of our family.
First, the way the Fisher Information matrix changes when we change parametrisations must be determined. From the aforementioned correspondence between and we can regard one as a function of the other and write: . Differentiating this yields: . It follows almost immediately that:
where the th entry of is .
(As discussed in class, the way to understand this is to consider the special case of being a linear function of and writing down the relationship between the squared-error of estimating and the squared-error of estimating and recalling that the Fisher Information matrix is essentially the inverse of the (asymptotic) squared-error. That the Fisher Information matrix determines the asymptotic performance explains why higher-order derivatives of with respect to do not appear in the above formula.)
We may write to signify the relationship between and ; they both trace out the same curve of probability densities, just with respect to different coordinates. Therefore, .
Collating the above results shows that an inner product defined with respect to the Fisher Information matrix is indeed coordinate-independent:
The reader is urged to think carefully about what has actually been done. We have come close to putting an inner product on the infinite-dimensional space of all probability densities. Given any finite-dimensional family we can define an inner product using the Fisher Information matrix in a consistent way which depends only on the densities themselves and not on their parametrisations. (There is a small lie in the last sentence; arbitrary parametrisations are not permitted but rather, any two parametrisations we consider need to be “reasonably nice” with respect to each other, such as the mapping from one to the other being continuously differentiable. Such finer points will be elaborated on later in the course.)
We have not endeavoured to justify why we want to use the Fisher Information matrix as an inner product; we have just demonstrated that because it transforms in the right way, we can use it to define an inner product if we so choose. It is appealing because it is coordinate-independent; it doesn’t matter how we parametrise our family, we get the same inner product being defined. This doesn’t automatically mean that it is a useful or sensible inner product though.
It turns out that it is both useful and sensible. That it is sensible comes from a deep property of being invariant with respect to sufficient statistics (this will be explained later in the course), and indeed, the Fisher metric (as we shall henceforth refer to the inner product coming from the Fisher Information matrix) is the essentially unique metric with this highly desirable property. That it is useful will be seen later when we gain a better understanding of what geometrical results become available by having an inner product structure.
As we will see in subsequent lectures, the Fisher Information Matrix plays an important role in information geometry.
Definition and Example
Associated with a family of probability densities , where and , is a function whose value at any point is called the Fisher Information Matrix. Although it does have several properties which warrant it being thought of as a measure of information, it would be misleading to read too much into the name “information”. Personally, I would call it the “asymptotic information” matrix to reduce the risk of erroneous intuition.
Precisely, the Fisher Information at is defined to be the matrix whose th entry equals . When stated in this fashion, the Fisher Information Matrix can appear mysterious for two reasons; it is not immediately clear how to calculate it, and it is not at all clear why anything meaningful should result from such a calculation. It is therefore informative to calculate the Fisher Information for the family of Gaussian random variables with unknown mean but known variance .
Let . The log-likelihood is therefore . Differentiating yields . The expectation appearing in the definition of the Fisher Information Matrix is with respect to , where the density of is taken to be . Precisely, is defined to be . Therefore the Fisher Information is:
That said, it is often easier to evaluate the Fisher Information by thinking in terms of expectation, and indeed, this is one reason why the Fisher Information is written as an expectation rather than an integration. Thinking of as a Gaussian random variable with known mean and variance , the expectation of is, by definition, . Therefore, the expectation of is . Stated formally:
Had we considered a family of multivariate Gaussian random variables with unknown mean but known covariance matrix , that is , then the Fisher Information Matrix would have been
It just so happens that in these cases, the Fisher Information Matrix is constant with respect to . As will be seen presently, it is not a coincidence that the Fisher Information Matrix appears to be the reciprocal of the accuracy with which we can expect to be able to estimate given an observation . As the variance decreases, the amount of information increases.
Before interpreting these results, it is remarked that although the definition of the Fisher Information Matrix came before information geometry, the fact that the definition of the Fisher Information Matrix requires the log-likelihood function to be differentiated means that was being treated as a vector space.
Assume that a friend chooses a value of and leaves it fixed. Once a second, the friend generates a random variable with distribution . This sequence of independent and identically distributed random variables will be denoted by . Given the first random variables , can we guess what is, and how accurate is our guess likely to be?
For finite , there is almost never a single best method for estimating ; follow this link for the reason why. However, it is sensible to ask if there is an estimation rule which works well as .
Observe that the density of is just a product of densities: . It follows that the Fisher Information Matrix for is simply times the Fisher Information Matrix for . [Indeed, one of the reasons why it is called “information” is because it is additive.]
The maximum-likelihood estimate of given the single observation is the value of which maximises . (Here, note that the observed value of is substituted into the expression for leaving just a function of ; it is this function which is maximised.) Let denote the maximum-likelihood estimate of based on the observations . (Whether we have observations or a single observation of dimension is one and the same thing; the maximum likelihood estimate is the “most likely” value of given by maximising .) Under reasonably mild regularity conditions, it turns out that:
That is to say, the maximum-likelihood estimator is asymptotically unbiased and the leading term in its asymptotic performance, as measured by its covariance matrix, is .
Therefore, the inverse of the Fisher Information Matrix determines the asymptotic performance of the maximum-likelihood estimator as the number of the samples goes to infinity.
There is more to the story. The Cramer-Rao Bound states that if is an unbiased estimator of then its performance must be lower-bounded by the inverse of the Fisher Information Matrix: always holds (for an unbiased estimator). Therefore, no other estimator can have a better leading term in its asymptotic performance than the maximum-likelihood estimator. This is important because it implies that the definition of the Fisher Information Matrix is intrinsic; it measures the best possible asymptotic performance rather than merely the performance of the maximum-likelihood estimator. (There is nothing special about the maximum-likelihood estimator except that it is one example of an estimator which is asymptotically efficient, meaning that asymptotically it achieves the Cramer-Rao Bound.)
Extensive literature is available on the Fisher Information Matrix, including on the wikipedia. I therefore curtail my remarks to the above.
The previous lecture looked at some of the lower-level details of affine spaces. This is a prerequisite for understanding the following higher-level overview, which in turn is a prerequisite for going back and understanding the lower-level details in a finer and precise manner. Following the review is a section on how calculus can be used to test if a family is affine.
Brief Review of Affine Geometry
It is important to distinguish between an affine space, an affine subspace of a vector space and an affine subspace of an affine space. They are three different but related concepts. If any part of the following, however small, is not understood, the reader is invited to revisit the previous lecture.
An affine subspace of a vector space can be defined directly in terms of vector space concepts. A subset of a vector space is an affine space if there exists a such that is a linear subspace of .
A set can be made into an affine space if the difference between two points can be made to behave like a vector, meaning among other things that . Therefore, there is an auxillary vector space and the difference of any two points in is a point in .
Choosing an origin then makes the affine space into a vector space. In particular, if is the auxillary vector space and then defined by is a bijective function which makes into a vector space. I will refer to such a function as a coordinate chart.
Using coordinate charts is a convenient way to work with affine spaces; it should be remembered though that the choice of origin is arbitrary and hence only concepts which do not depend on the choice of origin should be considered.
A subset of an affine space is an affine subspace of an affine space if its image under a coordinate chart is an affine subspace of the vector space . Note that if and the coordinate chart is used then is an affine subspace if and only if is a linear subspace of .
For the space of all (unnormalised) probability densities , we have observed that defining to be the log-likelihood ratio makes into an affine space in such a way that a family is exponential if and only if it is affine.
Just as a linear subspace of a vector space is itself a vector space in its own right, an affine subspace of an affine space is itself an affine space. It is therefore convenient at times to refer to affine subspaces as affine spaces.
A Local Test for a Family to be Exponential
Open Subsets of Affine Spaces
As hinted at in the previous lecture, an exponential family might form only an (open and) convex subset of an affine subspace. To develop a straightforward test to see if a family is exponential, it is convenient to test first to see if the subset A is an open subset of an affine subspace. Openness is a strong enough property to ensure that the subset has the “same dimension” as the affine subspace yet is weak enough to lend itself to the development of a local test involving only derivatives. (Open sets will be explained later in the course; for the moment, little will be lost by merely thinking of an open set as a set having the property that given any point in the set, we can move a little bit in any direction around that point and still stay within the set.)
It is noted tangentially that a prevalent concept in mathematics is the recognition of scenarios where “local implies global”. Anyone who has tried to prove that a (smooth) function is convex directly from the global condition that the straight line segment connecting any two points on the graph of the function must never drop below the function will appreciate the equivalent local condition that the second order derivative of the function must be non-negative. The first condition is said to be global because it involves testing pairs of distinct points. The second condition is said to be local because to test the condition at a point merely involves examining the function in an arbitrarily small neighbourhood of and computing a limit.
A local test for whether or not a subset of an affine space is an open subset of an affine subset B is now derived. (Precisely, we are given and are asked whether or not there exists an affine subspace such that is contained in and is open relative to .) For the derivation, it suffices to study the special case of when the affine space is . Therefore, assume that is a collection of points in indexed by the parameter . For example:
- for some matrix and vector ;
- where ;
- where .
In the first example, the space is affine whereas in the third example, provided , the space is not affine. The second example is perhaps the most interesting of the three.
Choose an arbitrary straight line in . As moves along this line, a curve is traced out in . In example one above, a straight line is traced out in ; this is the simplest possible case. In example two, a Lissajous curve is traced out. In example three, if then a parabola is traced out.
The first observation to make is that if were an open subset of an affine subspace of then the velocity vector of any curve traced out in by varying would belong to . Furthermore, under the regularity condition which we will be assuming for convenience, namely that the Jacobian matrix always has rank , then for any point and any vector (thought of as a vector whose base is at ), it would be possible to construct a curve such that and . In words, any vector in can be thought of as a velocity vector of some curve .
In principle, we could choose any two points , evaluate all possible velocity vectors of curves passing through and compare them with all possible velocity vectors of curves passing through . If we always obtained the same set of velocity vectors (by first shifting the base points to the origin to enable a comparison, as is always implicitly done in linear algebra) then we would have shown that is an open (due to the above regularity condition which was snuck in) subset of an affine subspace . However, this is still a global test because we must compare distinct points and .
Since equivalence is transitive, it suffices to test only nearby points with each other. In the limit, this motivates considering the acceleration vector . If were an open subset of an affine subspace then the acceleration vector would always lie in . Furthermore, a candidate is afforded by the span of the velocity vectors at any given point.
This leads to the following test for determining if a family is affine. Let be a non-empty open subset of and a twice continuously differentiable function whose derivative is injective for all . (In other words, the Jacobian matrix of has rank at all points .) Then is an open subset of a -dimensional affine subspace of if and only if, for all and , there exists a such that . The following examples hopefully serve to clarify the notation. (In words, the test is to see if the second-order partial derivatives of with respect to can be written as linear combinations of the first-order partial derivatives.)
If then and . The regularity condition that is injective is satisfied if has rank . (If has lower rank then the affine subspace would have dimension strictly lower than .) The test is satisfied by the trivial choice . The conclusion is that the family is affine.
If then and . The equation has no solution given any and non-zero , therefore the family is not affine. (This shows that the general case in the third example above is also not affine because if it were, then its projection onto its first two coordinates would also be affine.)
If then and . Note that there are values of for which is not injective; the equation has non-zero solutions in whenever is such that either or are zero. We assume that does not contain any such values. (Otherwise, the family would include boundary points and hence not be open.) Observe then that for any and any the equation has the solution , . Therefore, the family is affine.
Gaussian Family (Example)
The above test will be applied to the Gaussian family where . To convert the problem into a vector space setting, first introduce the coordinate chart . (Keep in mind that is a vector in the vector space of all continuous functions on and is therefore written as a function in .) We want to test if the second-order derivatives can be written as linear combinations of the first-order derivatives. Therefore, we calculate:
We must take into account though the “trick” of working with unnormalised densities. In full, it would mean augmenting with a third parameter and working with . This doesn’t change the above calculations, it merely augments the first-order derivatives with . (Note that represents the constant function whose value always equals .) In other words, when we are endeavouring to write the second-order derivatives in terms of the first-order derivatives, we can augment the
first-order derivatives with the constant function . Basic but tedious linear algebra shows:
Here, keep in mind that is fixed and therefore the linear coefficients can depend on (but not on ). Since is injective for , the above calculation shows that the Gaussian family is indeed an exponential family.
Here, is the vector space of symmetric matrices and denotes the identity matrix.
First some formalities. A convenient level of generalisation is to work with functions between Banach spaces. Recall that a Banach space is a vector space equipped with a norm and which is complete with respect to the norm. (Completeness means every Cauchy sequence converges to a point. Since every finite-dimensional normed vector space is automatically complete, this technical condition can be ignored for the three examples above.)
The vector spaces used above – and – can be made into Banach spaces simply by choosing a norm. Any norm will do because on a finite-dimensional vector space, any two norms are equivalent; if a sequence converges with respect to one norm then it converges with respect to any other norm to the same point. In particular, the derivative will be the same. (In general, this need not be the case; changing the norm can change the derivative.)
Although there is an important distinction between the Fréchet derivative and the Gâteaux derivative, the trick in practice is simply to aim to calculate the Gâteaux derivative, that is, the directional derivative. Either by showing the directional derivatives fit together in the right way or by appealing to a higher level of reasoning would then allow the Fréchet derivative to be written down in terms of the Gâteaux derivative. All this will be explained presently. For the moment, let’s compute the directional derivatives of the above functions.
By definition, the directional derivative of at in the direction is where Although there are standard rules (chain rule, product rule etc) that can be used to expedite the calculation, it is instructive to proceed from first principles.
from which it follows that [If were symmetric, this simplifies to .]
Method 1: Consider Either directly from the definition of determinant (Leibniz formula), or by writing the determinant recursively using the Laplace formula, it is straightforward to see that Furthermore, the Taylor series for log is Therefore, from which it follows that
Method 2: Since the determinant is the product of eigenvalues and the trace is the sum of eigenvalues, it is not surprising that Therefore By definition, matrix log is defined in terms of its Taylor series, therefore, Thus from which it follows that the directional derivative is
from which it follows that
In all cases, observe that for a fixed , the derivative is a linear function of the direction . Indeed, the Fréchet derivative is declared to exist at the point precisely when the directional derivatives are a (continuous and) linear function of the direction . (Since all three functions above are analytic, the directional derivatives must be linearly related and therefore we could have known in advance that the Fréchet derivative exists.)
If and are two Banach spaces then , the set of continuous linear functions from to , can itself be made into a Banach space (with the norm being the operator norm). The Fréchet derivative of a function is a function This notation indicates that the directional derivatives fit together linearly. The fact that is itself a Banach space means that itself can be differentiated in the same framework.
One common notation is to use a dot to denote the application of the linear operator applied to the direction , for example, the Fréchet derivative of the function in the first example is
A subsequent note will look at higher order derivatives and several rules for calculating Fréchet derivatives.