Home > Informal Classroom Notes > Information Geometry – Coordinate Independence (Lecture 6)

## Information Geometry – Coordinate Independence (Lecture 6)

The next few lectures aim to provide an introduction to several basic concepts in differential geometry required for progressing our understanding of information geometry. Rather than commence with a definition of differential geometry, the idea of “coordinate independence” will be studied in the simpler setting of affine geometry first. Roughly speaking, differential geometry combines the notion of coordinate independence with the notion of gluing simpler things together to form more complicated objects.

Let $S$ be a set. It may represent all the points on an infinitely large sheet of paper, in which case one must resist the temptation to think of $S$ as a subset of $\mathbb{R}^3$ but rather, envisage $S$ as the whole universe; there is nothing other than $S$. Alternatively, $S$ might represent the set of all elephants in the world.

Consider first the case when $S$ is the sheet of paper. In fact, assume we all live on $S$; the world is flat. In order to write down where someone lives, we need a coordinate chart. We need an injective function $f: S \rightarrow \mathbb{R}^2$ which assigns to every point a unique pair of numbers which we call the coordinates of the point. In order for this to be successful, everyone needs to use the same coordinate chart $f$. But given just $S$, no two people are likely to choose the same chart. How could they? Just for starters, it would necessitate someone drawing a big cross on the ground and declaring that everyone must consider that point to have coordinates $(0,0)$. Extra information beyond the set itself is required if different people are able to construct the same coordinate chart.

Sometimes, as we will now see, there is extra information available but it is not enough to determine a unique coordinate chart. If every person had a magnetic compass and a ruler then they could agree that moving one metre east must correspond to increasing the first coordinate by one, and moving one metre to the north must correspond to increasing the second coordinate by one. Two people’s charts will still differ in general, but only in the choice of origin. Although people would not be able to communicate where they live in absolute terms – saying I live at $(2,4)$ is no good to anyone else with a different coordinate chart – there is still a wealth of information that can be communicated. Saying that the difference in coordinates between my house and your house is $(5,2)$ is enough for you to find your way to my house; although it is likely our coordinate charts differ, the same answer is obtained no matter which chart is used. This is called coordinate independence.

The more possibilities there are for the charts, the fewer the number of coordinate independent properties. For example, if now people’s rulers are confiscated and they only have magnetic compasses, people’s coordinate charts can differ from each other’s in more ways. Saying I live $(5,2)$ away from your home will no longer work; my 5 units east will almost surely differ from your 5 units east. We could however, still agree on whether a collection of trees lies in a straight line or not.

Precisely, every person may decide to define that a collection of trees $s_1,\cdots,s_n \in S$ lies in a straight line if, under their personal coordinate chart $f: S \rightarrow \mathbb{R}^2$, the images of the trees $f(s_1),\cdots,f(s_N)$ lie in a straight line. This definition works because even though two people’s coordinate charts may be different, their definitions of lying in a straight line turn out to be the same. We will see presently that this can be understood in terms of a simple concept called transition functions. Note too that earlier we were implicitly thinking in terms of definitions too; we defined the location of your house $s_1$ relative to my house $s_2$ to be the vector $f(s_1) - f(s_2)$ and it was a useful definition whenever it was coordinate independent, as it was when we had both a compass and a ruler but not when we had a compass alone.

If $S$ represents a set of elephants then I might choose a coordinate chart $f: S \rightarrow \mathbb{R}$ by defining $f(s)$ to be the length of the trunk of elephant $s$. Tom might define his chart by measuring the length of the tail. We would not agree on the absolute size of an elephant but if the length of an elephant’s trunk and its tail is always a fixed ratio then we would agree on what it means for one elephant to be twice as big as another elephant.

Let’s formalise the above mathematically. The set $\mathbb{R}^2$ can be given a lot of extra structure. We are used to thinking of it as a vector space – we know how to add two points together in a sensible and consistent way – and we commonly introduce a norm for measuring distance and sometimes even an inner product for measuring angles. If $f: S \rightarrow \mathbb{R}^2$ is a bijection then any structure we have on $\mathbb{R}^2$ can be transferred to $S$. We can make $S$ a vector space simply by defining $s_1 + s_2 = f^{-1}( f(s_1) + f(s_2) )$ and $\alpha s = f^{-1}(\alpha f(s))$, for instance.

Let $\mathcal{F}$ be a set of bijective functions of the form $f: S \rightarrow \mathbb{R}^2$. Each element of $\mathcal{F}$ represents a valid coordinate chart, or the way we had introduced it earlier, each person uses their own coordinate chart and $\mathcal{F}$ is the set of all these coordinate charts. Unless $\mathcal{F}$ contains only a single coordinate chart, we can no longer transfer arbitrary structures from $\mathbb{R}^2$ to $S$ in a coordinate independent way; we saw examples of this earlier. What structures can be transferred?

A bit of thought reveals that the key is to study the transition functions $f \circ g^{-1}$ for all pairs $f,g \in \mathcal{F}$. Observe that $f \circ g^{-1}$ is a function from $\mathbb{R}^2$ to $\mathbb{R}^2$. We can therefore use the structure on $\mathbb{R}^2$ to determine what properties $f \circ g^{-1}$ has, for example, it might be that $\mathcal{F}$ is such that $f \circ g^{-1}$ is always a linear function; linearity is a property which can be defined in terms of the vector space structure on $\mathbb{R}^2$.

Recalling the earlier examples, when people had magnetic compasses and rulers, the transition functions $f \circ g^{-1}$ would always have the form $f \circ g^{-1}(x) = x + v$ for some vector $v$. (Changing charts would cause $v$ to change; indeed, $v$ represents the difference in the choice of origin of the two charts.) When people only had magnetic compasses, $f \circ g^{-1}$ would be of the more general form $f \circ g^{-1}(x) = D x + v$ for some positive diagonal matrix $D$. (Here, I have assumed that each person would build their own ruler, so everyone has rulers, they are just of different lengths.)

Linking in with previous lectures, the set $S$ can be made into an affine space by introducing a collection of coordinate charts $\mathcal{F}$ such that for any two charts $f,g \in \mathcal{F}$, their transition function $f \circ g^{-1}$ always has the form $f \circ g^{-1}(x) = A x + b$ for some matrix $A$ and vector $b$. Because the image of a straight line under such a transition function remains a straight line, different people with different coordinate charts will still agree on what is and what is not a straight line in $S$. It is a worthwhile exercise to prove that this definition of an affine space is equivalent to the definition given in earlier lectures.

To summarise, there is interest in playing the following mathematical game:

• We are given a set $S$ and a collection $\mathcal{F}$ of coordinate charts $f: S \rightarrow \mathbb{R}^n$.
• We want to give the set $S$ some structure coming from the structure on $\mathbb{R}^n$.
• We want to do this in a coordinate independent way, meaning that if I use my own coordinate chart and you use your own coordinate chart then we get the same structure on $S$.

The secret is to look at the form of the transition functions $f \circ g^{-1}$ for all pairs $f,g \in \mathcal{F}$. The more general the form of the transition functions, the less structure can be transferred from $\mathbb{R}^n$ to $S$ in a coordinate independent way.

The relevance to information geometry is that the parametrisation used to describe a family of densities $\{ p(x;\theta) \}$ is, to a large extent, irrelevant. Properties that depend on a particular choice of parametrisation are generally not as attractive as properties which are coordinate independent. If $\{ p(x;\mu,\sigma) \mid \sigma > 0 \}$ represents the family of Gaussian random variables parametrised by mean $\mu$ and variance $\sigma^2$ then there is little justification in calling the subfamily $\{ p(x; 2t,5t+7) \mid t > 0\}$ a line segment because there does not appear to be anything special about the parametrisation of Gaussians by mean and variance. (It turns out that for exponential families, statisticians have come up with a set of parametrisations they believe to be nice. Although these parametrisations are not unique, their transition functions are always affine functions; this is why it was possible to introduce an affine structure in earlier lectures. Note that we have not pursued this to the end because we want to move quickly to a more powerful concept coming from differential geometry which will subsume this affine geometry.)

For completeness, note that a parametrisation is just the inverse of a coordinate chart. If we think of defining a family by specifying a function from $\theta \in \mathbb{R}^n$ to $p(.;\theta)$ then we speak of a parametrisation. On the other hand, if someone points to a density $q(x)$ then we can determine its coordinates by asking what value of $\theta$ makes $q(x) = p(x;\theta)$.