Home > Informal Classroom Notes > Information Geometry – An Affine Geometry of Exponential Families (Lecture 2)

## Information Geometry – An Affine Geometry of Exponential Families (Lecture 2)

Underlying information geometry is the question of what it means to put a geometry on a set of probability distributions. Ultimately, we will be viewing sets of probability distributions as manifolds with metrics and connections having statistical relevance. (Metrics and connections are Riemannian geometric concepts determining, among other things, how distance is measured and what curves should be thought of as “straight lines”, that is, curves with zero acceleration.) As a warm up exercise, we ask a considerably simpler question: Can we impose an affine geometry on (a significantly large subset of) the set of all probability distributions such that all exponential families of distributions appear as affine subspaces? (Definitions will be given presently.) Our only motivation for wishing to do this, beyond being a warm up exercise for considering what a “geometry” actually is, is that such a geometry leads to a simple geometric test for determining if a given family is exponential or not. On the one hand then, this geometry does have a use, albeit a relatively straightforward one, and there must be something a bit geometrically special about exponential families because there are other collections of families who could not have been shoehorned into an affine structure. On the other hand, it should not be interpreted as the geometry of probability distributions. There are other geometries to consider too.

In a vector space, an affine subspace is what results by taking a linear subspace and translating its origin to another point. The simplest example is a straight line. Recall that a straight line in a vector space is defined to be a curve $\gamma(t)$ of the form $\gamma(t) = u + tv$ where $u$ and $v$ are two vectors from the space. What if we are given a set $S$? How can we define what a straight line is in $S$?

We could arbitrarily decree that certain collections of points in $S$ form straight lines but if these straight lines don’t “fit together” in a nice way, we wouldn’t have a systematic structure to work with, and our definition would not be a useful one. That said, there is not just a single structure which is the right structure for “straight lines”; the message is merely that some structure is required for our definition to be useful. (An example of structure for straight lines is that, in a vector space, two straight lines are either parallel or they intersect at precisely one point; this is because the geometry is Euclidean. There are non-Euclidean geometries in which straight lines have different properties.)

One way to define straight lines (in a useful way) on $S$ is to impose a (useful) vector space structure on $S$ then apply the aforementioned definition of a straight line on a vector space. A vector space has an origin though, so if $S$ represents a set of probability distributions, which probability distribution should we choose to be the origin? Sadly, there is no distinguished probability distribution which would serve well as an origin. (We will see presently there is a way round this though!) An alternative would be to make $S$ into a manifold, define a statistically meaningful connection on the manifold and then use this connection to form curves of zero acceleration which we could think of as straight lines (especially if we restrict attention to submanifolds which are flat with respect to the connection). It is this latter approach which will be pursued in subsequent lectures, but it is overkill for now.

Observe that the definition of a straight line in a vector space doesn’t actually depend on the choice of origin. This should be clear from a simple diagram but it is instructive to write this down rigorously. If $L \subset \mathbb{R}^2$ is a straight line in the vector space formed from $\mathbb{R}^2$ with the origin at $(0,0)$ then $L$ remains a straight line in the vector space formed from $\mathbb{R}^2$ with the origin at $(3,5)$. Although the vector space operations have changed – in the first vector space $(1,2) + (3,5) = (4,7)$ whereas in the second vector space $(1,2) + (3,5) = (1,2)$ – any set of the form $L = \{u+tv\mid t \in \mathbb{R} \}$ in the first vector space can be written in the same form in the second vector space. [If this is not clear, use $\oplus$ and $\odot$ to denote vector space addition and scalar multiplication in the second vector space and show that for any $u,v \in \mathbb{R}^2$ there exist $w,x \in \mathbb{R}^2$ such that $u+tv = w \oplus (t \odot x)$. Hint: If $p$ denotes the origin of the second vector space then $x \oplus y = (x-p)+(y-p)+p = x+y-p$ and $t \odot x = t(x-p)+p$.]

Roughly speaking, an affine space is a vector space who cannot remember where his origin is. The affine space captures all the structure it possibly can from the absent-minded vector space. It can capture the property of straight lines because we have just seen that these can be defined in a way which does not depend on where the origin actually is.

The sophisticated way to proceed would be to define a collection of coordinate charts, each chart mapping $S$ to a vector space, and every pair of charts related by an affine transformation. This would give us an affine geometry on $S$. Instead though, we will proceed in a conceptually simpler but more arduous way. (In a subsequent lecture we will return to the sophisticated approach because that approach generalises to manifolds and other structures.)

Concretely, let $S$ be the set of all strictly positive and continuous functions on $\mathbb{R}$. It includes, for example, the functions $p(x) = \frac1{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$ which are the probability densities for Gaussian random variables with mean $\mu$ and variance $\sigma^2$. How can we put a (sensible) affine structure onto $S$ such that, among other things, the Gaussian random variables lie in a two-dimensional affine subspace of $S$? And firstly, what is an affine structure on a set? (We could have made $S$ larger by including all measurable functions but this is merely a distraction. The fact that $S$ contains functions which do not integrate to $1$ and are therefore not probability densities is a convenience which will turn out later not to affect things in any significant way.)

Taking away the origin from a vector space means vector space addition and scalar multiplication are no longer defined; they change as the origin changes, as we saw earlier. Intuitively, what does not change is our ability to define direction. For example, we can say that $v$ is arrived at by starting at $u$ and moving three steps North and two steps East. Precisely, we express this by saying that the statement $v-u = y-x$ does not depend on where the origin is; change the origin and both sides change by the same amount. The basic idea (which may or may not work) is that although individual points in an affine space cannot be thought of as vectors, perhaps their difference can be. [It is more efficient to guess that this might work then check that it really does than try to guarantee beforehand it will work.]

This motivates us to try putting an affine structure onto $S$ by introducing an auxillary vector space $V$ to represent the difference of any two points in $S$. For every pair $p_1, p_2 \in S$ we must associate a point $p_1 - p_2 \in V$. [Precisely, we must define a function $s: S \times S \rightarrow V$. Writing $p_1 - p_2$ is shorthand for $s(p_1,p_2)$. ]

Merely defining $p_1 - p_2$ does not endow $S$ with sufficient structure though. It turns out that we get the full amount of structure possible on $S$ by requiring $p_1 - p_2$ to behave in the following sensible ways: $(p_3 - p_2) + (p_2 - p_1) = p_3 - p_1$ holds for all $p_1,p_2,p_3 \in S$; $p_2 - p_1 = 0$ implies $p_2 = p_1$; for all $v \in V$ and $p_1 \in S$ there exists a $p_2 \in S$ such that $p_2 - p_1 = v$. (Note that “$+$” is vector space addition in $V$ whereas “$-$” is the peculiar operation defined earlier.) [To derive this, assume that $S = V + q$ for some unknown $q \in V$ and write down all the properties you can think of that are independent of $q$, then check what you end up with is sensible.]

Returning to the job at hand, that of defining an affine structure on $S$, what works is the following. For $p_1 , p_2 \in S$, define $p_2 - p_1 = \ln\frac{p_2}{p_1}$. Statisticians will recognise the right-hand side as the log-likelihood ratio which is ubiquitous in statistical inference. This fulfils our requirements as now shown. Let $V$ be the vector space of all continuous functions $f: \mathbb{R} \rightarrow \mathbb{R}$ (where vector addition and scalar multiplication are defined pointwise). Firstly, $(p_3-p_2)+(p_2-p_1) = \ln\frac{p_3}{p_2} + \ln\frac{p_2}{p_1} = p_3 - p_1$ as required. If $p_2 - p_1 = 0$ then $\ln\frac{p_2}{p_1}=0$ so $p_2=p_1$. Given a $v \in V$ and $p_1 \in S$ define $p_2 = e^v p_1$. [This means $p_2(x) = e^{v(x)}p_1(x)$.] Then $p_2 - p_1 = \ln p_2 - \ln p_1 = v$, thus completing the verification that the above axioms for an affine structure are satisfied. Importantly, the choice of the log-likelihood ratio means that the family of Gaussian distributions is affine, as now explained.

A straight line in $S$ is a collection of points of the form $\{q \in S \mid q-p = t v, t \in \mathbb{R}\}$. Indeed, this is the line passing through $p \in S$ in the direction $v \in V$. Generalising this, we define an affine subspace in $S$ to be a subset of the form $\{q \in S \mid q-p \in W\}$ for some linear subspace $W$ of $V$. Take a particular Gaussian density, say $p(x) = \frac1{\sqrt{2\pi}}e^{-\frac{x^2}{2\sigma^2}}$. Let $W$ be the three-dimensional subspace of $V$ spanned by the basis functions $1$, $x$ and $x^2$. That is to say, the elements of $W$ are the quadratic polynomials $a x^2 + bx + c$. Choose an element $a x^2 + bx + c \in W$. The axioms of an affine space ensure there is a unique $q$ such that $q-p = ax^2+bx+c$. Indeed, a calculation reveals that $q(x) = \frac1{\sqrt{2\pi}} e^{c+\frac{b^2}{2(1-2a)}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$ where $\mu = b/(1-2a)$ and $\sigma^2 = 1/(1-2a)$. That is to say, provided we do not care about the normalising factor required to make a density integrate to unity, we are content to claim that the Gaussian family of unnormalised densities is a three-dimensional affine subspace of $S$. [The third dimension is because the scaling factor can be freely chosen.] Importantly, every (unnormalised) exponential family forms an affine subspace of $S$, and conversely, every (finite dimensional) affine subspace of $S$ is an exponential family. With brute force, we could have made the Gaussian densities affine in many different ways; the key feature is that we found a way which worked for all exponential families simultaneously. This is only possible because exponential families have an inherent geometric structure, for example, it is a prerequisite for our approach that the intersection of two exponential families is an exponential family because we know that the intersection of two affine subspaces is affine.

Working with unnormalised densities is not uncommon. Here, it is used to overcome the fact that normalised densities do not form an affine subspace in the same way that a circle, representing normalised vectors in a plane, is not a linear subspace. In fact, some further unpleasantries have been covered up. The assignment $\sigma^2 = 1/(1-2a)$ may lead to $\sigma^2$ being negative (or undefined if $a=1/2$). The effect is that the resulting density integrates to infinity. Should we wish to pursue this path, what would save us is the result that the set on which the densities have a finite integral is convex. This is sufficiently nice to work with. The alternative is to switch to Riemannian geometry which is powerful enough to allow the existence of extrinsically curved surfaces which are intrinsically flat, the cylinder being just one example. Indeed, in a subsequent lecture we will see that there are statistically meaningful connections with respect to which the exponential families are flat.

The next lecture will show how this affine geometry can be used to derive a test for determining if a family is exponential or not. The eager reader may wish to consider how we could have worked out in advance that we should use the definition $p_2 - p_1 = \ln\frac{p_2}{p_1}$ to define the affine geometry. The first chapter of Murray and Rice’s book on Differential Geometry and Statistics may prove useful.

The best way to think of a tangent vector is as something which represents a direction along which a directional derivative can be computed. In Euclidean space, given a function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ and a point $p$, we are familiar with the directional derivative $Df(p) \cdot v$ in the direction $v$, where $v$ belongs to $\mathbb{R}^n$. On a manifold $M$, we wish to do the same thing, that is, be able to specify a direction and compute a derivative in that direction of a function $f: M \rightarrow \mathbb{R}$. It turns out we can do this, but the direction $v$ must now belong to the tangent space $T_pM$.
For Euclidean space, $T_p\mathbb{R}^n$ is isomorphic to $\mathbb{R}^n$ and hence it is easy to “confuse” points $p$ and tangent vectors $v$. But they should be treated as two very different things.