Home > Informal Classroom Notes > Information Geometry – Affine Geometry (Lecture 3)

Information Geometry – Affine Geometry (Lecture 3)

The previous lecture looked at some of the lower-level details of affine spaces. This is a prerequisite for understanding the following higher-level overview, which in turn is a prerequisite for going back and understanding the lower-level details in a finer and precise manner. Following the review is a section on how calculus can be used to test if a family is affine.

Brief Review of Affine Geometry

It is important to distinguish between an affine space, an affine subspace of a vector space and an affine subspace of an affine space. They are three different but related concepts. If any part of the following, however small, is not understood, the reader is invited to revisit the previous lecture.

An affine subspace of a vector space can be defined directly in terms of vector space concepts. A subset U \subset V of a vector space V is an affine space if there exists a u \in U such that U - u = \{x-u \mid x \in U\} is a linear subspace of V.

A set S can be made into an affine space if the difference between two points can be made to behave like a vector, meaning among other things that (p - q) + (q - r) = (p - r). Therefore, there is an auxillary vector space V and the difference of any two points in S is a point in V.

Choosing an origin then makes the affine space S into a vector space. In particular, if V is the auxillary vector space and q \in S then f: S \rightarrow V defined by f(p) = p-q is a bijective function which makes S into a vector space. I will refer to such a function f as a coordinate chart.

Using coordinate charts is a convenient way to work with affine spaces; it should be remembered though that the choice of origin is arbitrary and hence only concepts which do not depend on the choice of origin should be considered.

A subset A \subset S of an affine space S is an affine subspace of an affine space if its image f(A) under a coordinate chart f: S \rightarrow V is an affine subspace of the vector space V. Note that if q \in A and the coordinate chart f(p) = p-q is used then A is an affine subspace if and only if f(A) is a linear subspace of V.

For the space of all (unnormalised) probability densities S, we have observed that defining p-q to be the log-likelihood ratio \frac{\log p}{\log q} makes S into an affine space in such a way that a family is exponential if and only if it is affine.

Just as a linear subspace of a vector space is itself a vector space in its own right, an affine subspace of an affine space is itself an affine space. It is therefore convenient at times to refer to affine subspaces as affine spaces.

A Local Test for a Family to be Exponential

Open Subsets of Affine Spaces

As hinted at in the previous lecture, an exponential family might form only an (open and) convex subset of an affine subspace. To develop a straightforward test to see if a family is exponential, it is convenient to test first to see if the subset A is an open subset of an affine subspace. Openness is a strong enough property to ensure that the subset has the “same dimension” as the affine subspace yet is weak enough to lend itself to the development of a local test involving only derivatives.  (Open sets will be explained later in the course; for the moment, little will be lost by merely thinking of an open set as a set having the property that given any point in the set, we can move a little bit in any direction around that point and still stay within the set.)

It is noted tangentially that a prevalent concept in mathematics is the recognition of scenarios where “local implies global”. Anyone who has tried to prove that a (smooth) function is convex directly from the global condition that the straight line segment connecting any two points on the graph of the function must never drop below the function will appreciate the equivalent local condition that the second order derivative of the function must be non-negative. The first condition is said to be global because it involves testing pairs of distinct points.  The second condition is said to be local because to test the condition at a point x merely involves examining the function in an arbitrarily small neighbourhood of x and computing a limit.

Derivation

A local test for whether or not a subset A of an affine space S is an open subset of an affine subset B is now derived. (Precisely, we are given A \subset S and are asked whether or not there exists an affine subspace B \subset S such that A is contained in B and A is open relative to B.) For the derivation, it suffices to study the special case of when the affine space is \mathbb{R}^n. Therefore, assume that \{ p(\theta) \} is a collection of points in \mathbb{R}^n indexed by the parameter \theta \in \mathbb{R}^k. For example:

  1. p(\theta) = M \theta + b for some matrix M \in \mathbb{R}^{n \times k} and vector b \in \mathbb{R}^n;
  2. p(\theta) = (\cos \theta_1, \sin \theta_2, \cos \theta_1 + \sin \theta_2, 0, \cdots, 0) where \theta = (\theta_1,\theta_2) \in \mathbb{R}^2;
  3. p(\theta) = (\theta, \theta^2, \theta^3, \cdots) where \theta \in \mathbb{R}.

In the first example, the space is affine whereas in the third example, provided n > 1, the space is not affine. The second example is perhaps the most interesting of the three.

Choose an arbitrary straight line in \mathbb{R}^k.  As \theta moves along this line, a curve p(\theta) is traced out in \mathbb{R}^n.  In example one above, a straight line is traced out in \mathbb{R}^n; this is the simplest possible case. In example two, a Lissajous curve is traced out. In example three, if n=2 then a parabola is traced out.

The first observation to make is that if A were an open subset of an affine subspace B of \mathbb{R}^n then the velocity vector of any curve traced out in A by varying \theta would belong to B. Furthermore, under the regularity condition which we will be assuming for convenience, namely that the Jacobian matrix \frac{dp}{d\theta} always has rank k, then for any point q \in A and any vector v \in B (thought of as a vector whose base is at q), it would be possible to construct a curve \theta(t) such that \theta(0) = q and (p \circ \theta)'(0) = v.  In words, any vector in B can be thought of as a velocity vector of some curve p(\theta(t)).

In principle, we could choose any two points q,s \in A, evaluate all possible velocity vectors of curves passing through q and compare them with all possible velocity vectors of curves passing through s. If we always obtained the same set of velocity vectors (by first shifting the base points to the origin to enable a comparison, as is always implicitly done in linear algebra) then we would have shown that A is an open (due to the above regularity condition which was snuck in) subset of an affine subspace B. However, this is still a global test because we must compare distinct points q and s.

Since equivalence is transitive, it suffices to test only nearby points with each other. In the limit, this motivates considering the acceleration vector (p \circ \theta)''(0). If A were an open subset of an affine subspace B then the acceleration vector would always lie in B.  Furthermore, a candidate B is afforded by the span of the velocity vectors at any given point.

This leads to the following test for determining if a family is affine. Let \Theta be a non-empty open subset of \mathbb{R}^k and p: \Theta \rightarrow \mathbb{R}^n a twice continuously differentiable function whose derivative Dp(\theta) is injective for all \theta \in \Theta. (In other words, the Jacobian matrix of p has rank k at all points \theta \in \Theta.) Then \{ p(\theta) \} is an open subset of a k-dimensional affine subspace of \mathbb{R}^n if and only if, for all \theta \in \Theta and x \in \mathbb{R}^k, there exists a y \in \mathbb{R}^k such that D^2p(\theta) \cdot (x,x) = Dp(\theta) \cdot y.  The following examples hopefully serve to clarify the notation.  (In words, the test is to see if the second-order partial derivatives of p with respect to \theta can be written as linear combinations of the first-order partial derivatives.)

If p(\theta) = M \theta + b then Dp(\theta) \cdot y = My and D^2p(\theta) = 0. The regularity condition that Dp is injective is satisfied if M has rank k.  (If M has lower rank then the affine subspace \{ p(\theta) \} would have dimension strictly lower than k.) The test D^2p(\theta) \cdot (x,x) = Dp(\theta) \cdot y is satisfied by the trivial choice y=0. The conclusion is that the family is affine.

If p(\theta) = (\theta, \theta^2) then Dp(\theta) \cdot y = \lim_{t \rightarrow 0} \frac{p(\theta + t y) - p(\theta)}t = (y, 2 \theta y) and D^2p(\theta) \cdot (x,x) = (0, 2x^2).  The equation (y, 2\theta y) = (0, 2x^2) has no solution y given any \theta and non-zero x, therefore the family is not affine. (This shows that the general case in the third example above is also not affine because if it were, then its projection onto its first two coordinates would also be affine.)

If p(\theta) = (\cos \theta_1, \sin \theta_2, \cos \theta_1 + \sin \theta_2) then Dp(\theta) \cdot y = ( - y_1 \sin \theta_1, y_2 \cos \theta_2, - y_1 \sin \theta_1 + y_2 \cos \theta_2 ) and D^2p(\theta) \cdot (x,x) = -( x_1^2 \cos \theta_1, x_2^2 \sin \theta_2, x_1^2 \cos \theta_1 + x_2^2 \sin \theta_2 ). Note that there are values of \theta for which Dp(\theta) is not injective; the equation Dp(\theta) \cdot y = 0 has non-zero solutions in y whenever \theta is such that either \sin \theta_1 or \cos \theta_2 are zero. We assume that \Theta does not contain any such values.  (Otherwise, the family \{ p(\theta) \} would include boundary points and hence not be open.) Observe then that for any \theta \in \Theta and any x \in \mathbb{R}^2 the equation ( - y_1 \sin \theta_1, y_2 \cos \theta_2, - y_1 \sin \theta_1 + y_2 \cos \theta_2 ) = -( x_1^2 \cos \theta_1, x_2^2 \sin \theta_2, x_1^2 \cos \theta_1 + x_2^2 \sin \theta_2 ) has the solution y_1 = x_1^2 \cot \theta_1, y_2 = -x_2^2 \tan \theta_2. Therefore, the family is affine.

Gaussian Family (Example)

The above test will be applied to the Gaussian family p(x;\theta) = \frac1{\sqrt{2\pi} \sigma} e^{-\frac12\left(\frac{x-\mu}{\sigma}\right)^2 } where \theta = (\mu,\sigma). To convert the problem into a vector space setting, first introduce the coordinate chart f(p) = \frac{\log p(x;\theta)}{\log p(x;(0,1))} = -\frac12( 2 \log \sigma + \left(\frac{x-\mu}{\sigma}\right)^2 + x^2). (Keep in mind that f(p) is a vector in the vector space of all continuous functions on \mathbb{R} and is therefore written as a function in x.) We want to test if the second-order derivatives can be written as linear combinations of the first-order derivatives. Therefore, we calculate:

\frac{\partial f}{\partial \mu} = \frac{x-\mu}\sigma ;

\frac{\partial f}{\partial \sigma} = -\frac1\sigma + \frac{(x-\mu)^2}{\sigma^3};

\frac{\partial^2 f}{\partial \mu^2} = -\frac1{\sigma^2};

\frac{\partial^2 f}{\partial \mu \partial \sigma} = -2 \frac{x-\mu}{\sigma^3} ;

\frac{\partial^2 f}{\partial \sigma^2} = \frac1{\sigma^2} - \frac{3(x-\mu)^2}{\sigma^4}.

We must take into account though the “trick” of working with unnormalised densities. In full, it would mean augmenting \theta with a third parameter c and working with \tilde p(x;(\mu,\sigma,c)) = c\,p(x;(\mu,\sigma)). This doesn’t change the above calculations, it merely augments the first-order derivatives with \frac{\partial f}{\partial c} = 1/c. (Note that 1/c represents the constant function whose value always equals 1/c.) In other words, when we are endeavouring to write the second-order derivatives in terms of the first-order derivatives, we can augment the
first-order derivatives with the constant function 1. Basic but tedious linear algebra shows:

\frac{\partial^2 f}{\partial \mu^2} = 0\cdot\frac{\partial f}{\partial \mu} + 0\cdot\frac{\partial f}{\partial \sigma} + \frac{-1}{\sigma^2}\cdot 1;

\frac{\partial^2 f}{\partial \mu \partial \sigma} = \frac{-2}{\sigma}\cdot\frac{\partial f}{\partial \mu} + 0\cdot\frac{\partial f}{\partial \sigma} + 0\cdot1;

\frac{\partial^2 f}{\partial \sigma^2} = 0\cdot\frac{\partial f}{\partial \mu} + \frac{-3}\sigma\cdot \frac{\partial f}{\partial \sigma} + \frac{-2}{\sigma^2}\cdot 1.

Here, keep in mind that \theta is fixed and therefore the linear coefficients can depend on \theta (but not on x). Since Df(\theta) is injective for \sigma > 0, the above calculation shows that the Gaussian family is indeed an exponential family.

Advertisements
  1. Sam Livingstone
    January 23, 2013 at 10:47 pm

    Firstly, thanks for posting these tutorials, they’re very useful! You’ve switched notation slightly between the last lecture and this one, shouldn’t the log-likelihood ratio be log (p/q) rather than log(p)/log(q), or am I missing something? From your Gaussian example at the end it looks like you’ve taken the distance as log(p/q). I think that you should also have a “-x^2” at the end of your log-likelihood ratio for the Gaussian example, rather than a “+x^2”, though it doesn’t actually matter since this term disappears in all of the derivatives.

    I look forward to reading on!

  1. July 26, 2010 at 11:46 am

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: