### Archive

Archive for July, 2010

## Information Geometry – Coordinate Independence (Lecture 6)

The next few lectures aim to provide an introduction to several basic concepts in differential geometry required for progressing our understanding of information geometry. Rather than commence with a definition of differential geometry, the idea of “coordinate independence” will be studied in the simpler setting of affine geometry first. Roughly speaking, differential geometry combines the notion of coordinate independence with the notion of gluing simpler things together to form more complicated objects.

Let $S$ be a set. It may represent all the points on an infinitely large sheet of paper, in which case one must resist the temptation to think of $S$ as a subset of $\mathbb{R}^3$ but rather, envisage $S$ as the whole universe; there is nothing other than $S$. Alternatively, $S$ might represent the set of all elephants in the world.

Consider first the case when $S$ is the sheet of paper. In fact, assume we all live on $S$; the world is flat. In order to write down where someone lives, we need a coordinate chart. We need an injective function $f: S \rightarrow \mathbb{R}^2$ which assigns to every point a unique pair of numbers which we call the coordinates of the point. In order for this to be successful, everyone needs to use the same coordinate chart $f$. But given just $S$, no two people are likely to choose the same chart. How could they? Just for starters, it would necessitate someone drawing a big cross on the ground and declaring that everyone must consider that point to have coordinates $(0,0)$. Extra information beyond the set itself is required if different people are able to construct the same coordinate chart.

Sometimes, as we will now see, there is extra information available but it is not enough to determine a unique coordinate chart. If every person had a magnetic compass and a ruler then they could agree that moving one metre east must correspond to increasing the first coordinate by one, and moving one metre to the north must correspond to increasing the second coordinate by one. Two people’s charts will still differ in general, but only in the choice of origin. Although people would not be able to communicate where they live in absolute terms – saying I live at $(2,4)$ is no good to anyone else with a different coordinate chart – there is still a wealth of information that can be communicated. Saying that the difference in coordinates between my house and your house is $(5,2)$ is enough for you to find your way to my house; although it is likely our coordinate charts differ, the same answer is obtained no matter which chart is used. This is called coordinate independence.

The more possibilities there are for the charts, the fewer the number of coordinate independent properties. For example, if now people’s rulers are confiscated and they only have magnetic compasses, people’s coordinate charts can differ from each other’s in more ways. Saying I live $(5,2)$ away from your home will no longer work; my 5 units east will almost surely differ from your 5 units east. We could however, still agree on whether a collection of trees lies in a straight line or not.

Precisely, every person may decide to define that a collection of trees $s_1,\cdots,s_n \in S$ lies in a straight line if, under their personal coordinate chart $f: S \rightarrow \mathbb{R}^2$, the images of the trees $f(s_1),\cdots,f(s_N)$ lie in a straight line. This definition works because even though two people’s coordinate charts may be different, their definitions of lying in a straight line turn out to be the same. We will see presently that this can be understood in terms of a simple concept called transition functions. Note too that earlier we were implicitly thinking in terms of definitions too; we defined the location of your house $s_1$ relative to my house $s_2$ to be the vector $f(s_1) - f(s_2)$ and it was a useful definition whenever it was coordinate independent, as it was when we had both a compass and a ruler but not when we had a compass alone.

If $S$ represents a set of elephants then I might choose a coordinate chart $f: S \rightarrow \mathbb{R}$ by defining $f(s)$ to be the length of the trunk of elephant $s$. Tom might define his chart by measuring the length of the tail. We would not agree on the absolute size of an elephant but if the length of an elephant’s trunk and its tail is always a fixed ratio then we would agree on what it means for one elephant to be twice as big as another elephant.

Let’s formalise the above mathematically. The set $\mathbb{R}^2$ can be given a lot of extra structure. We are used to thinking of it as a vector space – we know how to add two points together in a sensible and consistent way – and we commonly introduce a norm for measuring distance and sometimes even an inner product for measuring angles. If $f: S \rightarrow \mathbb{R}^2$ is a bijection then any structure we have on $\mathbb{R}^2$ can be transferred to $S$. We can make $S$ a vector space simply by defining $s_1 + s_2 = f^{-1}( f(s_1) + f(s_2) )$ and $\alpha s = f^{-1}(\alpha f(s))$, for instance.

Let $\mathcal{F}$ be a set of bijective functions of the form $f: S \rightarrow \mathbb{R}^2$. Each element of $\mathcal{F}$ represents a valid coordinate chart, or the way we had introduced it earlier, each person uses their own coordinate chart and $\mathcal{F}$ is the set of all these coordinate charts. Unless $\mathcal{F}$ contains only a single coordinate chart, we can no longer transfer arbitrary structures from $\mathbb{R}^2$ to $S$ in a coordinate independent way; we saw examples of this earlier. What structures can be transferred?

A bit of thought reveals that the key is to study the transition functions $f \circ g^{-1}$ for all pairs $f,g \in \mathcal{F}$. Observe that $f \circ g^{-1}$ is a function from $\mathbb{R}^2$ to $\mathbb{R}^2$. We can therefore use the structure on $\mathbb{R}^2$ to determine what properties $f \circ g^{-1}$ has, for example, it might be that $\mathcal{F}$ is such that $f \circ g^{-1}$ is always a linear function; linearity is a property which can be defined in terms of the vector space structure on $\mathbb{R}^2$.

Recalling the earlier examples, when people had magnetic compasses and rulers, the transition functions $f \circ g^{-1}$ would always have the form $f \circ g^{-1}(x) = x + v$ for some vector $v$. (Changing charts would cause $v$ to change; indeed, $v$ represents the difference in the choice of origin of the two charts.) When people only had magnetic compasses, $f \circ g^{-1}$ would be of the more general form $f \circ g^{-1}(x) = D x + v$ for some positive diagonal matrix $D$. (Here, I have assumed that each person would build their own ruler, so everyone has rulers, they are just of different lengths.)

Linking in with previous lectures, the set $S$ can be made into an affine space by introducing a collection of coordinate charts $\mathcal{F}$ such that for any two charts $f,g \in \mathcal{F}$, their transition function $f \circ g^{-1}$ always has the form $f \circ g^{-1}(x) = A x + b$ for some matrix $A$ and vector $b$. Because the image of a straight line under such a transition function remains a straight line, different people with different coordinate charts will still agree on what is and what is not a straight line in $S$. It is a worthwhile exercise to prove that this definition of an affine space is equivalent to the definition given in earlier lectures.

To summarise, there is interest in playing the following mathematical game:

• We are given a set $S$ and a collection $\mathcal{F}$ of coordinate charts $f: S \rightarrow \mathbb{R}^n$.
• We want to give the set $S$ some structure coming from the structure on $\mathbb{R}^n$.
• We want to do this in a coordinate independent way, meaning that if I use my own coordinate chart and you use your own coordinate chart then we get the same structure on $S$.

The secret is to look at the form of the transition functions $f \circ g^{-1}$ for all pairs $f,g \in \mathcal{F}$. The more general the form of the transition functions, the less structure can be transferred from $\mathbb{R}^n$ to $S$ in a coordinate independent way.

The relevance to information geometry is that the parametrisation used to describe a family of densities $\{ p(x;\theta) \}$ is, to a large extent, irrelevant. Properties that depend on a particular choice of parametrisation are generally not as attractive as properties which are coordinate independent. If $\{ p(x;\mu,\sigma) \mid \sigma > 0 \}$ represents the family of Gaussian random variables parametrised by mean $\mu$ and variance $\sigma^2$ then there is little justification in calling the subfamily $\{ p(x; 2t,5t+7) \mid t > 0\}$ a line segment because there does not appear to be anything special about the parametrisation of Gaussians by mean and variance. (It turns out that for exponential families, statisticians have come up with a set of parametrisations they believe to be nice. Although these parametrisations are not unique, their transition functions are always affine functions; this is why it was possible to introduce an affine structure in earlier lectures. Note that we have not pursued this to the end because we want to move quickly to a more powerful concept coming from differential geometry which will subsume this affine geometry.)

For completeness, note that a parametrisation is just the inverse of a coordinate chart. If we think of defining a family by specifying a function from $\theta \in \mathbb{R}^n$ to $p(.;\theta)$ then we speak of a parametrisation. On the other hand, if someone points to a density $q(x)$ then we can determine its coordinates by asking what value of $\theta$ makes $q(x) = p(x;\theta)$.

## Information Geometry – Using the Fisher Information Matrix to Define an Inner Product (Lecture 5)

### Motivation

In Lecture 3, a local test for determining if a family was exponential was introduced. The last part of the test involved seeing if the second-order derivatives could be written as linear combinations of the first-order derivatives. As will be shown in a subsequent lecture, such calculations are sometimes easier to do if an inner product is introduced; we are therefore motivated to define an inner product on the space spanned by the first-order derivatives $\frac{\partial \log p}{\partial \theta_1},\cdots,\frac{\partial \log p}{\partial \theta_n}$.

Alternatively, we may simply be motivated to introduce some geometry into the family $\{ p(x;\theta) \}$. Precisely, pick a particular density $q$ from the family and consider two curves passing through $q$ at time zero. By this is meant $\gamma_1$ and $\gamma_2$ are two functions from $(-\epsilon,\epsilon)$ to parameter space, so that as $t$ varies, $p(x;\gamma_1(t))$ and $p(x;\gamma_2(t))$ trace out two curves in the space of probability densities, and it is required that the curves intersect at $q$ when $t=0$, namely $p(x;\gamma_1(0)) = p(x;\gamma_2(0)) = q$.

If the space of probability densities had a “geometry” then we should be able to say at what angle any two curves intersect at. Mathematically, we wish to compute the inner product of the “velocity vector” of $p(x;\gamma_1(t))$ at $t=0$ with the “velocity vector” of $p(x;\gamma_2(t))$ at $t=0$.

We are inclined to work not with $p(x;\theta)$ directly but with $\log p(x;\theta)$; the mapping is bijective so nothing is lost or gained by doing this, it is simply the case that we know from previous work that we can interpret log-likelihoods as elements of a vector space (albeit a vector space who has forgotten where he placed his origin) and therefore we are comfortable to differentiate $\log p(x;\theta)$ with respect to $\theta$. (See too the previous lecture on the Fisher Information Matrix.)

### An Inner Product

For any two curves $\gamma_1, \gamma_2$ intersecting at time $t=0$ we wish to define their inner product c. Since $\left.\frac{\partial \log p(x;\gamma(t))}{\partial t}\right|_{t=0}$ is a linear function of $\gamma'(0)$, it suffices to write down the rule for computing the inner product in terms of $\gamma_1'(0)$ and $\gamma_2'(0)$. It is desirable to do this because $\gamma'(0)$ is a finite-dimensional vector whereas $\left.\frac{\partial \log p(x;\gamma(t))}{\partial t}\right|_{t=0}$ is infinite-dimensional.

The catch though is that if we work with $\gamma'(0)$ then we must ensure that we get the same answer for $\langle \left.\frac{\partial \log p(x;\gamma_1(t))}{\partial t}\right|_{t=0}, \left.\frac{\partial \log p(x;\gamma_2(t))}{\partial t}\right|_{t=0} \rangle$ even if we change to a new parametrisation of the family $\{ p(x;\theta) \}$. Precisely, suppose $\{ q(x;\phi) \}$ represents the same family as $\{ p(x;\theta) \}$ but with respect to a different parametrisation $\phi$. (We assume there is a one-to-one correspondence betweeen $\phi$ and $\theta$; given any $\phi$ there is a $\theta$ such that $q(x;\phi) = p(x;\theta)$, and given any $\theta$ there is a $\phi$ such that $q(x;\phi) = p(x;\theta)$ too.) If $\tilde\gamma_1, \tilde\gamma_2$ are such that $q(x;\tilde\gamma_1(t)) = p(x;\gamma_1(t))$ and $q(x;\tilde\gamma_2(t)) = p(x;\gamma_2(t))$ then we must obtain the same answer for $\langle \left.\frac{\partial \log p(x;\gamma_1(t))}{\partial t}\right|_{t=0}, \left.\frac{\partial \log p(x;\gamma_2(t))}{\partial t}\right|_{t=0} \rangle$ as we do for $\langle \left.\frac{\partial \log q(x;\tilde\gamma_1(t))}{\partial t}\right|_{t=0}, \left.\frac{\partial \log q(x;\tilde\gamma_2(t))}{\partial t}\right|_{t=0} \rangle$.

It so happens that the Fisher Information Matrix can be used to define an inner product in a coordinate-independent way, meaning the same answer will be obtained regardless of how the family is parametrised. Before considering why we use the Fisher Information Matrix, let’s just see first how it can be used to do this.

Note that on a finite-dimensional vector space, every inner product $\langle x,y \rangle$ is of the form $\langle x,y \rangle = y^T Q x$ for some positive-definite symmetric matrix $Q$. The matrix $Q$ determines the inner product and the idea we will experiment with is using the Fisher Information matrix as the $Q$ matrix. To wit, we define:

$\langle \left.\frac{\partial \log p(x;\gamma_1(t))}{\partial t}\right|_{t=0}, \left.\frac{\partial \log p(x;\gamma_2(t))}{\partial t}\right|_{t=0} \rangle = [\gamma_2'(0)]^T \mathcal{I}(\theta_0) [\gamma_1'(0)]$ where $\theta_0 = \gamma_1(0) = \gamma_2(0)$ is the point of intersection of the two curves at time $t=0$.

From its form, and assuming the Fisher Information matrix is positive-definite (as usual, technical assumptions are being omitted in order to focus attention on the higher-level details), it is clear that we have defined an inner product; the axioms of an inner product are satisfied. What we need to check is whether or not we get the same answer if we change the parametrisation of our family.

First, the way the Fisher Information matrix changes when we change parametrisations must be determined. From the aforementioned correspondence between $\theta$ and $\phi$ we can regard one as a function of the other and write: $\log q(x;\phi) = \log p(x;\theta(\phi))$. Differentiating this yields: $\frac{\partial \log q}{\partial \phi_i} = \sum_k \frac{\partial \log p}{\partial \theta_k} \frac{\partial \theta_k}{\partial \phi_i}$. It follows almost immediately that:

$\mathcal{I}(\phi) = \left[\frac{d\theta}{d\phi}\right]^T\,\mathcal{I}(\theta)\,\left[\frac{d\theta}{d\phi}\right]$ where the $ij$th entry of $\frac{d\theta}{d\phi}$ is $\frac{\partial \theta_i}{\partial \phi_j}$.

(As discussed in class, the way to understand this is to consider the special case of $\theta$ being a linear function of $\phi$ and writing down the relationship between the squared-error of estimating $\theta$ and the squared-error of estimating $\phi$ and recalling that the Fisher Information matrix is essentially the inverse of the (asymptotic) squared-error. That the Fisher Information matrix determines the asymptotic performance explains why higher-order derivatives of $\theta$ with respect to $\phi$ do not appear in the above formula.)

We may write $\tilde\gamma(t) = \phi( \gamma(t) )$ to signify the relationship between $\gamma$ and $\tilde\gamma$; they both trace out the same curve of probability densities, just with respect to different coordinates. Therefore, $\tilde\gamma'(0) = \frac{d\phi}{d\theta} \gamma'(0)$ $= \left[ \frac{d\theta}{d\phi} \right]^{-1} \gamma'(0)$.

Collating the above results shows that an inner product defined with respect to the Fisher Information matrix is indeed coordinate-independent:

$[\tilde\gamma_2'(0)]^T \mathcal{I}(\phi_0) [\tilde\gamma_1'(0)] = \left[ \frac{d\phi}{d\theta} \gamma_2'(0) \right]^T\,\left[ \frac{d\theta}{d\phi} \right]^T\,\mathcal{I}(\theta_0)\,\left[ \frac{d\theta}{d\phi} \right]\,\left[ \frac{d\phi}{d\theta} \gamma_1'(0) \right] = [\gamma_2'(0)]^T\,\mathcal{I}(\theta_0)\,[\gamma_1'(0)]$.

The reader is urged to think carefully about what has actually been done. We have come close to putting an inner product on the infinite-dimensional space of all probability densities. Given any finite-dimensional family we can define an inner product using the Fisher Information matrix in a consistent way which depends only on the densities themselves and not on their parametrisations. (There is a small lie in the last sentence; arbitrary parametrisations are not permitted but rather, any two parametrisations we consider need to be “reasonably nice” with respect to each other, such as the mapping from one to the other being continuously differentiable. Such finer points will be elaborated on later in the course.)

### Justification

We have not endeavoured to justify why we want to use the Fisher Information matrix as an inner product; we have just demonstrated that because it transforms in the right way, we can use it to define an inner product if we so choose. It is appealing because it is coordinate-independent; it doesn’t matter how we parametrise our family, we get the same inner product being defined. This doesn’t automatically mean that it is a useful or sensible inner product though.

It turns out that it is both useful and sensible. That it is sensible comes from a deep property of being invariant with respect to sufficient statistics (this will be explained later in the course), and indeed, the Fisher metric (as we shall henceforth refer to the inner product coming from the Fisher Information matrix) is the essentially unique metric with this highly desirable property. That it is useful will be seen later when we gain a better understanding of what geometrical results become available by having an inner product structure.

## Information Geometry – Fisher Information Matrix (Lecture 4)

July 25, 2010 1 comment

As we will see in subsequent lectures, the Fisher Information Matrix plays an important role in information geometry.

### Definition and Example

Associated with a family of probability densities $\{ p(x;\theta) \}$, where $x \in \mathbb{R}^n$ and $\theta \in \mathbb{R}^k$, is a function $\mathcal{I}: \mathbb{R}^k \rightarrow \mathbb{R}^{n \times n}$ whose value $\mathcal{I}(\theta)$ at any point $\theta$ is called the Fisher Information Matrix. Although it does have several properties which warrant it being thought of as a measure of information, it would be misleading to read too much into the name “information”. Personally, I would call it the “asymptotic information” matrix to reduce the risk of erroneous intuition.

Precisely, the Fisher Information at $\theta$ is defined to be the matrix $\mathcal{I}(\theta)$ whose $ij$th entry equals $\mathbf{E}\left[ \frac{\partial \log p}{\partial \theta_i} \frac{\partial \log p}{\partial \theta_j} \right]$. When stated in this fashion, the Fisher Information Matrix can appear mysterious for two reasons; it is not immediately clear how to calculate it, and it is not at all clear why anything meaningful should result from such a calculation. It is therefore informative to calculate the Fisher Information for the family of Gaussian random variables with unknown mean $\theta$ but known variance $\sigma^2$.

Let $p(x;\theta) = \frac1{\sqrt{2\pi}\sigma}e^{-\frac12\left(\frac{x-\theta}{\sigma}\right)^2}$. The log-likelihood is therefore $\log p = -\frac12\log 2\pi - \log\sigma -\frac12\left(\frac{x-\theta}{\sigma}\right)^2$. Differentiating yields $\frac{\partial \log p}{\partial \theta} = \frac{x-\theta}{\sigma^2}$. The expectation appearing in the definition of the Fisher Information Matrix is with respect to $x$, where the density of $x$ is taken to be $p(x;\theta)$. Precisely, $\mathbf{E}[ g(x,\theta) ]$ is defined to be $\int g(x,\theta)\,p(x;\theta)\,dx$. Therefore the Fisher Information is:

$\mathcal{I}(\theta) = \int \frac{x-\theta}{\sigma^2}\,\frac{x-\theta}{\sigma^2}\,\frac1{\sqrt{2\pi}\sigma}e^{-\frac12\left(\frac{x-\theta}{\sigma}\right)^2}\,dx$.

That said, it is often easier to evaluate the Fisher Information by thinking in terms of expectation, and indeed, this is one reason why the Fisher Information is written as an expectation rather than an integration. Thinking of $x$ as a Gaussian random variable with known mean $\theta$ and variance $\sigma^2$, the expectation of $(x-\theta)^2$ is, by definition, $\sigma^2$. Therefore, the expectation of $\frac{x-\theta}{\sigma^2}\,\frac{x-\theta}{\sigma^2}$ is $\frac1{\sigma^2}$. Stated formally:

$\mathcal{I}(\theta) = \frac1{\sigma^2}$.

Had we considered a family of multivariate Gaussian random variables with unknown mean $\theta$ but known covariance matrix $\Sigma$, that is $p(x;\theta) = (2\pi)^{-n/2} \left| \Sigma \right|^{-1/2} e^{-\frac12 (x-\theta)^T \Sigma^{-1} (x-\theta)}$, then the Fisher Information Matrix would have been

$\mathcal{I}(\theta) = \Sigma^{-1}$.

It just so happens that in these cases, the Fisher Information Matrix is constant with respect to $\theta$. As will be seen presently, it is not a coincidence that the Fisher Information Matrix appears to be the reciprocal of the accuracy with which we can expect to be able to estimate $\theta$ given an observation $x$. As the variance decreases, the amount of information increases.

Before interpreting these results, it is remarked that although the definition of the Fisher Information Matrix came before information geometry, the fact that the definition of the Fisher Information Matrix requires the log-likelihood function to be differentiated means that $\log p$ was being treated as a vector space.

### Interpretation

Assume that a friend chooses a value of $\theta$ and leaves it fixed. Once a second, the friend generates a random variable with distribution $p(\cdot;\theta)$. This sequence of independent and identically distributed random variables will be denoted by $x_1,x_2,\cdots$. Given the first $N$ random variables $x_1,\cdots,x_N$, can we guess what $\theta$ is, and how accurate is our guess likely to be?

For finite $N$, there is almost never a single best method for estimating $\theta$; follow this link for the reason why. However, it is sensible to ask if there is an estimation rule which works well as $N \rightarrow \infty$.

Observe that the density of $x_1,\cdots,x_N$ is just a product of densities: $p(x_1,\cdots,x_N;\theta) = \prod_{i=1}^N p(x_i;\theta)$. It follows that the Fisher Information Matrix for $p(x_1,\cdots,x_N;\theta)$ is simply $N$ times the Fisher Information Matrix for $p(x;\theta)$. [Indeed, one of the reasons why it is called “information” is because it is additive.]

The maximum-likelihood estimate of $\theta$ given the single observation $x$ is the value of $\theta$ which maximises $p(x;\theta)$. (Here, note that the observed value of $x$ is substituted into the expression for $p(x;\theta)$ leaving just a function of $\theta$; it is this function which is maximised.) Let $\widehat{\theta}_N$ denote the maximum-likelihood estimate of $\theta$ based on the $N$ observations $x_1,\cdots,x_N$. (Whether we have $N$ observations or a single observation of dimension $N$ is one and the same thing; the maximum likelihood estimate is the “most likely” value of $\theta$ given by maximising $p(x_1,\cdots,x_N;\theta)$.) Under reasonably mild regularity conditions, it turns out that:

1. $\lim_{N \rightarrow \infty} \widehat{\theta}_N = \theta$;
2. $\lim_{N \rightarrow \infty} N\cdot\mathbf{E}\left[ (\widehat{\theta}_N - \theta) (\widehat{\theta}_N - \theta)^T \right] = \mathcal{I}^{-1}(\theta)$.

That is to say, the maximum-likelihood estimator is asymptotically unbiased and the leading term in its asymptotic performance, as measured by its covariance matrix, is $\frac1N \mathcal{I}^{-1}(\theta)$.

Therefore, the inverse of the Fisher Information Matrix determines the asymptotic performance of the maximum-likelihood estimator as the number of the samples goes to infinity.

There is more to the story. The Cramer-Rao Bound states that if $\widehat{\theta}$ is an unbiased estimator of $\theta$ then its performance must be lower-bounded by the inverse of the Fisher Information Matrix: $\mathbf{E}\left[ (\widehat{\theta} - \theta) (\widehat{\theta} - \theta)^T \right] \geq \mathcal{I}^{-1}(\theta)$ always holds (for an unbiased estimator). Therefore, no other estimator can have a better leading term in its asymptotic performance than the maximum-likelihood estimator. This is important because it implies that the definition of the Fisher Information Matrix is intrinsic; it measures the best possible asymptotic performance rather than merely the performance of the maximum-likelihood estimator. (There is nothing special about the maximum-likelihood estimator except that it is one example of an estimator which is asymptotically efficient, meaning that asymptotically it achieves the Cramer-Rao Bound.)

Extensive literature is available on the Fisher Information Matrix, including on the wikipedia. I therefore curtail my remarks to the above.

## Information Geometry – Affine Geometry (Lecture 3)

The previous lecture looked at some of the lower-level details of affine spaces. This is a prerequisite for understanding the following higher-level overview, which in turn is a prerequisite for going back and understanding the lower-level details in a finer and precise manner. Following the review is a section on how calculus can be used to test if a family is affine.

## Brief Review of Affine Geometry

It is important to distinguish between an affine space, an affine subspace of a vector space and an affine subspace of an affine space. They are three different but related concepts. If any part of the following, however small, is not understood, the reader is invited to revisit the previous lecture.

An affine subspace of a vector space can be defined directly in terms of vector space concepts. A subset $U \subset V$ of a vector space $V$ is an affine space if there exists a $u \in U$ such that $U - u = \{x-u \mid x \in U\}$ is a linear subspace of $V$.

A set $S$ can be made into an affine space if the difference between two points can be made to behave like a vector, meaning among other things that $(p - q) + (q - r) = (p - r)$. Therefore, there is an auxillary vector space $V$ and the difference of any two points in $S$ is a point in $V$.

Choosing an origin then makes the affine space $S$ into a vector space. In particular, if $V$ is the auxillary vector space and $q \in S$ then $f: S \rightarrow V$ defined by $f(p) = p-q$ is a bijective function which makes $S$ into a vector space. I will refer to such a function $f$ as a coordinate chart.

Using coordinate charts is a convenient way to work with affine spaces; it should be remembered though that the choice of origin is arbitrary and hence only concepts which do not depend on the choice of origin should be considered.

A subset $A \subset S$ of an affine space $S$ is an affine subspace of an affine space if its image $f(A)$ under a coordinate chart $f: S \rightarrow V$ is an affine subspace of the vector space $V$. Note that if $q \in A$ and the coordinate chart $f(p) = p-q$ is used then $A$ is an affine subspace if and only if $f(A)$ is a linear subspace of $V$.

For the space of all (unnormalised) probability densities $S$, we have observed that defining $p-q$ to be the log-likelihood ratio $\frac{\log p}{\log q}$ makes $S$ into an affine space in such a way that a family is exponential if and only if it is affine.

Just as a linear subspace of a vector space is itself a vector space in its own right, an affine subspace of an affine space is itself an affine space. It is therefore convenient at times to refer to affine subspaces as affine spaces.

## A Local Test for a Family to be Exponential

##### Open Subsets of Affine Spaces

As hinted at in the previous lecture, an exponential family might form only an (open and) convex subset of an affine subspace. To develop a straightforward test to see if a family is exponential, it is convenient to test first to see if the subset A is an open subset of an affine subspace. Openness is a strong enough property to ensure that the subset has the “same dimension” as the affine subspace yet is weak enough to lend itself to the development of a local test involving only derivatives.  (Open sets will be explained later in the course; for the moment, little will be lost by merely thinking of an open set as a set having the property that given any point in the set, we can move a little bit in any direction around that point and still stay within the set.)

It is noted tangentially that a prevalent concept in mathematics is the recognition of scenarios where “local implies global”. Anyone who has tried to prove that a (smooth) function is convex directly from the global condition that the straight line segment connecting any two points on the graph of the function must never drop below the function will appreciate the equivalent local condition that the second order derivative of the function must be non-negative. The first condition is said to be global because it involves testing pairs of distinct points.  The second condition is said to be local because to test the condition at a point $x$ merely involves examining the function in an arbitrarily small neighbourhood of $x$ and computing a limit.

##### Derivation

A local test for whether or not a subset $A$ of an affine space $S$ is an open subset of an affine subset B is now derived. (Precisely, we are given $A \subset S$ and are asked whether or not there exists an affine subspace $B \subset S$ such that $A$ is contained in $B$ and $A$ is open relative to $B$.) For the derivation, it suffices to study the special case of when the affine space is $\mathbb{R}^n$. Therefore, assume that $\{ p(\theta) \}$ is a collection of points in $\mathbb{R}^n$ indexed by the parameter $\theta \in \mathbb{R}^k$. For example:

1. $p(\theta) = M \theta + b$ for some matrix $M \in \mathbb{R}^{n \times k}$ and vector $b \in \mathbb{R}^n$;
2. $p(\theta) = (\cos \theta_1, \sin \theta_2, \cos \theta_1 + \sin \theta_2, 0, \cdots, 0)$ where $\theta = (\theta_1,\theta_2) \in \mathbb{R}^2$;
3. $p(\theta) = (\theta, \theta^2, \theta^3, \cdots)$ where $\theta \in \mathbb{R}$.

In the first example, the space is affine whereas in the third example, provided $n > 1$, the space is not affine. The second example is perhaps the most interesting of the three.

Choose an arbitrary straight line in $\mathbb{R}^k$.  As $\theta$ moves along this line, a curve $p(\theta)$ is traced out in $\mathbb{R}^n$.  In example one above, a straight line is traced out in $\mathbb{R}^n$; this is the simplest possible case. In example two, a Lissajous curve is traced out. In example three, if $n=2$ then a parabola is traced out.

The first observation to make is that if $A$ were an open subset of an affine subspace $B$ of $\mathbb{R}^n$ then the velocity vector of any curve traced out in $A$ by varying $\theta$ would belong to $B$. Furthermore, under the regularity condition which we will be assuming for convenience, namely that the Jacobian matrix $\frac{dp}{d\theta}$ always has rank $k$, then for any point $q \in A$ and any vector $v \in B$ (thought of as a vector whose base is at $q$), it would be possible to construct a curve $\theta(t)$ such that $\theta(0) = q$ and $(p \circ \theta)'(0) = v$.  In words, any vector in $B$ can be thought of as a velocity vector of some curve $p(\theta(t))$.

In principle, we could choose any two points $q,s \in A$, evaluate all possible velocity vectors of curves passing through $q$ and compare them with all possible velocity vectors of curves passing through $s$. If we always obtained the same set of velocity vectors (by first shifting the base points to the origin to enable a comparison, as is always implicitly done in linear algebra) then we would have shown that $A$ is an open (due to the above regularity condition which was snuck in) subset of an affine subspace $B$. However, this is still a global test because we must compare distinct points $q$ and $s$.

Since equivalence is transitive, it suffices to test only nearby points with each other. In the limit, this motivates considering the acceleration vector $(p \circ \theta)''(0)$. If $A$ were an open subset of an affine subspace $B$ then the acceleration vector would always lie in $B$.  Furthermore, a candidate $B$ is afforded by the span of the velocity vectors at any given point.

This leads to the following test for determining if a family is affine. Let $\Theta$ be a non-empty open subset of $\mathbb{R}^k$ and $p: \Theta \rightarrow \mathbb{R}^n$ a twice continuously differentiable function whose derivative $Dp(\theta)$ is injective for all $\theta \in \Theta$. (In other words, the Jacobian matrix of $p$ has rank $k$ at all points $\theta \in \Theta$.) Then $\{ p(\theta) \}$ is an open subset of a $k$-dimensional affine subspace of $\mathbb{R}^n$ if and only if, for all $\theta \in \Theta$ and $x \in \mathbb{R}^k$, there exists a $y \in \mathbb{R}^k$ such that $D^2p(\theta) \cdot (x,x) = Dp(\theta) \cdot y$.  The following examples hopefully serve to clarify the notation.  (In words, the test is to see if the second-order partial derivatives of $p$ with respect to $\theta$ can be written as linear combinations of the first-order partial derivatives.)

If $p(\theta) = M \theta + b$ then $Dp(\theta) \cdot y = My$ and $D^2p(\theta) = 0$. The regularity condition that $Dp$ is injective is satisfied if $M$ has rank $k$.  (If $M$ has lower rank then the affine subspace $\{ p(\theta) \}$ would have dimension strictly lower than $k$.) The test $D^2p(\theta) \cdot (x,x) = Dp(\theta) \cdot y$ is satisfied by the trivial choice $y=0$. The conclusion is that the family is affine.

If $p(\theta) = (\theta, \theta^2)$ then $Dp(\theta) \cdot y = \lim_{t \rightarrow 0} \frac{p(\theta + t y) - p(\theta)}t = (y, 2 \theta y)$ and $D^2p(\theta) \cdot (x,x) = (0, 2x^2)$.  The equation $(y, 2\theta y) = (0, 2x^2)$ has no solution $y$ given any $\theta$ and non-zero $x$, therefore the family is not affine. (This shows that the general case in the third example above is also not affine because if it were, then its projection onto its first two coordinates would also be affine.)

If $p(\theta) = (\cos \theta_1, \sin \theta_2, \cos \theta_1 + \sin \theta_2)$ then $Dp(\theta) \cdot y = ( - y_1 \sin \theta_1, y_2 \cos \theta_2, - y_1 \sin \theta_1 + y_2 \cos \theta_2 )$ and $D^2p(\theta) \cdot (x,x) = -( x_1^2 \cos \theta_1, x_2^2 \sin \theta_2, x_1^2 \cos \theta_1 + x_2^2 \sin \theta_2 )$. Note that there are values of $\theta$ for which $Dp(\theta)$ is not injective; the equation $Dp(\theta) \cdot y = 0$ has non-zero solutions in $y$ whenever $\theta$ is such that either $\sin \theta_1$ or $\cos \theta_2$ are zero. We assume that $\Theta$ does not contain any such values.  (Otherwise, the family $\{ p(\theta) \}$ would include boundary points and hence not be open.) Observe then that for any $\theta \in \Theta$ and any $x \in \mathbb{R}^2$ the equation $( - y_1 \sin \theta_1, y_2 \cos \theta_2, - y_1 \sin \theta_1 + y_2 \cos \theta_2 )$ $= -( x_1^2 \cos \theta_1, x_2^2 \sin \theta_2, x_1^2 \cos \theta_1 + x_2^2 \sin \theta_2 )$ has the solution $y_1 = x_1^2 \cot \theta_1$, $y_2 = -x_2^2 \tan \theta_2$. Therefore, the family is affine.

##### Gaussian Family (Example)

The above test will be applied to the Gaussian family $p(x;\theta) = \frac1{\sqrt{2\pi} \sigma} e^{-\frac12\left(\frac{x-\mu}{\sigma}\right)^2 }$ where $\theta = (\mu,\sigma)$. To convert the problem into a vector space setting, first introduce the coordinate chart $f(p) = \frac{\log p(x;\theta)}{\log p(x;(0,1))} = -\frac12( 2 \log \sigma + \left(\frac{x-\mu}{\sigma}\right)^2 + x^2)$. (Keep in mind that $f(p)$ is a vector in the vector space of all continuous functions on $\mathbb{R}$ and is therefore written as a function in $x$.) We want to test if the second-order derivatives can be written as linear combinations of the first-order derivatives. Therefore, we calculate:

$\frac{\partial f}{\partial \mu} = \frac{x-\mu}\sigma$;

$\frac{\partial f}{\partial \sigma} = -\frac1\sigma + \frac{(x-\mu)^2}{\sigma^3}$;

$\frac{\partial^2 f}{\partial \mu^2} = -\frac1{\sigma^2}$;

$\frac{\partial^2 f}{\partial \mu \partial \sigma} = -2 \frac{x-\mu}{\sigma^3}$;

$\frac{\partial^2 f}{\partial \sigma^2} = \frac1{\sigma^2} - \frac{3(x-\mu)^2}{\sigma^4}$.

We must take into account though the “trick” of working with unnormalised densities. In full, it would mean augmenting $\theta$ with a third parameter $c$ and working with $\tilde p(x;(\mu,\sigma,c)) = c\,p(x;(\mu,\sigma))$. This doesn’t change the above calculations, it merely augments the first-order derivatives with $\frac{\partial f}{\partial c} = 1/c$. (Note that $1/c$ represents the constant function whose value always equals $1/c$.) In other words, when we are endeavouring to write the second-order derivatives in terms of the first-order derivatives, we can augment the
first-order derivatives with the constant function $1$. Basic but tedious linear algebra shows:

$\frac{\partial^2 f}{\partial \mu^2} = 0\cdot\frac{\partial f}{\partial \mu} + 0\cdot\frac{\partial f}{\partial \sigma} + \frac{-1}{\sigma^2}\cdot 1$;

$\frac{\partial^2 f}{\partial \mu \partial \sigma} = \frac{-2}{\sigma}\cdot\frac{\partial f}{\partial \mu} + 0\cdot\frac{\partial f}{\partial \sigma} + 0\cdot1$;

$\frac{\partial^2 f}{\partial \sigma^2} = 0\cdot\frac{\partial f}{\partial \mu} + \frac{-3}\sigma\cdot \frac{\partial f}{\partial \sigma} + \frac{-2}{\sigma^2}\cdot 1$.

Here, keep in mind that $\theta$ is fixed and therefore the linear coefficients can depend on $\theta$ (but not on $x$). Since $Df(\theta)$ is injective for $\sigma > 0$, the above calculation shows that the Gaussian family is indeed an exponential family.

## Coordinate-independent Approach to Differentiation

Using partial derivatives is a coordinate-based approach to differentiating functions and can quickly become messy and tedious if matrices are involved. In many cases the following coordinate-independent approach is simpler. This brief note will use the following three examples to demonstrate the elegance of the coordinate-independent approach.
1. $f: \mathbb{R}^{n \times m} \rightarrow \mathbb{R}, f(X) = \mathrm{tr}\{X^T A X\}.$
2. $f: \mathbb{R}^{n \times n} \rightarrow \mathbb{R}, f(X) = \log\det(X).$
3. $f: \mathbb{R}^{n \times m} \rightarrow S^{m \times m}, f(X) = X^T X - I.$

Here, $S^{m \times m} = \{ Y \in \mathbb{R}^{m \times m} \mid Y^T = Y\}$ is the vector space of $m \times m$ symmetric matrices and $I$ denotes the identity matrix.

First some formalities. A convenient level of generalisation is to work with functions between Banach spaces. Recall that a Banach space is a vector space equipped with a norm and which is complete with respect to the norm. (Completeness means every Cauchy sequence converges to a point. Since every finite-dimensional normed vector space is automatically complete, this technical condition can be ignored for the three examples above.)

The vector spaces used above – $\mathbb{R}, \mathbb{R}^{n \times m}, \mathbb{R}^{n \times n}$ and $S^{m \times m}$ – can be made into Banach spaces simply by choosing a norm. Any norm will do because on a finite-dimensional vector space, any two norms are equivalent; if a sequence converges with respect to one norm then it converges with respect to any other norm to the same point. In particular, the derivative will be the same.  (In general, this need not be the case; changing the norm can change the derivative.)

Although there is an important distinction between the Fréchet derivative and the Gâteaux derivative, the trick in practice is simply to aim to calculate the Gâteaux derivative, that is, the directional derivative. Either by showing the directional derivatives fit together in the right way or by appealing to a higher level of reasoning would then allow the Fréchet derivative to be written down in terms of the Gâteaux derivative.  All this will be explained presently.  For the moment, let’s compute the directional derivatives of the above functions.

By definition, the directional derivative of $f$ at $X$ in the direction $Z$ is $g'(0) = \lim_{t \rightarrow 0} \frac{g(t)-g(0)}{t}$ where $g(t) = f(X+tZ).$  Although there are standard rules (chain rule, product rule etc) that can be used to expedite the calculation, it is instructive to proceed from first principles.

#### Example 1

$f: \mathbb{R}^{n \times m} \rightarrow \mathbb{R}, f(X) = \mathrm{tr}\{X^T A X\}.$

$g(t) - g(0) = \mathrm{tr}\{(X+tZ)^T A (X+tZ)\} - \mathrm{tr}\{X^T A X\}$ $= t\,\mathrm{tr}\{X^T A Z\} + t\,\mathrm{tr}\{Z^T A X\} + t^2\,\mathrm{tr}\{Z^T A Z\}$ from which it follows that $g'(0) = \mathrm{tr}\{X^T A Z\} + \mathrm{tr}\{Z^T A X\}.$ [If $A$ were symmetric, this simplifies to $2 \mathrm{tr}\{X^T A Z\}$.]

#### Example 2

$f: \mathbb{R}^{n \times n} \rightarrow \mathbb{R}, f(X) = \log\det(X).$

Method 1: $g(t) - g(0) = \log \det(X+tZ) - \log \det X$ $= \log \det( X (I + tX^{-1}Z) ) - \log \det X$ $= \log \det(I + t X^{-1} Z).$ Consider $\det(I + t Y).$ Either directly from the definition of determinant (Leibniz formula), or by writing the determinant recursively using the Laplace formula, it is straightforward to see that $\det(I + t Y) = 1 + t\,\mathrm{tr}\{Y\} + O(t^2).$ Furthermore, the Taylor series for log is $\log(1+x) = x + O(x^2).$ Therefore, $\log \det(I + t X^{-1} Z) = t\,\mathrm{tr}\{X^{-1} Z\} + O(t^2)$ from which it follows that $g'(0) = \mathrm{tr}\{X^{-1} Z\}.$

Method 2: Since the determinant is the product of eigenvalues and the trace is the sum of eigenvalues, it is not surprising that $\log\det A = \mathrm{tr} \log A.$ Therefore $f(X) = \mathrm{tr} \log X.$  By definition, matrix log is defined in terms of its Taylor series, therefore, $\log(X+tZ) = \log(X) + \log(I+tX^{-1}Z)$ $= \log(X) + t X^{-1}Z + O(t^2).$ Thus $\mathrm{tr} \log(X+tZ) - \mathrm{tr} \log X$ $= t\,\mathrm{tr}\{ X^{-1} Z \} + O(t^2)$ from which it follows that the directional derivative is $\mathrm{tr}\{X^{-1} Z\}.$

#### Example 3

$f: \mathbb{R}^{n \times m} \rightarrow S^{m \times m}, f(X) = X^T X - I.$

$g(t) - g(0) = (X+tZ)^T (X+tZ) - X^T X$ $= t (Z^T X + X^T Z) + t^2 (Z^t Z)$ from which it follows that $g'(0) = Z^T X + X^T Z.$

In all cases, observe that for a fixed $X$, the derivative is a linear function of the direction $Z$. Indeed, the Fréchet derivative is declared to exist at the point $X$ precisely when the directional derivatives are a (continuous and) linear function of the direction $Z$. (Since all three functions above are analytic, the directional derivatives must be linearly related and therefore we could have known in advance that the Fréchet derivative exists.)

If $V$ and $W$ are two Banach spaces then $L(V;W)$, the set of continuous linear functions from $V$ to $W$, can itself be made into a Banach space (with the norm being the operator norm). The Fréchet derivative of a function $f: V \rightarrow W$ is a function $Df : V \rightarrow L(V;W).$ This notation indicates that the directional derivatives fit together linearly. The fact that $L(V;W)$ is itself a Banach space means that $Df$ itself can be differentiated in the same framework.

One common notation is to use a dot to denote the application of the linear operator $Df(X)$ applied to the direction $Z$, for example, the Fréchet derivative of the function $f$ in the first example is $Df(X) \cdot Z = \mathrm{tr}\{X^T A Z\} + \mathrm{tr}\{Z^T A X\}.$

A subsequent note will look at higher order derivatives and several rules for calculating Fréchet derivatives.