Home > Informal Classroom Notes > Information Geometry – Using the Fisher Information Matrix to Define an Inner Product (Lecture 5)

## Information Geometry – Using the Fisher Information Matrix to Define an Inner Product (Lecture 5)

### Motivation

In Lecture 3, a local test for determining if a family was exponential was introduced. The last part of the test involved seeing if the second-order derivatives could be written as linear combinations of the first-order derivatives. As will be shown in a subsequent lecture, such calculations are sometimes easier to do if an inner product is introduced; we are therefore motivated to define an inner product on the space spanned by the first-order derivatives $\frac{\partial \log p}{\partial \theta_1},\cdots,\frac{\partial \log p}{\partial \theta_n}$.

Alternatively, we may simply be motivated to introduce some geometry into the family $\{ p(x;\theta) \}$. Precisely, pick a particular density $q$ from the family and consider two curves passing through $q$ at time zero. By this is meant $\gamma_1$ and $\gamma_2$ are two functions from $(-\epsilon,\epsilon)$ to parameter space, so that as $t$ varies, $p(x;\gamma_1(t))$ and $p(x;\gamma_2(t))$ trace out two curves in the space of probability densities, and it is required that the curves intersect at $q$ when $t=0$, namely $p(x;\gamma_1(0)) = p(x;\gamma_2(0)) = q$.

If the space of probability densities had a “geometry” then we should be able to say at what angle any two curves intersect at. Mathematically, we wish to compute the inner product of the “velocity vector” of $p(x;\gamma_1(t))$ at $t=0$ with the “velocity vector” of $p(x;\gamma_2(t))$ at $t=0$.

We are inclined to work not with $p(x;\theta)$ directly but with $\log p(x;\theta)$; the mapping is bijective so nothing is lost or gained by doing this, it is simply the case that we know from previous work that we can interpret log-likelihoods as elements of a vector space (albeit a vector space who has forgotten where he placed his origin) and therefore we are comfortable to differentiate $\log p(x;\theta)$ with respect to $\theta$. (See too the previous lecture on the Fisher Information Matrix.)

### An Inner Product

For any two curves $\gamma_1, \gamma_2$ intersecting at time $t=0$ we wish to define their inner product c. Since $\left.\frac{\partial \log p(x;\gamma(t))}{\partial t}\right|_{t=0}$ is a linear function of $\gamma'(0)$, it suffices to write down the rule for computing the inner product in terms of $\gamma_1'(0)$ and $\gamma_2'(0)$. It is desirable to do this because $\gamma'(0)$ is a finite-dimensional vector whereas $\left.\frac{\partial \log p(x;\gamma(t))}{\partial t}\right|_{t=0}$ is infinite-dimensional.

The catch though is that if we work with $\gamma'(0)$ then we must ensure that we get the same answer for $\langle \left.\frac{\partial \log p(x;\gamma_1(t))}{\partial t}\right|_{t=0}, \left.\frac{\partial \log p(x;\gamma_2(t))}{\partial t}\right|_{t=0} \rangle$ even if we change to a new parametrisation of the family $\{ p(x;\theta) \}$. Precisely, suppose $\{ q(x;\phi) \}$ represents the same family as $\{ p(x;\theta) \}$ but with respect to a different parametrisation $\phi$. (We assume there is a one-to-one correspondence betweeen $\phi$ and $\theta$; given any $\phi$ there is a $\theta$ such that $q(x;\phi) = p(x;\theta)$, and given any $\theta$ there is a $\phi$ such that $q(x;\phi) = p(x;\theta)$ too.) If $\tilde\gamma_1, \tilde\gamma_2$ are such that $q(x;\tilde\gamma_1(t)) = p(x;\gamma_1(t))$ and $q(x;\tilde\gamma_2(t)) = p(x;\gamma_2(t))$ then we must obtain the same answer for $\langle \left.\frac{\partial \log p(x;\gamma_1(t))}{\partial t}\right|_{t=0}, \left.\frac{\partial \log p(x;\gamma_2(t))}{\partial t}\right|_{t=0} \rangle$ as we do for $\langle \left.\frac{\partial \log q(x;\tilde\gamma_1(t))}{\partial t}\right|_{t=0}, \left.\frac{\partial \log q(x;\tilde\gamma_2(t))}{\partial t}\right|_{t=0} \rangle$.

It so happens that the Fisher Information Matrix can be used to define an inner product in a coordinate-independent way, meaning the same answer will be obtained regardless of how the family is parametrised. Before considering why we use the Fisher Information Matrix, let’s just see first how it can be used to do this.

Note that on a finite-dimensional vector space, every inner product $\langle x,y \rangle$ is of the form $\langle x,y \rangle = y^T Q x$ for some positive-definite symmetric matrix $Q$. The matrix $Q$ determines the inner product and the idea we will experiment with is using the Fisher Information matrix as the $Q$ matrix. To wit, we define:

$\langle \left.\frac{\partial \log p(x;\gamma_1(t))}{\partial t}\right|_{t=0}, \left.\frac{\partial \log p(x;\gamma_2(t))}{\partial t}\right|_{t=0} \rangle = [\gamma_2'(0)]^T \mathcal{I}(\theta_0) [\gamma_1'(0)]$ where $\theta_0 = \gamma_1(0) = \gamma_2(0)$ is the point of intersection of the two curves at time $t=0$.

From its form, and assuming the Fisher Information matrix is positive-definite (as usual, technical assumptions are being omitted in order to focus attention on the higher-level details), it is clear that we have defined an inner product; the axioms of an inner product are satisfied. What we need to check is whether or not we get the same answer if we change the parametrisation of our family.

First, the way the Fisher Information matrix changes when we change parametrisations must be determined. From the aforementioned correspondence between $\theta$ and $\phi$ we can regard one as a function of the other and write: $\log q(x;\phi) = \log p(x;\theta(\phi))$. Differentiating this yields: $\frac{\partial \log q}{\partial \phi_i} = \sum_k \frac{\partial \log p}{\partial \theta_k} \frac{\partial \theta_k}{\partial \phi_i}$. It follows almost immediately that:

$\mathcal{I}(\phi) = \left[\frac{d\theta}{d\phi}\right]^T\,\mathcal{I}(\theta)\,\left[\frac{d\theta}{d\phi}\right]$ where the $ij$th entry of $\frac{d\theta}{d\phi}$ is $\frac{\partial \theta_i}{\partial \phi_j}$.

(As discussed in class, the way to understand this is to consider the special case of $\theta$ being a linear function of $\phi$ and writing down the relationship between the squared-error of estimating $\theta$ and the squared-error of estimating $\phi$ and recalling that the Fisher Information matrix is essentially the inverse of the (asymptotic) squared-error. That the Fisher Information matrix determines the asymptotic performance explains why higher-order derivatives of $\theta$ with respect to $\phi$ do not appear in the above formula.)

We may write $\tilde\gamma(t) = \phi( \gamma(t) )$ to signify the relationship between $\gamma$ and $\tilde\gamma$; they both trace out the same curve of probability densities, just with respect to different coordinates. Therefore, $\tilde\gamma'(0) = \frac{d\phi}{d\theta} \gamma'(0)$ $= \left[ \frac{d\theta}{d\phi} \right]^{-1} \gamma'(0)$.

Collating the above results shows that an inner product defined with respect to the Fisher Information matrix is indeed coordinate-independent:

$[\tilde\gamma_2'(0)]^T \mathcal{I}(\phi_0) [\tilde\gamma_1'(0)] = \left[ \frac{d\phi}{d\theta} \gamma_2'(0) \right]^T\,\left[ \frac{d\theta}{d\phi} \right]^T\,\mathcal{I}(\theta_0)\,\left[ \frac{d\theta}{d\phi} \right]\,\left[ \frac{d\phi}{d\theta} \gamma_1'(0) \right] = [\gamma_2'(0)]^T\,\mathcal{I}(\theta_0)\,[\gamma_1'(0)]$.

The reader is urged to think carefully about what has actually been done. We have come close to putting an inner product on the infinite-dimensional space of all probability densities. Given any finite-dimensional family we can define an inner product using the Fisher Information matrix in a consistent way which depends only on the densities themselves and not on their parametrisations. (There is a small lie in the last sentence; arbitrary parametrisations are not permitted but rather, any two parametrisations we consider need to be “reasonably nice” with respect to each other, such as the mapping from one to the other being continuously differentiable. Such finer points will be elaborated on later in the course.)

### Justification

We have not endeavoured to justify why we want to use the Fisher Information matrix as an inner product; we have just demonstrated that because it transforms in the right way, we can use it to define an inner product if we so choose. It is appealing because it is coordinate-independent; it doesn’t matter how we parametrise our family, we get the same inner product being defined. This doesn’t automatically mean that it is a useful or sensible inner product though.

It turns out that it is both useful and sensible. That it is sensible comes from a deep property of being invariant with respect to sufficient statistics (this will be explained later in the course), and indeed, the Fisher metric (as we shall henceforth refer to the inner product coming from the Fisher Information matrix) is the essentially unique metric with this highly desirable property. That it is useful will be seen later when we gain a better understanding of what geometrical results become available by having an inner product structure.