### Archive

Posts Tagged ‘Cramer-Rao Bound’

## Information Geometry – Fisher Information Matrix (Lecture 4)

July 25, 2010 1 comment

As we will see in subsequent lectures, the Fisher Information Matrix plays an important role in information geometry.

### Definition and Example

Associated with a family of probability densities $\{ p(x;\theta) \}$, where $x \in \mathbb{R}^n$ and $\theta \in \mathbb{R}^k$, is a function $\mathcal{I}: \mathbb{R}^k \rightarrow \mathbb{R}^{n \times n}$ whose value $\mathcal{I}(\theta)$ at any point $\theta$ is called the Fisher Information Matrix. Although it does have several properties which warrant it being thought of as a measure of information, it would be misleading to read too much into the name “information”. Personally, I would call it the “asymptotic information” matrix to reduce the risk of erroneous intuition.

Precisely, the Fisher Information at $\theta$ is defined to be the matrix $\mathcal{I}(\theta)$ whose $ij$th entry equals $\mathbf{E}\left[ \frac{\partial \log p}{\partial \theta_i} \frac{\partial \log p}{\partial \theta_j} \right]$. When stated in this fashion, the Fisher Information Matrix can appear mysterious for two reasons; it is not immediately clear how to calculate it, and it is not at all clear why anything meaningful should result from such a calculation. It is therefore informative to calculate the Fisher Information for the family of Gaussian random variables with unknown mean $\theta$ but known variance $\sigma^2$.

Let $p(x;\theta) = \frac1{\sqrt{2\pi}\sigma}e^{-\frac12\left(\frac{x-\theta}{\sigma}\right)^2}$. The log-likelihood is therefore $\log p = -\frac12\log 2\pi - \log\sigma -\frac12\left(\frac{x-\theta}{\sigma}\right)^2$. Differentiating yields $\frac{\partial \log p}{\partial \theta} = \frac{x-\theta}{\sigma^2}$. The expectation appearing in the definition of the Fisher Information Matrix is with respect to $x$, where the density of $x$ is taken to be $p(x;\theta)$. Precisely, $\mathbf{E}[ g(x,\theta) ]$ is defined to be $\int g(x,\theta)\,p(x;\theta)\,dx$. Therefore the Fisher Information is:

$\mathcal{I}(\theta) = \int \frac{x-\theta}{\sigma^2}\,\frac{x-\theta}{\sigma^2}\,\frac1{\sqrt{2\pi}\sigma}e^{-\frac12\left(\frac{x-\theta}{\sigma}\right)^2}\,dx$.

That said, it is often easier to evaluate the Fisher Information by thinking in terms of expectation, and indeed, this is one reason why the Fisher Information is written as an expectation rather than an integration. Thinking of $x$ as a Gaussian random variable with known mean $\theta$ and variance $\sigma^2$, the expectation of $(x-\theta)^2$ is, by definition, $\sigma^2$. Therefore, the expectation of $\frac{x-\theta}{\sigma^2}\,\frac{x-\theta}{\sigma^2}$ is $\frac1{\sigma^2}$. Stated formally:

$\mathcal{I}(\theta) = \frac1{\sigma^2}$.

Had we considered a family of multivariate Gaussian random variables with unknown mean $\theta$ but known covariance matrix $\Sigma$, that is $p(x;\theta) = (2\pi)^{-n/2} \left| \Sigma \right|^{-1/2} e^{-\frac12 (x-\theta)^T \Sigma^{-1} (x-\theta)}$, then the Fisher Information Matrix would have been

$\mathcal{I}(\theta) = \Sigma^{-1}$.

It just so happens that in these cases, the Fisher Information Matrix is constant with respect to $\theta$. As will be seen presently, it is not a coincidence that the Fisher Information Matrix appears to be the reciprocal of the accuracy with which we can expect to be able to estimate $\theta$ given an observation $x$. As the variance decreases, the amount of information increases.

Before interpreting these results, it is remarked that although the definition of the Fisher Information Matrix came before information geometry, the fact that the definition of the Fisher Information Matrix requires the log-likelihood function to be differentiated means that $\log p$ was being treated as a vector space.

### Interpretation

Assume that a friend chooses a value of $\theta$ and leaves it fixed. Once a second, the friend generates a random variable with distribution $p(\cdot;\theta)$. This sequence of independent and identically distributed random variables will be denoted by $x_1,x_2,\cdots$. Given the first $N$ random variables $x_1,\cdots,x_N$, can we guess what $\theta$ is, and how accurate is our guess likely to be?

For finite $N$, there is almost never a single best method for estimating $\theta$; follow this link for the reason why. However, it is sensible to ask if there is an estimation rule which works well as $N \rightarrow \infty$.

Observe that the density of $x_1,\cdots,x_N$ is just a product of densities: $p(x_1,\cdots,x_N;\theta) = \prod_{i=1}^N p(x_i;\theta)$. It follows that the Fisher Information Matrix for $p(x_1,\cdots,x_N;\theta)$ is simply $N$ times the Fisher Information Matrix for $p(x;\theta)$. [Indeed, one of the reasons why it is called “information” is because it is additive.]

The maximum-likelihood estimate of $\theta$ given the single observation $x$ is the value of $\theta$ which maximises $p(x;\theta)$. (Here, note that the observed value of $x$ is substituted into the expression for $p(x;\theta)$ leaving just a function of $\theta$; it is this function which is maximised.) Let $\widehat{\theta}_N$ denote the maximum-likelihood estimate of $\theta$ based on the $N$ observations $x_1,\cdots,x_N$. (Whether we have $N$ observations or a single observation of dimension $N$ is one and the same thing; the maximum likelihood estimate is the “most likely” value of $\theta$ given by maximising $p(x_1,\cdots,x_N;\theta)$.) Under reasonably mild regularity conditions, it turns out that:

1. $\lim_{N \rightarrow \infty} \widehat{\theta}_N = \theta$;
2. $\lim_{N \rightarrow \infty} N\cdot\mathbf{E}\left[ (\widehat{\theta}_N - \theta) (\widehat{\theta}_N - \theta)^T \right] = \mathcal{I}^{-1}(\theta)$.

That is to say, the maximum-likelihood estimator is asymptotically unbiased and the leading term in its asymptotic performance, as measured by its covariance matrix, is $\frac1N \mathcal{I}^{-1}(\theta)$.

Therefore, the inverse of the Fisher Information Matrix determines the asymptotic performance of the maximum-likelihood estimator as the number of the samples goes to infinity.

There is more to the story. The Cramer-Rao Bound states that if $\widehat{\theta}$ is an unbiased estimator of $\theta$ then its performance must be lower-bounded by the inverse of the Fisher Information Matrix: $\mathbf{E}\left[ (\widehat{\theta} - \theta) (\widehat{\theta} - \theta)^T \right] \geq \mathcal{I}^{-1}(\theta)$ always holds (for an unbiased estimator). Therefore, no other estimator can have a better leading term in its asymptotic performance than the maximum-likelihood estimator. This is important because it implies that the definition of the Fisher Information Matrix is intrinsic; it measures the best possible asymptotic performance rather than merely the performance of the maximum-likelihood estimator. (There is nothing special about the maximum-likelihood estimator except that it is one example of an estimator which is asymptotically efficient, meaning that asymptotically it achieves the Cramer-Rao Bound.)

Extensive literature is available on the Fisher Information Matrix, including on the wikipedia. I therefore curtail my remarks to the above.