Information Geometry – Fisher Information Matrix (Lecture 4)
As we will see in subsequent lectures, the Fisher Information Matrix plays an important role in information geometry.
Definition and Example
Associated with a family of probability densities , where and , is a function whose value at any point is called the Fisher Information Matrix. Although it does have several properties which warrant it being thought of as a measure of information, it would be misleading to read too much into the name “information”. Personally, I would call it the “asymptotic information” matrix to reduce the risk of erroneous intuition.
Precisely, the Fisher Information at is defined to be the matrix whose th entry equals . When stated in this fashion, the Fisher Information Matrix can appear mysterious for two reasons; it is not immediately clear how to calculate it, and it is not at all clear why anything meaningful should result from such a calculation. It is therefore informative to calculate the Fisher Information for the family of Gaussian random variables with unknown mean but known variance .
Let . The log-likelihood is therefore . Differentiating yields . The expectation appearing in the definition of the Fisher Information Matrix is with respect to , where the density of is taken to be . Precisely, is defined to be . Therefore the Fisher Information is:
That said, it is often easier to evaluate the Fisher Information by thinking in terms of expectation, and indeed, this is one reason why the Fisher Information is written as an expectation rather than an integration. Thinking of as a Gaussian random variable with known mean and variance , the expectation of is, by definition, . Therefore, the expectation of is . Stated formally:
Had we considered a family of multivariate Gaussian random variables with unknown mean but known covariance matrix , that is , then the Fisher Information Matrix would have been
It just so happens that in these cases, the Fisher Information Matrix is constant with respect to . As will be seen presently, it is not a coincidence that the Fisher Information Matrix appears to be the reciprocal of the accuracy with which we can expect to be able to estimate given an observation . As the variance decreases, the amount of information increases.
Before interpreting these results, it is remarked that although the definition of the Fisher Information Matrix came before information geometry, the fact that the definition of the Fisher Information Matrix requires the log-likelihood function to be differentiated means that was being treated as a vector space.
Assume that a friend chooses a value of and leaves it fixed. Once a second, the friend generates a random variable with distribution . This sequence of independent and identically distributed random variables will be denoted by . Given the first random variables , can we guess what is, and how accurate is our guess likely to be?
For finite , there is almost never a single best method for estimating ; follow this link for the reason why. However, it is sensible to ask if there is an estimation rule which works well as .
Observe that the density of is just a product of densities: . It follows that the Fisher Information Matrix for is simply times the Fisher Information Matrix for . [Indeed, one of the reasons why it is called “information” is because it is additive.]
The maximum-likelihood estimate of given the single observation is the value of which maximises . (Here, note that the observed value of is substituted into the expression for leaving just a function of ; it is this function which is maximised.) Let denote the maximum-likelihood estimate of based on the observations . (Whether we have observations or a single observation of dimension is one and the same thing; the maximum likelihood estimate is the “most likely” value of given by maximising .) Under reasonably mild regularity conditions, it turns out that:
That is to say, the maximum-likelihood estimator is asymptotically unbiased and the leading term in its asymptotic performance, as measured by its covariance matrix, is .
Therefore, the inverse of the Fisher Information Matrix determines the asymptotic performance of the maximum-likelihood estimator as the number of the samples goes to infinity.
There is more to the story. The Cramer-Rao Bound states that if is an unbiased estimator of then its performance must be lower-bounded by the inverse of the Fisher Information Matrix: always holds (for an unbiased estimator). Therefore, no other estimator can have a better leading term in its asymptotic performance than the maximum-likelihood estimator. This is important because it implies that the definition of the Fisher Information Matrix is intrinsic; it measures the best possible asymptotic performance rather than merely the performance of the maximum-likelihood estimator. (There is nothing special about the maximum-likelihood estimator except that it is one example of an estimator which is asymptotically efficient, meaning that asymptotically it achieves the Cramer-Rao Bound.)
Extensive literature is available on the Fisher Information Matrix, including on the wikipedia. I therefore curtail my remarks to the above.