Home > Informal Classroom Notes > Comments on James-Stein Estimation Theory

## Comments on James-Stein Estimation Theory

Part of the reason for this short article is to provide an example which will be relied on in subsequent articles arguing that:

• There are a priori no such things as estimation problems, only decision problems.
• The Bayesian-Frequentist debate is a nonsense (because it is ill-founded).

The James-Stein Estimator has intrinsic interest though, and indeed, has been heralded by some as the most striking result in post-war mathematical statistics.

#### Reference

References to the literate can be found in the bibliography of the following paper:

Manton, J.H., Krishnamurthy, V. and Poor, H.V. (1998). James-Stein State Filtering Algorithms. IEEE Transactions on Signal Processing, 46(9) pp. 2431-2447.

### Introduction

Write $x \sim N(\mu,1)$ to denote that the real-valued random variable $x$ has a Gaussian distribution with unknown mean $\mu$ and unit variance. It is accepted that the “best” estimate of $\mu$ given $x$ is simply $\widehat\mu = x$.  (Without loss of generality it can be assumed we have only a single observation; multiple observations can be averaged, which will reduce the variance, but not change the essence of the discussion to follow.)

Assume now that $\mu$ (and therefore $x$) is a three-dimensional real-valued vector and $x \sim N(\mu,I)$ where $I$ is the identity matrix. In words, each of the three elements of $x$ is a Gaussian random variable with unknown mean and unit variance.  Importantly, the three elements of $x$ are independent of each other.

Prior to the James-Stein estimator, every self-respecting statistician would have argued that estimating the means of three independent random variables is equivalent to estimating the mean of each one in isolation, and in particular, it must follow that $\widehat\mu = x$ must remain optimal in this three-dimensional case.

This is not necessarily true though. If we are interested in minimising the mean-square error of our estimate then while $\widehat\mu = x$ is optimal in the one- and two-dimensional cases, the following estimator is always better in the three-dimensional case: $\widehat\mu = \left(1-\frac1{\|x\|^2}\right)x$.  (An obvious improvement, but harder to analyse, is to set the term in brackets to zero whenever it would otherwise be negative.)

This is an example of a shrinkage estimator.  All it does is take the normal estimate $x$ of the mean $\mu$, and shrink it towards the origin by multiplying it by the scalar $(1-\|x\|^{-2})$.

The resulting Mean-Square Error (MSE) of the James-Stein estimator has been graphed in the figure on the James-Stein wiki page. Regardless of the true value of $\mu$, the MSE of the James-Stein estimator is always lower than the MSE of the usual estimator $\widehat\mu = x$.

It is worthwhile emphasising how the performance of the estimators is being assessed.  A graph is drawn with $\|\mu\|$ along the horizontal axis, representing the true value of what it is we wish to estimate.  (Conceptually, $\mu$ should appear along the horizontal axis but this is a little tricky since $\mu$ is three-dimensional.  Fortunately, it turns out that the graph only depends on the norm of $\mu$.) For a fixed value of $\mu$, imagine that a computer generates very many realisations of $x \sim N(\mu,I)$, and for each realisation, $\widehat\mu$ is calculated and the error $\|\widehat\mu-\mu\|^2$ recorded. The Mean-Square Error (MSE) of the estimate $\widehat\mu$ is the average of these errors as the number of realisations goes to infinity.  The MSE is graphed against $\|\mu\|$. (It can be shown that the MSE depends only on the magnitude of $\mu$ and not on its direction.) The claim that the James-Stein Estimator is superior than the usual estimator means that, regardless of the value of $\mu$, the resulting MSE is smaller for the James-Stein Estimator.

Popular articles have appeared hailing the James-Stein estimator a paradox; one should use the price of tea in China to obtain a better estimate of the chance of rain in Melbourne!

It is not a paradox for the simple reason that even though the three random variables (that is, the three elements of $x$) are independent, the measure of the performance of the estimator is not.  Definitely, the James-Stein estimator will not improve the estimate of all three means at once; that would be impossible. What the James-Stein Estimator does is gamble; it gambles that by guessing that all three means are closer to the origin than the observations suggest, the possibly enlarged error it makes on estimating one or two of the means is more than compensated for by the reduction in error that it achieves on the other one or two means.

It must be recognised that the James-Stein estimator is good for only some applications; generally, the normal estimate $\widehat\mu = x$ is preferable.  (There are several explanations for this; one is that the James-Stein estimator trades bias for risk and it is this bias which is often undesirable in applications. A simpler explanation is that if three random variables are independent of each other, then quite likely, what is actually required in practice is an estimate of their means which is accurate for each and every one of the three random variables.)

The James-Stein estimator is good when it is truly the case that it is the overall Mean-Square Error (and not the individual Mean-Square Errors) that should be minimised. For example, if $\mu_i$ for $i=1,\cdots,3$ represents the financial cost of claims a multi-national insurance company will incur in the next year in three different countries, the company may be less concerned with estimating the values of the individual $\mu_i$ accurately and more concerned with getting an accurate overall estimate.  Therefore, it may well choose to use the James-Stein Estimator.

Why shrink the estimate to the origin (or to some other point, which will also work)? One way to derive the James-Stein estimator is as an empirical Bayes estimate.  If $\mu$ were a Gaussian random variable with zero mean and variance $\sigma^2$ then the optimal estimate would indeed shrink the observation $x$ towards the origin by a factor depending on $\sigma^2$.  Replacing $\sigma^2$ by (a suitable function of) $\|x\|^2$ results in the James-Stein estimator; the sample variance of the observations serves as a proxy for $\sigma^2$.

### Moral

The moral of the story is that there is no such thing as an optimal estimator; which estimator is “good” depends on the application.  For the aforementioned multi-national insurance company, the James-Stein Estimator is preferable. For most other applications, $\widehat\mu = x$ is best.

Subsequent articles will elaborate on the key message that there are a priori no such things as estimation problems, only decision problems.