## The Role of Estimates, Estimation Theory and Statistical Inference – Is it what we think it is?

The tenet of this article is that estimation theory is a means to an end and therefore cannot be sensibly considered in isolation. Realising this has pragmatic consequences:

- Pedagogical. When faced with solving a statistical problem, it becomes clearer how to proceed.
- Philosophical. A number of controversies and debates in the literature can be resolved (or become null and void).
- Interpretive. A clearer understanding is gained of how to interpret and use estimates made by others.

Forming estimates is ingrained in us; I estimate the tree is 5 metres high, there are 53 jelly beans in the jar and it will be 25 degrees tomorrow. This can draw us strongly to the belief that forming an estimate is something intrinsic, something that can be done in isolation. It suggests there should be a right way and a wrong way of estimating a quantity; perhaps even an optimal way. Succumbing to this belief though is counterproductive.

Once you have an estimate, what will you use it for? Putting aside the amusement or curiousity value some may attach to forming estimates, for all intents and purposes, an estimate is merely an intermediate step used to provide (extra) information in a decision making process. I estimated the height of the tree in order to know how much rope to buy, I estimated the number of jelly beans in the jar to try to win the prize by being the closest guess, and I estimated the temperature tomorrow to decide what clothes to pack. In all cases, the estimate was nothing more than a stepping stone used to guide a subsequent action.

In general, it is meaningless to speak of a good or a bad estimator because, without knowing what the estimate will be used for, there is no consistent way of ascribing the attribute “good” or “bad” to the estimator. The exception is if the estimator is a sufficient statistic, and indeed, it might be more intuitive if “estimators” were sometimes thought of as “approximate sufficient statistics”. All this will be explained presently.

The James-Stein Estimator exemplifies the assertion that it is generally not possible to declare one estimator better than another. Which is better is application dependent. Less striking examples come from situations where the penalties (in terms of making a bad decision) resulting from different types of estimation errors (such as under-estimation or over-estimation) can vary considerably from application to application.

Usually, estimates serve to compress information. Their job is to extract from a large set of data the pertinent pieces of information required to make a good decision. For example, the receiving circuitry of a radar gathers a very large amount of information about what objects are around it, but in a form which is too difficult for humans to process manually. The familiar graphical display produced by a radar results from processing the received signal and extracting out the features we are interested in. Even in estimating the height of a tree, this is true. The full information is the complete sequence of images our eyes see as we look up at the tree; we compress this information into a single number (we hope is) related to the height of the tree.

Initially then, there is no role for estimation theory. We have data (also commonly referred to as observations) and we wish to make an informed decision. A standard and widely applicable framework for making decisions is to determine first how to measure the goodness of a decision and then endeavour to construct a decision rule (which takes as input the available data and outputs the recommended decision to make) which can be shown, in a probabilistic framework, to make good decisions the majority of the time. A key point is that theoretically, we should use all the data available to us if we wish to make the best decision possible. (Old habits die hard. It is tempting to reason thus: If I knew what the temperature will be tomorrow then I know what clothes to pack, therefore, I will base my decision on my “best guess” of tomorrow’s temperature. This is not only sub-optimal, it is also ill-posed because the only way to define what a “best guess” is, is by starting with the decision problem and working backwards.)

There are two pertinent questions, one a special case of the other, caused by the undesirability of returning to the full set of data each time we wish to make another decision. (Imagine having to download the global weather observations and process them using a super-computer to decide what clothes to wear tomorrow, only to repeat this with a different decision-making algorithm to decide whether or not to water the garden.)

- Is there a satisfactory (but perhaps sub-optimal) method for processing the data into a compact and more convenient form allowing for many different decisions to be made more easily by virtue of being based only on this compact summary of the original data?
- Are there any conditions under which the data can be processed into a more compact form without inducing a loss of optimality in any subsequent decision rule?

In fact, the mathematics used in estimation theory is precisely the mathematics required to answer the above two questions. The mathematics is the same, the results are the same, but the *interpretation* is different. The true role of estimation theory is to provide answers to these questions. There are many situations where it seems that this has been forgotten or is not known though.

The answer to the second question can be found in statistical textbooks under the heading of sufficient statistics. The rest of statistics, by and large, represents our endeavours to answer the first question. Indeed, we routinely go from data to an “estimator” to making a decision. When the Bureau of Meteorology forecasts tomorrow’s weather, they are doing precisely what is described in the first question.

In virtue of the above discussion, I advocate thinking of “estimators” as “approximate sufficient statistics”. They serve to answer the first question above when a sufficiently convenient sufficient statistic (the answer to the second question) cannot be found or does not exist.

By shifting from thinking in terms of “estimators” to “approximate sufficient statistics”, I hope to show in subsequent articles that this leads to clarity of thought.

Hi Jonathan

the CREB app got me thinking about Baysian inference and I have been encountering a similar question from a different perspective.

I link this to a larger question about time and memory. Time is constructed by the formation of memory in the brain. Independent identical events (the very substance of probability) require a model of time (and causality) such that events can be construed as independent. ‘Identical’ requires categorization. Interestingly we have identified two principle neural processing mecahnisms – categorical (discrete) and continuous (spatial population codes) that continuously interact to apply remembered relations to new signals. Categorical perception and long-term memory consolidation are your information compactors! The generation of temporal relations by sequential spatial coding of new information is critical in these mechanisms.

So it is impossible to apply a probability to a unique event! I have been cogutating on how to write about these ideas – perhaps we should discuss over lunch because there’s way too much to write!