## Information Geometry – An Affine Geometry of Exponential Families (Lecture 2)

Underlying information geometry is the question of what it means to put a geometry on a set of probability distributions. Ultimately, we will be viewing sets of probability distributions as manifolds with metrics and connections having statistical relevance. (Metrics and connections are Riemannian geometric concepts determining, among other things, how distance is measured and what curves should be thought of as “straight lines”, that is, curves with zero acceleration.) As a warm up exercise, we ask a considerably simpler question: Can we impose an affine geometry on (a significantly large subset of) the set of all probability distributions such that all exponential families of distributions appear as affine subspaces? (Definitions will be given presently.) Our only motivation for wishing to do this, beyond being a warm up exercise for considering what a “geometry” actually is, is that such a geometry leads to a simple geometric test for determining if a given family is exponential or not. On the one hand then, this geometry does have a use, albeit a relatively straightforward one, and there must be something a bit geometrically special about exponential families because there are other collections of families who could not have been shoehorned into an affine structure. On the other hand, it should not be interpreted as *the *geometry of probability distributions. There are other geometries to consider too.

In a vector space, an affine subspace is what results by taking a linear subspace and translating its origin to another point. The simplest example is a straight line. Recall that a straight line in a vector space is defined to be a curve of the form where and are two vectors from the space. What if we are given a set ? How can we define what a straight line is in ?

We could arbitrarily decree that certain collections of points in form straight lines but if these straight lines don’t “fit together” in a nice way, we wouldn’t have a systematic structure to work with, and our definition would not be a useful one. That said, there is not just a single structure which is the right structure for “straight lines”; the message is merely that some structure is required for our definition to be useful. (An example of structure for straight lines is that, in a vector space, two straight lines are either parallel or they intersect at precisely one point; this is because the geometry is Euclidean. There are non-Euclidean geometries in which straight lines have different properties.)

One way to define straight lines (in a useful way) on is to impose a (useful) vector space structure on then apply the aforementioned definition of a straight line on a vector space. A vector space has an origin though, so if represents a set of probability distributions, which probability distribution should we choose to be the origin? Sadly, there is no distinguished probability distribution which would serve well as an origin. (We will see presently there is a way round this though!) An alternative would be to make into a manifold, define a statistically meaningful connection on the manifold and then use this connection to form curves of zero acceleration which we could think of as straight lines (especially if we restrict attention to submanifolds which are flat with respect to the connection). It is this latter approach which will be pursued in subsequent lectures, but it is overkill for now.

Observe that the definition of a straight line in a vector space doesn’t actually depend on the choice of origin. This should be clear from a simple diagram but it is instructive to write this down rigorously. If is a straight line in the vector space formed from with the origin at then remains a straight line in the vector space formed from with the origin at . Although the vector space operations have changed – in the first vector space whereas in the second vector space – any set of the form in the first vector space can be written in the same form in the second vector space. [If this is not clear, use and to denote vector space addition and scalar multiplication in the second vector space and show that for any there exist such that . Hint: If denotes the origin of the second vector space then and .]

Roughly speaking, an affine space is a vector space who cannot remember where his origin is. The affine space captures all the structure it possibly can from the absent-minded vector space. It can capture the property of straight lines because we have just seen that these can be defined in a way which does not depend on where the origin actually is.

The sophisticated way to proceed would be to define a collection of coordinate charts, each chart mapping to a vector space, and every pair of charts related by an affine transformation. This would give us an affine geometry on . Instead though, we will proceed in a conceptually simpler but more arduous way. (In a subsequent lecture we will return to the sophisticated approach because that approach generalises to manifolds and other structures.)

Concretely, let be the set of all strictly positive and continuous functions on . It includes, for example, the functions which are the probability densities for Gaussian random variables with mean and variance . How can we put a (sensible) affine structure onto such that, among other things, the Gaussian random variables lie in a two-dimensional affine subspace of ? And firstly, what *is* an affine structure on a set? (We could have made larger by including all measurable functions but this is merely a distraction. The fact that contains functions which do not integrate to and are therefore not probability densities is a convenience which will turn out later not to affect things in any significant way.)

Taking away the origin from a vector space means vector space addition and scalar multiplication are no longer defined; they change as the origin changes, as we saw earlier. Intuitively, what does not change is our ability to define direction. For example, we can say that is arrived at by starting at and moving three steps North and two steps East. Precisely, we express this by saying that the statement does not depend on where the origin is; change the origin and both sides change by the same amount. The basic idea (which may or may not work) is that although individual points in an affine space cannot be thought of as vectors, perhaps their difference can be. [It is more efficient to guess that this might work then check that it really does than try to guarantee beforehand it will work.]

This motivates us to try putting an affine structure onto by introducing an auxillary vector space to represent the difference of any two points in . For every pair we must associate a point . [Precisely, we must define a function . Writing is shorthand for . ]

Merely defining does not endow with sufficient structure though. It turns out that we get the full amount of structure possible on by requiring to behave in the following sensible ways: holds for all ; implies ; for all and there exists a such that . (Note that “” is vector space addition in whereas “” is the peculiar operation defined earlier.) [To derive this, assume that for some unknown and write down all the properties you can think of that are independent of , then check what you end up with is sensible.]

Returning to the job at hand, that of defining an affine structure on , what works is the following. For , define . Statisticians will recognise the right-hand side as the log-likelihood ratio which is ubiquitous in statistical inference. This fulfils our requirements as now shown. Let be the vector space of all continuous functions (where vector addition and scalar multiplication are defined pointwise). Firstly, as required. If then so . Given a and define . [This means .] Then , thus completing the verification that the above axioms for an affine structure are satisfied. Importantly, the choice of the log-likelihood ratio means that the family of Gaussian distributions is affine, as now explained.

A straight line in is a collection of points of the form . Indeed, this is the line passing through in the direction . Generalising this, we define an affine subspace in to be a subset of the form for some linear subspace of . Take a particular Gaussian density, say . Let be the three-dimensional subspace of spanned by the basis functions , and . That is to say, the elements of are the quadratic polynomials . Choose an element . The axioms of an affine space ensure there is a unique such that . Indeed, a calculation reveals that where and . That is to say, provided we do not care about the normalising factor required to make a density integrate to unity, we are content to claim that the Gaussian family of unnormalised densities is a three-dimensional affine subspace of . [The third dimension is because the scaling factor can be freely chosen.] Importantly, every (unnormalised) exponential family forms an affine subspace of , and conversely, every (finite dimensional) affine subspace of is an exponential family. With brute force, we could have made the Gaussian densities affine in many different ways; the key feature is that we found a way which worked for all exponential families simultaneously. This is only possible because exponential families have an inherent geometric structure, for example, it is a prerequisite for our approach that the intersection of two exponential families is an exponential family because we know that the intersection of two affine subspaces is affine.

Working with unnormalised densities is not uncommon. Here, it is used to overcome the fact that normalised densities do not form an affine subspace in the same way that a circle, representing normalised vectors in a plane, is not a linear subspace. In fact, some further unpleasantries have been covered up. The assignment may lead to being negative (or undefined if ). The effect is that the resulting density integrates to infinity. Should we wish to pursue this path, what would save us is the result that the set on which the densities have a finite integral is convex. This is sufficiently nice to work with. The alternative is to switch to Riemannian geometry which is powerful enough to allow the existence of extrinsically curved surfaces which are intrinsically flat, the cylinder being just one example. Indeed, in a subsequent lecture we will see that there are statistically meaningful connections with respect to which the exponential families are flat.

The next lecture will show how this affine geometry can be used to derive a test for determining if a family is exponential or not. The eager reader may wish to consider how we could have worked out in advance that we should use the definition to define the affine geometry. The first chapter of Murray and Rice’s book on Differential Geometry and Statistics may prove useful.

## Measure-theoretic Probability: Still not convinced?

This is a sequel to the introductory article on measure-theoretic probability and accords with my belief that learning should not be one-pass, by which I mean loosely that it is more efficient to learn the basics first at a rough level and then come back to fill in the details soon afterwards. It endeavours to address the questions:

- Why a probability triple at all?
- What if is not a -algebra?
- Why is it important that is countably additive?

In addressing these questions, it also addresses the question:

- Why can’t a uniform probability be defined on the natural numbers ?

Consider a real-life process, such as the population of a family of rabbits at each generation . This gives us a countable family of random variables (Recall that countable means countably infinite; with only a finite number of random variables, matters would be simpler.) We can safely assume that if for some then the population has died out, that is,

*What is the probability that the population dies out?*

The key questions here are the implicit questions of how to actually define and then subsequently calculate this probability of extinction. Intuitively, we want the probability that there exists an such that When trying to formulate this mathematically, we may think to split this up into bits such as “does ?”, “does ?” and so forth. Because these events are not disjoint (if we know then we are guaranteed that ) we realise that we need some way to account for this “connection” between the random variables. Is there any better way of accounting for this “connection” other than by declaring the “full” outcome to be and interpreting each as a function of ? (Only by endeavouring to think of an alternative will the full merit of having an become clear.)

There are (at least) two paths we could take to define the probability of the population dying out. The first was hinted at already; segment into disjoint sets then add up the probabilities of each of the relevant sets. Precisely, the sets , , and so forth are disjoint, and we are tempted to sum the probabilities of each one occurring to arrive at the probability of extinction. This is an infinite summation though, so unless we *believe* that probability is countably additive (recall that this means for disjoint sets ) then this avenue is not available.

Another path is to recognise that the sets are like Russian dolls, one inside the other, namely This means that their probabilities, , form a non-decreasing sequence, and moreover, we are tempted to believe that should equal the probability of extinction. (The limit exists because the form a bounded and monotonic sequence.)

In fact, these paths are equivalent; if is countably additive and the are nested as above then and the converse is true too; if for any sequence of nested sets the probability and the limit operations can be interchanged (which is how the statement should be interpreted) then is countably additive.

Essentially, we have arrived at the conclusion that the only sensible way we can define the probability of extinction is to agree that probability is countably additive and then carry out the calculations above. *Without countable additivity, there does not seem to be any way of defining the probability of extinction in general.*

The above argument in itself is intended to complete the motivation for having a probability triple; the is required to “link” random variables together and countable additivity is required in order to model real-world problems of interest. The following section goes further though by giving an example of when countable additivity does not hold.

### A Uniform Distribution on the Natural Numbers

For argument’s sake, let’s try to define a “probability triple” corresponding to a uniform distribution on the natural numbers . The probability of drawing an even number should be one half, the probability of drawing an integer multiple of 3 should be one third, and so forth. Generalising this principle, it seems entirely reasonable to define to be the limit, as , of the number of elements of less than divided by itself. Since this limit does not necessarily exist, we solve it by declaring to be the set of all for which this limit exists.

It can be shown directly that is not a -algebra. In fact, it is not even an algebra because it is relatively straightforward to construct two subsets of , call them and , which belong to but whose intersection does not, that is, there exist for which .

Does behave nicely? Let and observe that and We know from the earlier discussion about extinction that it is very natural to expect that . However, this is not the case here; since each of the contain only a finite number of elements, it follows that . Therefore, the limit on the left hand side is zero whereas the right hand side is equal to one.

In summary:

- Countable additivity enables us to give meaning to probabilities of real-world events of interest to us (such as probability of extinction).
- Without countable additivity, even very basic results such as for nested need not hold. In other words, there are not enough constraints on for a comprehensive theory to be developed if we drop the requirement of being countably additive over a -algebra .

## Information Geometry – Summary of Lecture 1

In preparation for going through Amari and Nagaoka’s book on “Methods of Information Geometry”, it is valuable to consider the statistical world from the viewpoint of families of distributions. Often, questions that are of interest to us can be (re-)formulated in terms of families of distributions. In class, it was loosely explained that a Kalman filter, in wishing to track the location of an aeroplane, is actually keeping track of the evolution of a particular distribution over time.

From this viewpoint of families of distributions, one pertinent question is, given two distributions, how far apart are they? If we could answer this question by defining an interesting “statistical distance” between two distributions, we could start to build up a geometry. The challenge would be to arrange the distributions, considered as points in space, in a certain shape (on a torus or a sphere, perhaps?) so that the geometric distance between two distributions equalled their statistical distance. For example, the class of all non-degenerate Gaussian random variables can be described by the upper half plane in where we think of the horizontal axis as specifying the mean and the vertical axis as the variance . This is what we mean by a geometric representation, but it is deficient in that the geometric (Euclidean) distance between two distributions and is unrelated to any interesting statistical concept. Does there exist a transformation of the parameters (possibly sending them into a higher-dimensional space) so that the resulting shape of the Gaussian family has geometric significance (meaning that the distance between two points is related to a useful statistical concept)?

While the answer turns out to be “yes”, this is not the end of the story. There are important statistical concepts, Kullback-Leibler divergence being a prime example, which do not fit perfectly into a (Riemannian) geometric framework because they are not true distances. (The Kullback-Leibler divergence is not symmetrical; the Kullback-Leibler divergence (or “distance”) from one distribution to another is not necessarily equal to the Kullback-Leibler divergence in the other direction, from the latter distribution to the former.) Therefore, it is to be expected that new concepts in geometry will be developed for the express purpose of enabling a comprehensive geometric interpretation of statistics. (Indeed, Amari’s introduction of “dual connections” into geometry is such an example.) As is common in many areas of science, this leads to a symbiotic relationship where statistical thinking can lead to advances in geometry and geometric thinking can lead to advances in statistics.

In a nutshell, Information Geometry, at its most basic level, allows us to visualise geometrically:

- certain statistical concepts of significant interest to statisticians (Fisher information, Kullback-Leibler divergence, …);
- certain algorithms used by engineers (Expectation-Maximisation algorithm, turbo decoding, …).

These visualisations are useful because geometric concepts (straight lines, distances between points, projections) have statistical meaning, thereby allowing us to use geometry to reason about statistical problems. An additional perspective on a problem can do no harm and may even be beneficial.

## The Role of Estimates, Estimation Theory and Statistical Inference – Is it what we think it is?

The tenet of this article is that estimation theory is a means to an end and therefore cannot be sensibly considered in isolation. Realising this has pragmatic consequences:

- Pedagogical. When faced with solving a statistical problem, it becomes clearer how to proceed.
- Philosophical. A number of controversies and debates in the literature can be resolved (or become null and void).
- Interpretive. A clearer understanding is gained of how to interpret and use estimates made by others.

Forming estimates is ingrained in us; I estimate the tree is 5 metres high, there are 53 jelly beans in the jar and it will be 25 degrees tomorrow. This can draw us strongly to the belief that forming an estimate is something intrinsic, something that can be done in isolation. It suggests there should be a right way and a wrong way of estimating a quantity; perhaps even an optimal way. Succumbing to this belief though is counterproductive.

Once you have an estimate, what will you use it for? Putting aside the amusement or curiousity value some may attach to forming estimates, for all intents and purposes, an estimate is merely an intermediate step used to provide (extra) information in a decision making process. I estimated the height of the tree in order to know how much rope to buy, I estimated the number of jelly beans in the jar to try to win the prize by being the closest guess, and I estimated the temperature tomorrow to decide what clothes to pack. In all cases, the estimate was nothing more than a stepping stone used to guide a subsequent action.

In general, it is meaningless to speak of a good or a bad estimator because, without knowing what the estimate will be used for, there is no consistent way of ascribing the attribute “good” or “bad” to the estimator. The exception is if the estimator is a sufficient statistic, and indeed, it might be more intuitive if “estimators” were sometimes thought of as “approximate sufficient statistics”. All this will be explained presently.

The James-Stein Estimator exemplifies the assertion that it is generally not possible to declare one estimator better than another. Which is better is application dependent. Less striking examples come from situations where the penalties (in terms of making a bad decision) resulting from different types of estimation errors (such as under-estimation or over-estimation) can vary considerably from application to application.

Usually, estimates serve to compress information. Their job is to extract from a large set of data the pertinent pieces of information required to make a good decision. For example, the receiving circuitry of a radar gathers a very large amount of information about what objects are around it, but in a form which is too difficult for humans to process manually. The familiar graphical display produced by a radar results from processing the received signal and extracting out the features we are interested in. Even in estimating the height of a tree, this is true. The full information is the complete sequence of images our eyes see as we look up at the tree; we compress this information into a single number (we hope is) related to the height of the tree.

Initially then, there is no role for estimation theory. We have data (also commonly referred to as observations) and we wish to make an informed decision. A standard and widely applicable framework for making decisions is to determine first how to measure the goodness of a decision and then endeavour to construct a decision rule (which takes as input the available data and outputs the recommended decision to make) which can be shown, in a probabilistic framework, to make good decisions the majority of the time. A key point is that theoretically, we should use all the data available to us if we wish to make the best decision possible. (Old habits die hard. It is tempting to reason thus: If I knew what the temperature will be tomorrow then I know what clothes to pack, therefore, I will base my decision on my “best guess” of tomorrow’s temperature. This is not only sub-optimal, it is also ill-posed because the only way to define what a “best guess” is, is by starting with the decision problem and working backwards.)

There are two pertinent questions, one a special case of the other, caused by the undesirability of returning to the full set of data each time we wish to make another decision. (Imagine having to download the global weather observations and process them using a super-computer to decide what clothes to wear tomorrow, only to repeat this with a different decision-making algorithm to decide whether or not to water the garden.)

- Is there a satisfactory (but perhaps sub-optimal) method for processing the data into a compact and more convenient form allowing for many different decisions to be made more easily by virtue of being based only on this compact summary of the original data?
- Are there any conditions under which the data can be processed into a more compact form without inducing a loss of optimality in any subsequent decision rule?

In fact, the mathematics used in estimation theory is precisely the mathematics required to answer the above two questions. The mathematics is the same, the results are the same, but the *interpretation* is different. The true role of estimation theory is to provide answers to these questions. There are many situations where it seems that this has been forgotten or is not known though.

The answer to the second question can be found in statistical textbooks under the heading of sufficient statistics. The rest of statistics, by and large, represents our endeavours to answer the first question. Indeed, we routinely go from data to an “estimator” to making a decision. When the Bureau of Meteorology forecasts tomorrow’s weather, they are doing precisely what is described in the first question.

In virtue of the above discussion, I advocate thinking of “estimators” as “approximate sufficient statistics”. They serve to answer the first question above when a sufficiently convenient sufficient statistic (the answer to the second question) cannot be found or does not exist.

By shifting from thinking in terms of “estimators” to “approximate sufficient statistics”, I hope to show in subsequent articles that this leads to clarity of thought.

## Are your research aims the right aims?

Defining the aims of a research project, whether it be for a Masters or PhD thesis, or even a major grant application, requires considerable thought. It is more difficult than it looks for several distinct reasons.

First, one must define a general research area in which to work. The hardest part about research is finding the right problem to work on. The problem needs to be hard enough and interesting enough for others to recognise the value of your contributions. Yet if it is too hard, it would not be possible to make a sufficiently worthwhile contribution in a reasonable amount of time. Several common strategies are:

- The entrepreneurial approach. Identify a problem that has not been considered yet by your peers but which you believe will attract attention once people become aware of it. Looking backwards, it is easy to identify areas of work which were (or are) a hot topic and for which the barrier to entry was not great; it was the right problem at the right time to consider.
- The arbitrage approach. Different disciplines can have different approaches for tackling similar problems; it is not uncommon for one discipline to be completely unaware of related advances in another discipline. Look for how your expertise can be applied outside your discipline in a novel way. (Breakthroughs in an area often come about by the introduction of ideas originally foreign to that area.)
- The team approach. Find a world-class team of experts working on a grand challenge problem and join them. This has many benefits. It will serve as a constant supply of interesting research problems to work on and your research will have greater impact because the team as a whole will work towards consolidating contributions into even bigger ones.
- The tour de force approach. Keeping in mind that even great research is the result of 1% inspiration and 99% perspiration, if you have the determination and single-mindedness to spend many months chipping away at a difficult problem, chances are that you will be rewarded.

Next, it must be appreciated that identifying an area in which to work is just the first step in defining your research aims. Wanting to develop a better algorithm for a certain signal processing problem is not a tangible goal that you can work towards unless you have defined what you mean by better. It is not uncommon for this step to be omitted, with the hope that a new idea will lead to a new algorithm which will be found serendipitously to have some advantage over previous approaches. Yet there is little to recommend this rush-in approach since invariably the small amount of time saved at the beginning is more than lost by the lack of focus and direction during the remainder of the project. Furthermore, few funding agencies would be willing to fund such a cavalier approach, just as few investors would be prepared to invest in a start-up without a business case. It might succeed, but experience suggests it would be a bigger success if time is invested up front to think carefully about what it is you really want to solve. Measure nine times, cut once.

Research aims should be:

- Tangible and measurable. An independent person should be able to come along and judge whether or not you have made significant progress.
- Outcome focused, not output focused. Writing reports and journal papers are outputs, but if no one uses or builds on your work, you have not achieved an
*outcome*. - Achievable. You must be able to give a credible argument as to why you will be able to achieve your aims in the allocated time frame.
- Hierarchical. A hierarchical structure allows for ambitious aims at the top level with shorter-term goals leading up to them. This is generally seen to be:
- focused and efficient – your work builds on itself and grows into something substantial rather than remaining a diverse collection of less substantial contributions;
- mitigating risk – by having lower level aims which are perceived as achievable then there is less concern that you may not achieve all of your more ambitious aims;
- outcome orientated – focused work on ambitious (and appropriately chosen) aims is perhaps the best way of maximising the potential outcomes and impact of your work.

Supporting the research aims should be statements on the significance and innovation of the proposed research and an explanation of the approach and methodology which will be adopted. All this should be described in the context of previous and current international work in the relevant areas.

## One is a boy born on a Tuesday…

My colleague brought this intriguing probability question to my attention which, as the comments following the article show, has a controversial answer. I believe the reason for this is that there is more than one way to model the question in a random framework.

“I have two children. (At least) one is a boy born on a Tuesday. What is the probability I have two boys?”

The puzzle teller then asks, “What has Tuesday got to do with it?”

(We can assume that the probability of a baby being born a male is 1/2.)

Being pragmatic, to answer the above question, we need to think of a way of (at least conceptually) repeating the “experiment” many times, so that we can deduce what the long term average of there being two boys is.

**Model A**: Look at all families in this world who have two children, (at least) one of them a boy born on a Tuesday. How many of these families have two boys?

**Model B**: Look at all families in this world who have two children. Pick one of the children uniformly at random and state the sex and weekday of birth of that child. How many of these families will have both children of the same sex?

**Model C**: Look at all families in this world who have two children. Pick one of the children uniformly at random and state the sex and weekday of birth of that child. How many of these families will have two boys?

**Model D**: Look at all families in this world who have two children. Pick one of the children uniformly at random and state the sex and weekday of birth of that child. Out of only those families who stated they have a boy, how many will have two boys?

**Model E**: Look at all families in this world who have two children, (at least) one a boy. State the weekday of birth of the boy (or if there are two boys, pick one at random and state his weekday of birth). How many of these families will have two boys?

In Models B to D, the term “uniformly at random” is intended to mean that there is equal chance of picking either child. It should be compared with other choices, such as “pick the younger child” or “pick the boy if there is a boy”.

The puzzle teller is actually interested in Model A, and indeed, the answer is an interesting one in this case, as elaborated on below. The controversy over this puzzle arises because people, wittingly or otherwise, consider other models, especially Model E above. Indeed, *we need to ask for clarification from the puzzle teller on how and why he chose to tell us that one of his children was a son born on a Tuesday* before we can solve this puzzle without making any additional assumptions. (That said, the puzzle teller would presumably argue that Model A is the most sensible and therefore the default interpretation of the question. Note though that by being asked what the relevance of Tuesday is, the audience will start to think along the lines of Model E, thus fueling the controversy.)

In Model B, whether or not we learn the sex and weekday of birth of one of the children is irrelevant to the question being asked; one half of all two-children families will have both children of the same sex. Similarly, in Model C, the answer is 1/4 because we are considering the proportion of all families. In Model D, all families with two girls are discarded and there is 50% chance that families with one boy and one girl are discarded, while all families with two boys remain. Therefore, 1/2 of all remaining families will have two boys; it is irrelevant to the question being asked that we are told on which day of the week the boy was born. The answer to Model E is 1/3; learning the weekday of birth is again irrelevant to the question we have been asked.

The difference between Model A and E is that in Model A, we have discarded all families who do not have a son born on a Tuesday, whereas in Model E, we merely learn of the weekday of birth of the son. The distinction is crucial! Would we always be told the same information, or would the information we are told change from experiment to experiment?

### The Answer to Model A

The fail-safe method for answering any probability question concerning only a finite number of outcomes is to enumerate all the outcomes. It may not be illuminating on its own but it is rigorous. The first child could be a boy or a girl (2 possibilities) and it could have been born on any day of the week (7 possibilities). There are therefore 14 possible outcomes. Same for the second child, leading to a total of 14 times 14 equally likely outcomes (but we don’t need to know this number). What information does knowing that “one of the children is a boy born on Tuesday” tell us? Well, it fixes precisely the outcome of either the first child or the second child. To avoid double counting, we need a little care; we consider the three non-overlapping cases:

- First child is a boy born on a Tuesday, second child is not a boy born on a Tuesday
- First child is not a boy born on a Tuesday, second child is a boy born on a Tuesday
- Both children are boys born on a Tuesday

There are 13 possibilities for the first case with 6 being that the second child is a boy; 13 possibilities for the second case with 6 being that the first child is a boy; and only 1 possibility for the third case. So the probability that the other child is a boy is (6+6+1) / (13+13+1) = 13/27.

### An Explanation

In Model A, we are filtering out families that do not meet the criteria of i) at least one boy; and ii) at least one boy born on a Tuesday. What is the difference between filtering out just on (i) versus filtering out on both (i) and (ii)? They key point is that (ii) will filter out more families with just one boy than it will filter out families with two boys. For every family with one and only one boy, we will discard 6 out of 7 of them (equivalently, 42 out of 49) because the boy will fail to be born on a Tuesday. For families with two boys though, they have a greater chance at passing the Tuesday test because, roughly speaking, they get two goes at it! We will discard only 36 out of 49 such families.

Consider a pool of families having at least one boy. We may assume that 2/3 of them have exactly one boy, that is 98 families have exactly one boy and 49 have two boys. Now, we discard from the 98, 42 out of every 49, that is, we discard 84. So by discarding families not having a son born on a Tuesday, we are left with 14 families having exactly one boy. From the two-boy families, we discard 36 out of 49, so we are left with 13 two-boy families. Therefore, 13 out of families having at least one boy born on a Tuesday will have two boys.

### Moral

Probability (and mathematics in general) sometimes gives counter-intuitive results. We choose to accept these results because they are founded on axioms and logic which experience has shown to be extremely useful and conceptually very intuitive. The trick then is to improve our intuition based on such examples so we are better prepared in the future.

Another moral is that English is imprecise. The puzzle is controversial because it can be interpreted in different ways, and in particular, I speculate that although we feel that we understand the question, our subconscious mind doesn’t quite know which way is best to understand it, so it sporadically jumps from one interpretation to another the more we think about it, just as it does in the young woman / old woman optical illusion where the mind sporadically jumps between seeing a young woman or seeing an old woman. (Rather than say “I have two children..”, the puzzle teller should have said, “Pick at random a two-child family who has a son born on a Tuesday.” It is not as catchy though this way…)

## Comments on James-Stein Estimation Theory

Part of the reason for this short article is to provide an example which will be relied on in subsequent articles arguing that:

- There are
*a priori*no such things as estimation problems, only decision problems. - The Bayesian-Frequentist debate is a nonsense (because it is ill-founded).

The James-Stein Estimator has intrinsic interest though, and indeed, has been heralded by some as the most striking result in post-war mathematical statistics.

#### Reference

References to the literate can be found in the bibliography of the following paper:

Manton, J.H., Krishnamurthy, V. and Poor, H.V. (1998). James-Stein State Filtering Algorithms. IEEE Transactions on Signal Processing, 46(9) pp. 2431-2447.

### Introduction

Write to denote that the real-valued random variable has a Gaussian distribution with unknown mean and unit variance. It is accepted that the “best” estimate of given is simply . (Without loss of generality it can be assumed we have only a single observation; multiple observations can be averaged, which will reduce the variance, but not change the essence of the discussion to follow.)

Assume now that (and therefore ) is a three-dimensional real-valued vector and where is the identity matrix. In words, each of the three elements of is a Gaussian random variable with unknown mean and unit variance. Importantly, the three elements of are independent of each other.

Prior to the James-Stein estimator, every self-respecting statistician would have argued that estimating the means of three *independent* random variables is equivalent to estimating the mean of each one in isolation, and in particular, it must follow that must remain optimal in this three-dimensional case.

This is not necessarily true though. If we are interested in minimising the mean-square error of our estimate then while is optimal in the one- and two-dimensional cases, the following estimator is always better in the three-dimensional case: . (An obvious improvement, but harder to analyse, is to set the term in brackets to zero whenever it would otherwise be negative.)

This is an example of a shrinkage estimator. All it does is take the normal estimate of the mean , and shrink it towards the origin by multiplying it by the scalar .

The resulting Mean-Square Error (MSE) of the James-Stein estimator has been graphed in the figure on the James-Stein wiki page. Regardless of the true value of , the MSE of the James-Stein estimator is always lower than the MSE of the usual estimator .

It is worthwhile emphasising how the performance of the estimators is being assessed. A graph is drawn with along the horizontal axis, representing the true value of what it is we wish to estimate. (Conceptually, should appear along the horizontal axis but this is a little tricky since is three-dimensional. Fortunately, it turns out that the graph only depends on the norm of .) For a fixed value of , imagine that a computer generates very many realisations of , and for each realisation, is calculated and the error recorded. The Mean-Square Error (MSE) of the estimate is the average of these errors as the number of realisations goes to infinity. The MSE is graphed against . (It can be shown that the MSE depends only on the magnitude of and not on its direction.) The claim that the James-Stein Estimator is superior than the usual estimator means that, regardless of the value of , the resulting MSE is smaller for the James-Stein Estimator.

### A Paradox?

Popular articles have appeared hailing the James-Stein estimator a paradox; one should use the price of tea in China to obtain a better estimate of the chance of rain in Melbourne!

It is not a paradox for the simple reason that even though the three random variables (that is, the three elements of ) are independent, the measure of the performance of the estimator is not. Definitely, the James-Stein estimator will not improve the estimate of all three means at once; that would be impossible. What the James-Stein Estimator does is gamble; it gambles that by guessing that all three means are closer to the origin than the observations suggest, the possibly enlarged error it makes on estimating one or two of the means is more than compensated for by the reduction in error that it achieves on the other one or two means.

It must be recognised that the James-Stein estimator is good for only some applications; generally, the normal estimate is preferable. (There are several explanations for this; one is that the James-Stein estimator trades bias for risk and it is this bias which is often undesirable in applications. A simpler explanation is that if three random variables are independent of each other, then quite likely, what is actually required in practice is an estimate of their means which is accurate for each and every one of the three random variables.)

The James-Stein estimator is good when it is truly the case that it is the overall Mean-Square Error (and not the individual Mean-Square Errors) that should be minimised. For example, if for represents the financial cost of claims a multi-national insurance company will incur in the next year in three different countries, the company may be less concerned with estimating the values of the individual accurately and more concerned with getting an accurate overall estimate. Therefore, it may well choose to use the James-Stein Estimator.

Why shrink the estimate to the origin (or to some other point, which will also work)? One way to derive the James-Stein estimator is as an empirical Bayes estimate. If were a Gaussian random variable with zero mean and variance then the optimal estimate would indeed shrink the observation towards the origin by a factor depending on . Replacing by (a suitable function of) results in the James-Stein estimator; the sample variance of the observations serves as a proxy for .

### Moral

The moral of the story is that there is no such thing as an optimal estimator; which estimator is “good” depends on the application. For the aforementioned multi-national insurance company, the James-Stein Estimator is preferable. For most other applications, is best.

Subsequent articles will elaborate on the key message that there are *a priori* no such things as estimation problems, only decision problems.