Archive for June, 2012

Sets of Measure Zero in Probability

June 28, 2012 4 comments

Probability is unique in that, on the one hand, its axioms rely on advanced mathematics, yet on the other hand, it is not only used across all areas of science, it comes up in everyday conversation, especially when the topic is gambling or tomorrow’s weather.  I suspect this is the main reason why one does not have to look very far to find vehement debates about aspects of probability and statistics, whether it be Bayesians versus non-Bayesians, or the incorrect application of hypothesis testing in clinical trials, or of relevance to this article, Borel’s paradox.

When learning probability, people tend to worry about what would happen if an outcome having zero probability of occurring actually occurs.  It does not help that books often use the notation p(x \mid y) to mean the probability of the random variable X taking the value x given that the random variable Y was observed to equal y because it is not long before this notation is used when Y has a continuous distribution, so that the probability that Y exactly equals y is zero.  (This occurs even though the book will have warned the reader earlier that conditional probability cannot be defined on sets of measure zero, that is, sets having zero probability of occurring.)

The beauty of Kolmogorov’s axioms of probability theory is that they sidestep irrelevant but otherwise complicating issues.  This is often not appreciated because many textbooks choose to work with classical notation such as p(x \mid y) rather than with measure-theoretic notation such as E[ g(X) \mid \sigma(Y)] where g is a test function and \sigma(Y) is the \sigma-algebra generated by Y.

This article endeavours to:

  • emphasise that p(x \mid y) is an abuse of notation and must be used with care;
  • emphasise that we should not use Euclidean distance when thinking in terms of probabilities (i.e., while there is not much difference between something having a chance of 0.1 or 0.0999999 of occurring, there is an infinite difference between something having a chance of 10^{-23} and a chance of 0 of occurring);
  • conclude that it would be more surprising if Borel’s paradox did not exist.

Tangentially, my observation in this field is that there are essentially just two principles that need to be kept in mind in order to be able to avoid/resolve all debates:

  • Think first about the finite case, that is, the case when the total number of possible outcomes is finite.  Then realise that the infinite case is a mathematical construct that should be understood as a “limit” of the finite case, but that sometimes different limits can give different answers. The resolution then is to go back to the original question in order to determine which limit is the right one (or conclude that the original question is ill-posed because different choices of limits will give different answers).
  • Never think purely in terms of estimation; always think in terms of the resulting decisions that may be taken in response to being told that estimate.

The standard theory of probability as we know it is designed for answering questions (filtering, hypothesis testing etc) based on the underlying assumption that the quality of the answers will be judged by some sort of expectation.  For example, with respect to the frequentist interpretation, this would mean that we would want to design a filter so that, if the filter were applied every day to a new set of data, then on average, the filter would perform as well as possible (perhaps in the sense of minimising the mean-squared error).  The presence of the expectation operator is crucial, for reasons which will become clear presently.

If the set of possible outcomes is finite, and if we agree that for all intents and purposes we are only interested in what happens on average (e.g., designing a filter so it works well on average), then there is no loss of generality in assuming that all outcomes have a non-zero chance of occurring.  In this case, conditional probability p(x \mid y) is well-defined.

In general, when there are sets of measure zero, one must always remember that according to the rigorous theory, one cannot condition on a set of measure zero. There is a simple reason why; we are not given enough information to be able to do this in any meaningful way.  A simple example will illustrate this.  Assume that I pick a point on a circle uniformly at random.  Choose two distinct points, p and q, on the circle.  Intuitively at least, no harm or inconsistency will come if I conclude that, given that I know that the outcome is either the point p or the point q then the probability that it was p is precisely one half; this accords with our intuitive understanding of “uniformly at random”. However, assume that instead I use the following procedure to pick a point on the circle: first I choose a point x uniformly at random, then I check if it is p.  If it is p, I declare my choice to be q, otherwise I declare my choice to be x.  So now, it seems intuitively clear that the probability that the outcome is p given that it is either p or q is zero. Since in Kolmogorov’s formulation of probability these two scenarios have identical descriptions, it follows that it is impossible to know what the conditional probability is of observing p given that either p or q was chosen.  This is not a shortcoming but an advantage of Kolmogorov’s framework! As long as we agree that we are working inside an expectation operator (remember the earlier discussion) then such questions are irrelevant.  While the theory could conceivably be expanded to allow meaningful answers to be given in certain situations, would it have any practical value?

The appearance of p(x \mid y) is an abuse of notation.  The advantage is that it looks easier than the measure-theoretic approach.  The disadvantage is that wrong answers can be obtained if one is not aware of its correct interpretation; see M. Proschan and B. Presnell, “Expect the unexpected from conditional expectation,” Am Stat, vol. 52, no. 3, pp. 248–252, 1998.

The fact of the matter is that (when it exists), p(x \mid y) is not a function, but an equivalence class of functions.  To specify an equivalence class, we must write down a representative function belonging to that equivalence class, hence p(x \mid y) will always look like a function, e.g., p(x \mid y) = (2\pi)^{-1/2}e^{-(x-y)^2/2}.  It is easy to start thinking that it should be a function.  And some simple examples can be written down where there seems to be only one obvious choice of p(x \mid y).  But p(x \mid y) is not a function, and Borel’s paradox is the reason why.

Precisely, p(x \mid y)  is a Radon-Nikodym derivative.  It satisfies Pr\{X \in A, Y \in B\} = \int_B p(y)\, \int_A p(x \mid y)\,dx\,dy.  If p(x \mid y) exists then it will not be unique because integration is blind to what happens on sets of measure zero.  Therefore, p(x \mid y) should only appear inside an integral, since it is only averaged versions of p(x \mid y) that are well-defined. Hence the importance of the expectation operator mentioned earlier; at the end of the day, we are always working under an integral sign, even if it is not always explicitly written down. Therefore, although it may look like p(x \mid y) is being treated as a well-defined function, it isn’t, because there is an implicit if not explicit expectation somewhere.

It is both interesting and wonderful that it is possible to define an equivalence class of functions in such a way that pointwise evaluation is meaningless yet as soon as there is an integral sign then everything is well-defined.  This is at the heart of Borel’s paradox.  And unless one is already familiar with working with equivalence classes from other areas of mathematics, it may well take a bit of getting used to.  In other areas, square brackets are sometimes used to denote equivalence classes.  It might be clearer then to write [p(x \mid y)] to remind ourselves that [p(1 \mid 2)] is not defined if Pr\{Y=2\}=0; different members of the same equivalence class need not agree on sets of measure zero. Provided we always work inside an integral, this is not an issue.

Although there is merit in defining a probability to be a number between 0 and 1 inclusive, one should be careful not to think in terms of the usual Euclidean metric.  In some situations, intuition would be better guided if a probability of zero was thought of as -\infty and a probability of one was thought of as \infty.  (One way to achieve this is to send a probability p \in (0,1) to the real number \ln(p) - \ln(1-p).)  The reason can be seen by thinking in terms of examining long sequences of outcomes.  If one biased coin had a probability of 0.001 of coming up heads, and another had a probability of 0.0001, then what we readily notice is the factor of ten; we would expect to get ten times as many heads with the first coin than with the second.  Even though the absolute difference |0.001-0.0001| is small, the large ratio 0.001 / 0.0001 makes the expected sequences very different.  Similarly, a coin with a chance of 10^{-23} of heads is infinitely different from a coin with no chance of heads; essentially any countably infinite sequence of outcomes of the first coin will have infinitely many heads whereas essentially any countably infinite sequence of outcomes of the second coin will have no (or at worst a finite number of) heads. These behaviours are very different!

In other words, whereas one would expect that in many cases, if an experiment was undertaken repeatedly which involved the tossing of a coin and certain observations were made by averaging over many trials, then whether an unbiased coin was used, or a biased coin was used with probability p of heads, there would be consistency in that the difference in observations between a biased and an unbiased coin could be made arbitrarily small by taking p arbitrarily close to 0.5. The preceding paragraph implies that there should be no reason whatsoever why such consistency should be observed between p \rightarrow 0 and p = 0.  It does not matter how close p is to zero, it is still “infinitely far away” from zero in terms of expected behaviour.  This is the simple reason why Borel’s paradox is not surprising; we should not expect any form of continuity will hold in the limit as p goes to zero.  And as soon as there is no continuity then the possibility exists for different sequences to have different limits: just think of the function f(x) = \sin(1/x) and what happens as x \rightarrow 0.  For any y \in [-1,1], a sequence \{x_n\} can be constructed such that x_n \rightarrow 0 and f(x_n) \rightarrow y. Borel’s paradox is just another demonstration that different sequences can give different answers when continuity does not hold.

In summary:

  • There is no reason for there to be any sort of “continuity” when approaching the extremes of probability 0 or probability 1 because something with probability 10^{-23} of happening is still infinitely more likely to happen than something with probability 0 and hence does not serve as a good approximation.
  • In practice, there is always an expectation operator inside of which we are working. Therefore, while the notation p(x \mid y) may suggest we are conditioning on sets of measure zero, we are not. We are always working with equivalence classes.
  • The beauty of Kolmogorov’s axioms is that the lack of continuity that occurs when approaching the extremes of probability can be neatly sidestepped.
  • Although in some simple examples it would be possible to extend Kolmogorov’s axioms so that conditioning on sets of measure zero were allowed, would this have any practical use?  For a start, it would require more information be provided than the standard probability triple (\Omega,\mathfrak{F},\mathcal{P}).