Home > Informal Classroom Notes > Sets of Measure Zero in Probability

Sets of Measure Zero in Probability

Probability is unique in that, on the one hand, its axioms rely on advanced mathematics, yet on the other hand, it is not only used across all areas of science, it comes up in everyday conversation, especially when the topic is gambling or tomorrow’s weather.  I suspect this is the main reason why one does not have to look very far to find vehement debates about aspects of probability and statistics, whether it be Bayesians versus non-Bayesians, or the incorrect application of hypothesis testing in clinical trials, or of relevance to this article, Borel’s paradox.

When learning probability, people tend to worry about what would happen if an outcome having zero probability of occurring actually occurs.  It does not help that books often use the notation p(x \mid y) to mean the probability of the random variable X taking the value x given that the random variable Y was observed to equal y because it is not long before this notation is used when Y has a continuous distribution, so that the probability that Y exactly equals y is zero.  (This occurs even though the book will have warned the reader earlier that conditional probability cannot be defined on sets of measure zero, that is, sets having zero probability of occurring.)

The beauty of Kolmogorov’s axioms of probability theory is that they sidestep irrelevant but otherwise complicating issues.  This is often not appreciated because many textbooks choose to work with classical notation such as p(x \mid y) rather than with measure-theoretic notation such as E[ g(X) \mid \sigma(Y)] where g is a test function and \sigma(Y) is the \sigma-algebra generated by Y.

This article endeavours to:

  • emphasise that p(x \mid y) is an abuse of notation and must be used with care;
  • emphasise that we should not use Euclidean distance when thinking in terms of probabilities (i.e., while there is not much difference between something having a chance of 0.1 or 0.0999999 of occurring, there is an infinite difference between something having a chance of 10^{-23} and a chance of 0 of occurring);
  • conclude that it would be more surprising if Borel’s paradox did not exist.

Tangentially, my observation in this field is that there are essentially just two principles that need to be kept in mind in order to be able to avoid/resolve all debates:

  • Think first about the finite case, that is, the case when the total number of possible outcomes is finite.  Then realise that the infinite case is a mathematical construct that should be understood as a “limit” of the finite case, but that sometimes different limits can give different answers. The resolution then is to go back to the original question in order to determine which limit is the right one (or conclude that the original question is ill-posed because different choices of limits will give different answers).
  • Never think purely in terms of estimation; always think in terms of the resulting decisions that may be taken in response to being told that estimate.

The standard theory of probability as we know it is designed for answering questions (filtering, hypothesis testing etc) based on the underlying assumption that the quality of the answers will be judged by some sort of expectation.  For example, with respect to the frequentist interpretation, this would mean that we would want to design a filter so that, if the filter were applied every day to a new set of data, then on average, the filter would perform as well as possible (perhaps in the sense of minimising the mean-squared error).  The presence of the expectation operator is crucial, for reasons which will become clear presently.

If the set of possible outcomes is finite, and if we agree that for all intents and purposes we are only interested in what happens on average (e.g., designing a filter so it works well on average), then there is no loss of generality in assuming that all outcomes have a non-zero chance of occurring.  In this case, conditional probability p(x \mid y) is well-defined.

In general, when there are sets of measure zero, one must always remember that according to the rigorous theory, one cannot condition on a set of measure zero. There is a simple reason why; we are not given enough information to be able to do this in any meaningful way.  A simple example will illustrate this.  Assume that I pick a point on a circle uniformly at random.  Choose two distinct points, p and q, on the circle.  Intuitively at least, no harm or inconsistency will come if I conclude that, given that I know that the outcome is either the point p or the point q then the probability that it was p is precisely one half; this accords with our intuitive understanding of “uniformly at random”. However, assume that instead I use the following procedure to pick a point on the circle: first I choose a point x uniformly at random, then I check if it is p.  If it is p, I declare my choice to be q, otherwise I declare my choice to be x.  So now, it seems intuitively clear that the probability that the outcome is p given that it is either p or q is zero. Since in Kolmogorov’s formulation of probability these two scenarios have identical descriptions, it follows that it is impossible to know what the conditional probability is of observing p given that either p or q was chosen.  This is not a shortcoming but an advantage of Kolmogorov’s framework! As long as we agree that we are working inside an expectation operator (remember the earlier discussion) then such questions are irrelevant.  While the theory could conceivably be expanded to allow meaningful answers to be given in certain situations, would it have any practical value?

The appearance of p(x \mid y) is an abuse of notation.  The advantage is that it looks easier than the measure-theoretic approach.  The disadvantage is that wrong answers can be obtained if one is not aware of its correct interpretation; see M. Proschan and B. Presnell, “Expect the unexpected from conditional expectation,” Am Stat, vol. 52, no. 3, pp. 248–252, 1998.

The fact of the matter is that (when it exists), p(x \mid y) is not a function, but an equivalence class of functions.  To specify an equivalence class, we must write down a representative function belonging to that equivalence class, hence p(x \mid y) will always look like a function, e.g., p(x \mid y) = (2\pi)^{-1/2}e^{-(x-y)^2/2}.  It is easy to start thinking that it should be a function.  And some simple examples can be written down where there seems to be only one obvious choice of p(x \mid y).  But p(x \mid y) is not a function, and Borel’s paradox is the reason why.

Precisely, p(x \mid y)  is a Radon-Nikodym derivative.  It satisfies Pr\{X \in A, Y \in B\} = \int_B p(y)\, \int_A p(x \mid y)\,dx\,dy.  If p(x \mid y) exists then it will not be unique because integration is blind to what happens on sets of measure zero.  Therefore, p(x \mid y) should only appear inside an integral, since it is only averaged versions of p(x \mid y) that are well-defined. Hence the importance of the expectation operator mentioned earlier; at the end of the day, we are always working under an integral sign, even if it is not always explicitly written down. Therefore, although it may look like p(x \mid y) is being treated as a well-defined function, it isn’t, because there is an implicit if not explicit expectation somewhere.

It is both interesting and wonderful that it is possible to define an equivalence class of functions in such a way that pointwise evaluation is meaningless yet as soon as there is an integral sign then everything is well-defined.  This is at the heart of Borel’s paradox.  And unless one is already familiar with working with equivalence classes from other areas of mathematics, it may well take a bit of getting used to.  In other areas, square brackets are sometimes used to denote equivalence classes.  It might be clearer then to write [p(x \mid y)] to remind ourselves that [p(1 \mid 2)] is not defined if Pr\{Y=2\}=0; different members of the same equivalence class need not agree on sets of measure zero. Provided we always work inside an integral, this is not an issue.

Although there is merit in defining a probability to be a number between 0 and 1 inclusive, one should be careful not to think in terms of the usual Euclidean metric.  In some situations, intuition would be better guided if a probability of zero was thought of as -\infty and a probability of one was thought of as \infty.  (One way to achieve this is to send a probability p \in (0,1) to the real number \ln(p) - \ln(1-p).)  The reason can be seen by thinking in terms of examining long sequences of outcomes.  If one biased coin had a probability of 0.001 of coming up heads, and another had a probability of 0.0001, then what we readily notice is the factor of ten; we would expect to get ten times as many heads with the first coin than with the second.  Even though the absolute difference |0.001-0.0001| is small, the large ratio 0.001 / 0.0001 makes the expected sequences very different.  Similarly, a coin with a chance of 10^{-23} of heads is infinitely different from a coin with no chance of heads; essentially any countably infinite sequence of outcomes of the first coin will have infinitely many heads whereas essentially any countably infinite sequence of outcomes of the second coin will have no (or at worst a finite number of) heads. These behaviours are very different!

In other words, whereas one would expect that in many cases, if an experiment was undertaken repeatedly which involved the tossing of a coin and certain observations were made by averaging over many trials, then whether an unbiased coin was used, or a biased coin was used with probability p of heads, there would be consistency in that the difference in observations between a biased and an unbiased coin could be made arbitrarily small by taking p arbitrarily close to 0.5. The preceding paragraph implies that there should be no reason whatsoever why such consistency should be observed between p \rightarrow 0 and p = 0.  It does not matter how close p is to zero, it is still “infinitely far away” from zero in terms of expected behaviour.  This is the simple reason why Borel’s paradox is not surprising; we should not expect any form of continuity will hold in the limit as p goes to zero.  And as soon as there is no continuity then the possibility exists for different sequences to have different limits: just think of the function f(x) = \sin(1/x) and what happens as x \rightarrow 0.  For any y \in [-1,1], a sequence \{x_n\} can be constructed such that x_n \rightarrow 0 and f(x_n) \rightarrow y. Borel’s paradox is just another demonstration that different sequences can give different answers when continuity does not hold.

In summary:

  • There is no reason for there to be any sort of “continuity” when approaching the extremes of probability 0 or probability 1 because something with probability 10^{-23} of happening is still infinitely more likely to happen than something with probability 0 and hence does not serve as a good approximation.
  • In practice, there is always an expectation operator inside of which we are working. Therefore, while the notation p(x \mid y) may suggest we are conditioning on sets of measure zero, we are not. We are always working with equivalence classes.
  • The beauty of Kolmogorov’s axioms is that the lack of continuity that occurs when approaching the extremes of probability can be neatly sidestepped.
  • Although in some simple examples it would be possible to extend Kolmogorov’s axioms so that conditioning on sets of measure zero were allowed, would this have any practical use?  For a start, it would require more information be provided than the standard probability triple (\Omega,\mathfrak{F},\mathcal{P}).
Advertisements
  1. Stephen Tashiro
    September 30, 2012 at 1:36 am

    We acknowledge that it is impossible in practice to take exact samples from continuous random variables either by measurement or by simulation due to the limitations of measuring instruments and computers. There is a generally held belief that this is not a serious limitation. Has this belief every been formalized as a mathematical theory? I’m not aware that there are even formal definitions that deal with this.

    For example, one goal of sampling is estimation. I think that certain properties of a probability distribution may have no “good” estimators even if we take exact samples (e.g. Whether the mean of the distribution is a rational number). Can there be properties of a continuous distribution that cannot be well estimated when we approximate the distribution with a discrete distribution having a finite number of N outcomes – no matter how large we make N?

    This is a different question than whether we can or cannot estimate certain things from a continuous distribution (e.g. a gaussian) when the sample values are rounded-off. In that scenario there is an actual infinity of possible discrete sample values. The limitation to a finite N is the scenario faced by someone writing a computer simulation to generate samples.

    • October 11, 2012 at 12:20 am

      An alternative viewpoint on the issue is the following, and yes, it is a very important part of the theory, yet is rarely discussed. It is related to the common statement in differential geometry that “every coordinate system is as good as any other”. This common statement is false! There are restrictions on the allowable coordinate systems; the correct statement is that every allowable coordinate system is as good as any other. It is implicitly assumed that if we are interested in estimating a quantity (e.g. temperature) then we measure it in a way (e.g., in Kelvin, Fahrenheit or Celsius) such that if two temperatures are “close” in numerical value then the two temperatures are “close” according to our intuition (i.e., we find it difficult in real life to tell the difference between objects with these two temperatures). Formally, this means we agree on a topology for temperature and we only wish to measure temperature by using a continuous function from temperature to the real numbers. The example you gave with rational numbers is written mathematically using the Dirichlet function; this function is nowhere continuous, explaining to a large degree why it does not work.

      In response to the second part of your comment, approximating continuous distributions by discrete distributions, and having to work with quantised data, have been well-studied in the literature. (Even though quantisation is not a continuous operation, it is nicer than the Dirichlet function and an adequate theory can be developed.) Since my motto is that there are no estimation problems, only decision problems, one way to understand all this is to think in terms of loss functions and the performance of an estimator with respect to a chosen loss function. (Change the decision you want to make and that changes the loss function you want to use, assuming of course the decision is one for which a suitable loss function can be found.) If observations are quantised then the estimator will not be able to perform as well, but depending on the particular problem at hand, the decrease in performance may be acceptable.

  2. Stephen Tashiro
    October 12, 2012 at 4:41 am

    OK, a property of a distribution that is estimable from samples should be a continuous function – but a continuous function of what?. We can make up an estimator that is a continuous function of the sample values, but it might be a bad estimator. We could say that an estimable property must be a continuous function “of the distribution”. That would imply the property was some sort of functional. Or we could say the property was a function of “the parameters” of the distribution, making it simply a function. Either way, taking your hint about geometry, the idea of continuity only has meaning if we are imagining varying something. So we must imagine more than just a single distribution. We have to imagine some sort of space of distributions and think of the property as varying as we vary the particular member of the family that is generating the samples. (I hope I’m interpreting you remarks correctly instead of putting words into your mouth.)

    it’s is traditional to speak of “the parameters” of well known probability distributions, such as a gaussian. But single probability distributions don’t really have a well defined number of parameters. For example, in a family of distributions that is a mixture of a gaussian with a uniform distribution on [0,1], the gaussian with mean 0 and variance 1 needs an additional parameter to specify that it is all gaussian with no uniform distribution mixed-in. From this point of view, it is only possible to define estimable properties of probability distributions with respect to a particular family of distributions.

    Defining continuity “on a family of distributions” brings up the problem of what topology to used on the family. I’ll have to think about that.

    —-

    Can you give some good search keywords to use for either of the following scenarios?

    1) The problem of simulating continuous distributions with finite discrete distributions. (I’d imagine there are papers about simulating the tails of continuous distributions. Are there keywords that deal with more general problems?)

    2) The problem of producing estimates from actual samples from continuous distributions that have been truncated or rounded-off. (It seems like there should be volumes written on this topic and it also seems like this subject should appear in elementary statistics texts, at least for gaussian data.. However, I haven’t found any online expositions of this topic – perhaps it is implicit in the topic of “censored data” in a way that’s too general for me to recognize.)

    Stephen Tashiro
    September 30, 2012 at 1:36 am | #1
    Reply | Quote
    For example, one goal of sampling is estimation. I think that certain properties of a probability distribution may have no “good” estimators even if we take exact samples (e.g. Whether the mean of the distribution is a rational number). Can there be properties of a continuous distribution that cannot be well estimated when we approximate the distribution with a discrete distribution having a finite number of N outcomes – no matter how large we make N?
    We acknowledge that it is impossible in practice to take exact samples from continuous random variables either by measurement or by simulation due to the limitations of measuring instruments and computers. There is a generally held belief that this is not a serious limitation. Has this belief every been formalized as a mathematical theory? I’m not aware that there are even formal definitions that deal with this.
    This is a different question than whether we can or cannot estimate certain things from a continuous distribution (e.g. a gaussian) when the sample values are rounded-off. In that scenario there is an actual infinity of possible discrete sample values. The limitation to a finite N is the scenario faced by someone writing a computer simulation to generate samples.

    jmanton
    October 11, 2012 at 12:20 am | #2
    Reply | Quote
    An alternative viewpoint on the issue is the following, and yes, it is a very important part of the theory, yet is rarely discussed. It is related to the common statement in differential geometry that “every coordinate system is as good as any other”. This common statement is false! There are restrictions on the allowable coordinate systems; the correct statement is that every allowable coordinate system is as good as any other. It is implicitly assumed that if we are interested in estimating a quantity (e.g. temperature) then we measure it in a way (e.g., in Kelvin, Fahrenheit or Celsius) such that if two temperatures are “close” in numerical value then the two temperatures are “close” according to our intuition (i.e., we find it difficult in real life to tell the difference between objects with these two temperatures). Formally, this means we agree on a topology for temperature and we only wish to measure temperature by using a continuous function from temperature to the real numbers. The example you gave with rational numbers is written mathematically using the Dirichlet function; this function is nowhere continuous, explaining to a large degree why it does not work.
    In response to the second part of your comment, approximating continuous distributions by discrete distributions, and having to work with quantised data, have been well-studied in the literature. (Even though quantisation is not a continuous operation, it is nicer than the Dirichlet function and an adequate theory can be developed.) Since my motto is that there are no estimation problems, only decision problems, one way to understand all this is to think in terms of loss functions and the performance of an estimator with respect to a chosen loss function. (Change the decision you want to make and that changes the loss function you want to use, assuming of course the decision is one for which a suitable loss function can be found.) If observations are quantised then the estimator will not be able to perform as well, but depending on the particular problem at hand, the decrease in performance may be acceptable.

  3. nicolas
    January 1, 2013 at 6:42 am

    very interesting. thanks for sharing.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: