Home > Informal Classroom Notes > Measure-theoretic Probability: Why it should be learnt and how to get started

Measure-theoretic Probability: Why it should be learnt and how to get started

Last Friday I gave an informal 2-hour talk at the City University of Hong Kong on measure-theoretic probability. The main points were as follows. Comments on which parts are unclear or how better to explain certain concepts are especially welcome.

Objectives

• Understand why measure-theoretic probability is useful
• Learn enough to get past the initial barrier to self-learning
• Motivation

Recommended Textbooks

The primary textbook I recommend is “Probability with Martingales” by David Williams.  Although out of print, a secondary textbook I recommend is Wong and Hajek’s “Stochastic Processes in Engineering Systems“.

Motivation

One unattractive feature of traditional probability theory is that discrete and continuous random variables are generally treated separately and thus some care is required when studying mixtures of discrete and continuous random variables.  Measure-theoretic probability provides a unified framework which is ultimately easier to work with rigorously.  (In other words, and roughly speaking, fewer lines of mathematics are required and the chance of making a mistake is decreased.)

A simple example of  moving to a more general setting  is given by the real and complex numbers.  Initially, complex numbers were treated with some scepticism.  Ultimately though, by generalising real numbers to complex numbers, a range of fundamental concepts became simpler and more natural.  To state just one, an $n$th degree polynomial has precisely $n$ roots (counting multiplicities) over the complex field,  but possibly fewer over the real field.

Derivation

Journal papers using measure-theoretic probability often start by saying, “Let $(\Omega,\mathfrak{F},\mathbb{P})$ be a probability space”.  This section endeavours to derive (or re-discover) this formalism.

The Outcome ($\omega \in \Omega$)

At least for an engineer, it is benign to assume that even if a variable or process is random and not observed directly, it still has a true and actual outcome in every experiment. For the purposes of measure-theoretic probability, it is convenient (and unrestrictive) to assume that the outcomes of a series of bets placed by a gambler are known beforehand to Tyche, the Goddess of Chance. Formally, the actual outcome is denoted by a point $\omega$ drawn from the set of all possible outcomes $\Omega$.

If the experiment consists of just a single coin toss, $\Omega$ might contain just two elements, say $\Omega = \{H,T\}$.  (There is no reason why $\Omega$ could not contain more elements; they would merely be deemed to occur with probability zero.  While this might be a silly thing to do when the set of possible outcomes is finite, there are sometimes advantages in choosing $\Omega$ larger than it needs to be in the infinite case, such as when working with stochastic processes.)

Rarely though is a single toss of a coin interesting.  Normally, the coin is tossed two or more times.  It is important that $\Omega$ contains as many outcomes as necessary to describe the full sequence of events.  So if the coin is tossed twice and only twice, it suffices to choose $\Omega = \{HH,HT,TH,TT\}$. Generally though, one would want to allow for the coin to be tossed any number of times, in which case $\Omega$ would contain all possible infinitely-long sequences of Heads and Tails.  (This is denoted $\Omega = \{H,T\}^\infty$.)

The Probability of the Outcome ($\mathbb{P}$)

We must somehow characterise the fact that Tyche will choose some outcomes $\omega \in \Omega$ more frequently than others.  If $\Omega$ is finite, there is an obvious way to do this.  We could simply define $\mathbb P$ to be a function from $\Omega$ to the set of real numbers between $0$ and $1$ inclusively, the latter denoted by $[0,1]$. This generally does not work though if $\Omega$ consists of an infinite number of elements.  To see why, assume Tyche will choose a number uniformly between $2$ and $3$.  Then we may take $\Omega = [2,3]$.  We would be forced to assign a probability of zero though to any particular outcome $\omega \in \Omega$. Therefore, there is not enough information to deduce that the probability of Tyche choosing a number in the set $[2.1,2.2]$ is precisely the same as the probability of choosing a number in the set $[2.8,2.9]$, for instance.

Going to the other extreme, we may be tempted to solve the problem by defining the probability of occurrence of any conceivable set of outcomes.  So for instance, we can define $\mathbb{P}([2.1,2.2]) = 0.1$ and $\mathbb{P}([2.8,2.9]) = 0.1$, and indeed, for any interval from $a$ to $b$ with $2 \leq a \leq b \leq 3$ we can define $\mathbb{P}([a,b]) = b-a$. Notice that now, we have made $\mathbb P$ a function which takes a subset of $\Omega$ and returns a number between $0$ and $1$. So strictly speaking, we must write $\mathbb P(\{2.5\})$ and not $\mathbb P(2.5)$ for the probability of occurrence of an individual element of $\Omega$ .

Superficially, this is ok. However, it does not work for two reasons.

1. How can we define the value of $\mathbb P(A)$  for an arbitrary subset $A$ of $\Omega$ when for some sets, it is not even possible to write down a description of them?  (That is, there are some subsets of the interval $[2,3]$ which we cannot even write down, so how can we even write down a definition of $\mathbb P$ which tells us what value it takes on such indescribable sets?)
2. It can be proved that there exist “bad” sets for which it is impossible to assign a probability to them in any consistent way.

It is very tempting to elaborate on the second point above.  However, my experience is that doing so distracts too much attention from the original aim of understanding measure-theoretic probability. It is therefore better to think that even if we could assign a probability to every possible subset, we do not want to because it would cause unnecessary trouble and complication; surely, provided we have enough interesting subsets to work with, that is enough?

Therefore, ultimately we define $\mathbb P$ as a function from $\mathfrak F$ to $[0,1]$ where $\mathfrak F$ is a set of subsets of $\Omega$ which we think of as (some of) the “nice” subsets of $\Omega$, that is, subsets of $\Omega$ to which we can and want to assign probabilities of occurrence. Roughly speaking, $\mathfrak F$ should be just large enough to be useful, and no larger.

The Set of Nice Subsets ($\mathfrak F$)

Referring to what was said just before, how should we choose $\mathfrak F$? Experience suggests that if $\Omega = [2,3]$ then we would generally be interested in all open intervals $(a,b)$ and all closed intervals $[a,b]$ for starters.  (Open intervals do not include their endpoints whereas closed intervals do.) We would also want to be able to take (finite) unions and intersections of such sets.  This may well be enough already. However, we should also look at our requirements on $\mathbb P$ since they will have an effect on how we choose $\mathfrak F$.  They are:

1. $\mathbb{P}(\{\}) = 0$. (The probability of $\omega$ being in the empty set is zero.)
2. $\mathbb{P}(\Omega) = 1$. (The probability of $\omega$ being in the set of all possible outcomes $\Omega$ is one.)
3. $\mathbb{P}( \cup_{i=1}^\infty F_i ) = \sum_{i=1}^\infty \mathbb{P}( F_i)$ whenever the $F_i$ are mutually disjoint subsets of $\Omega$. (Probability is countably additive.)

In order even to be able to state these properties rigorously, we require $\mathfrak F$ to have certain properties. In particular, the first two conditions only make sense if we insist that both the empty set $\{\}$ and $\Omega$ are elements of $\mathfrak F$. (Recall that $\mathbb P$ is only defined on elements of $\mathfrak F$.) The third condition requires that if $F_i \in \mathfrak F$ then $\cup_{i=1}^\infty F_i \in \mathfrak F$. (Technically, we have only argued for this in the special case of the $F_i$ being mutually disjoint, but it ultimately turns out to be no different from requiring it to hold for non-disjoint sets too.)

Note that the third condition implies (finite) additivity; just choose most of the $F_i$ to be the empty set.  Therefore, if $A \in \Omega$ and if $\Omega \backslash A$ (the complement of $A$) is also in $\mathfrak F$ then properties 2 and 3 above would impy that $\mathbb{P}(\Omega \backslash A) = 1 - \mathbb{P}(A)$.  It is easy to believe that this condition is fundamental enough to insist that if $A \in \mathfrak F$ then its complement $\Omega \backslash A$ is also in $\mathfrak F$.  Once we have complements of sets in $\mathfrak F$ also belonging to $\mathfrak F$, then (finite and countable) intersections of sets in $\mathfrak F$ also belong to $\mathfrak F$.  (Recall that $A \cap B = \Omega \backslash ( (\Omega \backslash A) \cup (\Omega \backslash B))$, for example.)

To summarise the last paragraph, we have endeavoured to show that we require $\mathfrak F$ to satisfy the following conditions.

1. $\Omega \in \mathfrak F$.
2. $A \in \mathfrak F$ implies $\Omega \backslash A \in \mathfrak F$.
3. $A_i \in \mathfrak F$ implies $\cup_{i=1}^\infty A_i \in \mathfrak F$.

These conditions are precisely those required for $\mathfrak F$ to be what is known as a $\sigma$-algebra.  Here, $\sigma$ is used to denote the word “countable” and refers to condition 3 above.  (While the alternative term $\sigma$-field is widely used, the existing definitions of “algebra” and “field” in mathematics makes the term $\sigma$-algebra the preferred term; it is not a “field” in any precise sense.)

If $\Omega = [2,3]$, recall from above that we wished for $\mathfrak F$ to contain all the intervals at the very least.   Therefore, we choose $\mathfrak F$ to be the smallest $\sigma$-algebra containing the intervals.  (Intuitively, one could think of building $\mathfrak F$ up by starting with $\mathfrak F$ being equal to the set of all intervals, then adding all complements, then adding all countable unions, then adding all complements of these new sets, then adding all countable unions of new and old sets, and going on like this until finally $\mathfrak F$ grew no larger by repeating this process. Mathematically though, it is constructed by taking the intersection of all $\sigma$-algebras containing the intervals; it can be shown that the (uncountable) intersections of $\sigma$-algebras is still a $\sigma$-algebra.)

In general, if $\Omega$ is a topological space then it is common to choose $\mathfrak F$ to be the smallest $\sigma$-algebra containing all the open sets.  This is called the Borel $\sigma$-algebra generated by the open sets on $\Omega$.

The elements of $\mathfrak F$ are called events. Indeed, an event $B \in \mathfrak F$ is a subset of $\Omega$ and therefore represents a set of possible outcomes or events that we might observe (we might be told that $\omega \in B$), or ask the probability of observing (we might want to know the value of $\mathbb{P}(B)$).

How to Define $\mathbb P$ on Borel Subsets

One issue remains; in general, it is not possible to write down an arbitrary Borel subset; some Borel sets are indescribable. How then can we define $\mathbb P$ on sets we cannot describe? Fortunately, we can appeal to Caratheodory’s Extension Theorem.  In fact, this is a repeating theme in measure-theoretic probability; it is necessary to learn techniques for avoiding the need to work directly with indescribable sets.

Caratheodory’s Extension Theorem implies that if we assign a probability to every interval in $\Omega = [2,3]$ (in a way which is consistent with the axioms for probability, e.g., respecting countable additivity) then there is one and only one way to extend the assignments of probability to arbitrary Borel subsets of $[2,3]$. In other words, by defining $\mathbb P$ just for intervals, we have implicitly defined $\mathbb P$ on all Borel subsets. (This is analogous to defining a linear function at only a handful of points; the linearity of the function means that the value of the function can be deduced at other points by using the property of linearity.)

Note that $\mathbb P$ is called a probability measure.  It “measures” the probability assigned to certain nice subsets of $\Omega$, or precisely, to the elements of $\mathfrak F$.  (Recall that every element of $\mathfrak F$ is a subset of $\Omega$.)

Random Variables

A (real-valued) random variable is simply a function from $\Omega$ to $\mathbb R$ which satisfies a natural condition of being measurable, which will be defined presently. First though, note that a random variable gives (generally only partial) information about the outcome $\omega$.  For example, if $\Omega=\{HH,HT,TH,TT\}$ and $X_1: \Omega \rightarrow \mathbb R$ is defined by $X_1(HH) = 1$$X_1(HT) = 1$, $X_1(TH) = 0$ and $X_1(TT) = 0$ then we would describe $X_1$ as the outcome of the first coin toss (with $1$ for Heads and $0$ for Tails).

We know from the previous section that when we are dealing with the set of  real numbers $\mathbb R$, we would like to be able to assign a probability to any Borel subset of $\mathbb R$.  Therefore, given a random variable $X: \Omega \rightarrow \mathbb R$ and a Borel subset $B$, we would like to compute the probability that the outcome $\omega$ causes $X$ to take on a value in the set $B$.  Mathematically, this is written as $\mathbb{P}(\{\omega \mid X(\omega) \in B\})$, which is commonly abbreviated as $\mathbb{P}(X^{-1}(B))$. For this to make sense though, we must have $X^{-1}(B) = \{\omega \mid X(\omega) \in B\}$ being an element of $\mathfrak F$. This condition, that the inverse image of a Borel set lies in the $\sigma$-algebra $\mathfrak F$, is precisely the condition of measurability imposed on any random variable.

Expectation is Central

Although just formulating the probability triple $(\Omega, \mathfrak{F}, \mathbb{P})$ is already enough to unify discrete and continuous-valued random variables, there are other differences between measure-theoretic and “classical” probability. In particular, in measure-theoretic probability, emphasis shifts to the expectation and conditional expectation operators. One benefit of doing this is that it avoids certain unpleasantries associated with defining conditional probability; for example, Bayes rule does not apply when the denominator is zero.

Note that the probability $\mathbb{P}(B)$ of an event $B \in \mathfrak F$ occurring is equal to the expected value of $I_B(\omega)$ where $I$ denotes the indicator function; $I_B(\omega)$ equals $1$ when $\omega \in B$ and $0$ otherwise. Therefore, the shift from probability being central to expectation being central is merely a change of view; it often provides a nicer view of the same underlying theory.

Concluding Remarks

• Measure-theoretic probability is initially more complicated to learn, but it is rigorous, more natural and therefore ultimately easier to work with.
• Its advantages come from its different and more general viewpoint; the underlying theory is still essentially the same as classical probability.
• (Rather than work with cumulative probability distributions and Riemann-Stieltjes integrals, measure-theoretic probability works with probability measures and Lebesgue integrals which are generally cleaner and easier to work with.)
• When learning measure-theoretic probability:
• Keep in mind that the basic ideas are straightforward; don’t let the technical detail obscure the basic ideas.
• Most of the technical detail comes (at least initially) from having to work with Borel sets but not being able to describe them in general (cf., Caratheodory’s Extension Theorem mentioned earlier).
• Look for and develop your own mapping between the measure-theoretic way of obtaining a result, and the classical way.  (For example, Girsanov’s Change of Measure is essentially the measure-theoretic version of Bayes rule; it is stated in terms of conditional expectation rather than conditional probability and is therefore neater to work with.)