### Archive

Posts Tagged ‘measure-theoretic probability’

## Some Comments on the Situation of a Random Variable being Measurable with respect to the Sigma-Algebra Generated by Another Random Variable

If $Y$ is a $\sigma(X)$-measurable random variable then there exists a Borel-measurable function $f \colon \mathbb{R} \rightarrow \mathbb{R}$ such that $Y = f(X)$. The standard proof of this fact leaves several questions unanswered. This note explains what goes wrong when attempting a “direct” proof. It also explains how the standard proof overcomes this difficulty.

First some background. It is a standard result that $\sigma(X) = \{X^{-1}(B) | B \in \mathcal{B}\}$ where $\mathcal{B}$ is the set of all Borel subsets of the real line $\mathbb{R}$. Thus, if $A \in \mathcal{B}$ then there exists an $H \in \mathcal{B}$ such that $Y^{-1}(A) = X^{-1}(H)$. Indeed, this follows from the fact that since $Y$ is $\sigma(X)$-measurable, the inverse image $Y^{-1}(A)$ of any Borel set $A$ must lie in $\sigma(X)$.

A “direct” proof would endeavour to construct $f$ pointwise. The basic intuition (and not difficult to prove) is that $Y$ must be constant on sets of the form $X^{-1}(c)$ for $c \in \mathbb{R}$. This suggests defining $f$ by $f(x) = \sup Y(X^{-1}(x))$. Here, the supremum is used to go from the set $Y(X^{-1}(x))$ to what we believe to be its only element, or to $\infty$ if $X^{-1}(x)$ is empty. Unfortunately, this intuitive approach fails because the range of $X$ need not be Borel. This causes problems because $f^{-1}((-\infty,\infty))$ is the range of $X$ and must be Borel if $f$ is to be Borel-measurable.

Technically, we need a way of extending the definition of $f$ from the range of $X$ to a Borel set containing the range of $X$, and moreover, the extension must result in a measurable function.

Constructing an appropriate extension requires knowing more about $Y$ than simply $Y(X^{-1}(x))$ for each individual $x$. That is, we need a canonical representation of $Y$. Before we get to this though, let us look at two special cases.

Consider first the case when $Y = I_{A}$ for some measurable set $A$, where $I$ is the indicator function. If $Y$ is $\sigma(X)$-measurable then $Y^{-1}(1)$ must lie in $\sigma(X)$ and hence there exists a Borel $H$ such that $Y^{-1}(1) = A = X^{-1}(H)$. Let $f = I_H$. (It is Borel-measurable because $H$ is Borel.) To show $Y = f \circ X$, let $\omega$ be arbitrary. (Recall, random variables are actually functions, conventionally indexed by $\omega$.) If $\omega \in X^{-1}(H)$ then $X(\omega) \in H$ and $(f \circ X)(\omega) = 1$, while $Y(\omega) = 1$ because $\omega \in X^{-1}(H) = Y^{-1}(1)$. Otherwise, if $\omega \not\in X^{-1}(H)$ then analogous reasoning shows both $Y(\omega)$ and $(f \circ X)(\omega)$ equal zero.

How did the above construction avoid the problem of the range of $X$ not necessarily being Borel? The subtlety is that the choice of $H$ need not be unique, and in particular, $H$ may contain values which lie outside the range of $X$. Whereas a choice such as $f(x) = \sup Y(X^{-1}(x))$ assigns a single value (in this case, $\infty$) to values of $x$ not lying in the range of $X$, the choice $f = I_H$ can assign either $0$ or $1$ to values of $x$ not in the range of $X$, and by doing so, it can make $f$  Borel-measurable.

Consider next the case when $Y = I_{A_1} + 2 I_{A_2}$. As above, we can find Borel sets $H_i$ such that $A_i = X^{-1}(H_i)$ for $i=1,2$, and moreover, $f = I_{H_1} + 2 I_{H_2}$ gives a suitable $f$. Here, it can be readily shown that if $x$ is in the range of $X$ then $x$ can lie in at most one $H_i$. Thus, regardless of how the $H_i$ are chosen, $f$ will take on the correct value whenever $x$ lies in the range of $X$. Different choices of the $H_i$ can result in different extensions of $f$, but each such choice is Borel-measurable, as required.

The above depends crucially on having only finitely many indicator functions. A frequently used principle is that an arbitrary measurable function can be approximated by a sequence of bounded functions with each function being a sum of a finite number of indicator functions (i.e., a simple function). Therefore, the general case can be handled by using a sequence $Y_n$ of random variables converging pointwise to $Y$. Each $Y_n$ results in an $f_n$ obtained by replacing the $A_i$ by $H_i$, as was done in the paragraph above. For $x$ in the range of $X$, it turns out as one would expect: the $f_n(x)$ converge, and $f(x) = \lim f_n(x)$ gives the correct value for $f$ at $x$. For $x$ not in the range of $X$, there is no reason to expect the $f_n(x)$ to converge: the choice of the $H_i$ at the $n$th and $(n+1)$th steps are not coordinated in any way when it comes to which values to include from the complement of the image of $X$. Intuitively though, we could hope that each $H_i$ includes a “minimal” extension that is required to make $f$ measurable and that convergence takes place on this “minimal” extension. Thus, by choosing $f(x)$ to be zero whenever $f_n(x)$ does not converge, and choosing $f(x)$ to be the limit of $f_n(x)$ otherwise, we may hope that we have constructed a suitable $f$ despite how the $H_i$ at each step were chosen. Whether or not this intuition is correct, it can be shown mathematically that $f$ so defined is indeed the desired function. (See for example the short proof of Theorem 20.1 in the second edition of Billingsley’s book Probability and Measure.)

Finally, it is remarked that sometimes the monotone class theorem is used in the proof. Essentially, the idea is exactly the same: approximate $Y$ by a suitable sequence $Y_n$. The subtlety is that the monotone class theorem only requires us to work with indicator functions $I_A$ where $A$ is particularly nice (i.e., $A$ lies in a $\pi$-system generating the $\sigma$-algebra of interest). The price of this nicety is that $Y$ must be bounded. For the above problem, as Williams points out in his book Probability with Martingales, we can simply replace $Y$ by $\arctan Y$ to obtain a bounded random variable. On the other hand, there is nothing to be gained by working with particularly nice $A_i$, hence Williams’ admission that there is no real need to use the monotone class theorem (see A3.2 of his book).

## Measure-theoretic Formulation of the Likelihood Function

Let $P_\theta$ be a family of probability measures indexed by $\theta \in \Theta$. For notational convenience, assume $0 \in \Theta$, so that $P_0$ is one of the probability measures in the family. This short note sketches why $L(\theta) = E_0\left[ \frac{dP_\theta}{dP_0} \mid \mathcal X \right]$ is the likelihood function, where the $\sigma$-algebra $\mathcal X$ describes the possible observations and $E_0$ denotes expectation with respect to the measure $P_0$.

First, consider the special case where the probability measure can be described by a probability density function (pdf) $p(x,y;\theta)$. Here, $x$ is a real-valued random variable that we have observed, $y$ is a real-valued unobserved random variable, and $\theta$ indexes the family of joint pdfs. The likelihood function when there is a “hidden variable” $y$ is usually defined as $\theta \mapsto p(x;\theta)$ where $p(x;\theta)$ is the marginalised pdf obtained by integrating out the unknown variable $y$, that is, $p(x;\theta) = \int_{-\infty}^{\infty} p(x,y;\theta)\,dy$. Does this likelihood function equal $L(\theta)$ when $\mathcal X$ is the $\sigma$-algebra generated by the random variable $x$?

The correspondence between the measure and the pdf is: $P_\theta(A) = \int_A p(x,y;\theta)\,dx\,dy$ for any (measurable) set $A \subset \mathbb{R}^2$; this is the probability that $(x,y)$ lies in $A$. In this case, the Radon-Nikodym derivative $\frac{dP_\theta}{dP_0}$ is simply the ratio $\frac{p(x,y;\theta)}{p(x,y;0)}$. The conditional expectation with respect to $X$ under the distribution $p(x,y;0)$ is $E_0\left[ \frac{p(x,y;\theta)}{p(x,y;0)} \mid x \right] = \int_{-\infty}^{\infty} \frac{p(x,y;\theta)}{p(x,y;0)} p(x,y;0)\, dy = \int_{-\infty}^{\infty} p(x,y;\theta)\,dy$, verifying in this special case that $L(\theta)$ is indeed the likelihood function.

The above verification does not make $L(\theta) = E_0\left[ \frac{dP_\theta}{dP_0} \mid \mathcal X \right]$ any less mysterious. Instead, it can be understood directly as follows. From the definition of conditional expectation, it is straightforward to verify that $L(\theta) = \left. \frac{dP_\theta}{dP_0}\right|_{\mathcal X}$ meaning that for any $\mathcal X$-measurable set $A$, $P_\theta(A) = \int_A \left. \frac{dP_\theta}{dP_0}\right|_{\mathcal X}\,dP_0$. The likelihood function is basically asking for the “probability” that we observed what we did, or precisely, we want to take the set $A$ to be our actual observation and see how $P_\theta(A)$ varies with $\theta$. This would work if $P_\theta(A) > 0$ but otherwise it is necessary to look at how $P_\theta(A)$ varies when $A$ is an arbitrarily small but non-negligible set centred on the true observation. (If you like, it is impossible to make a perfect observation correct to infinitely many significant figures; instead, an observation of $x$ usually means we know, for example, that $1.0 \leq x \leq 1.1$, hence $A$ can be chosen to be the event that $1.0 \leq x \leq 1.1$ instead of the negligible event $x = 1.05$.) It follows from the integral representation $P_\theta(A) = \int_A \left. \frac{dP_\theta}{dP_0}\right|_{\mathcal X}\,dP_0$ that $\left. \frac{dP_\theta}{dP_0}\right|_{\mathcal X}$ describes the behaviour of $P_\theta(A)$ as $A$ shrinks down from a range of outcomes to a single outcome. Importantly, the subscript $\mathcal X$ means $L(\theta) = \left. \frac{dP_\theta}{dP_0}\right|_{\mathcal X}$ is $\mathcal X$-measurable, therefore, $L(\theta)$ depends only on what is observed and not on any other hidden variables.

While the above is not a careful exposition, it will hopefully point the interested reader in a sensible direction.

## Measure-theoretic Probability: Still not convinced?

This is a sequel to the introductory article on measure-theoretic probability and accords with my belief that learning should not be one-pass, by which I mean loosely that it is more efficient to learn the basics first at a rough level and then come back to fill in the details soon afterwards. It endeavours to address the questions:

• Why a probability triple $(\Omega,\mathfrak{F},\mathbb{P})$ at all?
• What if $\mathfrak F$ is not a $\sigma$-algebra?
• Why is it important that $\mathbb P$ is countably additive?

• Why can’t a uniform probability be defined on the natural numbers $\{0,1,2,\cdots,\infty\}$?

Consider a real-life process, such as the population $X_k$ of a family of rabbits at each generation $k$. This gives us a countable family of random variables $\{X_1,X_2,\cdots\}.$ (Recall that countable means countably infinite; with only a finite number of random variables, matters would be simpler.) We can safely assume that if $X_k = 0$ for some $k$ then the population has died out, that is, $X_{k+1} = X_{k+2} = \cdots = 0.$

What is the probability that the population dies out?

The key questions here are the implicit questions of how to actually define and then subsequently calculate this probability of extinction. Intuitively, we want the probability that there exists an $m$ such that $X_m = 0.$ When trying to formulate this mathematically, we may think to split this up into bits such as “does $X_1 = 0$?”, “does $X_2 = 0$?” and so forth. Because these events are not disjoint (if we know $X_1 = 0$ then we are guaranteed that $X_2 = 0$) we realise that we need some way to account for this “connection” between the random variables. Is there any better way of accounting for this “connection” other than by declaring the “full” outcome to be $\omega \in \Omega$ and interpreting each $X_k$ as a function of $\omega$? (Only by endeavouring to think of an alternative will the full merit of having an $\Omega$ become clear.)

There are (at least) two paths we could take to define the probability of the population dying out. The first was hinted at already; segment $\Omega$ into disjoint sets then add up the probabilities of each of the relevant sets. Precisely, the sets $F_1 = \{\omega \in \Omega \mid X_1(\omega) = 0\}$, $F_2 = \{\omega \in \Omega \mid X_1(\omega) \neq 0, X_2(\omega) = 0\}$, $F_3 = \{\omega \in \Omega \mid X_2(\omega) \neq 0, X_3(\omega) = 0\}$ and so forth are disjoint, and we are tempted to sum the probabilities of each one occurring to arrive at the probability of extinction. This is an infinite summation though, so unless we believe that probability is countably additive (recall that this means $\mathbb{P}(\cup_{i=1}^\infty F_i) = \sum_{i=1}^\infty \mathbb{P}(F_i)$ for disjoint sets $F_k$) then this avenue is not available.

Another path is to recognise that the sets $B_k = \{\omega \in \Omega \mid X_k(\omega) = 0\}$ are like Russian dolls, one inside the other, namely $B_1 \subset B_2 \subset B_3 \subset \cdots.$ This means that their probabilities, $\mathbb{P}(B_k)$, form a non-decreasing sequence, and moreover, we are tempted to believe that $\lim_{k \rightarrow \infty} \mathbb{P}(B_k)$ should equal the probability of extinction. (The limit exists because the $\mathbb{P}(B_k)$ form a bounded and monotonic sequence.)

In fact, these paths are equivalent; if $\mathbb P$ is countably additive and the $B_k$ are nested as above then $\mathbb{P}(\cup_{k=1}^\infty B_k) = \lim_{k \rightarrow \infty} \mathbb{P}(B_k)$ and the converse is true too; if for any sequence of nested sets the probability and the limit operations can be interchanged (which is how the statement $\mathbb{P}(\cup_{k=1}^\infty B_k) = \lim_{k \rightarrow \infty} \mathbb{P}(B_k)$ should be interpreted) then $\mathbb P$ is countably additive.

Essentially, we have arrived at the conclusion that the only sensible way we can define the probability of extinction is to agree that probability is countably additive and then carry out the calculations above. Without countable additivity, there does not seem to be any way of defining the probability of extinction in general.

The above argument in itself is intended to complete the motivation for having a probability triple; the $\Omega$ is required to “link” random variables together and countable additivity is required in order to model real-world problems of interest. The following section goes further though by giving an example of when countable additivity does not hold.

### A Uniform Distribution on the Natural Numbers

For argument’s sake, let’s try to define a “probability triple” $(\Omega,\mathfrak{F},\mathbb{P})$ corresponding to a uniform distribution on the natural numbers $\Omega = \{0,1,2,\cdots,\infty\}$. The probability of drawing an even number should be one half, the probability of drawing an integer multiple of 3 should be one third, and so forth. Generalising this principle, it seems entirely reasonable to define $\mathbb{P}(F)$ to be the limit, as $N \rightarrow \infty$, of the number of elements of $F$ less than $N$ divided by $N$ itself. Since this limit does not necessarily exist, we solve it by declaring $\mathfrak F$ to be the set of all $F \subset \Omega$ for which this limit exists.

It can be shown directly that $\mathfrak F$ is not a $\sigma$-algebra.  In fact, it is not even an algebra because it is relatively straightforward to construct two subsets of $\Omega$, call them $A$ and $B$, which belong to $\mathfrak F$ but whose intersection does not, that is, there exist $A, B \in \mathfrak F$ for which $A \cap B \not\in \mathfrak F$.

Does $\mathbb{P}$ behave nicely? Let $B_k = \{0,\cdots,k\}$ and observe that $B_1 \subset B_2 \subset \cdots$ and $\Omega = \cup_{i=1}^{\infty} B_k.$ We know from the earlier discussion about extinction that it is very natural to expect that $\lim_{k \rightarrow \infty} \mathbb{P}(B_k) = \mathbb{P}(\Omega)$. However, this is not the case here; since each of the $B_k$ contain only a finite number of elements, it follows that $\mathbb{P}(B_k) = 0$. Therefore, the limit on the left hand side is zero whereas the right hand side is equal to one.

In summary:

• Countable additivity enables us to give meaning to probabilities of real-world events of interest to us (such as probability of extinction).
• Without countable additivity, even very basic results such as $\mathbb{P}(\cup_{k=1}^\infty B_k) = \lim_{k \rightarrow \infty} \mathbb{P}(B_k)$ for nested $B_k$ need not hold. In other words, there are not enough constraints on $\mathbb P$ for a comprehensive theory to be developed if we drop the requirement of $\mathbb{P}$ being countably additive over a $\sigma$-algebra $\mathfrak F$.