### Archive

Posts Tagged ‘maximum likelihood’

## Measure-theoretic Formulation of the Likelihood Function

Let $P_\theta$ be a family of probability measures indexed by $\theta \in \Theta$. For notational convenience, assume $0 \in \Theta$, so that $P_0$ is one of the probability measures in the family. This short note sketches why $L(\theta) = E_0\left[ \frac{dP_\theta}{dP_0} \mid \mathcal X \right]$ is the likelihood function, where the $\sigma$-algebra $\mathcal X$ describes the possible observations and $E_0$ denotes expectation with respect to the measure $P_0$.
First, consider the special case where the probability measure can be described by a probability density function (pdf) $p(x,y;\theta)$. Here, $x$ is a real-valued random variable that we have observed, $y$ is a real-valued unobserved random variable, and $\theta$ indexes the family of joint pdfs. The likelihood function when there is a “hidden variable” $y$ is usually defined as $\theta \mapsto p(x;\theta)$ where $p(x;\theta)$ is the marginalised pdf obtained by integrating out the unknown variable $y$, that is, $p(x;\theta) = \int_{-\infty}^{\infty} p(x,y;\theta)\,dy$. Does this likelihood function equal $L(\theta)$ when $\mathcal X$ is the $\sigma$-algebra generated by the random variable $x$?
The correspondence between the measure and the pdf is: $P_\theta(A) = \int_A p(x,y;\theta)\,dx\,dy$ for any (measurable) set $A \subset \mathbb{R}^2$; this is the probability that $(x,y)$ lies in $A$. In this case, the Radon-Nikodym derivative $\frac{dP_\theta}{dP_0}$ is simply the ratio $\frac{p(x,y;\theta)}{p(x,y;0)}$. The conditional expectation with respect to $X$ under the distribution $p(x,y;0)$ is $E_0\left[ \frac{p(x,y;\theta)}{p(x,y;0)} \mid x \right] = \int_{-\infty}^{\infty} \frac{p(x,y;\theta)}{p(x,y;0)} p(x,y;0)\, dy = \int_{-\infty}^{\infty} p(x,y;\theta)\,dy$, verifying in this special case that $L(\theta)$ is indeed the likelihood function.
The above verification does not make $L(\theta) = E_0\left[ \frac{dP_\theta}{dP_0} \mid \mathcal X \right]$ any less mysterious. Instead, it can be understood directly as follows. From the definition of conditional expectation, it is straightforward to verify that $L(\theta) = \left. \frac{dP_\theta}{dP_0}\right|_{\mathcal X}$ meaning that for any $\mathcal X$-measurable set $A$, $P_\theta(A) = \int_A \left. \frac{dP_\theta}{dP_0}\right|_{\mathcal X}\,dP_0$. The likelihood function is basically asking for the “probability” that we observed what we did, or precisely, we want to take the set $A$ to be our actual observation and see how $P_\theta(A)$ varies with $\theta$. This would work if $P_\theta(A) > 0$ but otherwise it is necessary to look at how $P_\theta(A)$ varies when $A$ is an arbitrarily small but non-negligible set centred on the true observation. (If you like, it is impossible to make a perfect observation correct to infinitely many significant figures; instead, an observation of $x$ usually means we know, for example, that $1.0 \leq x \leq 1.1$, hence $A$ can be chosen to be the event that $1.0 \leq x \leq 1.1$ instead of the negligible event $x = 1.05$.) It follows from the integral representation $P_\theta(A) = \int_A \left. \frac{dP_\theta}{dP_0}\right|_{\mathcal X}\,dP_0$ that $\left. \frac{dP_\theta}{dP_0}\right|_{\mathcal X}$ describes the behaviour of $P_\theta(A)$ as $A$ shrinks down from a range of outcomes to a single outcome. Importantly, the subscript $\mathcal X$ means $L(\theta) = \left. \frac{dP_\theta}{dP_0}\right|_{\mathcal X}$ is $\mathcal X$-measurable, therefore, $L(\theta)$ depends only on what is observed and not on any other hidden variables.