## How the Jacobian can Detect Cusps and Self-intersections

Let $M$ be a subset of the plane defined by the vanishing of a smooth (that is, derivatives of all orders exist) function $f\colon \mathbb{R}^2 \rightarrow \mathbb{R}$.  For simplicity, it is assumed throughout that $f$ is not identically zero.

For example, if $f(x,y) = x^2 + y^2 - 1$ then the set $M = \{ (x,y) \in \mathbb{R}^2 \mid f(x,y) = 0\}$ is the unit circle. This can be readily visualised using the online Sage Notebook. Sign in to Sage, create a new worksheet, and type: var(‘x,y’); implicit_plot(x^2+y^2-1==0, (x, -2, 2), (y, -2, 2)); NOTE: WordPress automatically converts the first apostrophe in var(‘x,y’); from the correct ${}'$ to the incorrect `.

Next, try: var(‘x,y’); implicit_plot(x^3+y^3-x*y==0, (x, -2, 2), (y, -2, 2));

Note that this second example has a self-intersection; if one considers starting at one end of the line and walking to the other end, there will be a point that will be passed over twice, once when entering the loop and once upon exit.

Thirdly, try: var(‘x,y’); implicit_plot(x^3-y^2==0, (x, -2, 2), (y, -2, 2));

This third example has a cusp; if one endeavours to drive a car from the bottom end of the line to the top end, one will get stuck at the cusp with the car pointing to the left but the “road” continuing off to the right.

Roughly speaking, and as a good introduction to differential geometry, a (concrete) manifold is a subset of $\mathbb{R}^n$ that does not contain cusps (that is, it is smooth) and does not contain self-intersections (that is, it is locally Euclidean). Regardless, it is interesting to ask how difficult it is to determine if the set $M = \{ (x,y) \in \mathbb{R}^2 \mid f(x,y) = 0\}$  has any cusps or self-intersections.  The intuitive principle of driving a car along the curve $M$ to detect cusps or self-intersections may appear difficult to translate into a mathematical tool, especially for detecting self-intersections.

Note though that driving a car corresponds to a parametrisation of the curve: roughly speaking, constructing functions $x(t)$ and $y(t)$ such that $f(x(t),y(t)) = 0$ for all $t$. Finding a self-intersection is equivalent to finding a non-trivial ($t \neq t'$) solution to $x(t)=x(t')$ and $y(t)=y(t')$.  This is a difficult “global” problem. However, the curve has been specified implicitly by $f(x,y) =0$, and perhaps the problem is not global but only local with respect to this implicit formulation?

Detecting cusps and self-intersections really is only a local problem with respect to $f(x,y)$, meaning that knowledge of what $M = \{ (x,y) \in \mathbb{R}^2 \mid f(x,y) = 0\}$ looks like anywhere except in a small neighbourhood of a point $(x,y)$ is irrelevant to determining whether there is a singularity or self-intersection at $(x,y)$.  This suggests considering a Taylor series of $f$ about the point $(x,y)$ of interest. [Technically, we would also want $f$ to be analytic if we are relying on a Taylor series argument, but we are focusing here on gaining a feel for the problem in the first instance.] It is also worth noting that this is also an example of “unhelpful intuition”; thinking of driving cars is a hindrance, not a help.

Choose a point $(x',y')$ lying on $M$, that is, $f(x',y') = 0$.  Then $f(x,y) = \alpha (x-x') + \beta (y-y') + \cdots$ where the dots denote higher-order terms.  If either $\alpha$ or $\beta$ is non-zero then $f$ contains a linear term, and the linear term dominates the behaviour of $f$ when $(x,y)$ is sufficiently close to $(x',y')$.  The linear equation $\alpha (x-x') + \beta (y-y') = 0$ describes a straight line; there can be no cusps or self-intersections.  This argument can be made rigorous using the implicit function theorem, even when $f$ is not analytic: if for every point $(x,y) \in M$, either $\frac{\partial f}{\partial x}$ or $\frac{\partial f}{\partial y}$ is non-zero at $(x,y)$, then $M = \{ (x,y) \in \mathbb{R}^2 \mid f(x,y) = 0\}$ contains no self-intersections or cusps.  This result, stated more generally in terms of the rank of the Jacobian matrix, appears near the beginning in most differential geometry textbooks.  Without the above considerations though, it may appear mysterious how the Jacobian can detect cusps and self-intersections. (The key point is not to think in terms of a parametrisation of the curve, but instead, break the space $\mathbb{R}^2$ into neighbourhoods $U_i$  which are sufficiently small that it is possible to determine exactly what $M \cap U_i$ looks like. To belabour the point, if a point in $U_i$ is not in $M \cap U_i$ then it is not in $M$.)

Textbooks generally do not mention explicitly though that if $M$ fails this “Jacobian test” then it may still be a manifold. The above makes this clear; if $\alpha=\beta=0$ then it is necessary to examine higher-order terms before making any claims. As a simple example, $f(x,y) = x^3$ fails the test at $(0,0)$ because both $\frac{\partial f}{\partial x}$ and $\frac{\partial f}{\partial y}$ are zero at the origin, yet $M = \{ (x,y) \in \mathbb{R}^2 \mid x^3 = 0\}$ is the straight line given equivalently by $x=0$.

## Long versus Short Proofs

Proofs are very similar to computer programs. And just like computer programs, there are often many different ways of writing a proof.  Sometimes the difference between proofs is great; one might be based on geometry while another on analysis. The purpose of this short note is to emphasise that different proofs are possible at the “simple” level too.  While this may appear mundane, it is actually an important part of learning mathematics: once you have proved a result, spend a few minutes trying to simplify the proof. It will make the next proof you do easier. The two key messages are as follows.

• There is no substitute for experience when it comes to deriving proofs.
• Practicing on simple examples to find the “best” proof is training for being able even just to find a proof of a harder result.

For lack of a better on-the-spot example, consider the following problem. Let $f\colon\mathbb{C} \rightarrow \mathbb{C}$ be an analytic function having a double zero at the origin and whose first derivative has at most linear growth: $| f'(x) | \leq \alpha |x|$ for some $\alpha \in \mathbb{R}$.  What is the most general form of $f$?

## First Approach

A double zero at the origin means $f(x) = x^2 h(x)$ for some analytic function $h$. Therefore $f'(x) = 2xh(x) + x^2h'(x)$ and $\frac{f'(x)}{x} = 2h(x) + xh'(x)$.  Since $latex 2h(x) + xh'(x)$ is both analytic and bounded, it must be constant, by Liouville’s theorem.  (Here, all analytic functions are entire because they are defined on the whole of the complex plane.) Solving the differential equation $latex 2h(x) + xh'(x) = c$ by substituting in the power series expansion $h(x) = \sum a_i x^i$ shows that $h(x) = \frac{c}2$. [The general solution of $latex 2h(x) + xh'(x) = 0$ is $h(x) = \beta x^{-2}$ where $\beta$ is an arbitrary scalar. The only general solution that is analytic is $h(x)=0$.] The conclusion is that the most general form of $f$ is $f(x) = ax^2$ for some complex scalar $a$.

## Second Approach

The first approach is unattractive and unnecessarily complicated. Instead of starting with the double zero, start instead with the first derivative having a linear bound. (A standard generalisation of Liouville’s theorem is that an entire function with a linear bound must itself be linear. We will pretend here we do not know this fact.) If $f(x)$ has a double zero at the origin then $f'(x)$ has a zero at the origin, therefore $f'(x) = x g(x)$ for some analytic $g(x)$. The linear bound together with Liouville’s theorem means $g(x)$ is constant, that is $f'(x) = 2 a x$ for some scalar $a$. Therefore $f(x)$ must equal $ax^2$ if it is to satisfy both $f'(x) = 2 a x$ and $f(0)=0$.

## What was the Difference?

The first approach expressed $f$ as $f(x) = x^2 h(x)$ while the second approach expressed $f'$ as $f'(x) = x g(x)$. Both approaches resulted in a differential equation, but the second approach resulted in the simpler differential equation $f'(x) = 2ax$. Underlying this example is that a “change of coordinates” can simplify a differential equation. Although both approaches could be made to work in this simple example, there are situations where some approaches are too difficult to follow through to completion.

One could argue that because the linear bound constraint is “harder” than the double-zero constraint, one should start with the linear bound constraint and not the double-zero constraint, and therefore be led to the simpler differential equation. Yet the real messages are as stated at the start:

• There is no substitute for experience when it comes to deriving proofs.
• Practicing on simple examples to find the “best” proof is training for being able even just to find a proof of a harder result.

## Measure-theoretic Formulation of the Likelihood Function

Let $P_\theta$ be a family of probability measures indexed by $\theta \in \Theta$. For notational convenience, assume $0 \in \Theta$, so that $P_0$ is one of the probability measures in the family. This short note sketches why $L(\theta) = E_0\left[ \frac{dP_\theta}{dP_0} \mid \mathcal X \right]$ is the likelihood function, where the $\sigma$-algebra $\mathcal X$ describes the possible observations and $E_0$ denotes expectation with respect to the measure $P_0$.

First, consider the special case where the probability measure can be described by a probability density function (pdf) $p(x,y;\theta)$. Here, $x$ is a real-valued random variable that we have observed, $y$ is a real-valued unobserved random variable, and $\theta$ indexes the family of joint pdfs. The likelihood function when there is a “hidden variable” $y$ is usually defined as $\theta \mapsto p(x;\theta)$ where $p(x;\theta)$ is the marginalised pdf obtained by integrating out the unknown variable $y$, that is, $p(x;\theta) = \int_{-\infty}^{\infty} p(x,y;\theta)\,dy$. Does this likelihood function equal $L(\theta)$ when $\mathcal X$ is the $\sigma$-algebra generated by the random variable $x$?

The correspondence between the measure and the pdf is: $P_\theta(A) = \int_A p(x,y;\theta)\,dx\,dy$ for any (measurable) set $A \subset \mathbb{R}^2$; this is the probability that $(x,y)$ lies in $A$. In this case, the Radon-Nikodym derivative $\frac{dP_\theta}{dP_0}$ is simply the ratio $\frac{p(x,y;\theta)}{p(x,y;0)}$. The conditional expectation with respect to $X$ under the distribution $p(x,y;0)$ is $E_0\left[ \frac{p(x,y;\theta)}{p(x,y;0)} \mid x \right] = \int_{-\infty}^{\infty} \frac{p(x,y;\theta)}{p(x,y;0)} p(x,y;0)\, dy = \int_{-\infty}^{\infty} p(x,y;\theta)\,dy$, verifying in this special case that $L(\theta)$ is indeed the likelihood function.

The above verification does not make $L(\theta) = E_0\left[ \frac{dP_\theta}{dP_0} \mid \mathcal X \right]$ any less mysterious. Instead, it can be understood directly as follows. From the definition of conditional expectation, it is straightforward to verify that $L(\theta) = \left. \frac{dP_\theta}{dP_0}\right|_{\mathcal X}$ meaning that for any $\mathcal X$-measurable set $A$, $P_\theta(A) = \int_A \left. \frac{dP_\theta}{dP_0}\right|_{\mathcal X}\,dP_0$. The likelihood function is basically asking for the “probability” that we observed what we did, or precisely, we want to take the set $A$ to be our actual observation and see how $P_\theta(A)$ varies with $\theta$. This would work if $P_\theta(A) > 0$ but otherwise it is necessary to look at how $P_\theta(A)$ varies when $A$ is an arbitrarily small but non-negligible set centred on the true observation. (If you like, it is impossible to make a perfect observation correct to infinitely many significant figures; instead, an observation of $x$ usually means we know, for example, that $1.0 \leq x \leq 1.1$, hence $A$ can be chosen to be the event that $1.0 \leq x \leq 1.1$ instead of the negligible event $x = 1.05$.) It follows from the integral representation $P_\theta(A) = \int_A \left. \frac{dP_\theta}{dP_0}\right|_{\mathcal X}\,dP_0$ that $\left. \frac{dP_\theta}{dP_0}\right|_{\mathcal X}$ describes the behaviour of $P_\theta(A)$ as $A$ shrinks down from a range of outcomes to a single outcome. Importantly, the subscript $\mathcal X$ means $L(\theta) = \left. \frac{dP_\theta}{dP_0}\right|_{\mathcal X}$ is $\mathcal X$-measurable, therefore, $L(\theta)$ depends only on what is observed and not on any other hidden variables.

While the above is not a careful exposition, it will hopefully point the interested reader in a sensible direction.

Advances in mathematics occur in one of two ways.
The first occurs by the solution of some outstanding problem, such as the Bieberbach conjecture or Fermat’s conjecture. Such solutions are justly acclaimed by the mathematical community. The solution of every famous mathematical problem is the result of joint effort of a great many mathematicians. It always comes as an unexpected application of theories that were previously developed without a specific purpose, theories whose effectiveness was at first thought to be highly questionable.
Mathematicians realized long ago that it is hopeless to get the lay public to understand the miracle of unexpected effectiveness of theory. The public, misled by two hundred years of Romantic fantasies, clamors for some “genius” whose brain power cracks open the secrets of nature. It is therefore a common public relations gimmick to give the entire credit for the solution of famous problems to the one mathematician who is responsible for the last step.
It would probably be counterproductive to let it be known that behind every “genius” there lurks a beehive of research mathematicians who gradually built up to the “final” step in seemingly pointless research papers. And it would be fatal to let it be known that the showcase problems of mathematics are of little or no interest for the progress of mathematics. We all know that they are dead ends, curiosities, good only as confirmation of the effectiveness of theory. What mathematicians privately celebrate when one of their showcase problems is solved is Polya’s adage: “no problem is ever solved directly.”
There is a second way by which mathematics advances, one that mathematicians are also reluctant to publicize. It happens whenever some commonsense notion that had heretofore been taken for granted is discovered to be wanting, to need clarification or definition. Such foundational advances produce substantial dividends, but not right away. The usual accusation that is leveled against mathematicians who dare propose overhauls of the obvious is that of being “too abstract.” As if one piece of mathematics could be “more abstract” than another, except in the eyes of the beholder (it is time to raise a cry of alarm against the misuse of the word “abstract,” which has become as meaningless as the word “Platonism.”)
An amusing case history of an advance of the second kind is uniform convergence, which first made headway in the latter quarter of the nineteenth century. The late Herbert Busemann told me that while he was a student, his analysis teachers admitted their inability to visualize uniform convergence, and viewed it as the outermost limit of abstraction. It took a few more generations to get uniform convergence taught in undergraduate classes.
The hostility against groups, when groups were first “abstracted” from the earlier “group of permutations” is another case in point. Hadamard admitted to being unable to visualize groups except as groups of permutations. In the thirties, when groups made their first inroad into physics via quantum mechanics, a staunch sect of reactionary physicists, repeatedly cried “Victory!” after convincing themselves of having finally rid physics of the “Gruppenpest.” Later, they tried to have this episode erased from the history of physics.
In our time, we have witnessed at least two displays of hostility against new mathematical ideas. The first was directed against lattice theory, and its virulence all but succeeded in wiping lattice theory off the mathematical map. The second, still going on, is directed against the theory of categories. Grothendieck did much to show the simplifying power of categories in mathematics. Categories have broadened our view all the way to the solution of the Weil conjectures. Today, after the advent of braided categories and quantum groups, categories are beginning to look downright concrete, and the last remaining anticategorical reactionaries are beginning to look downright pathetic.
There is a common pattern to advances in mathematics of the second kind. They inevitably begin when someone points out that items that were formerly thought to be “the same” are not really “the same,” while the opposition claims that “it does not matter,” or “these are piddling distinctions.” Take the notion of species that is the subject of this book. The distinction between “labeled graphs” and “unlabeled graphs” has long been familiar. Everyone agrees on the definition of an unlabeled graph, but until a while ago the notion of labeled graph was taken as obvious and not in need of clarification. If you objected that a graph whose vertices are labeled by cyclic permutations – nowadays called a “fat graph” – is not the same thing as a graph whose vertices are labeled by integers, you were given a strange look and you would not be invited to the next combinatorics meeting.

Excerpt from the Forward by Gian-Carlo Rota (1997) to the book “Combinatorial Species and Tree-like Structures” by F. Bergeron et al.

Categories: Uncategorized Tags: ,

## “Most Likely” is an All or Nothing Proposition

The principle of maximum likelihood estimation is generally not explained well; readers are made to believe that it should be obvious to them that choosing the “most likely outcome” is the most sensible thing to do.  It isn’t obvious, and it need not be the most sensible thing to do.

First, recall the statement I made in an earlier paper:

The author believes firmly that asking for an estimate of a parameter is, a priori, a meaningless question. It has been given meaning by force of habit. An estimate only becomes useful once it is used to make a decision, serving as a proxy for the unknown true parameter value. Decisions include: the action taken by a pilot in response to estimates from the flight computer; an automated control action in response to feedback; and, what someone decides they hear over a mobile phone (with the pertinent question being whether the estimate produced by the phone of the transmitted message is intelligible). Without knowing the decision to be made, whether an estimator is good or bad is unanswerable. One could hope for an estimator that works well for a large class of decisions, and the author sees this as the context of estimation theory.

Consider the following problem.  Assume two coins are tossed, but somehow the outcome of the first coin influences the outcome of the second coin.  Specifically, the possible outcomes (H = heads, T = tails) and their probabilities are: HH $0.35$; HT $0.05$; TH $0.3$; TT $0.3$.  Given these probabilities, what is our best guess as to the outcome?  We have been conditioned to respond by saying that the most likely outcome is the one with the highest probability, namely, HH.  What is our best guess as to the outcome of the first coin only?  Well, there is $0.35 + 0.05 = 0.4$ chance it will be H and $0.3 + 0.3 = 0.6$ chance it will be T, so the most likely outcome is T.  How can it be that the most likely outcome of the first coin is T but the most likely outcome of both coins is HH?

The (only) way to understand this sensibly is to think in terms of how the estimate will be used.  What “most likely” really means is that it is the best strategy to use when placing an all-or-nothing bet.  If I must bet on the outcome of the two coins, and I win $1 if I guess correctly and win nothing otherwise, my best strategy is to bet on HH. If I must bet on the outcome of the first coin, the best strategy is to bet on T. This is not a contradiction because betting on the first coin being T is the same as betting on the two coins being either TH or TT. I can now win in two cases, not just one; it is a different gamble. The above is not an idle example. In communications, the receiver must estimate what symbols were sent. A typical mathematical formulation of the problem is estimating the state of a hidden Markov chain. One can choose to estimate the most likely sequence of states or the most likely state at a particular instance. The above example explains the difference and helps determine which is the more appropriate estimate to use. Finally, it is noted that an all-or-nothing bet is not necessarily the most appropriate way of measuring the performance of an estimator. For instance, partial credit might be given for being close to the answer, so if I guess two coins correctly I win$2, if I guess one coin correctly I win \$1, otherwise I win nothing.  This can be interpreted as “regularising” the maximum likelihood estimate.  Nevertheless, at the end of the day, the only way to understand an estimator is in the broader context of the types of decisions that can be made well by using that estimator.

## Background Information for Continuous-time Filtering and Estimation on Manifolds

The preprint A Primer on Stochastic Differential Geometry in Signal Processing discusses, among other things, the following in simple but rigorous terms:

• How Brownian motion can be generated on Riemannian manifolds;
• How “coloured” (technically, left-invariant) Brownian motion can be generated on Lie groups;
• Ito and Stratonovich integrals, and the transfer principle of Stratonovich integrals making them convenient to use for stochastic differential equations on manifolds;
• The special orthogonal groups SO(n);
• How a “Gaussian random variable” can be generated on a Riemannian manifold;
• How state-space models extend to manifolds;
• How stochastic development provides a convenient framework for understanding stochastic processes on manifolds;
• Whether or not stochastic integrals are “pathwise” computable.

The last section of the paper includes the following:

Several concepts normally taken for granted, such as unbiasedness of an estimator, are not geometric concepts and hence raise the question of their correct generalisations to manifolds. The answer is that the difficulty lies not with manifolds, but with the absence of meaning to ask for an estimate of a parameter. The author believes firmly that asking for an estimate of a parameter is, a priori, a meaningless question. It has been given meaning by force of habit. An estimate only becomes useful once it is used to make a decision, serving as a proxy for the unknown true parameter value. Decisions include: the action taken by a pilot in response to estimates from the flight computer; an automated control action in response to feedback; and, what someone decides they hear over a mobile phone (with the pertinent question being whether the estimate produced by the phone of the transmitted message is intelligible). Without knowing the decision to be made, whether an estimator is good or bad is unanswerable. One could hope for an estimator that works well for a large class of decisions, and the author sees this as the context of estimation theory.