Understanding (real- and complex-valued) Inner Products

This short note addresses two issues.

• How can we intuitively understand a complex-valued inner product?
• If an inner product structure is given to a vector space, how can we understand the resulting geometry?

Since inner products are associated with angles, and since we can understand angles, there is temptation to interpret inner products in terms of angles. I advocate against this being the primary means of interpreting inner products.

An inner product $\langle \cdot, \cdot \rangle$ induces the norm $\| x \| = \sqrt{\langle x,x \rangle}$. Importantly, the inner product can be recovered from this norm by the polarisation identity. Therefore, understanding the geometry of an inner product is the same as understanding the geometry of the norm, and for the latter, it often suffices to consider what the unit ball looks like. For me, the norm is the primary structure giving the space its geometry.

What then is the purpose of the inner product? Not all norms have the same properties. Under some norms, projection onto a closed subspace may not be unique, for example. When interested in shortest-norm optimisation problems, the most desirable situation to be in is for the square of the norm to be quadratic, since then differentiating it produces a linear equation. In infinite dimensions, what does it mean for the square of a norm to be quadratic?

The presence of an inner product structure means the square of the norm is quadratic. Furthermore, the inner product “decomposes” the norm in a way that gives direct access to the derivative of the norm squared.

The remaining issue is how to understand complex-valued inner products. Given the above, the natural starting place is to consider endowing a complex vector space with a norm. Keeping the axioms of a real-valued normed vector space seems sensible; it implies that scaling a vector by $e^{\jmath \theta}$ does not change its norm (because $\| e^{\jmath \theta} x \| = | x |\,\| x \| = \| x \|$).

Then one asks what it means for the square of a norm to be quadratic. From the real-valued case, one guesses that one wants to be able to represent the square of the norm as a bilinear form: $\| x \|^2 = \langle x, x \rangle$, where $\langle \cdot, \cdot \rangle$ is linear in each of its arguments. Following the letter of the law, this would mean $\| \alpha x \|^2 = \langle \alpha x, \alpha x \rangle = \alpha^2 \langle x,x \rangle = \alpha^2 \|x\|^2$. In the complex case though, $\alpha^2$ need not equal $|\alpha|^2$. This explains why one tweaks the definition and instead considers sesquilinear forms which are linear in one argument and conjugate linear in the other: $\langle \alpha x, x \rangle = \alpha \langle x, x \rangle = \langle x, \bar\alpha x \rangle$. Indeed, one then correctly has that $\|\alpha x \|^2 = \langle \alpha x, \alpha x \rangle = \alpha \bar\alpha \langle x , x \rangle = |\alpha|^2 \|x\|^2$. With this tweak, one can verify that the complex-valued case works the same way as the real-valued case.

By treating the norm as the primary structure, one does not have to worry about giving an intuitive meaning to the inner product of two vectors not being a purely real-valued number; the inner product is there merely to expose the square of the norm as being quadratic. A complex-valued inner product is recoverable from its norm and hence no geometric information is lost. (Of course, orthogonality remains an important concept.) If one really wanted, one could play around with examples in $\mathbb{C}^2$ to get a better feel for what it means for $\langle x , y \rangle = \jmath$, for example, however, unless one encounters a particular problem encountering this level of detail, thinking in terms of norms is cleaner and more efficient. (If $\langle x, y \rangle = r e^{\jmath\theta}$ then $\langle x, e^{\jmath \theta} y \rangle = r$, so that by “rotating” a complex-valued vector in the two-dimensional real-valued vector space that it spans, one can always return to thinking about real-valued inner products.)

Are Humans Smart and Computers Dumb? Can Computers Become Better at Helping Humans Solve Challenging Problems?

Computers are generally seen as “dumb”, unable to think for themselves and therefore unable to solve certain complex tasks that humans, being “smart”, are able to solve relatively easily. Humans outperform computers at speech recognition, many image processing problems, and at real-time control problems such as walking – bipedal robots cannot manoeuvre as well as humans. Some may argue computers will never be able to prove mathematical theorems or engineer complex systems as well as mathematicians or engineers.

From a pragmatic perspective, “being smart” is ill-defined.  Roughly speaking, a solution is “smart” if a formulaic solution is not known (or if the solution is obtained much more efficiently than via a formulaic but brute force solution, e.g. one could argue humans are still “smarter” at chess than computers, even though computers can now beat humans).

My current viewpoint is that:-

• we are probably not as smart as we may think;
• computers are dumb primarily because we only know how to program them that way;
• regardless, the future vision should be of challenging problems being solved jointly by humans and computers without the current clunky divide between what humans do and what computers do. We should be aiming for semi-automated design, not fully-automated design.

The following three sections elaborate on the above three points in turn. First a summary though.

Summary: We should not dismiss the idea that computers can help us in significantly better ways to solve challenging problems simply because we see a divide: we are smart and computers are not. Ultimately, smartness probably can be recreated algorithmically provided computers (robots?) and humans start to interact extensively. But well prior to this, computers can become better at assisting us solve challenging problems if we start to understand how much ‘intuition’ and ‘problem solving’ boils down to rules and pattern recognition. Certainly not all intuition would be easy to capture, but often challenging problems involve large amounts of fairly routine manipulations interspersed by ingenuity. Semi-automated design aims to have a human and a computer work together, with the computer handling the more routine manipulations and the human providing high-level guidance which is where the ingenuity comes in.

Readers short of time but interested primarily in a suggested research direction can jump immediately to the last section.

As the focus is on solving mathematical and engineering problems, no explicit consideration is given to philosophical or spiritual considerations. Smartness refers merely to the ability to solve challenging (but not completely unconstrained) problems in mathematics and engineering.

How Smart Are Humans?

A definitive answer is hard to come by, but the following considerations are relevant.

Human evolution traces back to prokaryote cells. At what point did we become “smart”? My favoured hypothesis is that as the brain got more complex and we learnt to supplement our abilities with tools (e.g. by writing things down to compensate for fallible memory), exponential improvement resulted in our capabilities. (I vaguely recall that one theory of how our brain got more complex was by learning to use tools that lead to improved diets, e.g. by breaking bones and accessing the marrow.) It seems much less likely we suddenly picked up an “intelligence gene” that no other creature has. Flies, rats, apes and humans are not that different in terms of basic building blocks.

When comparing humans to computers, one should not forget that humans require 21 or so years of boot-up time. And different humans have different strengths; not all are equally good at solving mathematical problems or engineering systems. An unpleasant a thought as it may be, consider an extraterrestrial who has a choice of building and programming computers or breeding and teaching humans. Which would be a better choice for the extraterrestrial wishing for help in solving problems? (Depending on the initial gene pool and initial population size, it might take 100 years before a human is bred that is good at a particular type of problem. Regardless, if starting from scratch, there is an unavoidable 15-20 year delay.)

Then there is the issue of what precisely smartness is. Smartness is defined by a social phenomenon. At a small scale, one just has to look at review committees and promotion panels to realise there is not always agreement on what is smart and what is not. More striking though, smartness is relative – humans compare with each other and with the world around them. There are infinitely more problems we cannot solve than we can solve, yet no one seems to conclude that we are therefore not smart. We look at what we can do and compare that with what others can do. In a similar vein, a computer can effortlessly produce millions and millions of theorems and proofs in that it could systematically piece axioms together to form true sentences, recording each new sentence as a “theorem” and the trail leading to it as a “proof”. Yet this would be dismissed immediately as not useful. So ‘usefulness to humans’ plays a role in defining smartness.

How do humans solve problems? Following again the principle of Occam’s Razor, the simplest hypothesis that comes to my mind involves the following factors. Most of our abilities come from having to compete in the world (think back thousands of years). Image processing is crucial, and as over half the brain is involved to some extent with vision, a large part of how we reason is visual. (Even if we do not explicitly form images in our minds, our subconscious is likely to be using similar circuitry.) We also need to be able to understand cause and effect — if I stand next to that lion, I will get eaten — which leads to the concept of thinking systematically. So my current hypothesis is that systematic thinking (including understanding cause and effect) and pattern recognition are the main players when it comes to reasoning. The mathematics and engineering we have developed, largely fits into this mould. Rigorous systemisation and writing down and sharing of results have lead to ‘amazing’ results that no single human could achieve (insufficient time!) but at each little step, ideas come from experience and nowhere else. Those that can calculate faster, are more perceptive, are more inquisitive, tend to find the “right” types of patterns, have greater focus and stamina, and are more attuned to what others may find interesting, have significant competitive advantages, but to rule a line and say computers can never do mathematics or engineering is unjustified. (The issue of creativity will be discussed in the next section.)

A final point before moving on. Speech recognition is very challenging for computers. Yet are we smart because we can understand speech? Speech was something created over time by humans for humans. Presumably grunts turned into a small vocabulary of words which grew into phrases and more complicated structures. The point though is that at no time could the process evolve to produce something humans couldn’t understand, because by definition, the aim was to communicate, and if communication was not working, another approach would be taken. If we could learn the languages of all extraterrestrials, perhaps we really are smart, but I’m skeptical.

How can Computers be Made Smarter?

The three main characteristics of a computer system (i.e., something built to do speech recognition or solve mathematical problems) are its size (raw processing power), architecture (e.g., what parts can communicate with what other parts) and the software running on it.

Since humans tend to set the benchmark at what humans can do, a minimum amount of raw processing power is required before a computer can even hope to do something “smart”. Yet the architecture is even more important. Engineers currently do not build computers anything like the way nature builds brains. It is very likely that current computational architectures are ineffective for the types of pattern recognition the human brain engages in.

More interestingly, the architecture of the human brain evolves over time. In simple terms, it learns by changing its architecture. There are two obvious components; via DNA, natural selection increases the chances that a human is born with a ‘good’ architecture to start with. Then there is the long process of refining this architecture through everyday experiences and targeted learning (e.g., attending school).

There is nothing stopping us from building reconfigurable computers that are massively interconnected, thereby very crudely approximating the architecture of a brain. (I suspect this may involve a shift away from ‘reliable’ logic gates to logic gates (or equivalent) that work some but not all of the time, for there is a trade-off between density and reliability.)

With remarkable advances in technology, the real challenge looking forwards is the software side. Because we don’t understand precisely the sort of pattern recognition the brain uses to solve problems, and because until recently the technology was not there to build massively interconnected reconfigurable computers, no one has seriously tried to make a computer “smart” in the way I have been referring to in this essay. (This is not overlooking Artificial Intelligence whose approach was partly hampered by the available technology of the day, and which I suspect never delved deeper and deeper into how the brain does pattern recognition — once artificial neural networks came on the scene, there were sufficiently many research questions for AI researchers to chase that they did not continuously return to the biology to understand better and better how the brain does pattern recognition. And in fairness, as they lacked the computational power for hypothesis testing, it would most likely have been a futile endeavour anyway.)

Summarising this section, several points can be made. Current computing systems (which nevertheless serve the majority of their intended purposes extremely well) seem to be a ‘mismatch’ for doing what humans are good at doing, therefore, they come across as “dumb”. There is no evidence yet to suggest there is a barrier to building (this century) a computing system that is “smart” in some respects, but it will require a whole new approach to computing. It is not clear whether imitating the architecture of the human brain is the best thing to do, or if there is an even more efficient approach waiting to be discovered. Nevertheless, if smartness is being measured against what humans can do, a good starting point is starting with similar resources to what a human has.

Bringing in points from the preceding section, one must not forget though that humans have been ‘trained’ over centuries (natural selection), that each individual then takes an additional 21 years of training, during which time they are communicating with and learning from other individuals, and even then, we tend to work on the problems we believe we can solve and ignore the rest. This suggests we have a narrow definition of “smartness” and perhaps the only real way for a computer to be “smart” in our eyes is if it were to ‘grow up’ with humans and ‘learn’ (through daily feedback over 21 years) what humans value.

Indeed, smartness is usually linked with creativity and being able to solve “completely new” problems. (I would argue though that the problems we can solve are, by definition, not as distant from other problems we have solved before as we would like to think. Who knows how much the subconscious remembers that our conscious mind does not.) Creativity, even more than smartness, is linked to how humans perceive themselves relative to others. A random number generator is not creative. An abstract artist is. Some computer generated pictures coming from fractals or other mathematical constructs can be aesthetically pleasing to look at but is creativity involved once one realises a formulaic pattern is being followed? When it comes to problem solving, creativity and smartness come back largely to usefulness, or more generally, to some notion of value. We solve problems by instinctively knowing which of the thousands of possible approaches are most likely not to work, thereby leaving a manageable few options to work our way through. When we “create” a new theorem or engineer a new system, we are credited with creativity if it is both potentially useful (“sensible”?) and different from what has been done before. A random number generator succeeds at being different; the challenge is teaching a computer a sense of value; I suspect this is achievable by having computers interact with humans as described above (it is all just pattern recognition refined by external feedback!), and perhaps humans can even learn to write down systematically not just an algebraic proof of a mathematical result but also the “intuition” behind the proof, thereby enabling computers to learn like humans by studying how others solve problems.

A Future Vision of Semi-automated Design

Although arguing that if we wanted computers to be smart then we are most likely going to be able to achieve that goal eventually, the more important question is whether we should be trying to head directly there. Personally, a more useful and more easily achievable goal is to work towards semi-automation and not full automation. At the moment there is essentially no way to guide a computer towards a goal; we run something, we wait, we get a result. Often we make a modification and try again. But we are not really interacting with the computer, just using it to shorten the time it takes for us to calculate something. By comparison, two people can work together on a design problem or a mathematical proof; there are periods when they work alone, but there are periods when they come together to discuss and get ideas from each other before going off to think on their own again. Semi-automation lies somewhere between these two extremes: the computer need not be “smart” like a human, but it should come across as being more than just a dumb calculator (even if, at the end of the day, all it is doing are calculations!).

Efficient exchange of information is vital. Computers can store more data than a brain can, but we can probe someone else’s brain for information much more efficiently than we can probe a computer’s memory banks. Largely this is because exchange of information is more efficient when there is shared knowledge that need not be explicitly transmitted.

Semi-automated Theorem Proving

There are so many mistakes, both big and small, in the published literature (even the top journals) that it seems not only highly desirable but inevitable we will ultimately have all mathematical works stored on and verified by computers. This is certainly not a new dream, so how far are we from achieving it?

Although proof verification systems exist, otherwise simple proofs become excessively long when translated into a suitable form for the computer to check. Exchange of information is far from efficient! At the other extreme, automated theorem provers exist, but are yet to prove any theorems of substance. I propose the following stages of research.

1) Choose an area (e.g. basic set theory, or basic set theory and finite-dimensional linear algebra). Work towards devising a notation so that humans can enter a theorem and proof using a similar amount of effort to typing up the theorem and proof for a formal set of lecture notes. To improve efficiency, the notation does not need to be unambiguous since the computer can always ask for clarification if it cannot figure out the correct meaning. Similarly, the gaps between each step of the proof may not always be sufficiently small for the computer to be able to fill in, nevertheless, the computer should have a go, and ask for help (e.g., ask for a sub-step to be inserted) if it fails.

2) Enhance stage one above by having the computer give a list of suggestions whenever it gets stuck; it is easier for the user to click on the correct suggestion than to type in the answer, and if none of the suggestions are correct, the user can always fall back to entering the answer. Here, some method of ranking the suggestions is required (for otherwise there might be too many suggestions). Initially, this can be based on the computer determining “what we already know” and “what we want to find out”, then searching through the axioms and existing theorems to find something that relates either to “what we already know” (a possible ‘starting move’) or “what we want to find out” (a possible ending move).

3) Stage two will eventually prove inefficient as the number of theorems grows and the proofs become more complicated. Some sort of ‘learning’ and ‘intuition’ on behalf of the computer is unavoidable if humans and computers are to interact significantly more efficiently. Perhaps ultimately pattern recognition, as discussed in previous sections, becomes unavoidable, but one should not be too hasty in ruling out the possibility that a fair amount of intuition is more formulaic than we think. Stage three then is supplementing theorems and proofs with basic intuition. The first example that comes to mind is working with bounds, e.g., the triangle inequality. A computer can be taught that if it wants to prove x and y are close, then it can do this if it can find a point z that is close to x and also close to y. Or in another form (i.e., $| \|x\| - \|y\| | \leq \|x-y\|$), if the computer wants to show x and y are some distance apart, it can endeavour to do so by comparing the norm of x with the norm of y.

Certainly some proofs require deep intuition and ingenuity, but even then, a large part of the proof is generally fairly routine. Once a human has specified the “stepping stones”, the routine parts can be filled in (semi-)automatically by the computer working its way through its collection of routine moves and supplemented by basic intuition to rank the order in which it should try the moves.

Stages 1 to 3 are within our grasp and would already have a tremendous impact if they work as expected.

4) Stages 1 to 3 were concerned with a user stating a theorem and sketching a proof, with the aim that the computer can fill in the missing steps of the proof and thereby verify the correctness of the theorem, falling back to asking for more information if it gets stuck. Stage 4 is for the computer to work with a user in proving or disproving a theorem. That is, the user states a conjecture, then through a series of interactions with the computer, the user either finds a proof or a counterexample in a shorter time than if the user were to proceed unassisted. Refinement of the “intuition engine” may or may not be necessary. Some way for the user to suggest efficiently (minimum of keystrokes) to the computer what to try next is required. Some way for the computer to make suggestions to the user as to what to (and what not to) try next is required. [I omit my ideas for how to do this because they will most likely be superseded by the insight gained by anyone working through stages 1 to 3.]

5) The fifth stage is for computers to propose theorems of their own. In the first instance, they can look for repeated patterns used in proofs stored in its database. The examples that come to mind immediately are category theory and reproducing kernel Hilbert spaces. In part, these areas came into existence because mathematicians observed the same types of manipulations being carried out in different situations, identified the common threads and packaged them up into their own theory or “toolbox”.

Semi-automated Chip Design

Engineers have long had the dream of being able to specify what they would like a software program or an electronic circuit to do, push a button, and have a computer generate the software or the circuit. Even more significant than saving time, if done properly, this would ensure the software (and possibly the circuit, although this is a bit harder to guarantee in more complicated situations when extra considerations such as EM interactions must be taken into account) is guaranteed to work correctly. There is a staggering amount of buggy software and hardware out there that this dream is becoming more and more desirable to achieve.

Basically, all the ideas discussed earlier are relevant here. Instead of a library of mathematical theorems, proofs and accompanying intuition being built up, a library of software or a library of electronic circuits is built up. Humans will always be part of the design loop, but more and more of the routine work can be carried out by a computer. Brute force algorithms and heuristics must be replaced by a way of encoding “intuition” and “lessons from past experience”, thereby allowing humans and computers to interact much more efficiently.

Categories: Research

Why does Compactness ensure a Bijective Map is a Homeomorphism?

The motivation for this exposition is threefold:

• It helps illustrate the mantra that intuition is as equally important as algebraic manipulation.
• It relates to a fundamental concept in mathematics, namely morphisms (from category theory).
• It helps explain why compact spaces and Hausdorff spaces are important.

The following facts are readily found in textbooks; the tenet of this exposition is that it is impossible, without further thought, to appreciate any of these facts.

A topological space is compact if every open cover has a finite subcover.

A topological space is Hausdorff if every two points can be separated by neighbourhoods.

A map (i.e., a continuous function) $f \colon X \rightarrow Y$ between topological spaces $X$ and $Y$ is a homeomorphism if it is bijective and its inverse $f^{-1}$ is continuous.

If $X$ is compact and $Y$ is Hausdorff then a bijective map $f \colon X \rightarrow Y$ is automatically a homeomorphism.

Questions:

• How can a bijective map not be a homeomorphism?
• How can compactness possibly have any relevance to whether a map is a homeomorphism?
• Why the Hausdorff assumption on $Y$?
• How does this relate to “morphisms”, as mentioned at the start of this exposition?

Before addressing these questions, a standard proof of the above result is given, the purpose being to emphasise that the proof, on its own, does not directly answer any of these questions. Memorising the proof is essentially worthless. (Important are being able to derive the proof from scratch and being able to understand why the result should indeed be true.)

Showing $f^{-1}$ is continuous means showing the preimage of every open set is open, or equivalently, showing the preimage of every closed set is closed. Since $f$ is bijective, the preimage under $f^{-1}$ of a set $A$ is simply $f(A)$. Hence it suffices to prove that $f$ is closed (the image of every closed set is closed). Let $A \subset X$ be closed. Since $X$ is compact, $A$ must be compact. The image of a compact set under a continuous function is itself compact, that is, $f(A)$ is compact. A compact subset of a Hausdorff space must be closed, that is, $f(A)$ is a closed subset of $Y$, proving $f$ is closed and therefore that $f^{-1}$ is continuous, as required.

While basing topology on open sets works very well from the perspective of being able to give concise proofs such as the one above, it is perhaps the worst way of introducing topology. It is noteworthy that when topology was being developed, it took decades for the importance of “open sets” to be recognised.

Let’s start from scratch and try to understand things intuitively. One approach (but certainly not the only approach) to understanding a topological space is understanding what converges to what. In special cases, such as when the topology comes from a metric, it suffices to consider sequences (the curious reader may wish to read about sequential spaces). In general, nets must be considered instead of sequences, nevertheless, for the purposes of answering the questions above, it suffices to restrict attention to sequences. (It is often enough to understand intuitively why a result is true in special cases, leaving the proof to justify the result in full generality.)

Why a Bijective Map need not be a Homeomorphism

By restricting attention to sequences (and therefore to nice topological spaces), the definition of continuity becomes: $f$ is continuous if, whenever $x_n \rightarrow x$, it holds that $f(x_n) \rightarrow f(x)$. Therefore, a bijective map $f$ is a homeomorphism if and only if $f(x_n) \rightarrow f(y)$ implies $x_n \rightarrow y$.

Whereas bijectivity just tests whether distinct points are mapped to distinct points, being a homeomorphism means nearby points must be mapped to nearby points, in both directions, from $X$ to $Y$ and from $Y$ to $X$. (That is, a homeomorphism preserves the topology.)

This suggests the following example. Let $X$ be the interval $[0,2\pi) \subset \mathbb{R}$ and let $Y$ be the unit circle $Y = \{ (x_1,x_2) \in \mathbb{R}^2 \mid x_1^2 + x_2^2 = 1\}$. Let $f\colon X \rightarrow Y$ be the function sending $\theta \in [0,2\pi)$ to $(\cos\theta,\sin\theta)$. Then $f$ is continuous because it sends nearby points in $X$ to nearby points in $Y$. But $f^{-1}$ is not continuous because the point $(1,0) \in Y$ is sent to the point $0 \in X$ but a point just below $(1,0) \in Y$, say $(\cos\theta,\sin\theta)$ for $\theta$ just less than $2\pi$, is sent to a point far away from $0 \in X$. (The reader should pause to understand this fully; drawing a picture may help.)

The above example is quintessential; an injective map $f$ is a homeomorphism only if it is not possible to map a “line $\longrightarrow$” to a “loop $\circlearrowleft$“. In detail, let $x_n$ be a sequence of points that either does not have a limit point, or that converges to a point $x$ that is different from $x_1$. Such a sequence can be obtained by considering the location at regularly spaced intervals of an ant walking along a path in $X$ that does not eventually return to where it started. It is therefore referred to loosely as a “line”. Let it loosely be said that this “line” gets mapped to a “loop” if $\lim f(x_n) = f(x_1)$. If it is possible for a line to be mapped to a loop, then $f$ cannot be a homeomorphism. (Draw some pictures!)

To summarise, injectivity of $f$ means that if $f(x) = f(y)$ then  $x = y$. For $f$ to be a homeomorphism requires more: if $\lim f(x_n) = f(y)$ then it must be that $\lim x_n = y$. Whereas injectivity just means the preimages of distinct points are distinct, a homeomorphism means the preimages of  convergent sequences with distinct limits are convergent sequences with distinct limits. Perhaps the best way of visualising this though is asking whether a “line” can be mapped to a “loop”.

Relevance of the Domain being Compact and the Codomain being Hausdorff

Let $x_n$ be a “line”, as defined above. There are two cases to consider: either $x_n$ does not have a limit point, in which case it will be seen that compactness of $X$ is relevant, or $x_n$ has a limit, in which case it will be seen that $Y$ being Hausdorff is significant.

When dealing with sequences, the role of compactness is guaranteeing every sequence has a limit point (or equivalently, that a convergent subsequence exists). Therefore, if $X$ is compact, it is not possible for $x_n$ to lack a limit point, and this option is immediately off the table.

If it is possible for a “line” to be mapped to a “loop” despite $X$ being compact then it is possible to achieve this assuming $x_n$ has a limit (by considering a convergent subsequence if necessary). Henceforth, assume $x_n \rightarrow x \neq x_1$ yet $f(x_n) \rightarrow f(x_1)$. Continuity of $f$ implies $f(x_n) \rightarrow f(x)$. When dealing with sequences, the role of the Hausdorff requirement is to ensure limits are unique. (Prove this for yourself, or look up a proof that limits are unique in Hausdorff spaces.) Therefore, $Y$ being Hausdorff means $f(x) = f(x_1)$ since $f(x_n)$ converges to both $f(x)$ and $f(x_1)$. Since $f$ is injective, this means $x = x_1$, contradicting $x_n$ being a “line”. Therefore, it is impossible for a “line” to be mapped to a “loop”; every injective map $f$ from a compact space to a Hausdorff space is automatically a homeomorphism.

The above has hopefully taken away a bit of the mystery of why compactness and the Hausdorff condition enter into the picture.

• An injective map $f$ (on a nice topological space where it suffices to reason with sequences) is a homeomorphism if and only if it is not possible for a “line $\longrightarrow$” to be mapped to a “loop $\circlearrowleft$“. (A line and a loop are topologically distinct whereas the purpose of a homeomorphism is to preserve topological structure.)
• If $y_n$ is a loop in $Y$, that is, $y_n \rightarrow y_1$, then consider its preimage $f^{-1}(y_n)$. Unless there is some reason for $f^{-1}(y_n)$ to converge, the fact that $f$ is injective is powerless to prevent $f^{-1}(y_n)$ from being a “line”, i.e., it may well be the case that $f$ is not a homeomorphism.
• If $X$ is compact then $f^{-1}(y_n)$ has a limit point. By considering a subsequence if necessary, assume without loss of generality that $f^{-1}(y_n) \rightarrow x$. Continuity of $f$ implies $y_n \rightarrow f(x)$.
• If $Y$ is Hausdorff then limits are unique, therefore, $f(x) = y_1$. Therefore, the preimage of a loop is a loop, not a line. The topology has been preserved, hence $f$ is a homeomorphism.
• The above is “intuitively” why the result holds, yet the formal proof based on open sets is shorter, more general but devoid of intuition (at least for beginners); this is the joy of mathematics.

Morphisms

Very briefly, it is remarked that structure-preserving functions are called morphisms (in category theory). Two sets are equivalent if there exists a bijective function between them. As more structure is given to these sets, requirements stronger than just bijectivity are necessary if the structure is to be preserved.

In linear algebra, two vector spaces are the same if there exists a bijective function $f$ between them that is linear and whose inverse is linear; the linearity of $f$ and $f^{-1}$ is what preserves the structure. Interestingly, if $f$ is linear and bijective then its inverse $f^{-1}$ is automatically linear.

In topology, a continuous function preserves structure in one direction, therefore, to preserve structure in both directions requires a bijective $f$ that is continuous and whose inverse $f^{-1}$ is also continuous; as introduced earlier, such a function is called a homeomorphism. Unlike the linear case, $f$ being bijective and continuous is not enough to imply $f^{-1}$ is continuous.

Countable Intersections and Unions of Sets: A Discussion on an Example in Williams “Probability with Martingales”

As homework for students, I recently set the task of understanding the Example/Warning in Section 1.11 of Williams’ book “Probability with Martingales“. I provide here my own ‘solution’ to this homework exercise. There is no need to refer to Williams’ book as the following is self-contained.

Let $V = \{v_1,v_2,\cdots\}$ be the set of rational numbers between 0 and 1 inclusive; because the rational numbers are countable, they can be enumerated as $v_1,v_2,\cdots$. Define the sets $I_{n,k} = (v_n - \frac1{k2^n},v_n+\frac1{k2^n}) \cap [0,1]$. Here, $(v_n-\frac1{k2^n},v_n+\frac1{k2^n})$ is the open interval of width $\frac1{k2^{n-1}}$ centred at the rational number $v_n$, and intersecting it with the closed interval $[0,1]$ simply eliminates any “end effects” by keeping all sets within $[0,1] \subset \mathbb{R}$. For a fixed $n$, note that $I_{n,k}$ monotonically decreases to $\{v_n\}$ as $k \rightarrow \infty$. In particular, $\cup_{n=1}^\infty \cap_{k=1}^\infty I_{n,k} = V$.

Of interest is what the set $H = \cap_{k=1}^\infty \cup_{n=1}^\infty I_{n,k}$ looks like. A straightforward calculation shows $H \supset V$, but do you think $H = V$, or $H = [0,1]$, or something else?  Is $\sqrt{2}-1 \in H$?

For any finite $N$ and $K$$\cap_{k=1}^K \cup_{n=1}^N I_{n,k} = \cup_{n=1}^N \cap_{k=1}^K I_{n,k}$.  Believing this to continue to hold in the limit as $N,K \rightarrow \infty$ is a flawed argument for $H=V$.  (Note: normally, union and intersection cannot be interchanged, but here, the sets $I_{n,k}$ are decreasing in $k$. In particular, $\cap_{k=1}^K I_{n,k} = I_{n,K}$.) To see the difference between the finite and infinite cases, note that $x$ being an element of $\cap_{k=1}^K \cup_{n=1}^N I_{n,k}$ means precisely that $\forall k$, $\exists n_k$ such that $x \in I_{n_k,k}$.  Here, a subscript $k$ adorns $n$ to emphasise that the choice of $n$ is allowed to depend on $k$. That $x$ is an element of $\cup_{n=1}^N \cap_{k=1}^K I_{n,k}$ means precisely that $\exists n$, $\forall k$, $x \in I_{n,k}$.  The key difference is that here, a single $n$ must work for all $k$.  In the finite case, the choice $n_k = n_K$ works for all $k$. If $K$ is infinite though, there is no such thing as $n_\infty$ that could serve in place of $n_K$.

A flawed argument for $H = [0,1]$ can be based on the denseness of rationals: every $x \in [0,1]$ is the limit of a sequence of rational numbers, and since $\cup_n I_{n,k}$ contains not only every rational number, but also a neighbourhood of every rational number, it “must” therefore contain all numbers in $[0,1]$. It is insightful to understand where the flaw is.  Let $x \in [0,1]$ be an irrational number. Let $i_1 < i_2 < \cdots$ be an increasing sequence of indices such that $v_{i_1},v_{i_2},\cdots \rightarrow x$. Whether $x \in \cap_k \cup_n I_{i_n,k}$ depends on which is faster: the convergence of $v_{i_n} \rightarrow x$, or the convergence of $\frac1{k2^{i_n}} \rightarrow 0$. To illustrate, assume that $|v_{i_n} - x | = \frac1{2^n}$. (For pedants, interpret this equality as approximate equality; the difference between a rational and irrational number cannot exactly equal a rational number.) Then even though $v_{i_n} \rightarrow x$, the convergence is sufficiently slow that $x \notin I_{i_n,k}$ for all $k$. (Note $2^n \geq 2^{i_n}$ because $i_n \geq n$.)

The pendulum swings back: perhaps then $H = V$ because no sequence can converge sufficiently quickly to an irrational number? To demonstrate that this is false, consider the following procedure for choosing $v_1,v_2,\cdots$ to ensure that $\sqrt{2}-1 \in H$. It suffices to construct a sequence $v_1,v_3,v_5,\cdots$ that converges quickly to $\sqrt{2}-1 \in H$ because the remaining rational numbers can be assigned (in arbitrary order) to $v_2,v_4,v_6,\cdots$.  Simply choose $v_n$, for $n$ odd, to be a rational number within $\frac1{n2^n}$ of $\sqrt{2}-1$.  (This can always be arranged by taking sufficiently many terms in the decimal expansion $0.414214\cdots$ of $\sqrt2-1$.) Then, no matter how large $k$ is chosen, $\sqrt2-1 \in I_{n,k}$ when $n$ is an odd number greater than $k$, hence $\sqrt2-1 \in H$.

The above has already demonstrated that limits cannot be interchanged; a sequence $v_1,v_2,\cdots$ of rational numbers was constructed such that $\sqrt2-1$ is in $H$ and hence $H \neq V$. This is not the argument used by Williams, but being constructive in nature might mean it is a more transparent argument.

The higher-level argument in Williams’ book reveals more information about $H$.  Williams uses the Baire category theorem to deduce that $H$ must be uncountable, hence $H \supsetneq V$ regardless of how $v_1,v_2,\cdots$ are chosen. Furthermore, the Lebesgue measure of $H$ is easily shown to be zero, therefore, $H$ does not contain any open intervals, and in particular $H \subsetneq [0,1]$. (Proof: the Lebesgue measure of $I_{n,k}$ is at most $\frac1{k2^{n-1}}$.  Since $\sum_{n=1}^\infty \frac1{2^{n-1}} = 2$, the Lebesgue measure of $\cup_n I_{n,k}$ is at most $\frac2k$.  As $k \rightarrow \infty$, this goes to zero, hence the measure of $H$ must be zero.)

For completeness, the fact that $H$ is uncountable will be derived. The trick is to consider the complement $H^c = [0,1] \setminus H$. If it can be shown that the complement $H^c$ is “too small” then $H$ must be uncountable.  De Morgan’s law implies that $H^c = \cup_{k=1}^\infty \left( \cup_{n=1}^\infty I_{n,k} \right)^c$.   (Exercise: prove from first principles that $\left( \cap_{k=1}^\infty G_k \right)^c = \cup_{k=1}^\infty G_k^c$.)  Let $F_n = \left( \cup_{n=1}^\infty I_{n,k} \right)^c$.  Since $\cup_{n=1}^\infty I_{n,k}$ contains all the rational numbers, $F_n$ can contain only irrational numbers.  In particular, $F_n$ does not contain any open interval. Each $I_{n,k}$ is open and hence $\cup_{n=1}^\infty I_{n,k}$ is open. Thus, $F_n$ is a closed set not containing any open intervals. Such sets are called nowhere dense. Since $H^c$ is the countable union of nowhere dense sets, it is a meagre set. If $H$ is countable then it too would be meagre, and hence $H \cup H^c$ would be meagre. Yet $H \cup H^c = [0,1]$, and $[0,1]$ is a Baire space, meaning meagre sets are “small”.  In particular, the whole space $[0,1]$ cannot be meagre, a contradiction.  Therefore, $H$ must be uncountable.

The conclusion then is that $H$ depends on the actual ordering $v_1,v_2,\cdots$ of the rationals, yet regardless of how the rationals are ordered, $H$ will be an uncountable set of Lebesgue measure zero. In particular, $V \subsetneq H \subsetneq [0,1]$.

How the Jacobian can Detect Cusps and Self-intersections

Let $M$ be a subset of the plane defined by the vanishing of a smooth (that is, derivatives of all orders exist) function $f\colon \mathbb{R}^2 \rightarrow \mathbb{R}$.  For simplicity, it is assumed throughout that $f$ is not identically zero.

For example, if $f(x,y) = x^2 + y^2 - 1$ then the set $M = \{ (x,y) \in \mathbb{R}^2 \mid f(x,y) = 0\}$ is the unit circle. This can be readily visualised using the online Sage Notebook. Sign in to Sage, create a new worksheet, and type: var(‘x,y’); implicit_plot(x^2+y^2-1==0, (x, -2, 2), (y, -2, 2)); NOTE: WordPress automatically converts the first apostrophe in var(‘x,y’); from the correct ${}'$ to the incorrect `.

Next, try: var(‘x,y’); implicit_plot(x^3+y^3-x*y==0, (x, -2, 2), (y, -2, 2));

Note that this second example has a self-intersection; if one considers starting at one end of the line and walking to the other end, there will be a point that will be passed over twice, once when entering the loop and once upon exit.

Thirdly, try: var(‘x,y’); implicit_plot(x^3-y^2==0, (x, -2, 2), (y, -2, 2));

This third example has a cusp; if one endeavours to drive a car from the bottom end of the line to the top end, one will get stuck at the cusp with the car pointing to the left but the “road” continuing off to the right.

Roughly speaking, and as a good introduction to differential geometry, a (concrete) manifold is a subset of $\mathbb{R}^n$ that does not contain cusps (that is, it is smooth) and does not contain self-intersections (that is, it is locally Euclidean). Regardless, it is interesting to ask how difficult it is to determine if the set $M = \{ (x,y) \in \mathbb{R}^2 \mid f(x,y) = 0\}$  has any cusps or self-intersections.  The intuitive principle of driving a car along the curve $M$ to detect cusps or self-intersections may appear difficult to translate into a mathematical tool, especially for detecting self-intersections.

Note though that driving a car corresponds to a parametrisation of the curve: roughly speaking, constructing functions $x(t)$ and $y(t)$ such that $f(x(t),y(t)) = 0$ for all $t$. Finding a self-intersection is equivalent to finding a non-trivial ($t \neq t'$) solution to $x(t)=x(t')$ and $y(t)=y(t')$.  This is a difficult “global” problem. However, the curve has been specified implicitly by $f(x,y) =0$, and perhaps the problem is not global but only local with respect to this implicit formulation?

Detecting cusps and self-intersections really is only a local problem with respect to $f(x,y)$, meaning that knowledge of what $M = \{ (x,y) \in \mathbb{R}^2 \mid f(x,y) = 0\}$ looks like anywhere except in a small neighbourhood of a point $(x,y)$ is irrelevant to determining whether there is a singularity or self-intersection at $(x,y)$.  This suggests considering a Taylor series of $f$ about the point $(x,y)$ of interest. [Technically, we would also want $f$ to be analytic if we are relying on a Taylor series argument, but we are focusing here on gaining a feel for the problem in the first instance.] It is also worth noting that this is also an example of “unhelpful intuition”; thinking of driving cars is a hindrance, not a help.

Choose a point $(x',y')$ lying on $M$, that is, $f(x',y') = 0$.  Then $f(x,y) = \alpha (x-x') + \beta (y-y') + \cdots$ where the dots denote higher-order terms.  If either $\alpha$ or $\beta$ is non-zero then $f$ contains a linear term, and the linear term dominates the behaviour of $f$ when $(x,y)$ is sufficiently close to $(x',y')$.  The linear equation $\alpha (x-x') + \beta (y-y') = 0$ describes a straight line; there can be no cusps or self-intersections.  This argument can be made rigorous using the implicit function theorem, even when $f$ is not analytic: if for every point $(x,y) \in M$, either $\frac{\partial f}{\partial x}$ or $\frac{\partial f}{\partial y}$ is non-zero at $(x,y)$, then $M = \{ (x,y) \in \mathbb{R}^2 \mid f(x,y) = 0\}$ contains no self-intersections or cusps.  This result, stated more generally in terms of the rank of the Jacobian matrix, appears near the beginning in most differential geometry textbooks.  Without the above considerations though, it may appear mysterious how the Jacobian can detect cusps and self-intersections. (The key point is not to think in terms of a parametrisation of the curve, but instead, break the space $\mathbb{R}^2$ into neighbourhoods $U_i$  which are sufficiently small that it is possible to determine exactly what $M \cap U_i$ looks like. To belabour the point, if a point in $U_i$ is not in $M \cap U_i$ then it is not in $M$.)

Textbooks generally do not mention explicitly though that if $M$ fails this “Jacobian test” then it may still be a manifold. The above makes this clear; if $\alpha=\beta=0$ then it is necessary to examine higher-order terms before making any claims. As a simple example, $f(x,y) = x^3$ fails the test at $(0,0)$ because both $\frac{\partial f}{\partial x}$ and $\frac{\partial f}{\partial y}$ are zero at the origin, yet $M = \{ (x,y) \in \mathbb{R}^2 \mid x^3 = 0\}$ is the straight line given equivalently by $x=0$.

Long versus Short Proofs

Proofs are very similar to computer programs. And just like computer programs, there are often many different ways of writing a proof.  Sometimes the difference between proofs is great; one might be based on geometry while another on analysis. The purpose of this short note is to emphasise that different proofs are possible at the “simple” level too.  While this may appear mundane, it is actually an important part of learning mathematics: once you have proved a result, spend a few minutes trying to simplify the proof. It will make the next proof you do easier. The two key messages are as follows.

• There is no substitute for experience when it comes to deriving proofs.
• Practicing on simple examples to find the “best” proof is training for being able even just to find a proof of a harder result.

For lack of a better on-the-spot example, consider the following problem. Let $f\colon\mathbb{C} \rightarrow \mathbb{C}$ be an analytic function having a double zero at the origin and whose first derivative has at most linear growth: $| f'(x) | \leq \alpha |x|$ for some $\alpha \in \mathbb{R}$.  What is the most general form of $f$?

First Approach

A double zero at the origin means $f(x) = x^2 h(x)$ for some analytic function $h$. Therefore $f'(x) = 2xh(x) + x^2h'(x)$ and $\frac{f'(x)}{x} = 2h(x) + xh'(x)$.  Since $latex 2h(x) + xh'(x)$ is both analytic and bounded, it must be constant, by Liouville’s theorem.  (Here, all analytic functions are entire because they are defined on the whole of the complex plane.) Solving the differential equation $latex 2h(x) + xh'(x) = c$ by substituting in the power series expansion $h(x) = \sum a_i x^i$ shows that $h(x) = \frac{c}2$. [The general solution of $latex 2h(x) + xh'(x) = 0$ is $h(x) = \beta x^{-2}$ where $\beta$ is an arbitrary scalar. The only general solution that is analytic is $h(x)=0$.] The conclusion is that the most general form of $f$ is $f(x) = ax^2$ for some complex scalar $a$.

Second Approach

The first approach is unattractive and unnecessarily complicated. Instead of starting with the double zero, start instead with the first derivative having a linear bound. (A standard generalisation of Liouville’s theorem is that an entire function with a linear bound must itself be linear. We will pretend here we do not know this fact.) If $f(x)$ has a double zero at the origin then $f'(x)$ has a zero at the origin, therefore $f'(x) = x g(x)$ for some analytic $g(x)$. The linear bound together with Liouville’s theorem means $g(x)$ is constant, that is $f'(x) = 2 a x$ for some scalar $a$. Therefore $f(x)$ must equal $ax^2$ if it is to satisfy both $f'(x) = 2 a x$ and $f(0)=0$.

What was the Difference?

The first approach expressed $f$ as $f(x) = x^2 h(x)$ while the second approach expressed $f'$ as $f'(x) = x g(x)$. Both approaches resulted in a differential equation, but the second approach resulted in the simpler differential equation $f'(x) = 2ax$. Underlying this example is that a “change of coordinates” can simplify a differential equation. Although both approaches could be made to work in this simple example, there are situations where some approaches are too difficult to follow through to completion.

One could argue that because the linear bound constraint is “harder” than the double-zero constraint, one should start with the linear bound constraint and not the double-zero constraint, and therefore be led to the simpler differential equation. Yet the real messages are as stated at the start:

• There is no substitute for experience when it comes to deriving proofs.
• Practicing on simple examples to find the “best” proof is training for being able even just to find a proof of a harder result.

Measure-theoretic Formulation of the Likelihood Function

Let $P_\theta$ be a family of probability measures indexed by $\theta \in \Theta$. For notational convenience, assume $0 \in \Theta$, so that $P_0$ is one of the probability measures in the family. This short note sketches why $L(\theta) = E_0\left[ \frac{dP_\theta}{dP_0} \mid \mathcal X \right]$ is the likelihood function, where the $\sigma$-algebra $\mathcal X$ describes the possible observations and $E_0$ denotes expectation with respect to the measure $P_0$.
First, consider the special case where the probability measure can be described by a probability density function (pdf) $p(x,y;\theta)$. Here, $x$ is a real-valued random variable that we have observed, $y$ is a real-valued unobserved random variable, and $\theta$ indexes the family of joint pdfs. The likelihood function when there is a “hidden variable” $y$ is usually defined as $\theta \mapsto p(x;\theta)$ where $p(x;\theta)$ is the marginalised pdf obtained by integrating out the unknown variable $y$, that is, $p(x;\theta) = \int_{-\infty}^{\infty} p(x,y;\theta)\,dy$. Does this likelihood function equal $L(\theta)$ when $\mathcal X$ is the $\sigma$-algebra generated by the random variable $x$?
The correspondence between the measure and the pdf is: $P_\theta(A) = \int_A p(x,y;\theta)\,dx\,dy$ for any (measurable) set $A \subset \mathbb{R}^2$; this is the probability that $(x,y)$ lies in $A$. In this case, the Radon-Nikodym derivative $\frac{dP_\theta}{dP_0}$ is simply the ratio $\frac{p(x,y;\theta)}{p(x,y;0)}$. The conditional expectation with respect to $X$ under the distribution $p(x,y;0)$ is $E_0\left[ \frac{p(x,y;\theta)}{p(x,y;0)} \mid x \right] = \int_{-\infty}^{\infty} \frac{p(x,y;\theta)}{p(x,y;0)} p(x,y;0)\, dy = \int_{-\infty}^{\infty} p(x,y;\theta)\,dy$, verifying in this special case that $L(\theta)$ is indeed the likelihood function.
The above verification does not make $L(\theta) = E_0\left[ \frac{dP_\theta}{dP_0} \mid \mathcal X \right]$ any less mysterious. Instead, it can be understood directly as follows. From the definition of conditional expectation, it is straightforward to verify that $L(\theta) = \left. \frac{dP_\theta}{dP_0}\right|_{\mathcal X}$ meaning that for any $\mathcal X$-measurable set $A$, $P_\theta(A) = \int_A \left. \frac{dP_\theta}{dP_0}\right|_{\mathcal X}\,dP_0$. The likelihood function is basically asking for the “probability” that we observed what we did, or precisely, we want to take the set $A$ to be our actual observation and see how $P_\theta(A)$ varies with $\theta$. This would work if $P_\theta(A) > 0$ but otherwise it is necessary to look at how $P_\theta(A)$ varies when $A$ is an arbitrarily small but non-negligible set centred on the true observation. (If you like, it is impossible to make a perfect observation correct to infinitely many significant figures; instead, an observation of $x$ usually means we know, for example, that $1.0 \leq x \leq 1.1$, hence $A$ can be chosen to be the event that $1.0 \leq x \leq 1.1$ instead of the negligible event $x = 1.05$.) It follows from the integral representation $P_\theta(A) = \int_A \left. \frac{dP_\theta}{dP_0}\right|_{\mathcal X}\,dP_0$ that $\left. \frac{dP_\theta}{dP_0}\right|_{\mathcal X}$ describes the behaviour of $P_\theta(A)$ as $A$ shrinks down from a range of outcomes to a single outcome. Importantly, the subscript $\mathcal X$ means $L(\theta) = \left. \frac{dP_\theta}{dP_0}\right|_{\mathcal X}$ is $\mathcal X$-measurable, therefore, $L(\theta)$ depends only on what is observed and not on any other hidden variables.