## Long versus Short Proofs

Proofs are very similar to computer programs. And just like computer programs, there are often many different ways of writing a proof. Sometimes the difference between proofs is great; one might be based on geometry while another on analysis. The purpose of this short note is to emphasise that different proofs are possible at the “simple” level too. While this may appear mundane, it is actually an important part of learning mathematics: once you have proved a result, spend a few minutes trying to simplify the proof. It will make the next proof you do easier. The two key messages are as follows.

- There is no substitute for experience when it comes to deriving proofs.
- Practicing on simple examples to find the “best” proof is training for being able even just to find a proof of a harder result.

For lack of a better on-the-spot example, consider the following problem. Let be an analytic function having a double zero at the origin and whose first derivative has at most linear growth: for some . What is the most general form of ?

## First Approach

A double zero at the origin means for some analytic function . Therefore and . Since $latex 2h(x) + xh'(x)$ is both analytic and bounded, it must be constant, by Liouville’s theorem. (Here, all analytic functions are entire because they are defined on the whole of the complex plane.) Solving the differential equation $latex 2h(x) + xh'(x) = c$ by substituting in the power series expansion shows that . [The general solution of $latex 2h(x) + xh'(x) = 0$ is where is an arbitrary scalar. The only general solution that is analytic is .] The conclusion is that the most general form of is for some complex scalar .

## Second Approach

The first approach is unattractive and unnecessarily complicated. Instead of starting with the double zero, start instead with the first derivative having a linear bound. (A standard generalisation of Liouville’s theorem is that an entire function with a linear bound must itself be linear. We will pretend here we do not know this fact.) If has a double zero at the origin then has a zero at the origin, therefore for some analytic . The linear bound together with Liouville’s theorem means is constant, that is for some scalar . Therefore must equal if it is to satisfy both and .

## What was the Difference?

The first approach expressed as while the second approach expressed as . Both approaches resulted in a differential equation, but the second approach resulted in the simpler differential equation . Underlying this example is that a “change of coordinates” can simplify a differential equation. Although both approaches could be made to work in this simple example, there are situations where some approaches are too difficult to follow through to completion.

One could argue that because the linear bound constraint is “harder” than the double-zero constraint, one should start with the linear bound constraint and not the double-zero constraint, and therefore be led to the simpler differential equation. Yet the real messages are as stated at the start:

- There is no substitute for experience when it comes to deriving proofs.
- Practicing on simple examples to find the “best” proof is training for being able even just to find a proof of a harder result.

## On Learning and Teaching Mathematics: Nothing is Elementary, and the Importance of Intuition

Learning is ineffective if attempted in a linear fashion. Fine details are best learnt, appreciated and remembered by those that can spontaneously describe and answer questions about the coarser details of the subject at hand. Therefore, it can be valuable for more advanced books to revisit “elementary” concepts because rarely is anything sufficiently elementary that nothing more remains to be known. The following quote is apt; the emphasis is my own. (All quotes below are taken from the preface of C. Lanczos (1966) *Discourse on Fourier Series*.)

By the nature of things it was necessary to develop the subject from its early beginnings and this explains the fact that even so-called “elementary” concepts, such as the idea of a function, the meaning of limit, uniform convergence and similar “well-known” subjects of analysis were included in the discussion. Far from being bored, the students found this procedure highly appropriate, because very often exactly the apparently “elementary” ideas of mathematics —

which are in fact “elementary” only because they are relegated to the undergraduate level of instruction, although their true significance cannot be properly grasped on that level— cause great difficulties in proceeding to the more advanced subjects.

There are three interwoven aspects of mathematical knowledge:

- Intuition — the “pictures” one forms (consciously or subconsciously) in one’s head when reasoning about a problem or endeavouring to generalise a concept.
- Rigour — the formalisation and verification of definitions and proofs.
- Communication — the transfer of mathematical knowledge from one person to another.

Pictures, formulae and discussions are generally how mathematical knowledge is communicated from one person to another. It is important though to isolate this from the actual understanding of mathematics itself. The formula is not in itself what is “understood” by someone reading it. Rather, seeing the formula conjures up a wealth of images in one’s subconscious mind which can be then refined further and reasoned with; seeing the formula primes relevant areas of the cortex facilitating subsequent thought. One “understands” because *one can reason with it and answer questions about it*, for instance, one can graph it, differentiate it, find its zeros, write down its Taylor series expansion, draw a relevant right-angled triangle and so forth. [Understanding is therefore relative to the questions one has asked oneself or otherwise encountered to date.] Memorising a result does not immediately lead to understanding. Understanding occurs only after one’s mind has formed associations that link the result with other stored knowledge. The degree of understanding is related to the scope and complexity of such associations.

To distinguish rigour from intuition, consider reading the proof of a theorem. It is possible to check a proof is correct without having any sense of actually “understanding” the proof, or even of “understanding” why the theorem should be true. In fact, it is possible to come up with a proof without “understanding” it! That is to say, by trial-and-error and (subconscious) pattern recognition (e.g., making substitutions and transformations one has seen before without quite being sure one is heading in the right direction), one can write down an algebraic proof of a theorem in convex analysis, say, without being able to offer any geometric picture or other explanation to justify how the proof was found. It is often worth the extra effort to develop a sense of intuition about theorems and their proofs. Intuition and rigour together provide a sense of understanding and increase one’s fluency in mathematics.

In some cases, intuition and rigour go hand in hand; one can translate directly one’s intuition into a proof. That this is not always the case is perhaps the only reason why teaching and learning mathematics is non-trivial: It is easy to convey rigour (at least, no harder than programming a computer), and all too easy for an author to convey only rigour and leave it to the readers’ mathematical maturity and ingenuity to deduce the intuition for themselves. Conveying intuition is not necessarily more difficult, but for an author, there are apparently drawbacks.

The most obvious drawback is verbosity. Stating and proving a theorem in its full level of generality (à la Bourbaki) takes considerably less space than does proving a basic theorem using one technique, then motivating the generalisation of the theorem, then pointing out why the proof of the basic theorem does not generalise, then motivating a new proof technique and finally proving the general theorem. Another drawback is imprecision and inaccuracy; intuition need not be precise nor even accurate for it to be valuable, yet some authors may be uncomfortable committing to paper anything even remotely inaccurate.

Tracing the historical development of a subject can provide a wealth of intuition. Here is what Lanczos has to say on the matter.

… a close tie with the historical development seemed appropriate, although the author is well aware that this exposes him to the charge of datedness. We have to be “modern” and there are those who believe that before the advent of our own blessed era the pursuers of mathematics lived in a kind of no-man’s-land, bumping against each other in the gloomy haze that pervaded everything (“Euclid must go!”). But there are others (and the author belongs to the latter group), who believe that the great masters of the eighteenth and nineteenth centuries, Lagrange, Euler, Gauss, Cauchy, Riemann, Fourier, Dirichlet, and many others, were not necessarily lacking mathematical intelligence and some of them might even be comparable to the geniuses of today.

One wonders if occasionally intuition is purposely withheld, the false reasoning being that the merit of an idea is judged by how complicated it is to understand. Merit should be judged by originality, usefulness and simplicity rather than complexity. A thing is understood when it appears to be simple.

To display formal fireworks, which are so much in the centre of many mathematical treatises — perhaps as a status-symbol by which one gains admission to the august guild of mathematicians — was not the primary aim of the book.

In conclusion, when teaching or learning mathematics, keep in mind that intuition and rigour are distinct aspects whose synergy forms mathematical knowledge. Intuition and rigour should be learnt together because rigour without intuition is like owning a car without the key; you can admire the car for its beauty but you cannot get very far with it.

## The Role of Estimates, Estimation Theory and Statistical Inference – Is it what we think it is?

The tenet of this article is that estimation theory is a means to an end and therefore cannot be sensibly considered in isolation. Realising this has pragmatic consequences:

- Pedagogical. When faced with solving a statistical problem, it becomes clearer how to proceed.
- Philosophical. A number of controversies and debates in the literature can be resolved (or become null and void).
- Interpretive. A clearer understanding is gained of how to interpret and use estimates made by others.

Forming estimates is ingrained in us; I estimate the tree is 5 metres high, there are 53 jelly beans in the jar and it will be 25 degrees tomorrow. This can draw us strongly to the belief that forming an estimate is something intrinsic, something that can be done in isolation. It suggests there should be a right way and a wrong way of estimating a quantity; perhaps even an optimal way. Succumbing to this belief though is counterproductive.

Once you have an estimate, what will you use it for? Putting aside the amusement or curiousity value some may attach to forming estimates, for all intents and purposes, an estimate is merely an intermediate step used to provide (extra) information in a decision making process. I estimated the height of the tree in order to know how much rope to buy, I estimated the number of jelly beans in the jar to try to win the prize by being the closest guess, and I estimated the temperature tomorrow to decide what clothes to pack. In all cases, the estimate was nothing more than a stepping stone used to guide a subsequent action.

In general, it is meaningless to speak of a good or a bad estimator because, without knowing what the estimate will be used for, there is no consistent way of ascribing the attribute “good” or “bad” to the estimator. The exception is if the estimator is a sufficient statistic, and indeed, it might be more intuitive if “estimators” were sometimes thought of as “approximate sufficient statistics”. All this will be explained presently.

The James-Stein Estimator exemplifies the assertion that it is generally not possible to declare one estimator better than another. Which is better is application dependent. Less striking examples come from situations where the penalties (in terms of making a bad decision) resulting from different types of estimation errors (such as under-estimation or over-estimation) can vary considerably from application to application.

Usually, estimates serve to compress information. Their job is to extract from a large set of data the pertinent pieces of information required to make a good decision. For example, the receiving circuitry of a radar gathers a very large amount of information about what objects are around it, but in a form which is too difficult for humans to process manually. The familiar graphical display produced by a radar results from processing the received signal and extracting out the features we are interested in. Even in estimating the height of a tree, this is true. The full information is the complete sequence of images our eyes see as we look up at the tree; we compress this information into a single number (we hope is) related to the height of the tree.

Initially then, there is no role for estimation theory. We have data (also commonly referred to as observations) and we wish to make an informed decision. A standard and widely applicable framework for making decisions is to determine first how to measure the goodness of a decision and then endeavour to construct a decision rule (which takes as input the available data and outputs the recommended decision to make) which can be shown, in a probabilistic framework, to make good decisions the majority of the time. A key point is that theoretically, we should use all the data available to us if we wish to make the best decision possible. (Old habits die hard. It is tempting to reason thus: If I knew what the temperature will be tomorrow then I know what clothes to pack, therefore, I will base my decision on my “best guess” of tomorrow’s temperature. This is not only sub-optimal, it is also ill-posed because the only way to define what a “best guess” is, is by starting with the decision problem and working backwards.)

There are two pertinent questions, one a special case of the other, caused by the undesirability of returning to the full set of data each time we wish to make another decision. (Imagine having to download the global weather observations and process them using a super-computer to decide what clothes to wear tomorrow, only to repeat this with a different decision-making algorithm to decide whether or not to water the garden.)

- Is there a satisfactory (but perhaps sub-optimal) method for processing the data into a compact and more convenient form allowing for many different decisions to be made more easily by virtue of being based only on this compact summary of the original data?
- Are there any conditions under which the data can be processed into a more compact form without inducing a loss of optimality in any subsequent decision rule?

In fact, the mathematics used in estimation theory is precisely the mathematics required to answer the above two questions. The mathematics is the same, the results are the same, but the *interpretation* is different. The true role of estimation theory is to provide answers to these questions. There are many situations where it seems that this has been forgotten or is not known though.

The answer to the second question can be found in statistical textbooks under the heading of sufficient statistics. The rest of statistics, by and large, represents our endeavours to answer the first question. Indeed, we routinely go from data to an “estimator” to making a decision. When the Bureau of Meteorology forecasts tomorrow’s weather, they are doing precisely what is described in the first question.

In virtue of the above discussion, I advocate thinking of “estimators” as “approximate sufficient statistics”. They serve to answer the first question above when a sufficiently convenient sufficient statistic (the answer to the second question) cannot be found or does not exist.

By shifting from thinking in terms of “estimators” to “approximate sufficient statistics”, I hope to show in subsequent articles that this leads to clarity of thought.