The Cameron-Martin Space, and Some Differences Between Finite and Infinite Dimensional Spaces

October 4, 2015 Leave a comment

The usefulness of infinite-dimensional vector spaces is that often concepts in finite dimensions extend to infinite dimensions; this allows us to reason about infinite-dimensional problems using finite-dimensional intuition. (This is explained in detail in my primer on RKHS theory, for example.) Nevertheless, infinite-dimensional spaces exhibit behaviours not found in finite dimensions, and working effectively with infinite-dimensional spaces requires understanding how they differ from finite dimensional spaces.

This article discusses several key differences between a Gaussian measure on Euclidean space \mathbb{R}^n and on the separable Hilbert space l^2, which can be thought of as an infinite-dimensional generalisation of Euclidean space. (Recall l^2 consists of all square-summable sequences.)

A Gaussian Measure on l^2

Let e_i denote the canonical basis vectors of l^2, that is, e_i is the sequence of all zeros except for the ith term which is unity.

Essentially, we wish to construct a Gaussian random variable X on l^2 element-wise by first generating a collection of real-valued independent random variables X_i \sim N(0,\lambda_i) and then setting X = (X_1,X_2,\cdots), or in other words, X = \sum X_i e_i. We assume \lambda_i > 0.

It is essential for \lambda_i to decay to zero sufficiently fast, for otherwise, the sequence X will have a non-zero probability of not lying in l^2.  We require \sum X_i^2 < \infty to hold with probability one.  Since the standard deviation \sqrt{\lambda_i} gives the “typical” size of a random realisation of N(0,\lambda_i), it is not surprising that the requirement be \sum \lambda_i < \infty.  (Throughout, curious readers are invited to consult an introductory book on infinite-dimensional analysis for more details. For example, in general, one starts with a linear operator Q that will serve as the covariance matrix, and requires Q to be of trace class.  Here, I am simply taking Q to be the “diagonal matrix” \mathrm{diag}(\lambda_1,\lambda_2,\cdots).)

It turns out that the above procedure can be made rigorous; provided \sum \lambda_i < \infty, there is indeed a Gaussian random variable X on l^2 such that the X_i = \langle X, e_i \rangle are independent Gaussian random variables N(0,\lambda_i).

To any (Borel-measurable) subset A \subset l^2, we define the Gaussian measure \mu(A) as the probability that X lies in A.  (To a mathematician, this is putting the cart before the horse, but nevertheless, it is a convenient way to think when it comes to having to evaluate \mu(A) in certain situations.)

Since we insisted all the \lambda_i be positive, the measure \mu is non-zero on any open subset of l^2. Note too that \mu(l^2) = 1.

Some Possibly Surprising Facts

Given that the following do not hold in finite dimensions, they are initially counter-intuitive. The remainder of the article aims to give sufficient explanations to “improve” our intuition, thereby making the following facts unsurprising.

  • Given a Gaussian measure \mu on l^2, there exists a subset A \subset l^2 and an element v \in l^2 such that
    1. A and A+v are disjoint;
    2. \mu(A) = 1; and
    3. \mu(A+v) = 0.

Why should these facts be surprising? By analogy with the finite-dimensional case, one would expect that unless A is pretty much all of l^2 then it would not be possible for \mu(A) = 1. Indeed, \mu is non-zero on any open subset of l^2, hence must A be dense in l^2? So how can the translated version A+v be disjoint from A? And how can the “large” set A, having measure 1, go to the “small” set A+v having measure 0, simply by a translation?

Explanation

For concreteness, choose \lambda_i = \frac1{i^3}.

The subset A will be taken to be the limit of a monotonically increasing sequence of subsets A^{(n)}. The subsets A^{(n)} will be taken to be “rectangles” of the form A^{(n)} = \{ (x_1,x_2,\cdots) \in l^2 \mid -a_i^{(n)} < x_i < a_i^{(n)}\} where the a_i will be strictly positive.

Since the variance of the X_i goes to zero, we hope to be able to choose the a_i^{(n)} so that they decay to zero in i, for fixed n, while ensuring A^{(n)} has measure close to unity. This gives us a chance of constructing an A which is not the whole of l^2 yet has measure equal to unity. The rate of decay i^{-3/2} is too fast because the probability that X_i \sim N(0,i^{-3}) lies between -i^{-3/2} and i^{-3/2} does not depend on i; if this probability is c then c^\infty = 0 would be the measure of the rectangle. This motivates trying a slower rate of decay: a_i^{(n)} = \sqrt{2}\frac{n}{i}. (The numerator n is there to grow the rectangles so the probability of X lying in A^{(n)} goes to unity as n \rightarrow \infty, and the \sqrt{2} is for later convenience.)

The probability that X_i \sim N(0,i^{-3}) lies between \pm \sqrt{2}\frac{n}{i} is conveniently expressed in terms of the error function as \mathrm{erf}(n\sqrt{i}). Hopefully, \mu(A^{(n)}) = \prod_{i=1}^\infty \mathrm{erf}(n\sqrt{i}) > 0 for all n > 0, and \prod_{i=1}^\infty \mathrm{erf}(n\sqrt{i}) \rightarrow 1 as n \rightarrow \infty, so that \mu(A)=1. This is indeed the case.

[Here is the tedious calculation to prove the claims. Let c_i = 1-\mathrm{erf}(n \sqrt{i}). A well-known bound on the complementary error function is 1-\mathrm{erf}(u) = \mathrm{erfc}(u) < \frac1{\sqrt{\pi}u} e^{-u^2}. Therefore, c_i < d_i where d_i = \frac1{n\sqrt{\pi i}} e^{-in^2}. Note 0 < c_i < 1 and 0 < d_i < 1 when n \geq 1 and i \geq 1. Provided \sum \ln(1-c_i) is finite, \mu(A^{(n)}) = \prod (1-c_i) will be strictly positive, as required. Now, 0 > \ln(1-c_i) > \ln(1-d_i) \geq \frac{-d_i}{1-d_i}, hence it suffices to prove \sum\frac{d_i}{1-d_i} is finite. The ratio test for absolute convergence involves the ratio \frac{d_{i+1}}{1-d_{i+1}} \frac{1-d_i}{d_i} = \frac{d_{i+1}}{d_i} (1-d_i) \frac1{1-d_{i+1}}. Since d_i \rightarrow 0, it suffices to show \lim_{i \rightarrow \infty} \frac{d_{i+1}}{d_i} < 1 in order to conclude that \sum\frac{d_i}{1-d_i} is finite. Now, \frac{d_{i+1}}{d_i} = \sqrt{\frac{i}{i+1}} \frac{e^{-(i+1)n^2}}{e^{-in^2}} \rightarrow e^{-n^2} < 1 whenever n > 0, as required. To show \mu(A) = 1, we need to show \lim_{n \rightarrow \infty} \sum_i \frac{d_i}{1-d_i} = 0.  An earlier calculation shows d_{i+1} < e^{-n^2} d_i, so that \sum_i \frac{d_i}{1-d_i} \leq \frac{d_1}{1-d_1}(1+\alpha+\alpha^2+\cdots) = \frac{d_1}{1-d_1} \frac1{1-\alpha} where \alpha = e^{-n^2}.  It is clear that this can be made arbitrarily close to zero by choosing n large.]

  • Had we been in finite dimensions, the sets A^{(n)} would have been open. Here, they are not open.

Since the a_i^{(n)} \rightarrow 0 as i \rightarrow \infty, there is no open ball centred at the origin, with positive radius \rho, that is contained in A^{(n)}.  Indeed, for a given radius \rho > 0 and n, an i can be found such that a_i^{(n)} < \rho, showing A^{(n)} is not fully contained within the ball.

  • Let v_i = i^{-3/4}. The point v = (v_1,v_2,\cdots) \in l^2 is not in A.

We may be led to believe this is surprising: as n gets bigger, all the sides of the rectangle A^{(n)} get bigger, hence we may expect that it will grow to be the whole of l^2. However, as explained presently, in infinite dimensions, the order in which limits are taken matters, and the above argument is invalid.  It will be seen that while the sides of the rectangle do grow, they grow too slowly.

On the one hand, it is true that if we fix an i, then we can find an n sufficiently large so that the ith side of the rectangle A^{(n)} contains v_i, the ith term of v.  Mathematically, this is true because \lim_{n \rightarrow \infty} a_i^{(n)} = \infty.  However, this is only really telling us that the “truncated” approximations (v_1,0,\cdots), (v_1,v_2,0,\cdots), (v_1,v_2,v_3,0,\cdots),\cdots are all in A, but analytically, we know that if A is not closed then the limit of these approximations need not lie in A. Figuratively speaking, even though an ordinary towel can be stuffed in a suitcase by sequentially pushing in the bits that are hanging out, this reasoning is not valid if the towel were infinitely big; after each push, there may still be an infinite portion of the towel hanging out. Instead, we must think directly of the A^{(n)} as suitcases of increasing size, and ask if the towel fits entirely inside one of these suitcases.

The reason why v does not lie in any of the A^{(n)}, and hence v is not in A, is that the terms v_i of v decrease more slowly to zero than the sides a_i^{(n)} of the rectangles.  For a fixed n, there will be a sufficiently large i so that v_i > a_i^{(n)}, thereby showing v \not\in A^{(n)} for all n.

  • But how can v not be in A if every dimension of the suitcase A^{(n)} is expanding with n?

The sides of the suitcase are aligned with the canonical basis vectors e_i.  However, the e_i form a Schauder basis, not a Hamel basis, so it is not true that “every dimension of the suitcase is expanding”.  For example, the point v does not lie in the span of the e_i.  (The span of the e_i consists of all sequences with only a finite number of non-zero terms.) The argument given earlier readily extends to show that, for all \lambda \neq 0, the point \lambda v does not lie in any of the suitcases A^{(n)}. So while the suitcases are getting bigger in the directions e_i, they always have zero width in the direction v.  (It is true that the distance from A^{(n)} to v goes to zero as n \rightarrow \infty, but that just means v is in the closure of A.)

  • \mu(A+v) = 0.

This is another phenomenon unique to the infinite-dimensional case. Recall that X_i \sim N(0,i^{-3}). The ith side of A^{(n)} + v is the interval from v_i - a_i^{(n)} to v_i + a_i^{(n)}. We know \frac{v_i}{a_i^{(n)}} \rightarrow \infty as i \rightarrow \infty (the sides of the suitcase decrease much faster than the v_i). And v_i normalised by the standard deviation of X_i also goes to infinity, i.e., \frac{v_i}{i^{-3/2}} = i^{3/4} \rightarrow \infty. So the probability that X_i lies inside the interval from v_i - a_i^{(n)} to v_i + a_i^{(n)} goes to zero. While the finite product of positive numbers is positive, the infinite product of positive numbers that decay to zero is zero: there is zero probability that a randomly chosen point will lie in A^{(n)} + v.

  • The sets A and A +v are disjoint.

Again, this is a consequence of the order of the limits.  For a fixed n, the ith edge of the suitcase A^{(n)} decays to zero faster than v_i.  So for i sufficiently large, the ith edges of A^{(n)} and A^{(n)}+v do not overlap, hence A and A +v are disjoint.

Epilogue

While the above gave one specific example, the general theory goes as follows. Given a Gaussian measure N_{0,Q} on l^2 with mean zero and covariance Q, the shifted Gaussian measure N_{v,Q} is either equivalent to N_{0,Q} (meaning the two measures are absolutely continuous with respect to each other), or the two measures are singular. The Cameron-Martin space is the space of all v for which the two measures are equivalent. There is a simple expression for this space: it is Q^{1/2}(l^2), the image of l^2 under the operator Q^{1/2}.

An Introduction to Normal and Non-normal Subgroups

February 21, 2015 Leave a comment

We currently are running several reading groups for engineering students and one is on Lie Groups and Representation Theory.  When elementary group theory was introduced yesterday, not surprisingly, there were many questions concerning normal subgroups.  Like many areas of mathematics, the traditional learning paradigm is to “just memorise the definitions and later things will make sense”.  While it is true that later things will make sense — once a critical mass of ideas and definitions has been accumulated, the connections between them can be seen, and the structure becomes self-supporting — we should still strive to make things as intuitive as possible right from the start.  Below is one such attempt.  It tries to introduce group theory based around several examples, one example at a time. Particular emphasis is placed on explaining normal and non-normal subgroups. Key points will be written in bold. (Caveat: The below captures what first came to mind and is likely to be unduly verbose.)

Free Group with Two Generators

Group theory came from the study of “transformations” of a space, and often, it is useful to think of each element of a group as something which “acts on” or “transforms” something. Later, we will consider permutation groups, which precisely fit this description. (In the reading group, we consider rotation groups, which also fit this description precisely.) Here, we consider a different group first, giving two equivalent definitions of it.

The algebraic definition is that our group G consists of all “words” that can be written using the letters a, b, a^{-1} and b^{-1}, with the cancellation rule that whenever a and a^{-1} appear next to each other, they can be omitted from the word, and the same for b and b^{-1}. Besides the “empty word”, denoted e, the group G contains elements such as aaaba, ababa, ab^{-1}b^{-1}aba^{-1} and so forth. The group operation (often written as multiplication) is just concatenation, so that ab times aabb is abaabb.

Since abb^{-1}a is the same as aa by definition of our cancellation rule (later, a physical interpretation will be given), note that ab times b^{-1}a equals aa.

The axioms of a group require the group operation to be associative, and it is clear that concatenation is associative, i.e., the order of doing the multiplications does not matter: ( (word one) times (word two)) times (word three) gives the same as (word one) times ((word two) times (word three)).  For example, ((ab) \cdot (ab)) \cdot (ab) = ababab = (ab) \cdot ((ab) \cdot (ab)).

There must also be an identity element. In this case, this is the empty word e, and it is clear that adding a “blank” to the start or the end of a word leaves the word unchanged, hence the empty word really is the identity element. (Time precludes me from writing down the group axioms and verifying them in detail.)

Lastly, to verify G is a group, it must be shown that every element has an inverse. This is intuitively clear; for example, the inverse of ab^{-1}ab is b^{-1}a^{-1}ba^{-1} because ab^{-1}ab \cdot b^{-1}a^{-1}ba^{-1} = e. This example shows the inverse can be constructed one letter at a time, going from right to left, writing down the inverse of each letter. (Note that the blank word also results if the inverse had been multiplied on the left rather than the right: b^{-1} a^{-1} b a^{-1} \cdot a b^{-1} a b = e. If this were not true, then G would not be a group, i.e., the axioms require that each element have a left inverse which is also a right inverse.)

While the above algebraic definition shows that this group, known as the free group with two generators, is easy to work with algebraically, it lacks a physical interpretation. This is now remedied by thinking of the free group with two generators as the fundamental group of the plane with two holes removed; don’t worry if that does not make any sense as I will explain the concept intuitively.

Imagine the floor of a room has an X marked on it, and to the left of the X is a vertical pole sticking out of the ground, and to the right there is another pole.  A spider starts at X and can be commanded to do four things.  Command “a” will make the spider walk from X clockwise around the pole on the left and return to X, all the while leaving a thin web behind him. (If you prefer, think of a robot laying down a rope.) Command “a inverse” will make the spider leave X, walk anti-clockwise around the pole on the left then return to X.  Notice that if the spider does “a” then ”a inverse” the web can be shrunk or contracted back to X, so “a” followed by “a inverse” is the equivalent of “do nothing”, because we want to treat two web configurations as the same if one can be deformed into the other without breaking the web. (Technically, the web is allowed to pass through itself so we do not need to worry about knots forming; we simply care about what happens as the web shrinks to as short a length as possible.)  Note that “a” followed by “a” gives a configuration (denoted aa) that is different from just instructing the spider to do “a” (denoted a); the former has the web encircling the pole on the left twice while the latter has the web encircling the pole only once, and no amount of jiggling of the web will transform one configuration into the other.

Similarly, commands “b” and “b inverse” can be defined to mean walk clockwise or anti-clockwise around the pole on the right, respectively. The reader should pause to ensure the algebraic definition and this spider-web definition agree intuitively by doing several thought experiments and visualising the resulting web configuration.

A subgroup can be thought of as what results if the spider is no longer allowed to carry out arbitrary commands but can only carry out certain sequences of commands and their inverses.  For example, perhaps the spider can only carry out the command ab and its inverse b^{-1}a^{-1}.  Then the resulting subgroup consists of \cdots, (b^{-1}a^{-1})^3,(b^{-1}a^{-1})^2,b^{-1}a^{-1},e,ab,(ab)^2,(ab)^3,\cdots . (This subgroup is isomorphic to the group of integers.)  Another subgroup is obtained by simply allowing the spider to carry out command “a” or its inverse.  (This subgroup is also isomorphic to the group of integers.)

Let H denote a subgroup of G. Recall the definition of normality: H is a normal subgroup of G if, for every g \in G and h \in H, the so-called conjugate g^{-1} h g is in H.  This definition will have been motived several times before the end of this article, however, for now, we start with a high level if not somewhat vague explanation: a normal subgroup is one whose elements produce actions that do not get “tangled up” very much with all the other actions, as now discussed.

If H is the subgroup generated by a, meaning the elements of H are precisely \cdots,a^{-3},a^{-2},a^{-1},e,a,a^2,a^3,\cdots then H is not normal. For example, b^{-1} a b cannot be written as an element of H because there is no way for the actions of a and b to be exchanged: if first the spider is given the command to wrap a web around the right pole clockwise, then apply a command from H, then wrap a web around the right pole anti-clockwise, there is no way for the web around the right pole to “cancel out” and leave a web that is only wrapped around the left pole; the subgroup H is not normal.

Why is normality interesting? Before giving the standard explanation (i.e., the left cosets of a subgroup form a quotient group if and only if the subgroup is normal, and essentially equivalently, a subgroup is the kernel of a group homomorphism if and only if it is normal), a high level pragmatic explanation is proffered: when we are writing down sequences of group operations, we want to be able to simplify things, so we need some way of being allowed to interchange operations. In the simplest of cases, there are so-called Abelian groups which, by definition, are commutative: gh = hg for all g and h in the group. The simplest examples of Abelian groups are vector spaces with the usual vector addition being treated as the group operation (and scalar multiplication is forgotten about, except for multiplication by -1 which is the inverse group operator). In an Abelian group, g^{-1} h g = g^{-1} g h = h, hence every subgroup is normal. This is the ultimate in “interchangeability”.

A normal subgroup has a property similar to, but a bit weaker than, full commutativity, in the following sense. If H is normal, and h \in H, then although gh may not equal hg, I can achieve the next best thing: I can find a \tilde h \in H such that g h = \tilde h g.  That is, I have been able to interchange the action g with an action from H provided I do not mind using a possibly different action from H. (Note that this follows immediately from the definition of normality: gh = ghg^{-1}g = \tilde h g where \tilde h = g h g^{-1} is guaranteed to be in H by normality. Note that g can be replaced by g^{-1} in the definition of normality without changing anything.)

At this point, the reader should be comfortable in believing that being normal is a special property not shared by every subgroup, and if a subgroup is normal, then it is a useful thing, because it allows one to manipulate group operations to achieve simplifications.

The Symmetric Group on Three Letters

Consider three distinct balls arranged in a row.  A robot can be asked to change their ordering: one action might be to swap the left-most ball with the right-most ball. The set of all distinct actions forms a group S_3, called the symmetric group on three letters, with group multiplication gh taken to mean “first perform action h then perform action g”, which again is a form of concatenation. Googling will bring up more than enough information about symmetric groups, so here I just wish to emphasise thinking in terms of each element corresponding to asking a robot to perform an action.

Let H denote the subgroup of S_3 formed from the action “interchange the left-most and middle balls”.  Denote this action by (12), and note that it is its own inverse. (I am using cycle notation.) This is another example of a non-normal subgroup: consider (13)^{-1} (12) (13), with the notation meaning, first swap the left and right balls, then swap the left and middle balls, then swap the left and right balls.  If the balls (red, green, blue) start off in the order RGB, then applying (13) gives BGR, applying (12) gives GBR and applying (13)^{-1} = (13) gives RBG.  So overall, it transformed RGB to RBG.  Note that the right ball has been changed, yet no command from H can change the right ball, hence H is not normal.  This again is an illustration of the vague intuition that non-normal subgroups perform actions that tangle other actions up.

Besides the empty group and the whole group, it can be shown that the only other normal subgroup of S_3 is the so-called alternating group A_3 formed from all the even permutations.

Gaining further insight requires looking more closely at applications of normality. Perhaps the most germane to look at is whether a certain way of partitioning a group into equivalence classes leads to a compatible group structure on the equivalence classes. Recall H is the subgroup of S_3 generated by (12). It turns out that any subgroup leads to a partition of the original group: G = \cup_{g \in G} gH where gH = \{gh \mid h \in H\} is a left coset. Crucially, for two distinct g,g' \in G, either gH = g'H, or gH \cap g'H = \emptyset. (Intuitively why this is so will be explained later. For now, an algebraic proof is given: Assume there exist h,h' \in H such that gh = g'h'. For any f \in H, define f' = (g')^{-1}gf, so that gf = g'f'. The result follows if we can show f' belongs to H. Note f' = (g')^{-1}ghh^{-1}f = h'h^{-1}f which is in H because h', h^{-1} and f are all in H.)

If we treat a vector space as a group, and take for H a linear subspace, then gH would correspond to the affine subspaces obtained by translations of H; perhaps this is the simplest way of visualising left cosets.  (Vector spaces form Abelian groups, hence all their subgroups are normal, and simply considering vector spaces as groups does not lead to insight into non-normal subgroups.)

It can be deduced fairly readily that the elements of S_3 are (), (12), (13), (23), (123), (132).  (There are six elements because the left ball can be placed in one of three positions, the middle ball can be placed in one of two positions, and having done so, there is only one spot remaining for the right ball, i.e., there are 1 \times 2 \times 3 = 6 possible re-arrangements.)

The cosets ()H and (12)H are both just H because the elements of H are () and (12).

The coset (13)H consists of the elements (13) and (13)(12). The latter transforms RGB into BRG and hence is equivalent to (123).  So the cosets (13)H and (123)H are the same; they contain the elements (13) and (123) of S_3.

The coset (23)H consists of the elements (23) and (23)(12) = (132). Hence the cosets (23)H and (132)H are the same; they contain the elements (23) and (132) of S_3.

This H is therefore seen to partition S_3 into three sets called equivalence classes: S_3 = \{ (), (12) \} \cup \{ (13), (123) \} \cup \{ (23), (132) \}.

Consider multiplying two elements from different equivalence classes of the partition. For example, (13) from the second equivalence class times (23) from the third equivalence class gives (13)(23)=(132), which lies in the third equivalence class. However, (123) from the second equivalence class times (132) from the third equivalence class gives (), which lies in the first equivalence class.  Therefore, there is no consistent way of defining what it means to multiply the second equivalence class with the third equivalence class, since choosing different elements to represent each equivalence class can lead to different answers.  (This should be compared with modulo arithmetic, where things work nicely.) Since H is not normal, it tangles things up too much.

What is required for the equivalence classes to have a group structure?  If f,f',g,g' \in G are such that fH = f'H and gH = g'H, that is, g and g' represent the same equivalence class, and the same for f and f', then we want fgH to be the same as f'g'H. This must be true in the special case when f = (). If f = () then f' must be an element of H.  (If H = f'H then we can find h,h' \in H such that h = f'h', so that f' = h(h')^{-1} \in H.) So a necessary condition is for gH = f'g'H when f' \in H. Recall from earlier that if H were normal then we could interchange the order in which f' and g' are multiplied, albeit by having to change f' to a different element of H. Thus, if H is normal then f'g' = g'f'' for some f'' \in H, hence f'g'H = g'f''H = g'H, as required. With a little more care, it can be shown that a necessary and sufficient condition for left cosets to have a group structure, is for the subgroup to be normal. One way to see why normality suffices is by writing (gH)(fH) = (gff^{-1}H)(fH) = (gf)(f^{-1}Hf)H = gfH because, by normality, f^{-1}Hf will equal H. (This notation requires some justification which I do not elaborate on; treat it as a mnemonic, if you prefer.)

Why do Cosets form a Partition?

For Abelian groups, such as vector spaces, it is relatively clear why left cosets form equivalence classes. Why does this hold for arbitrary groups, when multiplication can tangle things up?

Let H be a subgroup of G.  First note two things: if h \in H then hH = H, and if g \notin H then gH \cap H = \emptyset. Proving the former is left as an easy exercise. The latter holds because, if gh is in H, and if h is in H, then ghh^{-1} = g must also be in H.

We can now appeal to symmetry. Groups were invented to study symmetry, and in the first instance, this means often we understand what is going on by shifting things back to the “origin”, meaning the identity element of the group. If we want to understand what gH and g'H look like relative to each other, then we can premultiply them by g^{-1}. The key point is that this “transformation” is one-to-one and preserves structure, so we are allowed to do this without altering anything important. Thus, comparing gH with g'H is essentially the same as comparing H with g^{-1}g'H. But we know from earlier that either g^{-1}g' is an element of H, in which case H = g^{-1}g'H and hence gH = g'H, or else g^{-1}g' is not an element of H, in which case H and g^{-1}g'H are disjoint, and hence gH and g'H are also disjoint.

Reproducing Kernel Hilbert Spaces

August 6, 2014 Leave a comment

Given that reproducing kernel Hilbert spaces (RKHS) behave more like Euclidean space than more general Hilbert spaces, it is somewhat surprising they are not better known. I have uploaded to arXiv a primer on reproducing kernel Hilbert spaces that differs from other introductory material in a number of ways.

It is perhaps the first to start by considering finite-dimensional spaces rather than jump straight to infinite-dimensional spaces. Surprisingly much can be gained by considering the finite-dimensional case first. It also does not assume any prior knowledge about Hilbert spaces and goes to great lengths to explain clearly about closed and non-closed subspaces and completions of spaces. (e.g., It is explained that a space which is not complete can be thought to have a ‘scratched surface’.)

Whereas other material refers to a RKHS theory as providing a “coordinate-free approach”, I argue that the presence of a Hilbert space already provides a coordinate-free approach, with the added benefit of a reproducing kernel Hilbert space being a canonical coordinate system associated to each point in the Hilbert space.

The primer explains that RKHS theory studies the configuration of an inner product space sitting inside a function space; it describes an extrinsic geometry because how the space is oriented inside the larger function space matters. This way of thinking leads to being able to determine when RKHS theory might be relevant to a problem at hand; if the solution of the problem changes continuously as the configuration changes continuously then it is possible the solution might be expressible in terms of the kernel in a nice way.

How RKHS theory can be used to solve linear equations is discussed along with some novel descriptions and extensions, as is the role of RKHS theory in signal detection.

My colleague contributed substantially by describing some of the recent work in the statistical learning theory literature, including explaining how RKHS theory can be used for testing for independence, for nonlinear filtering, and so forth.

Tensor Products Introduced as Most General Multiplication (and as Lazy Evaluation)

July 27, 2014 2 comments

This note is a fresh attempt at introducing tensor products in a simple and intuitive way. The setting is a vector space V on which we wish to define a multiplication. Importantly, we do not wish to limit ourselves to a multiplication producing an element of V. For example, we would like to think of u^T v and u v^T as two examples of multiplying elements u,v of \mathbb{R}^n. The first multiplication, u,v \mapsto u^T v, is a map from \mathbb{R}^n \times \mathbb{R}^n to \mathbb{R}. The second multiplication, u,v \mapsto uv^T, is a map from \mathbb{R}^n \times \mathbb{R}^n to \mathbb{R}^{n \times n}. In fact, since uv^T is well-defined even if u and v have different dimensions, the most general setting is the following.

Let U and V be fixed vector spaces agreed upon at the start. Assume someone has defined a multiplication rule m\colon U \times V \rightarrow W but has not told us what it is.  Here, m is a function and W is a vector space, both of which are unknown to us. For example, if U = \mathbb{R}^n and V = \mathbb{R}^p, a possible choice might be m(u,v) = uv^T where W = \mathbb{R}^{n \times p}.

What does it mean for m to be a rule for multiplication? We wish it to have certain properties characteristic of multiplication. In a field, multiplication is involved in a number of axioms, including the distributive law x(y+z) = xy+xz. Since U, V and W are vector spaces, potentially distinct from each other, not all the axioms for multiplication in a field make sense in this more general setting. (For instance, commutativity makes no sense if U \neq V.) The outcome is that m is declared to be a rule for multiplication if it is bilinear: m(x,y+z) = m(x,y)+m(x,z), m(x+y,z) = m(x,z)+m(y,z) and m(ax,y) = m(x,ay) = a\,m(x,y).

Not knowing m does not prevent us from working with it; we simply write it as m and are free to manipulate it according to the aforementioned rules. We do the equivalent all the time when we prove general results about a metric; we say “let d(\cdot,\cdot) be a metric” and then make use of only the axioms of a metric, thereby ensuring our derivation is valid regardless of which metric is actually being represented by d.

While the tensor product \otimes is more than this, in the first instance, it suffices to treat it merely as an unspecified rule for multiplication. (It is specified, but no harm comes from forgetting about this in the first instance.) Rather than write m(u,v) it is more convenient to write u \otimes v, and thinking of \otimes as multiplication reminds us that x \otimes (y+z) = x \otimes y + x \otimes z.

The first point is that we can treat \otimes as an unspecified multiplication, simplify expressions by manipulating it in accordance with the rules for multiplication, and at the end of the day, if we are eventually told the rule m, we can evaluate its simplified expression. In computer science parlance, this can be thought of as lazy evaluation.

Whereas there are many metrics “incompatible” with each other, the rules for multiplication are sufficiently rigid that there exists a “most general multiplication”, and all other multiplications are special cases of it, as explained presently. The tensor product \otimes represents this most general multiplication possible.

To consider a specific case first, take U and V to be \mathbb{R}^2. We already know two possible multiplications are u^T v and uv^T. The claim is that the latter represents the most general multiplication possible. This means, among other things, that it must be possible to write u^T v in terms of uv^T, and indeed it is: u^T v = \mathrm{trace}(uv^T). Note that trace is a linear operator. The precise claim is the following. No matter what m\colon \mathbb{R}^2 \times \mathbb{R}^2 \rightarrow W is, we can pretend it is m(u,v) = uv^T, work with this definition, then at the very end, if we are told what m actually is, we can obtain the true answer by applying a particular linear transform. At the risk of belabouring the point, if we wish to simplify m(u+v,w+x) - m(v,x) - m(u,w) then we can first pretend m(u,v) is uv^T to obtain (u+v)(w+x)^T - vx^T - uw^T = ux^T + vw^T. Later when we are told m(u,v) = 2u^Tv we can deduce that the linear map required to convert from uv^T to 2u^Tv is 2u^T v = 2\mathrm{trace}(uv^T). Applying this linear map to ux^T + vw^T yields 2(x^T u + w^Tv) and this is indeed the correct answer because it equals m(u+v,w+x) - m(v,x) - m(u,w).

Readers wishing to think further about the above example are encouraged to consider that the most general multiplication returning a real-valued number is of the form u,v \mapsto v^T Q u for some matrix Q. How can v^T Q u be obtained from uv^T? What about for multiplication rules that return a vector or a matrix?

The general situation works as follows. Given vector spaces U and V, two things can be constructed, both of which use the same symbol \otimes for brevity. We need a space W for our most general multiplication, and we denote this by W = U \otimes V. We also need a rule for multiplication and we denote this by u \otimes v. By definition, u \otimes v is an element of U \otimes V. (Just like not all matrices can be written as uv^T, there are elements in U \otimes V that cannot be written as u \otimes v. They can, however, be written as linear combinations of such terms.)

At this point, I leave the reader to look up elsewhere the definition of a tensor product and relate it with the above.

Simple Derivation of Neyman-Pearson Lemma for Hypothesis Testing

July 24, 2014 1 comment

This short note presents a very simple and intuitive derivation that explains why the likelihood ratio is used for hypothesis testing.

Recall the standard setup: observed is a realisation x of a random variable (or vector or process) X generated either with probability density p_0(X) or p_1(X).

The observation space is to be partitioned into two regions, one region labelled zero and the other unity. If x lies in the region labelled zero then X is assumed to have been generated under p_0. The rough aim is to choose the partition so that if this “game” is repeated many times — choose i to be 0 or 1; repeatedly generate samples from p_i and count how many times they fall in the two regions — then the ratio of “correct outcomes” to “incorrect outcomes” is maximised, regardless of whether i=0 or i=1 was used. On average, we wish to be correct many more times than we are wrong.

Since there is a trade-off between being correct when i=0 and when i=1, Neyman-Pearson’s underlying criterion is to fix the probability of being wrong when i=0 and maximise the probability of being correct when i=1. (This part is assumed to be familiar to the reader.)

Mathematically, the region A of the observation space must be found such that 1) the probability P_0(A) of the observation lying in A under p_0 is some fixed (and small) value, while 2) the probability P_1(A) of the observation lying in A under p_1 is as large as possible. (This region A will be the region labelled unity.) This is a constrained optimisation problem involving a region of space. Pleasantly then, there is a simple way of solving it.

The trick is to imaging the observation space as broken up into many small tiles fitting together to cover the observation space. (If the observation space is drawn as a square then imagine dividing it into hundreds of small squares.) With this quantisation, finding the region A (up to quantisation error) means finding which of the tiles we want to be part of the region A.

Think of starting with A being the empty set and growing it by adding tiles one at a time. The first tile we add will do two things: it will give us a P_0(A) which we want to keep as small as possible and it will give us a P_1(A) which we want to maximise. (At this point, some readers may wish to re-tile the region with tiles of possibly differing sizes, so that the probability under P_0 of landing on any given tile is the same fixed constant regardless of which tile was chosen.) It is intuitively clear (and even clearer after re-tiling) that the very first tile we should pick is the one which will make \frac{P_1(A)}{P_0(A)} as large as possible. If P_0(A) has not exceeded the limit then we are allowed to pick another tile, and we pick the next best tile, that is, out of the tiles T_k remaining, we want the tile that maximises the ratio \frac{p_1(T_k)}{p_0(T_k)}. Each time, we greedily pick the best remaining tile that will help us make P_1(A) as large as possible while not letting P_0(A) grow too quickly.

As the size of the tiles shrinks to zero, the “limit” is the likelihood ratio \frac{p_1(x)}{p_0(x)} for the “infinitesimal” tile T_x centred at x. Choosing the region A = \{x \mid \frac{p_1(x)}{p_0(x)} > c\} for some constant c corresponds precisely to choosing the tiles T_x with the largest ratios; every tile with ratio better than c is included.

The above is a hand-waving argument and is not the best on which to base a rigorous proof. Often this is the case though: an intuitive derivation is subsequently verified by a less intuitive but more elementary mathematical argument. An elementary proof can be found in the Wikipedia.

Understanding (real- and complex-valued) Inner Products

July 10, 2014 3 comments

This short note addresses two issues.

  • How can we intuitively understand a complex-valued inner product?
  • If an inner product structure is given to a vector space, how can we understand the resulting geometry?

Since inner products are associated with angles, and since we can understand angles, there is temptation to interpret inner products in terms of angles. I advocate against this being the primary means of interpreting inner products.

An inner product \langle \cdot, \cdot \rangle induces the norm \| x \| = \sqrt{\langle x,x \rangle}. Importantly, the inner product can be recovered from this norm by the polarisation identity. Therefore, understanding the geometry of an inner product is the same as understanding the geometry of the norm, and for the latter, it often suffices to consider what the unit ball looks like. For me, the norm is the primary structure giving the space its geometry.

What then is the purpose of the inner product? Not all norms have the same properties. Under some norms, projection onto a closed subspace may not be unique, for example. When interested in shortest-norm optimisation problems, the most desirable situation to be in is for the square of the norm to be quadratic, since then differentiating it produces a linear equation. In infinite dimensions, what does it mean for the square of a norm to be quadratic?

The presence of an inner product structure means the square of the norm is quadratic. Furthermore, the inner product “decomposes” the norm in a way that gives direct access to the derivative of the norm squared.

The remaining issue is how to understand complex-valued inner products. Given the above, the natural starting place is to consider endowing a complex vector space with a norm. Keeping the axioms of a real-valued normed vector space seems sensible; it implies that scaling a vector by e^{\jmath \theta} does not change its norm (because \| e^{\jmath \theta} x \| = | x |\,\| x \| = \| x \|).

Then one asks what it means for the square of a norm to be quadratic. From the real-valued case, one guesses that one wants to be able to represent the square of the norm as a bilinear form: \| x \|^2 = \langle x, x \rangle, where \langle \cdot, \cdot \rangle is linear in each of its arguments. Following the letter of the law, this would mean \| \alpha x \|^2 = \langle \alpha x, \alpha x \rangle = \alpha^2 \langle x,x \rangle = \alpha^2 \|x\|^2. In the complex case though, \alpha^2 need not equal |\alpha|^2. This explains why one tweaks the definition and instead considers sesquilinear forms which are linear in one argument and conjugate linear in the other: \langle \alpha x, x \rangle = \alpha \langle x, x \rangle = \langle x, \bar\alpha x \rangle. Indeed, one then correctly has that \|\alpha x \|^2 = \langle \alpha x, \alpha x \rangle = \alpha \bar\alpha \langle x , x \rangle = |\alpha|^2 \|x\|^2. With this tweak, one can verify that the complex-valued case works the same way as the real-valued case.

By treating the norm as the primary structure, one does not have to worry about giving an intuitive meaning to the inner product of two vectors not being a purely real-valued number; the inner product is there merely to expose the square of the norm as being quadratic. A complex-valued inner product is recoverable from its norm and hence no geometric information is lost. (Of course, orthogonality remains an important concept.) If one really wanted, one could play around with examples in \mathbb{C}^2 to get a better feel for what it means for \langle x , y \rangle = \jmath, for example, however, unless one encounters a particular problem encountering this level of detail, thinking in terms of norms is cleaner and more efficient. (If \langle x, y \rangle = r e^{\jmath\theta} then \langle x, e^{\jmath \theta} y \rangle = r, so that by “rotating” a complex-valued vector in the two-dimensional real-valued vector space that it spans, one can always return to thinking about real-valued inner products.)

Are Humans Smart and Computers Dumb? Can Computers Become Better at Helping Humans Solve Challenging Problems?

July 10, 2014 2 comments

Computers are generally seen as “dumb”, unable to think for themselves and therefore unable to solve certain complex tasks that humans, being “smart”, are able to solve relatively easily. Humans outperform computers at speech recognition, many image processing problems, and at real-time control problems such as walking – bipedal robots cannot manoeuvre as well as humans. Some may argue computers will never be able to prove mathematical theorems or engineer complex systems as well as mathematicians or engineers.

From a pragmatic perspective, “being smart” is ill-defined.  Roughly speaking, a solution is “smart” if a formulaic solution is not known (or if the solution is obtained much more efficiently than via a formulaic but brute force solution, e.g. one could argue humans are still “smarter” at chess than computers, even though computers can now beat humans).

My current viewpoint is that:-

  • we are probably not as smart as we may think;
  • computers are dumb primarily because we only know how to program them that way;
  • regardless, the future vision should be of challenging problems being solved jointly by humans and computers without the current clunky divide between what humans do and what computers do. We should be aiming for semi-automated design, not fully-automated design.

The following three sections elaborate on the above three points in turn. First a summary though.

Summary: We should not dismiss the idea that computers can help us in significantly better ways to solve challenging problems simply because we see a divide: we are smart and computers are not. Ultimately, smartness probably can be recreated algorithmically provided computers (robots?) and humans start to interact extensively. But well prior to this, computers can become better at assisting us solve challenging problems if we start to understand how much ‘intuition’ and ‘problem solving’ boils down to rules and pattern recognition. Certainly not all intuition would be easy to capture, but often challenging problems involve large amounts of fairly routine manipulations interspersed by ingenuity. Semi-automated design aims to have a human and a computer work together, with the computer handling the more routine manipulations and the human providing high-level guidance which is where the ingenuity comes in.

Readers short of time but interested primarily in a suggested research direction can jump immediately to the last section.

As the focus is on solving mathematical and engineering problems, no explicit consideration is given to philosophical or spiritual considerations. Smartness refers merely to the ability to solve challenging (but not completely unconstrained) problems in mathematics and engineering.

How Smart Are Humans?

A definitive answer is hard to come by, but the following considerations are relevant.

Human evolution traces back to prokaryote cells. At what point did we become “smart”? My favoured hypothesis is that as the brain got more complex and we learnt to supplement our abilities with tools (e.g. by writing things down to compensate for fallible memory), exponential improvement resulted in our capabilities. (I vaguely recall that one theory of how our brain got more complex was by learning to use tools that lead to improved diets, e.g. by breaking bones and accessing the marrow.) It seems much less likely we suddenly picked up an “intelligence gene” that no other creature has. Flies, rats, apes and humans are not that different in terms of basic building blocks.

When comparing humans to computers, one should not forget that humans require 21 or so years of boot-up time. And different humans have different strengths; not all are equally good at solving mathematical problems or engineering systems. An unpleasant a thought as it may be, consider an extraterrestrial who has a choice of building and programming computers or breeding and teaching humans. Which would be a better choice for the extraterrestrial wishing for help in solving problems? (Depending on the initial gene pool and initial population size, it might take 100 years before a human is bred that is good at a particular type of problem. Regardless, if starting from scratch, there is an unavoidable 15-20 year delay.)

Then there is the issue of what precisely smartness is. Smartness is defined by a social phenomenon. At a small scale, one just has to look at review committees and promotion panels to realise there is not always agreement on what is smart and what is not. More striking though, smartness is relative – humans compare with each other and with the world around them. There are infinitely more problems we cannot solve than we can solve, yet no one seems to conclude that we are therefore not smart. We look at what we can do and compare that with what others can do. In a similar vein, a computer can effortlessly produce millions and millions of theorems and proofs in that it could systematically piece axioms together to form true sentences, recording each new sentence as a “theorem” and the trail leading to it as a “proof”. Yet this would be dismissed immediately as not useful. So ‘usefulness to humans’ plays a role in defining smartness.

How do humans solve problems? Following again the principle of Occam’s Razor, the simplest hypothesis that comes to my mind involves the following factors. Most of our abilities come from having to compete in the world (think back thousands of years). Image processing is crucial, and as over half the brain is involved to some extent with vision, a large part of how we reason is visual. (Even if we do not explicitly form images in our minds, our subconscious is likely to be using similar circuitry.) We also need to be able to understand cause and effect — if I stand next to that lion, I will get eaten — which leads to the concept of thinking systematically. So my current hypothesis is that systematic thinking (including understanding cause and effect) and pattern recognition are the main players when it comes to reasoning. The mathematics and engineering we have developed, largely fits into this mould. Rigorous systemisation and writing down and sharing of results have lead to ‘amazing’ results that no single human could achieve (insufficient time!) but at each little step, ideas come from experience and nowhere else. Those that can calculate faster, are more perceptive, are more inquisitive, tend to find the “right” types of patterns, have greater focus and stamina, and are more attuned to what others may find interesting, have significant competitive advantages, but to rule a line and say computers can never do mathematics or engineering is unjustified. (The issue of creativity will be discussed in the next section.)

A final point before moving on. Speech recognition is very challenging for computers. Yet are we smart because we can understand speech? Speech was something created over time by humans for humans. Presumably grunts turned into a small vocabulary of words which grew into phrases and more complicated structures. The point though is that at no time could the process evolve to produce something humans couldn’t understand, because by definition, the aim was to communicate, and if communication was not working, another approach would be taken. If we could learn the languages of all extraterrestrials, perhaps we really are smart, but I’m skeptical.

How can Computers be Made Smarter?

The three main characteristics of a computer system (i.e., something built to do speech recognition or solve mathematical problems) are its size (raw processing power), architecture (e.g., what parts can communicate with what other parts) and the software running on it.

Since humans tend to set the benchmark at what humans can do, a minimum amount of raw processing power is required before a computer can even hope to do something “smart”. Yet the architecture is even more important. Engineers currently do not build computers anything like the way nature builds brains. It is very likely that current computational architectures are ineffective for the types of pattern recognition the human brain engages in.

More interestingly, the architecture of the human brain evolves over time. In simple terms, it learns by changing its architecture. There are two obvious components; via DNA, natural selection increases the chances that a human is born with a ‘good’ architecture to start with. Then there is the long process of refining this architecture through everyday experiences and targeted learning (e.g., attending school).

There is nothing stopping us from building reconfigurable computers that are massively interconnected, thereby very crudely approximating the architecture of a brain. (I suspect this may involve a shift away from ‘reliable’ logic gates to logic gates (or equivalent) that work some but not all of the time, for there is a trade-off between density and reliability.)

With remarkable advances in technology, the real challenge looking forwards is the software side. Because we don’t understand precisely the sort of pattern recognition the brain uses to solve problems, and because until recently the technology was not there to build massively interconnected reconfigurable computers, no one has seriously tried to make a computer “smart” in the way I have been referring to in this essay. (This is not overlooking Artificial Intelligence whose approach was partly hampered by the available technology of the day, and which I suspect never delved deeper and deeper into how the brain does pattern recognition — once artificial neural networks came on the scene, there were sufficiently many research questions for AI researchers to chase that they did not continuously return to the biology to understand better and better how the brain does pattern recognition. And in fairness, as they lacked the computational power for hypothesis testing, it would most likely have been a futile endeavour anyway.)

Summarising this section, several points can be made. Current computing systems (which nevertheless serve the majority of their intended purposes extremely well) seem to be a ‘mismatch’ for doing what humans are good at doing, therefore, they come across as “dumb”. There is no evidence yet to suggest there is a barrier to building (this century) a computing system that is “smart” in some respects, but it will require a whole new approach to computing. It is not clear whether imitating the architecture of the human brain is the best thing to do, or if there is an even more efficient approach waiting to be discovered. Nevertheless, if smartness is being measured against what humans can do, a good starting point is starting with similar resources to what a human has.

Bringing in points from the preceding section, one must not forget though that humans have been ‘trained’ over centuries (natural selection), that each individual then takes an additional 21 years of training, during which time they are communicating with and learning from other individuals, and even then, we tend to work on the problems we believe we can solve and ignore the rest. This suggests we have a narrow definition of “smartness” and perhaps the only real way for a computer to be “smart” in our eyes is if it were to ‘grow up’ with humans and ‘learn’ (through daily feedback over 21 years) what humans value.

Indeed, smartness is usually linked with creativity and being able to solve “completely new” problems. (I would argue though that the problems we can solve are, by definition, not as distant from other problems we have solved before as we would like to think. Who knows how much the subconscious remembers that our conscious mind does not.) Creativity, even more than smartness, is linked to how humans perceive themselves relative to others. A random number generator is not creative. An abstract artist is. Some computer generated pictures coming from fractals or other mathematical constructs can be aesthetically pleasing to look at but is creativity involved once one realises a formulaic pattern is being followed? When it comes to problem solving, creativity and smartness come back largely to usefulness, or more generally, to some notion of value. We solve problems by instinctively knowing which of the thousands of possible approaches are most likely not to work, thereby leaving a manageable few options to work our way through. When we “create” a new theorem or engineer a new system, we are credited with creativity if it is both potentially useful (“sensible”?) and different from what has been done before. A random number generator succeeds at being different; the challenge is teaching a computer a sense of value; I suspect this is achievable by having computers interact with humans as described above (it is all just pattern recognition refined by external feedback!), and perhaps humans can even learn to write down systematically not just an algebraic proof of a mathematical result but also the “intuition” behind the proof, thereby enabling computers to learn like humans by studying how others solve problems.

A Future Vision of Semi-automated Design

Although arguing that if we wanted computers to be smart then we are most likely going to be able to achieve that goal eventually, the more important question is whether we should be trying to head directly there. Personally, a more useful and more easily achievable goal is to work towards semi-automation and not full automation. At the moment there is essentially no way to guide a computer towards a goal; we run something, we wait, we get a result. Often we make a modification and try again. But we are not really interacting with the computer, just using it to shorten the time it takes for us to calculate something. By comparison, two people can work together on a design problem or a mathematical proof; there are periods when they work alone, but there are periods when they come together to discuss and get ideas from each other before going off to think on their own again. Semi-automation lies somewhere between these two extremes: the computer need not be “smart” like a human, but it should come across as being more than just a dumb calculator (even if, at the end of the day, all it is doing are calculations!).

Efficient exchange of information is vital. Computers can store more data than a brain can, but we can probe someone else’s brain for information much more efficiently than we can probe a computer’s memory banks. Largely this is because exchange of information is more efficient when there is shared knowledge that need not be explicitly transmitted.

Semi-automated Theorem Proving

There are so many mistakes, both big and small, in the published literature (even the top journals) that it seems not only highly desirable but inevitable we will ultimately have all mathematical works stored on and verified by computers. This is certainly not a new dream, so how far are we from achieving it?

Although proof verification systems exist, otherwise simple proofs become excessively long when translated into a suitable form for the computer to check. Exchange of information is far from efficient! At the other extreme, automated theorem provers exist, but are yet to prove any theorems of substance. I propose the following stages of research.

1) Choose an area (e.g. basic set theory, or basic set theory and finite-dimensional linear algebra). Work towards devising a notation so that humans can enter a theorem and proof using a similar amount of effort to typing up the theorem and proof for a formal set of lecture notes. To improve efficiency, the notation does not need to be unambiguous since the computer can always ask for clarification if it cannot figure out the correct meaning. Similarly, the gaps between each step of the proof may not always be sufficiently small for the computer to be able to fill in, nevertheless, the computer should have a go, and ask for help (e.g., ask for a sub-step to be inserted) if it fails.

2) Enhance stage one above by having the computer give a list of suggestions whenever it gets stuck; it is easier for the user to click on the correct suggestion than to type in the answer, and if none of the suggestions are correct, the user can always fall back to entering the answer. Here, some method of ranking the suggestions is required (for otherwise there might be too many suggestions). Initially, this can be based on the computer determining “what we already know” and “what we want to find out”, then searching through the axioms and existing theorems to find something that relates either to “what we already know” (a possible ‘starting move’) or “what we want to find out” (a possible ending move).

3) Stage two will eventually prove inefficient as the number of theorems grows and the proofs become more complicated. Some sort of ‘learning’ and ‘intuition’ on behalf of the computer is unavoidable if humans and computers are to interact significantly more efficiently. Perhaps ultimately pattern recognition, as discussed in previous sections, becomes unavoidable, but one should not be too hasty in ruling out the possibility that a fair amount of intuition is more formulaic than we think. Stage three then is supplementing theorems and proofs with basic intuition. The first example that comes to mind is working with bounds, e.g., the triangle inequality. A computer can be taught that if it wants to prove x and y are close, then it can do this if it can find a point z that is close to x and also close to y. Or in another form (i.e., | \|x\| - \|y\| | \leq \|x-y\|), if the computer wants to show x and y are some distance apart, it can endeavour to do so by comparing the norm of x with the norm of y.

Certainly some proofs require deep intuition and ingenuity, but even then, a large part of the proof is generally fairly routine. Once a human has specified the “stepping stones”, the routine parts can be filled in (semi-)automatically by the computer working its way through its collection of routine moves and supplemented by basic intuition to rank the order in which it should try the moves.

Stages 1 to 3 are within our grasp and would already have a tremendous impact if they work as expected.

4) Stages 1 to 3 were concerned with a user stating a theorem and sketching a proof, with the aim that the computer can fill in the missing steps of the proof and thereby verify the correctness of the theorem, falling back to asking for more information if it gets stuck. Stage 4 is for the computer to work with a user in proving or disproving a theorem. That is, the user states a conjecture, then through a series of interactions with the computer, the user either finds a proof or a counterexample in a shorter time than if the user were to proceed unassisted. Refinement of the “intuition engine” may or may not be necessary. Some way for the user to suggest efficiently (minimum of keystrokes) to the computer what to try next is required. Some way for the computer to make suggestions to the user as to what to (and what not to) try next is required. [I omit my ideas for how to do this because they will most likely be superseded by the insight gained by anyone working through stages 1 to 3.]

5) The fifth stage is for computers to propose theorems of their own. In the first instance, they can look for repeated patterns used in proofs stored in its database. The examples that come to mind immediately are category theory and reproducing kernel Hilbert spaces. In part, these areas came into existence because mathematicians observed the same types of manipulations being carried out in different situations, identified the common threads and packaged them up into their own theory or “toolbox”.

Semi-automated Chip Design

Engineers have long had the dream of being able to specify what they would like a software program or an electronic circuit to do, push a button, and have a computer generate the software or the circuit. Even more significant than saving time, if done properly, this would ensure the software (and possibly the circuit, although this is a bit harder to guarantee in more complicated situations when extra considerations such as EM interactions must be taken into account) is guaranteed to work correctly. There is a staggering amount of buggy software and hardware out there that this dream is becoming more and more desirable to achieve.

Basically, all the ideas discussed earlier are relevant here. Instead of a library of mathematical theorems, proofs and accompanying intuition being built up, a library of software or a library of electronic circuits is built up. Humans will always be part of the design loop, but more and more of the routine work can be carried out by a computer. Brute force algorithms and heuristics must be replaced by a way of encoding “intuition” and “lessons from past experience”, thereby allowing humans and computers to interact much more efficiently.

Follow

Get every new post delivered to your Inbox.

Join 73 other followers