Michael Nielsen / Recurse Center / February 2016

The world's oldest complete piece of surviving music is the 3,400-year-old Hurrian cult hymn. The hymn was found inscribed on clay tablets excavated from the ancient city of Ugarit, in modern-day Syria. Sometime after the hymn was composed, the idea of using notation to record music was lost. Writing two millennia later, in the seventh century CE, the polymath Isidore of Seville stated that “unless [musical] sounds are held by the memory of man, they perish, because they cannot be written down.”

In the ninth century, European musicians began experimenting with notations to record plainchant. The initial notations were rough. They didn't precisely denote either pitch or rhythm, and were intended more as an aid to memory than as a recording medium. But they improved. In the 11th century, Guido of Arezzo introduced the musical staff, making it easy to record pitch. And in the 13th century, Franco of Cologne suggested using a written note's appearance to signify duration. These and many other ideas led to modern musical notation.

Written music originated as a recording medium, but became a
creative medium in its own right. It made it much easier to
compose complex, intricate music for many instruments and
voices. This paved the way for Bach's fugues, Beethoven's
symphonies, and much else. Written music became a medium for
thought, a medium which expanded the range of musical ideas a
composer could have, and thus changed music itself. It's an
example of a *cognitive medium* – a media
environment to support and enable thought.

In the modern day, digital computers have spurred the invention of many new cognitive media. Consider a program such as Photoshop. Adept users of Photoshop internalize tools such as the lasso, the eyedropper, the clone stamp, and so on. The tools become part of the way users think about image manipulation, and expand the range of graphical ideas they can have.

Photoshop is just one of many cognitive media enabled by digital computers. Other examples include musical scorewriting programs such as Finale and Sibelius, modeling programs such as Blender and 3ds Max, and game engines such as Unity and Unreal. Such media enable users to think and create in ways that would otherwise be difficult or impossible. They are ways of augmenting human intellect** This notion has been developed by many researchers. See, for example, Douglas Engelbart's Augmenting Human Intellect: A Conceptual Framework, and Alan Kay and Adele Goldberg's Personal Dynamic Media..

This essay is about the design of cognitive media for mathematics. Examples of such cognitive media include Mathematica, Matlab, Sage, and Coq. These programs are extremely useful, but conservative in that they represent mathematics in ways close to those used by mathematicians in (say) the 1950s, working with paper-and-pencil. By contrast, in recent years many people** A partial list includes Bret Victor, Marc ten Bosch, and Steven Wittens. have experimented with media which don't treat the basic language of mathematics as given, but rather use computers to change the fundamental representations and operations used. Over the long run, I believe such experimentation will lead to radical changes in how we think about mathematics.

As a step in this direction, in this essay I'll show a simple prototype medium which helps the user explore in a particular part of mathematics, namely, linear algebra. The prototype is based on three main ideas:

When using paper-and-pencil (or its close cousin, blackboard-and-chalk), mathematicians often explore using informal, concrete, graphical representations, like those shown in the image to the right. Photo credit: United States Department of Energy. Such exploration is often difficult in existing computer-based cognitive media. While systems such as Mathematica are good at producing graphical output, they're not aimed primarily at direct manipulation of those graphical representations. In this sense, they fall short of paper-and-pencil. Is it possible to strictly improve on paper-and-pencil as a medium for doing mathematics? Our prototype medium will aim to make direct manipulation easy. The intent is to combine the flexibility and concreteness of paper-and-pencil with the automated inference that is the raison d'ĂȘtre of traditional systems for doing mathematics by computer. Think something like “Photoshop for linear algebra”. That grossly overstates what we'll achieve, but conveys the right gist.

Many experimental cognitive media are intended
as *explanations* – see, for a beautiful example,
Vi Hart and Nicky
Case's explanation of
one of the economist Thomas
Schelling's models of social
behavior. By contrast, the prototype medium we'll develop
is intended as part of an open-ended environment for
exploration and discovery. Of course, exploration and
discovery is a very different process to explanation, and so
requires a different kind of medium.

We'll use our prototype to develop a piece of genuine
mathematics, a beautiful result in linear algebra called
the *singular value decomposition*, or *SVD* for
short. This is in contrast to many experimental cognitive
media, which often aim at toy mathematical problems. Aiming
at the SVD makes the essay less accessible than working with a
toy problem, but has the great benefit of forcing us to
confront and overcome the kind of problems which arise in real
mathematical work. In essence, we'll be using the medium to
re-discover the SVD. Of course, ideally we'd go even further
than re-discovering existing mathematics, and use the medium
to discover new mathematics. We won't achieve that here, but
over the long run, that's the right test of an exploratory
cognitive medium: does it enable the development of important
new ideas?

The prototype itself is not the main point of the essay.
Rather, the point is what general lessons we can learn from
the prototype. We'll use our prototype to identify a
fundamental problem in the design of exploratory cognitive
media. This problem arises from desiring a medium that
simultaneously: (1) automates computation and inference, as we
expect a computer-based medium to do; and (2) allows the user
to flexibly change assumptions on the fly, in a manner similar
to paper-and-pencil. We will show that such a medium leads to
mathematical inconsistencies: there is no single ground truth
in the system, but rather multiple ground truths. These
inconsistencies are not an accident or specific to our
particular medium. Rather, they are inevitable in any
cognitive medium that has both the flexibility of
paper-and-pencil and the ability to automatically do
computations and inference. To resolve this problem, we must
develop an internal data model for the medium and a user
interface that lets us reason about these multiple ground
truths. We'll identify a set of principles that enable a
medium to support this kind of *semi-concrete
reasoning*. I believe semi-concrete reasoning will be an
important part of any concrete, exploratory medium for
mathematics.

Before getting to our exploratory medium, it helps to learn a
little about the singular value decomposition (SVD). If
you're not familiar with the SVD, it's a way of breaking a
matrix up into three simple components. It's most easily
understood for a real $2 \times 2$ matrix $M$, where the SVD
guarantees that it's always possible to break $M$ up into a
rotation about the origin, followed by a rescaling of the
co-ordinate axes, followed by a second rotation about the
origin. Each of these three components is easy to reason
about individually, and that often makes it easy to reason
about the matrix as a whole. The great thing about the SVD is
that it guarantees that *every* $2 \times 2$ matrix can
be broken up in this way, for suitable choices of rotations
and rescaling** Actually, the full
statement of the SVD is slightly more complex. I'll explain
it in a moment.. It's similar to the way knowing the
prime factorization of a number can make it easier to reason
about that number.

The SVD doesn't just apply to $2 \times 2$ matrices. For a real $n \times n$ matrix $M$, the SVD says such a matrix can always be decomposed as** There are versions of the SVD which apply to non-square matrices, to infinite matrices and linear operators, and to matrices over fields other than the real numbers. For our purposes, the $n \times n$ real case is a good level of generality. Many of the ideas can, in any case, be applied in more general contexts.

\begin{eqnarray} M = U \, {\rm diag}(s_1, s_2, \ldots, s_n) \, V. \end{eqnarray}

Here, the matrices $U$ and $V$ are real $n \times
n$ *orthogonal* matrices. As you may guess from the
earlier discussion, orthogonal matrices are generalizations of
the 2-dimensional rotations. In particular, they're matrices
which *preserve length* in $n$ dimensions, just as
rotations preserve length in 2 dimensions. That is, if $U$ is
an orthogonal matrix and $v$ is any vector, then the length
$\|U v\|$ is always the same as the length $\|v\|$. Formally,
a square matrix $U$ is defined to be orthogonal if $U^T U =
I$, where $T$ is the transpose operation, and $I$ is the
identity matrix** To see how this
relates to length preservation, observe that $\|v\|^2 = v^T
v$, i.e., the sum of the squares of the components of $v$.
Using this observation we have $\|Uv \|^2 = v^T U^T U v = v^T
v = \|v\|^2$, where we substituted $U^T U = I$ to get the
second equality.. And so in the case of the SVD we
have $U^TU = I$ and $V^T V = I$.

The other part of the singular value decomposition $(1)$ was
the diagonal matrix ${\rm diag}(s_1, s_2, \ldots, s_n)$. The
diagonal entries $s_1 \geq s_2 \geq \ldots \geq s_n \geq 0$
are known as the *singular values* of the matrix $M$.
These singular values simply rescale the corresponding
directions in $n$-dimensional vector space. Putting it all
together, the SVD says any matrix can be broken up into an
orthogonal “rotation”, a rescaling, and another
orthogonal “rotation”** I
told a small lie in my explanation of the $2 \times 2$ case.
The $2 \times 2$ orthogonal matrices include not just
rotations, but also reflections. As an example, consider the
matrix $\begin{bmatrix} 1 & 0 \\ 0 & -1 \end{bmatrix}$, which
reflects about the $x$ axis. It's easily verified that this
is an orthogonal matrix, but it's not a rotation. In fact,
you can show that an arbitrary $2 \times 2$ orthogonal matrix
is either a rotation, or a product of $\begin{bmatrix} 1 & 0
\\ 0 & -1 \end{bmatrix}$ with a rotation. An analogous
statement is true in $n$ dimensions. In practice, you can
mostly ignore the reflections, and think of orthogonal
matrices as rotations in $n$ dimensions..

While we're going to prove the SVD, the proof is not the main point of the essay. As such, I will take it for granted that you're willing to believe the SVD is interesting. This means we won't get deeply into the applications or meaning of the SVD. But with that said, it's worth knowing that the SVD is a widely used tool, for applications ranging from natural language processing to quantum computing to comparing protein structures. Indeed, it's one of the most useful tools in linear algebra, and, indeed, in all mathematics.

In this section I explain the proof of the SVD in the case when $M$ is a $2 \times 2$ matrix. This proof contains all the essential ideas needed for the general proof, which we'll discuss later.

Rather than attack the problem head-on, let's start with a simpler and more natural problem. This problem will appear unrelated, but as we attempt to solve this problem our explorations will lead us to the SVD.

The simpler problem is the question of how much a $2 \times 2$
matrix $M$ *stretches* space? To understand what I
mean by stretching space, consider the matrix

If we let this matrix act on the vector $(1, 0) \equiv \begin{bmatrix} 1 \\ 0 \end{bmatrix}$ the vector becomes

$$M \begin{bmatrix} 1 \\ 0 \end{bmatrix} = \begin{bmatrix} 1 \\ 1 \end{bmatrix} $$The input vector $(1, 0)$ has length $1$, while the output vector $(1, 1)$ has length $\sqrt 2$. So the matrix stretches this particular input vector to become a factor $\sqrt{2}$ longer.

Of course, $(1, 0)$ isn't the only possible input vector. If the input were, say, $(0, 1)$ the output would be $M(0, 1) = (1, 0)$, which is the same length as the input. In that direction, $M$ doesn't stretch space at all.

The question we'll investigate is this: what is
the *maximal* degree of stretching associated to some
given matrix, $M$? In other words, if we consider all unit
vectors $v$, what is the maximal value for the length $\| M
v\|$?

I'm abusing terminology a little here – $M$ might actually shrink unit vectors in some directions, or even in every direction. But the term “stretch” is evocative, so we'll continue using it** Stretching is not a standard mathematical term. Mathematicians often refer to this maximal stretching as the matrix norm, or, a little more formally, as the matrix norm induced by the Euclidean distance..

We won't solve this problem immediately, but will instead do some exploration that helps us understand the problem. The exploration will be illustrated using a rough prototype of a cognitive medium for doing linear algebra. Think of what follows as someone doing real, live mathematical work, trying to better understand the maximal degree of stretching associated to a $2 \times 2$ matrix. This is a record of their explorations. To start the demo, please play the video below. Of course, in a real medium, you'd be able to interrupt, do your own exploration, and so on. But this is just a prototype, to explore ideas, and so I haven't built the medium out to do that. Here it is** Original video may be downloaded here.:

Let's fill in the steps skipped above, to take the proof of the conjecture from geometric idea to rigorous proof. Let $T$ be the tangent vector at $Ms$. Recall that we're trying to prove that $T$ is orthogonal to $Ms$. Our strategy is to assume $T$ has a component in the direction of $Ms$, that is, $Ms \cdot T > 0$, and to try to deduce a contradiction** It might also be that $Ms \cdot T < 0$, in which case we would replace $T$ by $-T$, i.e., use the tangent in the other direction along the curve.. Because $T$ is the tangent to the image curve, vectors on the image curve near $Ms$ can be written as $Ms+\Delta T + O(\Delta^2)$ for some small parameter $\Delta$. The squared length of such a vector is:

$$\|Ms+\Delta T + O(\Delta^2)\|^2 = \|Ms\|^2+ 2 \Delta \, Ms \cdot T + O(\Delta^2)$$

But $Ms \cdot T > 0$, so for small $\Delta$ we have $\|Ms+\Delta T + O(\Delta^2)\|^2 > \|Ms\|^2$, i.e., moving along the image curve does, indeed, give us a vector longer than $Ms$. That's a rigorous proof of what we wanted to show.

We learned from the above explorations that the tangent to the image curve at $Ms$ is orthogonal to $Ms$. We can make this observation more explicit by expressing the tangent $T$ in terms of existing quantities like $M$ and $s$. To do this, recall that the image curve arose by applying $M$ to the unit circle. It should be plausible that the tangent to the image curve at $Ms$ is just $Mt$ where $t$ is the tangent to the unit circle at $s$. This can be proved with a little calculus** I won't go through the details, though it's a good exercise if it's not immediately obvious. Incidentally, this is standard calculus, and could be something the medium automatically “knows”, and suggests to the user through the interface.. It's easiest to see the situation through an illustration:

We can make things even more explicit, noting that the tangent $t$ is orthogonal to $s$, since it's the tangent to a circle. And so the previous observation that the tangent to the image curve at $Ms$ is orthogonal to $Ms$ can be rewritten as:

**Lemma:** *Let $s$ be the principal right
singular vector of $M$, and let $t$ be orthonormal to $s$.
Then $Ms$ is orthogonal to $Mt$.*

The lemma tells us that the principal right singular vector
$s$ is a very special vector. For orthonormal vectors $s$ and
$t$ it usually is *not* the case that $Ms$ is
orthogonal to $Mt$. What we learn from the lemma is that this
is, however, true for the principal right singular vector.
That's a powerful thing to know.

In fact, it actually lets us prove the SVD for $2 \times 2$ matrices. Let's use the lemma to explicitly construct each of the rotation-rescaling-rotation steps, in terms of $s, t, Ms$ and $Mt$. To see how this works, please play the video below. Note that I've moved $t$ and $Mt$ to the origin, emphasizing that $t$ is orthogonal to $s$, and $Mt$ is orthogonal to $Ms$** Original video may be downloaded here..

To recap, we rotate the principal right singular vector $s$ to the first co-ordinate axis. This rotation automatically rotates the vector $t$ (which is orthonormal to $s$) to the second co-ordinate axis. We then rescale the first co-ordinate axis by a factor $\|Ms\|$, and the second co-ordinate axis by a factor $\|Mt\|$. And finally we rotate the rescaled vectors to $Ms$ and $Mt$. This last step makes crucial use of the lemma, since it's only possible because $Ms$ and $Mt$ are orthogonal** A point I glossed over in the visual proof is that it may be necessary to reflect one of the axes. If that's the case the reflection can be absorbed into the rotation, and it'll still be an orthogonal matrix. This is why the SVD is stated in terms of orthogonal matrices, rather than rotations..

By construction, this rotation-rescaling-rotation combination
takes $s$ to $Ms$ and $t$ to $Mt$. In fact, it follows that
the rotation-rescaling-rotation combination must do the same
as $M$ for *all* possible inputs. The reason is that
$s$ and $t$ form a basis for the vector space, and since both
$M$ and the rotation-rescaling-rotation act linearly, they
must therefore have the same action on all vectors. Thus, $M$
is equal to the rotation-rescaling-rotation operation. And so
we've proved the singular value decomposition for $2 \times 2$
matrices.

Well, that's pretty nice! We started out with a question about how much a matrix stretches space, and the singular value decomposition more or less dropped out of our exploration. Indeed, if you carefully review the key ideas in the proof, the SVD is an almost obvious consequence of two simple and easily-visualized geometric ideas: (1) $Ms$ must be orthogonal to $Mt$, otherwise it would be possible to go “further out” along the image curve; and (2) the rotation-rescaling-rotation sequence just visualized. What's more, while our proof of the SVD was in $2$ dimensions, the proof is easily extended to $n$ dimensions. That generalization is incidental to the main point of this essay, so I've sketched it out in Appendix A.

An interesting aspect of the discussion so far is that we
actually *haven't* solved our original problem: to
figure out the maximal degree of stretching. This mirrors
what often happens in real mathematical work – you may
not solve the problem you originally intended to solve, but
you learn something interesting anyway! In fact, it's
possible to use the SVD to solve our original problem, finding
the maximal stretching. Again, this is incidental to the main
point of this essay, and so I describe how to do it
in Appendix B.

I glossed over a very interesting moment in our proof of the SVD. It's the point where we ask what happens if the tangent vector has a component in the direction of $Ms$. Let me remind you what happens. It takes just a few seconds** Original video may be downloaded here.:

This appears simple and natural. It's the kind of thing a mathematician working with paper-and-pencil does almost without thinking. But in a computer-based medium, something sophisticated must be going on under the hood. To understand why, let's think about how the medium would model the actions shown. A natural approach is to start with a matrix $M$ which has some generic values filled in. In fact, the values for $s, t, Ms$ and so on shown above are produced by the following matrix $M$:

$$M = \begin{bmatrix} 0.70 & -0.28 \\ 1.27 & 0.86 \end{bmatrix} $$

So, for example, the image curve is computed by applying this particular $M$ to the points making up the unit circle. And the principal right singular vector $s$ is computed from this $M$. And so on, for all the properties computed by the medium.

In using this specific $M$ we're following a time-honored mathematical tradition: in order to reason about some general class of objects – in this case, all the $2 \times 2$ matrices – our medium has picked out one particular $2 \times 2$ matrix, and is using it as a generic example. The benefit of doing this is that it lets the medium do all computations explicitly, so we can reason from concrete, easily-understood and visualized examples. So, for instance, using a concrete example makes it easy to notice that the tangent to the image curve at $Ms$ is orthogonal to $Ms$, when $s$ is the principal right singular vector. Of course, the medium only shows this for this specific matrix $M$. But the fact that the example matrix $M$ is generic makes it plausible that this phenomenon holds for any $2 \times 2$ matrix. Indeed, it would be easy to check for other choices of $M$, though I didn't explicitly show that happening. This practice of using concrete-but-generic examples to gain insight has been common all through history, but is made especially easy with digital computers** There's an art to picking out generic examples that don't mislead. That's not a problem I'm going to discuss in detail, but it's a fascinating subject. To develop a good concrete medium for doing mathematics, I suspect it's a subject which would need to be investigated in great depth. Incidentally, in the essay Up and Down the Ladder of Abstraction, Bret Victor has described powerful techniques which sometimes make it possible to eschew generic examples, and instead work with concrete visualizations representing the space of all examples. .

Now, what happens when we drag the tangent away from its actual value to a value not orthogonal to $Ms$, as shown in the video just above?

Before the change the medium was keeping track of the following mathematical objects:

M: a 2 by 2 matrix, with entries 0.70, -0.28, ... The unit circle The image of the unit circle under M s: the principal right singular vector, with entries 0.89, 0.46 Ms: the result of applying M to s, with entries 0.49, 1.52 The tangent to the image curve at Ms, with entries -0.57, 0.18

What happens to the internal state of the medium when we change the tangent? Of course, our internal representation of the tangent must change. But there are challenges in updating our description of the other objects. Most glaringly, no matrix $M$ exists which has the desired values for $Ms$ and the changed tangent.

One natural response is to declare the value of $M$ to
be `undefined`

. If we do that, then it's also
natural to declare $s$, the principal right singular vector of
$M$, as `undefined`

. Ditto the image curve. Our
mathematical objects are all vanishing! We started with the
idea of computing everything from underlying concrete objects.
That's difficult to do when our objects no longer exist.

The problem is that we're trying to describe an impossible world. There is no consistent set of mathematical objects that satisfies all the constraints we've imposed in our system. And yet despite this inconsistency, we'd like to continue reasoning. After all, a mathematician working with pencil-and-paper has no trouble continuing at this point. The reason this is possible is that a static medium like paper-and-pencil does no computation or constraint-checking. The medium doesn't tell us it's not possible to make changes to a sketch like

erasing and modifying the tangent and image curve so the sketch becomes

This is despite the fact that there is no matrix $M$ having the depicted values.

This lack of constraint checking is both a feature and a bug. Because paper-and-pencil doesn't enforce this kind of constraint checking, it relies on the mathematician to do the constraint checking in their head in a principled way. That's a substantial cognitive burden. But it also gives the mathematician a lot of freedom to carry out counterfactual reasoning, like we did in the last section.

One possible response is to puritanically say “Oh, well,
we shouldn't do this kind of reasoning in our medium”.
While that would be convenient, it'd be burying our head in
the sand. Real mathematicians engage in this kind of
counterfactual reasoning all the time. They're happy to
imagine impossible worlds *en passant*, and to make
progress by asking “What would happen if this world
actually existed?”

Systems for doing mathematics by computer, such as Matlab and Mathematica, are often terrific at doing computations with concrete objects. They can verify in a flash that $Ms$ is orthogonal to $Mt$ for any specific choice of matrix $M$. But they don't always make it easy to do counterfactual reasoning. There's no easy way of asking Matlab to suppose that $Ms$ is not orthogonal to $Mt$. And this actually makes it harder to prove that $Ms$ is always orthogonal to $Mt$, for any $2 \times 2$ matrix $M$. If we make counterfactual reasoning difficult, then we greatly weaken our medium of discovery.

What we'd ideally like is a medium supporting what we will
call *semi-concrete reasoning*. It would
simultaneously provide: (1) the ability to compute concretely,
to apply constraints, and to make inferences, i.e., all the
benefits we expect a digital computer to apply (and which made
it so easy to notice that $Ms$ is orthogonal to the tangent in
the first place); and (2) the benefits of paper-and-pencil,
notably the flexibility to explore and make inferences about
impossible worlds. As we've seen, there is tension between
these two requirements. Yet it is highly desirable that both
be satisfied simultaneously if we are to build a powerful
exploratory medium for doing mathematics. That is true not
just in the medium I have described, but in any exploratory
medium.

In the next section I sketch an approach to building a system for semi-concrete reasoning. We shall see that this is a challenging problem, with open-ended components involving the psychology of the user, and what constitutes “natural” modes of mathematical inference. We won't completely solve the problem. But we will at least better understand some of the principles to be followed in any system for semi-concrete reasoning.

Before moving to those principles, let me briefly mention some connections between semi-concrete reasoning and existing systems. One connection is to constraint-based media. Pioneered by the graphics medium Sketchpad, such media enable users to build systems by declaring constraints on those systems. The idea is that the medium finds a solution (possibly approximate) satisfying the constraints, often by minimizing some sort of cost function** There is a large literature on this. A good starting point for comparison to the current work is the work on Constraint Hierarchies by Alan Borning, Bjorn Freeman-Benson, and Molly Wilson (1992).. This idea has found application in many areas, including graphics, computer-aided design, simulation, user interface design, and programming languages.

Our medium differs from such systems in that it deliberately maintains, and attempts to make reasoned inferences from, a constraint violation. There's no cost to be minimized, and we're not trying to find a solution, approximate or otherwise. Rather, the goal is to understand the implications of the constraint violation in a step-by-step way, a way where the inferences being made are clearly legible to the user.

Another connection is to proof assistants and systems for automated reasoning, such as Coq and Agda. Such systems do enable proofs by contradiction and other forms of counterfactual reasoning. However, the aim of such systems is different to what I'm describing. They're typically aimed toward helping human users develop formal proofs, rather than presenting a user interface to enable informal, concrete, exploratory reasoning, and thus have very different interface goals.

In our earlier discussion, we implicitly assumed our medium has an internal data model that represents mathematical reality. In particular, having updated the value for the tangent, we assumed it was the job of the medium to find a corresponding consistent value for $M$ and for other quantities in the problem.

It's possible to proceed in a different fashion. Instead of
using our medium's data model to represent mathematical
reality, we can instead *use the medium's data model to
represent the user's current state of mathematical
knowledge*. This makes sense, since in an exploratory
medium we are not trying to describe what is true – by
assumption, we don't know that, and are trying to figure it
out – but rather what the user currently knows, and how
to best support further inference.

Having adopted this point of view, user interface operations correspond to changes in the user's state of mathematical knowledge, and thus also make changes in the medium's model of that state. There is no problem with inconsistency, because the medium's job is only to model the user's current state of knowledge, and that state of knowledge may well be inconsistent. In a sense, we're actually asking the computer to do less, at least in some ways, by ignoring constraints. And that makes for a more powerful medium.

Having adopted this point of view, the question is how the medium can best support the user's mathematical exploration and further inference. The simplest approach is to allow the user to manually edit the mathematical world, with little assistance from the computer in doing inference** Original video may be downloaded here.:

This sequence of operations corresponds roughly to the operations performed by a human mathematician with paper-and-pencil. The internal state of the medium after these operations would be:

M: a 2 by 2 matrix, with entries 0.70, -0.28, ... The unit circle The edited image curve s: the principal right singular vector, with entries 0.89, 0.46 Ms: the result of applying M to s, with entries 0.49, 1.52 The edited tangent to the image curve at Ms, with entries -0.15, 0.5

This is not a consistent mathematical world. It doesn't need to be, since the purpose of our medium's data model is no longer to represent a consistent mathematical world, but to represent the user's current state of mathematical knowledge, and to support further inference.

Of course, it's disappointing to require the user to make all inferences manually, as they would with paper-and-pencil. Ideally, an exploratory medium would help the user make inferences, give the user control over how these inferences are made, and make it easy for the user to understand and track the chain of reasoning. To appreciate the importance of these principles, consider the following medium, which behaves differently again when we modify the tangent** Original video may be downloaded here.:

In this medium, as we move the end of the tangent vector, the medium finds a new corresponding base point $Ms$ on the image curve. It chooses the new base point so the tangent vector is, indeed, a genuine tangent to the image curve (which is presumed to be fixed). At the same time, the medium also recomputes the value of $s$, by multiplying the new value for $Ms$ by $M^{-1}$.

Now, in some proofs these may be exactly the desired inferential steps. But upon consideration, this example illustrates some genuine design challenges. In particular, in general we'd like to use our medium to reason about a large number of mathematical objects. There is potentially an exponentially large number of ways changing one object can cause inferred changes in others. In order for the user to understand what is going on, the way inferred changes propagate needs to clearly communicated. In particular, the medium must be designed so the user can easily select from amongst the many possible chains of inference.

In paper-and-pencil mathematics, how we do inference depends upon what relationships we know between objects in the system. In this case, we know of just a single direct relationship between the tangent vector and other objects: the tangent is a function of the base point and the image curve** Note that later in the proof we learn that the tangent is equal to $Mt$. That's a second important relationship. However, at the time we're discussing that relationship is not yet known to the user, and so isn't part of the medium's model of the user's knowledge.:

If we're using paper-and-pencil, we must use this relationship to deduce the implications of changing the tangent. The reason is that it's the only relevant thing we know. Effectively, we're inverting the relationship:

In this particular case, we want the base vector $Ms$ to
remain unchanged, by assumption. Given that, we can infer
what the image curve “must have been”, given the
modified value for the tangent. Of course, the modified image
curve is not uniquely determined, and so with paper-and-pencil
we sketch an illustrative curve segment. That illustrative
curve segment is *not* supposed to literally be the
implied curve segment; rather, it is a curve segment with
approximately the right shape, and where the base point and
tangent are exactly correct.

The paper-and-pencil reasoning is strikingly different to the medium last shown, where changing the tangent vector caused an immediate change to $s$. With paper-and-pencil any such inference would have required a two-step reasoning process: first, figure out what changing the tangent means for $Ms$, and then figure out what that change to $Ms$ means for $s$.

We could adopt a similar single-step inference approach to our
exploratory medium. In such a medium, for every interface
operation *generating* a new relationship between
objects, there would be a corresponding interface operation
enabling us to infer how changes in any one of those objects
affects the other objects. In the specific case we've been
considering, this means the task of the medium is to present
the user a way of selecting from amongst the space of base
vectors and segments of image
curves** A point I've glossed over is
how the medium knows what shape to use for the curve segment.
One possible answer is to interpolate based on the original
curve and the modified tangent. This is the strategy I used in
the prototype, and it seems to work reasonably well..

With paper-and-pencil, this inversion needs to be performed mechanically. In an exploratory medium, the medium can guide us in doing the inversion. Of course, in this instance the inference isn't especially hard, so this is only a small win, but it nonetheless helps free the user to concentrate on higher-level reasoning.

Using the medium to support only a single stage of inference has several benefits. It naturally makes the chain of inference legible, since it mirrors the way we do inference with paper-and-pencil, every step made explicit, while nonetheless reducing tedious computational work, and helping the user understand what inferences are possible. It's also natural psychologically, since the user is already thinking in terms of these relationships, having defined the objects this way. Finally, and perhaps most importantly, it limits the scope of the interface design problem, since we need not design separate interfaces for the unlimited(!) number of possible inferences. Rather, for every interface operation generating a mathematical object, we need to design a corresponding interface to propagate changes. That's a challenging but finite design problem. Indeed, in the worst case, a “completely manual” interface like that presented earlier may in general be used.

With that said, one could imagine media which perform multiple stages of inference in a single step, such as our medium modifying $s$ in response to changes in the tangent. Designing such a medium would be much more challenging, since potentially many more relationships are involved (meaning more interfaces need to be exposed to the user), and it is also substantially harder to make the chain of reasoning legible to the user.

Even with the simplification of doing single-step inference, there are still many challenging design problems to be solved. Most obviously, we've left open the problem of designing interfaces to support these single stages of inference. In general, solving this interface design problem is an open-ended empirical and psychological question. It's an empirical question insofar as different modes of inference may be useful in different mathematical proofs. And it is a psychological question, insofar as different interfaces may be more or less natural for the user. Every kind of relationship possible in the medium will require its own interface, and thus present a new design challenge. The simplest way to meet that challenge is to use a default-to-manual-editing strategy, mirroring paper-and-pencil. But by providing additional inference options we can support more powerful reasoning, and strictly surpass paper-and-pencil.

A second design problem is the profusion of possible ways in which inference may occur. Later in the proof we connect the tangent not just to $Ms$ and the image curve, but also to $Mt$. At that point the user would be aware of two separate direct relationships involving the tangent vector:

The medium would then need to allow the user to make
inferences about how changes to the tangent
changed *either* $Mt$ *or* $Ms$ and the image
curve. In general, the medium needs to allow the user to make
choices between *multiple types of inference* possible
at any given stage. This seems complex, but it mirrors the
complexity inherent in the problem. The problem solver must
make choices about how to investigate the effects of changes
in assumption. The medium is merely reifying the choices that
have to be made to support those inferences. The benefit of
the medium is that it will help do the computational work in
making those inferences, and help track their impact.

In discussions of systems of reasoning it is sometimes assumed that the informal, intuitive systems used by humans are things to be “fixed up”, turned into so-called proper, rigorous reasoning. If the purpose of reasoning were merely verifying correctness, then that would be a reasonable point of view. But if the purpose of reasoning is exploration and discovery, then it is wrong. Exploration and discovery require a logic that is different to, and at least as valuable as, conventionally “correct” reasoning. The idea of semi-concrete reasoning is a step toward media to support such exploration and discovery, and perhaps toward new ways of thinking about mathematics.

Alan Kay has asked “what
is the carrying capacity for ideas of the computer?” We
may also ask the closely related question: “what is the
carrying capacity for *discovery* of the
computer?” In this essay we've made progress on that
question using a simple strategy: develop a prototype medium
to represent mathematics in a new way, and carefully
investigate what we can learn when we use the prototype to
attack a serious mathematical problem. In future, it'd be of
interest to pursue a similar strategy with other problems, and
with more adventurous interface ideas. And, of course, it
would be of interest to build out a working system that
develops the best ideas fully, not merely prototypes.

To conclude, a personal observation. I began thinking about the design of cognitive media just a few years ago. I've repeatedly found that interface design is deeper than I suspected. A powerful medium reifies the deepest ideas we have about a subject: it becomes an active carrier for those ideas. And to the extent it is successful in reifying those ideas, mastering the medium becomes the same as mastering the subject. In this view, designing exploratory media is about designing tools which can transform and extend our ability to think, create, and discover.

**Acknowledgments:** Thanks to Dave Albert,
Darius Bacon, Kovas Boguta, Ilona Brand, Joe Edelman, Jesse
Tasse Gonzalez, Max McCrea, Chris Olah, Prabhakar Ragde,
Omar Rizwan, John Workman, Katherine Ye, and Devon Zuegel
for helpful conversations and for comments on a draft of
this essay.

In the main body of the essay we learned how to prove the singular value decomposition in 2 dimensions. In fact, the same ideas can be used to prove the SVD in $n$ dimensions. I'll briefly sketch how the proof works in this appendix. It begins with exactly the same argument as we used earlier, showing that if $s$ is the principal right singular vector of $M$, and $t$ is any vector orthonormal to $s$, then $Ms$ is orthogonal to $Mt$. No modification is needed to establish this – the proof goes straight through.

From this, it can be shown using a construction similar to the $2 \times 2$ case that

$$M = U_1 \begin{bmatrix} s_1 & 0 \\ 0 & M' \end{bmatrix} V_1$$

where $U_1$ and $V_1$ are orthogonal matrices, and $M'$ is an $n-1$ by $n-1$ submatrix. Proving this is pretty straightforward, although it may be tedious if you're not so comfortable with linear algebra. The only real difference is that instead of working with a vector $t$ tangent to the principal right singular vector $s$, we instead work with the entire vector subspace tangent to $s$. I won't go through the details.

We can repeat this decomposition, breaking $M'$ up in terms of orthogonal matrices in $n-1$ dimensions and an $n-2$ by $n-2$ submatrix $M''$. Continuing in this way until we reach two dimensions, we obtain

$$M = U_1 U_2 \ldots {\rm diag}(s_1, s_2, \ldots, s_n) \ldots V_2 V_1,$$

where $U_1, V_1, U_2, V_2, \ldots$ are all orthogonal matrices. Since a product of orthogonal matrices is itself an orthogonal matrix, we obtain

$$M = U \, {\rm diag}(s_1, s_2, \ldots, s_n) \, V$$

for orthogonal matrices $U$ and $V$. This is the singular value decomposition.

It's notable that we developed the ideas underlying this proof in a visual environment depicting two spatial dimensions. We've then used algebraic-and-textual reasoning to extend that intuition to higher dimensions. This is a common procedure in paper-and-pencil based mathematical reasoning. Mathematicians develop fairly sophisticated ways of using informal two-dimensional representations to support more rigorous $n$-dimensional reasoning. Consider, for example, the following remark by mathematician Gil Kalai on high-dimensional reasoning:

*Vitali Milman, who works in high-dimensional convexity, likes
to draw high-dimensional convex bodies in a non-convex way.
This is to convey the point that if you take the convex hull
of a few points on the unit sphere of $R^n$, then for large
$n$ very little of the measure of the convex body is anywhere
near the corners, so in a certain sense the body is a bit like
a small sphere with long thin “spikes”.
*

A powerful exploratory medium for mathematics would build in such representational ideas, and would understand how to use them to support the transition from informal reasoning about low-dimensional representations to rigorous high-dimensional proofs.

Our earlier discussion showed that the largest singular value $s_1$ for a $2 \times 2$ matrix $M$ is equal to the maximal degree of stretching $\|Ms\|$. In fact, it's straightforward to adapt the reasoning in Appendix A to show that this is true also for an $n \times n$ matrix. That is, suppose an $n \times n$ matrix $M$ has singular value decomposition

$$M = U \,\mbox{diag}(s_1, s_2, \ldots, s_n) V,$$

with singular values $s_1 \geq s_2 \geq \ldots \geq s_n \geq 0$. Then the maximal degree of stretching is just

$$\max_{\|v \| = 1} \| Mv \| = s_1.$$

While this is a nice connection, it's not obvious it helps us figure out the maximal degree of stretching. After all, at this point we haven't figured out an explicit way of computing the singular values for a given matrix $M$. Fortunately, with a little work we can figure out not just $s_1$, but all the singular values. To do this, observe that from the SVD we have

$$M^T M = V^T \mbox{diag}(s_1^2, s_2^2, \ldots, s_n^2) V$$

since $U$ is an orthogonal matrix, and thus satisfies $U^T U =
I$. Inspecting this equation, we see that the eigenvalues of
$M^TM$ are just the squares of the singular values, $s_1^2,
s_2^2, \ldots$. And so $s_1$ is just the square root of the
largest eigenvalue of $M^TM$. This eigenvalue *is*
easy to compute, which makes the singular value $s_1$ easy to
compute. And that, in turn, makes it easy to compute the
maximal degree of stretching for $M$.

Summing up, the maximal degree of stretching for a matrix $M$ is equal to the square root of the largest eigenvalue of $M^TM$. This gives an easy, computationally feasible way of computing the maximal degree of stretching, and thus solves our original problem. This also illustrates a point made when we first discussed the SVD: knowing the SVD is true sometimes makes it much easier to reason about a matrix.