Jekyll2019-01-13T10:34:39+00:00http://cognitivemedium.com/feed.xmlCognitive MediumRepo for cognitivemedium.comUsing spaced repetition systems to see through a piece of mathematics2019-01-12T00:00:00+00:002019-01-12T00:00:00+00:00http://cognitivemedium.com/using-srs-mathematics<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {inlineMath: [['$','$']]},
"HTML-CSS":
{scale: 92},
TeX: { equationNumbers: { autoNumber: "AMS" }}});
</script>
<script type="text/javascript" src="../emm/mathjax/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
<p>By <a href="http://michaelnielsen.org">Michael Nielsen</a>, January 2019</p>
<p>What does it mean to understand a piece of mathematics? Naively, we
perhaps think of this in relatively black and white terms: initially
you don’t understand a piece of mathematics, then you go through a
brief grey period where you’re learning it, and with some luck and
hard work you emerge out the other side “understanding” the
mathematics.</p>
<p>In reality, mathematical understanding is much more nuanced. My
experience is that it’s nearly always possible to deepen one’s
understanding of any piece of mathematics. This is even true –
perhaps especially true – of what appear to be very simple
mathematical ideas.</p>
<p>I first really appreciated this after reading an essay by the
mathematician Andrey Kolmogorov. You might suppose a great
mathematician such as Kolmogorov would be writing about some very
complicated piece of mathematics, but his subject was the humble
equals sign: what made it a good piece of notation, and what its
deficiencies were. Kolmogorov discussed this in loving detail, and
made many beautiful points along the way, e.g., that the invention of
the equals sign helped make possible notions such as equations (and
algebraic manipulations of equations).</p>
<p>Prior to reading the essay I thought I understood the equals
sign. Indeed, I would have been offended by the suggestion that I did
not. But the essay showed convincingly that I could understand the
equals sign much more deeply.</p>
<p>This experience suggested three broader points. First, it’s possible
to understand other pieces of mathematics far more deeply than I
assumed. Second, mathematical understanding is an open-ended process;
it’s nearly always possible to go deeper. Third, even great
mathematicians – perhaps, especially, great mathematicians
– thought it worth their time to engage in such deepening.</p>
<p>(I found Kolmogorov’s essay in my University library as a
teenager. I’ve unsuccessfully tried to track it down several times in
the intervening years. If anyone can identify the essay, I’d
appreciate it. I’ve put enough effort into tracking it down that I
must admit I’ve sometimes wondered if I imagined the essay. If so, I
have no idea where the above story comes from.)</p>
<p>How can we make actionable this idea that it’s possible to deepen our
mathematical understanding in an open-ended way? What heuristics can
we use to deepen our understanding of a piece of mathematics?</p>
<p>Over the years I’ve collected many such heuristics. In these notes I
describe a heuristic I stumbled upon a year or so ago that I’ve found
especially helpful (albeit time intensive). I’m still developing the
heuristic, and my articulation will therefore be somewhat
stumbling. I’m certain it can still be much improved upon! But perhaps
it will already be of interest to others.</p>
<p>One caveat is that I’m very uncertain how useful the heuristic will be
to people with backgrounds different to my own. And so it’s perhaps
worth saying a little about what that background is. I’m not a
professional mathematician, but I was trained and worked as a
professional theoretical physicist for many years. As such, I’ve
written dozens of research papers proving mathematical theorems,
mostly in the field of quantum information and computation. Much of my
life has been spent doing mathematics for many hours each day. It’s
possible someone with a different background would find the heuristic
I’m about to describe much less useful. This applies to people with
both much less and much more mathematical background than I have.</p>
<p>It’s also worth noting that my work mostly involves mathematics only
incidentally these days. I still do some mathematics as a hobby, and
occasionally as part of other research projects. But it’s no longer a
central focus of my life in the way it once was. I suspect the
heuristic I will describe would have been tremendously useful to me
when mathematics was a central focus. But I’m honestly not sure.</p>
<p>The heuristic involves the use of <em>spaced-repetition memory
systems</em>. The system I use is a flashcard program called Anki. You
enter flashcards with a question on one (virtual) side of the card,
and the answer on the other. Anki then repeatedly tests you on the
questions. The clever thing Anki does is to manage the schedule. If
you get a question right, Anki increases the time interval until
you’re tested again. If you get a question wrong, the interval is
decreased. The effect of this schedule management is to limit the
total time required to learn the answer to the question. Typically, I
estimate total lifetime study for a card to be in the range 5-10
minutes.</p>
<p>I’ve described many elements of my Anki practice in a <a href="http://augmentingcognition.com/ltm.html">separate essay</a>.
Reading that essay isn’t necessary to understand what follows, but
will shed additional light on some of the ideas. Note that that essay
describes a set of heuristics for reading papers – indeed, of
syntopically reading entire literatures – that are largely
orthogonal to the heuristic I’m about to describe. I find the
heuristics in that essay useful for rapidly getting a broad picture of
a subject, while the heuristics in this essay are for drilling down
deeply.</p>
<p>To explain the heuristic, I need a piece of mathematics to use as an
example. The piece I will use is a beautiful theorem of linear
algebra. The theorem states that a complex normal matrix is always
diagonalizable by a unitary matrix. The converse is also true (and is
much easier to prove, so we won’t be concerned with it): a matrix
diagonalizable by a unitary matrix is always normal.</p>
<p>Unpacking that statement, recall that a matrix $M$ is said to be
normal if $MM^\dagger = M^\dagger M$, where $M^\dagger$ is the complex
transpose, $M^\dagger := (M^*)^T$. And a matrix is diagonalizable by a
unitary matrix if there exists a unitary matrix $U$ such that $M = U D
U^\dagger$, where $D$ is a diagonal matrix.</p>
<p>(As shorthand, from now on I will use “diagonalizable” as shorthand to
mean “diagonalizable by a unitary matrix”.)</p>
<p>What’s lovely about this theorem is that the condition $MM^\dagger =
M^\dagger M$ can be checked by simple computation. By contrast,
whether $M$ is diagonalizable seems <em>a priori</em> much harder to check,
since there are infinitely many possible choices of $U$ and $D$. But
the theorem shows that the two conditions are equivalent. So it
converts what seems like a search over an infinite space into simply
checking a small number of algebraic conditions. Furthermore, working
with diagonalizable matrices is often <em>much</em> easier than working with
general matrices, and so it’s extremely useful to have an easy way of
checking whether a matrix is diagonalizable.</p>
<p>Let me explain the proof. I shall explain it at about the level of
detail I would use with a colleague who is a mathematician or quantum
information theorist; people less comfortable with linear algebra may
need to unpack the proof somewhat.</p>
<p>There are two ideas in the proof.</p>
<p>The first idea is to observe that $MM^\dagger = M^\dagger M$ means the
length of the $j$th row of $M$ is equal to the length of the $j$th
column. It’s easiest to see this for the first row and first column.
Suppose we write $M$ as</p>
<script type="math/tex; mode=display">M = \left[ \begin{array}{c} r \\ M' \end{array} \right]</script>
<p>where $r$ is the first row and $M’$ is the remainder of the
matrix. Then the top-left entry in $MM^\dagger$ is:</p>
<script type="math/tex; mode=display">% <![CDATA[
MM^\dagger = \left[ \begin{array}{cc} r r^\dagger & \cdots \\ \cdots & \cdots \end{array} \right]. %]]></script>
<p>Similarly, suppose we write $M$ as:</p>
<script type="math/tex; mode=display">% <![CDATA[
M = \left[ \begin{array}{cc} c & M'' \end{array} \right] %]]></script>
<p>where $c$ is the first column and $M’’$ is the remainder of the
matrix. Then the top-leftmost entry in $M^\dagger M$ is:</p>
<script type="math/tex; mode=display">% <![CDATA[
M^\dagger M = \left[ \begin{array}{cc} c^\dagger c & \cdots \\ \cdots & \cdots \end{array} \right]. %]]></script>
<p>The normalcy condition $MM^\dagger = M^\dagger M$ then implies that $r
r^\dagger = c^\dagger c$, and thus the length of the first row $r$
must be the same as the length of the first column $c$.</p>
<p>The second idea in the proof is to observe that since $M$ is over the
algebraically complete field of complex numbers, the characteristic
equation $|M-\lambda I|=0$ has at least one solution $\lambda$ and so
there is an eigenvalue $\lambda$ and a basis in which $M$ can be
written:</p>
<script type="math/tex; mode=display">% <![CDATA[
M = \left[ \begin{array}{cc} \lambda & \cdots \\ 0 & \cdots \end{array} \right]. %]]></script>
<p>But we just saw that normalcy implies the length of the first column
is equal to the length of the first row, so the remaining entries of
the first row must be zero:</p>
<script type="math/tex; mode=display">% <![CDATA[
M = \left[ \begin{array}{cc} \lambda & 0 \\ 0 & \cdots \end{array} \right]. %]]></script>
<p>Recursively applying this to the bottom-right block in the matrix we
can diagonalize $M$. That completes the proof.</p>
<p>Alright, so that’s the proof. But that’s not the end of the process. I
then use Anki to go much deeper into the proof; I’ll call this the
(deep) Ankification process. This Ankification process works in
(roughly) two phases.</p>
<p><em>Phase I: understanding the proof:</em> This involves multiple passes over
the proof. Initially, it starts out with what I think of as <em>grazing</em>,
picking out single elements of the proof and converting to Anki
cards. For instance, for the above proof, I have Anki cards like the
following:</p>
<p><em>Q: If $M$ is a complex matrix, how is the top-left entry of $M
M^\dagger$ related to the first row $r$ of the matrix $M$?</em></p>
<p><em>A: It’s the length $\|r \|^2$.</em></p>
<p><em>Q: If $M$ is a complex matrix, how is the top-left entry of
$M^\dagger M$ related to the first column $c$ of the matrix $M$?</em></p>
<p><em>A: It’s the length $\|c \|^2$.</em></p>
<p>I work hard to restate ideas in multiple ways. For instance, here’s a
restatement of the first question above:</p>
<p><em>Q: If $M$ is a complex matrix, why is the top-left entry of
$MM^\dagger$ equal to the length squared $|r|^2$ of the first row?</em></p>
<p><em>A: <script type="math/tex">% <![CDATA[
\left[ \begin{array}{c} r \\ \cdot \end{array} \right]
\left[ \begin{array}{cc} r^\dagger & \cdot \end{array} \right]
= \left[ \begin{array}{cc} \|r\|^2 & \cdot \\ \cdot & \cdot \end{array} \right] %]]></script></em></p>
<p>Indeed, I worked hard to simplify both questions and answers –
the just given question-and-answer pair started out somewhat more
complicated. Part of this was some minor complexity in the question,
which I gradually trimmed down. The answer I’ve stated above, though,
is much better than in earlier versions. Earlier versions mentioned
$M$ explicitly (unnecessary), had more blocks in the matrices, used
$\cdots$ rather than $\cdot$, and so on. You want to aim for the
minimal answer, displaying the core idea as sharply as
possible. Indeed, if it was easy to do I’d de-emphasize the matrix
brackets, and perhaps find some way of highlighting the $r$,
$r^\dagger$ and $\|r\|^2$ entries. Those are the thing that really
matters.</p>
<p>I can’t emphasize enough the value of finding multiple different ways
of thinking about the “same” mathematical ideas. Here’s a couple more
related restatements:</p>
<p><em>Q: What’s a geometric interpretation of the diagonal entries in the
matrix $MM^\dagger$?</em></p>
<p><em>A: The lengths squared of the respective rows.</em></p>
<p><em>Q: What’s a geometric interpretation of the diagonal entries in the
matrix $M^\dagger M$?</em></p>
<p><em>A: The lengths squared of the respective columns.</em></p>
<p><em>Q: What do the diagonal elements of the normalcy condition
$MM^\dagger = M^\dagger M$ mean geometrically?</em></p>
<p><em>A: The corresponding row and column lengths are the same.</em></p>
<p>What you’re trying to do at this stage is learn your way around the
proof. Every piece should become a comfortable part of your mental
furniture, ideally something you start to really feel. That means
understanding every idea in multiple ways, and finding as many
connections between different ideas as possible.</p>
<p>People inexperienced at mathematics sometimes memorize proofs as
linear lists of statements. A more useful way is to think of proofs is
as interconnected networks of simple observations. Things are rarely
true for just one reason; finding multiple explanations for things
gives you an improved understanding. This is in some sense
“inefficient”, but it’s also a way of deepening understanding and
improving intuition. You’re building out the network of the proof,
making more connections between nodes.</p>
<p>One way of doing this is to explore minor variations. For instance,
you might wonder what the normalcy condition $MM^\dagger = M^\dagger
M$ means on the off-diagonal elements. This leads to questions like
(again, it’s useful to enter many different variations of this
question, I’ll just show a couple):</p>
<p><em>Q: What does the normalcy condition $MM^\dagger = M^\dagger M$ mean
for the $jk$th component, in terms of the rows $r_j$ and columns
$c_j$ of the matrix $M$?</em></p>
<p><em>A: The inner product $r_k \cdot r_j = c_j \cdot c_k$.</em></p>
<p><em>Q: The normalcy condition $MM^\dagger = M^\dagger M$ implies $r_k
\cdot r_j = c_j \cdot c_k$ for rows and columns. What does this mean
for row and column lengths?</em></p>
<p><em>A: They must be the same.</em></p>
<p>(By the way, it’s questions like these that make me think it helps to
be fairly mathematically experienced in carrying this Ankification
process out. For someone who has done a lot of linear algebra these
are very natural observations to make, and questions to ask. But I’m
not sure they would be so natural for everyone. The ability to ask the
“right” questions – insight-generating questions – is a
limiting part of this whole process, and requires some experience.)</p>
<p>I’ve been describing the grazing process, aiming to thoroughly
familiarize yourself with every element of the proof. This is useful,
but is also a rather undirected process, with no clear end point, and
not necessarily helping you understand the broader to structure of the
proof. I also impose on myself a set of aspirational goals, all
variations on the idea of distilling the entire proof to one question
and (simple) answer. The aim is to fill in the answers to questions
having forms like:</p>
<p><em>Q: In one sentence, what is the core reason a (complex) normal matrix
is diagonalizable?</em></p>
<p>And:</p>
<p><em>Q: What is a simple visual representation of the proof that (complex)
normal matrices are diagonalizable?</em></p>
<p>I think of these question templates as boundary conditions or forcing
functions. They’re things to aim for, and I try to write questions
that will help me move toward answers. That starts with grazing, but
over time moves to more structural questions about the proof, and
about how elements fit together. For instance:</p>
<p><em>Q: How many key ideas are there in the proof that complex normal
matrices are diagonalizable?</em></p>
<p><em>A: Two.</em></p>
<p><em>Q: What are the two key ideas in the proof that complex normal
matrices $M$ are diagonalizable?</em></p>
<p><em>A: (1) Write $M$ in a basis where the first column is all zeroes
except the first entry; and (2) use the normalcy condition to argue
that row lengths are equal to column lengths.</em></p>
<p>The second card here is, in fact, too complicated – it’d be
better to refactor into two or more cards, separating the two ideas,
and sharpening the answers. In general, it’s helpful to make both
questions and answers as atomic as possible; it seems to help build
clarity. That atomicity doesn’t mean the questions and answers can’t
involve quite sophisticated concepts, but they ideally express a
single idea.</p>
<p>In practice, as I understand the proof better and better the
aspirational goal cards change their nature somewhat. Here’s a good
example of such an aspirational card:</p>
<p><em>Q: What is a simple visual representation of the reason that
(complex) normal matrices are diagonalizable?</em></p>
<p><em>A: <script type="math/tex">% <![CDATA[
\left[ \begin{array}{cc} \lambda & r \\ 0 & \cdot \end{array} \right]
\left[ \begin{array}{cc} \lambda^* & 0 \\ r^\dagger & \cdot \end{array} \right] =
\left[ \begin{array}{cc} \lambda^* & 0 \\ r^\dagger & \cdot \end{array} \right]
\left[ \begin{array}{cc} \lambda & r \\ 0 & \cdot \end{array} \right]
\,\, \Rightarrow \,\, |\lambda|^2+r^\dagger r = |\lambda|^2 \,\, \Rightarrow \,\, r = 0. %]]></script></em></p>
<p>This is pretty good – certainly, there’s a sense in which it’s
much better than the original proof! But it’s still somewhat
complicated. What you really want is to feel every element (and the
connections between them) in your bones. Some substantial part of that
feeling comes by actually constructing the cards. That’s a feeling you
can’t get merely by reading an essay, it can only be experienced by
going through the deep Ankification process yourself. Nonetheless, I
find that process, as described up to now, is also not quite
enough. You can improve upon it by asking further questions
elaborating on different parts of the answer, with the intent of
helping you understand the answer better. I <em>haven’t</em> done this nearly
as much as I would like. In part, it’s because the tools I have aren’t
well adapted. For instance, I’d love to have an easy way of
highlighting (say, in yellow) the crucial rows and columns that are
multiplied in the matrices above, and then connecting them to the
crucial inference on the right. But while I can easily imagine
multiple ways of doing that, in practice it’s more effort than I’m
willing to put in.</p>
<p>Another helpful trick is to have multiple ways of writing these
top-level questions. Much of my thinking is non-verbal (especially in
subjects I’m knowledgeable about), but I still find it useful to force
a verbal question-and-answer:</p>
<p><em>Q: In one sentence, what is the core reason a (complex) normal matrix
is diagonalizable?</em></p>
<p><em>A: If an eigenvalue $\lambda$ is in the top-left of $M$, then
normalcy means $|\lambda|^2 + \|r\|^2 = |\lambda|^2$, and so the
remainder $r$ of the first row vanishes.</em></p>
<p>As described, this deep Ankification process can feel rather
wasteful. Inevitably, over time my understanding of the proof
changes. When that happens it’s often useful to rewrite (and sometimes
discard or replace) cards to reflect my improved understanding. And
some of the cards written along the way have the flavor of exhaust,
bad cards that seem to be necessary to get to good cards. I wish I had
a good way of characterizing these, but I haven’t gone through this
often enough to have more than fuzzy ideas about it.</p>
<p>A shortcoming of my description of the Ankification process is that I
cheated in an important way. The proof I wrote above was written
<em>after</em> I’d already gone through the process, and was much clearer
than any proof I could have written before going through the process.
And so part of the benefit is hidden: you refactor and improve your
proof along the way. Indeed, although I haven’t been in the habit of
rewriting the refactored proof after the Ankification process (this
essay is the first time I’ve done it), I suspect it would be a good
practice.</p>
<p><em>The inner experience of mathematics:</em> As I reread the description of
Part I just given, it is rather unsatisfactory in that it conveys
little of the experience of mathematics one is trying to move
toward. Let me try to explain this in the context not of Anki, but
rather of an experience I’ve sometimes had while doing research, an
experience I dub “being inside a piece of mathematics”.</p>
<p>Typically, my mathematical work begins with paper-and-pen and messing
about, often in a rather <em>ad hoc</em> way. But over time if I really get
into something my thinking starts to change. I gradually internalize
the mathematical objects I’m dealing with. It becomes easier and
easier to conduct (most of) my work in my head. I will go on long
walks, and simply think intensively about the objects of
concern. Those are no longer symbolic or verbal or visual in the
conventional way, though they have some secondary aspects of this
nature. Rather, the sense is somehow of working directly with the
objects of concern, without any direct symbolic or verbal or visual
referents. Furthermore, as my understanding of the objects change
– as I learn more about their nature, and correct my own
misconceptions – my sense of what I can do with the objects
changes as well. It’s as though they sprout new affordances, in the
language of user interface design, and I get much practice in learning
to fluidly apply those affordances in multiple ways.</p>
<p>This is a very difficult experience to describe in a way that I’m
confident others will understand, but it really is central to my
experience of mathematics – at least, of mathematics that I
understand well. I must admit I’ve shared it with some trepidation; it
seems to be rather unusual for someone to describe their inner
mathematical experiences in these terms (or, more broadly, in the
terms used in this essay).</p>
<p>If you don’t do mathematics, I expect this all sounds rather strange.
When I was a teenager I vividly recall reading a curious letter Albert
Einstein wrote to the mathematician Jacques Hadamard, describing his
(Einstein’s) thought processes. I won’t quote the whole letter, but
here’s some of the flavor:</p>
<blockquote>
<p>The words or the language, as they are written or spoken, do not
seem to play any role in my mechanism of thought. The psychical
entities which seem to serve as elements in thought are certain
signs and more or less clear images which can be “voluntarily”
reproduced and combined… The above-mentioned elements are, in my
case, of visual and some of muscular type. Conventional words or
other signs have to be sought for laboriously only in a secondary
stage, when the mentioned associative play is sufficiently
established and can be reproduced at will.</p>
</blockquote>
<p>When I first read this, I had no idea what Einstein was talking
about. It was so different from my experience of physics and
mathematics that I wondered if I was hopelessly unsuited to do work in
physics or mathematics. But if you’d asked me about Einstein’s letter
a decade (of intensive work on physics and mathematics) later, I would
have smiled and said that while my internal experience wasn’t the same
as Einstein’s, I very much empathized with his description.</p>
<p>In retrospect, I think that what’s going on is what psychologists call
<a href="http://augmentingcognition.com/assets/Simon1974.pdf">chunking</a>. People
who intensively study a subject gradually start to build mental
libraries of “chunks” – large-scale patterns that they recognize
and use to reason. This is why some grandmaster chess players can
remember thousands of games move for move. They’re not remembering the
individual moves – they’re remembering the ideas those games
express, in terms of larger patterns. And they’ve studied chess so
much that those ideas and patterns are deeply meaningful, much as the
phrases in a lover’s letter may be meaningful. It’s why <a href="https://www.youtube.com/watch?v=eNVJFRl6f6s">top basketball
players</a> have extraordinary recall of games. Experts begin to
think, perhaps only semi-consciously, using such chunks. The
conventional representations – words or symbols in mathematics,
or moves on a chessboard – are still there, but they are somehow
secondary.</p>
<p>So, my informal pop-psychology explanation is that when I’m doing
mathematics really well, in the deeply internalized state I described
earlier, I’m mostly using such higher-level chunks, and that’s why it
no longer seems symbolic or verbal or even visual. I’m not entirely
conscious of what’s going on – it’s more a sense of just playing
around a lot with the various objects, trying things out, trying to
find unexpected connections. But, presumably, what’s underlying the
process is these chunked patterns.</p>
<p>Now, the only way I’ve reliably found to get to this point is to get
obsessed with some mathematical problem. I will start out thinking
symbolically about the problem as I become familiar with the relevant
ideas, but eventually I internalize those ideas and their patterns of
use, and can carry out a lot (not all) of operations inside my head.</p>
<p>What’s all this got to do with the Ankification process? Well, I said
that the only reliable way I’ve found to get to this deeply
internalized state is to obsess over a problem. But I’ve noticed that
when I do the Ankification process, I also start to think less and
less in terms of the conventional representations. The more questions
I write, the more true this seems to be. And so I wonder if the
Ankification process can be used as a kind of deterministic way of
attaining that type of state. (Unfortunately, I can’t get obsessed
with a problem on demand; it’s a decidedly non-deterministic process!)</p>
<p>One consequence of this for the Ankification process is that over time
I find myself more and more wanting to use blank answers: I don’t have
a conventional symbolic or visual representation for the
answer. Instead, I have to bring to mind the former experience of the
answer. Or, I will sometimes use an answer that would be essentially
unintelligible to anyone else, relying on my internal representation
to fill in the blanks. This all tends to occur pretty late in the
process.</p>
<p>Now, unfortunately, this transition to the chunked,
deeply-internalized state isn’t as thorough when I’m Ankifying as it
is when obsessively problem solving. However, I suspect it greatly
enables such a transition. (I rarely obsessively problem solve these
days, so I haven’t yet had a chance to see this happen.) And I do
wonder if there are types of question I can ask that will help me get
more fully to the deeply-internalized state. What seems to be lacking
is a really strongly-felt internalization of the meaning of answers
like that shown above:</p>
<p><em>A: <script type="math/tex">% <![CDATA[
\left[ \begin{array}{cc} \lambda & r \\ 0 & \cdot \end{array} \right]
\left[ \begin{array}{cc} \lambda^* & 0 \\ r^\dagger & \cdot \end{array} \right] =
\left[ \begin{array}{cc} \lambda^* & 0 \\ r^\dagger & \cdot \end{array} \right]
\left[ \begin{array}{cc} \lambda & r \\ 0 & \cdot \end{array} \right]
\,\, \Rightarrow \,\, |\lambda|^2+r^\dagger r = |\lambda|^2 \,\, \Rightarrow \,\, r = 0. %]]></script></em></p>
<p>That type of strongly-felt meaning can, however, be built by using
such representations in many different ways as part of
problem-solving; it builds fluency and familiarity. But I haven’t
actually done it.</p>
<p><em>Phase II: variations, pushing the boundaries:</em> Let’s get back to
details of how the Ankification process works. One way of deepening
your understanding further is to find ways of pushing the boundaries
of the proof and of the theorem. I find it helpful to consider many
different ways of changing the assumptions of the theorem, and to ask
how it breaks down (or generalizes). For instance:</p>
<p><em>Q: Why does the proof that complex normal matrices are diagonalizable
fail for real matrices?</em></p>
<p><em>A: It may not be possible to find an eigenvector for the matrix,
since the real numbers aren’t algebraically complete.</em></p>
<p><em>Q: What’s an example of a real normal matrix that isn’t
diagonalizable by a real orthogonal matrix?</em></p>
<p><em>A: <script type="math/tex">% <![CDATA[
\left[ \begin{array}{cc} 1 & -1 \\ 1 & 1 \end{array} \right] %]]></script></em></p>
<p>As per usual, these questions can be extended and varied in many ways.</p>
<p>Another good strategy is to ask if the conditions can be weakened. For
instance, you might have noticed that we only seemed to use the
normality condition on the diagonal. Can we get away with requiring
$M^\dagger M = MM^\dagger$ just on the diagonal? In fact, some
reflection shows that the answer is no: we need it to be true in a
basis which includes an eigenvector of $M$. So we can add questions
like this:</p>
<p><em>Q: In the proof that normalcy implies diagonalizability, why does it
not suffice to require that $M^\dagger M = MM^\dagger$ only on the
diagonal?</em></p>
<p><em>A: Because we need this to be true in a particular basis, and we
cannot anticipate in advance what that basis will be.</em></p>
<p>Or we can try to generalize:</p>
<p><em>Q: For which fields is it possible to generalize the result that
complex normal matrices are diagonalizable?</em></p>
<p><em>A: [I haven’t checked this carefully!] For algebraically complete
fields.</em></p>
<p>(My actual Anki card doesn’t have the annotation in the last
answer. But it’s true: I haven’t checked the proof carefully. Still,
answering the question helped me understand the original proof and the
result better.)</p>
<p>This second phase really is open-ended: we can keep putting in
variations essentially <em>ad infinitum</em>. The questions are no longer
directly about the proof, but rather are about poking it in various
ways, and seeing what happens. The further I go, and the more I
connect to other results, the better.</p>
<p><em>“The” proof?</em> Having described the two phases in this Ankification
process, let me turn to a few miscellaneous remarks. One complication
is that throughout I’ve referred to “the” proof. Of course,
mathematical theorem often have two or more proofs. Understanding
multiple proofs and how they relate is a good way of deepening one’s
understanding further. It does raise an issue, which is that some of
the Anki questions refer to “the” proof of a result. I must admit, I
don’t have an elegant way of addressing this! But it’s something I
expect I’ll need to address eventually.</p>
<p>A related point is how much context-setting to do in the questions
– do we keep referring, over and over, to “the proof that
$MM^\dagger = M^\dagger M$ implies normalcy”, or to “if $M$ is a
complex matrix” (and so on)? In my Anki cards I do (note that I’ve
elided this kind of stuff in some of the questions above), but frankly
find it a bit irritating. However, since the cards are studied at
unknown times in the future, and I like to mix all my cards up in a
single deck, some context-setting is necessary.</p>
<p><em>What have I used this to do?</em> I’ve used this process on
three-and-a-half theorems so far:</p>
<ul>
<li>Complex normal matrices are diagonalizable.</li>
<li>Euler’s theorem that $a^{\phi(n)} \equiv 1 (\mod n)$ for any number
$a$ coprime to positive integer $n$, and $\phi(n)$ is Euler’s
totient function.</li>
<li>Lagrange’s theorem (used in the proof of Euler’s theorem) that the
order of a subgroup of a finite group must divide the order of the
entire group.</li>
<li>I’ve started the process for the fundamental theorem of algebra,
stating that every non-constant polynomial has a zero in the complex
plane. I was interrupted (I don’t recall why), and never finished
it.</li>
</ul>
<p>It’s quite time-intensive. I don’t have any easy way to count the
number of questions I’ve added for each of these theorems, but I guess
on the order of dozens of cards for each. It takes a few hours
typically, though I expect I could easily add many more questions.</p>
<p>[Note added: in the initial version of this essay I wrote “100 cards
for each”. I looked, and in fact there are fewer – on the order
of dozens, well short of 100. This surprised me – if anything,
I’d have guessed my error was in underestimation. The card-adding
process was intense, however, which perhaps accounts for my badly
mistaken impression.]</p>
<p><em>Seeing through a piece of mathematics:</em> This is all a lot of work!
The result, though, has been a considerable deepening in my
understanding of all these results. There’s a sense of being able to
“see through” the result. Formerly, while I could have written down a
proof that normal matrices are diagonalizable, it was all a bit
murky. Now, it appears almost obvious, I can very nearly <em>see</em>
directly that it’s true. The reason, of course, is that I’m far more
familiar with all the underlying objects, and the relationships
between them.</p>
<p>My research experience has been that this ability to see through a
piece of mathematics isn’t just enjoyable, it’s absolutely invaluable;
it can give you a very rare level of understanding of (and flexibility
in using) a particular set of mathematical ideas.</p>
<p><em>Discovering alternate proofs:</em> After going through the Ankification
process described above I had a rather curious experience. I went for
a multi-hour walk along the San Francisco Embarcadero. I found that my
mind simply and naturally began discovering other facts related to the
result. In particular, I found a handful (perhaps half a dozen) of
different proofs of the basic theorem, as well as noticing many
related ideas. This wasn’t done especially consciously – rather,
my mind simply wanted to find these proofs.</p>
<p>At the time these alternate proofs seemed crystalline, almost
obvious. I didn’t bother writing them down in any form, or adding them
to Anki; they seemed sufficiently clear that I assumed I’d remember
them forever. I regret that, for later I did not recall the proofs at
all.</p>
<p>Curiously, however, in the process of writing these notes I have
recalled the ideas for two of the proofs. One was something like the
following: apply the condition $M^\dagger M = MM^\dagger$ directly to
the upper triangular form $M = D+T$ where $D$ is diagonal and $T$ is
strictly upper triangular; the result drops out by considering the
diagonal elements. And another was to apply the normalcy condition to
the singular value decomposition for the matrix $M$; the proof drops
out immediately when the singular values are distinct, and can be
recovered with a little work when the singular values are not.</p>
<p><em>Simplicity of the theorems:</em> The three-and-a-half theorems mentioned
above are all quite elementary mathematics. What about using this
Ankification process to deepen my understanding of more advanced
mathematical ideas? I’ll certainly try it at some point, and am
curious about the effect. I’m also curious to try the process with
networks of related theorems – I suspect there will be some
surprising mutual benefits in at least some cases. But I don’t yet
know.</p>
<p><em>In what sense is this really about Anki flashcards?</em> There’s very
little in the above process that explicitly depended on me using
Anki’s spaced-repetition flashcards. Rather, what I’ve described is a
general process for pulling apart the proof of a theorem and making
much more sense of it, essentially by atomizing the elements. There’s
no direct connection to Anki at all – you could carry out the
process using paper and pencil.</p>
<p>Nonetheless, something I find invaluable is the confidence Anki brings
that I will remember what I learn from this process. It’s not so much
any single fact, but rather a sense of familiarity and fluency with
the underlying objects, an ability to simply see relationships between
them. That sense does fade with time, but far less rapidly than if I
simply didn’t think about the proof again. That’s a large payoff, and
one that I find makes me far more motivated to go through the
process. Perhaps other people, with different motivations, would find
Anki superfluous.</p>
<p>That said, I do have some sense that, as mentioned earlier, some of
the cards I generate are a type of exhaust, and would be better off
excluded from the process. This is especially true of many of the
cards generated early in the process, when I’m still scratching
around, trying to get purchase on the proof. Unfortunately, also as
mentioned above, I don’t yet have much clarity on which cards are
exhaust, and which are crucial.</p>
<p><em>Can I share my deck?</em> When I discuss Anki publicly, some people
always ask if I can share my deck. The answer is no, for reasons I’ve
explained <a href="http://augmentingcognition.com/ltm.html">here</a>. I must admit,
in the present case, I don’t really understand why you’d want to use a
shared deck. In part, that’s because so much of the value is in the
process of constructing the cards. But even more important: I suspect
a deck of 100+ of my cards on the proof above would be largely
illegible to anyone else – keep in mind that you’d see the cards
in a randomized order, and without the benefit of <em>any</em> of the context
above. It’d be an incomprehensible mess.</p>
<p><em>Discovery fiction:</em> I’ve described this Ankification process as a
method for more deeply understanding mathematics. Of course, it’s just
one approach to doing that! I want to briefly mention one other
process I find particularly useful for understanding. It’s to write
what I call <em>discovery fiction</em>. Discovery fiction starts with the
question “how would I have discovered this result?” And then you try
to make up a story about how you might have come to discover it,
following simple, almost-obvious steps.</p>
<p>Two examples of discovery fiction are my <a href="http://www.michaelnielsen.org/ddi/how-the-bitcoin-protocol-actually-works/">essay
explaining how you might have come to invent Bitcoin</a>, and my <a href="http://www.michaelnielsen.org/ddi/why-bloom-filters-work-the-way-they-do/">essay
explaining how you might have invented an advanced data structure (the
Bloom filter)</a>.</p>
<p>Writing discovery fiction can be tough. For the theorem considered in
this essay, it’s not at all clear how you would have come to the
result in the first place. But maybe you started out already
interested in $M^\dagger$, and in the question of when two matrices
$A$ and $B$ commute. So you ask yourself: “Hmm, I wonder what it
might mean that $M$ and $M^\dagger$ commute?” If you’re willing to
grant that as a starting point, then with some work you can probably
find a series of simple, “obvious” steps whereby you come to wonder if
maybe $M$ is diagonalizable, and then discover a proof.</p>
<p>Any such “discovery fiction” proof will be long – far longer
than the proof above. Even a cleaned-up version will be – should
be! – messy and contain false turns. But I wanted to mention
discovery fiction as a good example of a process which gives rise to a
very different kind of understanding than the Ankification process.</p>
<p><em>What about other subjects?</em> Mathematics is particularly well suited
to deep Ankification, since much of it is about precise relationships
between precisely-specified objects. Although I use Anki extensively
for studying many other subjects, I haven’t used it at anything like
this kind of depth. In the near future, I plan to use a similar
process to study some of the absolute core results about climate
change, and perhaps also to study some of the qualities of good
writing (e.g., I can imagine using a similar process to analyze the
lead sentences from, say, 30 well-written books). I don’t know how
this will go, but am curious to try. I’m a little leery of coming to
rely too much on the process – creative work also requires many
skills at managing uncertainty and vagueness. But as a limited-use
cognitive tool, deep Ankification seems potentially valuable in many
areas.</p>
<p><a href="https://twitter.com/michael_nielsen">Follow me on Twitter</a></p>
<h3 id="acknowledgments">Acknowledgments</h3>
<p>Many thanks to everyone who has talked with me about spaced-repetition
memory systems. Especial thanks to Andy Matuschak, whose conversation
has deeply influenced how I think about nearly all aspects of spaced
repetition. And thanks to Kevin Simler for additional initial
encouragement to write about my spaced repetition practice.</p>
<h3 id="citation-and-licensing">Citation and licensing</h3>
<p><em>In academic work, please cite this as: Michael A. Nielsen, “Using
spaced repetition systems to see through a piece of mathematics”
http://cognitivemedium.com/srs-mathematics, 2019.</em></p>
<p><em>This work is licensed under a Creative Commons
Attribution-NonCommercial 3.0 Unported License. This means you’re free
to copy, share, and build on this essay, but not to sell it. If you’re
interested in commercial use, please contact me.</em></p>By Michael Nielsen, January 2019What does the quantum state mean?2018-12-13T00:00:00+00:002018-12-13T00:00:00+00:00http://cognitivemedium.com/qm-interpretation<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {inlineMath: [['$','$']]},
"HTML-CSS":
{scale: 92},
TeX: { equationNumbers: { autoNumber: "AMS" }}});
</script>
<script type="text/javascript" src="../emm/mathjax/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
<p>By <a href="http://michaelnielsen.org">Michael Nielsen</a>, December
2018</p>
<blockquote>
<p><em>We have always had a great deal of difficulty understanding the</em>
<em>world view that quantum mechanics represents. At least I do, because</em>
<em>I’m an old enough man that I haven’t got to the point that this</em>
<em>stuff is obvious to me. Okay, I still get nervous with it…. You</em>
<em>know how it always is, every new idea, it takes a generation or two</em>
<em>until it becomes obvious that there’s no real problem. I cannot</em>
<em>define the real problem, therefore I suspect there’s no real</em>
<em>problem, but I’m not sure there’s no real problem.</em> – Richard Feynman</p>
</blockquote>
<p>In popular articles about quantum computing it’s common to describe
qubits as having the ability to “be in both $|0\rangle$ and
$|1\rangle$ states at once”, and to say things like “quantum computers
get their power because they can simultaneously be in exponentially
many quantum states!”</p>
<p>I must confess, I don’t understand what such articles are talking
about.</p>
<p>What seems to be implied – it’s rarely spelled out, although
some accounts come close – is that quantum computers work by
preparing a superposition $\frac{1}{\sqrt 2^n} \sum_x
|x\rangle|f(x)\rangle$, with $x$ varying over possible solutions to
the problem – maybe it’s tours in a travelling salesman problem.
And $f(x)$ is some associated quantity of interest, such as the
distance through the tour. Then, somehow, voila!, you get to read out
the desired answer $f(x)$ from the quantum computer.</p>
<p>The only trouble is that this is <a href="https://arxiv.org/abs/quant-ph/9701001">provably impossible to
do in general, or even just in typical cases</a>.</p>
<p>What I think is going on is this: when people remark that the state
$0.6|0\rangle+0.8|1\rangle$ is simultaneously $0$ and
$1$, they’re trying to explain the quantum state in terms of classical
concepts they’re already familiar with. That sounds sort of okay at
first, and fills a vacuum of meaning for people unfamiliar with
quantum mechanics. But the more you think about it, the worse things
get. Saying $0.6|0\rangle+0.8|1\rangle$ is
simultaneously $0$ and $1$ makes about as much sense as Lewis
Carroll’s nonsense poem <em>Jabberwocky</em>:</p>
<blockquote>
<p>’Twas brillig, and the slithy toves<br /> Did
gyre and gimble in the wabe:<br /> All mimsy were the borogoves,<br />
And the mome raths outgrabe. <br /> …</p>
</blockquote>
<p>I call the implied way of thinking the “word salad interpretation of
quantum mechanics”. The main (sole?) virtue of the word salad
interpretation is that it does fill a vacuum of meaning. Because it is
a genuinely good question: what does the quantum state mean?</p>
<p>For me, it’s also a deeply uncomfortable question. I genuinely don’t
know the answer, despite having spent tens of thousands of hours
thinking about quantum mechanics. And I cannot, with conviction, tell
you what the quantum state means. It’s frankly a pretty strange
situation.</p>
<p>Now, there are some people who will very confidently tell you that
they “know” the correct way to think about the quantum state. Trouble
is, different people will tell you different things! That includes
deeply knowledgeable experts on quantum mechanics. Individually, each
can sound pretty convincing. But when you get them together in a room,
the result is sometimes some pretty unpleasant conflagrations. I’ve
seen physicists shout at one another over the issue, on more than one
occasion.</p>
<p>I’m not alone in my discomfort with the question. A lot of physicists
respond to this discomfort with a sort of reserved agnosticism. A
pretty common approach is what the physicist David Mermin dubbed the
“shut-up-and-calculate interpretation of quantum mechanics”.</p>
<p>In the shut-up-and-calculation interpretation, you think of the
quantum state as a calculational device. At most you have a sort of
vague meaning in mind, perhaps thinking of the quantum state as being
a bit like a probability distribution over states, but satisfying
slightly different mathematical rules (different for reasons that are
never made quite clear). You become fluent in those mathematical
rules, and use them to solve lots of different problems. Gradually,
you build up a library of higher-order tricks and intuitions,
understanding emergent rules hidden inside the rules of quantum
mechanics – ideas like quantum teleportation, or the no-cloning
theorem, for instance. It’s a very instrumental way of making meaning
of the quantum state.</p>
<p>As a practical matter, and for students starting out, I’m pretty
sympathetic to adopting the shut-up-and-calculate interpretation, at
least most of the time. It builds up many handy skills, as well as
intuition about how quantum mechanics work. That’s extremely useful
background when investigating interpretational issues.</p>
<p>Why does the meaning of the quantum state matter? Sure, maybe people
would feel better if they had a way of interpreting the quantum state
beyond it being a calculational device. But maybe that’s just an
irrelevant human prejudice. Nature doesn’t need to conform to our
prejudices! But I think there’s a genuine problem here, beyond our
prejudices about what our theories should look like. Quantum mechanics
isn’t a final theory. We don’t have a convincing understanding of the
measurement process in quantum mechanics. Nor do we have a convinving
quantum theory of gravity. And maybe those problems are connected to
having a better understanding what the quantum state means. In which
case having a better understanding of the quantum state may help in
solving those other problems.</p>
<p>I attributed the term “shut-up-and-calculate” to David Mermin. Mermin
is one of the deepest thinkers about interpretational issues, and he
certainly didn’t intend the term as a compliment! But despite that,
I’m somewhat sympathetic to shut-up-and-calculate not just as a
practical strategy, but also as a strategy for (eventually) better
understanding quantum states.</p>
<p>In particular, the situation reminds me of the study of human
consciousness. Many scientists and philsophers spend a great deal of
time pondering consciousness, writing about the “hard problem of
consciousness” and so on. In the meantime, there’s an army of
scientists doing very plain nuts-and-bolts experiments, trying to
understand all the myriad details of action potentials, neural
circuits, and so on. I suspect the latter group will ultimately make
far more contribution to our understanding of consciousness than the
former. Sometimes, when you solve enough tiny problems the big
problems just melt away. And I wonder if the same will be true of the
meaning of the quantum state, that we’ll understand it by gradually
building up our detailed knowledge of quantum mechanics, and
eventually understand things like the interpretation of the quantum
state almost <em>en passant</em>. If that’s the case, then the current lack
of a universally-agreed upon interpretation is a nuisance, and
regrettable, but no more.</p>
<p>My own current preference is thus for the this-is-an-open-problem
interpretation of quantum mechanics: I think we don’t yet have enough
evidence to know, and won’t for decades. I know some readers will
dislike this: they’d much prefer if I shouted with conviction that the
right way to interpet the quantum state is <em>etc</em> But I don’t know, and
I don’t think anyone else does either. I do have opinions about how to
get to such an interpretation, but will omit them in the interests of
brevity. The main thing I want you to take away from this essay is
that determined agnosticism <em>is</em> a possible approach, and is also
consistent with a deep interest in actually solving the problem.</p>
<p>Will all that said, there are people who’ve thought long and hard
about the meaning of the quantum state, and who do have definite
opinions about the right way to think about it. As a starting point, I
recommend reading <a href="/assets/qm-interpretation/Everett.pdf">Hugh
Everett</a> and <a href="https://www.amazon.com/Fabric-Reality-Parallel-Universes-Implications/dp/014027541X">David
Deutsch</a> on the many-worlds interpretation of quantum mechanics; <a href="https://arxiv.org/abs/quant-ph/0205039">Chris Fuchs</a> on the
idea that the quantum state is a state of knowledge; <a href="/assets/qm-interpretation/Bohm1952.pdf">David Bohm</a> on the
idea that it’s a sort of pilot wave, guiding particles in the
system. And, although it’s not exactly an interpretation of the
quantum state, I like <a href="/assets/qm-interpretation/Feynman.pdf">Richard Feynman’s</a>
paper recasting quantum mechanics in terms of (sometimes negative!)
probability distributions, rather than quantum states. Those are just
a few ideas, to give you a sample of some of the (very different)
ideas out there. Many more points of view have been put forward! Be
aware that many of these people disagree (or disagreed, while alive)
strongly with one another. Don’t necessarily expect to solve the
problem yourself – although maybe you will make some
contribution. And do come back to just plain working with the theory,
boots on the ground. No matter how you think about the quantum state,
quantum mechanics is a beautiful theory, and remarkably fun to work
with.</p>
<h3 id="addendum">Addendum</h3>
<p>This essay is a preliminary draft version of some material to be
included in a larger project (joint with Andy Matuschak). My thinking
will almost certainly change! In particular, in this draft I’ve
focused on the agnosticism and shut-up-and-calculate angles. One of my
strongly-held general convictions is that holding uncertainty in your
head is a very underrated skill, and so I’ve emphasized that in this
draft. Still, it’d be better if the draft were more opinionated, and
dug more into specific details. It is, of course, particularly
tempting to get more into the details of different
interpretations. Just maybe we can make some progress …</p>
<p>I wrote the essay with some trepidation. The interpretation of the
quantum state arouses strong passions and, for some reason, often
inspires people who know little of quantum mechanics to strong
convictions; it reminds me of cryptocurrencies in that regard. Past
experience suggests I’ll likely get strongly-worded messages telling
me I’m wrong or ignorant, that the messager knows the right way to
think (and will fill me in). Such messages are usually
well-intentioned, but I do wish such people would pause a moment.</p>
<h3 id="citation-and-licensing">Citation and licensing</h3>
<p><em>In academic work, please cite this as: Michael A. Nielsen, “What does
the quantum state mean?”,
http://cognitivemedium.com/qm-interpretation, 2018.</em></p>
<p><em>This work is licensed under a Creative Commons
Attribution-NonCommercial 3.0 Unported License. This means you’re free
to copy, share, and build on this essay, but not to sell it. If you’re
interested in commercial use, please contact me.</em></p>By Michael Nielsen, December 2018In what sense is quantum computing a science?2018-12-12T00:00:00+00:002018-12-12T00:00:00+00:00http://cognitivemedium.com/qc-a-science<p>By <a href="http://michaelnielsen.org">Michael Nielsen</a>, December
2018</p>
<blockquote>
<p><em>In natural science, Nature has given us a world and we’re just to</em>
<em>discover its laws. In computers, we can stuff laws into it and</em>
<em>create a world.</em> – Alan Kay</p>
</blockquote>
<p>Quantum computing originated in the 1980s with several papers that
received little fanfare at the time. Even by the mid-1990s, mentioning
quantum computing to a physicist usually resulted in the question:
“What’s a quantum computer?” Answers would often then be greeted
with: “Isn’t that engineering? What’s it got to do with physics?”</p>
<p>Sometimes, these questions were asked with a large dollop of
chauvinism, implying that engineering is somehow – it was never
quite explained how – a pursuit inferior to physics. But remove
that chauvinism and there’s still an interesting underlying question:
in what sense (if any) can quantum computing be considered a science?
And will it lead to the understanding of important new fundamental
truths about the universe?</p>
<p>The roots of these questions go back much further than quantum
computing. They’re reflective of some broad questions described in
Herbert Simon’s book <a href="https://www.amazon.com/Sciences-Artificial-3rd-Herbert-Simon/dp/0262691914">The
Sciences of the Artificial</a>.</p>
<p>Historically, the earliest sciences studied the natural world:
astronomy, physics, chemistry, and biology. Each took extant natural
systems, and tried to uncover the underlying ideas. But many more
recent sciences study systems made by humans. Examples include
computer science, linguistics, synthetic biology, and economics. While
the corresponding systems were made by humans, they have an
extraordinary, rich structure, unanticipated by the humans who made
them. What Simon means by the sciences of the artificial is the
discovery of this structure, i.e., the discovery of deep ideas and
principles such as the invisible hand, comparative advantage,
public-key cryptography, and so on.</p>
<p>This notion of the sciences of the artificial is particularly striking
in the case of computer science, which <a href="https://www.theatlantic.com/science/archive/2018/11/diminishing-returns-science/575665/">began
with its theory of everything</a>, but which has flourished as we
study the emergent consequences of that theory:</p>
<blockquote>
<p>[C]omputer science began in 1936 when Alan Turing developed the
mathematical model of computation we now call the Turing
machine. That model was extremely rudimentary, almost like a child’s
toy. And yet the model is mathematically equivalent to today’s
computer: Computer science actually began with its “theory of
everything.” Despite that, it has seen many extraordinary
discoveries since: ideas such as the cryptographic protocols that
underlie internet commerce and cryptocurrencies; the never-ending
layers of beautiful ideas that go into programming language design;
even, more whimsically, some of the imaginative ideas seen in the
very best video games.</p>
</blockquote>
<p>I’ve used the term <em>emergent</em> here, a term going back to a famous 1972
article by Phil Anderson, entitled “More is Different”. Anderson
argued for the now-commonplace <a href="#Anderson">(1)</a> point that
there may be many levels of behaviour in systems, with each new level
giving rise to deep new ideas. Just because you know the equations
governing a water molecule does not mean you will understand the
principles governing the crash of ocean waves, or the way a rainbow
arcs across the sky. Anderson’s own field of condensed matter physics
is a fount of examples of emergence, such as superconductivity,
superfluidity, and Bose-Einstein condensation. In each case, there are
multiple emergent levels of behaviour, and beautiful ideas to be
discovered at each level.</p>
<p>A different, though parallel, way of looking at the sciences of the
artificial is as examples of what Simon calls <em>design science</em> <a href="#designscience">(2)</a>. Design sciences are about the
invention of new types of object with new types of behaviour.
Examples of such invention range widely: arabic numerals (in
mathematics); the stealth fighter (in aeronautics); the notion of a
layer in software such as <em>Illustrator</em> (in user interface design);
and homoiconicity (in programming language design). The essence in
each case is that of a new type of object, with new kinds of
behaviour.</p>
<p>A challenge in describing what is meant by a design science is that
examples of genuinely new types of object and behaviour are rarely
clearcut. Arabic numerals drew on earlier numeral systems which
introduced ideas like a place-number system. The first stealth
fighters drew on earlier generations of fighters, some of which
attempted to reduce their radar cross section. And so on. Still, the
stealth fighter was a fundamentally new type of object in that
“invisible on radar” was a primary property. And anyone who has ever
tried to muliply numbers represented in roman numerals won’t need much
convincing that arabic numerals are fundamentally different.</p>
<p>In physics, an example of this design science approach is <a href="https://www.sciencedirect.com/science/article/pii/S0003491602000180">Kitaev’s
notion</a> of a topological quantum computer. This is one of the most
radical new ideas of the past hundred years. Rather than building a
computer out of component parts, the aspiration is to create a novel
phase of matter that wants to compute. Fluids want to flow; solids
want to maintain a stable shape; topological quantum computers want to
compute. Indeed, not only do they want to compute, they want to
quantum compute, and to do so in a way that protects the quantum state
against the effects of noise!</p>
<p>Up to now, physics has for the most part not been a design science.
But my guess is that’s going to change in the coming decades. There
are more and more examples where design seems the right way to think:
topological quantum computers; new designer phases of matter; the <a href="https://arxiv.org/abs/gr-qc/0009013">Alcubierre warp drive</a>
and other designer spacetimes; constructor theory and universal
constructors; programmable matter and utility fog. These are not just
about emergence, traditionally construed. Rather they’re about
designing to a target. Indeed, not just to target, but conceiving of
entirely new types of target, often even more radical than notions
like a stealth fighter or a homoiconic programming language.</p>
<p>I said above that design sciences are about the “invention” of new
types of object. When writing that sentence I equivocated between
using the term “invention” and the term “discovery”. Neither is quite
right. Invention is accurate in the sense that it’s a creation of the
human mind. But it’s a discovery in the sense that it seems as though
it’s a pre-existing property of the universe. Topological quantum
computers, homoiconicity, stealth, arabic numerals, even the idea of
layers: all have a depth and unitary quality that makes it hard to see
them entirely as <em>ad hoc</em> inventions. It’s true that many details are
<em>ad hoc</em>: the specifics of arabic numerals are obviously not
universal! But if we meet aliens I won’t be surprised to find that
they’ve discovered (and perhaps superseded) many of the same ideas
used in the arabic numerals. Indeed, I won’t be surprised if they’ve
also discovered homoiconicity, topological quantum computing, and
perhaps even something like our conceptions of stealth and the idea of
layers.</p>
<p>So, to come back to the question with which I started: in what sense
is quantum computing a basic science? And in what sense is it about
discovering important new fundamental truths about the universe?</p>
<p>I think the answer is that quantum computing will be in considerable
part a design science <a href="#notjustdesignscience">(3)</a>. That
is, it’ll be about discovering new types of object and behaviour.
This is a point of view that is perhaps unusual, even
idiosyncratic. It will take many decades to tell if I am correct. But
I believe it’s a stimulating point of view, and likely to be correct.</p>
<p>What would it mean for quantum computing to be a design science? We
can get some small insight by asking: how does one invent something
like the arabic numerals? Or concepts like homoiconicity, or layers?
The heuristics of discovery used by the designers behind these are
radically different than the traditional ways physicists
work. Physicists often work from the bottom up, understanding simple
systems, or putting things together in “natural” ways (e.g., by
cooling materials down or heating them up). Routine design work is
somewhat similar, taking extant elements and combining them in
standard ways. But the deepest types of imaginative design are very
different, creating fundamentally new types of objects and new types
of behaviour. I won’t try to enumerate the heuristics behind that kind
of work here (though see <a href="http://cognitivemedium.com/tat/index.html">my earlier
essay</a>). But it’s a very different kind of work than traditional
physics.</p>
<p>This point of view contrasts with the conventional point of view that
says quantum computing will mostly be about finding fast new
algorithms. Certainly, it will <em>in part</em> be about finding new
algorithms. But I don’t think it’s likely to just or even primarily be
about algorithms, any more than classical computing has been. Indeed,
I believe the design of new prototocols and new interfaces – the
invention of new types of object and behaviour – has been much
more important in classical computing. And so, perhaps, it may
ultimately be for quantum computing.</p>
<h3 id="critical-addendum">Critical Addendum</h3>
<p>This is a draft written as part of the process of writing a much
longer essay covering a wider array of quantum topics. In that sense
it’s been written as a sort of version 0 of a section of that essay,
with a (hopefully much improved) version 1 to be included in the
longer essay. My main critique of the current draft is that it
struggles to adequately convey what it would mean for quantum
computing to be a design science. The notion of designing radically
new classes of object and behaviour hasn’t made it into popular
culture in any really deep way, and it certainly isn’t part of the
culture of physics. Perhaps what’s need to make the essay work is a
longer discussion – or, at least, a more compelling discussion!
– of what it would mean for quantum computing to be a design
science.</p>
<p>The other main critique of this version 0 is that it focuses so much
on design science that it doesn’t quite do the job of answering the
underlying question: in what sense will quantum computing be a
science, and address fundamental questions? The design science aspects
may be the most unfamiliar (and so need the most explanation), but
they’re only part of a broader picture, which needs to be painted more
convincingly.</p>
<h3 id="notes">Notes</h3>
<p><a name="Anderson"></a> (1) I presume this broad point of view wasn’t
novel when Anderson wrote his article. Still, Anderson crystallized
the point of view, and provided some beautiful examples and useful
terminology. So it seems reasonable to attribute to his article.</p>
<p><a name="designscience"></a> (2) My notion of what a design science is
has changed considerably since reading Simon, influenced particularly
by the work of Bret Victor and Lev Vygotsky. Rather than revert to
Simon’s definition, the description that follows is my own current way
of thinking.</p>
<p><a name="notjustdesignscience"></a> (3) Of course, it won’t just be a
design science. Quantum computing has also stimulated lines of enquiry
leading to new work about black holes and quantum gravity. The desire
to build quantum computers has stimulated a tremendous amount of work
understanding how many different types of physical system work, and
how to control them. And once quantum computers have been built, they
will be exceptionally useful as tools of understanding, just as
conventional computers have been. All these activities are science,
and don’t fall squarely under the rubric of design science. Still, as
implied in the main text, over the long run I expect quantum computing
will primarily be a design science, in much the same way as
conventional computing has become a design science.</p>
<h3 id="citation-and-licensing">Citation and licensing</h3>
<p><em>In academic work, please cite this as: Michael A. Nielsen, “In what
sense is quantum computing a science?”,
http://cognitivemedium.com/qc-a-science, 2018.</em></p>
<p><em>This work is licensed under a Creative Commons
Attribution-NonCommercial 3.0 Unported License. This means you’re free
to copy, share, and build on this essay, but not to sell it. If you’re
interested in commercial use, please contact me.</em></p>By Michael Nielsen, December 2018What if we had oracles for common machine learning problems?2018-09-30T00:00:00+00:002018-09-30T00:00:00+00:00http://cognitivemedium.com/what-if<p><em>Rough working notes, musing out loud.</em></p>
<p>Much effort in machine learning and AI research is focused on a few
broad classes of problem. Three examples of such classes are:</p>
<ul>
<li>
<p>Classifiers, which do things like classify images according to their
category, generalizing from their training data so they can classify
previously unseen data in the wild;</p>
</li>
<li>
<p>Generative models, which are exposed to data from some distribution
(say, images of houses), and then build a new model which can
generate images of houses not in the training distribution. In some
very rough sense, such generative models are developing a theory of
the underlying distribution, and then using that theory to
generalize so they can produce new samples from the distribution;</p>
</li>
<li>
<p>Reinforcement learning, where an agent uses actions to explore some
environment, and tries to learn a control policy to maximize
expected reward.</p>
</li>
</ul>
<p>These are old problem classes, going back to the 1970s or earlier, and
each has seen tens of thousands of papers. Each of these problem
classes is really beautiful: they’re hard, but not so hard it’s
impossible to make progress; they’re precise enough that it’s possible
to say clearly when progress is being made; they’re useful, and seem
genuinely related to essential parts of the problem of AI.</p>
<p>I occasionally wonder, though, what’s the end game for these problem
classes? For instance, what will it mean if, in some future world,
we’re able to solve the classifier problem perfectly? How much would
that help us achieve the goal of general artificial intelligence? What
else would it let us achieve?</p>
<p>In other words, what happens if you skip over (say) the next few
decades of progress in classifiers, or generative models, or
reinforcement learning? And they become things you can just routinely
do essentially perfectly, perhaps even part of some standard library,
much as (say) sorting routines or random number generation can be
regarded as largely solved problems today. What other problems then
become either soluble, or at least tractable, which are intractable
today?</p>
<p><em>Perfect solutions don’t obviously help, even with closely adjacent
problems:</em> One obvious point is that you can make a great deal of
progress on one of these problems and it doesn’t necessarily help you
all that much even with problems which seem closely adjacent.</p>
<p>For instance, suppose you can classify images perfectly.</p>
<p>That doesn’t necessarily mean that you can solve the image
segmentation problem – identifying the different objects in some
general image.</p>
<p>And even if you can solve the image segmentation problem for static
images, that doesn’t mean you can solve it for video. I’ve watched
(static) image segmentation algorithms run on video, and they can be
remarkably unstable, with objects jumping in and out as we move from
frame to frame. In other words, the identity of an object across
frames is not obviously easy to track, even given perfect
classifiers. For instance, something like one object obscuring another
can cause considerable problems in making inferences about the
identity of the objects in a scene.</p>
<p><em>AI-complete problems:</em> The problem classes described above are in
some sense very natural problems, the kind that would occur to anyone
who thought about things like how humans recognize images, how they
create new images, or how they play games. But you can ask a very
different question, a much more top-down question, which is whether
there is some class of problem which, if you could solve that, would
enable you to build a genuinely artificially intelligent machine as a
byproduct?</p>
<p>This notion is called AI-completeness
(<a href="https://en.wikipedia.org/wiki/AI-complete">Wikipedia entry</a>). According to Wikipedia the term was coined by
the researcher Fanya Montalvo in the 1980s.</p>
<p>It’s interesting to read speculation about what problems would be
AI-complete.</p>
<p>The classic Turing test may be viewed as an assertion that the problem
of passing the Turing test – routinely winning the imitation
game against competent humans – is AI-complete.</p>
<p>Another example which is sometimes given is the problem of machine
translation. At first this seems ridiculous: the best machine
translation services can now do a serviceable job translating many
texts, and yet we’re very unlikely to be close to general artificial
intelligence.</p>
<p>Of course, those services don’t yet do excellent translations. And
some of the problems they face in order to do truly superb
translations are very interesting.</p>
<p>For instance: very good translations of a novel or a poem may require
the ability to track allusions, word-play, contrasts in mood,
contrasts in character, and so on, across long stretches of text. It
can require an understanding of quite a bit about the reader’s state
of mind, and perhaps even very complex pieces of folk psychology
– how the author thought the reader would think about the impact
one character’s changing relationship with a second character would
have on a third character. That sounds very complicated, but is
utterly routine in fiction. Certainly, producing excellent
translations is an extremely difficult problem which requires enormous
amounts of understanding.</p>
<p>That said, I’m not sure machine translation is AI-complete. Even if a
machine translation program did all those things, it’s not obvious you
can take what is learned and use it to do other things. This is
evident for certain tasks – learning to do machine translation,
no matter how well, probably will only help a tiny bit with (say)
robotics or machine vision. But I think it may be true even for
problems which seem much more in-domain. For example, suppose your
machine translation system can prepare first-rate translations of
difficult math books. It might be argued that there is some sense in
which they are truly <em>understanding</em> the mathematics. But even if
that’s the case – and it’s not obvious – that
understanding may be not be accessible in other ways.</p>
<p>To illustrate this point, let’s grant, for the sake of argument, that
the putative perfect math-translation system really does understand
mathematics deeply. Unfortunately, that doesn’t imply we can make use
of that understanding to do other things. It doesn’t mean we can ask
questions of the system. It doesn’t mean the system can prove
theorems. And it doesn’t mean the system can conjecture new theorems,
conjure up new definitions, and so on. Much of the relevant
understanding of mathematics may well be available inside the
system. But it doesn’t know how to utilize it. Now, it’s potentially
the case that we can use some kind of transfer learning to make it
significantly <em>easier</em> to solve those other problems. But that’d need
to be established in any given context.</p>
<p>For these reasons, I’m skeptical that narrowly-scoped AI-complete
problems exist.</p>
<p><em>Summary points</em></p>
<ul>
<li>
<p>A useful question: given the black-box ability to train a perfect
classifier (or generative model or reinforcement learning system or
<em>[etc]</em>), what other abilities would that give us? I am, I must
admit, disappointed in my ability to give interesting answers to
this question. Worth thinking more about.</p>
</li>
<li>
<p>The Turing Test as an assertion that the Imitation Game is
AI-complete.</p>
</li>
<li>
<p>No narrowly-scoped problem can be AI-complete. The trouble is that
if it’s narrowly scoped then while the system may in some sense have
a deep internal understanding, that doesn’t mean that understanding
can be used to solve other problems, even in closely-adjacent areas.
Put another way: there is still a transfer learning problem, and
it’s not at all obvious that problem will be easy. Put still
another way: interface matters.</p>
</li>
</ul>Rough working notes, musing out loud.The varieties of material existence2018-09-19T00:00:00+00:002018-09-19T00:00:00+00:00http://cognitivemedium.com/varieties-of-material-existence<p>By <a href="http://twitter.com/michael_nielsen">Michael Nielsen</a></p>
<p><em>Status: Rough and speculative working notes, very quickly written
– basically, a little raw thinking and
exploration. Knowledgeable corrections welcome!</em></p>
<p>William James wrote a book with the marvellous title “The
Varieties of Religious Experience”. I like the title because it
emphasizes just how many and varied are the ways in which a human
being can experience religion. And it invites followup questions, like
how aliens would experience religion, whether other animals could have
religious experiences, or what types of religious experience are
possible in principle.</p>
<p>As striking as are the varieties of religious experience, they pale
beside the variety of material <em>things</em> that can possibly exist in the
universe.</p>
<p>Using electrons, protons, and neutrons, it is possible to build: a
waterfall; a superconductor; a living cell; a Bose-Einstein
condensate; a conscious mind; a black hole; a tree; an iPhone; a
Jupiter Brain; a working economy; a von Neumann replicator; an
artificial general intellignece; a Drexlerian universal constructor
(maybe); and much, much else.</p>
<p>Each of these is astounding. And they’re all built from arrangements
of electrons, protons, and neutrons. As many people have observed,
with good enough tweezers and a lot of patience you could reassemble
me (or any other human) into a Bose-Einsten condensate, an iPhone, or
a black hole.</p>
<p>We usually think of all these things as separate phenomena, and we
have separate bodies of knowledge for reasoning about each. Yet all
are answers to the question “What can you build with electrons,
protons, and neutrons?”</p>
<p>For the past decade or so, when friends ask me what is the most
exciting thing happening in science, one of the subjects I often
burble about excitedly is quantum matter – very roughly, the
emerging field in which we’re engineering entirely new states of
matter, with intrinsically quantum mechanical properties. It turns out
there’s far more types of matter, with far weirder properties, than
people ever dreamed of.</p>
<p>I’m not an expert on quantum matter, I only follow it from afar. Yet
what I see makes me suspect something really profound and exciting is
going on, something that may, in the decades and centuries to come,
change our conception of what matter is.</p>
<p>Furthermore, it seems to me that many other very interesting nascent
ideas have a similar flavour: things like programmable matter, smart
dust, utility fog, synthetic biology, and so on. In a detailed
technical sense these are very different from the work on quantum
matter (though there are likely overlaps). But in some broader sense
all smell like things that might change our conception of what matter
is.</p>
<p>Because of this, I decided to write some quick notes about how we
think about matter, and what it might be possible to build. It’s a
brain dump of questions for myself, ideas, and pointers, basically
just me thinking out loud, trying to reduce some of my confusion, and
increase my understanding.</p>
<p><em>On the phrase “state (or phase) of matter”:</em> This phrase
has a technical meaning in physics, coming from the theory of
statistical mechanics. In that technical sense, solids, liquids, and
gases are all states of matter (as are superconductors, superfluids,
and numerous other more exotic phases), while things like life or
consciousness or universal computers are not.</p>
<p>Of course, there’s an everyday sense in which something like life
(etc) <em>is</em> a state of matter. To resolve the ambiguity, I’ll use the
phrase “phase of matter” for the physicist’s specific
meaning. And I’ll use the phrase “state of matter” for the
broader sense. I’m interested in both in these notes – I’m not
just interested in new phases of matter, I’m interested in what new
states of matter are possible, broadly speaking.</p>
<p><em>The flux in “phases of matter”:</em> Actually, there’s a
further issue: the meaning of “phase of matter” is in flux
amongst physicists themselves. In the 20th century a pretty good
theory of phases of matter was developed, by Landau, Wilson, Fisher,
Kadanoff, and others. Circa 1980 physicists “knew” what a
phase of matter was. And then things became very exciting, with the
discovery of the Haldane model, the AKLT model, and, especially,
fractional quantum Hall systems. These all showed new phases of
matter, but didn’t fit within the Landau-Wilson <em>et al</em>
understanding. Instead, in the decades since we’ve been trying to
figure out the right way of understanding these new ideas. It turns
out that there are many new “topological” phases of
matter, and we’re just at the beginning of understanding them. We
<em>don’t</em> yet have a good understanding. Even the basic theory and
questions are unclear at this point.</p>
<p><em>What are the most interesting states of matter which have not yet
been imagined?</em> It’s remarkable that human consciousness, universal
computing, superconductors, fractional quantum Hall systems (etc) are
all pretty recent arrivals on planet Earth. Each is an amazing step, a
qualitative change in what is possible with matter. What other states
of matter are possible? What qualitatively new types of phenomena are
possible, going beyond what we’ve yet conceived? Can we invent new
states of matter as different from what came before as something like
consciousness is from other states of matter? What states of matter
are possible, in principle? In a sense, this is really a question
about whether we can develop an overall theory of design?</p>
<p><em>How were the most interesting states of matter created or first
conceived?</em> There are a few common mechanisms: extremizing physical
quantities (black holes, Bose-Einstein condensates, superconductors);
evolution (cells, higher forms of life, consciousness, many forms of
technology, including the iPhone); asking fundamental questions
(universal computers, Drexlerian universal constructors, the Utility
Fog). Design and engineering sometimes play a role, although often as
part of a larger evolutionary process (e.g., you can view the iPhone
as the outcome of a 30+ year-long combination of imaginative design
and memetic, market-driven evolution). More recently, some of the most
interesting work on quantum matter has this flavour – people
like Kitaev, Haldane <em>et al</em>.</p>
<p>(I wish I could be more precise about: “asking fundamental
questions”. There’s lots of fundamental questions which don’t give
rise to ideas like this. But I can’t immediately think of a better
characterization.)</p>
<p><em>What phase of matter is life?</em> It bugs me that I don’t have a really
good answer to this question. Informally, we often think of human
bodies as solids. Certainly, in many everyday respects they behave
much more like solids than they do like liquids or gases, although
they tend to be rather squishy, and there are important exceptions
(like blood, tears, etc). Of course, we’re <a href="https://en.wikipedia.org/wiki/Body_water">filled up</a> with liquid
water! But those liquids are hidden away behind membranes, like the
cytosol inside the cell wall. Even human bone contains quite a lot of
water.</p>
<p>Much of my confusion is because the standard classification of matter
into phases relies on that matter being at (or near) thermodynamic
equilibrium. Parts of the human body are near thermodynamic
equilibrium. But much is not. The thing that makes it all go, that
makes life life – our metabolism – is all about energy
flows that keep things away from equilibrium.</p>
<p>Unfortunately, I also don’t understand very well when a physical
system should be at thermodynamic equilibrium. The standard story we
teach undergraduates is that if you put a macroscopic system in
contact with a large heat bath, then over time it will gradually
equilibriate.</p>
<p>That’s not a very good story.</p>
<p>Human beings are in contact with a large heat bath – our
external environment is a pretty good approximation to one.
Certainly, swimming in the ocean this is true! And yet large parts of
us remain stubbornly away from equilibrium. (Though swim in too cold
waters for too long, and you will eventually equilibriate in a most
unpleasant fashion).</p>
<p>Put another way, life seems to be a <em>system designed to resist
equilibrium</em>. And yet at the same time it’s also a <em>system designed to
be (surprisingly) stable in important ways</em>.</p>
<p>Except: that also is only partially true! In fact, much of our body
structure <em>is</em> at (or near) equilibrium – much of the fluid,
much of our bone structure, and so on. My guess is that many of the
essentially fixed, static structures in our body are near enough to
equilibrium.</p>
<p>So my very rough picture is that a (living) human body is a system
with the following properties:</p>
<ul>
<li>
<p>Many static components which are near thermodynamic
equilibrium. These are important structural components in the whole.</p>
</li>
<li>
<p>Many energy flows and dynamic components which are far away from
thermodynamic equilibrium (and sometimes driving movement of static
components, too).</p>
</li>
<li>
<p>Despite not being at equilibrium, the system is surprisingly
stable. Scratch your knee or injure a muscle and the injury will
(largely) heal itself. The immune system can fight off many
invaders. Many of the systems in our body are surprisingly
resilient and stable over time. In particular, we have systems which
keep us away from equilibrium in very specific ways.</p>
</li>
</ul>
<p>A big part of the reason this question bothers me is because I have
two broad (and very different) frameworks for thinking about matter.</p>
<p>One of those frameworks is equilibrium statistical mechanics. This is
the framework used by physicists to think about the different phases
of matter, and (often) by chemists and materials scientists to think
about what new materials are possible. It’s a powerful framework, and
most stable matter in the world is of this type.</p>
<p>However, many of the most interesting systems – including
universal computers, conscious minds, cells, economies, and others
– don’t fit well into this framework. Rather, they have the
three properties described above: many static components near
thermodynamic equilibirum; many energy flows and dynamic components
far from equilibrium; and surprising stability and resilience, often
with built in self-healing or error-correction mechanisms.</p>
<p>What, if anything, is the takeaway from all this? Here’s a few
tentative points and questions:</p>
<ul>
<li>
<p>It may be useful to think of “resilient matter” as the
overall class here – types of matter which can be stable
enough that it makes sense to think of objects at all. And that
class can be divided into two types: the stable classes which arise
out of statistical mechanics (equilibrium physics + renormalization
group => appropriate phase of matter); and the stable classes which
arise in some other way (e.g., an immune system, or other types of
built in error-correction and self-healing).</p>
</li>
<li>
<p>Is there a good unified way of thinking about these two approaches
to building resilient classes of matter?</p>
</li>
<li>
<p>Interesting things often happen when you try to move from one domain
into the other. For instance, Kitaev’s ideas about naturally
fault-tolerant quantum computation involved replacing complex
designed forms of error-correction with error-correction that occurs
naturally as a consequence of certain thermal processes. Ideas like
designing a system whose ground state is a quantum error-correcting
code are steps in merging the two domains.</p>
</li>
<li>
<p>Put another way, a good generative question given a designed system
or process may well be: can we find a system in which this same
process occurs intrinsically as a consequence of thermal relaxation?</p>
</li>
</ul>
<p><em>Why is this so disreputable?</em> Something interesting about many of the
ideas I’ve described is that they are (or were) a little
disreputable. Universal constructors, artificial general intelligence,
quantum computers, Jupiter Brains, and so on – all have gone
through periods when they were not regarded as serious subjects.</p>
<p>One interesting example is Eric Drexler’s writing on
nanotechnology. He wrote a <a href="http://e-drexler.com/d/06/00/EOC/EOC_Table_of_Contents.html">remarkable book</a> in 1986. This book has
an interesting status among scientists. For many it’s too far-out,
beyond-the-pale speculation, not backed up by any serious chemistry, a
form of science fiction. At the same time it seems pretty clear to me
that Drexler has helped set the agenda for what many of those people
dream about. Basically: ubiquitous, scalable, rapid, programmable,
atomically precise engineering of atomic systems, and a legitimization
of the question: what could we build if this were all possible and
inexpensive?</p>
<p>There’s a funny thing about norms here. I think it’s pretty common
that two communities, A and B, will do a body of work on overlapping
subjects. Community B will borrow a lot of ideas and inspiration from
Community A. Yet it will feel embarassed to be doing so, and will
often deny doing so, since Community A isn’t playing by what Community
B has internalized as the correct rules. But those very same rules
actually prevented Community B from seeing the things that Community A
saw. I think this is what happened with nanotechnology, and it’s a
common dynamic in all of human life.</p>
<p>(Related: the futurist Peter Schwartz’s observation that the great
thing about being a science fiction writer is that you get to
determine what the <em>next</em> generation of scientists and engineers will
dream of making.)</p>
<p>There are exceptions. Prestigious enough individuals get something of
a pass. Richard Feynman wrote pieces about <a href="assets/matter/Feynman1959.pdf">nanotechnology</a>
and <a href="assets/matter/Feynman1982.pdf">quantum computing</a>, and those were taken much more seriously
than they might otherwise have been (and eventually held up as
validating the fields) <em>because</em> it was Feynman. But even in those
essays, Feynman is somewhat apologetic – he knows he’s doing
something not regarded as entirely okay by his community of peers.</p>
<p>Of course, I’m not immune to this feeling. I feel somewhat embarassed
thinking in this speculative mode. And yet the question is an
important one: what fundamentally new modes of matter might it be
possible to create? And it’s worth spending at least a little time
exploring the question, from a variety of speculative points of view.</p>
<p><em>What could designer matter mean?</em> One natural and pretty common
conception is that it means the ability to reconfigure shape in real
time. This is central to concepts such as the <a href="assets/matter/Hall1993.pdf">Utility Fog</a>, much
of the work of the <a href="https://tangible.media.mit.edu/">Tangible Media Group</a>, DARPA’s program on
progammable matter (<a href="assets/matter/DARPA2006.pdf">e.g.</a>, and others. I’m fascinated, though, by
questions which go beyond reconfiguring shape and basic quantities
such as density. Ideally, you’d like to be able to program <em>all</em>
macroscopic quantities, things like thermal and electrical
conductivity, brittleness, elasticity, ductility, and so on. How wide
a range of parameters is in principle possible?</p>
<p>It seems likely that, unlike in computation, it’s not possible to
design a single substrate which can reconfigure itself across the
entire possible range for these macroscopic quantities. But you might
be able to design a substrate factory which could, upon being given
specifications for a desired substrate’s range of possible properties,
say whether or not such a substrate was possible, and if so
manufacture it. In that sense, a universal substrate would not be
possible, but a universal substrate factory might be.</p>
<p>I’ve listed out a set of macroscopic quantities. But I want to return
again to the question: what is missing from that list of macroscopic
properties? In a Bose-Einstein condensate the macrosopic property is
the (non-zero!) fraction of particles all simultaneously occupying the
ground state(!); this type of property could perhaps (just) barely
have been conceived 100 years ago, and it certainly couldn’t even have
been conceived 200 years ago. Presumably there are many, many such
properties still waiting to be discovered. What fundamental new types
of property of matter are possible? Apart from the historical
strategies described above, I have few ideas for how to answer that
question!</p>
<ul>
<li>To read: on magnetoresistance (and related effects, like giant
magnetoresistance), where an externally applied magnetic field can
be used to change the resistance of a material.</li>
</ul>
<p><em>Universality in electrostatics:</em> It’s easy to design a programmable
device which is universal for electrostatics in any given closed
region of space. You need two abilities: (1) the ability to create
arbitrary charge densities within the region; and (2) a set of
electrodes bounding the space, to which can be applied arbitrary
potentials. Standard results about boundary-value problems then imply
that both: (1) the electric field is completely determined within the
region; and (2) any electric field which is possible in electrostatics
may be created in this way. It should, in fact, be relatively easy to
build a crude prototype for such a system, although of course there
will be limits on the achievable charge densities and potentials. (I
wouldn’t be surprised if this was routine, and I simply don’t know the
name of this type of device.)</p>
<p><em>Miscellaneous ideas, questions, and observations</em></p>
<ul>
<li>
<p>How useful will the immune system be as a source of design or
engineering ideas?</p>
</li>
<li>
<p>Physics will be gradually reinvented as a design science. It’s
notable that computer science <em>began</em> with its theory of everything
(the Turing machine). And yet it still sees a steady stream of
fundamental advances, new types of abstraction, even entirely new
layers of abstraction, and radical reconceptions of the basics. I
think physics will transition to being a similar kind of design
science over the coming decades and centuries.</p>
</li>
<li>
<p>To what extent is it possible to make properties of matter
composable? So, e.g., you design foglets that can be composed to
achieve some desnity, and those dense super-foglets can be composed
to achieve some ductility? Etc.</p>
</li>
<li>
<p>Is it possible to imagine life inside an exotic phase of matter,
e.g., life evolving inside a superconductor? Frankly, I’m not
entirely sure what this question even means – as I said
earlier, life seems to be intrinsically an out-of-equilibrium
phenomenon. But perhaps it’s possible for something like this to
happen to the same kind of extent as we often think of human bodies
as solid+liquid hybrids. (Dandelion Mane tells me of <em>Dragon’s
Egg</em>, a novel set on the surface of a neutron star.)</p>
</li>
<li>
<p>Observation: a <em>lot</em> of people are working on quantum matter, and a
great deal is known. To do striking work, you’d need to bring in
some very interesting external ideas.</p>
</li>
<li>
<p>That said, it’s clear there is extraordinary power in the design of
simple, “unrealistic” model systems in quantum
matter. Renormalization and universality means there often are real
systems which exhibit very similar behaviour. So getting a picture
of the zoo of basic model systems may well be extremely
valuable. And developing some skill as a designer of such systems
also seems fun. What design principles are there?</p>
</li>
<li>
<p>It’s notable that engineering conceptions of programmable matter
tend to emphasize actuators, sensors, communication, and power. A
physics conception tends to focus more on physical properties like
density, elasticity, and so on. I’m not sure what this means –
I just wonder about the different cultures present in thinking about
this kind of problem, and the benefits of pushing those cultures up
against one another.</p>
</li>
<li>
<p>To what extent does the notion of fundamental particles even make
sense? It’s extremely common for a theory to have two or more
(equivalent) descriptions in terms of <em>different</em> sets of basic
particles or fields. E.g., the use of
the <a href="http://michaelnielsen.org/blog/archive/notes/fermions_and_jordan_wigner.pdf">Jordan-Wigner transform</a> shows that there is an equivalence
between certain spin chains and systems of free Fermi particles.
The answer to the question “Is the system really a set of spins or a
set of free fermions?” is ambiguous. It depends not on properties
<em>intrinsic</em> to the system, but rather on other external systems to
which it is coupled (for, e.g., state preparation and
measurement). This is absolutely remarkable! It means the question
“what is this system made of?” in some sense <em>depends on the other
systems which interact with it</em>, that is, is not entirely an
intrinsic property of the system itself. Change those other systems,
and there may be a sense in which you change what the system is
built of.</p>
</li>
<li>
<p>To drive this point home, suppose you worked very hard to build a
spin chain which had such a “reinterpretation” in terms of free
Fermions. It’s tempting to think of this reinterpretation as merely
a convenience, or fortuitous coincidence. But then someone hands
you a measurement probe which couples to degrees of freedom in the
Fermi gas, and perhaps allows you to control those degrees of
freedom, reset them, etc. The more powerful and flexible the probe,
the more you’d start to think of the system as “really” being made
of fermions.</p>
</li>
<li>
<p>It’s conventional to write down the action for physics in terms of
the familiar particles and fields – electrons, photons,
quarks, and so on. I wonder, though, what equivalent quasiparticle
descriptions are possible? Maybe this is a silly question, or
obviously not possible, at least for the standard modelq. But that’s
not at all obvious to me. And if some other quasiparticle
description is possible, then I can imagine doing physics in other
phases of matter where it wasn’t “natural” to discover electrons,
photons, etc, but rather we would naturally discover a very
different set of basic particles and fields. (It was this thought
that motivated me to wonder about life native to other phases of
matter.)</p>
</li>
<li>
<p>Related: the work of Xiao-Gang Wen, e.g. <a href="https://arxiv.org/abs/cond-mat/0404617">this paper</a>, and many
others.</p>
</li>
<li>
<p>What’s the analogue of the Church-Turing thesis for programmable
matter? What’s the analogue of the strong Church-Turing thesis?
Presumably there is some universal factory that can reasonably
efficiently produce near-optimal substrates. What is the nature of
that factory?</p>
</li>
<li>
<p>It’s interesting to think about overarching divisions of matter we
use in the everyday world. Different phases of matter. Living versus
non-living. Conscious versus non-conscious. Systems which process
(or carry) information versus those which do not. When you start to
push hard on the boundaries between these divisions, things get
interesting.</p>
</li>
<li>
<p>I’ve implicitly often made a distinction here between microscopic
and macroscopic scales. I’m uncomfortable with the
dichotomy. Somehow, you want to understand the transition, and
ideally perhaps even have several different layers of intermediate
abstraction.</p>
</li>
</ul>
<p><em>A few things to read, or to read more deeply</em></p>
<ul>
<li>
<p>Some of Kitaev’s early models: <a href="assets/matter/Kitaev2003.pdf">1</a>, <a href="https://arxiv.org/abs/cond-mat/0506438">2</a>.</p>
</li>
<li>
<p>Kitaev and Laumann review on <a href="https://arxiv.org/abs/0904.2771">topological phases and quantum computation</a></p>
</li>
<li>
<p>Kitaev on the Sachdev-Yu-Kitaev (SYK) model, and connections to
holography: <a href="http://online.kitp.ucsb.edu/online/joint98/kitaev/">1</a>, <a href="http://online.kitp.ucsb.edu/online/entangled15/kitaev/">2</a>, <a href="http://online.kitp.ucsb.edu/online/entangled15/kitaev2/">3</a>.</p>
</li>
<li>
<p>Kitaev on a <a href="https://arxiv.org/pdf/0901.2686.pdf">periodic table for topological insulators and superconductors</a>.</p>
</li>
<li>
<p>David Deutsch on <a href="https://arxiv.org/abs/1210.7439">constructor theory</a>.</p>
</li>
</ul>By Michael NielsenRMNIST with annealing and ensembling2017-11-26T00:00:00+00:002017-11-26T00:00:00+00:00http://cognitivemedium.com/rmnist_anneal<p>By <a href="http://twitter.com/michael_nielsen">Michael Nielsen</a></p>
<p>In the <a href="/rmnist">last post</a> I described Reduced MNIST, or RMNIST, a
very stripped-down version of the MNIST training set. As a side
project, I’ve been exploring RMNIST as an entree to the problem of
using machines to generalize from extremely small data sets, as humans
often do. Using just 10 examples of each training digit, in that post
I described how to achieve a classification accuracy of 92.07%.</p>
<p>That 92.07% accuracy was achieved using a simple convolutional neural
network, with dropout and data augmentation to reduce overfitting.</p>
<p>In this post I report the results obtained by using three additional
ideas:</p>
<ol>
<li>
<p>The use of simulated annealing to do hyper-parameter optimization;</p>
</li>
<li>
<p>Voting by an ensemble of neural nets, rather than just a single
neural net; and</p>
</li>
<li>
<p>l2 regularization.</p>
</li>
</ol>
<p>The code is available
in
<a href="https://github.com/mnielsen/rmnist/blob/master/anneal.py">anneal.py</a>.</p>
<p>The experiments in the last post were done on my laptop, using the CPU
– a nice thing about tiny training sets is that you can
experiment using relatively few computational resources. But for
these experiments, it was helpful to use a NVIDIA Tesla P100, run in
the Google Compute cloud. This sped my experiments up by a factor of
about 10.</p>
<p>These changes resulted in an accuracy of 93.81%, a considerable
improvement over the 92.07% obtained previously. I suspect that
further improvements using these ideas, along the lines described
below, will bump that accuracy over 95%, and possibly higher.
Ideally, I’d like to achieve better than 99% accuracy. My guess is
that this would be close to how humans would perform, starting with a
training set of this size.</p>
<h2 id="detailed-working-notes-and-ideas-for-improvement">Detailed working notes and ideas for improvement</h2>
<p>Through the remainder of this post, I assume you’re familiar with the
way annealing works.</p>
<p>The annealing strategy is to make local “moves” in hyper-parameter
space. For instance, a typical move was to increase by 2 the number
of kernels in the first convolutional layer. Another move was to
decrease by 2 the number of kernels. Two more moves were to increase
or decrease the learning rate by a constant factor of
10<sup>¼</sup>.</p>
<p>Overall, the anneal involved modifying four hyper-parameters using
such local moves: the learning rate, the weight decay (for l2
regularization), the number of kernels in the first convolutional
layer, and the number of kernels in the second convolutional layer.</p>
<p>The “energy” associated to hyper-parameter configurations was just the
validation accuracy of an ensemble of nets with those
hyper-parameters. More precisely, I used the negative of the
validation accuracy – the negative since the goal of annealing
is to minimize the energy, and thus to maximize the accuracy.</p>
<p>These were first experiments, and it’d likely be easy to considerably
improve the results. To do that, it’d be useful to have monitoring
tools which help us debug and improve the anneal. Such tools could
help us:</p>
<ul>
<li>
<p>Identify which hyper-parameters make a significant difference to
performance, and which do
not. <a href="http://www.jmlr.org/papers/v13/bergstra12a.html">Bergstra and Bengio</a> find
that typically only a few hyper-parameters make much difference.
How can we identify those hyper-parameters and ensure that we
concentrate on those?</p>
</li>
<li>
<p>Identify when we should change the structure of a move. For
instance, instead of changing the number of kernels by 2, perhaps it
would be better to change the number by 5. What step sizes are
best? Should we have a distribution? How sensitive is validation
accuracy to the size of the steps?</p>
</li>
<li>
<p>Identify changes to the way we should sample from the moves. At the
moment I simply choose a move at random. But if statistics are kept
of previous moves, it would be possible to estimate the probability
of a given move improving the validation accuracy, and sample
accordingly. What is the probability distribution with which
particular moves improve the accuracy? What’s a good model for the
size of the expected improvements? These are questions closely
related to the work
of
<a href="http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-l">Snoek, Larochelle, and Adams</a> on
Bayesian hyper-parameter optimization.</p>
</li>
<li>
<p>Identify pairs of moves which work well together. For instance, it
may be that increasing the number of kernels works well <em>provided</em>
the l2 regularization is also increased. But each move on its own
might be unfavourable. Which pairs of moves often produce good
outcomes, even when the individual moves do not? Is it possible for
the annealer to automatically learn such pairs and incorporate them
into the annealing?</p>
</li>
<li>
<p>Identify when we should change the energy scale of the anneal, i.e.,
the effective temperature. A characteristic question here is how
often we accept moves which make the accuracy lower, despite the
fact that a different move would have made the accuracy higher. If
this happens too often it likely means the energy scale should be
made smaller (i.e., the temperature of the anneal should be
decreased).</p>
</li>
<li>
<p>By sampling from the hyper-parameter space can we build a good model
which lets us predict accuracy from the hyper-parameters? And then
use something like gradient ascent to optimize that function?</p>
</li>
</ul>
<p>Each of these ideas suggests good small follow-up projects. Those
projects would be of interest in their own right; I also wouldn’t be
surprised if they resulted in considerable improvement in performance.</p>
<p>Insofar as such tools would change the way we do the anneal, we’d be
doing hyper-parameter optimization optimization.</p>
<p>A few miscellaneous observations:</p>
<p><em>Good performance even with small number of kernels in the first
layer:</em> I was surprised how well the network performed with just 2 (!)
kernels in the first convolutional layer – it was relatively
easy to get validation accuracies above 93%. What can we learn from
this? What would happen with just 1 kernel? How much is it possible
to reduce the number of kernels in the second convolutional layer? In
a situation where the key problem is overfitting and generalization,
it seems like an important observation that we can get 93% performance
with just 2 kernels.</p>
<p><em>Batch size mattered a lot for speed:</em> As a legacy of my CPU code I
started with a mini-batch size of 10. I changed that to 64, since
increasing mini-batch size often helps with speed, particularly on a
GPU, where these computations are easily parallelized. I was,
however, surprised by the speedup – I didn’t do a detailed
benchmark, but it was easily a factor of 2 or 3. Further
experimentation with mini-batch size would be useful. (Note: I’d
never used the P100 GPU before. I’ve seen speedups with other GPUs
when changing mini-batch size, but I’m pretty sure this is the largest
I’ve seen.)</p>
<p><em>Adding other hyper-parameters:</em> I suspect adding other
hyper-parameters would result in significantly better results. In
rough order of priority, it’d be good to add: initialization
parameters for the weights, different types of data augmentation, size
of the fully-connected layer, the kernel sizes, learning rate decay
rate, and stride length.</p>
<p><em>Understand performance across ensembles of nets:</em> Something I
understand poorly is the behaviour of ensembles of neural nets. What
is the distribution of performance across the ensemble? How much can
aggregating the outputs help? What are the best strategies for
aggregating outputs? How much does it help to increase the size of
the ensemble?</p>
<p><em>How stable are the results for large ensembles?</em> The questions in the
last item are all intrinsically interesting. They’re also interesting
for a practical reason: sometimes I found hyper-parameter choices
which did not provide stable performance across repeated training
using those same hyper-parameters. But perhaps with large enough
ensemble sizes that instability could be eliminated. A related point:
I achieved validation accuracies up to 94.39%, but didn’t report them
above, because they were not easy to reproduce while using the same
hyper-parameters.</p>
<p><em>Adding interactivity:</em> Something that’s often frustrating while
annealing is that a question will occur to me, based on observing the
program output, but I have no way to modify the anneal in real time.
It’d be exceptionally helpful to be able to break in, access the REPL,
modify the structure of the anneal, and restart.</p>
<p><em>The addictive psychology of training neural nets:</em> Watching the
outputs flow by – all the ups and downs of performance –
produces a feeling which mirrors the appeal many people (including
myself) feel while watching sport. There’s lots of random
intermittent reward, and the perhaps illusory sense that you’re
watching something important, something which your mind really wants
to find patterns in. Indeed, on occasion you do find patterns, and it
can be helpful. Nonetheless, I wonder if there aren’t healthier ways
of engaging with neural nets.</p>By Michael NielsenReduced MNIST: how well can machines learn from small data?2017-11-15T00:00:00+00:002017-11-15T00:00:00+00:00http://cognitivemedium.com/rmnist<p>By <a href="http://twitter.com/michael_nielsen">Michael Nielsen</a></p>
<p><em>Status: Exploratory working notes. Intended as preliminary
exploration to get familiar with the problem, not as a survey of prior
literature, with which I am only very incompletely familiar. Caveat
emptor.</em></p>
<p>For many years, the MNIST database of handwritten digits was a staple
of introductions to image recognition. Here’s a few MNIST training
digits:</p>
<p><img src="/assets/rmnist/digits.png" alt="MNIST digits" /></p>
<p>In recent years, many people have come to regard MNIST as too small
and simple to be taken seriously. It has “only” 60,000 training
images, each 28 by 28 grayscale pixels, and is divided into 10 classes
(0, 1, 2, …, 9). By comparison, modern image recognition systems may
be trained on more than a million full-color, high-resolution images,
with far more classes.</p>
<p>For many applications it’s desirable to train using larger and more
complex data sets. But from a scientific point of view it’s also
extremely interesting to understand how to train machines using small,
simple data sets. After all, human beings don’t need to see 60,000
examples to learn to recognize handwritten digits. Rather, we’re
shown a few examples and rapidly learn to generalize. What principles
underly that ability to generalize? Can machines learn to generalize
from small data sets?</p>
<p>In these notes, I explore several simple ways of training machine
learning algorithms using tiny subsets of the original MNIST data.
We’ll call these subsets <em>reduced MNIST</em>, or RMNIST. As said in the
introductory note, the notes aren’t at all complete, and I’m certainly
not thoroughly familiar with prior work. Rather, this is me getting
familiar with the problem by doing some basic hands-on work. Frankly,
I also wanted an excuse to experiment with the scikit-learn and
pytorch libraries.</p>
<p>The examples are based on the code in <a href="http://github.com/mnielsen/rmnist">this repository</a>.</p>
<p>Let’s define a few different training data sets. RMNIST/N will mean
reduced MNIST with N examples for each digit class. So, for instance,
RMNIST/1 has 1 training example for each digit, for a total of 10
training examples. RMNIST/5 has 5 examples of each digit. And so on.
When I say MNIST, I mean the full set of images (50,000 in total, once
10,000 are held apart for validation). Here are the digits in
RMNIST/1:</p>
<p><img src="/assets/rmnist/rmnist_1.png" alt="RMNIST/1" /></p>
<p>RMNIST/5:</p>
<p><img src="/assets/rmnist/rmnist_5.png" alt="RMNIST/5" /></p>
<p>And RMNIST/10:</p>
<p><img src="/assets/rmnist/rmnist_10.png" alt="RMNIST/10" /></p>
<p>These data sets are created by the program <a href="https://github.com/mnielsen/rmnist/blob/master/data_loader.py">data_loader.py</a> in the
repository linked above.</p>
<p>Additionally, we’ll use 10,000 images from MNIST as validation data.</p>
<h2 id="baselines">Baselines</h2>
<p>To get some baseline results, we’ll use the
program <a href="https://github.com/mnielsen/rmnist/blob/master/data_loader.py">baseline.py</a>. It uses the scikit-learn machine learning
library, which makes it easy to implement the baselines in just a few
lines of Python.</p>
<p>The classifiers we use include support vector machines (SVMs), with
both linear and radial basis function (RBF) kernels. We also use
k-nearest neighbors, decision trees, random forests, and a simple
neural network. For details, see the program <a href="https://github.com/mnielsen/rmnist/blob/master/data_loader.py">baseline.py</a>. Results
are shown in the table below. Classification accuracy is reported for
the 10,000 validation examples.</p>
<p>By the way, please don’t regard this as a genuine comparison of the
various techniques. I put minimal effort into configuring these, and
it’s quite likely the poor performance of any given classifier is due
to poor configuration or an error in my understanding, not to a defect
in that type of classifier. These are baselines as a starting point
for later experiments, not serious comparisons.</p>
<table>
<thead>
<tr>
<th>Data set</th>
<th style="text-align: center">SVM RBF</th>
<th style="text-align: center">SVM linear</th>
<th style="text-align: center">k-NN</th>
<th style="text-align: center">decision tree</th>
<th style="text-align: center">random forest</th>
<th style="text-align: center">neural network</th>
</tr>
</thead>
<tbody>
<tr>
<td>RMNIST/1</td>
<td style="text-align: center">41.85</td>
<td style="text-align: center">41.85</td>
<td style="text-align: center">41.85</td>
<td style="text-align: center">16.13</td>
<td style="text-align: center">41.56</td>
<td style="text-align: center">42.00</td>
</tr>
</tbody>
<tbody>
<tr>
<td>RMNIST/5</td>
<td style="text-align: center">69.73</td>
<td style="text-align: center">69.43</td>
<td style="text-align: center">65.08</td>
<td style="text-align: center">34.09</td>
<td style="text-align: center">65.70</td>
<td style="text-align: center">69.47</td>
</tr>
</tbody>
<tbody>
<tr>
<td>RMNIST/10</td>
<td style="text-align: center">75.46</td>
<td style="text-align: center">75.09</td>
<td style="text-align: center">70.14</td>
<td style="text-align: center">41.09</td>
<td style="text-align: center">72.87</td>
<td style="text-align: center">75.33</td>
</tr>
</tbody>
<tbody>
<tr>
<td>MNIST</td>
<td style="text-align: center">97.34</td>
<td style="text-align: center">94.81</td>
<td style="text-align: center">97.12</td>
<td style="text-align: center">87.51</td>
<td style="text-align: center">88.56</td>
<td style="text-align: center">97.01</td>
</tr>
</tbody>
</table>
<p><br />Except for decision trees, all the classifiers achieved accuracies
above 40% when trained on just a single training digit from each class
(i.e., RMNIST/1). Increase the number of training examples to 5 of
each digit, and the classification performance of several classifiers
rose to near 70%. With 10 of each digit, performance rose to near
75%.</p>
<p>However, all these are still a long way from performance when trained
on the full MNIST training data. There, several of our baselines
achieved performance above 97%. Indeed, state-of-the art classifiers
trained on MNIST can achieve in the neighbourhood of 99.7% or 99.8%.
That’s human-level performance, since quite a few examples in the
validation data are genuinely ambiguous, and there is arguably no
“true” classification.</p>
<p>Unfortunately, I don’t know how well human beings do when trained
using just very small number of example digits. As far as I know, the
experiment has never been done. It would certainly be interesting to
find someone who does not know arabic numerals, and see how well they
could learn to classify such numerals, after being exposed to just a
few examples.</p>
<p>With that said, I believe human beings generalize much better than our
baseline classifiers. Show a small child their first giraffe and they
are likely to do well at identifying later giraffes.</p>
<p>Can we find training strategies which let us get higher classification
accuracies for RMNIST/1, RMNIST/5, and RMNIST/10?</p>
<p>I conjecture that it should be possible to achieve above 95% for
RMNIST/1, and above 99.5% for RMNIST/10 and (perhaps) RMNIST/5, i.e.,
near-human performance from a small handful of training examples.</p>
<p>Let’s see if we can make some progress toward those goals.</p>
<p><em>Spoiler:</em> <em>We won’t get there. But we’ll make some progress.</em></p>
<h2 id="convolutional-network-with-dropout">Convolutional network with dropout</h2>
<p>As a step toward better performance, let’s use a simple convolutional
neural net, with dropout. The use of dropout acts as a regularizer,
reducing overfitting. We can expect this to be particularly important
for very small data sets. And the convolutional nature of the network
reduces the number of parameters, which also helps reduce overfitting.</p>
<p>The convolutional network we’ll try is similar to the
well-known <a href="/assets/rmnist/LeCun1998.pdf">LeNet-5</a> architecture. It uses two
convolutional layers, with pooling, and then two fully-connected layers. For
details see <a href="https://github.com/mnielsen/rmnist/blob/master/conv.py">conv.py</a>.
We achieve classification accuracies of:</p>
<ul>
<li>RMNIST/1: 56.91%</li>
<li>RMNIST/5: 76.65%</li>
<li>RMNIST/10: 86.53%</li>
<li>MNIST: 99.11%</li>
</ul>
<p>We’re doing much better than our simple baselines! But we’re still well short
of where we’d like to be.</p>
<h2 id="algorithmically-expand-the-training-data">Algorithmically expand the training data</h2>
<p>Another idea is to algorithmically expand the training data, by doing
things like making small rotations of the training images, displacing
them slightly, and so on. In some sense this mirrors human learning:
when a human being is shown a digit for the first time they can look
at it from different angles, different distances, different positions
in their field of view, and so on.</p>
<p>As an attempt in that direction, let’s expand the RMNIST data sets by
translating them by ± 1 pixel in both the horizontal and vertical
directions, and again train our convolutional network. The expansion is done
by
<a href="https://github.com/mnielsen/rmnist/blob/master/expand_rmnist.py">expand_rmnist.py</a>.
The resulting performance is:</p>
<ul>
<li>RMNIST/1: 55.25%</li>
<li>RMNIST/5: 84.38%</li>
<li>RMNIST/10: 92.07%</li>
<li>MNIST: 99.34%</li>
</ul>
<p>This helped significantly! In particular, we’ve exceeded 92% for
RMNIST/10. That’s bad compared to modern classifiers trained on the
full MNIST data set, but frankly I’m not absolutely certain a human
child would do much better. However, I certainly suspect a human child
would do better, and I’d very much hope we could do better with our
machine classifiers.</p>
<p>One oddity is that performance on RMNIST/1 is not helped by expanding
the training data. In fact, I did some experiments with translations
of up to ± 2 pixels, and performance on RMNIST/1 was
substantially improved, up to about 60%. But the results on other data
sets weren’t much changed by this further expansion of the training
data. It’d be good to understand this difference.</p>
<p><em>Problem:</em> Can we get further improvement if we expand the training data by
adding some jitter to the intensity of individual pixels?</p>
<p><em>Problem:</em> Can we get further improvement if we add some small
rotations to the training data?</p>
<p><em>Problem:</em> Can we get further improvement if we expand the data using
the transformations in the
paper
<a href="/assets/rmnist/Simard.pdf">Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis</a>,
by Simard, Steinkraus, and Platt (2003)? Note, for instance, the
transformations they introduce intended to mimic the natural jitter
associated to vibrations of hand muscles while writing.</p>
<p><em>Problem:</em> Are there other useful transformations one might perform to
expand the training data? <a href="https://arxiv.org/abs/1711.04340">This</a> is
a fun-looking recent paper.</p>
<h2 id="transfer-learning">Transfer learning</h2>
<p>So far, all our approaches to training start from the RMNIST data
alone. That unfairly disadvantages the computer, since human beings
don’t learn to recognize new image classes from scratch. Rather, they
take advantage of what their minds already know about vision, both
from experience and from evolutionary history.</p>
<p>We can do something similar by taking a neural network trained on some
other task – something not involving MNIST – and trying to
use the knowledge in that network to help us on RMNIST.</p>
<p>This idea is called transfer learning.</p>
<p>There are many approaches to transfer learning. We’ll approach it by
using the pre-trained <a href="https://arxiv.org/abs/1512.03385">ResNet-18</a> network, which is built into
pytorch. ResNet-18 is a deep convolutional neural network, trained on
1.28 million ImageNet training images, coming from 1000 classes. It
has thus learnt an enormous amount about how to classify images in
general, but not about RMNIST in particular.</p>
<p>We’ll take the RMNIST training and validation sets, run them through
ResNet-18, and extract the high-level features in the second-last
layer. The intuition is that these features contain the essential
high-level information about the image, but not unimportant
details. With some luck, these features will help in classifying
RMNIST images.</p>
<p>We generate these training data sets – the high-level features
for RMNIST – using the
program <a href="https://github.com/mnielsen/rmnist/blob/master/generate_abstract_features.py">generate_abstract_features.py</a>. We then
use <a href="https://github.com/mnielsen/rmnist/blob/master/transfer.py">transfer.py</a> to build RMNIST classifiers based on these learnt
features. The classifier we use is a fully-connected neural network
with a single hidden layer containing 300 neurons. Here are the
results:</p>
<ul>
<li>RMNIST/1: 51.01%</li>
<li>RMNIST/5: 72.81%</li>
<li>RMNIST/10: 82.95%</li>
</ul>
<p>We see that transfer learning does give a considerable improvement
over our baseline classifiers. However, it is well below the results
we obtained earlier using our purpose-built convolutional networks.</p>
<p>What happens if we algorithmically expand the training data, as
before, and then apply transfer learning? In that case the results
get a little better, but still don’t do as well as our earlier
convolutional network, even trained without the help of additional
data:</p>
<ul>
<li>RMNIST/1: 52.84%</li>
<li>RMNIST/5: 75.27%</li>
<li>RMNIST/10: 84.66%</li>
</ul>
<p>Of course, this is just one approach to transfer learning. It might be
that other approaches would perform better, and it’d be worth
exploring to find out. Here’s a few ideas in this vein:</p>
<p><em>Problem:</em> Can we improve the classifier used to learn from the
features derived from ResNet-18? In the experiments reported, I just
used the neural net classifier built in to scikit-learn. I did some
less systematic experiments using pytorch, and got to roughly 90%
accuracy on RMNIST/10. It’d be good to investigate this more
systematically.</p>
<p><em>Problem:</em> What if we used networks other than ResNet-18 to do the
transfer learning?</p>
<p><em>Problem:</em> What if we used features from earlier layers in the network
to do the transfer learning?</p>
<p><em>Problem:</em> What if we used used the features learned by an
unsupervised network, such as some kind of autoencoder? This has the
advantage that it removes the need for labelled training data.</p>
<p><em>Problem:</em> What if we use an ensembling approach to combine transfer
learning with convolutional networks not using any kind of transfer
learning?</p>
<h2 id="concluding-thoughts">Concluding thoughts</h2>
<p>Our best approach to RMNIST was to use a simple convolutional net with
dropout and algorithmic data expansion. That gave results of 92% on
RMNIST/10, 84% on RMNIST/5, and 55% (60% with more data expansion) on
RMNIST/1.</p>
<p>I expect it’d be easy to drive these numbers much higher just by doing
more experimentation using obvious techniques. Perhaps more fun would
be to explore more radical approaches to achieving high classification
accuracies.</p>
<p>Another fun question is whether we can find <em>super-trainers</em>, i.e.,
small training sets which give rise to particularly good peformance?
I chose the data for RMNIST at random from within MNIST. Might it be
possible to choose subsets which result in significantly improved
performance? This seems related to the problem
of <a href="/assets/rmnist/Bengio2009.pdf">curriculum learning</a>.</p>
<p>Even better, might it be possible to artificially synthesize very
small training sets which give rise to particularly good performance?
These would be true super-trainers, canonical examples from which to
learn. It’d be fascinating to see what such super-trainers look like,
assuming they exist.</p>By Michael Nielsen