From charlesreid1


Philip Johnson-Laird is an academic who sits at the intersection of philosophy and psychology. He studies cognition and the inner workings of the brain. My first exposure to his work came through his book "Mental Models," which I used when writing my dissertation to help articulate what, exactly, a model is, and understanding what models can and cannot do.

This book is particularly apt, given the recent resurgence in machine learning and artificial intelligence. When the book was originally published in 1988, the idea of a neural network was still undergoing development, and many foundational ideas are discussed here. That the book is not written like a computer scientist who is teaching how to do X in Y, or assume the reader will be able to follow graduate-level linear algebra concepts, but rather like a cognitive scientist carefully devising an experiment to devise the mechanisms of the brain.

The organization of the book is in six parts, each focusing equally on aspects of how our brains work, and how that can be replicated through computation.

Part 1, Computation and the Mind, starts by talking about the concept of computability, what it means to compute something, and how we might replicate some of the computing functions of the brain. It answers some basic questions that any non-expert would have, like how do you study the mind?

The remaining parts each focus on a particular aspect of our mental machinery:

Part 2: Vision

Part 3: Learning, Memory, and Action

Part 4: Cogitation

Part 5: Communication

Part 6: The Conscious and Unconscious Mind


Part 1: Computation and the Mind

Since Descartes, theorists have assumed that there is no problem in understanding how machines work. Indeed, Lord kelvin, the eminent Victorian physicist, even turned this argument around, and wrote in a letter to a colleague: "I can never satisfy myself until I can make a mechanical model of a thing. If I can make a mechanical model I can understand it. As long as I cannot make a mechanical model all the way through I cannot understand.

p. 24

On the Meaning of Symbols

Any system of external symbols, such as numerals or an alphabet, is capable of symbolizing many different domains. Thus, the binary numeral 1100 can stand for many things. It may stand for the number twelve, for the letter Z as in morse code, or for a particular person, artifact, 3d shape, region of the earth's surface, or many other entities, Numerals are potent because they are each distinct from one another, and there is a simple structural recipe for constructing an unlimited supply of them.

Even if a domain contains a potentially infinite number of entities, then a numerical system can be used to symbolize it provided that there is some way to relate the numerals to what they signify. The simplest link is an arbitrary pairing of each symbol to one referent, and each referent o one symbol, as in a numerical code for the rooms of a hotel. A symbol may be well formed, e.g., the Roman numeral XII, but fail to designate anything (no room with number 13). Rather than arbitrary pairings, it is usually convenient to have some principles for assigning interpretations to symbols. These principles may be a matter of rules, conventions or habits. If symbols are assembled out of primitives according to structural rules, then the structure of the symbol may, or may not, be relevant to its interpretation. A Roman numeral has a structure that is relevant to its interpretation as a number. A pile of sand in an hourglass has a structure that is not relevant to its interpretation as an interval of time - only the volume of sand matters.

p. 31-32

Further reading:

The idea of treating the mind as a symbol-manipulating device can be found in Craik (1943) and in the work of Turing (see Hodges, 1983). Newell and Simon (1976) provide a recent formulation of the concept of a physical symbol system. Different types of symbolic representation are discussed by the philosopher Nelson Goodman (1968). Sutherland and Mackintosh (1971) discuss the ability of animals to learn to discriminate symbols.

p. 35

A grammar is a set of rules for a domain of symbols (or language) that characterizes all the properly formed constructions, and provides a description of their structure. Grammars so defined, as first suggested by the linguist Noam Comsky, are intimately related to programs...

...the robot that moves in one dimension... Forward Forward Back Forward Forward Back Back Back...

Now we run into a major difficulty. There are no bounds on the number of steps in a journey. Given any acceptable journey, no matter how long we can alwys preface it with a step forwards and end it with a corresponding step backwards, and the result will still be acceptableWe state that rules capture these two possibilities directly. Thus,

3. JOURNEY = Forward JOURNEY Back

4. JOURNEY = Back JOURNEY Forward


p. 47-48

The obvious question is: how can memory be still further improved?

A natural step is to remove the constraint that memory operates like a stack, and to allow unlimited access to any amount of memory.

p. 48

Computability and Mental Processes

Computers work in a very different way from Turing machines: their memories are not just one-dimensional tapes, and they have a much richer set of basic operations. But a computer program is analogous to a particular Turing machine, and the computer is analogous to a univerasal machine because it can execute any program that is written in an appropriate code. Anything that can be computed by a digital computer cna be computed by a Turing machine.

Not everything, however, can be computed. There are many problems that can be stated but that have no computable solution. It is impossible, for example, to design a universal machine that determines whether any arbitrarily selected Turing machine, given some arbitrarily selected data, will come to halt or go on computing for ever. Hence, there is not test guaranteed to decide whether or not a problem has a computable solution.

p. 51

There are three morals to be drawn for cognitive science.

First, since there is an infinity of different programs for carrying out any computable task, observations of human performance can never eliminate all but the correct theory...

Second, if a theory of mental processes turns out to be equivalent in power to a universal machine, then it will be difficult to refute.

Third, theories of the mind should be expressed in a form that can be modeled in a computer program. A theory may fail to satisfy this criterion for several reasons: it may be radically incomplete; it may rely on a process that is not computable; it may be inconsistent, incoherent, or, like a mystical doctrine, take so much for granted that it is understood only by its adherents. These flaws are not always obvious. Students of the mind do not always know that they do not know what they are talking about. The surest way to find out is o try to devise a computer program that models the theory. A working computer model places a minimal reliance on intuition;: the theory it embodies may be false, but at least it is coherent, and does not assume too much. Computer programs model the interactions of fundamental particles, the mechanisms of molecular biology, and the economy of the country. The rest of the book is devoted to computable theories of the human mind.

p. 52

Part Two: Vision

The Visual Image

Consider three beliefs about vision:

  • The eye is like a television camera - you point it at a scene, it registers the scene, and it projects the image inside your head.
  • Vision is impossible. Different arrangements of things can produce the same image, so the brain does not know what particular arrangement you are looking at.
  • Vision is easy for brain to do, but hard for us to understand.

Three different levels of explanation are needed:

  • Theory of what is computed
  • Theory of how the system carries out computations
  • Theory of underlying neurophysiology (the "hardware")

Three stages of vision:

  • Vision stage 1: grayscale images (brightness value for pixels)
  • Vision stage 2: changes in intensity (gradients between pixels)
  • Vision stage 3: the primal sketch

Locating Gradients in Intensity

Gray-level array has certain amounts of noise - random fluctuations. How to differentiate between small scale changes and large scale changes?

In order to get a sensible measure of where the gradient undergoes major, significant changes, we need to apply a filter, or a spatial average.

Simple technique for reducing noise is to replace each value in the array by its local average - applying a 2D filter to smooth changes in the intensity.

Let's talk more about this filtering concept.

A crude local averaging stencil would just be an even weighted average of neighboring points:

 \frac{\Delta x}{3} \left( x_{i-1} + x_{i} + x_{i+1} \right)

More generally - the left hand rule, right hand rule, midpoint rule approximate the function between two points as a constant (1 unknown), requires 1 point

The trapezoid rule approximates the function between two points as a line (2 unknowns, slope and intercept) and requires 2 points

Can get increasingly better stencils by using things like Simpson's Rule, approximates function over an interval with a quadratic (3 unknowns, 3 coefficients) and requires 3 points

\frac{\Delta x}{2} \left( \frac{1}{3} x_{i-1} + \frac{4}{3} x_{i} + \frac{1}{3} x_{i+1} \right)

Applying a filter and removing local irregularities reveals large scale changes. (Another way to think about this: the SPECTRAL content of the image shifts to being larger scale changes.)

To extend this idea to the most general form, can apply averaging operator using a particular weighting function, the Gaussian (normal) distribution.

Once the averaging operator is applied, how to detect intensity boundaries? Simple way to measure steepness of gradient is to multiply left value by -1 and multiply right value by +1 and sum the results. If there is no gradient, the change is 0. If there is a gradient, the sum of these two values will result in a step function.

To explore this further, the boundaries of the gradient intensity can be found by calculating the gradient of the gradient - and finding where it crosses zero (corresponding to a location of constant gradient). The zero-crossing value is a strong indicator of a boundary between different regions of different intensities.

It is possible to combine the two operations, of local averaging and of finding changes in the gradient, in 2D. The result is the Mexican Hat filter. The Laplace of the Gaussian is nearly equivalent, and is intended to work for arbitrary number of dimensions.

In order to apply the operation of both local averaging and finding changes in the gradient, and performing that in two dimensions, we can combine the two operations into one by applying the Mexican Hat filter in multiple dimensions.

Each level in the gray level array is averaged with its neighbors using the Mexican Hat filter. Weights are positive for very near neighbors (values of near-neighbor points are weighted more heavily), and are negative for distant neighbors. When the Mexican Hat filter is applied to a gray level image, the result will be a set of positive and negative values, and a resulting zero-crossings map.

Visual filtering is possible by adjusting the width of the Mexican Hat filter. Larger hat extending over many elements reveals gradual changes in intensity over larger areas. It may be useful to use multiple filter sizes to obtain multiple zero-crossings maps for different filter sizes.

NOTE: The gradient of the intensity is equivalent to the first spatial derivative, while the changes in that gradient (the gradient of the gradient) is equivalent to the second derivative. The second derivative can be applied in two dimensions isotropically (equally weighted in all directions away from pixel). This is the Laplacian operator.

The Mexican Hat function is a combination of the Gaussian normal distribution, to smooth the data (importance/weight decreases with distance) and the Laplacian (of the Gaussian). So, if you see reference to the LoG (Laplacian of Gaussian), it's equivalent to the Mexican Hat function.

Neurophysiology of Vision

Trying to understand vision by studying only nerve cells, as Marr remarked, is like trying to understand bird flight by studying only feathers.

p. 72

A brief review of eye physiology:

  • The pupil is the black part of the eye, through which light enters the eye and is received by the brain. In camera terminology, this is equivalent to the eye's arpeture.
  • The iris is the colored portion that surrounds the pupil. It controls the size of the pupil and how much light enters. It is equivalent to the diaphragm f-stop controlling the arpeture. The pigmentation absorbs light and prevents excess light from reaching the retina, essentially making the eye more efficient.
  • The retina is the back of the eye, where light enters and is received by nerve cells. This light is converted into electrical and chemical signals that are forwarded on to the brain.

The retina consists of cells that create a coating on the inside of the eye, also called photoreceptor cells or ganglion cells.

Some ganglion cells are excited by light that falls directly on them, and inhibited by light that falls on the cells that surround it. Other ganglion cells are inhibited by light that falls directly on the center of the cell, and excited by light that falls on neighboring cells. This mechanism provides the necessary signal addition and subtraction to apply a Mexican Hat filter biologically. Cells that are excited by direct signals are the additive portion of the filter, while cells that are inhibited by direct signals are the negative values further away. The cells normally fire at a specific frequency, and when they are excited they fire at a faster rate and when inhibited fire at a slower rate. The zero crossings (where the second derivative crosses zero/changes sign), which corresponds to the location of edges, is linked to locations where these two sorts of ganglion cells have equal activity.

Neurophysiologists David Hubel and Torsten Wiesel studied mechanisms of perception, found cells in visual cortex excited by bright lines or bars at particular orientation (Marr's theory suggests these correspond to zero-crossings).

Third State of Vision: Primal Sketch

The eye is applying a filter equivalent to the Laplacian of the Gaussian, but with structures of ganglionic nerve cells each applying different size filters. Thin bars and details that are far away may give two zero-crossings when applying a small filter but be blurred together by a larger filter. The brain would thus find it useful to be able to compare the results of filters of different sizes - when zero-crossings are detected across multiple filter sizes, it is a "real" result.

Marr believes the zero-crossings are the key, while others (Roger Watt and Michael Morgan) believe it is the peaks and troughs.

Breaking down the visual perception of the world into a map of bars, edges, and blobs is how the macro-scale image of the world can be represented (so-called "primal sketch"). However, the mechanisms behind how the brain forms these is difficult to study.

Usually, focusing on the primal sketch and ignoring details will lead to a loss of information. However, you can also gain information. Example: checkerboard image of Lincoln (image by Leon D. Harmon).

Cost of Visual Processing

Major challenges of visual processing with onboard computers: computers have far fewer interconnections (electronic nerves) than biological systems, so slower bus speeds and bandwidth. Crucial to work fast enough for the task at hand - e.g., self-driving car can't take two seconds to process an image.

Major computational cost is filtering out gray-level array. For a 1000 x 1000 array, need to apply filter to every pixel, for every filter size. Specialty hardware can help, but still has significant costs.

Workarounds include limiting vision of the world to a primal sketch - much like the housefly, which does not need to perform 3D extrapolation from 2D images, everything boils down to a set of algorithms. Landing algorithm: if visual field expands at high speed, fly turns feet toward center of expanding plane, and stops flying when its feet hit the surface. Mate tracking: find small black patch moving against a background. Left and right wing power governed by patch position in visual field and by angular velocity, so fly keeps the target centered in its visual field and flies toward it.

For a fly, vision is, in fact, impossible. But the mechanism is tuned to work for specific scenarios with limited information. Thus there are many tasks a fly cannot accomplish.


Harmon, L. D. "The recognition of faces." Scientific American, November 1973, p 75.

Mayhew Frisby 1984 (technical account of computer vision)

Watt 1988 (advanced monograph on the initial stages of human vision)

Marroquin, J. L. "Human visual perception of structure." Master's thesis, Dept of EE and CS, MIT, 1976.

Seeing the World in Depth

Stereopsis - fusion of disparate images so as to see the world in depth

Problem of disparity: slight differences in images from eye to eye. Once size and direction of disparity is known, trigonometry can be used to find relative depth of point.

Puzzling problem: how to match two images, if neither image has any particular location on the retina that is the same between the two images?

Could use top-down image processing, identifying objects in the scene and matching up key points. But this requires two separate images being held in the head simultaneously, identification of objects and orientations of each independently, so seems unlikely.

Alternative is bottom-up processing, matching the intensity values in a pair of gray-level arrays. However, this also presents a challenge, in that intensities can be different between eyes (e.g. sunglasses over one eye).

If we want to write a program to do this, we have to ask: how does nature do it?

Run experiments to better understand the mechanism:

  • First, see how the visual system responds when there is no high-level knowledge that can be exploited (in which case, we're isolating the bottom-up mechanism)
  • Second, modify input data so they are badly corrupted, so process is not entirely data-driven.

Bottom up Stereopsis

The first approach, bottom-up stereopsis, in which we remove high-level knowledge, has proved invaluable in understanding/studying stereopsis. Stereoscope - victorian invention, combines two photographs of the same location from slightly different perspectives. This consists of using a mirror to manipulate images and present the brain with a particular stereo image. These are the "Magic Eye" style images, in which there is no high level detail for the brain to latch onto. And yet the brain can still perceive shapes and see the Magic Eye shape "pop out".

John Frisby and John Clatworthy - found that explicit high-level knowledge cannot speed up the process of perceiving stereopsis pictures.

How does the brain do this? Several hypotheses:

  • First, one thing cannot be in two places at the same time - so there is a uniqueness constraint (on point can be matched with one and only one point)
  • Surfaces are typically opaque and smooth, so depth from observer varies continuously rather than undergoing sudden changes (continuity constraint)

Imagine two eyes shooting out two rays. If each eye perceives the same point, that's 3^3 or 9 possible intersection points where that single point's location could be.

  • The first constraint - the uniqueness constraint - implies that the point can only be along one line of sight for each eye. So out of the nine total locations, really there are only three possible, at a given time.
  • The second constraint - the continuity constraint - implies the points must be roughly the same depth from the observer. This leads to other points being eliminated.
  • Constraints act in an exclusionary way - if X, then not Y or Z

Program for random dot stereograms:

  • Constructing arrays of fragments/sequences
  • Use large number of processors, applying calculation to local values.
  • Relaxation - system starts at starting point, continues to compute, uses output from each cycle as input to next cycle, until it gradually relaxes into stable configuration of values
  • Relaxation program for stereograms: takes input rows from two stereograms, array of processors work out possible fusions.
  • Array is 3D, with each 2D slice corresponding to one horizontal row from two stereograms.
  • 3D volume represents all different depths of possible fusions of dots in white rows
  • The two constraints, uniqueness and continuity, are implemented in the program; exitation from neighbors (continuity constraint), inhibition from processors in same line of sight (uniqueness constraint), so these values will be summed and subtracted. If the combination of excitation and inhibition exceeds a threshold value, it is chosen as a possible solution and the program moves on to the next step with it.
  • Only local information (from neighbors) is required, meaning this scales well.

Real Stereopsis

Questions still remain: how does the brain match up visual points? Can't be based on illumination or light, since that does not interfere with the mechanism.

Hypothesis: brain begins with zero-crossings, and specifically, zero-crossings starting with a very coarse filter (fewer zero crossings, so fewer edges to match). Edges going from dark-to-light or light-to-dark are matched up. Once areas are matched up, initial registration of coarse filtere images, and brain then repeats on zero-crossings with a smaller filter, with more details (but a better "initial guess").

Physiologically more plausible than the computationally complex random dot method.

Eric Grimson has applied this to aerial stereophotographs.

Colin Blakemore reported cells in visual cortex that corresponded to discrepancies in the same line of sight, but discrepancies were not in zero-crossings. John Mayhew and John Frisby demonstrated in some cases it is nearby peaks in change of gradient. Robert Gregory found edges of objects do not always correspond to elements matched in stereopsis.

Watt and Morgan advocated peaks-and-troughs approach, think zero-crossings do not play major role.

James Gibson found gradients in texture can be source of information about depth and orientation of surfaces: textures consisting of similar shapes repeated on a surface. Gradient can indicate orientation of surface. There are computer programs that can interpret textured gradient images, but that does not mean the human mechanism for perceiving them is well-known.

Other cues to depth: distant objects are hazier, bluer, higher in visual field; parallel edges converge with distance; motion of an object affords multiple views; visible boundaries of object against background are also cues to shape.

Rotating cylinders of random dots: without the motion of two coaxial cylinders painted with random dots, there is no ability to identify a pattern. But when dots begin moving, or are projected on screen, becomes clear.

Importance of rigidity: Gunnar Johansson ran experiments in pitch dark room, with lights attached to human limbs at the joints, and attached at the middle of the limbs. In the former case, people could perceive someone walking via the the lights alone. In latter case, people could not sense motion. Visual system assumes adjacent lights/points connected by rigid entities.

Contour and Shape

I hold my hand between a light and a wall, manipulate it appropriately, and what you seeon the wall is the shadow of a rabbit. The phenomenon raises again the argument about the impossibility of vision: there are an infinite number of different three-dimensinal shapes that can give rise to the same two-dimensional shadow. how is it, then, that you see a rabbit?

The same question arises not just with shadows, but with silhouettes and the visible contours of objects against their backgrounds... Once again, the issue is whether vision relies on a knowledge of such things as rabbits or uses low-level assumptions to work bottom up from the visual data.


S. Ullman, "The Interpretation of Visual Motion." MIT Press, 1979.

Marr, "Vision" (1982)

Special issues of journals Artificial Intelligence (1981, Volume 17) and Cognition (184, Volume 18) are devoted to vision

Koenderink (14) - advanced account of role of contour as cue to shape

Brady (1983) - analysis of shape from standpoint of machine vision

Scenes, Shapes, and Images

In what pattern does light reflect from objects on to a surface? This problem in optics is well-formed: it can be solved much as a set of equations can be solved. Vision is inverse optics. It has to establish what objects caused the patterns of light projected on to the retinae. This problem is not well-formed: it is almost impossible to solve, because there are too many unknowns...

When the mind solves a seemingly impossible task, it must have a secret weapon, and, as I remarked,that secret weapon is knowledge. Knowledge comes in two main varieties that are used in two main ways, top down or bottom up.

One sort of knowledge arises from evolution, and its wisdom is built into the process of the nervous system. This knowledge is not really knowledge at all...

The other sort of knowledge accrues during the lifetime of an individual... In fact, you are not always aware either of using such knowledge or of its nature.

Part 3: Learning, Memory, and Action

We can now see why the problem of central control in brains is linked with that of long-term memory. It is obvious that no animal, let alone man, carries out a simple and predictable sequence of operations on the inputs which stimulate it. We must suppose therefore that the control exercised over the brain's operations is one which varies not only with the nature of the input but also with the results of past operations.

Donald Broadbent

Learning and Learnability

Some organisms are born with an innate repertoire of behaviors for comping with their particular "niche" in the environment. they can survive and reproduce on the basis of inborn responses that are automatically triggered by specific events. The advantage of such responses is that they do not have to be learned and so can be ready as soon as the organism enters the world. Their disadvantage is that apart from some fine tuning, they may be mindlessly repeated whenever the circumstances that trigger them reoccur. A sand wasp constructs its nest by performing a chain of innate responses, and if part of the nest is removed the wasp automatically returns to the appropriate earlier stage in the chain to build it again. As long as an experimenter repeats the partial destruction, the wasp repeats the construction: it never learns that its Sisyphean efforts are futile.

- page 129

It is a profoundly erroneous truism... that we should cultivate the habit of thinking what we are doing. The precise opposite is the case. civilization advances by extending the number of important operations which we can perform without thinking about them.

- Alfred North Whitehead

...inborn constraints on what a species can learn. Some behaviors are easily learned, whereas others are not. For example, if a rat drinks a sweet tasting liquid and later becomes ill, it learns immediately never to drink that liquid again. But, if the li8quid is tasteless and its poisonousness is signaled by some other concurrent event, such as a flashing light, then the animal fails to learn to avoid the liquid.

- p. 131

Hence, in general terms, learning is the construction of new programs out off elements of experience...

Ultimately, learning must depend on innate programs that make programs.

- p. 133

Components of Memory

The components of the computer communicate by three separate highways, or buses. The address bus carries the binary addresses of locations in memory; the data bus carries information to and from these locations; and the control bus carries instructions generated by the control unit of the processor to synchronize such transfers.

- p. 146

Broadbent, "Perception and Communication." Oxford. 1958.

  • Flow of information from the senses to memory
  • Information enters the senses
  • Information passes through the senses into short term memory store
  • Information from short term memory store passes through a selective filter
  • Information that passes through the selective filter enters a limited capacity channel
  • Information passing through the limited capacity channel feeds back to short-term store
  • Limited capacity channel passes information on to two places:
    • System for varying output until some input is secured (effectors)
    • Store of conditional probabilities of pat events (feeds back to the selective filter)

The next innovation was splitting the short-term memory into two separate components. The phenomenon that led to the split is easy to demonstrate. If you are in your living room at night and switch the light out, you will retain an evanescent visual impression of the room for a fraction of a second afterwards. In an experimental analogue of the situation, George Sperling showed that although people are unable to report the entire contents of an array of letters flashed momentarily before them, they have actually seen and briefly registered the entire array. If they are prompted to recall any particular part of it immediately after its presentation, they can do so.


There is evidently a sensory memory for a visual image that persists for about a quarter of a second. This memory might be useful for reading a newspaper in a lightning storm...

Similar effects have been demonstrated for hearing - you retain a brief "echo" of what someone has just said, which fades rapidly. Each sense must have its own memory system whose contents are continuously replaced by new incoming information and cannot be rehearsed.

- p. 148

Plans and Productions

Breadth-first search: number of routes to explore grows exponentially, doubling at each step for even the simplest decision tree (two decisions).

It has been proved that in certain domains any search procedure may fail to discover that there is no successful route: the procedure, in effect, goes into the problem space, gets lost, and never emerges with an answer. Hence, no matter what procedure is used, constraints are needed to keep the search to a manageable size. Once again, they play a critical role in a mental task. - p. 159

Newell and Simon devised a program based on an idea of the mathematician George Polya (he was anticipated by Plato). It looks for an operation that reduces the difference between the goal and the initial state. Thus, if you are trying to find a plan to mend the hole in your bucket, a relevant operation for reducing the difference between goal and initial state is to put a stopper in the hole. However, it may be impossible to carry out such an operation because one of its preconditions is not satisfied, e.g., you do not have a stopper. Newell and Simon introduced an ingenious idea: you create a new sub-goal - to find a stopper - and you put this sub-goal on the stack above your main goal.

- p. 1161

Parallel Distributed Processing