Audio demonstrations of auditory scene analysis


Note: The following material originally accompanied an audio compact disk of demonstrations,

Bregman, A.S., & Ahad, P. (1996) Demonstrations of Auditory Scene Analysis: The Perceptual Organization of Sound.

That is why the word "disk" appears occasionally in the explanations. In the present context, this word should be understood as "demonstration" or "set of demonstrations".

The demonstrations on this disk illustrate principles that lie behind the perceptual organization of sound. The need for such principles is shown by the following argument: Sound is a pattern of pressure waves moving through the air, each sound-producing event creating its own wave pattern. The human brain recognizes these patterns as indicative of the events that give rise to them: a car going by, a violin playing, a woman speaking, and so on. Unfortunately, by the time the sound has reached the ear, the wave patterns arising from the individual events have been added together in the air so that the pressure wave that reaches the eardrum is the sum of the pressure patterns coming from the individual events. This summed pressure wave need not resemble the wave patterns of the individual sounds.

As listeners, we are not interested in this summed pattern, but in the individual wave patterns arising from the separate events. Therefore our brains have to solve the problem of creating separate descriptions of the individual happenings, but it doesn't even know, at the outset, how many sounds there are, never mind what their wave patterns are; so the discovery of the number and nature of the sound sources is analogous to the following mathematical problem: "The number 837 is the sum of an unknown number of other numbers; what are they? There is a unique answer."

To deal with this scene analysis problem, the first thing the brain does is to analyze the incoming array of sound into a large number of frequency components. But this does not solve the problem; it only changes it. Now the problem is this: how much energy from each of the frequency components, present at a given moment, has arisen from a particular source of sound, such as the voice of a particular person continuing over time? Only by solving this problem can the identity of the signals be recognized.

For example, particular talkers can be recognized, in part, by the frequency composition of their voices. However, there are many more frequencies arriving at the ear than just the ones coming from a single voice. Unless the spectrum of the voice can be isolated from the rest of the spectrum, the voice cannot be recognized. Furthermore, the recognition of what it is saying - its linguistic message - depends on the sequence of sounds coming from that voice over time. But when two people are talking in the same room, a large set of acoustic components will be generated. These have to be stitched together in the right way. Otherwise illusory syllables could be perceived by grouping components derived from both voices into a single stream of sound.

The name given to the set of methods employed by the auditory system to solve this problem is "auditory scene analysis", abbreviated ASA. This name emphasizes the analogy with "scene analysis", a term used by researchers in machine vision to refer to the computational process that decides which regions of a picture to treat as parts of the same object. It has been argued by Bregman (1990) that there exists a body of methods for accomplishing auditory scene analysis that are not specific to particular domains of sound such as speech, music, machinery, traffic, animal sounds, and so on, but cut across all domains.

These methods take advantage of certain regularities that are likely to be present in the total spectrum whenever it has been created by multiple events. The regularities include such things as harmonicity, the tendency of many important types of acoustic event to generate a set of frequency components that are all multiples of the same fundamental frequency. Here is an example of how the auditory system uses this environmental regularity: If it detects two different sets of harmonics (related to different fundamentals) it will decide that each set represents a different sound. There are many other kinds of regularities in the world that the brain can exploit as it tries to undo the mixture of sounds and decide which frequency components to fit together. They include the fact that all the acoustic components from any single sonic event (such as a voice saying a word) tend to rise and fall together in frequency and in amplitude, to come from the same spatial location, and that the spectrum of the particular event does not change in its frequency profile (spectrum) too rapidly.

Illustrations of some of these regularities and how they affect grouping are given by the demonstrations on this disk. They are meant to illustrate the principles of perceptual organization described in the book, Auditory Scene Analysis: The Perceptual Organization of Sound (Bregman, 1990), published by the MIT Press. This book will be mentioned fairly often; so its title will be abbreviated as "ASA-90". The phenomenon of auditory scene analysis, itself, will be abbreviated simply as "ASA".

ASA-90 attempts to integrate the phenomena of the perceptual organization of sound by interpreting them as parts of ASA. It also applies the same framework to the study of music and speech, and connects the problem of auditory grouping to the "scene analysis" problem encountered in machine vision.

The research described in "ASA-90" has shown that the well-known Gestalt principles of grouping, conceived in the early part of this century to describe the perceptual organization of visual stimuli, can also be found, in a modified form, in auditory perception, where they facilitate the grouping together of the auditory components that have been created by the same sound source. While the Gestalt principles have been shown to be useful, it is the contention of "ASA-90" that they are merely a subset of a larger set of scene analysis principles, some of which are unique to particular sense modalities.

Choice of demonstrations.

For the present disk, we tried to choose demonstrations that a listener should be able to hear without special training or conditions of reproduction. For this reason, they do not always correspond directly to the stimulus patterns used in the research discussed in "ASA-90", many of which require training of the listener, presentation against a quiet background, and statistical evaluation before regularities can be seen. However, the present examples illustrate the same principles.


In each description in the booklet, there is section entitled "Reading". This pertains to chapters and page numbers in the "ASA-90" book, and to other publications. The articles cited in the description of each demonstration are collected at the end of the booklet. Many of these are discussed in "ASA-90".

The use of cycles as stimuli.

Many of the demonstrations use a repeating cycle of sounds to illustrate principles of perceptual organization. While not typical of our acoustic environment, cycles have a number of advantages. One is that a short sequence of sounds can be repeated to make sequences of any desired length. Although they vary in length, they are still subject to simple descriptions, and can be generated simply. When we explain the demonstrations, we use the ellipsis symbol, (...) to mean "repeated over and over", as in "ABAB...".

A second reason for using cycles is that segregation increases over time. Cyclic presentation allows us to drive, to unnaturally high levels, the segregative effects of the stimulus properties that we are examining. Furthermore, the use of cycles allows the listener to have repeated chances to observe the ensuing perceptual effects, allowing stable judgments to be made. By using sequences composed of a large number of repetitions, we can also minimize the special effects that occur at the beginning and ends of sequences (e.g., echoic memory) so that purer effects can be observed.

In many of the demonstrations, we present high (H) and low (L) tones in a galloping sequence, a pattern first used by van Noorden (1975) to study the segregation of auditory streams. When the sequence HLH-HLH-HLH-... segregates into a high and a low stream, the galloping rhythm seems to disappear. Instead we hear a regular rhythm of the high tones H-H-H-H-H-H-... and a slower regular rhythm of the low tones -L---L---L--. This change in rhythm and melodic pattern makes it easy for listeners to recognize that stream segregation has taken place.

Standard and comparison patterns.

In order to clarify how you are organizing the sequence of sounds, many of the demonstrations ask you to listen for a particular pattern, A, inside a larger pattern, B.

The pattern that you are to listen for, A, is presented first, in the form of a "standard". Then, right afterwards, the larger pattern, B, is played as a "comparison sequence", which is always more complex than the standard. It may have more sequential components or more simultaneous ones. If some standard (A1) can more easily heard than other standards (A2, A3, etc.) in a given comparison pattern, this implies that sub-pattern A1 is more strongly isolated from the rest of B, by principles of grouping, than the other sub-patterns are.

If you concentrate very hard, you may be able to hear the standard whether or not perceptual organization favors its isolation. So try to listen to the standards in all conditions with the same degree of attention. Then you should be able to tell whether the isolation of the standard has been helped by the grouping cues whose effects are being examined in that demonstration.


There are a set of conventions that govern the format of the figures. We will list them now. Should this format be altered for a particular figure, it will be explained in the text.

  1. In most of the figures there are two or more panels which are referred to, in the text, as Panel 1, Panel 2, etc. The panel numbers are not included in the figure itself, but the order of numbering is consistent, going from left to right and then from top to bottom; i.e., as in normal reading.
  2. The format of most of the displays are schematic spectrograms, with time on the horizontal axis and frequency on the vertical. Tones are usually represented as horizontal black bars. A noise burst appears as a rectangle with a gray pattern filling it; its horizontal extent indicates duration and its vertical extent, the range of included frequencies.

Monophonic versus stereophonic presentation.

Most of the demonstrations are monophonic (same signal on both channels). This is because spatial location is only one of many cues used by ASA. The mono demonstrations, 1 to 37, will work when listened to over loudspeakers as long as there is little reverberation in the room ("dry" listening conditions). If the room is too reverberant, headphones should be used. Only Demonstrations 38 to 41 are in stereo. They are grouped at the end of the disk for convenience. Although listening to these stereo examples over loudspeakers may reproduce some of the effects, headphones are strongly recommended.

Track numbers for demonstrations and calibration signals.

For simplicity, the first 41 track numbers on the disk correspond with the 41 demonstration numbers in this booklet. However, there are two extra tracks at the end of the disk. The first one, track 42, contains the loudness calibration signal described later in this section. The second one, track 43, is a signal for calibrating the stereo balance of your playback equipment. It is described on page 81 in this booklet.

Valid XHTML 1.0 Transitional   Copyright ©2008 - Al Bregman   Valid CSS!