Al Bregman's Website

Computational auditory scene analysis (CASA)

Computer systems are getting very good at "recognizing" spoken language, in the limited sense of outputting a distinct symbol for every different spoken word that they receive, and outputting the same symbol for words that would be heard as the same by a human listener. However even the most successful of currently used systems require that the speech be produced in a fairly quiet background. This is because their programs cannot distinguish between the sounds they are supposed to recognize and other irrelevant sounds. So they respond as if the mixture of the signal with its background were a single sound.

Because of this problem, there has been great interest in programming computers to segregate the important signal from other co-occurring sounds. Not all of the methods that have been used have approached the problem the way the human auditory system does. Engineers have been able to achieve very good sound-source separation by other methods. For example, suppose we have a talker who stays in one place, and a set of competing sound sources that also maintain their locations. If this set of sources can be surrounded with an array of stationary microphones, separation of the distinct sources can be achieved with great accuracy. This would be very useful for such applications as teleconferencing. However this recording setup is rather inflexible. Compare this to human listeners, who do remarkably well with only two ears, with heads and bodies that are in motion, who are listening to a person who may also be in motion, in a landscape populated by irrelevant sound sources that may also be moving around. Even if the sound comes around a corner, losing its spatial information, the human can do a pretty good job at distinguishing one sound from another.

The observation that millions of years of evolution has produced such a flexible system for auditory scene analysis (ASA) in humans and other animals has encouraged some researchers in sound separation to incorporate the strategies employed by humans when they construct computer systems for the recognition of speech or other signals of interest, such as music. In speech recognition systems, researchers have designed a preliminary process to separate the concurrent sounds, after which the separated signals are passed on to recognition processes. The preliminary sound-separation stage has been called "computational auditory scene analysis" (CASA).

CASA researchers have described the problem of sound segregation as an example of the binding problem. Neuroscientists believe that the recognition of sensory inputs involves the actions of a large number of brain circuits, each of which becomes active when an input has some particular characteristic. For example, one circuit might be active whenever the input sound was tonal (periodic) and had a clearly evidenced fundamental that was close to 200 Hz [200 cycles per second]. Another might be active when a moment of sound seemed to be coming from about 30 degrees to the left of the listener. Computer scientists have designed automatic recognition processes using the same approach - detection of the individual features of the signal.

However, because a scene might have several sounds, each with its own fundamental frequency and location, the mere detection of the raw features of all the sounds is not good enough. The right combination of features must be put together to correctly represent the individual sound. Suppose we encounter a low-pitched sound on our right, accompanied by a high-pitched sound on our left. One neural circuit (which knows nothing about location) detects a low-pitched sound, another similar one detects a high-pitched sound. At the same time another circuit (which knows nothing about pitch) detects something on the left, while another circuit of that type detects something on the right. The brain has to know which location goes with which pitch before we can understand the perceptual input. The connections between high and left, and between low and right, are known as bindings. A particular location is temporarily "bound" to a particular pitch so as to represent the current sensory input.

We know that the different frequencies present in a particular signal at a particular moment are registered individually by the auditory system at each moment of time [to a rough approximation]. But which bit of frequency-by-time data came from the same environmental source as which other one? This can be viewed as another example of the binding problem. The frequencies that have arisen from the same environmental source (such as a voice) have to be bound together as a single sound, and the frequency components at one moment of time have to be connected to ("bound to") components coming later to create a coherent auditory stream. Only by knowing what goes with what can our brains correctly recognize of the set of sounds that are present and the properties of each one.

One proposed mechanism for how the brain binds different features together involves the simultaneity of activity in different neural circuits (e.g., Wang and Brown, 2006, Ch. 10). Each "feature detector" is thought to be a neural circuit that acts as an oscillator, possessing a particular frequency and phase. Whenever two circuits oscillate in exact synchrony (with respect to frequency and phase), some part of the brain that can detect this synchrony will treat the two circuits as signalling features of the same sound. For simplicity, I have described the case of only two synchronized neural circuits. But the reasoning applies just as well when there are more than two.

Speech recognition is only one of the problems for which CASA is used. For example CASA can be a component of automatic music transcription (the automatic conversion of recorded music into a printed representation). In this sub-area of CASA, methods analogous to those used in speech recognition have been employed.

A number of scientific meetings have been devoted to CASA. Two are reported in the following books:

Rosenthal, D. F., and Okuno, H. G., (Eds.) (1998). Computational Auditory Scene Analysis. Mahwah, NJ, Lawrence Erlbaum.

Divenyi, P. (Ed.) (2004) Speech separation by humans and machines. Kluwer Academic Publishers.

The following book is intended to be a comprehensive account of CASA, with each chapter written by an expert on that topic.

Wang, D. and Brown, G. J. (2006) Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Hoboken, NJ: John Wiley. Ch..

To find out more about CASA, use the following search terms, (without the angle brackets) in any Internet search engine:

<"computational auditory scene analysis">

<CASA "speech recognition">

<CASA "music transcription">