SONIC REALISM
PHOTO 1. A signal that consists of 3,000 Hz and 1,000 Hz
sine waves. This is the “proper” signal and is what the ear
hears if the acoustic path lengths of the tweeter and woofer
are the same.
PHOTO 2. The same signal as Photo 1 except that the acoustic
path length for the 3,000 sine wave has changed by about
two inches (about 180 degrees) relative to the 1,000 Hz sine
wave. The shape is considerably different from Photo 1.
This makes sense only if there is a
difference between a person’s voice
and a reproduction of a person’s
voice. The only significant difference
is phase. The reflections of a person’s
voice do not suffer from phase distortion because all the signals come from
a single point. There is no break-up of
the voice signal into multiple frequency channels where each channel has a
different physical location.
Then there is the strange effect
heard when recording a conversation
in an ordinary room. If the
microphone is some distance from the
person speaking, it often sounds as if
the recording was made in a tin can or
at the bottom of the well. (Police
undercover recordings are a typical
example.) Sometimes the speech is
nearly unintelligible. However, if you
were present during the recording,
you didn’t notice anything odd. Why?
There are two factors at work in
this second example: reflection and
auto-correlation. Sound reflects well
from hard surfaces. Often, there is
only about one dB or so of loss
(or about 90% reflection). If a microphone is held about a foot from the
mouth of someone speaking, the
sound level is about 65 to 70 dB for
76 May 2007
normal speech. This level follows an
inverse-square relationship as the
distance increases. At two feet, the
level is 50% or 6 dB down, and so
forth. If there is a wall four feet away
from the person speaking, then the
echo returning from the wall is about
24 dB down (excluding wall losses)
and delayed by about 8 mS. This is a
very substantial amplitude difference.
However, if the microphone is four
feet from the person and there is a wall
four feet beyond the microphone, the
echo will be only 6 dB down (and
delayed by 8 ms). This is not a very
large volume difference when considering that speech varies considerably
in amplitude. Thus, the reflection is
perceived as being virtually as loud as
the person speaking. This can be
compared to one person speaking
from four feet and another speaking
from 12 feet. It is clear that in such a
situation, both people will be heard at
a reasonably similar loudness. So, the
tape recording presents the speaker
and the echo at nearly the same
volume. That explains the tin can effect
as a reverb. This reasoning seems
acceptable but it doesn’t explain why
the reverb isn’t heard by people in the
room when the recording was made.
Auto-Correlation
The reason reverb isn’t heard by
people in the room is because the ear
apparently possesses an auto-correlation function (or its equivalent).
This is clearly demonstrated by the fact
that humans do not perceive short-term echos (delays of less than about
40 ms). But we certainly do perceive
echos with delays longer than that. This
seems to conflict with the demonstrated sensitivity of the ear to delays of
about 14 µs. This is a difference by a
factor of nearly 3,000. It implies that
something is happening to sounds that
are delayed from 14 µs to about 40 ms.
Auto-correlation is a technique
that allows the removal of echos. (This
is an important issue with the
telephone companies who have done
a lot of work on the subject.) The concept is fairly simple. If a delayed signal
is identical to an earlier signal, then the
second signal must be an echo and
can be removed. The method of
determining “identical” is a statistical
process called auto-correlation. It
compares, or correlates, different parts
of the signal, looking for similarities. If
the different signal parts are similar,
then there is a high auto-correlation.