Catalogue of Artificial Intelligence Techniques

   

Jump to: Top | Entry | References | Comments

View Maths as: Images | MathML

Sinewave synthesis

Keywords: haskins, pisoni, remez, rubin, sinewave, sws, synthesis

Categories: Speech


Author(s): Maciej Nowakowski

Sinewave synthesis is a technique for synthesizing speech by replacing the formants (main bands of energy) with pure tone whistles. The first sinewave synthesis program (SWS) for the automatic creation of stimuli for perceptual experiments was developed by Philip Rubin at Haskins Laboratories in the 1970s. This program was subsequently used by Robert Remez, Philip Rubin, David Pisoni, and other colleagues to show that listeners can perceive continuous speech without traditional speech cues.

Most familiar synthetic speech aims to copy natural acoustic elements meticulously. That is why synthetic speech sounds voicelike, despite the mechanical quality of its articulation. In contrast, sinewave replication discards all of the acoustic attributes of natural speech, except one: the changing pattern of vocal resonances. By fitting 3 or 4 sinusoids to the pattern of resonance changes, sinusoidal signals preserve the dynamic properties of utterances without replicating the short-term acoustic products of vocalization.

If speech perception depended upon the particular sounds produced by talkers (the pop of the "p", the hiss of the "s", the hum of the "m", the click of the "k", or the buzz of the "z"), then sinusoidal signals lacking these attributes should not evoke impressions of consonants, vowels, words, etc. In fact, listeners who were asked to identify sinewave signals, reported "bad electronic music," "radio interference," etc., and no speechlike qualities. However, when asked to transcribe a "strangely-synthesized sentence," listeners readily reported the words of the natural utterances on which the sinewave signals were modeled.

The spectrogram above illustrates natural speech acoustics.

A sinewave replica of a natural utterance (shown above) discards the fine-grain acoustic properties of speech, retaining only the coarse-grain changes in the spectra over time.


References:


Comments:

Add Comment

No comments.