Motor Theory of Speech Perception (A. Liberman 1985)

Speech Perception
Module by: David Lane

For most of usĦBlistening to speech is an effortless task. Generally speakingĦBspeech perception proceeds through a series of stages in which acoustic cues are extracted and stored in sensory memory and then mapped onto linguistic information. When air from the lungs is pushed into the larynx across the vocal cords and into the mouth noseĦBdifferent types of sounds are produced. the different qualities of the sounds are represented in formantsĦBwhich can be pictured on a graph that has time on the x-axis and the pressure under which the air is pushedĦBon the y-axis. Perception of the sound will vary as the frequency with which the air vibrates across time varies. Because vocal tracts vary somewhat between people (just as shoe size or height do)ĦBone person's vocal cords may be shorter than another'sĦBor the roof of someone's mouth may be higher than another'sĦBand the end result is that there are individual differences in how various sounds are produced. You probably know someone whose voice is slightly lower in pitch than yours or higher in pitch. Pitch is the psychological correlate of the physical acoustic cue of frequency. The more frequently the vibrations of air occur for a particular soundĦBthe higher in pitch it will be perceived. Less frequent vibrations are perceived as being lower in pitch. When language is the sound being processedĦBthe formants are mapped onto phonemesĦBwhich are the smallest unit of sound in a language. For exampleĦBin English the phonemes in the word "glad" are /g/ĦB/l/ĦB/a/ĦBand /d/.

The nature of speechĦBhoweverĦBhas provided researchers of language with a number of puzzlesĦBsome of which have been researched for more than forty years.

Note: To demonstrate one of these problemsĦBclick here. The waveform you see shows speech as a function of amplitudeĦBwhich is measured in decibels (dB)ĦBand frequency of the sound wavesĦBmeasured in hertz (Hz). As the cursor passes over the waveformĦByou may notice various sections that correspond to the words and individual sounds you hear; for exampleĦByou can detect where the word "show" begins and where the word "money" ends. After a bit of experimentationĦBhoweverĦByou notice that it is difficult to pinpoint precisely where one phoneme ends and another begins. Try to find the "th" sound in the word "the"ĦBfor example; and where can the "uh" sound in "the" be located? Often the acoustic feature of one sound will spread themselves across those of another soundĦBleading to the problem of linearity; that isĦBfor each speech sound phonemeĦBif phonemes were produced one at a timeĦBor linearlyĦBthere should be a single corresponding section in the waveform. As "the" showsĦBhoweverĦBspeech is not linear.

Another problem that investigators have studied is the problem of invariance. Invariance refers to a particular phoneme having one and only one waveform representation; that isĦBthe phoneme /i/ (the "ee" sound in "me") should have the identical amplitude and frequency as the same phoneme in "money". As you can see againĦBthat is not the case; the two differ. The plosivesĦBor stop consonantsĦB/b/ĦB/d/ĦB/g/ĦB/k/ĦBprovide particular problems for the invariance assumption.

Note: To download free sound-processing software to record your own sentences nowĦBin order to see the problems of linearity and invariance in your own speechĦBclick here.

The problems of linearity and invariance are brought about by co-articulationĦBthe influence of the articulation (pronunciation) of one phoneme on that of another phoneme. Because phonemes cannot always be isolated in a spectrogram and can vary from one context to another depending on neighboring phonemesĦBspeakers' rate of speechĦBand loudnessĦBperceptually identifying one phoneme among a stream of othersĦBthe process of segmentationĦBalso seems like a daunting task. Theories and models of speech perception have to be able to account for how segmentation occurs in order to provide an adequate account of speech perception. We will discuss some accounts of speech perception below.

Some clues as to how identifying phonemes occurs arise from investigation into the ability to perceive voiced consonantsĦBor consonants in which the vocal cords vibrate. To understand the concept of voicingĦBsay the phonemeĦB/p/ĦBfollowed by the phonemeĦB/b/ĦBwhile touching your throat. You will feel the vibration of your vocal cords during /b/ but not during /p/. Both of these phonemes are bilabial; that isĦBthey are produced by pressing the lips togetherĦBand are released with a puff of air. Since the discriminating difference between these two phonemes relevant to English is in their voicingĦBthe ability to adequately perceive voicing is crucial for an adept listener; for exampleĦBas the rate of speech increasesĦBlisteners are able to shift their criterion of what constitutes a voiceless phoneme. The criterion shift allows them to accept phonemes that are pronounced with shorter VOT durations. Although shifting criteria during the perception of phonemes may be one process that allows accurate identification of phonemes despite changing conditionsĦBwhat supports the criterion shifts is still a matter of investigation. These skills effortlessly become highly automatic and are probably acquired and fine-tuned during early childhoodĦBa topic we talk about in infant speech perception.
Infant language study: Introduction
Infant language study: High Amplitude Sucking Method
Infant language study: Head Turn Method
Infant language study: Preferential Looking Method
(Video clips courtesy of the late Peter W. Jusczyk and the Johns Hopkins University).

Is speech special?
In visual perceptionĦBpeople discriminate among colors based on the frequency of the wave length of light. Low frequencies are perceived as red and high frequencies are perceived as violet.
Figure 1

As we move from low to high frequenciesĦBwe perceive a continuum of colors from red to violet. Notice that as we move from red to orangeĦBwe pass through a middle ground that we call "red orange." Speech sounds lie on a physical continuum as well. For exampleĦBan important dimension in speech perception is voice onset time. This refers to the time between the beginning of the pronunciation of the word and the onset of the vibration of the vocal chords. For exampleĦBwhen you say "ba" your vocal chords vibrate right from the start. When you say "pa" your vocal chords do not vibrate until after a short delay. To see this for yourselfĦBput one of your fingers on your vocal chords and say "ba" and then "pa."

The only difference between the sound "ba" and the sound "pa" is that the voice onset time for "ba" is shorter than the voice onset time for "pa". An important difference between speech perception and visual perception is that we do not hear speech sounds as falling halfway between a "ba" and a "pa." We hear a sound one way or the other. This means that a range of voice onset times are perceived as "ba" and a different range of voice onset times are perceived as "pa". This phenomenon is called categorical perception and is very helpful for understanding speech.

The sounds "ba" and "pa" differ on the continuous dimension of voice onset time. The sounds "ga" and "da" also differ on a continuous dimension. HoweverĦBthe continuous dimension for these stimuli is more complex than the dimension of voice onset time (it is called the second formant but that is a little beyond the scope of this text). What is important here is that there is a continuum of sounds from "da" to "ga." The following demonstration uses computer generated speech sounds. Ten sounds were generated in equal steps from "da" to "ga." The experiment uses sounds numbered 1ĦB4ĦB7ĦBand 10. Sounds 1 and 4 are both heard as "da" whereas sounds 7 and 10 are heard as "ga." In the taskĦBsubjects are presented with a randomly-ordered series of sound pairs and askedĦBfor each pairĦBto judge whether the sounds are the same or different. Since sounds 1 and 4 are both heard as "da" it should be very hard to tell them apart. ThereforeĦBsubjects usually judge these sounds as identical. By contrastĦBSound 4 is heard as "da" while Sound 7 is heard as "ga." Since Sound 4 and Sound 7 are on opposite sides of the "categorical boundary" it is easier to hear the difference between these sounds than the difference between Sounds 1 and 4. This occurs even though the physical difference between Sounds 1 and 4 is the same as the difference between Sounds 4 and 7. By similar logicĦBthe difference between Sounds 7 and 10 should be hard to hear.
The results from one subject in this demonstration experiment are shown below and can be interpreted as follows: When the comparison was between Sounds 1 and 4ĦBthe subject judged them to be different once and the same 4 times. When the comparison was between Sounds 4 and 7 (which cross the border)ĦBthe subject correctly judged them to be different 5/5 times. FinallyĦBin comparing Sounds 7 and 10ĦBthe subject always judged the sounds to be the same. ThusĦBthe only time this subject heard a difference between sounds that were three steps apart was for Sounds 4 and 7.
Sound Pair
Judged different
Judged same

Not all results are as clear cut as those shown above. Many people need more time to become familiar with the task than is possible in this demonstration. In any caseĦByou should get a sense of how this kind of experiment works.

Note: Try this categorical discrimination task yourself.
The hypothesis that speech is perceptually special has arisen from this phenomenon of categorical perception. Listeners can differentiate between /p/ and /b/; howeverĦBperformance in distinguishing between different types of /p/ sounds is difficult andĦBfor someĦBimpossible. This pattern is consistent with the pragmatic demands of language; there is a meaning distinction between /p/ and /b/ĦBwhile the distinction between two variations of /p/ carries no meaning. (There are languages in which two different /p/ sounds are usedĦBandĦBin such casesĦBperception would be categorical).

The first experiment to demonstrate categorical perception was conducted by LibermanĦBHarrisĦBHoffman and Griffith (1957)ĦBand in it they presented consonant-vowel syllables along a continuum. The consonants were stop consonantsĦBor plosivesĦB/b/ĦB/d/ĦBand /g/ĦBfollowed by /a/; for exampleĦB/ba/. When asked to say whether two syllables were the same or differentĦBthe participants reported various forms of /pa/ to be the sameĦBwhereas /pa/ and /ba/ were easily discriminated.
Another categorical perception task presents two syllables followed by a probe syllableĦBand participants have to say which of the first two syllables the probe matches. If the first two sounds are from two different categories - for exampleĦB/da/ and /ga/ - participants accurately match the probe syllable. If the first two syllables are taken from the same categoryĦBhoweverĦBparticipants cannot differentiate them well enough to do the matching taskĦBand their performance is at chance.
Does the categorical perception of speech mean that speech is perceived via a specialized speech processor? Kewley-Port and Luce (1984) did not find categorical perception in some non-speech stimuliĦBindicating that there may be something special about speech.

For there to be a specialized speech processorĦBcategorical perception should occur during the perception of all phonemes. HoweverĦBFryĦBAbramsonĦBEimasĦBand Liberman (1962)ĦBfailed to find categorical perception with a vowel continuum. SoĦBthere are vowels and consonants that do not behave the same in that respect. AdditionallyĦBchinchillas have been shown to categorically perceive speechĦBdespite their obvious lack of speech-processing mechanism (KuhlĦB1987).
How is speech perceived?
One theory of how speech is perceived is the Motor Theory of speech perception (LibermanĦBCooperĦBShankweilerĦB& Studdert-KennedyĦB1967). The motor theory postulates that speech is perceived by reference to how it is produced; that isĦBwhen perceiving speechĦBlisteners access their own knowledge of how phonemes are articulated. Articulatory gestures such as rounding or pressing the lips together are units of perception that directly provide the listener with phonetic information. The motor theory can account for the invariance problem; that isĦBthe ways that phonemes are produced and perceived have more in common than the ways they are acoustically represented and perceived.

What would be the evidence that listeners use articulatory features when perceiving speech? HereĦBan accidental discovery made by two film technicians led to one of the most robust and widely discussed findings in language processing. A researcherĦBHarry McGurkĦBwas interested in whether auditory or visual modalities are differentially dominant during infants' perceptual development. To find outĦBhe asked his technician to create a film to test which modality captured infants' attention. In this filmĦBan actor pronounced the syllable "ga" while an auditory "ba" was dubbed over the tape. Would babies pay attention to the "ga" or the "ba"? The process of making the filmĦBhoweverĦBled to a surprising finding about adults. The technician (and others) did not perceive either a "ga" or a "ba". RatherĦBthe technician perceived a "da".

In an experiment that formally tested this observationĦBMcGurk and McDonald (1976) showed research participants a video of a person saying a syllable that began with a consonant formed in the back of the mouth at the velum-that isĦBa velar consonantĦB"ga"-while playing an auditory tape of a consonant which is formed in the front of the mouth at the two lips; that isĦBa bilabialĦB"ba". When viewers were asked what they heardĦBlike the film technicianĦBthey replied "da". Perceiving a "da" was the result of combining articulatory information from both visually and auditorily presented stimuli.

Note: You can experience McGurk effect by clicking here.
(To return to the question Harry McGurk originally asked about infantsĦBneither modality seems to have dominance; infants as young as 5-months old take in the visual and auditory information about words in the same way as adults: both influence perception).

In addition to being interpreted as evidence that listeners perceive phonetic gesturesĦBan account that suggests an explanation based on memory has been raised. Because perceivers have ample experience with both hearing and seeing people speakĦBthey may have built memories of these events that have subsequently become associated with the phoneme's mental representationĦBso that when the phoneme is perceivedĦBmemories based on the visual information are recalled (MassaroĦB1987).

To test this possibilityĦBFowler and Dekle (1991) introduced research participants to one of two experimental conditions. In oneĦBthe participants were presented with either a printed ba or printed ga syllableĦBwhile listening to a syllable from the auditory /ba/-/ga/ continuum. In the otherĦBthe printed syllables were replaced with their haptic presentations; that isĦBparticipants were able to feel how the syllables were being produced. Since there are no previously made associations to how syllables feel when a speaker produces themĦBby the memory account there should be no McGurk effect. The experimenters found no effect of the printed syllables on the auditory onesĦBas expectedĦBand they found that the feel of how a syllable is produced affected the perception of the auditory syllablesĦBindicating that articulatory gestures are indeed perceived by listeners.

The TRACE model of speech perceptionĦBTRACE 1 ĦBdeveloped by Jay McClelland and Jeff Elman (1986; Elman & McClellandĦB1988)ĦBdepicts speech as a process in which speech units are arranged into levels and interact with each other. There are three levels: featuresĦBphonemesĦBand words. The levels are comprised of processing unitsĦBor nodes; for exampleĦBwithin the feature levelĦBthere are individual nodes that detect voicing.

Nodes that are consistent with each other share excitatory activation; for exampleĦBto perceive a /k/ in "cake"ĦBthe /k/ phoneme and corresponding featural units share excitatory connections. Nodes that are inconsistent with each other share inhibitory links. Such nodes are nodes within a level. In this exampleĦB/k/ would have an inhibitory connection with the vowel sound in "cake"ĦB/eI/.

To perceive speechĦBthe featural nodes are activated initiallyĦBfollowed in time by the phoneme and then word nodes. ThusĦBactivation is bottom-up. Activation can also spread top-downĦBhoweverĦBand TRACE can model top-down effects such as the fact that context can influence the perception of individual phonemes.

Perception of speech can be influenced by contextual informationĦBindicating that perception is not strictly bottom-up but can receive feedback from semantic levels of knowledge. In 1970ĦBWarren and Warren took simple sentencesĦBsuch as "It was found that the wheel was on the axle"ĦBremoved the /w/ sound from "wheel"ĦBand replaced it with a cough. They found that listeners were unable to detect that the phoneme was missing. They found the same effect with the following sentences as well:
It was found that the *eel was on the shoe.
It was found that the *eel was on the orange.
It was found that the *eel was on the table.
Listeners perceived heelĦBpeelĦBand mealĦBrespectively. Because the perception of the word with the missing phoneme depends on the last word of the sentenceĦBtheir finding indicates that perception is highly interactive.

Gating Task: A task developed to show the effect of context on spoken word recognition is Gating (GrosjeanĦB1980). In this taskĦBparticipants are presented with fragments of a wordĦBof gradually increasing duration (such as 50 msec increments); for exampleĦBt - tr - tre - tress - tresp - trespa. Upon hearing each fragmentĦBthe participant makes a guess at what the whole word might be. (Have a go at this gating task yourself). The point at which the person guesses the whole word is called the isolation point. Gating shows the effect of context on spoken word recognition: there is a time difference between identifying a word in isolation and identifying it in a sentence. The time to identify a word in context is about a fifth of a secondĦBwhereas it takes a third of a second in isolation. It is thought that the grammar and meaning of the preceding part of the sentence limit the range of possibilities for the gated wordĦBsuch that it can be identified sooner in a sentence than on its own. The point at which there is only one possible candidate is called the uniqueness point. The uniqueness point and the isolation point need not correspond: on the one handĦBthe word may be recognized before there is one remaining candidateĦBif the context is helpful (i.e.ĦBstrongly biasing); on the other handĦBthere may be a delay in isolating the word. There is a third pointĦBcalled the recognition point. This is the point at which the person is confident in his/her identification of the gated word.

The guesses people make on this task indicate that the perceptual identity of the word is also important to spoken word recognitionĦBeven before the context has its effect. In other wordsĦBpeople's early guesses resemble the perceptual aspects of the word and not the contextually signaled candidate.

LibermanĦBA. M.ĦBHarrisĦBK. S.ĦBHoffmanĦBH. S.ĦB& GriffithĦBB. C. (1957). The discrimination of speech sounds within and across phoneme boundaries. Journal of Experimental PsychologyĦB54ĦB358-368.
Kewley-PortĦBD.ĦB& LuceĦBP. A. (1984). Time-varying features of initial stop consonants in auditory running spectra: A first report. Perception and psychophysicsĦB35ĦB353-360.
FryĦBD. B.ĦBAbramsonĦBA. S.ĦBEimasĦBP. D.ĦB& LibermanĦBA. M. (1962). The identification and discrimination of synthetic vowels. Language and Speech. Language and SpeechĦB5ĦB171-189.
KuhlĦBP.K. (1987). The special mechanisms debate in speech research: Categorization tests on animals and infants. In S. Harnad (Ed.)ĦBCategorical perception: The groundwork of cognition. (pp. 355-386). Cambridge: Cambridge University Press.
LibermanĦBA. M.ĦBCooperĦBF. S.ĦBShankweilerĦBD. P.ĦB& Studdert-KennedyĦBM. (1967). Perception of the speech code. Psychological ReviewĦB74ĦB431-361.
McGurkĦBH.ĦB& MacDonaldĦBJ. (1976). Hearing lips and seeing voices. NatureĦB264ĦB746-748.
FowlerĦBC. A.ĦB& DekleĦBD. J. (1991). Listening with eye and hand: Cross-modal contributions to speech perception. Journal Experimental Psychology: Human Perception and PerformanceĦB17ĦB816-828.
McClellandĦBJ. L.ĦB& ElmanĦBJ. L. (1986). The TRACE model of speech perception. Cognitive PsychologyĦB18ĦB1-86.
ElmanĦBJ. L.ĦB& McClellandĦBJ. L. (1988). Cognitive penetration of the mechanisms of perception: Compensation for Co-articulation of lexically restored phonemes. Journal of Memory and LanguageĦB27ĦB143-165.
WarrenĦBR. M.ĦB& Warren R. P. (1970). Auditory illusions and confusions. Scientific AmericanĦB223ĦB30-36.
GrosjeanĦBF. (1980). Spoken word recognition processes and the gating paradigm. Perception and PsychophysicsĦB28ĦB267-283.