Current theories that try to model vowel and consonant identification are centred around the segment proper and the transitions between segments. Papers that discuss these theories indeed do not mention speech beyond the nearest consonant-vowel transition (e.g., for vowels Strange, 1989; Fox, 1989; Andruski and Nearey, 1992; see also Van Son, 1993a,b; Harrington and Cassidy, 1994; for consonants Pickett et al., 1994). In general, these papers use a subdivision of the speech signal into "proper segments", whose articulation is dominated by a single underlying phoneme, and the transitions between these proper segments, whose articulatory properties are determined by a changing mix of the two flanking phonemes. In papers on vowel production and recognition, the "segment proper" is usually called the vowel kernel. The boundaries between the vowel kernel and the transitions are ill-defined (see, e.g., Benguerel and McFadden, 1989).
It has long been known that the transitions to and from neighbouring segments, especially vowels, are often essential for the correct identification of consonants. Many studies have investigated the contributions of acoustic cues from neighbouring vocalic transitions to consonant identification and its importance relative to cues from the consonantal segment proper (e.g., Cooper et al., 1952; Delattre et al., 1955; Ohde and Sharf, 1977; Pols and Schouten, 1978; Pols, 1979; Miller and Bear, 1983; Mack and Blumstein, 1983; Tarter et al., 1983; Polka and Strange, 1985; Diehl and Walsh, 1989; Mann and Soli, 1991; Nossair and Zahorian, 1991; Ohde, 1994; Cassidy and Harrington, 1995). From these studies it becomes clear that human listeners are quite able to identify many consonants from vocalic transitions alone. It is also clear that, if present, listener do use cues from outside the consonantal segment proper to identify it. However, only rarely is the influence of the next vowel itself on consonant identification, as opposed to only the vocalic transitions, investigated or acknowledged (Ohde and Sharf, 1977; Mann and Soli, 1991; Ohde 1994). An exception might be made for voicing contrasts, where the duration and F0 trajectory of the whole neighbouring vowel is generally used to explain identification results (Lisker; 1986; Van Santen, 1993).
There has been a long standing discussion about whether consonant-vowel transitions are used in vowel identification (both CV and VC). Many studies do not acknowledge a role for the transitions (Lehiste and Peterson, 1961; Nearey and Assmann, 1986; Miller, 1989; Nearey, 1989; Andruski and Nearey, 1992). However, other studies stress the importance of these transitions for correct vowel identification (Lindblom and Studdert-Kennedy, 1967; Strange et al., 1976; Gottfried and Strange, 1980; Strange et al., 1983; Pols et al., 1984; Verbrugge and Rakerd, 1986; Benguerel and McFadden, 1989; Di Benedetto, 1989; Fox, 1989; Strange, 1989a,b; Jenkins et al., 1994; see also Van Son, 1993a,b). Even these latter studies do not assess the importance of speech beyond the vocalic part of the transition.
From studies of speech production, it is clear that the effects of coarticulation and assimilation affect complete phoneme segments. The often profound changes induced by coarticulation do not seem to bother the listeners. Identification does not seem to deteriorate from coarticulation (Gottfried and Strange, 1980; Macchi, 1980; Strange and Gottfried, 1980; Strange, 1989a). Even the spread of features like rounding and nasality over other segments seems not to deteriorate identification (Manuel, 1995). This is remarkable, because influences are often strong and can range well beyond the neighbouring segments (e.g., Öhman, 1966, 1967; Keating et al., 1994). Complementing the variation in the pronunciation of individual phonetic segments are regularities in the interaction between phonemic segments that could be used in the "reconstruction" of the intended phonemes. For example, the duration of a vowel is linked to the voicing of a following consonant (Lisker, 1986; Van Santen, 1993). Vowels can interact across intervening consonant clusters (Öhman, 1966, 1967; Benguerel and McFadden, 1989), and the articulation of intervocalic consonants can be described as a perturbation of the vowel-vowel trajectory (Öhman, 1966, 1967; Keating et al., 1994). One of the more consistent regularities of Consonant-Vowel sequences can be described by Locus Equations that correlate F2 values in the centre of the vowel realization with the values found at the start of the CV transition (e.g., Schouten and Pols, 1979a,b, 1981; Sussman et al., 1991, 1993, 1995).
All studies point towards a universal occurrence of across-segment regularities in speech production. It is natural to ask whether these regularities are used by listeners when they try to identify the individual segments.
From a recent review of the literature on vowel identification (Van Son, 1993a,b), it became clear that the experiments that are generally thought to support the importance of the transitions for vowel identification in fact could not distinguish between the effects of the vowel on- or offset transitions and the effects of the consonantal context itself (from Lindblom and Studdert-Kennedy, 1967 to Andruski and Nearey, 1992). Where such a distinction would have been possible (e.g., Pols and Van Son, 1993; Van Son and Pols, 1993), the results could show a detrimental effect of the presence of (synthetic) transitions without an appropriate context. Studies on speaker-normalization do show an influence of sentence context (Verbrugge et al., 1976), although in this case too it seems that identification deteriorated from it.
Abstracting from specific theories on coarticulation and assimilation, the question is whether listeners can, and do, use contextual speech from beyond the nearest transitions to identify individual phonemes. In an experimental study, this translates into the question whether the presence or absence of (the original) context influences identification. Ohde and Sharf (1977, see also the comments in Pols and Schouten, 1978; Pols, 1979; and the thesis of Klaassen-Don, 1983) investigate this question using plosives and a few vowels in CV and VC tokens uttered in isolation. They do find a relatively small effect of context on the identification of vowels and consonants. However, due to the limited inventory and the fact that the (non-word) syllables were pronounced in isolation, it is possible that the context effects could be much more salient in "normal" speech with a considerably larger phoneme inventory and more coarticulation and reduction.
There are other studies that supply information about the influence of context on the identification of vowels (e.g., Benguerel and McFadden, 1989; Kuwabara, 1983, 1985, 1993; Huang, 1991, 1992) or consonants (Mann and Soli, 1991; Ohde, 1994). The results of these studies too suggest that listeners do indeed use cues from speech originating beyond the nearest transitions. These latter studies were not designed to answer this particular question so there are confounding factors that make extrapolating difficult. The most problematic factor generally being a lack of information on segmentation procedures.
Large Small F2 <= -175 F2 >= 175 -85 <= F2 Total <= 85 Larg F1 <= -125 1 - - 1 e F1 >= 125 19 20 20 59 Smal -65<= F1 20 20 20 60 l <=65 Total 40 40 40 120
In the present study, we tried to avoid these problems by using connected read speech (from a long, meaningful text). The downside of this approach is that it is nearly impossible to control all factors. In a normal text, the distribution of individual phonemes and phoneme combinations is highly unbalanced and many phonotactically allowed combinations will be missed. Still, it is the best way to ensure that the results are relevant to natural speech situations.
In natural speech, listeners are quite good at inferring the context of a segment. Very small fractions of neighbouring segments often are sufficient to identify the context with high reliability (Ohde and Sharf, 1977; Pols and Schouten, 1978; Pols, 1979). At the other hand, in a full sentence, or even in words or syllables, the intended words can often be guessed, even when individual segments are not intelligible. This "lexical" information can be used to "correct" the identification of the individual segments. In an experiment that aims at assessing the use of acoustical information from the context for the identification of individual phonemic segments, enough speech must be presented to allow for the identification of the context, but not enough to allow for the identification of the original syllable of word. This can be achieved by using fragments of the context beyond the transitions, and at the same time ignoring word and syllable boundaries.
In this paper we will try to find an answer the question whether listeners indeed use speech sounds beyond the consonant-vowel transitions to identify both the vowel and the consonant. As a first step in this direction we will limit ourselves to speech fragments from the nearest neighbours of the target segment. Our experimental design is comparable to the design used by Ohde and Sharf (1977). However, we will use a larger inventory of CVC fragments which are taken from continuous read speech.
Cons C1 (+/-) C2 (+/-) Vowel V (+/-) f/v 14 (4/10) 11 (4/7) A 10 (4/6) s/z 8 (3/5) 9 (5/4) a: 34 (18/16 ) S 4 (3/1) 0 (0/0) E 16 (7/9) x 3 (2/1) 5 (1/4) i 19 (13/6) h 5 (1/4) 1 (1/0) u 1 (0/1) p/b 10 (6/4) 2 (1/1) o: 37 (14/23 ) t/d 16 (7/9) 12 (8/4) y 3 (3/0) k 4 (0/4) 5 (3/2) m 12 (7/5) 1 (1/0) n 17 (11/6) 11 (4/7) r 10 (7/3) 35 (13/22 ) l 3 (3/0) 18 (10/8) w 7 (2/5) 6 (6/0) j 2 (1/1) 0 (0/0) N 0 (0/0) 3 (1/2) # 5 (2/3) 1 (1/0) Total 120 (59/61 120 (59/61 120 (59/61 ) ) )
Tokens were constructed using vowels and their context from a pre-existing corpus (Van Son and Pols, 1990). The segments were taken from two readings of a long, informative text (844 words), read by a single, professional speaker. The speech was recorded on a commercial Sony PCM recorder, low-pass filtered at 4.5 kHz and digitized at 10 kHz, with 12 bit resolution. Subsequent storage, handling, and editing were done in digital form only.
Pre-Voc On-Glid Vowel
K Off-Gli Post-Vo Min.
D Median . Cons. e ernel de c. uratio
Durat Transit Transit Cons. n ion ion ion Kernel - - 50 - - 50 50 * V* - >=15 50 >=15 - 80 112 CVC* 10 >=25 50 >=25 10 120 152 CT+ 10 >=25 - - - 35 41 CCT+ 25 >=25 - - - 50 56 CV*+ 10 >=25 50 - - 85 91 CCV+ 25 >=25 50 - - 100 106 TC+ - - - >=25 10 35 41 TCC+ - - - >=25 25 50 56 VC*+ - - 50 >=25 10 85 91 VCC+ - - 50 >=25 25 100 106
The actual tokens presented to the listeners were constructed from these speech segments (see table 3). For the vowel identification experiment, the vowel kernel was represented by the central 50 ms of the vowel realization (Kernel). From the complete vowel segment (with a median duration of 132 ms), 10 ms was removed from both sides to eliminate audible traces of the consonant. This was the Isolated Vowel token (V). The CVC token was constructed by adding 10 ms of context to both sides of the original vowel realizations (20 ms with respect to V). CV and VC tokens were constructed by removing, respectively, the vowel off-glide or on-glide transitions from the CVC tokens (leaving the Kernel part, or the central 50 ms, intact).
For both consonant identification experiments, tokens were constructed around the CV and VC boundary, respectively. The shortest tokens contained only the vowel on- or off-glide transition up to, but not including, the central 50 ms (i.e., excluding the Kernel part) and 10 ms of the consonantal context (CT and TC). Longer tokens were constructed by adding either the central 50 ms of the vowel to the transitions (CV and VC, identical to those used in the vowel identification experiment) or an extra 15 ms of the consonant (CCT and TCC, the CC indicates an extended, 25 ms, consonant fragment), or additions at both the vowel and the consonant side (CCV and VCC). Before being presented, all these fragments were windowed with a 2 ms Hanning window at both sides to smooth the on- and off-set of the sounds.
All subjects that participated in these experiments were students and staff members of our institute. Participation was voluntary and no rewards of any kind were offered. None of the subjects reported hearing problems. None of the subjects had heard the stimuli before and none was acquainted with the structure or construction of the stimuli. Tokens were presented separately for vowel identification, consonant identification in pre-vocalic position (CV-type tokens), and in post-vocalic position (VC-type tokens). For each subject, there was always more than a week between experiments.
There were 600 tokens in the Vowel identification experiment and 480 tokens in both consonant identification experiments. These tokens were presented in a pseudo-random order that was different for each subject. Each experiment was preceded by a sequence of 10 practice tokens, taken to be the last 10 tokens of the particular sequence of the subject.
The results of the vowel identification experiment are displayed in figure 1 for all vowel realizations pooled as well as for accented and unaccented vowels separately. All differences between token classes are statistically significant (Macnemars' test, p<=0.01, two-tailed), except for the difference between the CV tokens and either the V (for +/- Accent) or the CVC type tokens (for All and -Accent). The differences between accented and non-accented vowels are statistically significant for the V, CV, and CVC token types (2 >= 12, = 1, p<=0.01).
There is a large and statistically significant difference between the error rates for long- and short-vowel realizations in VC-type tokens. There is no such difference for the equally long CV-type tokens. This indicates that the (lack of) difference in intelligibility between long- and short-vowel realizations from different types of tokens cannot be attributed to only the duration of the tokens. The difference in intelligibility between long- and short-vowel tokens can most likely be attributed to diphthongization of the long-vowel realizations (see also Peeters, 1991; Andruski and Nearey, 1993). From figure 2 it is clear that the V-, CV-, and CVC-type stimuli contain enough dynamic "diphthong" information to blur the distinction in "intelligibility" between the realizations of long- and short-vowels. The Kernel- and VC-type stimuli are less adequate for the identification of such diphthongized vowels.
Kernel VC V CV CVC Mean F2 <= 39 30 18 19 17 25 -175 F2 >= 175 31 18 8 7 5 14 |F2| <= 29 22 19 16 14 20 85 Mean 33 23 15 14 12 19
Both CV- and VC-type tokens are better recognized than the central 50 ms alone (Kernel). It is clear that vowel identification benefits more from speech added in front of the kernel (CV-type tokens) than from speech added in the back of the kernel (VC-type tokens), with the error rate of the former almost half that of the latter. The intermediate position of the CV-type tokens between V and CVC-type tokens in the error rates suggests that the reduction of the error rates in the V- and CVC-type tokens is largely due to the added token onsets. The offset parts of the tokens seems to play only a minor role in reducing the error rate.
The vowel tokens were balanced with respect to the formant excursion sizes, F1 and F2 (see table 1). There was no detectable effect of first formant excursion size on the error rate (F1, not shown). However, there were large differences in identification errors due to differences in the second formant excursion size (F2, see table 4). Differences between the three sets of excursion sizes of the second formant were highly significant for all token types (2 >= 16, = 2, p <= 0.01). For each excursion size, the tokens followed the same pattern of error-rates as was found for the vowels as a whole (3 rows in table 4, Friedman's Q = 11.5, p <= 0.05, Kendal's concordance, i.e., mean rank correlation coefficient, W = 0.956). In general, the vowel realizations with large negative excursion sizes, F2 <= -175 Hz, induced the highest error rates, those with large positive excursion sizes, F2 >= +175 Hz, induced the lowest error rates. On average, the vowel realizations with small excursion sizes, |F2| <= 85 Hz, scored in between. This pattern was very consistent over token types (5 columns in table 4, Friedman's Q = 6.4, p <= 0.05, Kendal's concordance W = 0.64).
From these results it is clear that the absolute size of the formant excursions (i.e., formant dynamics, either F1 or F2), was not related to vowel intelligibility in our tokens. The excursion size of the first formant had no effect whatsoever on the error rates. Large positive excursions of the F2 were correlated to low error rates whereas large negative excursions were correlated to high error rates. Small F2 excursion were in between. A possible explanation of this somewhat odd pattern can be found when the distribution of vowels over formant excursions is taken into account. The excursion size of the second formant (F2) correlates strongly with vowel height (Pols and Van Son, 1993; Van Son, 1993a). The strong and consistent effect of the F2 excursion size on vowel identification can therefore be described as a correlation between vowel height and intelligibility. Higher vowels seem to induce less errors in our experiment.
The consonant identification results were evaluated with respect to the correctness of the identification using different criteria. In figure 3, the error rates for consonant identification are presented, ignoring voicing errors. All differences between the different classes of tokens (CT, CV, CCT, and CCV) are statistically significant except for the consonants preceding accented vowels in CT- versus CV-type tokens and CCT- versus CCV-type tokens (Macnemars' test, p<=0.01, two tailed). The differences between consonants preceding accented and non-accented vowels are significant for all token classes (2 >= 18, = 1, p <= 0.001). Both identification per se , and identification of only place or manner of articulation showed the same pattern of errors as identification based on ignoring voicing errors (not shown, Place: Labio-dental /fvpbmw/, Alveolar /sztdnl/, Palatal /Sj/, Velar-Uvular /kxNr/, Glottal /h/; Manner: Fricative /fvszSgh/, Plosive /pbtdk/, Nasal /mnN/, Vowel-like /wljr/). The absolute error rates varied between the error criteria.
CT CV CCT CCV Mean F2 <= -175 51 48 42 38 45 F2 >= 175 34 30 25 19 27 |F2| <= 85 60 57 39 36 49 Mean 48 45 35 31 40
For consonant identification too, the excursion size of the first formant of the vowel, F1, had no effect on consonant identification (not shown). There were large differences in consonant identification between tokens with different F2 excursion sizes (2 >= 40, = 2, p <= 0.01). All three groups of tokens with different F2 (see table 1) showed the same pattern of identification error-rates with respect to token classes (Friedman's Q = 9, p <= 0.05 , Kendal's concordance W = 1). Pre-vocalic consonants followed by a vowel with a large positive F2 excursion (F2 >= 175 Hz) always induced the lowest error rate.
The pattern of identification errors for the post-vocalic consonants is similar to that of pre-vocalic consonants (figure 4). All differences between the token types are statistically significant except for the consonants following accented vowels in VC- versus TCC-type tokens (Macnemars' test, p<=0.01, two tailed). The differences between consonants following accented and non-accented vowels are significant for all token types (2 >= 14, = 1, p <= 0.001). Again, there is no difference in the pattern of error rates for the different error types (i.e., including or excluding voicing errors and errors regarding manner or place of articulation, not shown). Error rates for both place and manner of articulation were high, around 36% and 31% respectively, and were strongly correlated for consonants following both accented and non-accented vowels (r >= 0.994, p <= 0.01, not shown).
TC VC TCC VCC Mean F2 <= -175 60 43 44 31 45 F2 >= 175 48 34 47 28 39 |F2| <= 85 62 48 55 39 51 Mean 57 42 49 33 45
In both figures 3 and 4 there is a strong effect of sentence accent (carried by the vowel) and consonant identification in the same token. This difference seems to be independent of the token type, i.e., the effect of sentence accent is even present when the vowel kernel is absent. Part of this difference is expected to be due to a difference in the consonant sets used (see table 2). However, the fact that both pre- and post-vocalic consonants (originating from different sets, see table 2) are identified worse when accompanying a non-accented vowel indicates that sentence accent is indeed an important factor for consonant identification. Again, this was found also for identification of place and manner of articulation.
In figure 5, the correctness of identification of one segment in a token is correlated to the correctness of identification of the other phoneme in the same token. The differences in error rate are statistically significant for the CV tokens only (2 = 28.7, = 1, p <= 0.01). For the VC-type tokens we see no relation at all. For the CV tokens the difference in error rate is large indeed. The vowel identification error rate almost doubles when the consonant is identified incorrectly with respect to when it is identified correctly. The results in figure 5 were obtained by ignoring long/short vowel and voicing errors (for vowel and consonant identification respectively). The same results were found when using consonant identification per se, or only errors in the place and manner of identification (not shown).
Our experiments are very much like those reported by Ohde and Sharf (1977). Contrary to this earlier paper, we do find strong effects of the presence of context on phoneme identification for vowels, pre-, and post-vocalic consonants. A likely explanation for this difference is a combination of two factors: First, in our experiments we used a larger inventory of consonants and vowels taken from CVC sequences, not only plosives and point vowels (/uia/) taken from CV- or VC-like combinations. Second, our stimuli were taken from a long, meaningful text read aloud instead of spoken in isolation. It is known that syllables spoken in isolation show less reduction and coarticulation than when reading aloud a long text. Therefore, it is to be expected that our stimuli show a larger variation in the strength and presence of cues to identity. This difference seems to force our subjects to rely more on contextual cues than the subjects of the earlier experiments.
When more of each vowel realization was present in the isolated vowel type tokens (V-type, containing around 85% of the original realization), the error rates decreased considerably (figures 1 and 2). This can be attributed to the presence of formant dynamics in the vowel realizations. Such a behaviour is expected if subjects use "target-overshoot" to compensate for the effects of coarticulation and reduction (Lindblom and Studdert-Kennedy, 1967; Strange, 1989a; Di Benedetto, 1989). However, irrespective of the token classes, the absolute size of F1 and F2 excursions, which are proxies for coarticulation, had no effect on error rates. This means that any "compensation" for coarticulation or any perceptual "target-overshoot" had to work equally well on the Kernel-type tokens as on the CVC-type tokens. But the former lack most, if not all, dynamic information. Therefore, an effect of the absolute formant excursion size in itself on our identification results seems implausible. Furthermore, it has been shown before that synthetic vowels with formant dynamics added without consideration for the central formant values or context, are identified "worse" (i.e., by "undershooting" instead of "overshooting" the target) than those without such synthetic formant dynamics (Pols and Van Son, 1993; Van Son, 1993a; Van Son and Pols, 1993).
There is undoubtedly a beneficial effect of the presence of transitions to vowel identification. However, "simple" target-overshoot related models cannot account for the fact that the absolute size of the formant excursions seems to be irrelevant. A possible explanation could be that the spectral change in the transition regions is a (redundant) independent cue to the identity of a vowel. That is, the spectral change in itself is recognized as a separate feature of a certain vowel and context.
For consonants, it was shown that identification benefited from adding the central part of the neighbouring vowel. For post-vocalic consonants, the effects of adding the vowel kernel in front of the token were larger than those of adding more of the consonant at the back. These results were found for both the place and the manner of articulation (not shown). This indicates that the high error rates were not only caused by a lack of structural, or "manner", information (e.g., plosives versus fricatives) but also from a lack of spectral, or "place", information. The high correlations between error rates calculated with respect to manner and place of articulation indicate that these two "dimensions" of consonant articulation are not independent from a perceptual point of view, at least not with respect to the manipulations we used to construct our tokens.
The presence or absence of sentence accent on the vowels had a strong effect on identification, both for the vowels themselves and for the consonants. For vowels, the difference between the error rates of accented and unaccented vowels increased with the amount of context (figure 1). This indicates that this difference in vowel intelligibility is due to the outer parts of the vowel realizations and possibly the context. For consonants, we could not find such a difference between token classes.
As with consonants, vowel identification benefited also from speech added at the periphery of the realization, crossing the segment boundaries. For the CVC-type tokens, the consonants were audible and this in itself seemed to have helped the listeners identifying the vowels, as was shown by the strong relation between correct identification of vowels and consonants in CV-type tokens (figures 5 and 6).
In conclusion, our results show that our listeners used the transition parts between vowels and consonants to identify both vowel and consonant realizations. If present, speech beyond these transitions was used too. In all experiments it could be shown that speech added in front of the target phoneme improved identification more than speech added at the back of the target phoneme. This was found even when the added speech originated from another phoneme (e.g., from the vowel when identification of post-vocal consonants was at stake, see figure 4). Asymmetries of this kind have been reported before but there seems to be no consensus about an explanation (Ohde and Sharf, 1977; Pols, 1979, but see also: Di Benedetto, 1989; Mann and Soli, 1991; Pols and Van Son, 1993; Van Son, 1993a; Van Son and Pols, 1993; Van Wieringen and Pols, 1991, 1995; Van Wieringen, 1995). The report of Mann and Soli (1991) is interesting in this respect because it states that the asymmetry is reversed, it is the vowel following /fS/ that contributes the strongest cues to fricative identification. Our data support the explanation proposed by Ohde and Sharf (1977) and Mann and Soli (1991) that this asymmetry can be attributed to the order in which context and target phoneme are presented. In our experiments, speech preceding the target phoneme always is a stronger cue to its identity than speech following the target phoneme. The CV-VC asymmetry was also found when the correlation between correct identification of vowels and consonants was investigated. Correct identification of vowels and consonants was strongly correlated in CV-type tokens. There was no correlation whatsoever between vowel and consonant identification error-rates in VC-type tokens.
An important question remains. Is it actually the original sound that is important for the listener, or would any speech sound do? The latter possibility could be expected when listeners use the preceding sound to "normalize" for the speaker or to give time for the ear to adapt to speech. Our own results do not differentiate between the two possibilities. But earlier work showed that adding just synthetic consonants to synthetic vowels had very little effect on the identification of the vowels (Pols and Van Son, 1993; Van Son, 1993a; Van Son and Pols, 1993). From the present study with natural speech, and from our earlier study with synthetic speech, it can be concluded that segment identification benefits from the presence of context if this context is appropriate for the segment.
Listeners use all speech available to identify vowels and consonants, even when this speech is beyond the transitions to and from a neighbouring phoneme. The presence of speech preceding the target segment benefits identification more than that of speech following the target segment.
The authors wish to thank dr. Astrid van Wieringen for her advice and help in all stages of the experiments. This research was made possible by grant 300-173-029 of the Dutch Organization of Research (NWO).