R.J.J.H. van Son, Florien J. Koopmans-van Beinum, and Louis C.W. Pols
University of Amsterdam, Institute of Phonetic Sciences/IFOTT
Herengracht 338, 1016 CG Amsterdam, The Netherlands
tel: +31 20 5252183; fax: +31 20 5252197; email: {rob, florienk, pols}@fon.hum.uva.nl


A large part of the variation in natural speech appears along the dimensions of articulatory precision / perceptual distinctiveness. We propose that this variation is the result of an effort to communicate efficiently. Speaking is considered efficient if the speech sound contains only the information needed to understand it. This efficiency is tested by means of a corpus of spontaneous and matched read speech, and syllable and word frequencies as measures of information content (12007 syllables, 8046 word forms, 1582 intervocalic consonants, and 2540 vowels). It is indeed found that the duration and spectral reduction of consonants and vowels correlate with the frequency of syllables and words in this corpus. Consonant intelligibility correlates with both the acoustic factors and the syllable and word frequencies. It is concluded that the principle of efficient communication organizes at least some aspects of speech production.


Speech can be considered an efficient means of communication. Speakers will not articulate more accurately than they consider necessary to be understood. This means that the speech signal will only contain the information needed to understand the message: "speech is the missing information" [10]. Much of the variation that is normally found in speech can be interpreted as ways to increase the efficiency of communication. Especially variation as a result of speaking style, assimilation, coarticulation and reduction.
The use of the term "efficient" implies a cost/benefit trade-off. The maximal amount of information transmitted at the minimum "cost". To be able to achieve this, the speaker must estimate the ease with which the listener can understand her: "speaking for listening" [4]. Different estimates lead to different speaking styles. Ranging from over-articulated word lists to mumbled courtesies.
One aspect of efficiency, the effect of (semantic) predictability on duration and intelligibility, has been the target of previous research [2,3,4,5,6,8,9,17]. In the context of the current paper, the results of these studies can be summarized as indicating that on the one hand, listeners tend to identify utterances better the more predictable they are. On the other hand, speakers seem to compensate for this by better pronouncing unpredictable words.
The actual "level of efficiency" is less important than which factors are used to determine the optimal level of "articulatory faithfulness" and how speakers "manipulate" the speech sound to achieve optimal intelligibility [1,11]. Note that a full investigation of these questions constitutes a major research effort.
In this paper we will start with a demonstration of principle. First, we will develop a quantitative basis for determining the level of efficiency. Then we will present results of an evaluation of the extent to which syllable and word frequencies affect the acoustic realization of consonants and vowels, and the intelligibility of consonants. Finally, we will discuss these results.


Measures of information content are derived from Bayes' equation:
(1) Prob(eI,ci) = Prob(ci|ei) · Prob(ei) = Prob(ei|ci) · Prob(ci)
In which ei is a certain speech element, say a word or a phoneme, in a certain context ci. Prob(x) is the probability of encountering x. Prob(x|y) is the conditional probability of encountering x if y is present and Prob(x,y) is the probability of encountering both x and y together. Using equation 1, we can rewrite the probability of encountering ei as the product of the probability of finding ei in a certain context, ci, and the probability of finding this context:
(2) Prob(ei) = Prob(ei|ci) · Prob(ci) / Prob(ci|ei)
Prob(ei|ci) is the probability measured in missing word or cloze tests, i.e., the probability of observing a word in a specific context. The information associated with the presence of a certain entity x is: I(x) = -log2(Prob(x)) (in bits). Using this we obtain equation 3:
(3) I(ei) = I(ei|ci) + I(ci) - I(ci|ei)
Averaging equation 3 over all possible elements, ei, and contexts, ci in a language, we obtain the conventional average information content:
(4) H(e) = H(e|c) + H(c) - H(c|e) = H(e|c) + T(e,c)
In which

summed over all possible xi, is the average information of x. H(e|c) is the logarithm of the perplexity of the language and T(e,c) = H(e)+H(c)-H(e,c) is called the Mutual Information. Equation 4 describes a way to divide the average information, H(e), needed to identify an element e (e.g., a word) into a part carried by the element itself, i.e., H(e|c), and a part carried by the context, i.e., T(e,c).
For example, to identify a word, not all information has to be present in the word itself, part of it can be extracted from the context. An extreme example is the sentence "A stitch in time saves nine". The last word "nine" can be very reliably predicted from the preceding words [9]. The word itself is hardly informative, I(nine|A...saves_) ~ 0.

Table 1: Dutch consonants used in this paper and the number of matched Read/Spontaneous VCV pairs (ignoring voicing differences). 308 pairs were from syllables carrying lexical syllable stress, 483 from unstressed syllables.

Speech communication is efficient if the speech signal contains enough information to be identified, and not more. This means that, after accounting for acoustic disturbances and speaking style, each element should contain an amount of information essentially proportional to I(ei|ci). For content words this has been qualitatively found [2,3,4,5,6,7,9]. Therefore, the application of equation 3 on the pronunciation and intelligibility of words in utterances seems feasible. However, it is unlikely that speakers and listeners process smaller entities, like phonemes in syllables, in the same way as words in an utterance. If we ignore the effects of context, the amount of information needed to identify an element is just the logarithm of the frequency of occurrence (I(ei) in equation 3). There is evidence that this is an important factor at the level of syllables [17] and we will use this measure in the present study.

Figure 1: Correlation coefficients between I(syllable) and phoneme Duration, Spectral Center of Gravity (CoG), F1/F2 distance, and Confusion of Identification (i.e., H(responses) per token, used with switched signs). The differences between conditions and measures were statistically not significant (p > 0.01). Top: Consonants (n=1582, +:308 --:483), bottom: Vowels (n=2540, +:471 --:799). Read: read, Spont: spontaneous speech, +:stressed, --:unstressed syllables, All: combined realizations. +:p <= 0.01, *:p <= 0.001.

If speech is indeed organized efficiently, we can predict that speakers adapt their speaking effort to "match" the expected effort needed for recognition. So we should find a correlation between acoustic measures of effort and information content. The effect of this correlation is acoustic reduction of the phonemes in "predictable" positions in the utterance and a strengthening in "unpredictable" positions. The intelligibility of the isolated phonemes should follow the acoustic reduction and strengthening. As acoustic measures of the effort and information content of speech, we use Duration and two measures of spectral reduction: Spectral Center of Gravity (CoG for consonants, i.e., the "mean" frequency, weighted by spectral power) and the F1/F2 distance to the center of vowel reduction (300, 1450 for vowels) in semitones. These measures have been shown to be related to speaking effort and intelligibility [15,16,12,13,14]. The entropy of the responses to single stimulus tokens was used as a measure of unintelligibility, i.e., confusion. This is equivalent to the logarithm of the perplexity of the responses and measures the amount of information missing from the acoustic signal.


For this study we selected recordings of a single male speaker who read aloud a transliteration of spontaneous speech recorded earlier (20 minutes of speech each, 12007 syllables and 8046 word forms). The orthographic script was transcribed to phonetic symbols and each recording was checked against this transcription and marked for sentence accent by one of us [15,16]. The original transcribed text was used to estimate word and syllable frequencies, circumventing the scarcity of data on spontaneous word frequencies. From the phonetic transcription, all Vowel-Consonant-Vowel (VCV) segments were located in the speech recordings (read and spontaneous). 791 VCV pairs that had both realizations originating from corresponding positions in the utterances with identical syllable structure, syllable boundary type, and sentence accent and lexical syllable stress were selected for this study (see table 1, 1770 distinct vowel pairs, these are the same realizations as used by [15,16]). Monosyllabic function words are marked as unstressed. Word medial consonants are considered to be syllable initial (maximal onset). The VCV pairs were selected to cover all consonants and stress conditions present (except for /h/). The pairs were selected randomly for each individual consonant and stress condition (lexical syllable stress only, 308 pairs from stressed and 483 from unstressed syllables). Duration and the extreme CoG frequency of all vowel and consonant realizations were measured [15,16].
22 Dutch subjects, all native speakers of Dutch, were asked to identify these 1582 intervocalic consonant realizations in their original VCV context. The outer 10 ms of the VCV tokens were removed and smoothened with 2 ms Hanning windows to prevent interference from the adjacent consonants and transient clicks. The order of presentation was (pseudo-) random and different for each subject. The subjects had to select the Dutch orthographic symbol on a computer CRT screen that corresponded to the sound heard (this causes no ambiguity in Dutch). For each token, the entropy of the 22 responses was calculated and used as a measure of confusion (i.e., missing information).

Figure 2: As figure 1 but now using I(word).


I(phonemei|syllablei) nor I(phonemei|wordi) (equation 3) was correlated with phoneme duration, CoG, nor with intelligibility (not shown). Therefore, we will use the frequency related measures of I(syllablei) and I(wordi) as the parameters of interest in the remainder of this paper
To compensate for the large variation in intrinsic values between our phonemes, we calculated the correlation coefficients after subtracting the individual mean values from each quasi-homogeneous group of phoneme realizations (homogeneous with respect to phoneme identity, speaking style and syllable stress, but ignoring voicing). The degrees of freedom in the statistical tests were reduced accordingly to compensate for this procedure.

Figure 3: As figure 1 but now correlating the Duration, CoG, I(syllable) and I(word) with the Confusion of consonants. Note the reversed vertical axis.

The results are represented in the figures 1-3. Figure 1 shows the correlation between the negative logarithm of the syllable frequency, I(syllable), and Duration, spectral reduction (respectively, CoG and F1/F2 distance), and the Confusion of our listeners for both consonants and vowels (Confusion for consonants only). Figure 2 shows the results for a correlation with the negative logarithm of the word frequency. This correlation was largely limited to the stressed syllables (p <= 0.01, R+vs.R-). Figure 3 shows the correlation of all other values with the confusion in the listening experiment. From figure 3 it becomes clear that duration was most strongly linked with intelligibility (p <= 0.001).


Although the correlation coefficients found in our data are generally statistically significant, they are also quite small (R2 < 0.07). There are several explanations for this weak correlation. First of all, there were large errors in determining syllable and word frequencies from such a small corpus and determining intelligibility using only 22 listeners. Measuring phoneme durations in connected (spontaneous) speech too has a high level of errors. Together with the small number of realizations, we expected a large level of "noise" in our data. More important, the frequency of occurrence is only a first step in evaluating predictability. To really express the importance of a word, its position in the utterance should be evaluated using models of grammar, prosody, and preferably, semantics.
On the whole, our results support the idea that the individual components of speech contain only the information needed to identify them. When the predictability of syllables and words is expressed in terms of information content, it correlated with duration, spectral reduction and intelligibility of individual phonemes. This correlation was found after normalization for the effects of phoneme identity, speaking style differences, and lexical stress. We know that predictability is strongly correlated with ease of identification [5,7]. Therefore, we can conclude that speakers anticipate the efforts listeners have to expend in recognition and try to strike a balance with their own efforts. The result can be seen as efficient communication.
Figure 2 shows that the effects of word frequency (I(word)) are limited to the stressed syllables. This can be partly explained by noting that rare syllables tend to occur as the stressed syllables of rare words, and unstressed syllables are generally high-frequency syllables [5,18]. This indicates that syllable frequencies dominate any effect of word frequency.
On a more detailed level, it was found that phoneme duration was the factor most strongly related to both information content and intelligibility. Speakers seem to smooth the distribution of information over time by varying segmental (or syllabic) durations. The weaker correlations of both spectral reduction and intelligibility with information content and with each other suggests that time is the limiting factor in speech [16]. Speakers seem to detest spending time speaking while listeners need some time to understand it [1], but also should not be bored [4]. If listeners do not have enough time to identify all aspects of a phoneme, it doesn't "pay" to articulate them properly. Hence, the spectral reduction found in predictable syllables and words could be a secondary result of the shorter durations.
Combining our data with those presented in the literature, we can conclude that speakers anticipate the efforts needed to understand their message. They adapt some aspects of their speech to strike a balance between their own efforts and those of their audience. This adaptation increases the efficiency of communication.