The intelligibility of speech, degraded by a speech-communication system, has been the topic of many studies in the past 70 years. Already between 1920 and 1930, Fletcher and Steinberg developed several methods to determine intelligibility. They found a relation between the transmission quality and several physical aspects of the transmission channel. Mainly bandwidth and signal-to-noise ratio were considered. The second world war had a great impact on the evaluation of speech communication. Many papers appeared just after the war on the subjective and objective assessment of speech- communication systems (Egan, 1944; French & Steinberg, 1947; Beranek, 1947; Fletcher & Galt, 1950).
The REXY system is able to recognise continuously spoken (Dutch) sentences. Since the system is trained with phones, any Dutch word can easily be incorporated in the system, although in the training sentences only 238 different words occurred. Furthermore, for practical reasons the present system is trained and tested as a speaker- dependent recogniser for one male speaker only. In Chapter 6 (and Appendix D) the database with speech utterances is described. With the REXY system we systematically evaluated various system components. Details about the experimental procedures can also be found in Chapter 6. The conclusions that can drawn from these experiments are elaborated in Chapter 7 and are summarised below.
One of the components we varied was the acoustic preprocessing. We have investi-
gated two types of analysis, and we have experimented with several feature vectors (a
description of the preprocessing is given in Chapter 2).
* The two types of analysis are a filterbank and a LPC analysis. The overall per- formance of the filterbank analysis turns out to be the better of the two.
* Our experiments showed that the performance of the recognition system could benefit from the cooperation of different feature vectors. The best performing combina- tion is the filterbank preprocessing together with three feature vectors: the "slope" vector (frequency derivative of the filterbank spectrum), the time derivative of the slope, and the time derivative of the energy (see Chapter 2 for details).
The classification algorithm we used is based on discrete HMM (hidden Markov
modelling) technology. In three subsequent chapters this classification algorithm is
described, as well as a dynamic programming technique used to perform an integrated
search. Markov theory (Markov chains and Markov processes) is introduced in Chapter
3. In Chapter 4 we expand the Markov models to hidden Markov models and in
Chapter 5 we adapt the hidden Markov models to model speech. As unit of modelling
we chose for the Dutch phonemes. By applying the dynamic programming, the Markov
models are combined with a word-duration model and a grammar model. The
experimental results allowed us to draw the following conclusions:
* Initialisation of the HMM parameters must be done with care. We compared a uniform (all parameters have initially the "same" value) and a sophisticated way of initialisation (based on hand-segmented data). Sophisticated initialisation yields a system that has a better recognition performance.
* As long as the HMM parameters are not well trained (which is almost always the case in actual conditions and which was also the case in our experiments), smoothing of the parameters is important. The smoothing technique we implemented is called "cooccurrence smoothing" (this technique smoothes the probability density functions of the Markov models).
* Because the HMM's do not model duration very well (only implicitly), we tried to model the word duration explicitly with a Gaussian distribution. The recognition benefit of this kind of duration modelling turned out to be limited. * Dynamic programming integrates knowledge about the spoken words (in the HMM's) with a simple grammar model. Different "bigram" grammars have been implemented with "perplexities" 60, 20, and 2.4 (lower perplexity implies a stricter grammar). The effect of the grammars is large: the error rate reduced from 25.8% (for the "no grammar" case with a perplexity of 110) to 15.5% (perplexity is 60), 5.3% (perplexity is 20), and 0.9% (perplexity is 2.4) given filterbank preprocessing. * The grammar model and the word-duration models can simply be integrated with the Markov models (Viterbi search). This means that at recognition time an integrated search is performed with many knowledge sources: acoustic and phonetic knowledge from the HMM's, lexical knowledge from the (word) pronunciation dictionary, word duration, and syntactical knowledge from the grammar.
The experiments we performed with the REXY system indicate that high recognition performance can only be achieved if preprocessing and classification are both performed adequately. In designing a recognition system, both preprocessing and classification have to be optimised and tuned to each other.
There are numerous indications that people extract more information from speech
than simply the message itself. We are able to identify speakers by their voice and
pronunciation, to recognize their regional background, their mood, and several other
Generally, we can also identify the sex of the speaker from his/her voice and/or pronunciation. Women speak with a relatively high-pitched voice and men with a low- pitched voice. The differences regarding pitch height are related to differences between the sexes in the anatomy and physiology of the vocal apparatus. However, apart from pitch height, little is known about phonetically-related differences between men and women.
The reason why some people speak more quickly, more melodiously, more broadly, or with more authority than others seems to be determined by environmental factors rather than by biological factors. People tend to adapt to their role in society regarding their clothing, their way of acting, and also their way of speaking. It is common knowledge that men and women play, or at least are more or less expected to play, different roles in our society. E.g. children-caring is done especially by women, while jobs with management aspects are taken most frequently by men. Such expectations or norms towards men and women may also influence the speech production and speech perception behaviour of men and women.
The distinction between speech of men and women is also apparent if one considers the developments in speech technology. In speech synthesis as well as automatic speech recognition there is a clear preference to use 'male-like' voices, whereas it is not clear at all, except for a few characteristics such as pitch, to what extent the voice and pronunciation characteristics of men and women differ.
The main aim in the present study was extracted from the above mentioned arguments. The aim was to obtain more insight into the voice and pronunciation characteristics of men and women, while distinguishing between attributed and actual characteristics of men and women (ch. 1). The attributed characteristics were measured by means of introspective judgments, whereas the actual characteristics were measured by means of perceptual or acoustic analyses.
Three main topics were chosen with respect to possible differences between speech of men and women. The first topic was the evaluation of voice and pronunciation characteristics by means of semantic scales. The second topic was pitch/fundamental frequency and the third topic was the intelligibility on the level of words and phonemes.
The description of our study is started with two experiments in which the importance
of non-verbal cues in speech was tested (ch. 2). Firstly, an identification experiment is
described in which the ability of listeners to extract information about age and sex from
voice and pronunciation cues alone was examined. It appeared that the listeners were
very well able to identify the sex of the speaker, but also to classify the age (which is
Secondly, an introspective experiment is described in which judges gave their opinion about ideal and average voice and pronunciation characteristics of men and women, by means of semantic scales (without actual presentation of speech). Regarding the characteristics of ideal voice and pronunciation, it was found that the differences between men and women were restricted to the fact that the ideal female voice should be higher and softer than the ideal male voice. Regarding the characteristics of average voice and pronunciation, the judges indicated far more differences between men and women. Also, it was found that the expected average characteristics for male speakers appeared to be closer to their ideal characteristics than those for female speakers.
Introspective judgments reveal insight into the norms and expectations with respect to voice and pronunciation of men and women. However, it could very well be that those ideas are based on sex-related stereotypes and not necessarily due to actual speech performance. Therefore, a listening experiment was carried out in which 40 listeners evaluated voice and pronunciation of 30 men and 30 women, again by means of semantic scales (ch. 3).
Apart from the variables 'sex of speaker' and 'sex of listener', a third variable was included in order to analyse the influence of another factor, which is specifically socio- culturally determined, on voice and pronunciation, i.e. 'profession of speaker'. The speakers were representatives of one out of the following profession categories: nurses, managers, and information agents (with equal numbers of male and female speakers in this experiment). These professions differ with respect to socio-economic status (SES) as well as with respect to the actual distribution of men and women over the three professions.
A number of characteristics appeared to differentiate between male and female speakers. However, these distinctions were not always in agreement with the literature or with the introspective judgments mentioned above. In the literature it is e.g. suggested that women speak in a more polished way than men and men speak with more authority than women. In contrast to this, our perceptual data reveal that male and female speakers sounded equally polished and authoritative. The data further indicate that the professions were clearly differentiated from one another with respect to characteristics of voice and pronunciation. Moreover, the significant differences are in agreement with stereotypes of these professions (e.g. managers speaking in a distinguished way and nurses speaking sweetly).
From the foregoing it is clear that the listeners had differentiated between the sexes and the professions without any other clues than voice and pronunciation. Subsequently, an identification experiment was carried out in order to examine whether or not listeners are able to classify the professions correctly. The results show that this is indeed possible.
Apart from perceptual evaluation, also introspective evaluation was executed about voice and pronunciation characteristics in the three profession categories, separately for men and women. Those results show for instance that women were supposed to speak in a more polished way than men, whereas this tendency was not at all present in the perceptual evaluation. Regarding the different professions, it appears that only partly the same tendencies are found as for the perceptual evaluations.
In addition to the perceptual and introspective evaluation by a large group of judges, also the opinion of the 60 speakers themselves about their own voice and pronunciation was asked. The results of that evaluation show no significant differences, neither between male and female speakers nor between the professions. So, the speakers themselves seem not to be aware of their distinctive voice and pronunciation characteristics.
The second topic was pitch/fundamental frequency (ch. 4). (We use the term 'pitch'
when considering the perceptual domain; the term 'fundamental frequency' is used
when referring to the acoustic domain).
In the literature, as well as by our listeners and judges, it was reported that pitch is the most salient factor for distinguishing between speech of men and of women. However, is this restricted to mean pitch/fundamental frequency or do the range and variation of pitch/fundamental frequency also play a role? From the above mentioned evaluation experiments, the general tendency in this respect was that female speakers sounded more melodious than male speakers. This might imply that more fundamental frequency variation is present in speech of women.
Acoustic analyses were carried out for several read speech samples of groups of male and female speakers. As was expected, the data reveal a clear difference in mean fundamental frequency between male and female speakers (▒120 Hz versus ▒200 Hz, respectively). No significant differences in mean fundamental frequency were found between speakers with a different educational level or different profession. It is striking that the different speech conditions under study (sentences and text) also resulted in similar mean fundamental frequency values.
Although considerable differences were found between the individual speakers with respect to fundamental frequency range or variability, no differences were found between the two sexes. Also, with respect to the factors 'educational level' or 'profession' no differences were found in fundamental frequency range or variability. Of course, our results are to be restricted to the reading condition. The relationship between acoustics and perception is rather clear as far as pitch height is concerned. However, only low correlations were found between fundamental frequency range and variability on the one hand and judgments regarding melodiousness and expressiveness on the other hand. Did we catch the wrong acoustic parameters for obtaining useful information about pitch variation (intonational) aspects? In order to verify the difference in fundamental frequency patterns between men and women, a perception experiment was carried out in which manipulated speech was presented to listeners.
The results indicated that the subjects had not been able to identify the sex of the speakers by means of information about fundamental frequency range and variability alone. So, the conclusion must be that at sentence level, fundamental frequency variability plays a minor role for sex identification.
With regard to the third topic, i.e. the effect of speaker sex on intelligibility,
contrasting suggestions have been found. For instance, a strong preference exists for
male voices in speech technology applications, while on the other hand there is a
preference for female voices in actual announcement situations (e.g. in department
Intelligibility was measured in several noise conditions (ch. 5). Ten male and ten female speakers of Standard Dutch were selected. In terms of Consonant-Vowel- Consonant (CVC) words, it appears that the group results for male and female speakers show equal word and phoneme intelligibility under all noise conditions. The differences between the individual speakers were rather large. Evaluation of the intelligibility of all speakers by means of the semantic scale 'low intelligibility - high intelligibility' revealed similar results with respect to the rank order of the different speakers. The phoneme confusions were also analysed. However, no fundamentally different patterns were found for male as opposed to female speaker data. Most confusions took place between phonemes that differed only with respect to one distinctive feature. The aforementioned results do not indicate any striking difference between men and women with respect to voice and/or pronunciation. In general, it can be concluded from our study that less actual (perceptual or acoustic) differences with respect to voice and pronunciation characteristics of men and women were found than were indicated in the literature or attributed by judges (ch. 6). Regarding the socio-culturally determined characteristics, the differences between male and female voices and pronunciation which were actually (perceptually or acoustically) found, seem to be of the same order as the differences found between the professions under study. In that case, the distinction of speakers between males and females is only one out of several other possible distinctions.
The restriction in our study to the use of read speech meant a clear abstraction from real-life speech situations. We chose for this abstraction in order not to be drowned by uncontrollable variables. However, we hope that future studies in the field of male and female speech will proceed more and more towards natural speech situations.
Kloosterman, S. (1992): 'Classification of vowel segments using neural networks'. Master's thesis, IFA report 119, 71 pp. (in Dutch: ďHerkenning van klinkersegmenten met neurale netwerken).
Abstract of paper to be published in:
Journal of Speech Communications
The effect of sentence accent, word stress, and word class (function words versus content words) on the acoustic properties of 9 Dutch vowels in fluent speech was investigated. A list of sentences was read aloud by 15 male speakers. Each sentence contained one syllable of interest. This could be a monosyllabic function word, an unstressed syllable of a content word, or a stressed syllable of a content word. The same syllable occurred in all three conditions. Sentence accent was manipulated with questions that preceded the sentences. A total number of 3465 vowels were segmented from the syllables and analysed. It was found that all three factors mentioned above had a significant effect both on the steady-state formant frequencies (F1 and F2) and on the duration of the vowels. Word stress and word class had a stronger effect on the vowels than sentence accent. A listening experiment showed the perceptual significance of the acoustic measurements. It appeared that spectral vowel reduction could be better interpreted as the result of an increased contextual assimilation than as the tendency to centralize. We also studied changes in the dynamics of the formant tracks due to the experimental conditions. It was found that formant tracks of reduced vowels became flatter, which supports the view of an increased contextual assimilation. Three simple models of vowel reduction are discussed.