The intelligibility of speech, degraded by a speech-communication system, has been the topic of many studies in the past 70 years. Already between 1920 and 1930, Fletcher and Steinberg developed several methods to determine intelligibility. They found a relation between the transmission quality and several physical aspects of the transmission channel. Mainly bandwidth and signal-to-noise ratio were considered. The second world war had a great impact on the evaluation of speech communication. Many papers appeared just after the war on the subjective and objective assessment of speech- communication systems (Egan, 1944; French & Steinberg, 1947; Beranek, 1947; Fletcher & Galt, 1950).
The REXY system is able to recognise continuously spoken (Dutch) sentences. Since the system is trained with phones, any Dutch word can easily be incorporated in the system, although in the training sentences only 238 different words occurred. Furthermore, for practical reasons the present system is trained and tested as a speaker- dependent recogniser for one male speaker only. In Chapter 6 (and Appendix D) the database with speech utterances is described. With the REXY system we systematically evaluated various system components. Details about the experimental procedures can also be found in Chapter 6. The conclusions that can drawn from these experiments are elaborated in Chapter 7 and are summarised below.
One of the components we varied was the acoustic preprocessing. We have investi-
gated two types of analysis, and we have experimented with several feature vectors (a
description of the preprocessing is given in Chapter 2).
* The two types of analysis are a filterbank and a LPC analysis. The overall per-
formance of the filterbank analysis turns out to be the better of the two.
* Our experiments showed that the performance of the recognition system could
benefit from the cooperation of different feature vectors. The best performing combina-
tion is the filterbank preprocessing together with three feature vectors: the "slope"
vector (frequency derivative of the filterbank spectrum), the time derivative of the
slope, and the time derivative of the energy (see Chapter 2 for details).
The classification algorithm we used is based on discrete HMM (hidden Markov
modelling) technology. In three subsequent chapters this classification algorithm is
described, as well as a dynamic programming technique used to perform an integrated
search. Markov theory (Markov chains and Markov processes) is introduced in Chapter
3. In Chapter 4 we expand the Markov models to hidden Markov models and in
Chapter 5 we adapt the hidden Markov models to model speech. As unit of modelling
we chose for the Dutch phonemes. By applying the dynamic programming, the Markov
models are combined with a word-duration model and a grammar model. The
experimental results allowed us to draw the following conclusions:
* Initialisation of the HMM parameters must be done with care. We compared a
uniform (all parameters have initially the "same" value) and a sophisticated way of
initialisation (based on hand-segmented data). Sophisticated initialisation yields a
system that has a better recognition performance.
* As long as the HMM parameters are not well trained (which is almost always the
case in actual conditions and which was also the case in our experiments), smoothing
of the parameters is important. The smoothing technique we implemented is called
"cooccurrence smoothing" (this technique smoothes the probability density functions of
the Markov models).
* Because the HMM's do not model duration very well (only implicitly), we tried to
model the word duration explicitly with a Gaussian distribution. The recognition benefit
of this kind of duration modelling turned out to be limited.
* Dynamic programming integrates knowledge about the spoken words (in the
HMM's) with a simple grammar model. Different "bigram" grammars have been
implemented with "perplexities" 60, 20, and 2.4 (lower perplexity implies a stricter
grammar). The effect of the grammars is large: the error rate reduced from 25.8% (for
the "no grammar" case with a perplexity of 110) to 15.5% (perplexity is 60), 5.3%
(perplexity is 20), and 0.9% (perplexity is 2.4) given filterbank preprocessing.
* The grammar model and the word-duration models can simply be integrated with
the Markov models (Viterbi search). This means that at recognition time an integrated
search is performed with many knowledge sources: acoustic and phonetic knowledge
from the HMM's, lexical knowledge from the (word) pronunciation dictionary, word
duration, and syntactical knowledge from the grammar.
The experiments we performed with the REXY system indicate that high recognition performance can only be achieved if preprocessing and classification are both performed adequately. In designing a recognition system, both preprocessing and classification have to be optimised and tuned to each other.
There are numerous indications that people extract more information from speech
than simply the message itself. We are able to identify speakers by their voice and
pronunciation, to recognize their regional background, their mood, and several other
characteristics.
Generally, we can also identify the sex of the speaker from his/her voice and/or
pronunciation. Women speak with a relatively high-pitched voice and men with a low-
pitched voice. The differences regarding pitch height are related to differences between
the sexes in the anatomy and physiology of the vocal apparatus. However, apart from
pitch height, little is known about phonetically-related differences between men and
women.
The reason why some people speak more quickly, more melodiously, more broadly,
or with more authority than others seems to be determined by environmental factors
rather than by biological factors. People tend to adapt to their role in society regarding
their clothing, their way of acting, and also their way of speaking.
It is common knowledge that men and women play, or at least are more or less
expected to play, different roles in our society. E.g. children-caring is done especially
by women, while jobs with management aspects are taken most frequently by men.
Such expectations or norms towards men and women may also influence the speech
production and speech perception behaviour of men and women.
The distinction between speech of men and women is also apparent if one considers
the developments in speech technology. In speech synthesis as well as automatic
speech recognition there is a clear preference to use 'male-like' voices, whereas it is not
clear at all, except for a few characteristics such as pitch, to what extent the voice and
pronunciation characteristics of men and women differ.
The main aim in the present study was extracted from the above mentioned
arguments. The aim was to obtain more insight into the voice and pronunciation
characteristics of men and women, while distinguishing between attributed and actual
characteristics of men and women (ch. 1). The attributed characteristics were measured
by means of introspective judgments, whereas the actual characteristics were measured
by means of perceptual or acoustic analyses.
Three main topics were chosen with respect to possible differences between speech
of men and women. The first topic was the evaluation of voice and pronunciation
characteristics by means of semantic scales. The second topic was pitch/fundamental
frequency and the third topic was the intelligibility on the level of words and phonemes.
The description of our study is started with two experiments in which the importance
of non-verbal cues in speech was tested (ch. 2). Firstly, an identification experiment is
described in which the ability of listeners to extract information about age and sex from
voice and pronunciation cues alone was examined. It appeared that the listeners were
very well able to identify the sex of the speaker, but also to classify the age (which is
less obvious).
Secondly, an introspective experiment is described in which judges gave their
opinion about ideal and average voice and pronunciation characteristics of men and
women, by means of semantic scales (without actual presentation of speech).
Regarding the characteristics of ideal voice and pronunciation, it was found that the
differences between men and women were restricted to the fact that the ideal female
voice should be higher and softer than the ideal male voice. Regarding the
characteristics of average voice and pronunciation, the judges indicated far more
differences between men and women. Also, it was found that the expected average
characteristics for male speakers appeared to be closer to their ideal characteristics than
those for female speakers.
Introspective judgments reveal insight into the norms and expectations with respect
to voice and pronunciation of men and women. However, it could very well be that
those ideas are based on sex-related stereotypes and not necessarily due to actual speech
performance. Therefore, a listening experiment was carried out in which 40 listeners
evaluated voice and pronunciation of 30 men and 30 women, again by means of
semantic scales (ch. 3).
Apart from the variables 'sex of speaker' and 'sex of listener', a third variable was
included in order to analyse the influence of another factor, which is specifically socio-
culturally determined, on voice and pronunciation, i.e. 'profession of speaker'. The
speakers were representatives of one out of the following profession categories: nurses,
managers, and information agents (with equal numbers of male and female speakers in
this experiment). These professions differ with respect to socio-economic status (SES)
as well as with respect to the actual distribution of men and women over the three
professions.
A number of characteristics appeared to differentiate between male and female
speakers. However, these distinctions were not always in agreement with the literature
or with the introspective judgments mentioned above. In the literature it is e.g.
suggested that women speak in a more polished way than men and men speak with
more authority than women. In contrast to this, our perceptual data reveal that male and
female speakers sounded equally polished and authoritative. The data further indicate
that the professions were clearly differentiated from one another with respect to
characteristics of voice and pronunciation. Moreover, the significant differences are in
agreement with stereotypes of these professions (e.g. managers speaking in a
distinguished way and nurses speaking sweetly).
From the foregoing it is clear that the listeners had differentiated between the sexes
and the professions without any other clues than voice and pronunciation.
Subsequently, an identification experiment was carried out in order to examine whether
or not listeners are able to classify the professions correctly. The results show that this
is indeed possible.
Apart from perceptual evaluation, also introspective evaluation was executed about
voice and pronunciation characteristics in the three profession categories, separately for
men and women. Those results show for instance that women were supposed to speak
in a more polished way than men, whereas this tendency was not at all present in the
perceptual evaluation. Regarding the different professions, it appears that only partly
the same tendencies are found as for the perceptual evaluations.
In addition to the perceptual and introspective evaluation by a large group of judges,
also the opinion of the 60 speakers themselves about their own voice and pronunciation
was asked. The results of that evaluation show no significant differences, neither
between male and female speakers nor between the professions. So, the speakers
themselves seem not to be aware of their distinctive voice and pronunciation
characteristics.
The second topic was pitch/fundamental frequency (ch. 4). (We use the term 'pitch'
when considering the perceptual domain; the term 'fundamental frequency' is used
when referring to the acoustic domain).
In the literature, as well as by our listeners and judges, it was reported that pitch is
the most salient factor for distinguishing between speech of men and of women.
However, is this restricted to mean pitch/fundamental frequency or do the range and
variation of pitch/fundamental frequency also play a role? From the above mentioned
evaluation experiments, the general tendency in this respect was that female speakers
sounded more melodious than male speakers. This might imply that more fundamental
frequency variation is present in speech of women.
Acoustic analyses were carried out for several read speech samples of groups of
male and female speakers. As was expected, the data reveal a clear difference in mean
fundamental frequency between male and female speakers (±120 Hz versus ±200 Hz,
respectively). No significant differences in mean fundamental frequency were found
between speakers with a different educational level or different profession. It is striking
that the different speech conditions under study (sentences and text) also resulted in
similar mean fundamental frequency values.
Although considerable differences were found between the individual speakers with
respect to fundamental frequency range or variability, no differences were found
between the two sexes. Also, with respect to the factors 'educational level' or
'profession' no differences were found in fundamental frequency range or variability.
Of course, our results are to be restricted to the reading condition.
The relationship between acoustics and perception is rather clear as far as pitch
height is concerned. However, only low correlations were found between fundamental
frequency range and variability on the one hand and judgments regarding
melodiousness and expressiveness on the other hand. Did we catch the wrong acoustic
parameters for obtaining useful information about pitch variation (intonational) aspects?
In order to verify the difference in fundamental frequency patterns between men and
women, a perception experiment was carried out in which manipulated speech was
presented to listeners.
The results indicated that the subjects had not been able to identify the sex of the
speakers by means of information about fundamental frequency range and variability
alone. So, the conclusion must be that at sentence level, fundamental frequency
variability plays a minor role for sex identification.
With regard to the third topic, i.e. the effect of speaker sex on intelligibility,
contrasting suggestions have been found. For instance, a strong preference exists for
male voices in speech technology applications, while on the other hand there is a
preference for female voices in actual announcement situations (e.g. in department
stores).
Intelligibility was measured in several noise conditions (ch. 5). Ten male and ten
female speakers of Standard Dutch were selected. In terms of Consonant-Vowel-
Consonant (CVC) words, it appears that the group results for male and female speakers
show equal word and phoneme intelligibility under all noise conditions. The differences
between the individual speakers were rather large. Evaluation of the intelligibility of all
speakers by means of the semantic scale 'low intelligibility - high intelligibility'
revealed similar results with respect to the rank order of the different speakers.
The phoneme confusions were also analysed. However, no fundamentally different
patterns were found for male as opposed to female speaker data. Most confusions took
place between phonemes that differed only with respect to one distinctive feature.
The aforementioned results do not indicate any striking difference between men and
women with respect to voice and/or pronunciation. In general, it can be concluded from
our study that less actual (perceptual or acoustic) differences with respect to voice and
pronunciation characteristics of men and women were found than were indicated in the
literature or attributed by judges (ch. 6).
Regarding the socio-culturally determined characteristics, the differences between
male and female voices and pronunciation which were actually (perceptually or
acoustically) found, seem to be of the same order as the differences found between the
professions under study. In that case, the distinction of speakers between males and
females is only one out of several other possible distinctions.
The restriction in our study to the use of read speech meant a clear abstraction from
real-life speech situations. We chose for this abstraction in order not to be drowned by
uncontrollable variables. However, we hope that future studies in the field of male and
female speech will proceed more and more towards natural speech situations.
Kloosterman, S. (1992): 'Classification of vowel segments using neural networks'. Master's thesis, IFA report 119, 71 pp. (in Dutch: ÔHerkenning van klinkersegmenten met neurale netwerken).
Abstract of paper to be published in:
Journal of Speech Communications
The effect of sentence accent, word stress, and word class (function words versus content words) on the acoustic properties of 9 Dutch vowels in fluent speech was investigated. A list of sentences was read aloud by 15 male speakers. Each sentence contained one syllable of interest. This could be a monosyllabic function word, an unstressed syllable of a content word, or a stressed syllable of a content word. The same syllable occurred in all three conditions. Sentence accent was manipulated with questions that preceded the sentences. A total number of 3465 vowels were segmented from the syllables and analysed. It was found that all three factors mentioned above had a significant effect both on the steady-state formant frequencies (F1 and F2) and on the duration of the vowels. Word stress and word class had a stronger effect on the vowels than sentence accent. A listening experiment showed the perceptual significance of the acoustic measurements. It appeared that spectral vowel reduction could be better interpreted as the result of an increased contextual assimilation than as the tendency to centralize. We also studied changes in the dynamics of the formant tracks due to the experimental conditions. It was found that formant tracks of reduced vowels became flatter, which supports the view of an increased contextual assimilation. Three simple models of vowel reduction are discussed.