Problem:
Vowels are traditionally characterized by low-frequency (i.e., below 3 kHz) spectral peaks, otherwise known as formants. It is assumed that perceptual cues from the low-frequency region of the speech spectrum provide the requisite information for vowel identity, with little consideration given to the high-frequency region. Similarly, it is assumed that identification of speaker gender from an audio signal relies heavily on the low-frequency region of the speech spectrum. The purpose of the following study is to determine whether and how well the high-pass filtered vowel signals recorded and identified by human listeners in our previous study could be separated using Mel frequency cepstral coefficients (MFCCs) in a supervised learning approach.
Solution:
In order to show how well the high-pass filtered vowel signals recorded and identified by human listeners in our previous study could be separated using Mel frequency cepstral coefficients, four experiments have been designed. The first experiment combined all vowel samples (male, female, and child), producing 6 vowel classes, with 30 samples per class (vowel). In the second experiment, the same dataset was used, but in this case the classification accuracy for the male, female, and child vowels were computed separately. The third experiment was designed to classify the filtered vowels by speaker types. All vowel samples are combined to create 3 classes (male, female, and child) with 60 samples per class. The fourth experiment was designed to classify speaker types for each vowel category.
Method:
The classification is based on a number of steps. First, MFCCs are extracted from each vowel signal, thus forming a temporal sequence. This is modeled as the output of a dynamical system. Therefore, classification is based on comparing dynamical system models. Among various metrics for computing the similarity between dynamical systems, the family of Binet-Cauchy kernels has been shown to have good performance in different applications. Those kernels are then used for training a support vector machine (SVM) classifier in a supervised learning framework.
Dataset:
Vowels were recorded from two males, two females, and two children (one male and one female, both age 10) at 96 kHz and 24-bit resolution in an h/Vowel/d (hVd) context using a high fidelity microphone (Lawson 251). Stimuli for the experiment consisted of five productions of six naturally produced, high-pass filtered hVd signals; /æ/ as in “had”, /i/ as in “heed”, /u/ as in “who’d”, /ɔ/ as in “hawd”, /ɝ/ as in “herd”, and /e/ as in “hayed.” The hVd signals were recorded in connected speech using the carrier phrase, “I say (hVd) again.” The individual hVd signals were extracted from the audio file for processing.
Results:
With a dataset containing multiple productions of vowels from two male, two female, and two children, classification results suggest the high-frequency region to contain useable information for classification of vowel category and speaker type in a supervised learning framework. Overall accuracy rates for the full set of hVd signals (combined male, female, and child signals) reached a minimum of 90% correct identification, suggesting good classification performance when using a combined set of high-pass filtered vowel productions from the three different speaker types. Identification accuracy of this order was found for the full hVd signals at 48 and 96 kHz sample rate as well as the 100 ms segments at a 48 kHz sample rate (see Table 1). Additionally, results showed good performance (i.e., accuracy rates from 85-92%) for classifying speaker type (male, female, or child) from the high-pass filtered vowels.