While biometric authentication has advanced significantly in recent years, evidence shows the technology can be susceptible to malicious spoofing attacks. The research community has responded with dedicated countermeasures which aim to detect and deflect such attacks. Even if the literature shows that they can be effective, the problem is far from being solved; biometric systems remain vulnerable to spoofing. Despite a growing momentum to develop spoofing countermeasures for automatic speaker verification, now that the technology has matured sufficiently to support mass deployment in an array of diverse applications, greater effort will be needed in the future to ensure adequate protection against spoofing. This article provides a survey of past work and identifies priority research directions for the future. We summarise previous studies involving impersonation, replay, speech synthesis and voice conversion spoofing attacks and more recent efforts to develop dedicated countermeasures. The survey shows that future research should address the lack of standard datasets and the over-fitting of existing countermeasures to specific, known spoofing attacks.
This paper gives an overview of automatic speaker recognition technology, with an emphasis on text-independent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the state-of-the-art methods. We start with the fundamentals of automatic speaker recognition, concerning feature extraction and speaker modeling. We elaborate advanced computational techniques to address robustness and session variability. The recent progress from vectors towards supervectors opens up a new area of exploration and represents a technology trend. We also provide an overview of this recent development and discuss the evaluation methodology of speaker recognition systems. We conclude the paper with discussion on future directions.
More than a decade has passed since research on automatic recognition of emotion from speech has become a new field of research in line with its ‘big brothers’ speech and speaker recognition. This article attempts to provide a short overview on where we are today, how we got there and what this can reveal us on where to go next and how we could arrive there. In a first part, we address the basic phenomenon reflecting the last fifteen years, commenting on databases, modelling and annotation, the unit of analysis and prototypicality. We then shift to automatic processing including discussions on features, classification, robustness, evaluation, and implementation and system integration. From there we go to the first comparative challenge on emotion recognition from speech – the INTERSPEECH 2009 Emotion Challenge, organised by (part of) the authors, including the description of the Challenge’s database, Sub-Challenges, participants and their approaches, the winners, and the fusion of results to the actual learnt lessons before we finally address the ever-lasting problems and future promising attempts.
This review gives a general overview of techniques used in . One instance of these techniques, called hidden Markov model (HMM)-based speech synthesis, has recently been demonstrated to be very effective in synthesizing acceptable speech. This review also contrasts these techniques with the more conventional technique of unit-selection synthesis that has dominated speech synthesis over the last decade. The advantages and drawbacks of statistical parametric synthesis are highlighted and we identify where we expect key developments to appear in the immediate future.
The database, designed to evaluate text-dependent speaker verification systems under different durations and lexical constraints has been collected and released by the Human Language Technology (HLT) department at Institute for Infocomm Research (I R) in Singapore. English speakers were recorded with a balanced diversity of accents commonly found in Singapore. More than 151 h of speech data were recorded using mobile devices. The pool of speakers consists of 300 participants (143 female and 157 male speakers) between 17 and 42 years old making the database one of the largest publicly available database targeted for text-dependent speaker verification. We provide evaluation protocol for each of the three parts of the database, together with the results of two speaker verification system: the HiLAM system, based on a three layer acoustic architecture, and an -vector/PLDA system. We thus provide a reference evaluation scheme and a reference performance on database to the research community. The HiLAM outperforms the state-of-the-art -vector system in most of the scenarios.
This paper is the first review into the automatic analysis of speech for use as an objective predictor of depression and suicidality. Both conditions are major public health concerns; depression has long been recognised as a prominent cause of disability and burden worldwide, whilst suicide is a misunderstood and complex course of death that strongly impacts the quality of life and mental health of the families and communities left behind. Despite this prevalence the diagnosis of depression and assessment of suicide risk, due to their complex clinical characterisations, are difficult tasks, nominally achieved by the categorical assessment of a set of specific symptoms. However many of the key symptoms of either condition, such as altered mood and motivation, are not physical in nature; therefore assigning a categorical score to them introduces a range of subjective biases to the diagnostic procedure. Due to these difficulties, research into finding a set of biological, physiological and behavioural markers to aid clinical assessment is gaining in popularity. This review starts by building the case for speech to be considered a key objective marker for both conditions; reviewing current diagnostic and assessment methods for depression and suicidality including key non-speech biological, physiological and behavioural markers and highlighting the expected cognitive and physiological changes associated with both conditions which affect speech production. We then review the key characteristics; size, associated clinical scores and collection paradigm, of active depressed and suicidal speech databases. The main focus of this paper is on how common paralinguistic speech characteristics are affected by depression and suicidality and the application of this information in classification and prediction systems. The paper concludes with an in-depth discussion on the key challenges – improving the generalisability through greater research collaboration and increased standardisation of data collection, and the mitigating unwanted sources of variability – that will shape the future research directions of this rapidly growing field of speech processing research.
Speech processing for under-resourced languages is an active field of research, which has experienced significant progress during the past decade. We propose, in this paper, a survey that focuses on automatic speech recognition (ASR) for these languages. The definition of under-resourced languages and the challenges associated to them are first defined. The main part of the paper is a literature review of the recent (last 8 years) contributions made in ASR for under-resourced languages. Examples of past projects and future trends when dealing with under-resourced languages are also presented. We believe that this paper will be a good starting point for anyone interested to initiate research in (or operational development of) ASR for one or several under-resourced languages. It should be clear, however, that many of the issues and approaches presented here, apply to speech technology in general (text-to-speech synthesis for instance).
► A class of block based MFCC computation techniques are investigated. ► Formant specific subband partitioning scheme is shown to be more efficient. ► A technique for computing relative subband information is proposed. ► Multiple systems have been successfully fused to improve performance. ► The proposed system is more robust than MFCC in presence of noise. Standard Mel frequency cepstrum coefficient (MFCC) computation technique utilizes discrete cosine transform (DCT) for decorrelating log energies of filter bank output. The use of DCT is reasonable here as the covariance matrix of Mel filter bank log energy (MFLE) can be compared with that of highly correlated Markov-I process. This full-band based MFCC computation technique where each of the filter bank output has contribution to all coefficients, has two main disadvantages. First, the covariance matrix of the log energies does not exactly follow Markov-I property. Second, full-band based MFCC feature gets severely degraded when speech signal is corrupted with narrow-band channel noise, though few filter bank outputs may remain unaffected. In this work, we have studied a class of linear transformation techniques based on block wise transformation of MFLE which effectively decorrelate the filter bank log energies and also capture speech information in an efficient manner. A thorough study has been carried out on the block based transformation approach by investigating a new partitioning technique that highlights associated advantages. This article also reports a novel feature extraction scheme which captures complementary information to wide band information; that otherwise remains undetected by standard MFCC and proposed block transform (BT) techniques. The proposed features are evaluated on NIST SRE databases using Gaussian mixture model-universal background model (GMM-UBM) based speaker recognition system. We have obtained significant performance improvement over baseline features for both matched and mismatched condition, also for standard and narrow-band noises. The proposed method achieves significant performance improvement in presence of narrow-band noise when clubbed with missing feature theory based score computation scheme.
► We propose a hierarchical structure for multiclass emotion recognition tasks. ► The structure is designed to first operate on the most differentiable binary task. ► The framework shows promising accuracy across two emotional databases. ► The structure won the 2009 Interspeech Emotion Challenge (classifier sub-challenge). Automated emotion state tracking is a crucial element in the computational study of human communication behaviors. It is important to design robust and reliable emotion recognition systems that are suitable for real-world applications both to enhance analytical abilities to support human decision making and to design human–machine interfaces that facilitate efficient communication. We introduce a hierarchical computational structure to recognize emotions. The proposed structure maps an input speech utterance into one of the multiple emotion classes through subsequent layers of binary classifications. The key idea is that the levels in the tree are designed to solve the easiest classification tasks first, allowing us to mitigate error propagation. We evaluated the classification framework on two different emotional databases using acoustic features, the AIBO database and the USC IEMOCAP database. In the case of the AIBO database, we obtain a balanced recall on each of the individual emotion classes using this hierarchical structure. The performance measure of the average unweighted recall on the evaluation data set improves by 3.37% absolute (8.82% relative) over a Support Vector Machine baseline model. In the USC IEMOCAP database, we obtain an absolute improvement of 7.44% (14.58%) over a baseline Support Vector Machine modeling. The results demonstrate that the presented hierarchical approach is effective for classifying emotional utterances in multiple database contexts.
Grapheme-to-phoneme conversion is the task of finding the pronunciation of a word given its written form. It has important applications in text-to-speech and speech recognition. Joint-sequence models are a simple and theoretically stringent probabilistic framework that is applicable to this problem. This article provides a self-contained and detailed description of this method. We present a novel estimation algorithm and demonstrate high accuracy on a variety of databases. Moreover, we study the impact of the maximum approximation in training and transcription, the interaction of model size parameters, -best list generation, confidence measures, and phoneme-to-grapheme conversion. Our software implementation of the method proposed in this work is available under an Open Source license.
In this study, modulation spectral features (MSFs) are proposed for the automatic recognition of human affective information from speech. The features are extracted from an auditory-inspired long-term spectro-temporal representation. Obtained using an auditory filterbank and a modulation filterbank for speech analysis, the representation captures both acoustic frequency and temporal modulation frequency components, thereby conveying information that is important for human speech perception but missing from conventional short-term spectral features. On an experiment assessing classification of discrete emotion categories, the MSFs show promising performance in comparison with features that are based on mel-frequency cepstral coefficients and perceptual linear prediction coefficients, two commonly used short-term spectral representations. The MSFs further render a substantial improvement in recognition performance when used to augment prosodic features, which have been extensively used for emotion recognition. Using both types of features, an overall recognition rate of 91.6% is obtained for classifying seven emotion categories. Moreover, in an experiment assessing recognition of continuous emotions, the proposed features in combination with prosodic features attain estimation performance comparable to human evaluation.
In this paper we overview emotional speech recognition having in mind three goals. The first goal is to provide an up-to-date record of the available emotional speech data collections. The number of emotional states, the language, the number of speakers, and the kind of speech are briefly addressed. The second goal is to present the most frequent acoustic features used for emotional speech recognition and to assess how the emotion affects them. Typical features are the pitch, the formants, the vocal tract cross-section areas, the mel-frequency cepstral coefficients, the Teager energy operator-based features, the intensity of the speech signal, and the speech rate. The third goal is to review appropriate techniques in order to classify speech into emotional states. We examine separately classification techniques that exploit timing information from which that ignore it. Classification techniques based on hidden Markov models, artificial neural networks, linear discriminant analysis, -nearest neighbors, support vector machines are reviewed.
The possibility of speech processing in the absence of an intelligible acoustic signal has given rise to the idea of a ‘ , to be used as an aid for the speech-handicapped, or as part of a communications system operating in silence-required or high-background-noise environments. The article first outlines the emergence of the silent speech interface from the fields of speech production, automatic speech processing, speech pathology research, and telecommunications privacy issues, and then follows with a presentation of demonstrator systems based on seven different types of technologies. A concluding section underlining some of the common challenges faced by silent speech interface researchers, and ideas for possible future directions, is also provided.
Recently deep learning has been successfully used in speech recognition, however it has not been carefully explored and widely accepted for speaker verification. To incorporate deep learning into speaker verification, this paper proposes novel approaches of extracting and using features from deep learning models for text-dependent speaker verification. In contrast to the traditional short-term spectral feature, such as MFCC or PLP, in this paper, outputs from hidden layer of various deep models are employed as for text-dependent speaker verification. Fours types of deep models are investigated: deep Restricted Boltzmann Machines, speech-discriminant Deep Neural Network (DNN), speaker-discriminant DNN, and multi-task joint-learned DNN. Once deep features are extracted, they may be used within either the GMM-UBM framework or the identity vector (i-vector) framework. Joint linear discriminant analysis and probabilistic linear discriminant analysis are proposed as effective back-end classifiers for identity vector based deep features. These approaches were evaluated on the RSR2015 data corpus. Experiments showed that deep feature based methods can obtain significant performance improvements compared to the traditional baselines, no matter if they are directly applied in the GMM-UBM system or utilized as identity vectors. The EER of the best system using the proposed identity vector is 0.10%, only one fifteenth of that in the GMM-UBM baseline.
Voice transformation (VT) aims to change one or more aspects of a speech signal while preserving linguistic information. A subset of VT, Voice conversion (VC) specifically aims to change a speaker’s speech in such a way that the generated output is perceived as a sentence uttered by a speaker. Despite many years of research, VC systems still exhibit deficiencies in accurately mimicking a target speaker spectrally and prosodically, and simultaneously maintaining high speech quality. In this work we provide an overview of real-world applications, extensively study existing systems proposed in the literature, and discuss remaining challenges.
Making meaningful comparisons between the performance of the various speech enhancement algorithms proposed over the years has been elusive due to lack of a common speech database, differences in the types of noise used and differences in the testing methodology. To facilitate such comparisons, we report on the development of a noisy speech corpus suitable for evaluation of speech enhancement algorithms. This corpus is subsequently used for the subjective evaluation of 13 speech enhancement methods encompassing four classes of algorithms: spectral subtractive, subspace, statistical-model based and Wiener-type algorithms. The subjective evaluation was performed by Dynastat, Inc., using the ITU-T P.835 methodology designed to evaluate the speech quality along three dimensions: signal distortion, noise distortion and overall quality. This paper reports the results of the subjective tests.
During the past three decades, the issue of processing spectral phase has been largely neglected in speech applications. There is no doubt that the interest of speech processing community towards the use of phase information in a big spectrum of speech technologies, from automatic speech and speaker recognition to speech synthesis, from speech enhancement and source separation to speech coding, is constantly increasing. In this paper, we elaborate on why phase was believed to be unimportant in each application. We provide an overview of advancements in with applications to speech, showing that considering phase-aware speech processing can be beneficial in many cases, while it can complement the possible solutions that magnitude-only methods suggest. Our goal is to show that phase-aware signal processing is an important emerging field with high potential in the current speech communication applications. The paper provides an extended and up-to-date bibliography on the topic of aiming at providing the necessary background to the interested readers for following the recent advancements in the area. Our review expands the step initiated by our organized special session and exemplifies the usefulness of spectral phase information in a wide range of speech processing applications. Finally, the overview will provide some future work directions.
A novel speech enhancement method based on Weighted Denoising Auto-encoder (WDA) and noise classification is proposed in this paper. A weighted reconstruction loss function is introduced into the conventional Denoising Auto-encoder (DA), and the relationship between the power spectra of clean speech and noisy observation is described by WDA model. First, the sub-band power spectrum of clean speech is estimated by WDA model from the noisy observation. Then, the SNR is estimated by the SNR Controlled Recursive Averaging (PCRA) approach. Finally, the clean speech is obtained by Wiener filter in frequency domain. In addition, in order to make the proposed method suitable for various kinds of noise conditions, a Gaussian Mixture Model (GMM) based noise classification method is employed. And the corresponding WDA model is used in the enhancement process. From the test results under ITU-T G.160, it is shown that, in comparison with the reference method which is the Wiener filtering method with decision-directed approach for SNR estimation, the WDA-based speech enhancement methods could achieve better objective speech quality, no matter whether the noise conditions are included in the training set or not. And the similar amount of noise reduction and SNR improvement can be obtained with smaller distortion on speech level.
Typical speech enhancement methods, based on the short-time Fourier analysis-modification-synthesis (AMS) framework, modify only the magnitude spectrum and keep the phase spectrum unchanged. In this paper our aim is to show that by modifying the phase spectrum in the enhancement process the quality of the resulting speech can be improved. For this we use analysis windows of 32 ms duration and investigate a number of approaches to phase spectrum computation. These include the use of matched or mismatched analysis windows for magnitude and phase spectra estimation during AMS processing, as well as the phase spectrum compensation (PSC) method. We consider four cases and conduct a series of objective and subjective experiments that examine the importance of the phase spectrum for speech quality in a systematic manner. In the first (oracle) case, our goal is to determine maximum speech quality improvements achievable when accurate phase spectrum estimates are available, but when no enhancement is performed on the magnitude spectrum. For this purpose speech stimuli are constructed, where (during AMS processing) the phase spectrum is computed from clean speech, while the magnitude spectrum is computed from noisy speech. While such a situation does not arise in practice, it does provide us with a useful insight into how much a precise knowledge of the phase spectrum can contribute towards speech quality. In this first case, matched and mismatched analysis window approaches are investigated. Particular attention is given to the choice of analysis window type used during phase spectrum computation, where the effect of spectral dynamic range on speech quality is examined. In the second (non-oracle) case, we consider a more realistic scenario where only the noisy spectra (observable in practice) is available. We study the potential of the mismatched window approach for speech quality improvements in this non-oracle case. We would also like to determine how much room for improvement exists between this case and the best (oracle) case. In the third case, we use the PSC algorithm to enhance the phase spectrum. We compare this approach with the oracle and non-oracle matched and mismatched window techniques investigated in the preceding cases. While in the first three cases we consider the usefulness of various approaches to phase spectrum computation within the AMS framework when noisy magnitude spectrum is used, in the fourth case we examine the usefulness of these techniques when enhanced magnitude spectrum is employed. Our aim (in the context of traditional magnitude spectrum-based enhancement methods) is to determine how much benefit in terms of speech quality can be attained by also processing the phase spectrum. For this purpose, the minimum mean-square error (MMSE) short-time spectral amplitude (STSA) estimates are employed instead of noisy magnitude spectra. The results of the oracle experiments show that accurate phase spectrum estimates can considerably contribute towards speech quality, as well as that the use of mismatched analysis windows (in the computation of the magnitude and phase spectra) provides significant improvements in both objective and subjective speech quality – especially, when the choice of analysis window used for phase spectrum computation is carefully considered. The mismatched window approach was also found to improve speech quality in the non-oracle case. While the improvements were found to be statistically significant, they were only modest compared to those observed in the oracle case. This suggests that research into better phase spectrum estimation algorithms, while a challenging task, could be worthwhile. The results of the PSC experiments indicate that the PSC method achieves better speech quality improvements than the other non-oracle methods considered. The results of the MMSE experiments suggest that accurate phase spectrum estimates have a potential to significantly improve performance of existing magnitude spectrum-based methods. Out of the non-oracle approaches considered, the combination of the MMSE STSA method with the PSC algorithm produced significantly better speech quality improvements than those achieved by these methods individually.