Publications

Refine Results

(Filters Applied) Clear All

Dialect recognition using adapted phonetic models

Published in:
INTERSPEECH 2008, 22-26 September 2008, p. 763-766.

Summary

In this paper, we introduce a dialect recognition method that makes use of phonetic models adapted per dialect without phonetically labeled data. We show that this method can be implemented efficiently within an existing PRLM system. We compare the performance of this system with other state-of-the-art dialect recognition methods (both acoustic and token-based) on the NIST LRE 2007 English and Mandarin dialect recognition tasks. Our experimental results indicate that this system can perform better than baseline GMM and adapted PRLM systems, and also results in consistent gains of 15-23% when combined with other systems.
READ LESS

Summary

In this paper, we introduce a dialect recognition method that makes use of phonetic models adapted per dialect without phonetically labeled data. We show that this method can be implemented efficiently within an existing PRLM system. We compare the performance of this system with other state-of-the-art dialect recognition methods (both...

READ MORE

Eigen-channel compensation and discriminatively trained Gaussian mixture models for dialect and accent recognition

Published in:
Proc. INTERSPEECH 2008, 22-26 September 2008, pp. 723-726.

Summary

This paper presents a series of dialect/accent identification results for three sets of dialects with discriminatively trained Gaussian mixture models and feature compensation using eigen-channel decomposition. The classification tasks evaluated in the paper include: 1)the Chinese language classes, 2) American and Indian accented English and 3) discrimination between three Arabic dialects. The first two tasks were evaluated on the 2007 NIST LRE corpus. The Arabic discrimination task was evaluated using data derived from the LDC Arabic set collected by Appen. Analysis is performed for the English accent problem studied and an approach to open set dialect scoring is introduced. The system resulted in equal error rates at or below 10% for each of the tasks studied.
READ LESS

Summary

This paper presents a series of dialect/accent identification results for three sets of dialects with discriminatively trained Gaussian mixture models and feature compensation using eigen-channel decomposition. The classification tasks evaluated in the paper include: 1)the Chinese language classes, 2) American and Indian accented English and 3) discrimination between three Arabic...

READ MORE

The MITLL NIST LRE 2007 language recognition system

Summary

This paper presents a description of the MIT Lincoln Laboratory language recognition system submitted to the NIST 2007 Language Recognition Evaluation. This system consists of a fusion of four core recognizers, two based on tokenization and two based on spectral similarity. Results for NIST?s 14-language detection task are presented for both the closed-set and open-set tasks and for the 30, 10 and 3 second durations. On the 30 second 14-language closed set detection task, the system achieves a 1% equal error rate.
READ LESS

Summary

This paper presents a description of the MIT Lincoln Laboratory language recognition system submitted to the NIST 2007 Language Recognition Evaluation. This system consists of a fusion of four core recognizers, two based on tokenization and two based on spectral similarity. Results for NIST?s 14-language detection task are presented for...

READ MORE

Improved GMM-based language recognition using constrained MLLR transforms

Author:
Published in:
Proc. 33rd IEEE Int. Conf. on Acoustics, Speech, and SIgnal Processing, ICASSP, 30 March - 4 April 2008, pp. 4149-4152.

Summary

In this paper we describe the application of a feature-space transform based on constrained maximum likelihood linear regression for unsupervised compensation of channel and speaker variability to the language recognition problem. We show that use of such transforms can improve baseline GMM-based language recognition performance on the 2005 NIST Language Recognition Evaluation (LRE05) task by 38%. Furthermore, gains from CMLLR are additive with other modeling enhancements such as vocal tract length normalization (VTLN). Further improvement is obtained using discriminative training, and it is shown that a system using only CMLLR adaption produces state-of-the-art accuracy with decreased test-time computational cost than systems using VTLN.
READ LESS

Summary

In this paper we describe the application of a feature-space transform based on constrained maximum likelihood linear regression for unsupervised compensation of channel and speaker variability to the language recognition problem. We show that use of such transforms can improve baseline GMM-based language recognition performance on the 2005 NIST Language...

READ MORE

Classification methods for speaker recognition

Published in:
Chapter in Springer Lecture Notes in Artificial Intelligence, 2007.

Summary

Automatic speaker recognition systems have a foundation built on ideas and techniques from the areas of speech science for speaker characterization, pattern recognition and engineering. In this chapter we provide an overview of the features, models, and classifiers derived from these areas that are the basis for modern automatic speaker recognition systems. We describe the components of state-of-the-art automatic speaker recognition systems, discuss application considerations and provide a brief survey of accuracy for different tasks.
READ LESS

Summary

Automatic speaker recognition systems have a foundation built on ideas and techniques from the areas of speech science for speaker characterization, pattern recognition and engineering. In this chapter we provide an overview of the features, models, and classifiers derived from these areas that are the basis for modern automatic speaker...

READ MORE

Speaker verification using support vector machines and high-level features

Published in:
IEEE Trans. on Audio, Speech, and Language Process., Vol. 15, No. 7, September 2007, pp. 2085-2094.

Summary

High-level characteristics such as word usage, pronunciation, phonotactics, prosody, etc., have seen a resurgence for automatic speaker recognition over the last several years. With the availability of many conversation sides per speaker in current corpora, high-level systems now have the amount of data needed to sufficiently characterize a speaker. Although a significant amount of work has been done in finding novel high-level features, less work has been done on modeling these features. We describe a method of speaker modeling based upon support vector machines. Current high-level feature extraction produces sequences or lattices of tokens for a given conversation side. These sequences can be converted to counts and then frequencies of -gram for a given conversation side. We use support vector machine modeling of these n-gram frequencies for speaker verification. We derive a new kernel based upon linearizing a log likelihood ratio scoring system. Generalizations of this method are shown to produce excellent results on a variety of high-level features. We demonstrate that our methods produce results significantly better than standard log-likelihood ratio modeling. We also demonstrate that our system can perform well in conjunction with standard cesptral speaker recognition systems.
READ LESS

Summary

High-level characteristics such as word usage, pronunciation, phonotactics, prosody, etc., have seen a resurgence for automatic speaker recognition over the last several years. With the availability of many conversation sides per speaker in current corpora, high-level systems now have the amount of data needed to sufficiently characterize a speaker. Although...

READ MORE

A comparison of speaker clustering and speech recognition techniques for air situational awareness

Author:
Published in:
INTERSPEECH 2007, 27-31 August 2007, pp. 2421-2424.

Summary

In this paper we compare speaker clustering and speech recognition techniques to the problem of understanding patterns of air traffic control communications. For a given radio transmission, our goal is to identify the talker and to whom he/she is speaking. This information, in combination with knowledge of the roles (i.e. takeoff, approach, hand-off, taxi, etc.) of different radio frequencies within an air traffic control region could allow tracking of pilots through various stages of flight, thus providing the potential to monitor the airspace in great detail. Both techniques must contend with degraded audio channels and significant non-native accents. We report results from experiments using the nn-MATC database showing 9.3% and 32.6% clustering error for speaker clustering and ASR methods respectively.
READ LESS

Summary

In this paper we compare speaker clustering and speech recognition techniques to the problem of understanding patterns of air traffic control communications. For a given radio transmission, our goal is to identify the talker and to whom he/she is speaking. This information, in combination with knowledge of the roles (i.e...

READ MORE

Improving phonotactic language recognition with acoustic adaptation

Author:
Published in:
INTERSPEECH 2007, 27-31 August 2007, pp. 358-361.

Summary

In recent evaluations of automatic language recognition systems, phonotactic approaches have proven highly effective. However, as most of these systems rely on underlying ASR techniques to derive a phonetic tokenization, these techniques are potentially susceptible to acoustic variability from non-language sources (i.e. gender, speaker, channel, etc.). In this paper we apply techniques from ASR research to normalize and adapt HMM-based phonetic models to improve phonotactic language recognition performance. Experiments we conducted with these techniques show an EER reduction of 29% over traditional PRLM-based approaches.
READ LESS

Summary

In recent evaluations of automatic language recognition systems, phonotactic approaches have proven highly effective. However, as most of these systems rely on underlying ASR techniques to derive a phonetic tokenization, these techniques are potentially susceptible to acoustic variability from non-language sources (i.e. gender, speaker, channel, etc.). In this paper we...

READ MORE

Robust speaker recognition in noisy conditions

Published in:
IEEE. Trans. Speech Audio Process., Vol. 15, No. 5, July 2007, pp. 1711-1723.

Summary

This paper investigates the problem of speaker identification and verification in noisy conditions, assuming that speech signals are corrupted by environmental noise, but knowledge about the noise characteristics is not available. This research is motivated in part by the potential application of speaker recognition technologies on handheld devices or the Internet. While the technologies promise an additional biometric layer of security to protect the user, the practical implementation of such systems faces many challenges. One of these is environmental noise. Due to the mobile nature of such systems, the noise sources can be highly time-varying and potentially unknown. This raises the requirement for noise robustness in the absence of information about the noise. This paper describes a method that combines multicondition model training and missing-feature theory to model noise with unknown temporal-spectral characteristics. Multicondition training is conducted using simulated noisy data with limited noise variation, providing a coarse compensation for the noise, and missing-feature theory is applied to refine the compensation by ignoring noise variation outside the given training conditions, thereby reducing the training and testing mismatch. This paper is focused on several issues relating to the implementation of the new model for real-world applications. These include the generation of multicondition training data to model noisy speech, the combination of different training data to optimize the recognition performance, and the reduction of the model's complexity. The new algorithm was tested using two databases with simulated and realistic noisy speech data. The first database is a redevelopment of the TIMIT database by rerecording the data in the presence of various noise types, used to test the model for speaker identification with a focus on the varieties of noise. The second database is a handheld-device database collected in realistic noisy conditions, used to further validate the model for real-world speaker verification. The new model is compared to baseline systems and is found to achieve lower error rates.
READ LESS

Summary

This paper investigates the problem of speaker identification and verification in noisy conditions, assuming that speech signals are corrupted by environmental noise, but knowledge about the noise characteristics is not available. This research is motivated in part by the potential application of speaker recognition technologies on handheld devices or the...

READ MORE

Text-independent speaker recognition

Published in:
Springer Handbook of Speech Processing and Communication, 2007, pp. 763-81.

Summary

In this chapter, we focus on the area of text-independent speaker verification, with an emphasis on unconstrained telephone conversational speech. We begin by providing a general likelihood ratio detection task framework to describe the various components in modern text-independent speaker verification systems. We next describe the general hierarchy of speaker information conveyed in the speech signal and the issues involved in reliably exploiting these levels of information for practical speaker verification systems. We then describe specific implementations of state-of-the-art text-independent speaker verification systems utilizing low-level spectral information and high-level token sequence information with generative and discriminative modeling techniques. Finally, we provide a performance assessment of these systems using the National Institute of Standards and Technology (NIST) speaker recognition evaluation telephone corpora.
READ LESS

Summary

In this chapter, we focus on the area of text-independent speaker verification, with an emphasis on unconstrained telephone conversational speech. We begin by providing a general likelihood ratio detection task framework to describe the various components in modern text-independent speaker verification systems. We next describe the general hierarchy of speaker...

READ MORE