Publications

Refine Results

(Filters Applied) Clear All

The MIT-LL/AFRL IWSLT-2007 MT System

Published in:
Int. Workshop on Spoken Language Translation, IWSLT, 15-16 October 2007.

Summary

The MIT-LL/AFRL MT system implements a standard phrase-based, statistical translation model. It incorporates a number of extensions that improve performance for speech-based translation. During this evaluation our efforts focused on the rapid porting of our SMT system to a new language (Arabic) and novel approaches to translation from speech input. This paper discusses the architecture of the MIT-LL/AFRL MT system, improvements over our 2007 system, and experiments we ran during the IWSLT-2007 evaluation. Specifically, we focus on 1) experiments comparing the performance of confusion network decoding and direct lattice decoding techniques for speech machine translation, 2) the application of lightweight morphology for Arabic MT pre-processing and 3) improved confusion network decoding.
READ LESS

Summary

The MIT-LL/AFRL MT system implements a standard phrase-based, statistical translation model. It incorporates a number of extensions that improve performance for speech-based translation. During this evaluation our efforts focused on the rapid porting of our SMT system to a new language (Arabic) and novel approaches to translation from speech input...

READ MORE

Classification methods for speaker recognition

Published in:
Chapter in Springer Lecture Notes in Artificial Intelligence, 2007.

Summary

Automatic speaker recognition systems have a foundation built on ideas and techniques from the areas of speech science for speaker characterization, pattern recognition and engineering. In this chapter we provide an overview of the features, models, and classifiers derived from these areas that are the basis for modern automatic speaker recognition systems. We describe the components of state-of-the-art automatic speaker recognition systems, discuss application considerations and provide a brief survey of accuracy for different tasks.
READ LESS

Summary

Automatic speaker recognition systems have a foundation built on ideas and techniques from the areas of speech science for speaker characterization, pattern recognition and engineering. In this chapter we provide an overview of the features, models, and classifiers derived from these areas that are the basis for modern automatic speaker...

READ MORE

Speaker verification using support vector machines and high-level features

Published in:
IEEE Trans. on Audio, Speech, and Language Process., Vol. 15, No. 7, September 2007, pp. 2085-2094.

Summary

High-level characteristics such as word usage, pronunciation, phonotactics, prosody, etc., have seen a resurgence for automatic speaker recognition over the last several years. With the availability of many conversation sides per speaker in current corpora, high-level systems now have the amount of data needed to sufficiently characterize a speaker. Although a significant amount of work has been done in finding novel high-level features, less work has been done on modeling these features. We describe a method of speaker modeling based upon support vector machines. Current high-level feature extraction produces sequences or lattices of tokens for a given conversation side. These sequences can be converted to counts and then frequencies of -gram for a given conversation side. We use support vector machine modeling of these n-gram frequencies for speaker verification. We derive a new kernel based upon linearizing a log likelihood ratio scoring system. Generalizations of this method are shown to produce excellent results on a variety of high-level features. We demonstrate that our methods produce results significantly better than standard log-likelihood ratio modeling. We also demonstrate that our system can perform well in conjunction with standard cesptral speaker recognition systems.
READ LESS

Summary

High-level characteristics such as word usage, pronunciation, phonotactics, prosody, etc., have seen a resurgence for automatic speaker recognition over the last several years. With the availability of many conversation sides per speaker in current corpora, high-level systems now have the amount of data needed to sufficiently characterize a speaker. Although...

READ MORE

Construction of a phonotactic dialect corpus using semiautomatic annotation

Summary

In this paper, we discuss rapid, semiautomatic annotation techniques of detailed phonological phenomena for large corpora. We describe the use of these techniques for the development of a corpus of American English dialects. The resulting annotations and corpora will support both large-scale linguistic dialect analysis and automatic dialect identification. We delineate the semiautomatic annotation process that we are currently employing and, a set of experiments we ran to validate this process. From these experiments, we learned that the use of ASR techniques could significantly increase the throughput and consistency of human annotators.
READ LESS

Summary

In this paper, we discuss rapid, semiautomatic annotation techniques of detailed phonological phenomena for large corpora. We describe the use of these techniques for the development of a corpus of American English dialects. The resulting annotations and corpora will support both large-scale linguistic dialect analysis and automatic dialect identification. We...

READ MORE

A comparison of speaker clustering and speech recognition techniques for air situational awareness

Author:
Published in:
INTERSPEECH 2007, 27-31 August 2007, pp. 2421-2424.

Summary

In this paper we compare speaker clustering and speech recognition techniques to the problem of understanding patterns of air traffic control communications. For a given radio transmission, our goal is to identify the talker and to whom he/she is speaking. This information, in combination with knowledge of the roles (i.e. takeoff, approach, hand-off, taxi, etc.) of different radio frequencies within an air traffic control region could allow tracking of pilots through various stages of flight, thus providing the potential to monitor the airspace in great detail. Both techniques must contend with degraded audio channels and significant non-native accents. We report results from experiments using the nn-MATC database showing 9.3% and 32.6% clustering error for speaker clustering and ASR methods respectively.
READ LESS

Summary

In this paper we compare speaker clustering and speech recognition techniques to the problem of understanding patterns of air traffic control communications. For a given radio transmission, our goal is to identify the talker and to whom he/she is speaking. This information, in combination with knowledge of the roles (i.e...

READ MORE

A new kernel for SVM MLLR based speaker recognition

Published in:
INTERSPEECH, 27-31 August 2007.

Summary

Speaker recognition using support vector machines (SVMs) with features derived from generative models has been shown to perform well. Typically, a universal background model (UBM) is adapted to each utterance yielding a set of features that are used in an SVM. We consider the case where the UBM is a Gaussian mixture model (GMM), and maximum likelihood linear regression (MLLR) adaptation is used to adapt the means of the UBM. We examine two possible SVM feature expansions that arise in this context: the first, a GMM supervector is constructed by stacking the means of the adapted GMM, and the second consists of the elements of the MLLR transform. We examine several kernels associated with these expansions. We show that both expansions are equivalent given an appropriate choice of kernels. Experiments performed on the NIST SRE 2006 corpus clearly highlight that our choice of kernels, which are motivated by distance metrics between GMMs, outperform ad-hoc ones. We also apply SVM nuisance attribute projection (NAP) to the kernels as a form of channel compensation and show that, with a proper choice of kernel, we achieve results comparable to existing SVM based recognizers.
READ LESS

Summary

Speaker recognition using support vector machines (SVMs) with features derived from generative models has been shown to perform well. Typically, a universal background model (UBM) is adapted to each utterance yielding a set of features that are used in an SVM. We consider the case where the UBM is a...

READ MORE

Improving phonotactic language recognition with acoustic adaptation

Author:
Published in:
INTERSPEECH 2007, 27-31 August 2007, pp. 358-361.

Summary

In recent evaluations of automatic language recognition systems, phonotactic approaches have proven highly effective. However, as most of these systems rely on underlying ASR techniques to derive a phonetic tokenization, these techniques are potentially susceptible to acoustic variability from non-language sources (i.e. gender, speaker, channel, etc.). In this paper we apply techniques from ASR research to normalize and adapt HMM-based phonetic models to improve phonotactic language recognition performance. Experiments we conducted with these techniques show an EER reduction of 29% over traditional PRLM-based approaches.
READ LESS

Summary

In recent evaluations of automatic language recognition systems, phonotactic approaches have proven highly effective. However, as most of these systems rely on underlying ASR techniques to derive a phonetic tokenization, these techniques are potentially susceptible to acoustic variability from non-language sources (i.e. gender, speaker, channel, etc.). In this paper we...

READ MORE

Variable projection and unfolding in compressed sensing

Published in:
Proc. 14th IEEE/SP Workshop on Statistical Signal Processing, 26-28 August 2007, pp. 358-362.

Summary

The performance of linear programming techniques that are applied in the signal identification and reconstruction process in compressed sensing (CS) is governed by both the number of measurements taken and the number of nonzero coefficients in the discrete basis used to represent the signal. To enhance the capabilities of CS, we have developed a technique called Variable Projection and Unfolding (VPU). VPU extends the identification and reconstruction capability of linear programming techniques to signals with a much greater number of nonzero coefficients in the basis in which the signals are compressible with significantly better reconstruction error.
READ LESS

Summary

The performance of linear programming techniques that are applied in the signal identification and reconstruction process in compressed sensing (CS) is governed by both the number of measurements taken and the number of nonzero coefficients in the discrete basis used to represent the signal. To enhance the capabilities of CS...

READ MORE

Robust speaker recognition in noisy conditions

Published in:
IEEE. Trans. Speech Audio Process., Vol. 15, No. 5, July 2007, pp. 1711-1723.

Summary

This paper investigates the problem of speaker identification and verification in noisy conditions, assuming that speech signals are corrupted by environmental noise, but knowledge about the noise characteristics is not available. This research is motivated in part by the potential application of speaker recognition technologies on handheld devices or the Internet. While the technologies promise an additional biometric layer of security to protect the user, the practical implementation of such systems faces many challenges. One of these is environmental noise. Due to the mobile nature of such systems, the noise sources can be highly time-varying and potentially unknown. This raises the requirement for noise robustness in the absence of information about the noise. This paper describes a method that combines multicondition model training and missing-feature theory to model noise with unknown temporal-spectral characteristics. Multicondition training is conducted using simulated noisy data with limited noise variation, providing a coarse compensation for the noise, and missing-feature theory is applied to refine the compensation by ignoring noise variation outside the given training conditions, thereby reducing the training and testing mismatch. This paper is focused on several issues relating to the implementation of the new model for real-world applications. These include the generation of multicondition training data to model noisy speech, the combination of different training data to optimize the recognition performance, and the reduction of the model's complexity. The new algorithm was tested using two databases with simulated and realistic noisy speech data. The first database is a redevelopment of the TIMIT database by rerecording the data in the presence of various noise types, used to test the model for speaker identification with a focus on the varieties of noise. The second database is a handheld-device database collected in realistic noisy conditions, used to further validate the model for real-world speaker verification. The new model is compared to baseline systems and is found to achieve lower error rates.
READ LESS

Summary

This paper investigates the problem of speaker identification and verification in noisy conditions, assuming that speech signals are corrupted by environmental noise, but knowledge about the noise characteristics is not available. This research is motivated in part by the potential application of speaker recognition technologies on handheld devices or the...

READ MORE

PANEMOTO: network visualization of security situational awareness through passive analysis

Summary

To maintain effective security situational awareness, administrators require tools that present up-to-date information on the state of the network in the form of 'at-a-glance' displays, and that enable rapid assessment and investigation of relevant security concerns through drill-down analysis capability. In this paper, we present a passive network monitoring tool we have developed to address these important requirements, known a Panemoto (PAssive NEtwork MOnitoring TOol). We show how Panemoto enumerates, describes, and characterizes all network components, including devices and connected networks, and delivers an accurate representation of the function of devices and logical connectivity of networks. We provide examples of Panemoto's output in which the network information is presented in two distinct but related formats: as a clickable network diagram (through the use of NetViz), a commercially available graphical display environment) and as statically-linked HTML pages, viewable in any standard web browser. Together, these presentation techniques enable a more complete understanding of the security situation of the network than each does individually.
READ LESS

Summary

To maintain effective security situational awareness, administrators require tools that present up-to-date information on the state of the network in the form of 'at-a-glance' displays, and that enable rapid assessment and investigation of relevant security concerns through drill-down analysis capability. In this paper, we present a passive network monitoring tool...

READ MORE