Publications

Refine Results

(Filters Applied) Clear All

Text-independent speaker recognition

Published in:
Springer Handbook of Speech Processing and Communication, 2007, pp. 763-81.

Summary

In this chapter, we focus on the area of text-independent speaker verification, with an emphasis on unconstrained telephone conversational speech. We begin by providing a general likelihood ratio detection task framework to describe the various components in modern text-independent speaker verification systems. We next describe the general hierarchy of speaker information conveyed in the speech signal and the issues involved in reliably exploiting these levels of information for practical speaker verification systems. We then describe specific implementations of state-of-the-art text-independent speaker verification systems utilizing low-level spectral information and high-level token sequence information with generative and discriminative modeling techniques. Finally, we provide a performance assessment of these systems using the National Institute of Standards and Technology (NIST) speaker recognition evaluation telephone corpora.
READ LESS

Summary

In this chapter, we focus on the area of text-independent speaker verification, with an emphasis on unconstrained telephone conversational speech. We begin by providing a general likelihood ratio detection task framework to describe the various components in modern text-independent speaker verification systems. We next describe the general hierarchy of speaker...

READ MORE

ILR-based MT comprehension test with multi-level questions

Published in:
Human Language Technology, North American Chapter of the Association for Computational Linguistics, HLT/NAACL, 22-27 April 2007.

Summary

We present results from a new Interagency Language Roundtable (ILR) based comprehension test. This new test design presents questions at multiple ILR difficulty levels within each document. We incorporated Arabic machine translation (MT) output from three independent research sites, arbitrarily merging these materials into one MT condition. We contrast the MT condition, for both text and audio data types, with high quality human reference Gold Standard (GS) translations. Overall, subjects achieved 95% comprehension for GS and 74% for MT, across all genres and difficulty levels. Interestingly, comprehension rates do not correlate highly with translation error rates, suggesting that we are measuring an additional dimension of MT quality.
READ LESS

Summary

We present results from a new Interagency Language Roundtable (ILR) based comprehension test. This new test design presents questions at multiple ILR difficulty levels within each document. We incorporated Arabic machine translation (MT) output from three independent research sites, arbitrarily merging these materials into one MT condition. We contrast the...

READ MORE

Surveillance improvement algorithms for Airport Surface Detection Equipment Model X (ASDE-X) at Dallas-Fort Worth Airport

Published in:
MIT Lincoln Laboratory Report ATC-333

Summary

Operational testing of the Runway Status Lights (RWSL) system at the Dallas-Fort Worth (DFW) airport has detected a number of cases where faults in the ASDE-X/DFW surveillance data have led to erroneous operation of the status lights. Among the surveillance problems noted during testing at DFW were: (a) false tracks, (b) track positional jumps to false locations, (c) Mode S track splits, (d) ATCRBS track splits, (e) invalid Mode C altitudes, (f) invalid track velocities, and (g) spurious Mode 3/a 06078 code tracks. The RWSL surveillance improvement algorithms package in this document is placed between the ASDE-X/DFW surveillance data source and the RESL safety logic. The surveillance improvement algorithms perform a variety of reasonableness and consistency checks on the input data and set validity flags and report status values for each input report which are then passed on to the RWSL safety logic. These flags and status values allow the RWSL to ignore erroneous reports and to avoid using questionable report components in the subsequent RWSL logic. This document illustrates the performance of the RWSL surveillance improvement algorithms package with examples from DFW analysis. It is shown that the RWSL surveillance improvement algorithms package substantially reduces the impact of the known ASDE-X/DFW surveillance anomalies on the performance of the RWSL safety logic. The RWSL surveillance improvement algorithms package may also host future algorithms necessary to mitigate further problems that might be detected in the surveillance data.
READ LESS

Summary

Operational testing of the Runway Status Lights (RWSL) system at the Dallas-Fort Worth (DFW) airport has detected a number of cases where faults in the ASDE-X/DFW surveillance data have led to erroneous operation of the status lights. Among the surveillance problems noted during testing at DFW were: (a) false tracks...

READ MORE

A new approach to achieving high-performance power amplifier linearization

Published in:
IEEE Radar Conf., 17-20 April 2007. doi: 10.1109/RADAR.2007.374329

Summary

Digital baseband predistortion (DBP) is not particularly well suited to linearizing wideband power amplifiers (PAs); this is due to the exorbitant price paid in computational complexity. One of the underlying reasons for the computational complexity of DBP is the inherent inefficiency of using a sufficiently deep memory and a high enough polynomial order to span the multidimensional signal space needed to mitigate PA-induced nonlinear distortion. Therefore we have developed a new mathematical method to efficiently search for and localize those regions in the multidimensional signal space that enable us to invert PA nonlinearities with a significant reduction in computational complexity. Using a wideband code division multiple access (CDMA) signal we demonstrate and compare the PA linearization performance and computational complexity of our algorithm to that of conventional DBP techniques using measured results.
READ LESS

Summary

Digital baseband predistortion (DBP) is not particularly well suited to linearizing wideband power amplifiers (PAs); this is due to the exorbitant price paid in computational complexity. One of the underlying reasons for the computational complexity of DBP is the inherent inefficiency of using a sufficiently deep memory and a high...

READ MORE

Language recognition with word lattices and support vector machines

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, 15-20 April 2007, Vol. IV, pp. 989-992.

Summary

Language recognition is typically performed with methods that exploit phonotactics--a phone recognition language modeling (PRLM) system. A PRLM system converts speech to a lattice of phones and then scores a language model. A standard extension to this scheme is to use multiple parallel phone recognizers (PPRLM). In this paper, we modify this approach in two distinct ways. First, we replace the phone tokenizer by a powerful speech-to-text system. Second, we use a discriminative support vector machine for language modeling. Our goals are twofold. First, we explore the ability of a single speech-to-text system to distinguish multiple languages. Second, we fuse the new system with an SVM PRLM system to see if it complements current approaches. Experiments on the 2005 NIST language recognition corpus show the new word system accomplishes these goals and has significant potential for language recognition.
READ LESS

Summary

Language recognition is typically performed with methods that exploit phonotactics--a phone recognition language modeling (PRLM) system. A PRLM system converts speech to a lattice of phones and then scores a language model. A standard extension to this scheme is to use multiple parallel phone recognizers (PPRLM). In this paper, we...

READ MORE

An evaluation of audio-visual person recognition on the XM2VTS corpus using the Lausanne protocols

Published in:
Proc. 32nd IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, April 2007, pp. IV-237 - 240.

Summary

A multimodal person recognition architecture has been developed for the purpose of improving overall recognition performance and for addressing channel-specific performance shortfalls. This multimodal architecture includes the fusion of a face recognition system with the MIT/LLGMM/UBM speaker recognition architecture. This architecture exploits the complementary and redundant nature of the face and speech modalities. The resulting multimodal architecture has been evaluated on theXM2VTS corpus using the Lausanne open set verification protocols, and demonstrates excellent recognition performance. The multimodal architecture also exhibits strong recognition performance gains over the performance of the individual modalities.
READ LESS

Summary

A multimodal person recognition architecture has been developed for the purpose of improving overall recognition performance and for addressing channel-specific performance shortfalls. This multimodal architecture includes the fusion of a face recognition system with the MIT/LLGMM/UBM speaker recognition architecture. This architecture exploits the complementary and redundant nature of the face...

READ MORE

Robust speaker recognition with cross-channel data: MIT-LL results on the 2006 NIST SRE auxiliary microphone task

Published in:
Proc. 32nd IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, April 2007, pp. IV-49 - IV-52.

Summary

One particularly difficult challenge for cross-channel speaker verification is the auxiliary microphone task introduced in the 2005 and 2006 NIST Speaker Recognition Evaluations, where training uses telephone speech and verification uses speech from multiple auxiliary microphones. This paper presents two approaches to compensate for the effects of auxiliary microphones on the speech signal. The first compensation method mitigates session effects through Latent Factor Analysis (LFA) and Nuisance Attribute Projection (NAP). The second approach operates directly on the recorded signal with noise reduction techniques. Results are presented that show a reduction in the performance gap between telephone and auxiliary microphone data.
READ LESS

Summary

One particularly difficult challenge for cross-channel speaker verification is the auxiliary microphone task introduced in the 2005 and 2006 NIST Speaker Recognition Evaluations, where training uses telephone speech and verification uses speech from multiple auxiliary microphones. This paper presents two approaches to compensate for the effects of auxiliary microphones on...

READ MORE

Multisensor dynamic waveform fusion

Published in:
Proc. 32nd Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, April 2007, pp. IV-577 - IV-580.

Summary

Speech communication is significantly more difficult in severe acoustic background noise environments, especially when low-rate speech coders are used. Non-acoustic sensors, such as radar sensors, vibrometers, and bone-conduction microphones, offer significant potential in these situations. We extend previous work on fixed waveform fusion from multiple sensors to an optimal dynamic waveform fusion algorithm that minimizes both additive noise and signal distortion in the estimated speech signal. We show that a minimum mean squared error (MMSE) waveform matching criterion results in a generalized multichannel Wiener filter, and that this filter will simultaneously perform waveform fusion, noise suppression, and crosschannel noise cancellation. Formal intelligibility and quality testing demonstrate significant improvement from this approach.
READ LESS

Summary

Speech communication is significantly more difficult in severe acoustic background noise environments, especially when low-rate speech coders are used. Non-acoustic sensors, such as radar sensors, vibrometers, and bone-conduction microphones, offer significant potential in these situations. We extend previous work on fixed waveform fusion from multiple sensors to an optimal dynamic...

READ MORE

The MIT-LL/IBM 2006 speaker recognition system: high-performance reduced-complexity recognition

Published in:
Proc. 32nd IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, April 2007, pp. IV-217 - IV-220.

Summary

Many powerful methods for speaker recognition have been introduced in recent years--high-level features, novel classifiers, and channel compensation methods. A common arena for evaluating these methods has been the NIST speaker recognition evaluation (SRE). In the NIST SRE from 2002-2005, a popular approach was to fuse multiple systems based upon cepstral features and different linguistic tiers of high-level features. With enough enrollment data, this approach produced dramatic error rate reductions and showed conceptually that better performance was attainable. A drawback in this approach is that many high-level systems were being run independently requiring significant computational complexity and resources. In 2006, MIT Lincoln Laboratory focused on a new system architecture which emphasized reduced complexity. This system was a carefully selected mixture of high-level techniques, new classifier methods, and novel channel compensation techniques. This new system has excellent accuracy and has substantially reduced complexity. The performance and computational aspects of the system are detailed on a NIST 2006 SRE task.
READ LESS

Summary

Many powerful methods for speaker recognition have been introduced in recent years--high-level features, novel classifiers, and channel compensation methods. A common arena for evaluating these methods has been the NIST speaker recognition evaluation (SRE). In the NIST SRE from 2002-2005, a popular approach was to fuse multiple systems based upon...

READ MORE

Triage framework for resource conservation in a speaker identification system

Published in:
Proc. 32nd IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, April 2007, pp. IV-69 - IV-72.

Summary

We present a novel framework for triaging (prioritizing and discarding) data to conserve resources for a speaker identification (SID) system. Our work is motivated by applications that require a SID system to process an overwhelming volume of audio data. We design a triage filter whose goal is to conserve recognizer resources while preserving relevant content. We propose triage methods that use signal quality assessment tools, a scaled-down version of the main recognizer itself, and a fusion of these measures. We define a new precision-based measure of effectiveness for our triage framework. Our experimental results with the 35-speaker tactical SID corpus bear out the validity of our approach.
READ LESS

Summary

We present a novel framework for triaging (prioritizing and discarding) data to conserve resources for a speaker identification (SID) system. Our work is motivated by applications that require a SID system to process an overwhelming volume of audio data. We design a triage filter whose goal is to conserve recognizer...

READ MORE