Publications

Refine Results

(Filters Applied) Clear All

Auditory signal processing as a basis for speaker recognition

Published in:
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 19-22 October, 2003, pp. 111-114.

Summary

In this paper, we exploit models of auditory signal processing at different levels along the auditory pathway for use in speaker recognition. A low-level nonlinear model, at the cochlea, provides accentuated signal dynamics, while a a high-level model, at the inferior colliculus, provides frequency analysis of modulation components that reveals additional temporal structure. A variety of features are derived from the low-level dynamic and high-level modulation signals. Fusion of likelihood scores from feature sets at different auditory levels with scores from standard mel-cepstral features provides an encouraging speaker recognition performance gain over use of the mel-cepstrum alone with corpora from land-line and cellular telephone communications.
READ LESS

Summary

In this paper, we exploit models of auditory signal processing at different levels along the auditory pathway for use in speaker recognition. A low-level nonlinear model, at the cochlea, provides accentuated signal dynamics, while a a high-level model, at the inferior colliculus, provides frequency analysis of modulation components that reveals...

READ MORE

System adaptation as a trust response in tactical ad hoc networks

Published in:
IEEE MILCOM 2003, 13-16 October 2003, pp. 209-214.

Summary

While mobile ad hoc networks offer significant improvements for tactical communications, these networks are vulnerable to node capture and other forms of cyberattack. In this paper we evaluated via simulation of the impact of a passive attacker, a denial of service (DoS) attack, and a data swallowing attack. We compared two different adaptive network responses to these attacks against a baseline of no response for 10 and 20 node networks. Each response reflects a level of trust assigned to the captured node. Our simulation used a responsive variant of the ad hoc on-demand distance vector (AODV) routing algorithm and focused on the response performance. We assumed that the attacks had been detected and reported. We compared performance tradeoffs of attack, response, and network size by focusing on metrics such as "goodput", i.e., percentage of messages that reach the intended destination untainted by the captured node. We showed, for example, that under general conditions a DoS attack response should minimize attacker impact while a response to a data swallowing attack should minimize risk to the system and trust of the compromised node with most of the response benefit. We show that the best network response depends on the mission goals, network configuration, density, network performance, attacker skill, and degree of compromise.
READ LESS

Summary

While mobile ad hoc networks offer significant improvements for tactical communications, these networks are vulnerable to node capture and other forms of cyberattack. In this paper we evaluated via simulation of the impact of a passive attacker, a denial of service (DoS) attack, and a data swallowing attack. We compared...

READ MORE

Acoustic, phonetic, and discriminative approaches to automatic language identification

Summary

Formal evaluations conducted by NIST in 1996 demonstrated that systems that used parallel banks of tokenizer-dependent language models produced the best language identification performance. Since that time, other approaches to language identification have been developed that match or surpass the performance of phone-based systems. This paper describes and evaluates three techniques that have been applied to the language identification problem: phone recognition, Gaussian mixture modeling, and support vector machine classification. A recognizer that fuses the scores of three systems that employ these techniques produces a 2.7% equal error rate (EER) on the 1996 NIST evaluation set and a 2.8% EER on the NIST 2003 primary condition evaluation set. An approach to dealing with the problem of out-of-set data is also discussed.
READ LESS

Summary

Formal evaluations conducted by NIST in 1996 demonstrated that systems that used parallel banks of tokenizer-dependent language models produced the best language identification performance. Since that time, other approaches to language identification have been developed that match or surpass the performance of phone-based systems. This paper describes and evaluates three...

READ MORE

Fusing high- and low-level features for speaker recognition

Summary

The area of automatic speaker recognition has been dominated by systems using only short-term, low-level acoustic information, such as cepstral features. While these systems have produced low error rates, they ignore higher levels of information beyond low-level acoustics that convey speaker information. Recently published works have demonstrated that such high-level information can be used successfully in automatic speaker recognition systems by improving accuracy and potentially increasing robustness. Wide ranging high-level-feature-based approaches using pronunciation models, prosodic dynamics, pitch gestures, phone streams, and conversational interactions were explored and developed under the SuperSID project at the 2002 JHU CLSP Summer Workshop (WS2002): http://www.clsp.jhu.edu/ws2002/groups/supersid/. In this paper, we show how these novel features and classifiers provide complementary information and can be fused together to drive down the equal error rate on the 2001 NIST Extended Data Task to 0.2%-a 71% relative reduction in error over the previous state of the art.
READ LESS

Summary

The area of automatic speaker recognition has been dominated by systems using only short-term, low-level acoustic information, such as cepstral features. While these systems have produced low error rates, they ignore higher levels of information beyond low-level acoustics that convey speaker information. Recently published works have demonstrated that such high-level...

READ MORE

Person authentication by voice: a need for caution

Published in:
8th European Conf. on Speech Communication and Technology, EUROSPEECH, 1-4 September 2003.

Summary

Because of recent events and as members of the scientific community working in the field of speech processing, we feel compelled to publicize our views concerning the possibility of identifying or authenticating a person from his or her voice. The need for a clear and common message was indeed shown by the diversity of information that has been circulating on this matter in the media and general public over the past year. In a press release initiated by the AFCP and further elaborated in collaboration with the SpLC ISCA-SIG, the two groups herein discuss and present a summary of the current state of scientific knowledge and technological development in the field of speaker recognition, in accessible wording for nonspecialists. Our main conclusion is that, despite the existence of technological solutions to some constrained applications, at the present time, there is no scientific process that enables one to uniquely characterize a person's voice or to identify with absolute certainty an individual from his or her voice.
READ LESS

Summary

Because of recent events and as members of the scientific community working in the field of speech processing, we feel compelled to publicize our views concerning the possibility of identifying or authenticating a person from his or her voice. The need for a clear and common message was indeed shown...

READ MORE

Integration of speaker recognition into conversational spoken dialogue systems

Summary

In this paper we examine the integration of speaker identification/verification technology into two dialogue systems developed at MIT: the Mercury air travel reservation system and the Orion task delegation system. These systems both utilize information collected from registered users that is useful in personalizing the system to specific users and that must be securely protected from imposters. Two speaker recognition systems, the MIT Lincoln Laboratory text independent GMM based system and the MIT Laboratory for Computer Science text-constrained speaker-adaptive ASR-based system, are evaluated and compared within the context of these conversational systems.
READ LESS

Summary

In this paper we examine the integration of speaker identification/verification technology into two dialogue systems developed at MIT: the Mercury air travel reservation system and the Orion task delegation system. These systems both utilize information collected from registered users that is useful in personalizing the system to specific users and...

READ MORE

Model compression for GMM based speaker recognition systems

Published in:
EUROSPEECH 2003, 1-4 September 2003.

Summary

For large-scale deployments of speaker verification systems models size can be an important issue for not only minimizing storage requirements but also reducing transfer time of models over networks. Model size is also critical for deployments to small, portable devices. In this paper we present a new model compression technique for Gaussian Mixture Model (GMM) based speaker recognition systems. For GMM systems using adaptation from a background model, the compression technique exploits the fact that speaker models are adapted from a single speaker-independent model and not all parameters need to be stored. We present results on the 2002 NIST speaker recognition evaluation cellular telephone corpus and show that the compression technique provides a good tradeoff of compression ratio to performance loss. We are able to achieve a 56:1 compression (624KB -> 11KB) with only a 3.2% relative increase in EER (9.1% -> 9.4%).
READ LESS

Summary

For large-scale deployments of speaker verification systems models size can be an important issue for not only minimizing storage requirements but also reducing transfer time of models over networks. Model size is also critical for deployments to small, portable devices. In this paper we present a new model compression technique...

READ MORE

Measuring the readability of automatic speech-to-text transcripts

Summary

This paper reports initial results from a novel psycholinguistic study that measures the readability of several types of speech transcripts. We define a four-part figure of merit to measure readability: accuracy of answers to comprehension questions, reaction-time for passage reading, reaction-time for question answering and a subjective rating of passage difficulty. We present results from an experiment with 28 test subjects reading transcripts in four experimental conditions.
READ LESS

Summary

This paper reports initial results from a novel psycholinguistic study that measures the readability of several types of speech transcripts. We define a four-part figure of merit to measure readability: accuracy of answers to comprehension questions, reaction-time for passage reading, reaction-time for question answering and a subjective rating of passage...

READ MORE

Combining cross-stream and time dimensions in phonetic speaker recognition

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 4, 6-10 April 2003, pp. IV-800 - IV-803.

Summary

Recent studies show that phonetic sequences from multiple languages can provide effective features for speaker recognition. So far, only pronunciation dynamics in the time dimension, i.e., n-gram modeling on each of the phone sequences, have been examined. In the JHU 2002 Summer Workshop, we explored modeling the statistical pronunciation dynamics across streams in multiple languages (cross-stream dimensions) as an additional component to the time dimension. We found that bigram modeling in the cross-stream dimension achieves improved performance over that in the time dimension on the NIST 2001 Speaker Recognition Evaluation Extended Data Task. Moreover, a linear combination of information from both dimensions at the score level further improves the performance, showing that the two dimensions contain complementary information.
READ LESS

Summary

Recent studies show that phonetic sequences from multiple languages can provide effective features for speaker recognition. So far, only pronunciation dynamics in the time dimension, i.e., n-gram modeling on each of the phone sequences, have been examined. In the JHU 2002 Summer Workshop, we explored modeling the statistical pronunciation dynamics...

READ MORE

Channel robust speaker verification via feature mapping

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. II, 6-10 April 2003, pp. II-53 - II-56.

Summary

In speaker recognition applications, channel variability is a major cause of errors. Techniques in the feature, model and score domains have been applied to mitigate channel effects. In this paper we present a new feature mapping technique that maps feature vectors into a channel independent space. The feature mapping learns mapping parameters from a set of channel-dependent models derived for a channel-dependent models derived from a channel-independent model via MAP adaptation. The technique is developed primarily for speaker verification, but can be applied for feature normalization in speech recognition applications. Results are presented on NIST landline and cellular telephone speech corpora where it is shown that feature mapping provides significant performance improvements over baseline systems and similar performance to Hnorm and Speaker-Model-Synthesis (SMS).
READ LESS

Summary

In speaker recognition applications, channel variability is a major cause of errors. Techniques in the feature, model and score domains have been applied to mitigate channel effects. In this paper we present a new feature mapping technique that maps feature vectors into a channel independent space. The feature mapping learns...

READ MORE