Publications

Refine Results

(Filters Applied) Clear All

Forensic speaker recognition: a need for caution

Summary

There has long been a desire to be able to identify a person on the basis of his or her voice. For many years, judges, lawyers, detectives, and law enforcement agencies have wanted to use forensic voice authentication to investigate a suspect or to confirm a judgment of guilt or innocence. Challenges, realities, and cautions regarding the use of speaker recognition applied to forensic-quality samples are presented.
READ LESS

Summary

There has long been a desire to be able to identify a person on the basis of his or her voice. For many years, judges, lawyers, detectives, and law enforcement agencies have wanted to use forensic voice authentication to investigate a suspect or to confirm a judgment of guilt or...

READ MORE

Cognitive services for the user

Published in:
Chapter 10, Cognitive Radio Technology, 2009, pp. 305-324.

Summary

Software-defined cognitive radios (CRs) use voice as a primary input/output (I/O) modality and are expected to have substantial computational resources capable of supporting advanced speech- and audio-processing applications. This chapter extends previous work on speech applications (e.g., [1]) to cognitive services that enhance military mission capability by capitalizing on automatic processes, such as speech information extraction and understanding the environment. Such capabilities go beyond interaction with the intended user of the software-defined radio (SDR) - they extend to speech and audio applications that can be applied to information that has been extracted from voice and acoustic noise gathered from other users and entities in the environment. For example, in a military environment, situational awareness and understanding could be enhanced by informing users based on processing voice and noise from both friendly and hostile forces operating in a given battle space. This chapter provides a survey of a number of speech- and audio-processing technologies and their potential applications to CR, including: - A description of the technology and its current state of practice. - An explanation of how the technology is currently being applied, or could be applied, to CR. - Descriptions and concepts of operations for how the technology can be applied to benefit users of CRs. - A description of relevant future research directions for both the speech and audio technologies and their applications to CR. A pictorial overview of many of the core technologies with some applications presented in the following sections is shown in Figure 10.1. Also shown are some overlapping components between the technologies. For example, Gaussian mixture models (GMMs) and support vector machines (SVMs) are used in both speaker and language recognition technologies [2]. These technologies and components are described in further detail in the following sections. Speech and concierge cognitive services and their corresponding applications are covered in the following sections. The services covered include speaker recognition, language identification (LID), text-to-speech (TTS) conversion, speech-to-text (STT) conversion, machine translation (MT), background noise suppression, speech coding, speaker characterization, noise management, noise characterization, and concierge services. These technologies and their potential applications to CR are discussed at varying levels of detail commensurate with their innovation and utility.
READ LESS

Summary

Software-defined cognitive radios (CRs) use voice as a primary input/output (I/O) modality and are expected to have substantial computational resources capable of supporting advanced speech- and audio-processing applications. This chapter extends previous work on speech applications (e.g., [1]) to cognitive services that enhance military mission capability by capitalizing on automatic...

READ MORE

Proficiency testing for imaging and audio enhancement: guidelines for evaluation

Published in:
Int. Assoc. of Forensic Sciences, IAFS, 21-26 July 2008.

Summary

Proficiency tests in the forensic sciences are vital in the accreditation and quality assurance process. Most commercially available proficiency testing is available for examiners in the traditional forensic disciplines, such as latent prints, drug analysis, DNA, questioned documents, etc. Each of these disciplines is identification based. There are other forensic disciplines, however, where the output of the examination is not an identification of a person or substance. Two such disciplines are audio enhancement and video/image enhancement.
READ LESS

Summary

Proficiency tests in the forensic sciences are vital in the accreditation and quality assurance process. Most commercially available proficiency testing is available for examiners in the traditional forensic disciplines, such as latent prints, drug analysis, DNA, questioned documents, etc. Each of these disciplines is identification based. There are other forensic...

READ MORE

Bridging the gap between linguists and technology developers: large-scale, sociolinguistic annotation for dialect and speaker recognition

Published in:
Proc. 6th Int. Conf. on Language Resources and Evaluation, LREC, 28 May 2008.

Summary

Recent years have seen increased interest within the speaker recognition community in high-level features including, for example, lexical choice, idiomatic expressions or syntactic structures. The promise of speaker recognition in forensic applications drives development toward systems robust to channel differences by selecting features inherently robust to channel difference. Within the language recognition community, there is growing interest in differentiating not only languages but also mutually intelligible dialects of a single language. Decades of research in dialectology suggest that high-level features can enable systems to cluster speakers according to the dialects they speak. The Phanotics (Phonetic Annotation of Typicality in Conversational Speech) project seeks to identify high-level features characteristic of American dialects, annotate a corpus for these features, use the data to dialect recognition systems and also use the categorization to create better models for speaker recognition. The data, once published, should be useful to other developers of speaker and dialect recognition systems and to dialectologists and sociolinguists. We expect the methods will generalize well beyond the speakers, dialects, and languages discussed here and should, if successful, provide a model for how linguists and technology developers can collaborate in the future for the benefit of both groups and toward a deeper understanding of how languages vary and change.
READ LESS

Summary

Recent years have seen increased interest within the speaker recognition community in high-level features including, for example, lexical choice, idiomatic expressions or syntactic structures. The promise of speaker recognition in forensic applications drives development toward systems robust to channel differences by selecting features inherently robust to channel difference. Within the...

READ MORE

Speaker verification using support vector machines and high-level features

Published in:
IEEE Trans. on Audio, Speech, and Language Process., Vol. 15, No. 7, September 2007, pp. 2085-2094.

Summary

High-level characteristics such as word usage, pronunciation, phonotactics, prosody, etc., have seen a resurgence for automatic speaker recognition over the last several years. With the availability of many conversation sides per speaker in current corpora, high-level systems now have the amount of data needed to sufficiently characterize a speaker. Although a significant amount of work has been done in finding novel high-level features, less work has been done on modeling these features. We describe a method of speaker modeling based upon support vector machines. Current high-level feature extraction produces sequences or lattices of tokens for a given conversation side. These sequences can be converted to counts and then frequencies of -gram for a given conversation side. We use support vector machine modeling of these n-gram frequencies for speaker verification. We derive a new kernel based upon linearizing a log likelihood ratio scoring system. Generalizations of this method are shown to produce excellent results on a variety of high-level features. We demonstrate that our methods produce results significantly better than standard log-likelihood ratio modeling. We also demonstrate that our system can perform well in conjunction with standard cesptral speaker recognition systems.
READ LESS

Summary

High-level characteristics such as word usage, pronunciation, phonotactics, prosody, etc., have seen a resurgence for automatic speaker recognition over the last several years. With the availability of many conversation sides per speaker in current corpora, high-level systems now have the amount of data needed to sufficiently characterize a speaker. Although...

READ MORE

Construction of a phonotactic dialect corpus using semiautomatic annotation

Summary

In this paper, we discuss rapid, semiautomatic annotation techniques of detailed phonological phenomena for large corpora. We describe the use of these techniques for the development of a corpus of American English dialects. The resulting annotations and corpora will support both large-scale linguistic dialect analysis and automatic dialect identification. We delineate the semiautomatic annotation process that we are currently employing and, a set of experiments we ran to validate this process. From these experiments, we learned that the use of ASR techniques could significantly increase the throughput and consistency of human annotators.
READ LESS

Summary

In this paper, we discuss rapid, semiautomatic annotation techniques of detailed phonological phenomena for large corpora. We describe the use of these techniques for the development of a corpus of American English dialects. The resulting annotations and corpora will support both large-scale linguistic dialect analysis and automatic dialect identification. We...

READ MORE

Understanding scores in forensic speaker recognition

Summary

Recent work in forensic speaker recognition has introduced many new scoring methodologies. First, confidence scores (posterior probabilities) have become a useful method of presenting results to an analyst. The introduction of an objective measure of confidence score quality, the normalized cross entropy, has resulted in a systematic manner of evaluating and designing these systems. A second scoring methodology that has become popular is support vector machines (SVMs) for high-level features. SVMs are accurate and produce excellent results across a wide variety of token types-words, phones, and prosodic features. In both cases, an analyst may be at a loss to explain the significance and meaning of the score produced by these methods. We tackle the problem of interpretation by exploring concepts from the statistical and pattern classification literature. In both cases, our preliminary results show interesting aspects of scores not obvious from viewing them "only as numbers."
READ LESS

Summary

Recent work in forensic speaker recognition has introduced many new scoring methodologies. First, confidence scores (posterior probabilities) have become a useful method of presenting results to an analyst. The introduction of an objective measure of confidence score quality, the normalized cross entropy, has resulted in a systematic manner of evaluating...

READ MORE

The mixer and transcript reading corpora: resources for multilingual, crosschannel speaker recognition research

Summary

This paper describes the planning and creation of the Mixer and Transcript Reading corpora, their properties and yields, and reports on the lessons learned during their development.
READ LESS

Summary

This paper describes the planning and creation of the Mixer and Transcript Reading corpora, their properties and yields, and reports on the lessons learned during their development.

READ MORE

Support vector machines for speaker and language recognition

Published in:
Comput. Speech Lang., Vol. 20, No. 2-3, April/July 2006, pp. 210-229.

Summary

Support vector machines (SVMs) have proven to be a powerful technique for pattern classification. SVMs map inputs into a high-dimensional space and then separate classes with a hyperplane. A critical aspect of using SVMs successfully is the design of the inner product, the kernel, induced by the high dimensional mapping. We consider the application of SVMs to speaker and language recognition. A key part of our approach is the use of a kernel that compares sequences of feature vectors and produces a measure of similarity. Our sequence kernel is based upon generalized linear discriminants. We show that this strategy has several important properties. First, the kernel uses an explicit expansion into SVM feature space - this property makes it possible to collapse all support vectors into a single model vector and have low computational complexity. Second, the SVM builds upon a simpler mean-squared error classifier to produce a more accurate system. Finally, the system is competitive and complimentary to other approaches, such as Gaussian mixture models (GMMs). We give results for the 2003 NIST speaker and language evaluations of the system and also show fusion with the traditional GMM approach.
READ LESS

Summary

Support vector machines (SVMs) have proven to be a powerful technique for pattern classification. SVMs map inputs into a high-dimensional space and then separate classes with a hyperplane. A critical aspect of using SVMs successfully is the design of the inner product, the kernel, induced by the high dimensional mapping...

READ MORE

Exploiting nonacoustic sensors for speech encoding

Summary

The intelligibility of speech transmitted through low-rate coders is severely degraded when high levels of acoustic noise are present in the acoustic environment. Recent advances in nonacoustic sensors, including microwave radar, skin vibration, and bone conduction sensors, provide the exciting possibility of both glottal excitation and, more generally, vocal tract measurements that are relatively immune to acoustic disturbances and can supplement the acoustic speech waveform. We are currently investigating methods of combining the output of these sensors for use in low-rate encoding according to their capability in representing specific speech characteristics in different frequency bands. Nonacoustic sensors have the ability to reveal certain speech attributes lost in the noisy acoustic signal; for example, low-energy consonant voice bars, nasality, and glottalized excitation. By fusing nonacoustic low-frequency and pitch content with acoustic-microphone content, we have achieved significant intelligibility performance gains using the DRT across a variety of environments over the government standard 2400-bps MELPe coder. By fusing quantized high-band 4-to-8-kHz speech, requiring only an additional 116 bps, we obtain further DRT performance gains by exploiting the ear's insensitivity to fine spectral detail in this frequency region.
READ LESS

Summary

The intelligibility of speech transmitted through low-rate coders is severely degraded when high levels of acoustic noise are present in the acoustic environment. Recent advances in nonacoustic sensors, including microwave radar, skin vibration, and bone conduction sensors, provide the exciting possibility of both glottal excitation and, more generally, vocal tract...

READ MORE