Publications

Refine Results

(Filters Applied) Clear All

Bayesian estimation of PLDA in the presence of noisy training labels, with applications to speaker verification

Published in:
IEEE/ACM Trans. Audio, Speech, Language Process., Vol. 30, 2022, pp. 414-28.

Summary

This paper presents a Bayesian framework for estimating a Probabilistic Linear Discriminant Analysis (PLDA) model in the presence of noisy labels. True class labels are interpreted as latent random variables, which are transmitted through a noisy channel, and received as observed speaker labels. The labeling process is modeled as a Discrete Memoryless Channel (DMC). PLDA hyperparameters are interpreted as random variables, and their joint posterior distribution is derived using meanfield Variational Bayes, allowing maximum a posteriori (MAP) estimates of the PLDA model parameters to be determined. The proposed solution, referred to as VB-MAP, is presented as a general framework, but is studied in the context of speaker verification, and a variety of use cases are discussed. Specifically, VB-MAP can be used for PLDA estimation with unreliable labels, unsupervised PLDA estimation, and to infer the reliability of a PLDA training set. Experimental results show the proposed approach to provide significant performance improvements on a variety of NIST Speaker Recognition Evaluation (SRE) tasks, both for data sets with simulated mislabels, and for data sets with naturally occurring missing or unreliable labels.
READ LESS

Summary

This paper presents a Bayesian framework for estimating a Probabilistic Linear Discriminant Analysis (PLDA) model in the presence of noisy labels. True class labels are interpreted as latent random variables, which are transmitted through a noisy channel, and received as observed speaker labels. The labeling process is modeled as a...

READ MORE

Unsupervised Bayesian adaptation of PLDA for speaker verification

Published in:
Interspeech, 30 August - 3 September 2021.

Summary

This paper presents a Bayesian framework for unsupervised domain adaptation of Probabilistic Linear Discriminant Analysis (PLDA). By interpreting class labels as latent random variables, Variational Bayes (VB) is used to derive a maximum a posterior (MAP) solution of the adapted PLDA model when labels are missing, referred to as VB-MAP. The VB solution iteratively infers class labels and updates PLDA hyperparameters, offering a systematic framework for dealing with unlabeled data. While presented as a general solution, this paper includes experimental results for domain adaptation in speaker verification. VBMAP estimation is applied to the 2016 and 2018 NIST Speaker Recognition Evaluations (SREs), both of which included small and unlabeled in-domain data sets, and is shown to provide performance improvements over a variety of state-of-the-art domain adaptation methods. Additionally, VB-MAP estimation is used to train a fully unsupervised PLDA model, suffering only minor performance degradation relative to conventional supervised training, offering promise for training PLDA models when no relevant labeled data exists.
READ LESS

Summary

This paper presents a Bayesian framework for unsupervised domain adaptation of Probabilistic Linear Discriminant Analysis (PLDA). By interpreting class labels as latent random variables, Variational Bayes (VB) is used to derive a maximum a posterior (MAP) solution of the adapted PLDA model when labels are missing, referred to as VB-MAP...

READ MORE

Speaker separation in realistic noise environments with applications to a cognitively-controlled hearing aid

Summary

Future wearable technology may provide for enhanced communication in noisy environments and for the ability to pick out a single talker of interest in a crowded room simply by the listener shifting their attentional focus. Such a system relies on two components, speaker separation and decoding the listener's attention to acoustic streams in the environment. To address the former, we present a system for joint speaker separation and noise suppression, referred to as the Binaural Enhancement via Attention Masking Network (BEAMNET). The BEAMNET system is an end-to-end neural network architecture based on self-attention. Binaural input waveforms are mapped to a joint embedding space via a learned encoder, and separate multiplicative masking mechanisms are included for noise suppression and speaker separation. Pairs of output binaural waveforms are then synthesized using learned decoders, each capturing a separated speaker while maintaining spatial cues. A key contribution of BEAMNET is that the architecture contains a separation path, an enhancement path, and an autoencoder path. This paper proposes a novel loss function which simultaneously trains these paths, so that disabling the masking mechanisms during inference causes BEAMNET to reconstruct the input speech signals. This allows dynamic control of the level of suppression applied by BEAMNET via a minimum gain level, which is not possible in other state-of-the-art approaches to end-to-end speaker separation. This paper also proposes a perceptually-motivated waveform distance measure. Using objective speech quality metrics, the proposed system is demonstrated to perform well at separating two equal-energy talkers, even in high levels of background noise. Subjective testing shows an improvement in speech intelligibility across a range of noise levels, for signals with artificially added head-related transfer functions and background noise. Finally, when used as part of an auditory attention decoder (AAD) system using existing electroencephalogram (EEG) data, BEAMNET is found to maintain the decoding accuracy achieved with ideal speaker separation, even in severe acoustic conditions. These results suggest that this enhancement system is highly effective at decoding auditory attention in realistic noise environments, and could possibly lead to improved speech perception in a cognitively controlled hearing aid.
READ LESS

Summary

Future wearable technology may provide for enhanced communication in noisy environments and for the ability to pick out a single talker of interest in a crowded room simply by the listener shifting their attentional focus. Such a system relies on two components, speaker separation and decoding the listener's attention to...

READ MORE

Implicitly-defined neural networks for sequence labeling

Published in:
Annual Meeting of Assoc. of Computational Lingusitics, 31 July 2017.

Summary

In this work, we propose a novel, implicitly defined neural network architecture and describe a method to compute its components. The proposed architecture forgoes the causality assumption previously used to formulate recurrent neural networks and allow the hidden states of the network to coupled together, allowing potential improvement on problems with complex, long-distance dependencies. Initial experiments demonstrate the new architecture outperforms both the Stanford Parser and a baseline bidirectional network on the Penn Treebank Part-of-Speech tagging task and a baseline bidirectional network on an additional artificial random biased walk task.
READ LESS

Summary

In this work, we propose a novel, implicitly defined neural network architecture and describe a method to compute its components. The proposed architecture forgoes the causality assumption previously used to formulate recurrent neural networks and allow the hidden states of the network to coupled together, allowing potential improvement on problems...

READ MORE

Adaptive noise cancellation in a fighter cockpit environment

Published in:
ICASSP'84, IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 19-21 March 1984.

Summary

In this paper we discuss some preliminary results on using Widrow's Adaptive Noise Cancelling (ANC) algorithm to reduce the background noise present in a fighter pilot's speech. With a dominant noise source present and with the pilot wearing an oxygen facemask, we demonstrate that good (>10 dB) cancellation of the additive noise and little speech distortion can be achieved by having the reference microphone attached to the outside of the facemask and by updating the filter coefficients only during silence intervals.
READ LESS

Summary

In this paper we discuss some preliminary results on using Widrow's Adaptive Noise Cancelling (ANC) algorithm to reduce the background noise present in a fighter pilot's speech. With a dominant noise source present and with the pilot wearing an oxygen facemask, we demonstrate that good (>10 dB) cancellation of the...

READ MORE

The effects of microphones and facemasks on LPC vocoder performance

Author:
Published in:
Proc. of IEEE Int. Conf. on Acoustics, Speech & Signal Processing, 30 March - 1 April 1981.

Summary

The effects of oxygen facemasks and noise cancelling microphones on LPC vocoder performance were analyzed and evaluated. Likely sources of potential vocoder performance degradation included the non-ideal frequency response characteristics of the microphone and the possible presence of additional resonances in the speech waveform due to the addition of the facemask cavity. Examination of vowel spectra revealed that spurious resonances do not occur in the vocoder frequency band for speech generated using the facemask and microphone. Also observed was a vowel-dependent reduction in the bandwidths of the upper formants, a result which can be predicted from acoustic theory. Finally, it is shown that the low frequency emphasis associated with small enclosures is not relevant when using a pressure gradient (noise cancelling) microphone. Diagnostic Rhyme Tests involving three subjects indicated that the presence of the oxygen facemask and noise cancelling microphone did not result in a significant increase in the LPC vocoder processing loss.
READ LESS

Summary

The effects of oxygen facemasks and noise cancelling microphones on LPC vocoder performance were analyzed and evaluated. Likely sources of potential vocoder performance degradation included the non-ideal frequency response characteristics of the microphone and the possible presence of additional resonances in the speech waveform due to the addition of the...

READ MORE

A split band adaptive predictive coding (SBAPC) speech system

Published in:
IEEE Int. Conf. on Acoustics, Speech, & Signal Processing, 9-11 April 1980.

Summary

As developed by Atal and Schroeder [1], conventional Adaptive Predictive Coding (APC) of speech employs both vocal tract and pitch prediction to achieve a low energy, spectrally flattened residual. Errors in the pitch predictor can result in clipping errors which can propagate in the system for relatively long periods of time and degrade the quality of the synthesized speech. Makhoul and Berouti [2] have developed a high quality 16 kbps APC system which eliminates the pitch predictor by using a multi-level variable rate quantizer. In order to achieve comparable quality at even lower data rates, a split band APC (SBAPC) structure is proposed which employs the multi-level quantizer on the low frequency portion of the residual and a 1-bit quantizer on the high frequency portion of the residual.
READ LESS

Summary

As developed by Atal and Schroeder [1], conventional Adaptive Predictive Coding (APC) of speech employs both vocal tract and pitch prediction to achieve a low energy, spectrally flattened residual. Errors in the pitch predictor can result in clipping errors which can propagate in the system for relatively long periods of...

READ MORE

Showing Results

1-7 of 7