This paper proposes the Speech Enhancement via Attention Masking Network (SEAMNET), a neural network-based end-to-end single-channel speech enhancement system designed for joint suppression of noise and reverberation. It formalizes an end-to-end network architecture, referred to as b-Net, which accomplishes noise suppression through attention masking in a learned embedding space. A key contribution of SEAMNET is that the b-Net architecture contains both an enhancement and an autoencoder path. This paper proposes a novel loss function which simultaneously trains both the enhancement and the autoencoder paths, so that disabling the masking mechanism during inference causes SEAMNET to reconstruct the input speech signal. This allows dynamic control of the level of suppression applied by SEAMNET via a minimum gain level, which is not possible in other state-of-the-art approaches to end-to-end speech enhancement. This paper also proposes a perceptually-motivated waveform distance measure. In addition to the b-Net architecture, this paper proposes a novel method for designing target waveforms for network training, so that joint suppression of additive noise and reverberation can be performed by an end-to-end enhancement system, which has not been previously possible. Experimental results show the SEAMNET system to outperform a variety of state-of-the-art baselines systems, both in terms of objective speech quality measures and subjective listening tests. Finally, this paper draws parallels between SEAMNET and conventional statistical model-based enhancement approaches, offering interpretability of many network components.

READ LESS

Summary

The Speech Enhancement via Attention Masking Network (SEAMNET): an end-to-end system for joint suppression of noise and reverberation [early access]

Speech enhancement using sparse convolutive non-negative matrix factorization with basis adaptation

September 9, 2012

Conference Paper

Author:

Michael A. Carlin

…

Published in:

INTERSPEECH 2012: 13th Annual Conf. of the Int. Speech Communication Assoc., 9-13 September 2012.

Topic:

speech enhancement

R&D area:

Cyber Security and Information Sciences

R&D group:

Summary

We introduce a framework for speech enhancement based on convolutive non-negative matrix factorization that leverages available speech data to enhance arbitrary noisy utterances with no a priori knowledge of the speakers or noise types present. Previous approaches have shown the utility of a sparse reconstruction of the speech-only components of an observed noisy utterance. We demonstrate that an underlying speech representation which, in addition to applying sparsity, also adapts to the noisy acoustics improves overall enhancement quality. The proposed system performs comparably to a traditional Wiener filtering approach, and the results suggest that the proposed framework is most useful in moderate- to low-SNR scenarios.

READ LESS

Summary

Speech enhancement using sparse convolutive non-negative matrix factorization with basis adaptation

Linear prediction modulation filtering for speaker recognition of reverberant speech

June 25, 2012

Conference Paper

Author:

Bengt J. Borgstrom

…

Alan V. McCree

Published in:

Odyssey 2012, The Speaker and Language Recognition Workshop, 25-28 June 2012.

Topic:

speaker recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

This paper proposes a framework for spectral enhancement of reverberant speech based on inversion of the modulation transfer function. All-pole modeling of modulation spectra of clean and degraded speech are utilized to derive the linear prediction inverse modulation transfer function (LP-IMTF) solution as a low-order IIR filter in the modulation envelope domain. By considering spectral estimation under speech presence uncertainty, speech presence probabilities are derived for the case of reverberation. Aside from enhancement, the LP-IMTF framework allows for blind estimation of reverberation time by extracting a minimum phase approximation of the short-time spectral channel impulse response. The proposed speech enhancement method is used as a front-end processing step for speaker recognition. When applied to the microphone condition of the NISTSRE 2010 with artificially added reverberation, the proposed spectral enhancement method yields significant improvements across a variety of performance metrics.

READ LESS

Summary

Linear prediction modulation filtering for speaker recognition of reverberant speech

Multi-pitch estimation by a joint 2-D representation of pitch and pitch dynamics

September 26, 2010

Conference Paper

Author:

Tianyu Tom Wang

…

Thomas F. Quatieri

Published in:

INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, 26-30 September 2010, pp. 645-648.

Topic:

speech enhancement

R&D area:

Cyber Security and Information Sciences

R&D group:

Summary

Multi-pitch estimation of co-channel speech is especially challenging when the underlying pitch tracks are close in pitch value (e.g., when pitch tracks cross). Building on our previous work, we demonstrate the utility of a two-dimensional (2-D) analysis method of speech for this problem by exploiting its joint representation of pitch and pitch-derivative information from distinct speakers. Specifically, we propose a novel multi-pitch estimation method consisting of 1) a data-driven classifier for pitch candidate selection, 2) local pitch and pitch-derivative estimation by k-means clustering, and 3) a Kalman filtering mechanism for pitch tracking and assignment. We evaluate our method on a database of all-voiced speech mixtures and illustrate its capability to estimate pitch tracks in cases where pitch tracks are separate and when they are close in pitch value (e.g., at crossings).

READ LESS

Summary

Multi-pitch estimation by a joint 2-D representation of pitch and pitch dynamics

Towards co-channel speaker separation by 2-D demodulation of spectrograms

October 18, 2009

Conference Paper

Author:

Tianyu Tom Wang

…

Thomas F. Quatieri

Published in:

WASPAA 2009, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 18-21 October 2009, pp. 65-68.

Topic:

speech enhancement

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

This paper explores a two-dimensional (2-D) processing approach for co-channel speaker separation of voiced speech. We analyze localized time-frequency regions of a narrowband spectrogram using 2-D Fourier transforms and propose a 2-D amplitude modulation model based on pitch information for single and multi-speaker content in each region. Our model maps harmonically-related speech content to concentrated entities in a transformed 2-D space, thereby motivating 2-D demodulation of the spectrogram for analysis/synthesis and speaker separation. Using a priori pitch estimates of individual speakers, we show through a quantitative evaluation: 1) Utility of the model for representing speech content of a single speaker and 2) Its feasibility for speaker separation. For the separation task, we also illustrate benefits of the model's representation of pitch dynamics relative to a sinusoidal-based separation system.

READ LESS

Summary

Towards co-channel speaker separation by 2-D demodulation of spectrograms

Time-varying autoregressive tests for multiscale speech analysis

September 6, 2009

Conference Paper

Author:

Daniel Rudoy

…

Published in:

INTERSPEECH 2009, 10th Annual Conf. of the International Speech Communication Association, pp. 2839-2842.

Topic:

speech enhancement

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

In this paper we develop hypothesis tests for speech waveform nonstationarity based on time-varying autoregressive models, and demonstrate their efficacy in speech analysis tasks at both segmental and sub-segmental scales. Key to the successful synthesis of these ideas is our employment of a generalized likelihood ratio testing framework tailored to autoregressive coefficient evolutions suitable for speech. After evaluating our framework on speech-like synthetic signals, we present preliminary results for two distinct analysis tasks using speech waveform data. At the segmental level, we develop an adaptive short-time segmentation scheme and evaluate it on whispered speech recordings, while at the sub-segmental level, we address the problem of detecting the glottal flow closed phase. Results show that our hypothesis testing framework can reliably detect changes in the vocal tract parameters across multiple scales, thereby underscoring its broad applicability to speech analysis.

READ LESS

Summary

Time-varying autoregressive tests for multiscale speech analysis

Cognitive services for the user

January 1, 2009

Book Chapter

Author:

Joseph P. Campbell Jr

…

Published in:

Chapter 10, Cognitive Radio Technology, 2009, pp. 305-324.

Topic:

biometrics

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

Software-defined cognitive radios (CRs) use voice as a primary input/output (I/O) modality and are expected to have substantial computational resources capable of supporting advanced speech- and audio-processing applications. This chapter extends previous work on speech applications (e.g., [1]) to cognitive services that enhance military mission capability by capitalizing on automatic processes, such as speech information extraction and understanding the environment. Such capabilities go beyond interaction with the intended user of the software-defined radio (SDR) - they extend to speech and audio applications that can be applied to information that has been extracted from voice and acoustic noise gathered from other users and entities in the environment. For example, in a military environment, situational awareness and understanding could be enhanced by informing users based on processing voice and noise from both friendly and hostile forces operating in a given battle space. This chapter provides a survey of a number of speech- and audio-processing technologies and their potential applications to CR, including: - A description of the technology and its current state of practice. - An explanation of how the technology is currently being applied, or could be applied, to CR. - Descriptions and concepts of operations for how the technology can be applied to benefit users of CRs. - A description of relevant future research directions for both the speech and audio technologies and their applications to CR. A pictorial overview of many of the core technologies with some applications presented in the following sections is shown in Figure 10.1. Also shown are some overlapping components between the technologies. For example, Gaussian mixture models (GMMs) and support vector machines (SVMs) are used in both speaker and language recognition technologies [2]. These technologies and components are described in further detail in the following sections. Speech and concierge cognitive services and their corresponding applications are covered in the following sections. The services covered include speaker recognition, language identification (LID), text-to-speech (TTS) conversion, speech-to-text (STT) conversion, machine translation (MT), background noise suppression, speech coding, speaker characterization, noise management, noise characterization, and concierge services. These technologies and their potential applications to CR are discussed at varying levels of detail commensurate with their innovation and utility.

READ LESS

Summary

Cognitive services for the user

Adaptive short-time analysis-synthesis for speech enhancement

March 31, 2008

Conference Paper

Author:

Prabahan Basu

…

Published in:

2008 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, 31 March - 4 April 2008.

Topic:

speech enhancement

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

In this paper we propose a multiresolution short-time analysis method for speech enhancement. It is well known that fixed resolution methods such as the traditional short-time Fourier transform do not generally match the time-frequency structure of the signal being analyzed resulting in poor estimates of the speech and noise spectra required for enhancement. This can lead to the reduction of quality in the enhanced signal through the introduction of artifacts such as musical noise. To counter these limitations, we propose an adaptive short-time analysis-synthesis scheme for speech enhancement in which the adaptation is based on a measure of local time-frequency concentration. Synthesis is made possible through a modified overlap-add procedure. Empirical results using voiced speech indicate a clear improvement over a fixed time-frequency resolution enhancement scheme both in terms of mean-square error and as indicated by informal listening tests.

READ LESS

Summary

Adaptive short-time analysis-synthesis for speech enhancement

Multisensor dynamic waveform fusion

April 1, 2007

Conference Paper

Author:

Alan V. McCree

…

Published in:

Proc. 32nd Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, April 2007, pp. IV-577 - IV-580.

Topic:

speech enhancement

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

Speech communication is significantly more difficult in severe acoustic background noise environments, especially when low-rate speech coders are used. Non-acoustic sensors, such as radar sensors, vibrometers, and bone-conduction microphones, offer significant potential in these situations. We extend previous work on fixed waveform fusion from multiple sensors to an optimal dynamic waveform fusion algorithm that minimizes both additive noise and signal distortion in the estimated speech signal. We show that a minimum mean squared error (MMSE) waveform matching criterion results in a generalized multichannel Wiener filter, and that this filter will simultaneously perform waveform fusion, noise suppression, and crosschannel noise cancellation. Formal intelligibility and quality testing demonstrate significant improvement from this approach.

READ LESS

Summary

Multisensor dynamic waveform fusion

Missing feature theory with soft spectral subtraction for speaker verification

September 17, 2006

Conference Paper

Author:

Michael T. Padilla

…

Published in:

Interspeech 2006, ICSLP, 17-21 September 2006.

Topic:

speech enhancement

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

This paper considers the problem of training/testing mismatch in the context of speaker verification and, in particular, explores the application of missing feature theory in the case of additive white Gaussian noise corruption in testing. Missing feature theory allows for corrupted features to be removed from scoring, the initial step of which is the detection of these features. One method of detection, employing spectral subtraction, is studied in a controlled manner and it is shown that with missing feature compensation the resulting verification performance is improved as long as a minimum number of features remain. Finally, a blending of "soft" spectral subtraction for noise mitigation and missing feature compensation is presented. The resulting performance improves on the constituent techniques alone, reducing the equal error rate by about 15% over an SNR range of 5 - 25 dB.

READ LESS

Summary

Missing feature theory with soft spectral subtraction for speaker verification

Publications

Refine Results

Tagged As

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Showing Results