Publications

Refine Results

(Filters Applied) Clear All

Multi-pitch estimation by a joint 2-D representation of pitch and pitch dynamics

Published in:
INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, 26-30 September 2010, pp. 645-648.

Summary

Multi-pitch estimation of co-channel speech is especially challenging when the underlying pitch tracks are close in pitch value (e.g., when pitch tracks cross). Building on our previous work, we demonstrate the utility of a two-dimensional (2-D) analysis method of speech for this problem by exploiting its joint representation of pitch and pitch-derivative information from distinct speakers. Specifically, we propose a novel multi-pitch estimation method consisting of 1) a data-driven classifier for pitch candidate selection, 2) local pitch and pitch-derivative estimation by k-means clustering, and 3) a Kalman filtering mechanism for pitch tracking and assignment. We evaluate our method on a database of all-voiced speech mixtures and illustrate its capability to estimate pitch tracks in cases where pitch tracks are separate and when they are close in pitch value (e.g., at crossings).
READ LESS

Summary

Multi-pitch estimation of co-channel speech is especially challenging when the underlying pitch tracks are close in pitch value (e.g., when pitch tracks cross). Building on our previous work, we demonstrate the utility of a two-dimensional (2-D) analysis method of speech for this problem by exploiting its joint representation of pitch...

READ MORE

High-pitch formant estimation by exploiting temporal change of pitch

Published in:
Proc. IEEE Trans. on Audio, Speech, and Language Processing, Vol. 18, No. 1, January 2010, pp. 171-186.

Summary

This paper considers the problem of obtaining an accurate spectral representation of speech formant structure when the voicing source exhibits a high fundamental frequency. Our work is inspired by auditory perception and physiological studies implicating the use of pitch dynamics in speech by humans. We develop and assess signal processing schemes aimed at exploiting temporal change of pitch to address the high-pitch formant frequency estimation problem. Specifically, we propose a 2-D analysis framework using 2-D transformations of the time-frequency space. In one approach, we project changing spectral harmonics over time to a 1-D function of frequency. In a second approach, we draw upon previous work of Quatieri and Ezzat et al. [1], [2], with similarities to the auditory modeling efforts of Chi et al. [3], where localized 2-D Fourier transforms of the time-frequency space provide improved source-filter separation when pitch is changing. Our methods show quantitative improvements for synthesized vowels with stationary formant structure in comparison to traditional and homomorphic linear prediction. We also demonstrate the feasibility of applying our methods on stationary vowel regions of natural speech spoken by high-pitch females of the TIMIT corpus. Finally, we show improvements afforded by the proposed analysis framework in formant tracking on examples of stationary and time-varying formant structure.
READ LESS

Summary

This paper considers the problem of obtaining an accurate spectral representation of speech formant structure when the voicing source exhibits a high fundamental frequency. Our work is inspired by auditory perception and physiological studies implicating the use of pitch dynamics in speech by humans. We develop and assess signal processing...

READ MORE

Towards co-channel speaker separation by 2-D demodulation of spectrograms

Published in:
WASPAA 2009, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 18-21 October 2009, pp. 65-68.

Summary

This paper explores a two-dimensional (2-D) processing approach for co-channel speaker separation of voiced speech. We analyze localized time-frequency regions of a narrowband spectrogram using 2-D Fourier transforms and propose a 2-D amplitude modulation model based on pitch information for single and multi-speaker content in each region. Our model maps harmonically-related speech content to concentrated entities in a transformed 2-D space, thereby motivating 2-D demodulation of the spectrogram for analysis/synthesis and speaker separation. Using a priori pitch estimates of individual speakers, we show through a quantitative evaluation: 1) Utility of the model for representing speech content of a single speaker and 2) Its feasibility for speaker separation. For the separation task, we also illustrate benefits of the model's representation of pitch dynamics relative to a sinusoidal-based separation system.
READ LESS

Summary

This paper explores a two-dimensional (2-D) processing approach for co-channel speaker separation of voiced speech. We analyze localized time-frequency regions of a narrowband spectrogram using 2-D Fourier transforms and propose a 2-D amplitude modulation model based on pitch information for single and multi-speaker content in each region. Our model maps...

READ MORE

2-D processing of speech for multi-pitch analysis.

Published in:
INTERSPEECH 2009, 6-10 September 2009.

Summary

This paper introduces a two-dimensional (2-D) processing approach for the analysis of multi-pitch speech sounds. Our framework invokes the short-space 2-D Fourier transform magnitude of a narrowband spectrogram, mapping harmonically related signal components to multiple concentrated entities in a new 2-D space. First, localized time-frequency regions of the spectrogram are analyzed to extract pitch candidates. These candidates are then combined across multiple regions for obtaining separate pitch estimates of each speech-signal component at a single point in time. We refer to this as multi-region analysis (MRA). By explicitly accounting for pitch dynamics within localized time segments, this separability is distinct from that which can be obtained using short-time autocorrelation methods typically employed in state-of-the-art multi-pitch tracking algorithms. We illustrate the feasibility of MRA for multi-pitch estimation on mixtures of synthetic and real speech.
READ LESS

Summary

This paper introduces a two-dimensional (2-D) processing approach for the analysis of multi-pitch speech sounds. Our framework invokes the short-space 2-D Fourier transform magnitude of a narrowband spectrogram, mapping harmonically related signal components to multiple concentrated entities in a new 2-D space. First, localized time-frequency regions of the spectrogram are...

READ MORE

Exploiting temporal change in pitch in formant estimation

Published in:
Proc. IEEE Int. Conf. on Acoustic, Speech, and Signal Processes, ICASSP, 31 March - 4 April 2008, pp. 3929-3932.

Summary

This paper considers the problem of obtaining an accurate spectral representation of speech formant structure when the voicing source exhibits a high fundamental frequency. Our work is inspired by auditory perception and physiological modeling studies implicating the use of temporal changes in speech by humans. Specifically, we develop and assess signal processing schemes aimed at exploiting temporal change of pitch as a basis for formant estimation. Our methods are cast in a generalized framework of two-dimensional processing of speech and show quantitative improvements under certain conditions over representations derived from traditional and homomorphic linear prediction. We conclude by highlighting potential benefits of our framework in the particular application of speaker recognition with preliminary results indicating a performance gender-gap closure on subsets of the TIMIT corpus.
READ LESS

Summary

This paper considers the problem of obtaining an accurate spectral representation of speech formant structure when the voicing source exhibits a high fundamental frequency. Our work is inspired by auditory perception and physiological modeling studies implicating the use of temporal changes in speech by humans. Specifically, we develop and assess...

READ MORE

Auditory modeling as a basis for spectral modulation analysis with application to speaker recognition

Published in:
MIT Lincoln Laboratory Report TR-1119

Summary

This report explores auditory modeling as a basis for robust automatic speaker verification. Specifically, we have developed feature-extraction front-ends that incorporate (1) time-varying, level-dependent filtering, (2) variations in analysis filterbank size,and (3) nonlinear adaptation. Our methods are motivated both by a desire to better mimic auditory processing relative to traditional front-ends (e.g., the mel-cepstrum) as well as by reported gains in automatic speech recognition robustness exploiting similar principles. Traditional mel-cepstral features in automatic speaker recognition are derived from ~20 invariant band-pass filter weights, thereby discarding temporal structure from phase. In contrast, cochlear frequency decomposition can be more precisely modeled as the output of ~3500 time-varying, level-dependent filters. Auditory signal processing is therefore more resolved in frequency than mel-cepstral analysis and also derives temporal information. Furthermore, loss of level-dependence has been suggested to reduce human speech reception in adverse acoustic environments. We were thus motivated to employ a recently proposed level-dependent compressed gammachirp filterbank in feature extraction as well as vary the number of filters or filter weights to improve frequency resolution. We are also simulating nonlinear adaptation models of inner hair cell function along the basilar membrane that presumably mimic temporal masking effects. Auditory-based front-ends are being evaluated with the Lincoln Laboratory Gaussian mixture model recognizer on the TIMIT database under clean and noisy (additive Gaussian white noise) conditions. Preliminary results of features derived from our auditory models suggest that they provide complementary information to the mel-cepstrum under clean and noisy conditions, resulting in speaker recognition performance improvements.
READ LESS

Summary

This report explores auditory modeling as a basis for robust automatic speaker verification. Specifically, we have developed feature-extraction front-ends that incorporate (1) time-varying, level-dependent filtering, (2) variations in analysis filterbank size,and (3) nonlinear adaptation. Our methods are motivated both by a desire to better mimic auditory processing relative to traditional...

READ MORE

Showing Results

1-6 of 6