Publications

Refine Results

(Filters Applied) Clear All

Speech recognition by humans and machines under conditions with severe channel variability and noise

Published in:
SPIE, Vol. 3077, Applications and Science of Artificial Neural Networks III, 21-24 April 1997, pp. 46-57.

Summary

Despite dramatic recent advances in speech recognition technology, speech recognizers still perform much worse than humans. The difference in performance between humans and machines is most dramatic when variable amounts and types of filtering and noise are present during testing. For example, humans readily understand speech that is low-pass filtered below 3 kHz or high-pass filtered above 1kHz. Machines trained with wide-band speech, however, degrade dramatically under these conditions. An approach to compensate for variable unknown sharp filtering and noise is presented which uses mel-filter-bank magnitudes as input features, estimates the signal-to-noise ratio (SNR) for each filter, and uses missing feature theory to dynamically modify the probability computations performed using Gaussian Mixture or Radial Basis Function neural network classifiers embedded within Hidden Markov Model (HMM) recognizers. The approach was successfully demonstrated using a talker-independent digit recognition task. It was found that recognition accuracy across many conditions rises from below 50 % to above 95 % with this approach. These promising results suggest future work to dynamically estimate SNR's and to explore the dynamics of human adaptation to channel and noise variability.
READ LESS

Summary

Despite dramatic recent advances in speech recognition technology, speech recognizers still perform much worse than humans. The difference in performance between humans and machines is most dramatic when variable amounts and types of filtering and noise are present during testing. For example, humans readily understand speech that is low-pass filtered...

READ MORE

AM-FM separation using auditory-motivated filters

Published in:
IEEE Trans. Speech Audio Process., Vol. 5, No. 5, September 1997, pp. 465-480.

Summary

An approach to the joint estimation of sine-wave amplitude modulation (AM) and frequency modulation (FM) is described based on the transduction of frequency modulation into amplitude modulation by linear filters, being motivated by the hypothesis that the auditory system uses a similar transduction mechanism in measuring sine-wave FM. An AM-FM estimation is described that uses the amplitude envelope of the output of two transduction filters of piecewise-linear spectral shape. The piecewise-linear constraint is then relaxed, allowing a wider class of transduction-filter pairs for AM-FM separation under a monotonicity constraint of the filters' quotient. The particular case of Gaussian filters, and measured auditory filters, although not leading to a solution in closed form, provide for iterative AM-FM estimation. Solution stability analysis and error evaluation are performed and the FM transduction method is compared with the energy separation algorithm, based on the Teager energy operator, and the Hilbert transform method for AM-FM estimation. Finally, a generalization to two-dimensional (2-D) filters is described.
READ LESS

Summary

An approach to the joint estimation of sine-wave amplitude modulation (AM) and frequency modulation (FM) is described based on the transduction of frequency modulation into amplitude modulation by linear filters, being motivated by the hypothesis that the auditory system uses a similar transduction mechanism in measuring sine-wave FM. An AM-FM...

READ MORE

Automated English-Korean translation for enhanced coalition communications

Summary

This article describes our progress on automated, two-way English-Korean translation of text and speech for enhanced military coalition communications. Our goal is to improve multilingual communications by producing accurate translations across a number of languages. Therefore, we have chosen an interlingua-based approach to machine translation that readily extends to multiple languages. In this approach, a natural-language-understanding system transforms the input into an intermediate-meaning representation called a semantic frame, which serves as the basis for generating output in multiple languages. To produce useful, accurate, and effective translation systems in the short term, we have focused on limited military-task domains, and have configured our system as a translator's aid so that the human translator can confirm or edit the machine translation. We have obtained promising results in translation of telegraphic military messages in a naval domain, and have successfully extended the system to additional military domains. The system has been demonstrated in a coalition exercise and at Combined Forces Command in the Republic of Korea. From these demonstrations we learned that the system must be robust enough to handle new inputs, which is why we have developed a multistage robust translation strategy, including a part-of-speech tagging technique to handle new works, and a fragmentation strategy for handling complex sentences. Our current work emphasizes ongoing development of these robust translation techniques and extending the translation system to application domains of interest to users in the military coalition environment in the Republic of Korea.
READ LESS

Summary

This article describes our progress on automated, two-way English-Korean translation of text and speech for enhanced military coalition communications. Our goal is to improve multilingual communications by producing accurate translations across a number of languages. Therefore, we have chosen an interlingua-based approach to machine translation that readily extends to multiple...

READ MORE

Automatic English-to-Korean text translation of telegraphic messages in a limited domain

Published in:
Proc. Int. Conf. on Computational Linguistics, 5-9 August 1996, pp. 705-710.

Summary

This paper describes our work-in-progress in automatic English-to-Korean text; translation. This work is an initial step toward the ultimate goal of text and speech translation for enhanced multilingual and multinational operations. For this purpose, we have adopted an interlingual approach with natural language understanding (TINA) and generation (GENESIS) modules at the core. We tackle the ambiguity problem by incorporating syntactic and semantic categories in the analysis grammar. Our system is capable of producing accurate translation of complex sentences (38 words) and sentence fragments as well as average length (12 words) grammatical sentences. Two types of system evaluation have been carried out: one for grammar coverage and the other for overall performance. For system robustness, integration of two subsystems is under way: (i) a rule-based part-of-speech tagger to handle unknown words/constructions, and (ii) a word-for-word translator to handle other system failures.
READ LESS

Summary

This paper describes our work-in-progress in automatic English-to-Korean text; translation. This work is an initial step toward the ultimate goal of text and speech translation for enhanced multilingual and multinational operations. For this purpose, we have adopted an interlingual approach with natural language understanding (TINA) and generation (GENESIS) modules at...

READ MORE

Improving wordspotting performance with artificially generated data

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, 9 May 1996, pp. 526-9.

Summary

Lack of training data is a major problem that limits the performance of speech recognizers. Performance can often only be improved by expensive collection of data from many different talkers. This paper demonstrates that artificially transformed speech can increase the variability of training data and increase the performance of a wordspotter without additional expensive data collection. This approach was shown to be effective on a high-performance whole-word wordspotter on the Switchboard Credit Card database. The proposed approach used in combination with a discriminative training approach increased the Figure of Merit of the wordspotting system by 9.4% percentage points (62.5% to 71.9%). The increase in performance provided by artificially transforming speech was roughly equivalent to the increase that would have been provided by doubling the amount of training data. The performance of the wordspotter was also compared to that of human listeners who were able to achieve lower error rates because of improved consonant recognition.
READ LESS

Summary

Lack of training data is a major problem that limits the performance of speech recognizers. Performance can often only be improved by expensive collection of data from many different talkers. This paper demonstrates that artificially transformed speech can increase the variability of training data and increase the performance of a...

READ MORE

Automatic dialect identification of extemporaneous, conversational, Latin American Spanish Speech

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Vol. 2, ICASSP, 7-10 May 1996, pp. 777-780.

Summary

A dialect identification technique is described that takes as input extemporaneous, conversational speech spoken in Latin American Spanish and produces as output a hypothesis of the dialect. The system has been trained to recognize Cuban and Peruvian dialects of Spanish, but could be extended easily to other dialects (and languages) as well. Building on our experience in automatic language identification, the dialect-ID system uses an English phone recognizer trained on the TIMIT corpus to tokenize training speech spoken in each Spanish dialect. Phonotactic language models generated from this tokenized training speech are used during testing to compute dialect likelihoods for each unknown message. This system has an error rate of 16% on the Cuban/Peruvian two-alternative forced-choice test. We introduce the new "Miami" Latin American Spanish speech corpus that is capable of supporting our research into the future.
READ LESS

Summary

A dialect identification technique is described that takes as input extemporaneous, conversational speech spoken in Latin American Spanish and produces as output a hypothesis of the dialect. The system has been trained to recognize Cuban and Peruvian dialects of Spanish, but could be extended easily to other dialects (and languages)...

READ MORE

Fine structure features for speaker identification

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 2, Speech (Part II), 7-10 May 1996, pp. 689-692.

Summary

The performance of speaker identification (SID) systems can be improved by the addition of the rapidly varying "fine structure" features of formant amplitude and/or frequency modulation and multiple excitation pulses. This paper shows how the estimation of such fine structure features can be improved further by obtaining better estimates of formant frequency locations and uncovering various sources of error in the feature extraction systems. Most female telephone speech showed "spurious" formants, due to distortion in the telephone network. Nevertheless, SID performance was greatest with these spurious formants as formant estimates. A new feature has also been identified which can increase SID performance: cepstral coefficients from noise in the estimated excitation waveform. Finally, statistical tools have been developed to explore the relative importance of features used for SID, with the ultimate goal of uncovering the source of the features that provide SID performance improvement.
READ LESS

Summary

The performance of speaker identification (SID) systems can be improved by the addition of the rapidly varying "fine structure" features of formant amplitude and/or frequency modulation and multiple excitation pulses. This paper shows how the estimation of such fine structure features can be improved further by obtaining better estimates of...

READ MORE

Low rate coding of the spectral envelope using channel gains

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 2, 7-10 May 1996, pp. 769-772.

Summary

A dual rate embedded sinusoidal transform coder is described in which a core 14th order allpole coder operating at 2400 b/s is augmented with a set of channel gain residuals in order to operate at the higher 4800 b/s rate. The channel gains are a set of non-uniformly spaced samples of the spline envelope and constitute a lowpass estimate of the short-time vocal tract magnitude spectrum. The channel gain residuals represent the difference between the spline envelope and the quantized 14th order allpole spectrum at the channel gain frequencies. The channel gain residuals are coded using pitch dependent scalar quantization. Informal listening indicates that the quality of the embedded coder at 4800 b/s is comparable to that of an existing high quality 4800 b/s allpole coder.
READ LESS

Summary

A dual rate embedded sinusoidal transform coder is described in which a core 14th order allpole coder operating at 2400 b/s is augmented with a set of channel gain residuals in order to operate at the higher 4800 b/s rate. The channel gains are a set of non-uniformly spaced samples...

READ MORE

The effects of handset variability on speaker recognition performance: experiments on the switchboard corpus

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 1, 7-10 May 1996, pp. 113-116.

Summary

This paper presents an empirical study of the effects of handset variability on text-independent speaker recognition performance using the Switchboard corpus. Handset variability occurs when training speech is collected using one type of handset, but a different handset is used for collecting test speech. For the Switchboard corpus, the calling telephone number associated with a file is used to imply the handset used. Analysis of experiments designed to focus on handset variability on the SPIDRE database and the May95 NIST speaker recognition evaluation database, show that a performance gap between matched and mismatched handset tests persists even after applying several standard channel compensation techniques. Error rates for the mismatched tests are over 4 times those for the matched tests. Lastly, a new energy dependent cepstral mean subtraction technique is proposed to compensate for nonlinear distortions, but is not found to improve performance on the databases used.
READ LESS

Summary

This paper presents an empirical study of the effects of handset variability on text-independent speaker recognition performance using the Switchboard corpus. Handset variability occurs when training speech is collected using one type of handset, but a different handset is used for collecting test speech. For the Switchboard corpus, the calling...

READ MORE

Unsupervised topic clustering of switchboard speech messages

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 1, 7-10 May 1996, pp. 315-318.

Summary

This paper presents a statistical technique which can be used to automatically group speech data records based on the similarity of their content. A tree-based clustering algorithm is used to generate a hierarchical structure for the corpus. This structure can then be used to guide the search for similar material in data from other corpora. The SWITCHBOARD Speech Corpus was used to demonstrate these techniques, since it provides sets of speech files which are nominally on the same topic. Excellent automatic clustering was achieved on the truth text transcripts provided with the SWITCHBOARD corpus, with an average cluster purity of 97.3%. Degraded clustering was achieved using the output transcriptions of a speech recognizer, with a clustering purity of 61.4%.
READ LESS

Summary

This paper presents a statistical technique which can be used to automatically group speech data records based on the similarity of their content. A tree-based clustering algorithm is used to generate a hierarchical structure for the corpus. This structure can then be used to guide the search for similar material...

READ MORE