Publications

Refine Results

(Filters Applied) Clear All

Embedded dual-rate sinusoidal transform coding

Published in:
Proc. IEEE Workshop on Speech Coding for Telecommunications Proc.: Back to Basics: Attacking Fundamental Problems in Speech Coding, 7-10 September 1997, pp. 33-34.

Summary

This paper describes the development of a dual-rate Sinusoidal Transformer Coder in which a 2400 b/s coder is embedded as a separate packet in the 4800 b/s bit stream. The underlying coding structure provides the flexibility necessary for multirate speech coding and multimedia applications.
READ LESS

Summary

This paper describes the development of a dual-rate Sinusoidal Transformer Coder in which a 2400 b/s coder is embedded as a separate packet in the 4800 b/s bit stream. The underlying coding structure provides the flexibility necessary for multirate speech coding and multimedia applications.

READ MORE

Ambiguity resolution for machine translation of telegraphic messages

Published in:
Proc. 35th Annual Meeting of the Assoc. for Computational Linguistics, 7-12 July 1997, pp. 120-7.

Summary

Telegraphic messages with numerous instances of omission pose a new challenge to parsing in that a sentence with omission causes a higher degree of ambiguity than a sentence without omission. Misparsing reduced by omissions has a far-reaching consequence in machine translation. Namely, a misparse of the input often leads to a translation into the target language which has incoherent meaning in the given context. This is more frequently the case if the structures of the source and target languages are quite different, as in English and Korean. Thus, the question of how we parse telegraphic messages accurately and efficiently becomes a critical issue in machine translation. In this paper we describe a technical solution for the issue, and present the performance evaluation of a machine translation system on telegraphic messages before and after adopting the proposed solution. The solution lies in a grammar design in which lexicalized grammar rules defined in terms of semantic categories and syntactic rules defined in terms of part-of-speech are utilized together. The proposed grammar achieves a higher parsing coverage without increasing the amount of ambiguity/misparsing when compared with a purely lexicalized semantic grammar, and achieves a lower degree of ambiguity/misparses without, decreasing the parsing coverage when compared with a purely syntactic grammar.
READ LESS

Summary

Telegraphic messages with numerous instances of omission pose a new challenge to parsing in that a sentence with omission causes a higher degree of ambiguity than a sentence without omission. Misparsing reduced by omissions has a far-reaching consequence in machine translation. Namely, a misparse of the input often leads to...

READ MORE

Speech recognition by machines and humans

Published in:
Speech Commun., Vol. 22, No. 1, July 1997, pp. 1-15.

Summary

This paper reviews past work comparing modern speech recognition systems and humans to determine how far recent dramatic advances in technology have progressed towards the goal of human-like performance. Comparisons use six modern speech corpora with vocabularies ranging from 10 to more than 65,000 words and content ranging from read isolated words to spontaneous conversations. Error rates of machines are often more than an order of magnitude greater than those of humans for quiet, wideband, read speech. Machine performance degrades further below that of humans in noise, with channel variability, and for spontaneous speech. Humans can also recognize quiet, clearly spoken nonsense syllables and nonsense sentences with little high-level grammatical information. These comparisons suggest that the human-machine performance gap can be reduced by basic research on improving low-level acoustic-phonetic modeling, on improving robustness with noise and channel variability, and on more accurately modeling spontaneous speech.
READ LESS

Summary

This paper reviews past work comparing modern speech recognition systems and humans to determine how far recent dramatic advances in technology have progressed towards the goal of human-like performance. Comparisons use six modern speech corpora with vocabularies ranging from 10 to more than 65,000 words and content ranging from read...

READ MORE

HTIMIT and LLHDB: speech corpora for the study of handset transducer effects

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP, Vol. 2, 21-24 April 1997, pp. 1535-1538.

Summary

This paper describes two corpora collected at Lincoln Laboratory for the study of handset transducer effects on the speech signal: the handset TIMIT (HTIMIT) corpus and the Lincoln Laboratory Handset Database (LLHDB). The goal of these corpora are to minimize all confounding factors and to produce speech predominately differing only in handset transducer effects. The speech is recorded directly from a telephone unit in a sound-booth using prompted text and extemporaneous photograph descriptions. The two corpora allow comparison of speech collected from a person speaking into a handset (LLHDB) versus speech played through a loudspeaker into a headset (HTIMIT). A comparison of analysis and results between the two corpora will address the realism of artificially creating handset degraded speech by playing recorded speech through handsets. The corpora are designed primarily for speaker recognition experimentation (in terms of amount of speech and level of transcription), but since both speaker and speech recognition systems operate on the same acoustic features affected by the handset, knowledge gleaned is directly transferable to speech recognizers.
READ LESS

Summary

This paper describes two corpora collected at Lincoln Laboratory for the study of handset transducer effects on the speech signal: the handset TIMIT (HTIMIT) corpus and the Lincoln Laboratory Handset Database (LLHDB). The goal of these corpora are to minimize all confounding factors and to produce speech predominately differing only...

READ MORE

Speech recognition by humans and machines under conditions with severe channel variability and noise

Published in:
SPIE, Vol. 3077, Applications and Science of Artificial Neural Networks III, 21-24 April 1997, pp. 46-57.

Summary

Despite dramatic recent advances in speech recognition technology, speech recognizers still perform much worse than humans. The difference in performance between humans and machines is most dramatic when variable amounts and types of filtering and noise are present during testing. For example, humans readily understand speech that is low-pass filtered below 3 kHz or high-pass filtered above 1kHz. Machines trained with wide-band speech, however, degrade dramatically under these conditions. An approach to compensate for variable unknown sharp filtering and noise is presented which uses mel-filter-bank magnitudes as input features, estimates the signal-to-noise ratio (SNR) for each filter, and uses missing feature theory to dynamically modify the probability computations performed using Gaussian Mixture or Radial Basis Function neural network classifiers embedded within Hidden Markov Model (HMM) recognizers. The approach was successfully demonstrated using a talker-independent digit recognition task. It was found that recognition accuracy across many conditions rises from below 50 % to above 95 % with this approach. These promising results suggest future work to dynamically estimate SNR's and to explore the dynamics of human adaptation to channel and noise variability.
READ LESS

Summary

Despite dramatic recent advances in speech recognition technology, speech recognizers still perform much worse than humans. The difference in performance between humans and machines is most dramatic when variable amounts and types of filtering and noise are present during testing. For example, humans readily understand speech that is low-pass filtered...

READ MORE

AM-FM separation using auditory-motivated filters

Published in:
IEEE Trans. Speech Audio Process., Vol. 5, No. 5, September 1997, pp. 465-480.

Summary

An approach to the joint estimation of sine-wave amplitude modulation (AM) and frequency modulation (FM) is described based on the transduction of frequency modulation into amplitude modulation by linear filters, being motivated by the hypothesis that the auditory system uses a similar transduction mechanism in measuring sine-wave FM. An AM-FM estimation is described that uses the amplitude envelope of the output of two transduction filters of piecewise-linear spectral shape. The piecewise-linear constraint is then relaxed, allowing a wider class of transduction-filter pairs for AM-FM separation under a monotonicity constraint of the filters' quotient. The particular case of Gaussian filters, and measured auditory filters, although not leading to a solution in closed form, provide for iterative AM-FM estimation. Solution stability analysis and error evaluation are performed and the FM transduction method is compared with the energy separation algorithm, based on the Teager energy operator, and the Hilbert transform method for AM-FM estimation. Finally, a generalization to two-dimensional (2-D) filters is described.
READ LESS

Summary

An approach to the joint estimation of sine-wave amplitude modulation (AM) and frequency modulation (FM) is described based on the transduction of frequency modulation into amplitude modulation by linear filters, being motivated by the hypothesis that the auditory system uses a similar transduction mechanism in measuring sine-wave FM. An AM-FM...

READ MORE

Automated English-Korean translation for enhanced coalition communications

Summary

This article describes our progress on automated, two-way English-Korean translation of text and speech for enhanced military coalition communications. Our goal is to improve multilingual communications by producing accurate translations across a number of languages. Therefore, we have chosen an interlingua-based approach to machine translation that readily extends to multiple languages. In this approach, a natural-language-understanding system transforms the input into an intermediate-meaning representation called a semantic frame, which serves as the basis for generating output in multiple languages. To produce useful, accurate, and effective translation systems in the short term, we have focused on limited military-task domains, and have configured our system as a translator's aid so that the human translator can confirm or edit the machine translation. We have obtained promising results in translation of telegraphic military messages in a naval domain, and have successfully extended the system to additional military domains. The system has been demonstrated in a coalition exercise and at Combined Forces Command in the Republic of Korea. From these demonstrations we learned that the system must be robust enough to handle new inputs, which is why we have developed a multistage robust translation strategy, including a part-of-speech tagging technique to handle new works, and a fragmentation strategy for handling complex sentences. Our current work emphasizes ongoing development of these robust translation techniques and extending the translation system to application domains of interest to users in the military coalition environment in the Republic of Korea.
READ LESS

Summary

This article describes our progress on automated, two-way English-Korean translation of text and speech for enhanced military coalition communications. Our goal is to improve multilingual communications by producing accurate translations across a number of languages. Therefore, we have chosen an interlingua-based approach to machine translation that readily extends to multiple...

READ MORE

Automatic English-to-Korean text translation of telegraphic messages in a limited domain

Published in:
Proc. Int. Conf. on Computational Linguistics, 5-9 August 1996, pp. 705-710.

Summary

This paper describes our work-in-progress in automatic English-to-Korean text; translation. This work is an initial step toward the ultimate goal of text and speech translation for enhanced multilingual and multinational operations. For this purpose, we have adopted an interlingual approach with natural language understanding (TINA) and generation (GENESIS) modules at the core. We tackle the ambiguity problem by incorporating syntactic and semantic categories in the analysis grammar. Our system is capable of producing accurate translation of complex sentences (38 words) and sentence fragments as well as average length (12 words) grammatical sentences. Two types of system evaluation have been carried out: one for grammar coverage and the other for overall performance. For system robustness, integration of two subsystems is under way: (i) a rule-based part-of-speech tagger to handle unknown words/constructions, and (ii) a word-for-word translator to handle other system failures.
READ LESS

Summary

This paper describes our work-in-progress in automatic English-to-Korean text; translation. This work is an initial step toward the ultimate goal of text and speech translation for enhanced multilingual and multinational operations. For this purpose, we have adopted an interlingual approach with natural language understanding (TINA) and generation (GENESIS) modules at...

READ MORE

Improving wordspotting performance with artificially generated data

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, 9 May 1996, pp. 526-9.

Summary

Lack of training data is a major problem that limits the performance of speech recognizers. Performance can often only be improved by expensive collection of data from many different talkers. This paper demonstrates that artificially transformed speech can increase the variability of training data and increase the performance of a wordspotter without additional expensive data collection. This approach was shown to be effective on a high-performance whole-word wordspotter on the Switchboard Credit Card database. The proposed approach used in combination with a discriminative training approach increased the Figure of Merit of the wordspotting system by 9.4% percentage points (62.5% to 71.9%). The increase in performance provided by artificially transforming speech was roughly equivalent to the increase that would have been provided by doubling the amount of training data. The performance of the wordspotter was also compared to that of human listeners who were able to achieve lower error rates because of improved consonant recognition.
READ LESS

Summary

Lack of training data is a major problem that limits the performance of speech recognizers. Performance can often only be improved by expensive collection of data from many different talkers. This paper demonstrates that artificially transformed speech can increase the variability of training data and increase the performance of a...

READ MORE

Automatic dialect identification of extemporaneous, conversational, Latin American Spanish Speech

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Vol. 2, ICASSP, 7-10 May 1996, pp. 777-780.

Summary

A dialect identification technique is described that takes as input extemporaneous, conversational speech spoken in Latin American Spanish and produces as output a hypothesis of the dialect. The system has been trained to recognize Cuban and Peruvian dialects of Spanish, but could be extended easily to other dialects (and languages) as well. Building on our experience in automatic language identification, the dialect-ID system uses an English phone recognizer trained on the TIMIT corpus to tokenize training speech spoken in each Spanish dialect. Phonotactic language models generated from this tokenized training speech are used during testing to compute dialect likelihoods for each unknown message. This system has an error rate of 16% on the Cuban/Peruvian two-alternative forced-choice test. We introduce the new "Miami" Latin American Spanish speech corpus that is capable of supporting our research into the future.
READ LESS

Summary

A dialect identification technique is described that takes as input extemporaneous, conversational speech spoken in Latin American Spanish and produces as output a hypothesis of the dialect. The system has been trained to recognize Cuban and Peruvian dialects of Spanish, but could be extended easily to other dialects (and languages)...

READ MORE