Publications

Refine Results

(Filters Applied) Clear All

R&D Areas

R&D Groups

Year

Items per page

Tagged As

speech recognition Clear filter

Corpora design and score calibration for text dependent pronunciation proficiency recognition

September 20, 2019

Conference Paper

Author:

Frederick S. Richardson

…

Published in:

8th ISCA Workshop on Speech and Language Technology in Education, SLaTe 2019, 20-21 September 2019.

Topic:

speech recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

This work investigates methods for improving a pronunciation proficiency recognition system, both in terms of phonetic level posterior probability calibration, and in ordinal utterance level classification, for Modern Standard Arabic (MSA), Spanish and Russian. To support this work, utterance level labels were obtained by crowd-sourcing the annotation of language learners' recordings. Phonetic posterior probability estimates extracted using automatic speech recognition systems trained in each language were estimated using a beta calibration approach [1] and language proficiency level was estimated using an ordinal regression [2]. Fusion with language recognition (LR) scores from an i-vector system [3] trained on 23 languages is also explored. Initial results were promising for all three languages and it was demonstrated that the calibrated posteriors were effective for predicting pronunciation proficiency. Significant relative gains of 16% mean absolute error for the ordinal regression and 17% normalized cross entropy for the binary beta regression were achieved on MSA through fusion with LR scores.

READ LESS

Summary

Corpora design and score calibration for text dependent pronunciation proficiency recognition

NetProf iOS pronunciation feedback demonstration

December 13, 2015

Conference Paper

Author:

Tamas Marius

…

Published in:

IEEE Automatic Speech Recognition and Understanding Workshop, ASRU, 13 December 2015.

Topic:

speech recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

One of the greatest challenges for an adult learning a new language is gaining the ability to distinguish and produce foreign sounds. The US Government trains 3,600 enlisted soldiers a year at the Defense Language Institute Foreign Language Center (DLIFLC) in languages critical to national security, most of which are not widely studied in the U.S. Many students struggle to attain speaking fluency and proper pronunciation. Teaching pronunciation is a time-intensive task for teachers that requires them to give individual feedback to students during classroom hours. This limits the time teachers can spend imparting other information, and students may feel embarrassed or inhibited when they practice with their classmates. Given the demand for students educated in foreign languages and the limited number of qualified teachers in languages of interest, there is a growing need for computer-based tools students can use to practice and receive feedback at their own pace and schedule. Most existing tools are limited to listening to pre-recorded audio with limited or nonexistent support for pronunciation feedback. MIT Lincoln Laboratory has developed a new tool, Net Pronunciation Feedback (NetProF), to address these challenges and improve student pronunciation and general language fluency.

READ LESS

Summary

NetProf iOS pronunciation feedback demonstration

Discrimination between singing and speech in real-world audio

December 7, 2014

Conference Paper

Author:

Brian J. Thompson

Published in:

SLT 2014, IEEE Spoken Language Technology Workshop, 7-10 December 2014.

Topic:

speech recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

The performance of a spoken language system suffers when non-speech is incorrectly classified as speech. Singing is particularly difficult to discriminate from speech, since both are natural language. However, singing conveys a melody, whereas speech does not; in particular, a singer's fundamental frequency should not deviate significantly from an underlying sequence of notes, while a speaker's fundamental frequency is freer to deviate about a mean value. The present work presents a novel approach to discrimination between singing and speech that exploits the distribution of such deviations. The melody in singing is typically non known a priori, so the distribution cannot be measured directly. Instead, an approximation to its Fourier transform is proposed that allows the unknown melody to be treated as multiplicative noise. This feature vector is shown to be highly discriminative between speech and singing segments when coupled with a simple maximum likelihood classifier, outperforming prior work on real-world data.

READ LESS

Summary

Discrimination between singing and speech in real-world audio

Comparing a high and low-level deep neural network implementation for automatic speech recognition

November 17, 2014

Conference Paper

Author:

Jessica M. Ray

…

Published in:

1st Workshop for High Performance Technical Computing in Dynamic Languages, HPTCDL 2014, 17 November 2014.

Topic:

speech recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

The use of deep neural networks (DNNs) has improved performance in several fields including computer vision, natural language processing, and automatic speech recognition (ASR). The increased use of DNNs in recent years has been largely due to performance afforded by GPUs, as the computational cost of training large networks on a CPU is prohibitive. Many training algorithms are well-suited to the GPU; however, writing hand-optimized GPGPU code is a significant undertaking. More recently, high-level libraries have attempted to simplify GPGPU development by automatically performing tasks such as optimization and code generation. This work utilizes Theano, a high-level Python library, to implement a DNN for the purpose of phone recognition in ASR. Performance is compared against a low-level, hand-optimized C++/CUDA DNN implementation from Kaldi, a popular ASR toolkit. Results show that the DNN implementation in Theano has CPU and GPU runtimes on par with that of Kaldi, while requiring approximately 95% less lines of code.

READ LESS

Summary

Comparing a high and low-level deep neural network implementation for automatic speech recognition

The MIT-LL/AFRL IWSLT-2010 MT system

December 2, 2010

Conference Paper

Author:

Wade Shen

…

Published in:

Proc. Int. Workshop on Spoken Language Translation, IWSLT, 2 December 2010.

Topic:

machine translation

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

This paper describes the MIT-LUAFRL statistical MT system and the improvements that were developed during the IWSLT 2010 evaluation campaign. As part of these efforts, we experimented with a number of extensions to the standard phrase-based model that improve performance on the Arabic and Turkish to English translation tasks. We also participated in the new French to English BTEC and English to French TALK tasks. We discuss the architecture of the MIT-LL/AFRL MT system, improvements over our 2008 system, and experiments we ran during the IWSLT-2010 evaluation. Specifically, we focus on 1) cross-domain translation using MAP adaptation, 2) Turkish morphological processing and translation, 3) improved Arabic morphology for MT preprocessing, and 4) system combination methods for machine translation.

READ LESS

Summary

The MIT-LL/AFRL IWSLT-2010 MT system

Query-by-example spoken term detection using phonetic posteriorgram templates

December 13, 2009

Conference Paper

Author:

Timothy J. Hazen

…

Published in:

Proc. IEEE Workshop on Automatic Speech Recognition & Understanding, ASRU, 13-17 December 2009, pp. 421-426.

Topic:

speech recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

This paper examines a query-by-example approach to spoken term detection in audio files. The approach is designed for low-resource situations in which limited or no in-domain training material is available and accurate word-based speech recognition capability is unavailable. Instead of using word or phone strings as search terms, the user presents the system with audio snippets of desired search terms to act as the queries. Query and test materials are represented using phonetic posteriorgrams obtained from a phonetic recognition system. Query matches in the test data are located using a modified dynamic time warping search between query templates and test utterances. Experiments using this approach are presented using data from the Fisher corpus.

READ LESS

Summary

Query-by-example spoken term detection using phonetic posteriorgram templates

A comparison of query-by-example methods for spoken term detection

September 6, 2009

Conference Paper

Author:

Wade Shen

…

Published in:

INTERSPEECH 2009, 6-10 September 2009.

Topic:

speech recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

In this paper we examine an alternative interface for phonetic search, namely query-by-example, that avoids OOV issues associated with both standard word-based and phonetic search methods. We develop three methods that compare query lattices derived from example audio against a standard ngrambased phonetic index and we analyze factors affecting the performance of these systems. We show that the best systems under this paradigm are able to achieve 77% precision when retrieving utterances from conversational telephone speech and returning 10 results from a single query (performance that is better than a similar dictionary-based approach) suggesting significant utility for applications requiring high precision. We also show that these systems can be further improved using relevance feedback: By incorporating four additional queries the precision of the best system can be improved by 13.7% relative. Our systems perform well despite high phone recognition error rates (> 40%) and make use of no pronunciation or letter-to-sound resources.

READ LESS

Summary

A comparison of query-by-example methods for spoken term detection

Cognitive services for the user

January 1, 2009

Book Chapter

Author:

Joseph P. Campbell Jr

…

Published in:

Chapter 10, Cognitive Radio Technology, 2009, pp. 305-324.

Topic:

biometrics

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

Software-defined cognitive radios (CRs) use voice as a primary input/output (I/O) modality and are expected to have substantial computational resources capable of supporting advanced speech- and audio-processing applications. This chapter extends previous work on speech applications (e.g., [1]) to cognitive services that enhance military mission capability by capitalizing on automatic processes, such as speech information extraction and understanding the environment. Such capabilities go beyond interaction with the intended user of the software-defined radio (SDR) - they extend to speech and audio applications that can be applied to information that has been extracted from voice and acoustic noise gathered from other users and entities in the environment. For example, in a military environment, situational awareness and understanding could be enhanced by informing users based on processing voice and noise from both friendly and hostile forces operating in a given battle space. This chapter provides a survey of a number of speech- and audio-processing technologies and their potential applications to CR, including: - A description of the technology and its current state of practice. - An explanation of how the technology is currently being applied, or could be applied, to CR. - Descriptions and concepts of operations for how the technology can be applied to benefit users of CRs. - A description of relevant future research directions for both the speech and audio technologies and their applications to CR. A pictorial overview of many of the core technologies with some applications presented in the following sections is shown in Figure 10.1. Also shown are some overlapping components between the technologies. For example, Gaussian mixture models (GMMs) and support vector machines (SVMs) are used in both speaker and language recognition technologies [2]. These technologies and components are described in further detail in the following sections. Speech and concierge cognitive services and their corresponding applications are covered in the following sections. The services covered include speaker recognition, language identification (LID), text-to-speech (TTS) conversion, speech-to-text (STT) conversion, machine translation (MT), background noise suppression, speech coding, speaker characterization, noise management, noise characterization, and concierge services. These technologies and their potential applications to CR are discussed at varying levels of detail commensurate with their innovation and utility.

READ LESS

Summary

Cognitive services for the user

Efficient speech translation through confusion network decoding

November 1, 2008

Journal Article

Author:

Nicola Bertoldi

…

Published in:

IEEE Trans. Audio Speech Lang. Proc., Vol. 16, No. 8, November 2008, pp. 1696-1705.

Topic:

speech recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

This paper describes advances in the use of confusion networks as interface between automatic speech recognition and machine translation. In particular, it presents a decoding algorithm for confusion networks which results as an extension of a state-of-the-art phrase-based text translation decoder. The confusion network decoder significantly improves both in efficiency and performance over previous work along this direction, and outperforms the background text translation system. Experimental results in terms of translation accuracy and decoding efficiency are reported for the task of translating plenary speeches of the European Parliament from Spanish to English and from English to Spanish.

READ LESS

Summary

Efficient speech translation through confusion network decoding

Two protocols comparing human and machine phonetic discrimination performance in conversational speech

September 22, 2008

Conference Paper

Author:

Wade Shen

…

Published in:

INTERSPEECH 2008, 22-26 September 2008, pp. 1630-1633.

Topic:

speech recognition

R&D area:

Cyber Security and Information Sciences

R&D group:

Artificial Intelligence Technology and Systems

Summary

This paper describes two experimental protocols for direct comparison on human and machine phonetic discrimination performance in continuous speech. These protocols attempt to isolate phonetic discrimination while controlling for language and segmentation biases. Results of two human experiments are described including comparisons with automatic phonetic recognition baselines. Our experiments suggest that in conversational telephone speech, human performance on these tasks exceeds that of machines by 15%. Furthermore, in a related controlled language model control experiment, human subjects were better able to correctly predict words in conversational speech by 45%.

READ LESS

Summary

Two protocols comparing human and machine phonetic discrimination performance in conversational speech

Publications

Refine Results

Tagged As

Summary

Summary

NetProf iOS pronunciation feedback demonstration

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Showing Results