Publications

Refine Results

(Filters Applied) Clear All

Artificial intelligence: short history, present developments, and future outlook, final report

Summary

The Director's Office at MIT Lincoln Laboratory (MIT LL) requested a comprehensive study on artificial intelligence (AI) focusing on present applications and future science and technology (S&T) opportunities in the Cyber Security and Information Sciences Division (Division 5). This report elaborates on the main results from the study. Since the AI field is evolving so rapidly, the study scope was to look at the recent past and ongoing developments to lead to a set of findings and recommendations. It was important to begin with a short AI history and a lay-of-the-land on representative developments across the Department of Defense (DoD), intelligence communities (IC), and Homeland Security. These areas are addressed in more detail within the report. A main deliverable from the study was to formulate an end-to-end AI canonical architecture that was suitable for a range of applications. The AI canonical architecture, formulated in the study, serves as the guiding framework for all the sections in this report. Even though the study primarily focused on cyber security and information sciences, the enabling technologies are broadly applicable to many other areas. Therefore, we dedicate a full section on enabling technologies in Section 3. The discussion on enabling technologies helps the reader clarify the distinction among AI, machine learning algorithms, and specific techniques to make an end-to-end AI system viable. In order to understand what is the lay-of-the-land in AI, study participants performed a fairly wide reach within MIT LL and external to the Laboratory (government, commercial companies, defense industrial base, peers, academia, and AI centers). In addition to the study participants (shown in the next section under acknowledgements), we also assembled an internal review team (IRT). The IRT was extremely helpful in providing feedback and in helping with the formulation of the study briefings, as we transitioned from datagathering mode to the study synthesis. The format followed throughout the study was to highlight relevant content that substantiates the study findings, and identify a set of recommendations. An important finding is the significant AI investment by the so-called "big 6" commercial companies. These major commercial companies are Google, Amazon, Facebook, Microsoft, Apple, and IBM. They dominate in the AI ecosystem research and development (R&D) investments within the U.S. According to a recent McKinsey Global Institute report, cumulative R&D investment in AI amounts to about $30 billion per year. This amount is substantially higher than the R&D investment within the DoD, IC, and Homeland Security. Therefore, the DoD will need to be very strategic about investing where needed, while at the same time leveraging the technologies already developed and available from a wide range of commercial applications. As we will discuss in Section 1 as part of the AI history, MIT LL has been instrumental in developing advanced AI capabilities. For example, MIT LL has a long history in the development of human language technologies (HLT) by successfully applying machine learning algorithms to difficult problems in speech recognition, machine translation, and speech understanding. Section 4 elaborates on prior applications of these technologies, as well as newer applications in the context of multi-modalities (e.g., speech, text, images, and video). An end-to-end AI system is very well suited to enhancing the capabilities of human language analysis. Section 5 discusses AI's nascent role in cyber security. There have been cases where AI has already provided important benefits. However, much more research is needed in both the application of AI to cyber security and the associated vulnerability to the so-called adversarial AI. Adversarial AI is an area very critical to the DoD, IC, and Homeland Security, where malicious adversaries can disrupt AI systems and make them untrusted in operational environments. This report concludes with specific recommendations by formulating the way forward for Division 5 and a discussion of S&T challenges and opportunities. The S&T challenges and opportunities are centered on the key elements of the AI canonical architecture to strengthen the AI capabilities across the DoD, IC, and Homeland Security in support of national security.
READ LESS

Summary

The Director's Office at MIT Lincoln Laboratory (MIT LL) requested a comprehensive study on artificial intelligence (AI) focusing on present applications and future science and technology (S&T) opportunities in the Cyber Security and Information Sciences Division (Division 5). This report elaborates on the main results from the study. Since the...

READ MORE

Multi-lingual deep neural networks for language recognition

Published in:
SLT 2016, IEEE Spoken Language Technology Workshop, 13-16 December 2016.

Summary

Multi-lingual feature extraction using bottleneck layers in deep neural networks (BN-DNNs) has been proven to be an effective technique for low resource speech recognition and more recently for language recognition. In this work we investigate the impact on language recognition performance of the multi-lingual BN-DNN architecture and training configurations for the NIST 2011 and 2015 language recognition evaluations (LRE11 and LRE15). The best performing multi-lingual BN-DNN configuration yields relative performance gains of 50% on LRE11 and 40% on LRE15 compared to a standard MFCC/SDC baseline system and 17% on LRE11 and 7% on LRE15 relative to a single language BN-DNN system. Detailed performance analysis using data from all 24 Babel languages, Fisher Spanish and Switchboard English shows the impact of language selection and the amount of training data on overall BN-DNN performance.
READ LESS

Summary

Multi-lingual feature extraction using bottleneck layers in deep neural networks (BN-DNNs) has been proven to be an effective technique for low resource speech recognition and more recently for language recognition. In this work we investigate the impact on language recognition performance of the multi-lingual BN-DNN architecture and training configurations for...

READ MORE

Speaker recognition using real vs synthetic parallel data for DNN channel compensation

Published in:
INTERSPEECH 2016: 16th Annual Conf. of the Int. Speech Communication Assoc., 8-12 September 2016.

Summary

Recent work has shown large performance gains using denoising DNNs for speech processing tasks under challenging acoustic conditions. However, training these DNNs requires large amounts of parallel multichannel speech data which can be impractical or expensive to collect. The effective use of synthetic parallel data as an alternative has been demonstrated for several speech technologies including automatic speech recognition and speaker recognition (SR). This paper demonstrates that denoising DNNs trained with real Mixer 2 multichannel data perform only slightly better than DNNs trained with synthetic multichannel data for microphone SR on Mixer 6. Large reductions in pooled error rates of 50% EER and 30% min DCF are achieved using DNNs trained on real Mixer 2 data. Nearly the same performance gains are achieved using synthetic data generated with a limited number of room impulse responses (RIRs) and noise sources derived from Mixer 2. Using RIRs from three publicly available sources used in the Kaldi ASpIRE recipe yields somewhat lower pooled gains of 34% EER and 25% min DCF. These results confirm the effective use of synthetic parallel data for DNN channel compensation even when the RIRs used for synthesizing the data are not particularly well matched to the task.
READ LESS

Summary

Recent work has shown large performance gains using denoising DNNs for speech processing tasks under challenging acoustic conditions. However, training these DNNs requires large amounts of parallel multichannel speech data which can be impractical or expensive to collect. The effective use of synthetic parallel data as an alternative has been...

READ MORE

Channel compensation for speaker recognition using MAP adapted PLDA and denoising DNNs

Published in:
Odyssey 2016, The Speaker and Language Recognition Workshop, 21-24 June 2016.

Summary

Over several decades, speaker recognition performance has steadily improved for applications using telephone speech. A big part of this improvement has been the availability of large quantities of speaker-labeled data from telephone recordings. For new data applications, such as audio from room microphones, we would like to effectively use existing telephone data to build systems with high accuracy while maintaining good performance on existing telephone tasks. In this paper we compare and combine approaches to compensate models parameters and features for this purpose. For model adaptation we explore MAP adaptation of hyper-parameters and for feature compensation we examine the use of denoising DNNs. On a multi-room, multi-microphone speaker recognition experiment we show a reduction of 61% in EER with a combination of these approaches while slightly improving performance on telephone data.
READ LESS

Summary

Over several decades, speaker recognition performance has steadily improved for applications using telephone speech. A big part of this improvement has been the availability of large quantities of speaker-labeled data from telephone recordings. For new data applications, such as audio from room microphones, we would like to effectively use existing...

READ MORE

The MITLL NIST LRE 2015 Language Recognition System

Summary

In this paper we describe the most recent MIT Lincoln Laboratory language recognition system developed for the NIST 2015 Language Recognition Evaluation (LRE). The submission features a fusion of five core classifiers, with most systems developed in the context of an i-vector framework. The 2015 evaluation presented new paradigms. First, the evaluation included fixed training and open training tracks for the first time; second, language classification performance was measured across 6 language clusters using 20 language classes instead of an N-way language task; and third, performance was measured across a nominal 3-30 second range. Results are presented for the overall performance across the six language clusters for both the fixed and open training tasks. On the 6-cluster metric the Lincoln system achieved overall costs of 0.173 and 0.168 for the fixed and open tasks respectively.
READ LESS

Summary

In this paper we describe the most recent MIT Lincoln Laboratory language recognition system developed for the NIST 2015 Language Recognition Evaluation (LRE). The submission features a fusion of five core classifiers, with most systems developed in the context of an i-vector framework. The 2015 evaluation presented new paradigms. First...

READ MORE

A unified deep neural network for speaker and language recognition

Published in:
INTERSPEECH 2015: 15th Annual Conf. of the Int. Speech Communication Assoc., 6-10 September 2015.

Summary

Significant performance gains have been reported separately for speaker recognition (SR) and language recognition (LR) tasks using either DNN posteriors of sub-phonetic units or DNN feature representations, but the two techniques have not been compared on the same SR or LR task or across SR and LR tasks using the same DNN. In this work we present the application of a single DNN for both tasks using the 2013 Domain Adaptation Challenge speaker recognition (DAC13) and the NIST 2011 language recognition evaluation (LRE11) benchmarks. Using a single DNN trained on Switchboard data we demonstrate large gains in performance on both benchmarks: a 55% reduction in EER for the DAC13 out-of-domain condition and a 48% reduction in Cavg on the LRE11 30s test condition. Score fusion and feature fusion are also investigated as is the performance of the DNN technologies at short durations for SR.
READ LESS

Summary

Significant performance gains have been reported separately for speaker recognition (SR) and language recognition (LR) tasks using either DNN posteriors of sub-phonetic units or DNN feature representations, but the two techniques have not been compared on the same SR or LR task or across SR and LR tasks using the...

READ MORE

Deep neural network approaches to speaker and language recognition

Published in:
IEEE Signal Process. Lett., Vol. 22, No. 10, October 2015, pp. 1671-5.

Summary

The impressive gains in performance obtained using deep neural networks (DNNs) for automatic speech recognition (ASR) have motivated the application of DNNs to other speech technologies such as speaker recognition (SR) and language recognition (LR). Prior work has shown performance gains for separate SR and LR tasks using DNNs for direct classification or for feature extraction. In this work we present the application for single DNN for both SR and LR using the 2013 Domain Adaptation Challenge speaker recognition (DAC13) and the NIST 2011 language recognition evaluation (LRE11) benchmarks. Using a single DNN trained for ASR on Switchboard data we demonstrate large gains on performance in both benchmarks: a 55% reduction in EER for the DAC13 out-of-domain condition and a 48% reduction in Cavg on the LRE11 30 s test condition. It is also shown that further gains are possible using score or feature fusion leading to the possibility of a single i-vector extractor producing state-of-the-art SR and LR performance.
READ LESS

Summary

The impressive gains in performance obtained using deep neural networks (DNNs) for automatic speech recognition (ASR) have motivated the application of DNNs to other speech technologies such as speaker recognition (SR) and language recognition (LR). Prior work has shown performance gains for separate SR and LR tasks using DNNs for...

READ MORE

The MITLL NIST LRE 2011 language recognition system

Summary

This paper presents a description of the MIT Lincoln Laboratory (MITLL) language recognition system developed for the NIST 2011 Language Recognition Evaluation (LRE). The submitted system consisted of a fusion of four core classifiers, three based on spectral similarity and one based on tokenization. Additional system improvements were achieved following the submission deadline. In a major departure from previous evaluations, the 2011 LRE task focused on closed-set pairwise performance so as to emphasize a system's ability to distinguish confusable language pairs. Results are presented for the 24-language confusable pair task at test utterance durations of 30, 10, and 3 seconds. Results are also shown using the standard detection metrics (DET, minDCF) and it is demonstrated the previous metrics adequately cover difficult pair performance. On the 30 s 24-language confusable pair task, the submitted and post-evaluation systems achieved average costs of 0.079 and 0.070 and standard detection costs of 0.038 and 0.033.
READ LESS

Summary

This paper presents a description of the MIT Lincoln Laboratory (MITLL) language recognition system developed for the NIST 2011 Language Recognition Evaluation (LRE). The submitted system consisted of a fusion of four core classifiers, three based on spectral similarity and one based on tokenization. Additional system improvements were achieved following...

READ MORE

NAP for high level language identification

Published in:
ICASSP 2011, IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 22-27 May 2011, pp. 4392-4395.

Summary

Varying channel conditions present a difficult problem for many speech technologies such as language identification (LID). Channel compensation techniques have been shown to significantly improve performance in LID for acoustic systems. For high-level token systems, nuisance attribute projection (NAP) has been shown to perform well in the context of speaker identification. In this work, we describe a novel approach to dealing with the high dimensional sparse NAP training problem as applied to a 4-gram phonotactic LID system run on the NIST 2009 Language Recognition Evaluation (LRE) task. We demonstrate performance gains on the Voice of America (VOA) portion of the 2009 LRE data.
READ LESS

Summary

Varying channel conditions present a difficult problem for many speech technologies such as language identification (LID). Channel compensation techniques have been shown to significantly improve performance in LID for acoustic systems. For high-level token systems, nuisance attribute projection (NAP) has been shown to perform well in the context of speaker...

READ MORE

The MIT LL 2010 speaker recognition evaluation system: scalable language-independent speaker recognition

Published in:
Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, ICASSP, 22-27 May 2011, pp. 5272-5275.

Summary

Research in the speaker recognition community has continued to address methods of mitigating variational nuisances. Telephone and auxiliary-microphone recorded speech emphasize the need for a robust way of dealing with unwanted variation. The design of recent 2010 NIST-SRE Speaker Recognition Evaluation (SRE) reflects this research emphasis. In this paper, we present the MIT submission applied to the tasks of the 2010 NIST-SRE with two main goals--language-independent scalable modeling and robust nuisance mitigation. For modeling, exclusive use of inner product-based and cepstral systems produced a language-independent computationally-scalable system. For robustness, systems that captured spectral and prosodic information, modeled nuisance subspaces using multiple novel methods, and fused scores of multiple systems were implemented. The performance of the system is presented on a subset of the NIST SRE 2010 core tasks.
READ LESS

Summary

Research in the speaker recognition community has continued to address methods of mitigating variational nuisances. Telephone and auxiliary-microphone recorded speech emphasize the need for a robust way of dealing with unwanted variation. The design of recent 2010 NIST-SRE Speaker Recognition Evaluation (SRE) reflects this research emphasis. In this paper, we...

READ MORE