Publications

Refine Results

(Filters Applied) Clear All

Unsupervised Bayesian adaptation of PLDA for speaker verification

Published in:
Interspeech, 30 August - 3 September 2021.

Summary

This paper presents a Bayesian framework for unsupervised domain adaptation of Probabilistic Linear Discriminant Analysis (PLDA). By interpreting class labels as latent random variables, Variational Bayes (VB) is used to derive a maximum a posterior (MAP) solution of the adapted PLDA model when labels are missing, referred to as VB-MAP. The VB solution iteratively infers class labels and updates PLDA hyperparameters, offering a systematic framework for dealing with unlabeled data. While presented as a general solution, this paper includes experimental results for domain adaptation in speaker verification. VBMAP estimation is applied to the 2016 and 2018 NIST Speaker Recognition Evaluations (SREs), both of which included small and unlabeled in-domain data sets, and is shown to provide performance improvements over a variety of state-of-the-art domain adaptation methods. Additionally, VB-MAP estimation is used to train a fully unsupervised PLDA model, suffering only minor performance degradation relative to conventional supervised training, offering promise for training PLDA models when no relevant labeled data exists.
READ LESS

Summary

This paper presents a Bayesian framework for unsupervised domain adaptation of Probabilistic Linear Discriminant Analysis (PLDA). By interpreting class labels as latent random variables, Variational Bayes (VB) is used to derive a maximum a posterior (MAP) solution of the adapted PLDA model when labels are missing, referred to as VB-MAP...

READ MORE

PATHATTACK: attacking shortest paths in complex networks

Summary

Shortest paths in complex networks play key roles in many applications. Examples include routing packets in a computer network, routing traffic on a transportation network, and inferring semantic distances between concepts on the World Wide Web. An adversary with the capability to perturb the graph might make the shortest path between two nodes route traffic through advantageous portions of the graph (e.g., a toll road he owns). In this paper, we introduce the Force Path Cut problem, in which there is a specific route the adversary wants to promote by removing a minimum number of edges in the graph. We show that Force Path Cut is NP-complete, but also that it can be recast as an instance of the Weighted Set Cover problem, enabling the use of approximation algorithms. The size of the universe for the set cover problem is potentially factorial in the number of nodes. To overcome this hurdle, we propose the PATHATTACK algorithm, which via constraint generation considers only a small subset of paths|at most 5% of the number of edges in 99% of our experiments. Across a diverse set of synthetic and real networks, the linear programming formulation of Weighted Set Cover yields the optimal solution in over 98% of cases. We also demonstrate a time/cost tradeoff using two approximation algorithms and greedy baseline methods. This work provides a foundation for addressing similar problems and expands the area of adversarial graph mining beyond recent work on node classification and embedding.
READ LESS

Summary

Shortest paths in complex networks play key roles in many applications. Examples include routing packets in a computer network, routing traffic on a transportation network, and inferring semantic distances between concepts on the World Wide Web. An adversary with the capability to perturb the graph might make the shortest path...

READ MORE

Combating Misinformation: HLT Highlights from MIT Lincoln Laboratory

Published in:
Human Language Technology Conference (HLTCon), 16-18 March 2021.

Summary

Dr. Joseph Campbell shares several human language technologies highlights from MIT Lincoln Laboratory. These include key enabling technologies in combating misinformation to link personas, analyze content, and understand human networks. Developing operationally relevant technologies requires access to corresponding data with meaningful evaluations, as Dr. Douglas Reynolds presented in his keynote. As Dr. Danelle Shah discussed in her keynote, it’s crucial to develop these technologies to operate at deeper levels than the surface. Producing reliable information from the fusion of missing and inherently unreliable information channels is paramount. Furthermore, the dynamic misinformation environment and the coevolution of allied methods with adversarial methods represent additional challenges
READ LESS

Summary

Dr. Joseph Campbell shares several human language technologies highlights from MIT Lincoln Laboratory. These include key enabling technologies in combating misinformation to link personas, analyze content, and understand human networks. Developing operationally relevant technologies requires access to corresponding data with meaningful evaluations, as Dr. Douglas Reynolds presented in his keynote...

READ MORE

Combating Misinformation: What HLT Can (and Can't) Do When Words Don't Say What They Mean

Author:
Published in:
Human Language Technology Conference (HLTCon), 16-18 March 2021.

Summary

Misinformation, disinformation, and “fake news” have been used as a means of influence for millennia, but the proliferation of the internet and social media in the 21st century has enabled nefarious campaigns to achieve unprecedented scale, speed, precision, and effectiveness. In the past few years, there has been significant recognition of the threats posed by malign influence operations to geopolitical relations, democratic institutions and processes, public health and safety, and more. At the same time, the digitization of communication offers tremendous opportunities for human language technologies (HLT) to observe, interpret, and understand this publicly available content. The ability to infer intent and impact, however, remains much more elusive.
READ LESS

Summary

Misinformation, disinformation, and “fake news” have been used as a means of influence for millennia, but the proliferation of the internet and social media in the 21st century has enabled nefarious campaigns to achieve unprecedented scale, speed, precision, and effectiveness. In the past few years, there has been significant recognition...

READ MORE

Speaker separation in realistic noise environments with applications to a cognitively-controlled hearing aid

Summary

Future wearable technology may provide for enhanced communication in noisy environments and for the ability to pick out a single talker of interest in a crowded room simply by the listener shifting their attentional focus. Such a system relies on two components, speaker separation and decoding the listener's attention to acoustic streams in the environment. To address the former, we present a system for joint speaker separation and noise suppression, referred to as the Binaural Enhancement via Attention Masking Network (BEAMNET). The BEAMNET system is an end-to-end neural network architecture based on self-attention. Binaural input waveforms are mapped to a joint embedding space via a learned encoder, and separate multiplicative masking mechanisms are included for noise suppression and speaker separation. Pairs of output binaural waveforms are then synthesized using learned decoders, each capturing a separated speaker while maintaining spatial cues. A key contribution of BEAMNET is that the architecture contains a separation path, an enhancement path, and an autoencoder path. This paper proposes a novel loss function which simultaneously trains these paths, so that disabling the masking mechanisms during inference causes BEAMNET to reconstruct the input speech signals. This allows dynamic control of the level of suppression applied by BEAMNET via a minimum gain level, which is not possible in other state-of-the-art approaches to end-to-end speaker separation. This paper also proposes a perceptually-motivated waveform distance measure. Using objective speech quality metrics, the proposed system is demonstrated to perform well at separating two equal-energy talkers, even in high levels of background noise. Subjective testing shows an improvement in speech intelligibility across a range of noise levels, for signals with artificially added head-related transfer functions and background noise. Finally, when used as part of an auditory attention decoder (AAD) system using existing electroencephalogram (EEG) data, BEAMNET is found to maintain the decoding accuracy achieved with ideal speaker separation, even in severe acoustic conditions. These results suggest that this enhancement system is highly effective at decoding auditory attention in realistic noise environments, and could possibly lead to improved speech perception in a cognitively controlled hearing aid.
READ LESS

Summary

Future wearable technology may provide for enhanced communication in noisy environments and for the ability to pick out a single talker of interest in a crowded room simply by the listener shifting their attentional focus. Such a system relies on two components, speaker separation and decoding the listener's attention to...

READ MORE

Seasonal Inhomogeneous Nonconsecutive Arrival Process Search and Evaluation

Published in:
25th International Conference on Pattern Recognition [submitted]

Summary

Time series often exhibit seasonal patterns, and identification of these patterns is essential to understanding thedata and predicting future behavior. Most methods train onlarge datasets and can fail to predict far past the training data. This limitation becomes more pronounced when data is sparse. This paper presents a method to fit a model to seasonal time series data that maintains predictive power when data is limited. This method, called SINAPSE, combines statistical model fitting with an information criteria to search for disjoint, andpossibly nonconsecutive, regimes underlying the data, allowing for a sparse representation resistant to overfitting.
READ LESS

Summary

Time series often exhibit seasonal patterns, and identification of these patterns is essential to understanding thedata and predicting future behavior. Most methods train onlarge datasets and can fail to predict far past the training data. This limitation becomes more pronounced when data is sparse. This paper presents a method to...

READ MORE

The Speech Enhancement via Attention Masking Network (SEAMNET): an end-to-end system for joint suppression of noise and reverberation [early access]

Published in:
IEEE/ACM Trans. on Audio, Speech, and Language Processing, Vol. 29, 2021, pp. 515-26.

Summary

This paper proposes the Speech Enhancement via Attention Masking Network (SEAMNET), a neural network-based end-to-end single-channel speech enhancement system designed for joint suppression of noise and reverberation. It formalizes an end-to-end network architecture, referred to as b-Net, which accomplishes noise suppression through attention masking in a learned embedding space. A key contribution of SEAMNET is that the b-Net architecture contains both an enhancement and an autoencoder path. This paper proposes a novel loss function which simultaneously trains both the enhancement and the autoencoder paths, so that disabling the masking mechanism during inference causes SEAMNET to reconstruct the input speech signal. This allows dynamic control of the level of suppression applied by SEAMNET via a minimum gain level, which is not possible in other state-of-the-art approaches to end-to-end speech enhancement. This paper also proposes a perceptually-motivated waveform distance measure. In addition to the b-Net architecture, this paper proposes a novel method for designing target waveforms for network training, so that joint suppression of additive noise and reverberation can be performed by an end-to-end enhancement system, which has not been previously possible. Experimental results show the SEAMNET system to outperform a variety of state-of-the-art baselines systems, both in terms of objective speech quality measures and subjective listening tests. Finally, this paper draws parallels between SEAMNET and conventional statistical model-based enhancement approaches, offering interpretability of many network components.
READ LESS

Summary

This paper proposes the Speech Enhancement via Attention Masking Network (SEAMNET), a neural network-based end-to-end single-channel speech enhancement system designed for joint suppression of noise and reverberation. It formalizes an end-to-end network architecture, referred to as b-Net, which accomplishes noise suppression through attention masking in a learned embedding space. A...

READ MORE

The 2019 NIST Speaker Recognition Evaluation CTS Challenge

Published in:
The Speaker and Language Recognition Workshop: Odyssey 2020, 1-5 November 2020.

Summary

In 2019, the U.S. National Institute of Standards and Technology (NIST) conducted a leaderboard style speaker recognition challenge using conversational telephone speech (CTS) data extracted from the unexposed portion of the Call My Net 2 (CMN2) corpus previously used in the 2018 Speaker Recognition Evaluation (SRE). The SRE19 CTS Challenge was organized in a similar manner to SRE18, except it offered only the open training condition. In addition, similar to the NIST i-vector challenge, the evaluation set consisted of two subsets: a progress subset, and a test subset. The progress subset comprised 30% of the trials and was used to monitor progress on the leaderboad, while the remaining 70% of the trials formed the test subset, which was used to generate the official final results determined at the end of the challenge. Which subset (i.e., progress or test) a trial belonged to was unknown to challenge participants, and each system submission had to contain outputs for all of trials. The CTS Challenge also served as a prerequisite for entrance to the main SRE19 whose primary task was audio-visual person recognition. A total of 67 organizations (forming 51 teams) from academia and industry participated in the CTS Challenge and submitted 1347 valid system outputs. This paper presents an overview of the evaluation and several analyses of system performance for all primary conditions in the CTS Challenge. Compared to the CTS track of the SRE18, the SRE19 CTS Challenge results indicate remarkable improvements in performance which are mainly attributed to 1) the availability of large amounts of in-domain development data from a large number of labeled speakers, 2) speaker representations (aka embeddings) extracted using extended and more complex end-to-end neural network frameworks, and 3) effective use of the provided large development set.
READ LESS

Summary

In 2019, the U.S. National Institute of Standards and Technology (NIST) conducted a leaderboard style speaker recognition challenge using conversational telephone speech (CTS) data extracted from the unexposed portion of the Call My Net 2 (CMN2) corpus previously used in the 2018 Speaker Recognition Evaluation (SRE). The SRE19 CTS Challenge...

READ MORE

The 2019 NIST Audio-Visual Speaker Recognition Evaluation

Published in:
The Speaker and Language Recognition Workshop: Odyssey 2020, 1-5 November 2020.

Summary

In 2019, the U.S. National Institute of Standards and Technology (NIST) conducted the most recent in an ongoing series of speaker recognition evaluations (SRE). There were two components to SRE19: 1) a leaderboard style Challenge using unexposed conversational telephone speech (CTS) data from the Call My Net 2 (CMN2) corpus, and 2) an Audio-Visual (AV) evaluation using video material extracted from the unexposed portions of the Video Annotation for Speech Technologies (VAST) corpus. This paper presents an overview of the Audio-Visual SRE19 activity including the task, the performance metric, data, and the evaluation protocol, results and system performance analyses. The Audio-Visual SRE19 was organized in a similar manner to the audio from video (AfV) track in SRE18, except it offered only the open training condition. In addition, instead of extracting and releasing only the AfV data, unexposed multimedia data from the VAST corpus was used to support the Audio-Visual SRE19. It featured two core evaluation tracks, namely audio only and audio-visual, as well as an optional visual only track. A total of 26 organizations (forming 14 teams) from academia and industry participated in the Audio-Visual SRE19 and submitted 102 valid system outputs. Evaluation results indicate: 1) notable performance improvements for the audio only speaker recognition task on the challenging amateur online video domain due to the use of more complex neural network architectures (e.g., ResNet) along with soft margin losses, 2) state-of-the-art speaker and face recognition technologies provide comparable person recognition performance on the amateur online video domain, and 3) audio-visual fusion results in remarkable performance gains (greater than 85% relative) over the audio only or visual only systems.
READ LESS

Summary

In 2019, the U.S. National Institute of Standards and Technology (NIST) conducted the most recent in an ongoing series of speaker recognition evaluations (SRE). There were two components to SRE19: 1) a leaderboard style Challenge using unexposed conversational telephone speech (CTS) data from the Call My Net 2 (CMN2) corpus...

READ MORE

Attacking Embeddings to Counter Community Detection

Published in:
Network Science Society Conference 2020 [submitted]

Summary

Community detection can be an extremely useful data triage tool, enabling a data analyst to split a largenetwork into smaller portions for a deeper analysis. If, however, a particular node wanted to avoid scrutiny, it could strategically create new connections that make it seem uninteresting. In this work, we investigate theuse of a state-of-the-art attack against node embedding as a means of countering community detection whilebeing blind to the attributes of others. The attack proposed in [1] attempts to maximize the loss function beingminimized by a random-walk-based embedding method (where two nodes are made closer together the more often a random walk starting at one node ends at the other). We propose using this method to attack thecommunity structure of the graph, specifically attacking the community assignment of an adversarial vertex. Since nodes in the same community tend to appear near each other in a random walk, their continuous-space embedding also tend to be close. Thus, we aim to use the general embedding attack in an attempt to shift the community membership of the adversarial vertex. To test this strategy, we adopt an experimental framework as in [2], where each node is given a “temperature” indicating how interesting it is. A node’s temperature can be “hot,” “cold,” or “unknown.” A node can perturbitself by adding new edges to any other node in the graph. The node’s goal is to be placed in a community thatis cold, i.e., where the average node temperature is less than 0. Of the 5 attacks proposed in [2], we use 2 in our experiments. The simpler attack is Cold and Lonely, which first connects to cold nodes, then unknown, then hot, and connects within each temperature in order of increasing degree. The more sophisticated attack is StableStructure. The procedure for this attack is to (1) identify stable structures (containing nodes assigned to the same community each time for several trials), (2) connect to nodes in order of increasing average temperature of their stable structures (randomly within a structure), and (3) connect to nodes with no stable structure in order of increasing temperature. As in [2], we use the Louvain modularity maximization technique for community detection. We slightly modify the embedding attack of [1] by only allowing addition of new edges and requiring that they include the adversary vertex. Since the embedding attack is blind to the temperatures of the nodes, experimenting with these attacks gives insight into how much this attribute information helps the adversary. Experimental results are shown in Figure 1. Graphs considered in these experiments are (1) an 500-node Erdos-Renyi graph with edge probabilityp= 0.02, (2) a stochastic block model with 5 communities of 100nodes each and edge probabilities ofpin= 0.06 andpout= 0.01, (3) the network of Abu Sayyaf Group (ASG)—aviolent non-state Islamist group operating in the Philippines—where two nodes are linked if they both participatein at least one kidnapping event, with labels derived from stable structures (nodes together in at least 95% of 1000 Louvain trials), and (4) the Cora machine learning citation graph, with 7 classes based on subjectarea. Temperature is assigned to the Erdos-Renyi nodes randomly with probability 0.25, 0.5, and 0.25 for hot,unknown, and cold, respectively. For the other graphs, nodes with the same label as the target are hot, unknown,and cold with probability 0.35, 0.55, and 0.1, respectively, and the hot and cold probabilities are swapped forother labels. The results demonstrate that, even without the temperature information, the embedding methodis about as effective as the Cold and Lonely when there is community structure to exploit, though it is not aseffective as Stable Structure, which leverages both community structure and temperature information.
READ LESS

Summary

Community detection can be an extremely useful data triage tool, enabling a data analyst to split a largenetwork into smaller portions for a deeper analysis. If, however, a particular node wanted to avoid scrutiny, it could strategically create new connections that make it seem uninteresting. In this work, we investigate...

READ MORE