Publications

Refine Results

(Filters Applied) Clear All

Poisoning network flow classifiers [e-print]

Summary

As machine learning (ML) classifiers increasingly oversee the automated monitoring of network traffic, studying their resilience against adversarial attacks becomes critical. This paper focuses on poisoning attacks, specifically backdoor attacks, against network traffic flow classifiers. We investigate the challenging scenario of clean-label poisoning where the adversary's capabilities are constrained to tampering only with the training data - without the ability to arbitrarily modify the training labels or any other component of the training process. We describe a trigger crafting strategy that leverages model interpretability techniques to generate trigger patterns that are effective even at very low poisoning rates. Finally, we design novel strategies to generate stealthy triggers, including an approach based on generative Bayesian network models, with the goal of minimizing the conspicuousness of the trigger, and thus making detection of an ongoing poisoning campaign more challenging. Our findings provide significant insights into the feasibility of poisoning attacks on network traffic classifiers used in multiple scenarios, including detecting malicious communication and application classification.
READ LESS

Summary

As machine learning (ML) classifiers increasingly oversee the automated monitoring of network traffic, studying their resilience against adversarial attacks becomes critical. This paper focuses on poisoning attacks, specifically backdoor attacks, against network traffic flow classifiers. We investigate the challenging scenario of clean-label poisoning where the adversary's capabilities are constrained to...

READ MORE

Improving long-text authorship verification via model selection and data tuning

Published in:
Proc. 7th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, LaTeCH-CLfL2023, 5 May 2023, pp. 28-37.

Summary

Authorship verification is used to link texts written by the same author without needing a model per author, making it useful for deanonymizing users spreading text with malicious intent. Recent advances in Transformer-based language models hold great promise for author verification, though short context lengths and non-diverse training regimes present challenges for their practical application. In this work, we investigate the effect of these challenges in the application of a Cross-Encoder Transformer-based author verification system under multiple conditions. We perform experiments with four Transformer backbones using differently tuned variants of fanfiction data and found that our BigBird pipeline outperformed Longformer, RoBERTa, and ELECTRA and performed competitively against the official top ranked system from the PAN evaluation. We also examined the effect of authors and fandoms not seen in training on model performance. Through this, we found fandom has the greatest influence on true trials, pairs of text written by the same author, and that a balanced training dataset in terms of class and fandom performed the most consistently.
READ LESS

Summary

Authorship verification is used to link texts written by the same author without needing a model per author, making it useful for deanonymizing users spreading text with malicious intent. Recent advances in Transformer-based language models hold great promise for author verification, though short context lengths and non-diverse training regimes present...

READ MORE

A generative approach to condition-aware score calibration for speaker verification

Published in:
IEEE/ACM Trans. Audio, Speech, Language Process., Vol. 31, 2023, pp. 891-901.

Summary

In speaker verification, score calibration is employed to transform verification scores to log-likelihood ratios (LLRs) which are statistically interpretable. Conventional calibration techniques apply a global score transform. However, in condition-aware (CA) calibration, information conveying signal conditions is provided as input, allowing calibration to be adaptive. This paper explores a generative approach to condition-aware score calibration. It proposes a novel generative model for speaker verification trials, each which includes a trial score, a trial label, and the associated pair of speaker embeddings. Trials are assumed to be drawn from a discrete set of underlying signal conditions which are modeled as latent Categorical random variables, so that trial scores and speaker embeddings are drawn from condition-dependent distributions. An Expectation-Maximization (EM) Algorithm for parameter estimation of the proposed model is presented, which does not require condition labels and instead discovers relevant conditions in an unsupervised manner. The generative condition-aware (GCA) calibration transform is then derived as the log-likelihood ratio of a verification score given the observed pair of embeddings. Experimental results show the proposed approach to provide performance improvements on a variety of speaker verification tasks, outperforming static and condition-aware baseline calibration methods. GCA calibration is observed to improve the discriminative ability of the speaker verification system, as well as provide good calibration performance across a range of operating points. The benefits of the proposed method are observed for task-dependent models where signal conditions are known, for universal models which are robust across a range of conditions, and when facing unseen signal conditions.
READ LESS

Summary

In speaker verification, score calibration is employed to transform verification scores to log-likelihood ratios (LLRs) which are statistically interpretable. Conventional calibration techniques apply a global score transform. However, in condition-aware (CA) calibration, information conveying signal conditions is provided as input, allowing calibration to be adaptive. This paper explores a generative...

READ MORE

Backdoor poisoning of encrypted traffic classifiers

Summary

Significant recent research has focused on applying deep neural network models to the problem of network traffic classification. At the same time, much has been written about the vulnerability of deep neural networks to adversarial inputs, both during training and inference. In this work, we consider launching backdoor poisoning attacks against an encrypted network traffic classifier. We consider attacks based on padding network packets, which has the benefit of preserving the functionality of the network traffic. In particular, we consider a handcrafted attack, as well as an optimized attack leveraging universal adversarial perturbations. We find that poisoning attacks can be extremely successful if the adversary has the ability to modify both the labels and the data (dirty label attacks) and somewhat successful, depending on the attack strength and the target class, if the adversary perturbs only the data (clean label attacks).
READ LESS

Summary

Significant recent research has focused on applying deep neural network models to the problem of network traffic classification. At the same time, much has been written about the vulnerability of deep neural networks to adversarial inputs, both during training and inference. In this work, we consider launching backdoor poisoning attacks...

READ MORE

Advances in cross-lingual and cross-source audio-visual speaker recognition: The JHU-MIT system for NIST SRE21

Summary

We present a condensed description of the joint effort of JHUCLSP/HLTCOE, MIT-LL and AGH for NIST SRE21. NIST SRE21 consisted of speaker detection over multilingual conversational telephone speech (CTS) and audio from video (AfV). Besides the regular audio track, the evaluation also contains visual (face recognition) and multi-modal tracks. This evaluation exposes new challenges, including cross-source–i.e., CTS vs. AfV– and cross-language trials. Each speaker can speak two or three languages among English, Mandarin and Cantonese. For the audio track, we evaluated embeddings based on Res2Net and ECAPA-TDNN, where the former performed the best. We used PLDA based back-ends trained on previous SRE and VoxCeleb and adapted to a subset of Mandarin/Cantonese speakers. Some novel contributions of this submission are: the use of neural bandwidth extension (BWE) to reduce the mismatch between the AFV and CTS conditions; and invariant representation learning (IRL) to make the embeddings from a given speaker invariant to language. Res2Net with neural BWE was the best monolithic system. We used a pre-trained RetinaFace face detector and ArcFace embeddings for the visual track, following our NIST SRE19 work. We also included a new system using a deep pyramid single shot face detector and face embeddings trained on Crystal loss and probabilistic triplet loss, which performed the best. The number of face embeddings in the test video was reduced by agglomerative clustering or weighting the embedding based on the face detection confidence. Cosine scoring was used to compare embeddings. For the multi-modal track, we just added the calibrated likelihood ratios of the audio and visual conditions, assuming independence between modalities. The multi-modal fusion improved Cprimary by 72% w.r.t. audio.
READ LESS

Summary

We present a condensed description of the joint effort of JHUCLSP/HLTCOE, MIT-LL and AGH for NIST SRE21. NIST SRE21 consisted of speaker detection over multilingual conversational telephone speech (CTS) and audio from video (AfV). Besides the regular audio track, the evaluation also contains visual (face recognition) and multi-modal tracks. This...

READ MORE

Advances in speaker recognition for multilingual conversational telephone speech: the JHU-MIT system for NIST SRE20 CTS challenge

Published in:
Speaker and Language Recognition Workshop, Odyssey 2022, pp. 338-345.

Summary

We present a condensed description of the joint effort of JHUCLSP/HLTCOE and MIT-LL for NIST SRE20. NIST SRE20 CTS consisted of multilingual conversational telephone speech. The set of languages included in the evaluation was not provided, encouraging the participants to develop systems robust to any language. We evaluated x-vector architectures based on ResNet, squeeze-excitation ResNets, Transformers and EfficientNets. Though squeeze-excitation ResNets and EfficientNets provide superior performance in in-domain tasks like VoxCeleb, regular ResNet34 was more robust in the challenge scenario. On the contrary, squeeze-excitation networks over-fitted to the training data, mostly in English. We also proposed a novel PLDA mixture and k-NN PLDA back-ends to handle the multilingual trials. The former clusters the x-vector space expecting that each cluster will correspond to a language family. The latter trains a PLDA model adapted to each enrollment speaker using the nearest speakers–i.e., those with similar language/channel. The k-NN back-end improved Act. Cprimary (Cp) by 68% in SRE16-19 and 22% in SRE20 Progress w.r.t. a single adapted PLDA back-end. Our best single system achieved Act. Cp=0.110 in SRE20 progress. Meanwhile, our best fusion obtained Act. Cp=0.110 in the progress–8% better than single– and Cp=0.087 in the eval set.
READ LESS

Summary

We present a condensed description of the joint effort of JHUCLSP/HLTCOE and MIT-LL for NIST SRE20. NIST SRE20 CTS consisted of multilingual conversational telephone speech. The set of languages included in the evaluation was not provided, encouraging the participants to develop systems robust to any language. We evaluated x-vector architectures...

READ MORE

Quantifying bias in face verification system

Summary

Machine learning models perform face verification (FV) for a variety of highly consequential applications, such as biometric authentication, face identification, and surveillance. Many state-of-the-art FV systems suffer from unequal performance across demographic groups, which is commonly overlooked by evaluation measures that do not assess population-specific performance. Deployed systems with bias may result in serious harm against individuals or groups who experience underperformance. We explore several fairness definitions and metrics, attempting to quantify bias in Google’s FaceNet model. In addition to statistical fairness metrics, we analyze clustered face embeddings produced by the FV model. We link well-clustered embeddings (well-defined, dense clusters) for a demographic group to biased model performance against that group. We present the intuition that FV systems underperform on protected demographic groups because they are less sensitive to differences between features within those groups, as evidenced by clustered embeddings. We show how this performance discrepancy results from a combination of representation and aggregation bias.
READ LESS

Summary

Machine learning models perform face verification (FV) for a variety of highly consequential applications, such as biometric authentication, face identification, and surveillance. Many state-of-the-art FV systems suffer from unequal performance across demographic groups, which is commonly overlooked by evaluation measures that do not assess population-specific performance. Deployed systems with bias...

READ MORE

Bayesian estimation of PLDA in the presence of noisy training labels, with applications to speaker verification

Published in:
IEEE/ACM Trans. Audio, Speech, Language Process., Vol. 30, 2022, pp. 414-28.

Summary

This paper presents a Bayesian framework for estimating a Probabilistic Linear Discriminant Analysis (PLDA) model in the presence of noisy labels. True class labels are interpreted as latent random variables, which are transmitted through a noisy channel, and received as observed speaker labels. The labeling process is modeled as a Discrete Memoryless Channel (DMC). PLDA hyperparameters are interpreted as random variables, and their joint posterior distribution is derived using meanfield Variational Bayes, allowing maximum a posteriori (MAP) estimates of the PLDA model parameters to be determined. The proposed solution, referred to as VB-MAP, is presented as a general framework, but is studied in the context of speaker verification, and a variety of use cases are discussed. Specifically, VB-MAP can be used for PLDA estimation with unreliable labels, unsupervised PLDA estimation, and to infer the reliability of a PLDA training set. Experimental results show the proposed approach to provide significant performance improvements on a variety of NIST Speaker Recognition Evaluation (SRE) tasks, both for data sets with simulated mislabels, and for data sets with naturally occurring missing or unreliable labels.
READ LESS

Summary

This paper presents a Bayesian framework for estimating a Probabilistic Linear Discriminant Analysis (PLDA) model in the presence of noisy labels. True class labels are interpreted as latent random variables, which are transmitted through a noisy channel, and received as observed speaker labels. The labeling process is modeled as a...

READ MORE

Tools and practices for responsible AI engineering

Summary

Responsible Artificial Intelligence (AI)—the practice of developing, evaluating, and maintaining accurate AI systems that also exhibit essential properties such as robustness and explainability—represents a multifaceted challenge that often stretches standard machine learning tooling, frameworks, and testing methods beyond their limits. In this paper, we present two new software libraries—hydra-zen and the rAI-toolbox—that address critical needs for responsible AI engineering. hydra-zen dramatically simplifies the process of making complex AI applications configurable, and their behaviors reproducible. The rAI-toolbox is designed to enable methods for evaluating and enhancing the robustness of AI-models in a way that is scalable and that composes naturally with other popular ML frameworks. We describe the design principles and methodologies that make these tools effective, including the use of property-based testing to bolster the reliability of the tools themselves. Finally, we demonstrate the composability and flexibility of the tools by showing how various use cases from adversarial robustness and explainable AI can be concisely implemented with familiar APIs.
READ LESS

Summary

Responsible Artificial Intelligence (AI)—the practice of developing, evaluating, and maintaining accurate AI systems that also exhibit essential properties such as robustness and explainability—represents a multifaceted challenge that often stretches standard machine learning tooling, frameworks, and testing methods beyond their limits. In this paper, we present two new software libraries—hydra-zen and...

READ MORE

Adapting deep learning models to new meteorological contexts using transfer learning

Published in:
2021 IEEE International Conference on Big Data (Big Data), 2021, pp. 4169-4177, doi: 10.1109/BigData52589.2021.9671451.

Summary

Meteorological applications such as precipitation nowcasting, synthetic radar generation, statistical downscaling and others have benefited from deep learning (DL) approaches, however several challenges remain for widespread adaptation of these complex models in operational systems. One of these challenges is adequate generalizability; deep learning models trained from datasets collected in specific contexts should not be expected to perform as well when applied to different contexts required by large operational systems. One obvious mitigation for this is to collect massive amounts of training data that cover all expected meteorological contexts, however this is not only costly and difficult to manage, but is also not possible in many parts of the globe where certain sensing platforms are sparse. In this paper, we describe an application of transfer learning to perform domain transfer for deep learning models. We demonstrate a transfer learning algorithm called weight superposition to adapt a Convolutional Neural Network trained in a source context to a new target context. Weight superposition is a method for storing multiple models within a single set of parameters thus greatly simplifying model maintenance and training. This approach also addresses the issue of catastrophic forgetting where a model, once adapted to a new context, performs poorly in the original context. We apply weight superposition to the problem of synthetic weather radar generation and show that in scenarios where the target context has less data, a model adapted with weight superposition is better at maintaining performance when compared to simpler methods. Conversely, the simple adapted model performs better on the source context when the source and target contexts have comparable amounts of data.
READ LESS

Summary

Meteorological applications such as precipitation nowcasting, synthetic radar generation, statistical downscaling and others have benefited from deep learning (DL) approaches, however several challenges remain for widespread adaptation of these complex models in operational systems. One of these challenges is adequate generalizability; deep learning models trained from datasets collected in specific...

READ MORE