Antibody Sequences Generated from Machine Learning-Enabled Antibody Design

More Information

To support protein representation learning, we are publishing a second dataset containing antibody sequences and quantitative binding measurements to a known antigen. Please refer to our paper "Machine Learning Optimization of Candidate Antibodies Yields Highly Diverse Sub-nanomolar Affinity Antibody Libraries" for additional information on the method and design of these sequences.

The initial AlphaSeq Antibody Dataset 1 can be found here. Additional information about the design of Dataset 1 and experimental setup for quantitative binding measurements can be found in our Data Descriptor paper.

Overview

The dataset presented here contains quantitative binding scores of scFv-format antibodies against a SARS-CoV-2 target peptide collected via an AlphaSeq assay. We integrate target-specific binding affinities with information from millions of natural protein sequences in a probabilistic machine learning framework to design thousands of scFvs that are then empirically measured. This is the second dataset we're releasing that contains antibody sequences, antigen sequence, and quantitative measurements and that provides an opportunity to serve as a benchmark to evaluate antibody-specific representation models for machine learning.

The dataset is a csv file with the following entries:

Variable Name Description
POI¹ Alphanumeric label corresponding to amino acid sequence.
Sequence Single letter amino acid representation of scFv measured.
Target Protein target represented by a text label for which the measured antibody interacted. Options are defined as target or non-target negative controls 1-3.
Assay Unique assay identifier. (All sequences here come from Assay B)
Replicate Unique replicate identifier, ranging from 1 to 6
Pred_affinity Value representing the score from the AlphaSeq assay, as described in the methods section. Values estimate the protein-protein dissociation constant in nanomolar, on a log scale. Lower values indicate stronger binding. Blank values indicate poor binding.
HC, LC Single letter amino acid sequence of the heavy chain (HC) or light chain (LC)
CDR[H/L][1/2/3] Single letter amino acid sequence of a CDR region where H indicates heavy chain, L indicates light chain and the numerical value represents either CDR 1, CDR 2 or CDR 3.

¹Note: The POI naming convention may only capture one of the methods used to generate the sequence. Multiple methods may have been used to generate the sequences that are not captured in the POI. Please see LINK for more information about the particular method used for a given sequence