Determining epitope specificity of T-cell receptors with Transformers

More Info
expand_more

Abstract

Transformers have dominated the field of natural language processing due to their competency in learning complex relationships within a sequence. Reusing a pre-trained transformer for a downstream task is known as Trans-fer learning. Transfer learning restricts the transformer to a fixed vocabulary; modification in transformer implementation will extend the utility of the transformer. Implementing transformers for complex biological problems can be beneficial in addressing the complexities in the biological sequences. One such biological problem is to capture the specificity of diverse T-cell repertoire to the unique antigens (i.e., immunogenic pathogenic elements). Using transformers to assess the relationship between T-cell receptors and antigen at the sequence level can provide us with better insights into the processes involved in these precise and complex immune responses in humans and murine. In this work, we determined the specificity of multiple TCR to unique antigens by classifying the CDR3 re-gions of TCR sequences to a particular antigen. For this problem, we used three pre-trained auto-encoder (ProtBERT, ProtALBERT, ProtELECTRA) and one pre-trained auto-regressive (ProtXLNet) transformer model wherein, to adapt to the challenges of the complex biological problem at hand, we implemented modifications in the transformers chosen here. We used the VDJdb to obtain the biological data for training and testing the selected transformers. After pre-processing data, we predicted the TCR specificity for 25 antigens (classes) in a multi-class setting. Transformers could predict the specificity of TCRs to an antigen with just the CDR3 sequences from the TCRB chain (weighted F1 score 0.48), the data that was unseen by the transformers. With additional features incorpo-rated, i.e., gene names for TCRs, the weighted F1 improved to 0.55 in the best performing transformer. We demon-strated that different modifications in transformers recognized out-of-vocabulary features with these results. When com-paring the AUC from the transformer model to other previously developed methods for the same biological problem such as TCRGP, TCRDist and DeepTCR, we observed that the transformers outperformed the previously available methods. To exemplify, the MCMV epitope family that suffered from restricted performance in TCRGP due to fewer training samples (~100 samples) showed 10% improvement in AUC with transformers under similar training samples. Transformer's proficiency in learning from fewer data combined with holistic modifications in transformers implementations proves that we can extend its capabilities to explore other biological settings. Further ingenuity in utiliz-ing the full potential of transformers either through attention head visualization or introducing additional features can fur-ther extend T-cell research avenues.

Files

Main_Report.pdf
(pdf | 1.68 Mb)
Unknown license
Unknown license