Determining epitope specificity of T-cell receptors with Transformers

Master thesis (2022)

Authors

A.R. Khan Electrical Engineering, Mathematics and Computer Science

Contributors

Marcel J.T. Reinders Pattern Recognition and Bioinformatics - (mentor)

M.J.T. Reinders Pattern Recognition and Bioinformatics - (mentor)

M. J.T. Reinders Pattern Recognition and Bioinformatics - (mentor)

Marcel Reinders Pattern Recognition and Bioinformatics - (mentor)

Indu Khatri Leiden University Medical Center (mentor)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

To reference this document use:

http://resolver.tudelft.nl/uuid:80fef18f-2b3c-4993-b5d8-10ca6dbf1b71

More Info

expand_more

Published Date

21-03-2022

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

Transformers have dominated the field of natural language processing due to their competency in learning complex relationships within a sequence. Reusing a pre-trained transformer for a downstream task is known as Trans-fer learning. Transfer learning restricts the transformer to a fixed vocabulary; modification in transformer implementation will extend the utility of the transformer. Implementing transformers for complex biological problems can be beneficial in addressing the complexities in the biological sequences. One such biological problem is to capture the specificity of diverse T-cell repertoire to the unique antigens (i.e., immunogenic pathogenic elements). Using transformers to assess the relationship between T-cell receptors and antigen at the sequence level can provide us with better insights into the processes involved in these precise and complex immune responses in humans and murine. In this work, we determined the specificity of multiple TCR to unique antigens by classifying the CDR3 re-gions of TCR sequences to a particular antigen. For this problem, we used three pre-trained auto-encoder (ProtBERT, ProtALBERT, ProtELECTRA) and one pre-trained auto-regressive (ProtXLNet) transformer model wherein, to adapt to the challenges of the complex biological problem at hand, we implemented modifications in the transformers chosen here. We used the VDJdb to obtain the biological data for training and testing the selected transformers. After pre-processing data, we predicted the TCR specificity for 25 antigens (classes) in a multi-class setting. Transformers could predict the specificity of TCRs to an antigen with just the CDR3 sequences from the TCRB chain (weighted F1 score 0.48), the data that was unseen by the transformers. With additional features incorpo-rated, i.e., gene names for TCRs, the weighted F1 improved to 0.55 in the best performing transformer. We demon-strated that different modifications in transformers recognized out-of-vocabulary features with these results. When com-paring the AUC from the transformer model to other previously developed methods for the same biological problem such as TCRGP, TCRDist and DeepTCR, we observed that the transformers outperformed the previously available methods. To exemplify, the MCMV epitope family that suffered from restricted performance in TCRGP due to fewer training samples (~100 samples) showed 10% improvement in AUC with transformers under similar training samples. Transformer's proficiency in learning from fewer data combined with holistic modifications in transformers implementations proves that we can extend its capabilities to explore other biological settings. Further ingenuity in utiliz-ing the full potential of transformers either through attention head visualization or introducing additional features can fur-ther extend T-cell research avenues.

Files

Main_Report.pdf

(pdf | 1.68 Mb)

Unknown license

Supplementary_Information.pdf

(pdf | 31.8 Mb)

Unknown license