Next-Generation Sequencing (NGS) has made it possible to perform metagenomic sequencing of environmental microbiome samples. Colorectal cancer (CRC) benefits from early detection, and many studies find correlations between disease presence and abundance of species in samples of t
...
Next-Generation Sequencing (NGS) has made it possible to perform metagenomic sequencing of environmental microbiome samples. Colorectal cancer (CRC) benefits from early detection, and many studies find correlations between disease presence and abundance of species in samples of the microbiome. However, these studies are hard to reproduce and even harder to build diagnostic tools from, and one of the major factors for this is the inherent bias in the datasets that were collected, the so-called batch effect.
To investigate the extent to which batch effect impacts the generalization of binary classifiers, we performed a benchmark of eleven batch correctors: four existing tools, three transformations and three encoders, assessing the subsequent performance of seven supervised binary classifiers using a leave-one-dataset-out (LODO) validation method. In addition, batch effect was measured through both visual (tSNE) and numeric (linear models) methods before and after applying each of the correctors, and the performance at different dataset counts was measured.
Batch effect was shown to be present in the shotgun metagenomic data, being reduced by some correction tools while being strengthened by others. Evaluations using AUROC showed that combining datasets without correction improved generalization, even at an equivalent number of samples. When combining batch correctors and different classifiers, the performance over the baseline did not improve significantly. Contrary to its popularity as batch corrector, the performance significantly worsened when using ComBat before training each of the binary classifiers.
Thus, even though batch correctors reduce batch effect within our taxonomic count data, they do not significantly improve classification performance when generalizing to separate datasets. We can thus advise against focusing on choosing a batch corrector when building tools for predicting diagnosis of CRC and instead aiming to improve the pool of datasets to learn from.
The code for reproducing the results and figures in this work have been made available at https://github.com/AbeelLab/ngs-batch-evaluation