Rapid and efficient analysis of 20,000 RNA-seq samples with Toil
Toil is portable, open-source workflow software that supports contemporary workflow definition languages and can be used to securely and reproducibly run scientific workflows efficiently at large-scale.
To demonstrate Toil, University of California Santa Cruz researchers processed over 20,000 RNA-seq samples to create a consistent meta-analysis of five datasets free of computational batch effects that they make freely available. Nearly all the samples were analysed in under four days using a commercial cloud cluster of 32,000 preemptable cores.
(Left) A dependency graph of the RNA-seq pipeline we developed (called CGL). CutAdapt was used to remove extraneous adapters, STAR was used for alignment and read coverage, and RSEM and Kallisto were used to produce quantification data. (Right) A scatter plot showing the Pearson correlation between the results of the TCGA best-practices pipeline and the CGL pipeline. 10,000 randomly selected sample/gene pairs were subset from the entire TCGA cohort and the normalized counts were plot against each other; this process was repeated 5 times with no change in Pearson correlation. The unit for counts is: log2(norm counts+1).
Availability – An up-to-date version of Toil’s documentation can be found here:http://toil.readthedocs.org/
The Toil source code is freely viewable at: https://github.com/BD2KGenomics/toil
A repository of Toil pipelines is available at: https://github.com/BD2KGenomics/toil-scripts