From FASTQ to Differential Expression: Building a Reproducible RNA-seq Pipeline
During my PhD at CUHK, RNA-seq was one of those things I kept coming back to. Different projects, different organisms, different experimental designs — but the analysis steps were always the same. Download data, check quality, trim, align, quantify, normalize, test for differential expression, run enrichment. Every time, I found myself re-running dozens of commands, tweaking parameters, and trying to remember what I did last time.
That repetition was the motivation behind RNA-seq-analysis-pipeline, a Snakemake-based workflow that takes you from raw FASTQ files all the way to differential expression results and functional enrichment analysis.
What the pipeline does
The workflow covers the full RNA-seq analysis chain:
- Data acquisition — Fetch raw sequencing data from SRA automatically.
- Quality control and trimming — fastp handles adapter removal, quality filtering, and generates QC reports.
- Alignment — STAR and HISAT2 run in parallel, giving you two independent alignments to cross-validate results.
- Quantification — featureCounts and STAR gene-level counts are both generated, again for cross-comparison.
- Differential expression — DESeq2 with ashr log-fold-change shrinkage, producing publication-ready results.
- Co-expression networks — WGCNA identifies gene modules and their correlations with traits.
- Functional enrichment — genekitr runs GO and KEGG enrichment on your DE gene sets.
Every step produces reports and visualizations along the way, so you can inspect quality at each stage without digging through log files.
Why Snakemake
I chose Snakemake because it hits a sweet spot between flexibility and reproducibility. Each rule in the workflow gets its own conda environment, which means the tooling for one step doesn’t clash with another. There are also pre-built Docker containers for full reproducibility — someone else can run the exact same pipeline without worrying about dependency hell.
The pipeline is registered in the Snakemake Workflow Catalog, which was a nice milestone. It means others can discover and use it with minimal setup.
Config-driven design
The key design decision was making everything configurable through two files. You don’t need to edit the workflow itself to analyze a new dataset. Here’s what setup looks like:
# config.yaml
samples: "config/samples.tsv"
outdir: "results"
genome:
source: "ensembl"
organism: "homo_sapiens"
release: 110
fasta: "Homo_sapiens.GRCh38.dna.primary_assembly.fa"
gtf: "Homo_sapiens.GRCh38.110.gtf"
params:
fastp:
qualified_quality_phred: 20
length_required: 50
star:
outFilterMismatchNmax: 10
# samples.tsv
sample_id condition platform layout
SRR1234567 treated ILLUMINA PAIRED
SRR1234568 control ILLUMINA PAIRED
Then run it:
snakemake --use-conda --cores 16
That’s it. Change the config for a new project, and you’re ready to go. The pipeline supports both paired-end and single-end data, and you can configure any reference genome available on Ensembl.
What I learned
Building this pipeline forced me to go deep into areas I had only touched on before.
RNA-seq theory — I had to understand why each step matters: why trimming affects alignment rates, why splice-aware aligners are essential for eukaryotic RNA, how different quantification strategies introduce different biases, and what normalization actually does to your counts.
The R/Bioconductor ecosystem — Working with DESeq2, WGCNA, and genekitr taught me a lot about how the R community approaches genomic data analysis. The S4 object system, the SummarizedExperiment abstraction, and the design philosophy behind Bioconductor packages all clicked while building the downstream analysis portions.
Reproducible research practices — Per-rule conda environments, containerization, and config-driven workflows are ideas borrowed from software engineering. Applying DevOps thinking to scientific analysis was a perspective shift. It’s not just about getting results — it’s about making sure someone else (or future you) can get the same results.
Workflow management — Snakemake’s rule-based approach, with its input/output declarations and automatic DAG construction, is a fundamentally different way of thinking about data processing. It took some time to internalize, but now I can’t imagine going back to manual shell scripts.
Reflections
This pipeline is very much a product of my PhD experience. It reflects the kind of work I did daily and the frustrations I wanted to solve. The parallel alignment and quantification steps, for instance, came from a period where I was debugging inconsistent results and needed to verify whether an alignment artifact was causing a problem.
If you’re doing RNA-seq analysis regularly, I’d encourage you to invest time in building (or adopting) a standardized workflow. The upfront cost is real, but the long-term payoff in consistency, reproducibility, and your own sanity is worth it.
The code is on GitHub. Feedback and contributions are welcome.