← Blog

NanoRNAMod: A Snakemake Pipeline for Benchmarking RNA Modification Detection from Nanopore Data

The Problem: No Gold Standard for RNA Modification Detection

During my PhD at CUHK School of Life Sciences, I spent a lot of time thinking about RNA modifications. Not just the biology of how m6A, pseudouridine, and other modifications regulate gene expression, but a more practical question: how do we actually detect them reliably from Oxford Nanopore direct RNA sequencing (DRS) data?

The answer, it turns out, is complicated. There is no single gold-standard tool. Instead, the field has produced a growing collection of detection tools, each built on different statistical assumptions and signal processing strategies:

  • xpore models per-position signal distributions
  • nanocompore compares signal features between conditions using generalized additive models
  • baleen uses a k-mer-based normalization approach
  • differr compares event-level current intensities
  • drummer applies mixture models to detect differential modifications
  • eligos2 focuses on RNA modification-induced misincorporation and truncation signals
  • epinano uses a random forest classifier trained on signal features

Each tool has its own strengths, biases, and blind spots. Some are sensitive but noisy. Others are conservative but may miss real modifications. When I started my research, I found myself running these tools one by one, manually managing dependencies, and trying to compare results across wildly different output formats. It was tedious and error-prone.

So I built NanoRNAMod — a Snakemake pipeline that integrates all seven tools under a single, reproducible workflow.

What NanoRNAMod Does

NanoRNAMod takes a comparison-based approach: you provide Native (unmodified) and Control (e.g., demethylated or in vitro transcribed) samples, and each tool identifies positions where the signal differs significantly between the two conditions. This is critical because most RNA modification detection algorithms rely on comparing modified vs. unmodified contexts rather than calling modifications de novo.

The pipeline handles everything from raw fast5 files through basecalling, alignment, event alignment, and finally modification detection. Here’s the general flow:

  1. Basecalling (Guppy) converts raw electrical signals to nucleotide sequences
  2. Read alignment (Minimap2) maps basecalled reads to the reference transcriptome
  3. Event alignment (f5c) realigns raw signal events to the basecalled sequence
  4. Modification detection runs all seven tools in parallel, each in its own isolated conda environment
  5. Downsampling iterations optionally repeat the analysis at reduced sequencing depths for benchmarking

The Downsampling Feature

One thing I’m particularly proud of is the configurable read downsampling system. A major challenge in the field is understanding how sequencing depth affects detection sensitivity and specificity. NanoRNAMod lets you specify any number of downsampling depths and iterations:

# In your config.yaml
downsampling:
  enabled: true
  depths: [100000, 500000, 1000000, 5000000]
  iterations: 3

For each depth, the pipeline randomly subsamples reads the specified number of times and runs the full detection workflow. This gives you a robust picture of how each tool’s performance scales with data volume — something that is essential for experimental planning and for fair tool comparisons.

Isolated Conda Environments

One of the biggest headaches in bioinformatics is dependency hell. These seven tools depend on different versions of Python, R, and countless libraries. NanoRNAMod solves this by providing 14 separate conda environment YAML files — one for each tool’s detection and post-processing step. Snakemake automatically activates the correct environment for each rule:

rule run_xpore:
    input:
        eventalign="results/eventalign/{sample}/eventalign.txt",
        bam="results/alignment/{sample}.bam"
    output:
        dir="results/xpore/{condition}/"
    conda:
        "envs/xpore.yaml"
    shell:
        """
        python scripts/run_xpore.py {input.eventalign} {input.bam} {output.dir}
        """

This means you can run the entire pipeline on a fresh machine with a single snakemake --use-conda command and everything just works.

Running the Pipeline

Getting started is straightforward:

# Clone the repository
git clone https://github.com/loganylchen/NanoRNAMod.git
cd NanoRNAMod

# Edit the configuration file
cp config/config.example.yaml config/config.yaml
# Edit config.yaml with your sample paths and parameters

# Run the pipeline
snakemake --use-conda --cores 32

The config file controls sample names, paths to fast5 directories, reference sequences, and all downsampling parameters. The pipeline will automatically download and create all conda environments on the first run.

What I Learned

Building NanoRNAMod taught me several things that went well beyond just running bioinformatics tools:

Snakemake workflow management. I went from writing scattered shell scripts to understanding proper DAG-based workflow design. Learning to write rules with proper input/output declarations, wildcard constraints, and checkpoint logic was transformative for how I approach reproducible research.

Nanopore signal processing. Digging into how f5c performs event alignment and understanding the relationship between raw current signals, squiggle data, and basecalled sequences gave me a much deeper appreciation for what these detection tools are actually doing under the hood.

Statistical benchmarking methodology. Designing fair comparisons across tools with different statistical frameworks forced me to think carefully about precision, recall, F1 scores, and how to define ground truth in a field where perfect truth sets rarely exist.

Reproducible research practices. Containerizing each tool’s environment and making the entire workflow configurable and one-command-runnable drilled home the importance of making science reproducible. If someone else cannot rerun your analysis, is it really science?

Closing Thoughts

RNA modification detection from nanopore data is still a rapidly evolving field. No single tool captures the full picture, and that is precisely why tools like NanoRNAMod matter. By making it easy to run and compare multiple detectors under identical conditions, we can move toward a more nuanced understanding of what each method detects — and what it misses.

If you are working with nanopore DRS data and RNA modifications, give NanoRNAMod a try. Contributions and feedback are always welcome.