NanoInsight: Multi-Tool Fusion Transcript Detection from Nanopore Direct RNA Sequencing

From RNA Modifications to Fusion Transcripts

After building NanoRNAMod for RNA modification detection, I found myself drawn to another challenge in nanopore direct RNA sequencing (DRS) analysis: fusion transcript detection.

Gene fusions occur when two separate genes become aberrantly joined, producing a chimeric transcript. They are clinically important — think of BCR-ABL in chronic myeloid leukemia or EML4-ALK in lung cancer. Accurate fusion detection can directly impact diagnosis and treatment decisions. Long-read sequencing from Oxford Nanopore is uniquely suited for this task because the reads can span entire fusion junctions in a single molecule, eliminating the ambiguity that plagues short-read approaches.

But just like with RNA modifications, there is no single tool that gets it right every time. Different fusion callers use fundamentally different strategies, and each has its own failure modes. So I built NanoInsight — a Snakemake workflow that integrates multiple fusion detection tools and lets them reinforce each other through consensus.

The Tools and Their Strategies

NanoInsight integrates four fusion detection tools, each with a distinct algorithmic approach:

JAFFAL takes a two-alignment strategy: it aligns reads to both the transcriptome and the genome. The transcriptome alignment captures known fusion isoforms, while the genome alignment helps identify novel breakpoints. This dual approach is thorough but computationally demanding.

Genion uses a k-mer based approach. Instead of performing full alignment first, it breaks reads into k-mers and searches for k-mers that map to different genes. This makes it fast and particularly good at detecting fusions that full aligners might miss due to complex junction structures.

LongGF employs a graph-based method. It builds an alignment graph from the read-to-genome mappings and identifies edges that connect disparate genomic loci. This approach handles complex rearrangements well.

Aeron rounds out the toolkit with yet another alignment-based strategy, providing an additional independent line of evidence.

The key insight is that these tools are not redundant. A fusion missed by one approach might be confidently called by another. Running all of them and looking for overlap gives you much higher confidence in the results.

Tool-Specific Alignment Matters

One of the trickiest parts of building NanoInsight was handling the alignment step. Each fusion caller expects its input in a slightly different format and works best with different Minimap2 parameters. For example:

# JAFFAL needs both transcriptome and genome alignments
jaffal:
  transcriptome_align:
    minimap2_params: "-ax splice -uf -k14"
  genome_align:
    minimap2_params: "-ax splice"

# Genion works with genome alignment using specific sensitivity settings
genion:
  minimap2_params: "-ax splice -k14 --secondary=no"

# LongGF prefers a more permissive alignment to capture split reads
longgf:
  minimap2_params: "-ax splice -k14 -G 200k"

Rather than forcing all tools to share a single alignment, NanoInsight runs tool-specific Minimap2 alignment for each caller. This means each tool gets input that is optimized for its algorithm, which significantly improves detection accuracy. Yes, it means more computation, but the improvement in results is worth it.

Containerized for Reproducibility

Following the lessons I learned from NanoRNAMod, NanoInsight uses per-tool Docker containers instead of conda environments. Each fusion caller runs inside its own container with all dependencies pre-installed:

# In the Snakemake rules
rule run_jaffal:
    input:
        bam="results/alignment/jaffal/{sample}.bam",
        reference="reference/genome.fa"
    output:
        dir="results/jaffal/{sample}/"
    container:
        "docker://loganylchen/jaffal:latest"
    shell:
        """
        python /opt/jaffal/run_jaffal.py {input.bam} {input.reference} {output.dir}
        """

Docker containers provide stronger isolation than conda environments and ensure that the pipeline produces identical results regardless of the host system. This is especially important for clinical applications where reproducibility is not optional.

Reference Genome Filtering

Nanopore DRS produces reads from the polyadenylated transcriptome, which means the vast majority of reads map to a relatively small number of chromosomes. Running alignment and fusion detection against the entire genome wastes significant computation. NanoInsight includes a reference genome filtering step that extracts only the specified contigs:

# Only include these chromosomes in the analysis reference
reference:
  genome: "reference/GRCh38_full.fa"
  contigs: ["chr1", "chr2", "chr3", "chr4", "chr5",
            "chr6", "chr7", "chr8", "chr9", "chr10",
            "chr11", "chr12", "chr13", "chr14", "chr15",
            "chr16", "chr17", "chr18", "chr19", "chr20",
            "chr21", "chr22", "chrX", "chrY", "chrM"]

This can reduce the reference size by excluding unplaced contigs and alternative haplotypes, speeding up every alignment step in the pipeline.

What I Learned

Building NanoInsight expanded my skill set in several directions:

Fusion gene biology. I had to deeply understand how gene fusions arise (chromosomal rearrangements, trans-splicing, read-through transcription) and what makes a fusion clinically significant versus a technical artifact.

Long-read fusion detection algorithms. Each tool’s paper reads like a mini-course in algorithm design. Understanding k-mer indexing, splice graph construction, and split-read analysis gave me a much richer view of computational genomics.

Multi-tool consensus design. Designing a system where multiple tools reinforce each other taught me about voting-based approaches, confidence scoring, and how to aggregate results from methods with different error profiles. The consensus is almost always more reliable than any single caller.

Containerized workflow design. Moving from conda environments to Docker containers taught me about container best practices, image optimization, and the trade-offs between reproducibility and flexibility in workflow design.

The Bigger Picture

There is a philosophical thread connecting NanoRNAMod and NanoInsight. In computational biology, we are often tempted to search for the “best” tool for a given task. But biology is messy, algorithms have biases, and real data rarely matches the clean benchmarks in papers. The multi-tool approach acknowledges this reality: instead of betting on one horse, you run several and look for agreement.

If you are analyzing fusion transcripts from nanopore DRS data, give NanoInsight a try. As always, feedback and contributions are welcome.