Pipeline Overview¶
Baleen turns raw nanopore signal into per-site RNA modification calls by contrasting a native sample against an IVT (in vitro transcribed, unmodified) control. The core assumption: at an unmodified position the native and IVT current signals are interchangeable; at a modified position the native signal diverges from the IVT distribution.
flowchart TD
subgraph Input
N[Native BAM / FASTQ / BLOW5]
I[IVT BAM / FASTQ / BLOW5]
R[Reference FASTA]
end
N --> X[Read-ID intersection<br/>BAM ∩ FASTQ ∩ BLOW5]
I --> X
X --> EA[krill eventalign]
R --> EA
EA --> SG[Signal grouping by<br/>genomic position]
SG --> DTW[Pairwise DTW distance<br/>matrices per position]
DTW --> V1[V1 — Empirical-Bayes null<br/>+ hierarchical shrinkage]
V1 --> V2[V2 — anchored two-component<br/>mixture EM + soft gating]
V2 --> V3[V3 — gap-aware HMM<br/>forward–backward smoothing]
V3 --> AGG[Beta-Binomial site aggregation<br/>+ Benjamini-Hochberg FDR]
AGG --> TSV[site_results.tsv]
AGG --> BAM[read_results.bam<br/>mod-BAM MM/ML tags]
Stages¶
0. Read-ID intersection¶
Before any signal work, Baleen computes
reads(BAM) ∩ reads(FASTQ) ∩ reads(BLOW5) independently for each condition.
eventalign silently drops BAM reads whose UUIDs are absent from the BLOW5 signal
file; without the intersection, depth statistics, subsampling, and the
--min-depth filter would all be computed against a read set larger than the
one that actually yields signals. Every downstream stage is gated on this
intersection. Disable with --no-read-intersection. See
Inputs › Read-ID intersection.
1. Event alignment (krill eventalign)¶
Each read's raw signal is aligned to its mapped reference subsequence, producing a table that maps reference positions to segments of the current signal. Baleen uses the krill engine (RNA mode by default). The alignment is HMM-free and forced-dense — every signal sample is assigned to a reference position with no read-vs-reference skips — and krill emits an f5c-format TSV, so every downstream stage is unchanged. krill reads the BLOW5 signal directly via pyslow5, so no separate event-alignment index step is required.
2. Signal grouping¶
Per-position signal segments are collected across reads and grouped by genomic
position. Only positions covered in both native and IVT are retained. The
--padding option concatenates flanking positions into each window to give DTW
more context.
3. Pairwise DTW¶
For every retained position, Baleen computes a pairwise DTW distance matrix between native and IVT signal segments. DTW (Dynamic Time Warping) is robust to the local time-warping inherent in nanopore translocation. The computation runs on krill's GPU backend when available, with an automatic CPU fallback.
4. Three-stage hierarchical modification calling¶
The per-position distance structure is converted into per-read modification probabilities through three nested models:
| Stage | Name | What it does |
|---|---|---|
| V1 | Empirical-Bayes null | Estimates the IVT-vs-IVT null distance distribution per position, with coverage-adaptive three-level shrinkage (position → local window → global) so low-coverage positions borrow strength from their neighbours. |
| V2 | Anchored mixture EM | Fits a two-component (modified / unmodified) mixture to the native distances, anchored on the V1 null, using continuous soft gating rather than a hard binary threshold. |
| V3 | Gap-aware HMM | A Hidden Markov Model runs forward–backward smoothing along each read's trajectory, sharing information between neighbouring positions and tolerating coverage gaps. Produces the final per-read p_mod_hmm. |
HMM parameters can be the built-in unsupervised defaults, or learned in semi-supervised / supervised modes — see HMM Training Modes.
5. Site aggregation¶
Per-read probabilities are aggregated to per-site calls:
- mod_ratio — MAP estimate of modification stoichiometry from a
Beta-Binomial model, with a 95% credible interval (
ci_low,ci_high). - p-value — one-sided Fisher's exact test on the binary
modified/unmodified calls (native vs IVT), thresholded at
--mod-threshold. - padj — Benjamini-Hochberg FDR correction across all tested sites.
- effect_size, stoichiometry, coverage counts.
See Outputs for the complete schema.
Outputs¶
| File | Level | Description |
|---|---|---|
site_results.tsv |
per-site | Modification calls with statistics. |
read_results.bam |
per-read | Standard mod-BAM with MM/ML tags. |
Two ways to run¶
baleen run— the full pipeline end to end.baleen aggregate— re-run HMM and/or site aggregation on previously saved results without recomputing DTW (the expensive step). Useful for sweeping--mod-thresholdor applying trained HMM parameters.
Both are documented in the CLI Reference.