Changelog¶
Baleen is pre-1.0 (Development Status: Alpha). This page summarises notable changes by theme; see the full commit history for detail.
v0.4.0¶
Changed¶
- krill engine replaces f5c + the in-tree CUDA DTW extension. Both event
alignment and DTW now run through the
krill package. GPU DTW is
bit-identical to the old in-tree kernel; eventalign is HMM-free forced-dense
and emits an f5c-format TSV, so every downstream stage is unchanged. krill
reads BLOW5 directly via pyslow5 — the old
f5c indexFASTQ-index step is gone (you still needslow5tools indexfor the.blow5.idx). - baleen is now pure Python — no C extension, no
nvccbuild. krill installs from a project index (cu122 GPU wheel or plain CPU wheel), not PyPI.
Added¶
--pore— select the krill pore model for eventalign (defaultrna002).
Removed¶
--f5c-threads— krill eventalign runs in-process, not as a separate multithreaded subprocess.BALEEN_NO_CUDA/BALEEN_CUDA_ARCHSbuild-time flags — the GPU/CPU split is now decided by which krill wheel is installed.
Unreleased (dev)¶
Features¶
- Read-ID intersection — every stage is gated on
reads(BAM) ∩ reads(FASTQ) ∩ reads(BLOW5)per condition, sof5csilently dropping reads absent from the signal file no longer biases depth statistics, the--min-depthfilter, or subsampling. Disable with--no-read-intersection. --resume— interrupted runs reuse per-contig slices already on disk, guarded by a.run_params.jsoninput/parameter fingerprint.--depth-mode— choose how--min-depthis interpreted; the default is nowread_count(total mapped reads on the contig) rather thanmean_coverage(breaking default change).
Performance¶
- Streaming per-contig flush — DTW → HMM → aggregation are fused per contig and written to disk immediately, bounding peak memory regardless of transcriptome size.
- cuDTW++ warp-shuffle kernel replaces the previous wavefront DTW kernel.
- Numba-JIT EM loops in the HMM calibration path (
_calibrate_beta,_anchored_mixture_em), roughly a 19× per-call speedup on calibration. emission_sourcegating — the defaultp_mod_knnpath skips the V1/V2 computation entirely.
Fixes¶
- Chunked
merge_contig_bamsto survive thousands of per-contig slices. - Closed path-traversal gaps in per-contig filename handling.
- Per-position buffer stride fix in the multi-position CUDA DTW kernel.
Build¶
- GPU Docker build now fails loudly if the
_cuda_dtwextension silently falls back to CPU.
Reverted¶
- The Sakoe-Chiba band DTW constraint was reverted — the soft-band implementation added overhead without reducing thread/diagonal count and was measured slower.
API note¶
run_pipeline_streaming(...) returns a 2-tuple (output_paths, metadata),
where output_paths is a dict with keys site_tsv, read_bam,
per_contig_dir, n_total_sites, and n_significant. (Earlier internal
revisions returned a 3-tuple; that shape is no longer used.)