Performance & Scaling¶
DTW dominates the wall-clock time of a Baleen run; everything after it (HMM, aggregation) is comparatively cheap. This page covers the DTW backend, memory behaviour, and the knobs that control throughput.
DTW backend¶
The baleen._dtw module is a thin shim over the
krill engine. The backend is
selected by which krill wheel is installed plus device presence — not by a
compile-time flag:
- GPU when krill's
cu122wheel is installed and a CUDA device is present. - CPU otherwise (krill's plain wheel, or no device).
Check which one is active:
from baleen._dtw import backend, is_available
print("DTW backend:", backend()) # "gpu" or "cpu"
print("GPU available:", is_available())
Force a backend per run:
| Flag | Effect |
|---|---|
--cuda 0 / --cuda 0,1 / --cuda all |
Use the listed GPU device(s). |
--no-cuda |
Force the CPU backend. |
--gpu-memory-limit BYTES |
Cap the GPU memory budget for concurrent DTW workers. |
CUDA kernel characteristics¶
- FP32 only. The kernel is templated on
float; FP16 is deliberately not used because it cripples Pascal consumer GPUs (1/64 FP32 throughput). - Wavefront parallelism — one thread per row of the cost matrix sweeping the
anti-diagonal,
blockDim.x = 1024, three rolling diagonals in shared memory. - One block per pair in pairwise mode; the grid spans all comparisons.
- No Sakoe-Chiba band. A soft-band variant was tried and reverted: marking out-of-band cells as infinite without shrinking the thread count or diagonal count is pure overhead on a latency-bound kernel.
Streaming architecture & memory¶
DTW → HMM → aggregation are fused per contig: each worker computes a
contig, writes its site_results rows and mod-BAM slice to per_contig/, then
frees the in-memory result before returning a lightweight summary. The main
process merges the slices at the end.
The practical consequence: peak memory stays bounded regardless of transcriptome size — you do not accumulate every contig's per-position read-name lists in RAM. This is what lets Baleen process thousands of contigs without OOM.
Throughput knobs¶
| Flag | Effect on performance |
|---|---|
--threads N |
Parallel contig workers (ProcessPoolExecutor). More workers = more concurrency. |
--subsample-n N |
Caps reads per condition per contig (default 300). Fewer reads → fewer DTW pairs → faster, at some statistical cost. |
--no-subsample |
Disables the cap — slower, more memory, on deep data. |
--min-depth / --depth-mode |
Skip shallow contigs entirely. |
--target |
Restrict to specific contigs. |
--threads controls contig parallelism
krill eventalign runs in-process, not as a separate multithreaded
subprocess, so there is no per-call thread budget to balance. --threads
simply sets how many contig workers run in parallel.
Resuming long runs¶
--resume reuses per-contig slices already under <output_dir>/per_contig/,
skipping their workers entirely. A .run_params.json fingerprint of the inputs
and run-affecting parameters guards correctness: if anything drifted, the resume
aborts and lists the mismatches rather than silently mixing incompatible
results. See the CLI Reference.