Skip to content

Performance & Scaling

DTW dominates the wall-clock time of a Baleen run; everything after it (HMM, aggregation) is comparatively cheap. This page covers the DTW backend, memory behaviour, and the knobs that control throughput.

DTW backend

The baleen._dtw module is a thin shim over the krill engine. The backend is selected by which krill wheel is installed plus device presence — not by a compile-time flag:

  • GPU when krill's cu122 wheel is installed and a CUDA device is present.
  • CPU otherwise (krill's plain wheel, or no device).

Check which one is active:

from baleen._dtw import backend, is_available
print("DTW backend:", backend())     # "gpu" or "cpu"
print("GPU available:", is_available())

Force a backend per run:

Flag Effect
--cuda 0 / --cuda 0,1 / --cuda all Use the listed GPU device(s).
--no-cuda Force the CPU backend.
--gpu-memory-limit BYTES Cap the GPU memory budget for concurrent DTW workers.

CUDA kernel characteristics

  • FP32 only. The kernel is templated on float; FP16 is deliberately not used because it cripples Pascal consumer GPUs (1/64 FP32 throughput).
  • Wavefront parallelism — one thread per row of the cost matrix sweeping the anti-diagonal, blockDim.x = 1024, three rolling diagonals in shared memory.
  • One block per pair in pairwise mode; the grid spans all comparisons.
  • No Sakoe-Chiba band. A soft-band variant was tried and reverted: marking out-of-band cells as infinite without shrinking the thread count or diagonal count is pure overhead on a latency-bound kernel.

Streaming architecture & memory

DTW → HMM → aggregation are fused per contig: each worker computes a contig, writes its site_results rows and mod-BAM slice to per_contig/, then frees the in-memory result before returning a lightweight summary. The main process merges the slices at the end.

The practical consequence: peak memory stays bounded regardless of transcriptome size — you do not accumulate every contig's per-position read-name lists in RAM. This is what lets Baleen process thousands of contigs without OOM.

Throughput knobs

Flag Effect on performance
--threads N Parallel contig workers (ProcessPoolExecutor). More workers = more concurrency.
--subsample-n N Caps reads per condition per contig (default 300). Fewer reads → fewer DTW pairs → faster, at some statistical cost.
--no-subsample Disables the cap — slower, more memory, on deep data.
--min-depth / --depth-mode Skip shallow contigs entirely.
--target Restrict to specific contigs.

--threads controls contig parallelism

krill eventalign runs in-process, not as a separate multithreaded subprocess, so there is no per-call thread budget to balance. --threads simply sets how many contig workers run in parallel.

Resuming long runs

--resume reuses per-contig slices already under <output_dir>/per_contig/, skipping their workers entirely. A .run_params.json fingerprint of the inputs and run-affecting parameters guards correctness: if anything drifted, the resume aborts and lists the mismatches rather than silently mixing incompatible results. See the CLI Reference.