Docker¶

Baleen ships two Dockerfiles and a CI workflow that builds and pushes both images on every push to main/dev. Both variants live in a single repository py-baleen; the variant is a tag suffix (-cpu / -gpu):

Dockerfile	Tag suffix	Base
`Dockerfile.cpu`	`-cpu`	`python:3.11-slim`, krill CPU wheel.
`Dockerfile.gpu`	`-gpu`	`nvidia/cuda:12.2.2-runtime-ubuntu22.04`, krill cu122 GPU wheel.

Tags follow <ref>-<variant>: latest-* is published only from main; branch (dev-*) and long-SHA tags are published for every build. Both images bundle the krill engine and slow5tools, and set ENTRYPOINT ["baleen"] with a /data working directory.

Published to two registries:

Docker Hub — btrspg/py-baleen
GHCR (public) — ghcr.io/loganylchen/py-baleen

Pull a published image¶

# Docker Hub
docker pull btrspg/py-baleen:latest-cpu
docker pull btrspg/py-baleen:latest-gpu      # requires the NVIDIA Container Toolkit
docker pull btrspg/py-baleen:dev-gpu         # latest dev build

# GHCR (public)
docker pull ghcr.io/loganylchen/py-baleen:latest-gpu

Build locally¶

If you prefer to build from source — or are running a fork — build the Dockerfile directly:

# CPU
docker build -f Dockerfile.cpu -t py-baleen:cpu .

# GPU
docker build -f Dockerfile.gpu -t py-baleen:gpu .

Both builds are pure Python (no C-extension compilation): they pip install baleen, then install the appropriate krill wheel (CPU vs cu122) from the project index. The GPU image's krill is GPU-capable only at run time when a device is visible — see the verification step below.

Run the pipeline in a container¶

The entrypoint is baleen, so pass sub-command arguments directly. Mount your data into the container's /data working directory:

# CPU
docker run --rm \
    -v "$PWD":/data \
    py-baleen:cpu run \
        --native-bam native.bam --native-fastq native.fq.gz --native-blow5 native.blow5 \
        --ivt-bam ivt.bam --ivt-fastq ivt.fq.gz --ivt-blow5 ivt.blow5 \
        --ref ref.fa -o results/

# GPU — add --gpus all
docker run --rm --gpus all \
    -v "$PWD":/data \
    py-baleen:gpu run \
        --native-bam native.bam --native-fastq native.fq.gz --native-blow5 native.blow5 \
        --ivt-bam ivt.bam --ivt-fastq ivt.fq.gz --ivt-blow5 ivt.blow5 \
        --ref ref.fa -o results/

File ownership

Add -u $(id -u):$(id -g) so output files under results/ are owned by your host user rather than root.

Verify the GPU image sees the device¶

docker run --rm --gpus all --entrypoint python3 py-baleen:gpu \
    -c "from baleen._dtw import backend, is_available; \
print('backend:', backend(), 'gpu:', is_available())"
# Expected: backend: gpu gpu: True

If it prints backend: cpu, the container cannot see the GPU — check the NVIDIA Container Toolkit installation and that you passed --gpus all.