Quick Start¶
This page runs the full pipeline on a native/IVT pair and inspects the output. For a conceptual walkthrough, see the Pipeline Overview.
1. Prepare inputs¶
You need, for each condition (native and IVT control):
| File | Description |
|---|---|
| BAM (sorted + indexed) | Alignments to the reference transcriptome. |
FASTQ (.gz) |
Basecalled reads. |
| BLOW5 | Raw signal. |
…plus a single reference FASTA (indexed with .fai). See
Inputs for exact requirements and indexing commands.
2. Run the pipeline¶
baleen run \
--native-bam native.bam \
--native-fastq native.fq.gz \
--native-blow5 native.blow5 \
--ivt-bam ivt.bam \
--ivt-fastq ivt.fq.gz \
--ivt-blow5 ivt.blow5 \
--ref ref.fa \
-o results/
This produces:
| Output | Description |
|---|---|
results/site_results.tsv |
Per-site modification calls (mod ratio, credible interval, p-value, BH-adjusted p-value, effect size, coverage, stoichiometry). |
results/read_results.bam |
Per-read modification probabilities in standard mod-BAM format (MM/ML tags). |
See Outputs for the full column reference and mod-BAM tag layout.
3. Inspect site calls¶
A minimal significance filter (see Interpreting & Filtering Results for why p-value alone is not enough at high coverage):
awk -F'\t' 'NR==1 || ($8 < 0.05 && $9 > 0)' results/site_results.tsv > significant_sites.tsv
# padj<0.05 effect_size>0
4. Inspect read-level calls¶
# Summarise per-read modification calls
modkit summary results/read_results.bam
# Or load into pandas via the Python API
python - <<'PY'
from baleen import load_read_results
df = load_read_results("results/read_results.bam")
print(df.head())
PY
Common adjustments¶
# Use 16 workers and cap each contig at 100 reads per condition
baleen run ... --threads 16 --subsample-n 100
# Restrict to a few transcripts
baleen run ... --target ENST00000001,ENST00000002
# Force CPU (no GPU available)
baleen run ... --no-cuda
# Resume an interrupted run
baleen run ... -o results/ --resume
The full flag list is in the CLI Reference.
Python API equivalent¶
from baleen import run_pipeline_streaming
output_paths, metadata = run_pipeline_streaming(
native_bam="native.bam",
native_fastq="native.fq.gz",
native_blow5="native.blow5",
ivt_bam="ivt.bam",
ivt_fastq="ivt.fq.gz",
ivt_blow5="ivt.blow5",
ref_fasta="ref.fa",
output_dir="results/",
threads=8,
)
print(output_paths["site_tsv"]) # results/site_results.tsv
print(output_paths["read_bam"]) # results/read_results.bam
print(output_paths["n_significant"]) # number of padj < 0.05 sites
Return shape
run_pipeline_streaming returns a 2-tuple (output_paths, metadata).
output_paths is a dict with keys site_tsv, read_bam,
per_contig_dir, n_total_sites, and n_significant.