Skip to content

Outputs

A run produces two primary artifacts in the output directory:

File Level Description
site_results.tsv per-site Tab-separated modification calls with statistics.
read_results.bam per-read Standard mod-BAM with MM/ML/RG tags.

A per_contig/ directory of intermediate slices is also written during the run and removed on completion unless you pass --keep-intermediate.

site_results.tsv

One row per tested genomic position. Columns, in order:

# Column Type Description
1 contig str Reference/transcript name.
2 position int 0-based position on the contig.
3 kmer str Reference k-mer centred on the position.
4 mod_ratio float MAP estimate of modification stoichiometry (Beta-Binomial).
5 ci_low float 2.5th percentile of the Beta posterior.
6 ci_high float 97.5th percentile of the Beta posterior.
7 pvalue float One-sided Fisher's exact test on binary modified/unmodified calls (native vs IVT).
8 padj float Benjamini-Hochberg FDR-adjusted p-value across all tested sites.
9 effect_size float median(native p_mod_hmm) − median(IVT p_mod_hmm).
10 n_native int Native reads covering the position.
11 n_ivt int IVT reads covering the position.
12 mean_p_mod float Mean of native p_mod_hmm.
13 stoichiometry float Fraction of native reads with p_mod_hmm > 0.5.

p-value vs effect size

pvalue/padj answer “is native different from IVT?” — at high coverage even tiny differences become significant. effect_size, mod_ratio, and the credible interval (ci_low, ci_high) tell you how large and how certain the difference is. Use both; see Interpreting & Filtering Results.

Loading in pandas

import pandas as pd
df = pd.read_csv("results/site_results.tsv", sep="\t")
sig = df[(df.padj < 0.05) & (df.effect_size > 0)]

read_results.bam

A standard mod-BAM: the original alignments copied through with per-read modification probabilities encoded in SAM tags. Compatible with modkit, modbamtools, and IGV.

Tag Type Meaning
MM:Z string Delta-encoded modified-base positions. Baleen uses N+? — an unknown modification on any base.
ML:B:C uint8 array Per-position modification probability, quantised to 0–255 (round(p_mod_hmm × 255)).
RG:Z string Read group: native or ivt.

The p_mod_hmm written here is the final per-read probability from the V3 gap-aware HMM stage.

Reading read-level calls

# Quick summary
modkit summary results/read_results.bam
from baleen import load_read_results
df = load_read_results("results/read_results.bam")
# columns: contig, position, read_name, is_native, p_mod_hmm
print(df.head())

load_read_results_iter yields the same records one at a time for streaming over large files. Both reconstruct p_mod_hmm by parsing the MM/ML tags (falling back to a legacy MP:f tag if MM is absent).

Intermediate files (per_contig/)

During the streaming run each worker writes <contig>.tsv (rows only) and <contig>.bam into per_contig/, which are then merged into the two top-level outputs. A .run_params.json fingerprint of the inputs and run-affecting parameters is written there to support --resume. Pass --keep-intermediate to retain this directory after the merge.