Interpreting & Filtering Results¶
site_results.tsv gives you several statistics per position. Choosing how to
filter them is the difference between a clean shortlist and a flood of
borderline calls. This page explains what each column means for filtering and
why p-value alone is not enough at high coverage.
Why p-value alone fails at high coverage¶
The site pvalue is a one-sided Fisher's exact test asking “is the
modified fraction higher in native than in IVT?”. Like any frequentist test,
its power grows with sample size. At deep coverage even a biologically trivial
difference — say 2% vs 1% modified reads — becomes "significant".
So padj < 0.05 answers whether native differs from IVT, but says nothing
about how much. Pair it with an effect-size or stoichiometry threshold.
The columns that matter for filtering¶
| Column | Use as a filter to… |
|---|---|
padj |
Control the false-discovery rate across all tested sites (BH-corrected). |
effect_size |
Require a minimum native−IVT separation: median(native p_mod_hmm) − median(IVT p_mod_hmm). Drops "significant but tiny" sites. |
mod_ratio + ci_low/ci_high |
Require a minimum stoichiometry with confidence: filter on ci_low to demand the lower bound clears a threshold. |
stoichiometry |
Fraction of native reads called modified (p_mod_hmm > 0.5); a coarse alternative to mod_ratio. |
n_native, n_ivt |
Drop low-coverage sites where any estimate is noisy. |
Recommended filter¶
A robust shortlist combines significance and effect size and coverage:
awk -F'\t' '
NR==1 || ($8 < 0.05 && $9 > 0.1 && $10 >= 30 && $11 >= 30)
' results/site_results.tsv > significant_sites.tsv
# padj<0.05 effect>0.1 n_native>=30 n_ivt>=30
Column indices: $8 = padj, $9 = effect_size, $10 = n_native,
$11 = n_ivt. See the full schema.
In pandas:
import pandas as pd
df = pd.read_csv("results/site_results.tsv", sep="\t")
hits = df[
(df.padj < 0.05)
& (df.effect_size > 0.1)
& (df.ci_low > 0.1) # lower CI bound of mod_ratio clears 10%
& (df.n_native >= 30)
& (df.n_ivt >= 30)
]
What you can and cannot filter post-hoc¶
| Want to change | Possible on the existing TSV? |
|---|---|
Stricter padj, effect_size, mod_ratio, ci_low, stoichiometry, coverage |
Yes — pure row filtering, no re-run. |
A different --mod-threshold (recounts modified reads → changes mod_ratio, pvalue, stoichiometry) |
No — re-run baleen aggregate with the new threshold. |
| Different HMM parameters | No — re-run baleen aggregate --hmm-params .... |
| Different DTW / subsampling / depth | No — full baleen run. |
The cheap path for threshold sweeps is baleen aggregate, which reuses the saved
DTW results and only re-does HMM + aggregation. See the
CLI Reference.
Benchmarking confidence¶
At high coverage, controlling false positives needs more than padj:
- Fix the depth. Use
--subsample-nso every site is tested at comparable read counts; this prevents power inflation from a few ultra-deep positions. - Calibrate a threshold against ground truth. If you have known modified
sites, sweep
effect_size/mod_ratiocutoffs and pick the one at your target precision.benchmarks/bench.pycomputes AUPRC/AUROC against a known-mods file. - Run an IVT-vs-IVT negative control. Feed two IVT replicates as native/IVT; every call is a false positive, giving an empirical FPR to calibrate against.