← Blog

Building a Reproducible m6A Detection Pipeline with Snakemake and GLORI-seq

Epitranscriptomics — the study of chemical modifications on RNA — is one of the most exciting frontiers in molecular biology. Among the hundreds of known RNA modifications, N6-methyladenosine (m6A) stands out as the most abundant internal modification on messenger RNA. It affects everything from mRNA stability and splicing to translation efficiency, and its dysregulation has been linked to cancer, neurological disorders, and developmental defects.

But detecting m6A at single-nucleotide resolution has historically been difficult. Most earlier techniques (like MeRIP-seq) could only map m6A to regions of ~100-200 nucleotides. That changed with GLORI, a chemical biology method that enables quantitative, single-nucleotide-resolution mapping of m6A. During my PhD, I needed a reliable, reproducible pipeline to process GLORI-seq data — so I built snakemake-epitranscriptome.

Why Snakemake?

If you have ever tried to process next-generation sequencing data with a collection of ad-hoc shell scripts, you know the pain: scripts break when you move them between machines, intermediate files get overwritten, and reproducing a published analysis becomes nearly impossible. Snakemake solves this by expressing the analysis as a directed acyclic graph (DAG) of rules, where each rule defines its inputs, outputs, and how to produce one from the other. Snakemake automatically determines which steps need to run, handles parallelization, and ensures reproducibility.

The GLORI Method: A Quick Primer

GLORI (Glyphosate-mediated m6A Level Quantification with Chemical Reduction and Sequencing) works on a clever principle. The treatment chemically converts unmodified adenosine (A) residues into inosine (I), which is read as guanosine (G) during sequencing. m6A-modified adenosines are protected from this conversion. So by comparing the A-to-G conversion rate at each position between treated and control samples, you can quantify the m6A fraction at single-nucleotide resolution.

This means the bioinformatics pipeline needs to do something unusual: build a modified reference genome where all A positions are converted to G, simulating what the chemical treatment does at the sequence level.

The Pipeline

The workflow is organized into three major stages:

1. Data Preparation

Raw FASTQ files go through quality control, adapter trimming, and deduplication. Because GLORI-seq often uses unique molecular identifiers (UMIs) to distinguish true biological duplicates from PCR artifacts, the pipeline includes proper UMI handling and extraction to avoid inflated read counts.

2. Index Building

This is where things get interesting. Instead of aligning reads to the standard reference genome, the pipeline builds AG-conversion references — modified versions of the genome where adenosines are replaced with guanosines. This simulates the chemical conversion that happens during GLORI treatment and allows the aligner to properly map converted reads. Separate indices are built for the treated and control conditions.

3. m6A Calling

With reads aligned to the appropriate converted references, the pipeline uses GLORI-tools to call m6A sites. At each adenosine position, it compares the conversion rates between treated and control samples, applies statistical testing, and outputs a list of high-confidence m6A sites with their modification levels.

Published in the Snakemake Workflow Catalog

One of my goals was to make this pipeline easy for others to use. I am happy to say that snakemake-epitranscriptome has been published in the Snakemake Workflow Catalog, which means anyone can find it, clone it, and run it on their own data with minimal configuration. Getting it accepted required following Snakemake’s best practices for workflow structure, documentation, and testing — a worthwhile exercise in itself.

Why Reproducibility Matters

In a field as young as epitranscriptomics, reproducibility is everything. Different labs use slightly different parameters for trimming, alignment, and statistical thresholds. Without a standardized pipeline, comparing results across studies becomes an apples-to-oranges exercise. By packaging the entire analysis into a Snakemake workflow with version-pinned dependencies, I wanted to make it possible for anyone to reproduce (and extend) the analysis with confidence.

What I Learned

Building this pipeline was a deep learning experience on several fronts:

  • GLORI method theory: Understanding the chemistry behind A-to-G conversion and how it translates into bioinformatics decisions (like building converted reference genomes) gave me a much deeper appreciation for how wet-lab methods constrain computational analysis.
  • Modified reference genomes: Constructing AG-converted references is not something most standard pipelines do. It required careful handling of genome indexing, strand information, and coordinate mapping back to the original reference.
  • Snakemake best practices: Structuring a workflow for the Snakemake Catalog taught me about rule modularization, configuration schemas, conda environment management per rule, and writing meaningful integration tests.
  • Publishing to workflow catalogs: The process of preparing a workflow for public distribution — documentation, example data, continuous integration testing — was itself a valuable exercise in scientific software engineering.

If you are working with GLORI-seq data or just curious about m6A detection, check out snakemake-epitranscriptome. Issues and pull requests are welcome.