Introduction

Recording everything.

Introduction — RNA-seq analysis

RNA sequencing (RNA‑seq) measures transcriptome-wide RNA abundance and sequence, enabling discovery and quantification of expressed genes, isoforms, and sequence variants. This chapter provides a concise practical introduction to the computational steps, core concepts, and best practices used in typical RNA‑seq studies.

Goals

Quantify gene and transcript expression across samples
Identify differentially expressed genes and pathways
Detect alternative splicing, fusion transcripts, and RNA editing when relevant
Produce reproducible, well-documented results ready for interpretation

Key concepts

Reads (FASTQ), alignments (BAM), and quantified counts/tpm
Library type: strandedness, read length, single/paired-end
Normalization to account for sequencing depth and composition
Experimental design and covariates (batch, replication)

Typical computational workflow

Project setup: metadata, sample sheet, versioned code and environment (conda/Docker/Singularity)
Raw-data QC: fastp, MultiQC
Alignment: STAR
Quantification: featureCounts, S
Normalization and exploratory QC: PCA, sample clustering, library size checks
Differential expression: DESeq2, edgeR, limma-voom
Functional analysis: GSEA, pathway enrichment, gene-set visualization
Optional: isoform analysis, fusion detection, variant calling
Reporting: figures, tables, reproducible notebooks/workflows

Outputs and quality checks

Per-sample QC reports and alignment metrics
Count matrix and normalized expression table
Differential expression results with effect sizes and adjusted p-values
Diagnostic plots: MA, PCA, heatmaps, volcano plots

Best practices

Record sample metadata and processing parameters early
Use workflow managers (Snakemake, Nextflow) and containerized environments
Include replication and model covariates in statistical tests
Validate key findings with orthogonal methods if possible

This chapter will expand each step with commands, recommended tools, example workflows, and interpretation guidance.

Raw Fastq Quality Control

To perform quality control on raw FASTQ files using fastp, follow these steps:

Step 1: Install fastp

Using conda or mamba to install fastp:

# conda
conda install -c bioconda fastp
# mamba
mamba install -c bioconda fastp

Step 2: Run fastp for Quality Control

For paired-end reads, run fastp with the following command:

fastp -i sample_R1.fastq.gz -I sample_R2.fastq.gz \
      -o sample_R1.clean.fastq.gz -O sample_R2.clean.fastq.gz \
      -h sample_fastp.html -j sample_fastp.json \
      --thread 8

Parameters:

-i: Input R1 FASTQ file
-I: Input R2 FASTQ file (for paired-end)
-o: Output R1 FASTQ file after QC
-O: Output R2 FASTQ file after QC
-h: HTML report file
-j: JSON report file
--thread: Number of threads to use

Step 3: Review Quality Control Report

fastp automatically performs several QC steps including:

Adapter trimming
Quality filtering
Length filtering
Per-read quality pruning
Base correction for overlapped paired-end reads

Open the HTML report (sample_fastp.html) to review detailed quality metrics before and after filtering.

Step 4: Additional Options (Optional)

You can customize fastp with additional parameters:

fastp -i sample_R1.fastq.gz -I sample_R2.fastq.gz \
      -o sample_R1.clean.fastq.gz -O sample_R2.clean.fastq.gz \
      -h sample_fastp.html -j sample_fastp.json \
      --qualified_quality_phred 20 \
      --length_required 50 \
      --thread 8

--qualified_quality_phred: Minimum quality value (default: 15)
--length_required: Minimum read length to keep (default: 15)

Paper: https://academic.oup.com/bioinformatics/article/34/17/i884/5093234

RNA-seq Mapping

To map paired-end (PE) FASTQ reads to a reference genome using STAR (Spliced Transcripts Alignment to a Reference) (Paper: https://academic.oup.com/bioinformatics/article/29/1/15/272537?login=false), follow these steps:

Step 1: Install STAR

Using conda or mamba to install STAR is your best choice:

# conda
conda install -c bioconda star
# mamba
mamba install -c bioconda star

Step 2: Generate Genome Index

Before mapping, you need to generate a genome index. This step is done once per reference genome. (You should know this step if you encounter similar tasks in the future, but this time, I will build the index for you.)

STAR --runMode genomeGenerate \
     --genomeDir /path/to/genomeDir \
     --genomeFastaFiles /path/to/genome.fa \
     --sjdbGTFfile /path/to/annotations.gtf \
     --runThreadN 8
     
# Above is the general command, I will also show you how I run the index building
cd /home/shared/reference # move into the reference folder
STAR --runMode genomeGenerate \
		--genomeDir STAR_INDEX \
		--genomeFastaFiles ./Homo_sapiens.GRCh38.dna.primary_assembly.fa \
		--sjdbGTFfile ./Homo_sapiens.GRCh38.104.gtf \
		--runThreadN 8

Parameters:

--genomeDir: Directory where the genome index will be stored
--genomeFastaFiles: Path to the reference genome FASTA file
--sjdbGTFfile: Path to the gene annotation GTF file
--runThreadN: Number of threads to use

So the index is in /home/shared/reference/STAR_INDEX, you can use it directly in the later steps.

Step 3: Map Paired-End Reads

Run STAR to align your paired-end FASTQ files to the indexed genome:

STAR --genomeDir /path/to/genomeDir \
     --readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \
     --readFilesCommand zcat \
     --outFileNamePrefix output_prefix_ \
     --outSAMtype BAM SortedByCoordinate \
     --runThreadN 8 
     
# For the mapping step, I will just show you the above command, you need to work
# out how the command works for your samples
# You can use STAR --help for help.

Parameters:

--genomeDir: Path to the genome index directory
--readFilesIn: Paired-end FASTQ files (R1 and R2)
--readFilesCommand zcat: For compressed .gz files (omit if uncompressed)
--outFileNamePrefix: Prefix for output files
--outSAMtype BAM SortedByCoordinate: Output sorted BAM file
--runThreadN: Number of threads

Step 4: Output Files

STAR generates several output files:

Aligned.sortedByCoord.out.bam: Sorted BAM file with aligned reads
Log.final.out: Summary mapping statistics
Log.out: Detailed log of the run
Log.progress.out: Progress information
SJ.out.tab: Splice junction information

Step 5: Index BAM File (Optional)

For downstream analysis, you may need to index the BAM file:

samtools index output_prefix_Aligned.sortedByCoord.out.bam

QC on the alignments

To perform quality control on RNA-seq alignments using QualiMap, follow these steps:

Step 1: Install QualiMap

Using conda or mamba to install QualiMap:

# conda
conda install -c bioconda qualimap
# mamba
mamba install -c bioconda qualimap

Step 2: Run QualiMap on BAM File

Use QualiMap's RNA-seq mode to analyze your sorted BAM file:

qualimap rnaseq \
    -bam output_prefix_Aligned.sortedByCoord.out.bam \
    -gtf /path/to/annotations.gtf \
    -outdir qualimap_results \
    --java-mem-size=4G

Parameters:

-bam: Input sorted BAM file from STAR alignment
-gtf: Path to the gene annotation GTF file (same as used in STAR indexing)
-outdir: Output directory for QualiMap results
--java-mem-size: Amount of memory to allocate (adjust based on your system)

Step 3: Additional Options

You can customize QualiMap with additional parameters:

qualimap rnaseq \
    -bam output_prefix_Aligned.sortedByCoord.out.bam \
    -gtf /path/to/annotations.gtf \
    -outdir qualimap_results \
    -pe \
    --java-mem-size=4G

-pe: Specify paired-end mode
-p: Sequencing protocol (strand-specific or non-strand-specific)

Step 4: Review Quality Control Report

QualiMap generates an HTML report with comprehensive quality metrics including:

Coverage profile along genes
Junction analysis
Reads genomic origin (exonic, intronic, intergenic)
Coverage per chromosome
Insert size distribution (for paired-end reads)
Mapping quality distribution

Open the HTML report in the output directory (qualimap_results/qualimapReport.html) to review detailed quality metrics of your RNA-seq alignments.

Paper: https://academic.oup.com/bioinformatics/article/28/20/2678/206551

Quantification

To quantify gene expression from RNA-seq alignments using featureCounts, follow these steps:

Step 1: Install featureCounts

featureCounts is part of the Subread package. Install it using conda or mamba:

# conda
conda install -c bioconda subread
# mamba
mamba install -c bioconda subread

Step 2: Run featureCounts

Use featureCounts to count reads mapped to genomic features (genes) from your BAM file:

featureCounts -a /path/to/annotations.gtf \
    -o gene_counts.txt \
    -t exon \
    -g gene_id \
    -T 8 \
    -p \
    output_prefix_Aligned.sortedByCoord.out.bam

Parameters:

-a: Path to the gene annotation GTF file
-o: Output file name for the count matrix
-t: Feature type to count (default: exon)
-g: Attribute type in GTF file to use as feature ID (default: gene_id)
-T: Number of threads to use
-p: Specify that input data is paired-end

Step 3: Additional Options

You can customize featureCounts with additional parameters for strand-specific RNA-seq:

featureCounts -a /path/to/annotations.gtf \
    -o gene_counts.txt \
    -t exon \
    -g gene_id \
    -T 8 \
    -p \
    -s 2 \
    output_prefix_Aligned.sortedByCoord.out.bam

-s: Specify strand-specificity (0: unstranded, 1: stranded, 2: reversely stranded)
-M: Count multi-mapping reads
-O: Assign reads to overlapping features
--fraction: Assign fractional counts to features

Step 4: Output Files

featureCounts generates two main output files:

gene_counts.txt: Tab-delimited file with gene counts per sample
gene_counts.txt.summary: Summary statistics including assigned, unassigned, and ambiguous reads

Step 5: Review Count Results

The output count file contains columns for:

Geneid: Gene identifier
Chr: Chromosome
Start: Gene start position
End: Gene end position
Strand: Gene strand
Length: Gene length
Sample columns: Read counts for each BAM file

Review the summary file to check the proportion of successfully assigned reads and identify potential issues with your alignment or annotation.

Paper: https://academic.oup.com/bioinformatics/article/30/7/923/232889

DEG analysis with DESeq2 (input: featureCounts output)

Summary: create a conda R environment with RStudio, install DESeq2, then run a reproducible R script that reads featureCounts output, performs QC, differential expression, shrinkage, and exports results.

Please refer to this tutorial DESeq2

Step1 Create conda env and install R / RStudio

Run in a terminal:

conda install -c conda-forge -c bioconda r-base=4.2 rstudio r-essentials -y
# launch RStudio (optional)
rstudio

Note: Bioconductor packages are best installed from within R using BiocManager (below).

Step2 Install R packages (from inside R or RStudio)

Open R and run:

install.packages("BiocManager")
BiocManager::install(c("DESeq2","apeglm","tximport","pheatmap","RColorBrewer","vsn"))

DESeq2: core DEA
apeglm: LFC shrinkage
tximport: only if importing transcript-level results (not needed for featureCounts gene counts)
pheatmap / RColorBrewer / vsn: plotting / normalization helpers

Step3 Prepare featureCounts input

Try by yourself Finally generate a counts_file (featureCounts_counts.txt) for next analysis

Step4 FeatureCounts merged file

Save as deseq2_featurecounts_DEA.R and edit counts_file and coldata accordingly.

# deseq2_featurecounts_DEA.R
library(DESeq2)
library(apeglm)
library(pheatmap)
library(RColorBrewer)
library(vsn)

# ---- user settings ----
counts_file <- "featureCounts_counts.txt" # merged output
# Define sample table: sample names (matching cols) and condition
coldata <- data.frame(
    sample = c("sampleA_rep1","sampleA_rep2","sampleB_rep1","sampleB_rep2"),
    condition = factor(c("A","A","B","B"))
)
rownames(coldata) <- coldata$sample
# ------------------------

# Read featureCounts merged file (skip comment lines starting with '#')
fc <- read.table(counts_file, header = TRUE, sep = "\t", comment.char="#", stringsAsFactors=FALSE)

# Typical featureCounts layout: Geneid, Chr, Start, End, Strand, Length, sample1, sample2, ...
# Find first counts column (usually where column names match sample names or numeric)
# If counts start at col 7:
counts <- as.matrix(fc[, 7:ncol(fc)])
rownames(counts) <- fc$Geneid
# Optionally subset columns to match coldata
counts <- counts[, rownames(coldata)]

# Create DESeq2 dataset
dds <- DESeqDataSetFromMatrix(countData = counts,
                                        colData = coldata,
                                        design = ~ condition)

# Prefiltering: remove very low counts
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep, ]

# Run DESeq
dds <- DESeq(dds)

# Variance-stabilizing transformation for QC and PCA
vsd <- vst(dds, blind=FALSE)


# Results table
res <- results(dds, alpha=0.05)           # default: condition B vs A depending on level order
res <- lfcShrink(dds, coef=2, type="apeglm")  # adjust coef number to match your design
resOrdered <- res[order(res$padj), ]

# Export results
write.csv(as.data.frame(resOrdered), file="deseq2_results.csv")

# Top genes heatmap
topGenes <- head(order(rowVars(assay(vsd)), decreasing=TRUE), 50)
mat <- assay(vsd)[topGenes, ]
mat <- mat - rowMeans(mat)


# Save transformed counts for downstream plots
write.csv(as.data.frame(assay(vsd)), file="vsd_normalized_counts.csv")

Notes:

Adjust coldata to reflect your samples and experimental design (use exact column names).
The coef in lfcShrink depends on the ordering of levels in your design; check resultsNames(dds) to pick correct coefficient.
Filtering threshold (here rowSums >= 10) can be changed depending on depth/replicates.

This provides a concise, reproducible pipeline for DESeq2 starting from featureCounts output. Adjust sample metadata and design formula to your experiment.