Introduction
Recording everything.
Introduction — RNA-seq analysis
RNA sequencing (RNA‑seq) measures transcriptome-wide RNA abundance and sequence, enabling discovery and quantification of expressed genes, isoforms, and sequence variants. This chapter provides a concise practical introduction to the computational steps, core concepts, and best practices used in typical RNA‑seq studies.
Goals
- Quantify gene and transcript expression across samples
- Identify differentially expressed genes and pathways
- Detect alternative splicing, fusion transcripts, and RNA editing when relevant
- Produce reproducible, well-documented results ready for interpretation
Key concepts
- Reads (FASTQ), alignments (BAM), and quantified counts/tpm
- Library type: strandedness, read length, single/paired-end
- Normalization to account for sequencing depth and composition
- Experimental design and covariates (batch, replication)
Typical computational workflow
- Project setup: metadata, sample sheet, versioned code and environment (conda/Docker/Singularity)
- Raw-data QC: fastp, MultiQC
- Alignment: STAR
- Quantification: featureCounts, S
- Normalization and exploratory QC: PCA, sample clustering, library size checks
- Differential expression: DESeq2, edgeR, limma-voom
- Functional analysis: GSEA, pathway enrichment, gene-set visualization
- Optional: isoform analysis, fusion detection, variant calling
- Reporting: figures, tables, reproducible notebooks/workflows
Outputs and quality checks
- Per-sample QC reports and alignment metrics
- Count matrix and normalized expression table
- Differential expression results with effect sizes and adjusted p-values
- Diagnostic plots: MA, PCA, heatmaps, volcano plots
Best practices
- Record sample metadata and processing parameters early
- Use workflow managers (Snakemake, Nextflow) and containerized environments
- Include replication and model covariates in statistical tests
- Validate key findings with orthogonal methods if possible
This chapter will expand each step with commands, recommended tools, example workflows, and interpretation guidance.
Raw Fastq Quality Control
To perform quality control on raw FASTQ files using fastp, follow these steps:
Step 1: Install fastp
Using conda or mamba to install fastp:
# conda
conda install -c bioconda fastp
# mamba
mamba install -c bioconda fastp
Step 2: Run fastp for Quality Control
For paired-end reads, run fastp with the following command:
fastp -i sample_R1.fastq.gz -I sample_R2.fastq.gz \
-o sample_R1.clean.fastq.gz -O sample_R2.clean.fastq.gz \
-h sample_fastp.html -j sample_fastp.json \
--thread 8
Parameters:
-i
: Input R1 FASTQ file-I
: Input R2 FASTQ file (for paired-end)-o
: Output R1 FASTQ file after QC-O
: Output R2 FASTQ file after QC-h
: HTML report file-j
: JSON report file--thread
: Number of threads to use
Step 3: Review Quality Control Report
fastp automatically performs several QC steps including:
- Adapter trimming
- Quality filtering
- Length filtering
- Per-read quality pruning
- Base correction for overlapped paired-end reads
Open the HTML report (sample_fastp.html
) to review detailed quality metrics before and after filtering.
Step 4: Additional Options (Optional)
You can customize fastp with additional parameters:
fastp -i sample_R1.fastq.gz -I sample_R2.fastq.gz \
-o sample_R1.clean.fastq.gz -O sample_R2.clean.fastq.gz \
-h sample_fastp.html -j sample_fastp.json \
--qualified_quality_phred 20 \
--length_required 50 \
--thread 8
--qualified_quality_phred
: Minimum quality value (default: 15)--length_required
: Minimum read length to keep (default: 15)
Paper: https://academic.oup.com/bioinformatics/article/34/17/i884/5093234
RNA-seq Mapping
To map paired-end (PE) FASTQ reads to a reference genome using STAR (Spliced Transcripts Alignment to a Reference) (Paper: https://academic.oup.com/bioinformatics/article/29/1/15/272537?login=false), follow these steps:
Step 1: Install STAR
Using conda or mamba to install STAR is your best choice:
# conda
conda install -c bioconda star
# mamba
mamba install -c bioconda star
Step 2: Generate Genome Index
Before mapping, you need to generate a genome index. This step is done once per reference genome. (You should know this step if you encounter similar tasks in the future, but this time, I will build the index for you.)
STAR --runMode genomeGenerate \
--genomeDir /path/to/genomeDir \
--genomeFastaFiles /path/to/genome.fa \
--sjdbGTFfile /path/to/annotations.gtf \
--runThreadN 8
# Above is the general command, I will also show you how I run the index building
cd /home/shared/reference # move into the reference folder
STAR --runMode genomeGenerate \
--genomeDir STAR_INDEX \
--genomeFastaFiles ./Homo_sapiens.GRCh38.dna.primary_assembly.fa \
--sjdbGTFfile ./Homo_sapiens.GRCh38.104.gtf \
--runThreadN 8
Parameters:
--genomeDir
: Directory where the genome index will be stored--genomeFastaFiles
: Path to the reference genome FASTA file--sjdbGTFfile
: Path to the gene annotation GTF file--runThreadN
: Number of threads to use
So the index is in /home/shared/reference/STAR_INDEX
, you can use it directly in the later steps.
Step 3: Map Paired-End Reads
Run STAR to align your paired-end FASTQ files to the indexed genome:
STAR --genomeDir /path/to/genomeDir \
--readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \
--readFilesCommand zcat \
--outFileNamePrefix output_prefix_ \
--outSAMtype BAM SortedByCoordinate \
--runThreadN 8
# For the mapping step, I will just show you the above command, you need to work
# out how the command works for your samples
# You can use STAR --help for help.
Parameters:
--genomeDir
: Path to the genome index directory--readFilesIn
: Paired-end FASTQ files (R1 and R2)--readFilesCommand zcat
: For compressed .gz files (omit if uncompressed)--outFileNamePrefix
: Prefix for output files--outSAMtype BAM SortedByCoordinate
: Output sorted BAM file--runThreadN
: Number of threads
Step 4: Output Files
STAR generates several output files:
Aligned.sortedByCoord.out.bam
: Sorted BAM file with aligned readsLog.final.out
: Summary mapping statisticsLog.out
: Detailed log of the runLog.progress.out
: Progress informationSJ.out.tab
: Splice junction information
Step 5: Index BAM File (Optional)
For downstream analysis, you may need to index the BAM file:
samtools index output_prefix_Aligned.sortedByCoord.out.bam
QC on the alignments
To perform quality control on RNA-seq alignments using QualiMap, follow these steps:
Step 1: Install QualiMap
Using conda or mamba to install QualiMap:
# conda
conda install -c bioconda qualimap
# mamba
mamba install -c bioconda qualimap
Step 2: Run QualiMap on BAM File
Use QualiMap's RNA-seq mode to analyze your sorted BAM file:
qualimap rnaseq \
-bam output_prefix_Aligned.sortedByCoord.out.bam \
-gtf /path/to/annotations.gtf \
-outdir qualimap_results \
--java-mem-size=4G
Parameters:
-bam
: Input sorted BAM file from STAR alignment-gtf
: Path to the gene annotation GTF file (same as used in STAR indexing)-outdir
: Output directory for QualiMap results--java-mem-size
: Amount of memory to allocate (adjust based on your system)
Step 3: Additional Options
You can customize QualiMap with additional parameters:
qualimap rnaseq \
-bam output_prefix_Aligned.sortedByCoord.out.bam \
-gtf /path/to/annotations.gtf \
-outdir qualimap_results \
-pe \
--java-mem-size=4G
-pe
: Specify paired-end mode-p
: Sequencing protocol (strand-specific or non-strand-specific)
Step 4: Review Quality Control Report
QualiMap generates an HTML report with comprehensive quality metrics including:
- Coverage profile along genes
- Junction analysis
- Reads genomic origin (exonic, intronic, intergenic)
- Coverage per chromosome
- Insert size distribution (for paired-end reads)
- Mapping quality distribution
Open the HTML report in the output directory (qualimap_results/qualimapReport.html
) to review detailed quality metrics of your RNA-seq alignments.
Paper: https://academic.oup.com/bioinformatics/article/28/20/2678/206551
Quantification
To quantify gene expression from RNA-seq alignments using featureCounts, follow these steps:
Step 1: Install featureCounts
featureCounts is part of the Subread package. Install it using conda or mamba:
# conda
conda install -c bioconda subread
# mamba
mamba install -c bioconda subread
Step 2: Run featureCounts
Use featureCounts to count reads mapped to genomic features (genes) from your BAM file:
featureCounts -a /path/to/annotations.gtf \
-o gene_counts.txt \
-t exon \
-g gene_id \
-T 8 \
-p \
output_prefix_Aligned.sortedByCoord.out.bam
Parameters:
-a
: Path to the gene annotation GTF file-o
: Output file name for the count matrix-t
: Feature type to count (default: exon)-g
: Attribute type in GTF file to use as feature ID (default: gene_id)-T
: Number of threads to use-p
: Specify that input data is paired-end
Step 3: Additional Options
You can customize featureCounts with additional parameters for strand-specific RNA-seq:
featureCounts -a /path/to/annotations.gtf \
-o gene_counts.txt \
-t exon \
-g gene_id \
-T 8 \
-p \
-s 2 \
output_prefix_Aligned.sortedByCoord.out.bam
-s
: Specify strand-specificity (0: unstranded, 1: stranded, 2: reversely stranded)-M
: Count multi-mapping reads-O
: Assign reads to overlapping features--fraction
: Assign fractional counts to features
Step 4: Output Files
featureCounts generates two main output files:
gene_counts.txt
: Tab-delimited file with gene counts per samplegene_counts.txt.summary
: Summary statistics including assigned, unassigned, and ambiguous reads
Step 5: Review Count Results
The output count file contains columns for:
- Geneid: Gene identifier
- Chr: Chromosome
- Start: Gene start position
- End: Gene end position
- Strand: Gene strand
- Length: Gene length
- Sample columns: Read counts for each BAM file
Review the summary file to check the proportion of successfully assigned reads and identify potential issues with your alignment or annotation.
Paper: https://academic.oup.com/bioinformatics/article/30/7/923/232889
DEG analysis with DESeq2 (input: featureCounts output)
Summary: create a conda R environment with RStudio, install DESeq2, then run a reproducible R script that reads featureCounts output, performs QC, differential expression, shrinkage, and exports results.
Please refer to this tutorial DESeq2
Step1 Create conda env and install R / RStudio
Run in a terminal:
conda install -c conda-forge -c bioconda r-base=4.2 rstudio r-essentials -y
# launch RStudio (optional)
rstudio
Note: Bioconductor packages are best installed from within R using BiocManager (below).
Step2 Install R packages (from inside R or RStudio)
Open R and run:
install.packages("BiocManager")
BiocManager::install(c("DESeq2","apeglm","tximport","pheatmap","RColorBrewer","vsn"))
- DESeq2: core DEA
- apeglm: LFC shrinkage
- tximport: only if importing transcript-level results (not needed for featureCounts gene counts)
- pheatmap / RColorBrewer / vsn: plotting / normalization helpers
Step3 Prepare featureCounts input
Try by yourself Finally generate a counts_file (featureCounts_counts.txt) for next analysis
Step4 FeatureCounts merged file
Save as deseq2_featurecounts_DEA.R
and edit counts_file
and coldata
accordingly.
# deseq2_featurecounts_DEA.R
library(DESeq2)
library(apeglm)
library(pheatmap)
library(RColorBrewer)
library(vsn)
# ---- user settings ----
counts_file <- "featureCounts_counts.txt" # merged output
# Define sample table: sample names (matching cols) and condition
coldata <- data.frame(
sample = c("sampleA_rep1","sampleA_rep2","sampleB_rep1","sampleB_rep2"),
condition = factor(c("A","A","B","B"))
)
rownames(coldata) <- coldata$sample
# ------------------------
# Read featureCounts merged file (skip comment lines starting with '#')
fc <- read.table(counts_file, header = TRUE, sep = "\t", comment.char="#", stringsAsFactors=FALSE)
# Typical featureCounts layout: Geneid, Chr, Start, End, Strand, Length, sample1, sample2, ...
# Find first counts column (usually where column names match sample names or numeric)
# If counts start at col 7:
counts <- as.matrix(fc[, 7:ncol(fc)])
rownames(counts) <- fc$Geneid
# Optionally subset columns to match coldata
counts <- counts[, rownames(coldata)]
# Create DESeq2 dataset
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = coldata,
design = ~ condition)
# Prefiltering: remove very low counts
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep, ]
# Run DESeq
dds <- DESeq(dds)
# Variance-stabilizing transformation for QC and PCA
vsd <- vst(dds, blind=FALSE)
# Results table
res <- results(dds, alpha=0.05) # default: condition B vs A depending on level order
res <- lfcShrink(dds, coef=2, type="apeglm") # adjust coef number to match your design
resOrdered <- res[order(res$padj), ]
# Export results
write.csv(as.data.frame(resOrdered), file="deseq2_results.csv")
# Top genes heatmap
topGenes <- head(order(rowVars(assay(vsd)), decreasing=TRUE), 50)
mat <- assay(vsd)[topGenes, ]
mat <- mat - rowMeans(mat)
# Save transformed counts for downstream plots
write.csv(as.data.frame(assay(vsd)), file="vsd_normalized_counts.csv")
Notes:
- Adjust
coldata
to reflect your samples and experimental design (use exact column names). - The
coef
in lfcShrink depends on the ordering of levels in your design; checkresultsNames(dds)
to pick correct coefficient. - Filtering threshold (here rowSums >= 10) can be changed depending on depth/replicates.
This provides a concise, reproducible pipeline for DESeq2 starting from featureCounts output. Adjust sample metadata and design formula to your experiment.