Lab 5: Quantification with Salmon & Exploratory Analysis

Part 1Setup on Ibex Part 2Salmon Index Part 3Quantify with Salmon Part 4Exploratory Analysis (Python) Part 5Exploratory Analysis (R)

In this lab you will quantify gene expression for all GSE136366 samples using Salmon quasi-mapping against the chromosome 11 transcriptome, then import the results into R, apply variance-stabilising transformation (VST), and produce a PCA plot and sample-distance heatmap to assess sample quality.

Using chr11 data in this lab

The input files (trimmed/${SRR}_chr11_trimmed_1/2.fastq.gz) are the chr11-filtered trimmed reads from Lab 3 (~3–4 M reads per sample). Salmon is built against a chr11-only transcriptome index, so quantification completes in under 1 minute per sample. The resulting count matrices contain only chr11 genes — enough for meaningful exploratory analysis and visualization. From Lab 6 onwards, the full dataset is used.

Learning Objectives

Build a Salmon transcriptome index from the Ensembl cDNA FASTA
Run Salmon quasi-mapping quantification on all GSE136366 paired-end samples
Import Salmon output into R using tximport (transcript → gene summarisation)
Apply variance-stabilising transformation (VST) with DESeq2
Produce a PCA plot and sample-distance heatmap to evaluate sample quality and clustering

Part 1: Setup on Ibex

Connect to Ibex, start an interactive session, and load the tools needed for this lab.

Connect and start an interactive session

# Connect to Ibex
ssh username@ilogin.ibex.kaust.edu.sa

# Start an interactive session (never run analyses on login nodes!)
# Chr11 Salmon index is small; 4G RAM and 30 min is sufficient
srun --pty --time=00:30:00 --mem=4G --cpus-per-task=8 bash

Load required modules

# Load Salmon for quantification
module load salmon

# Load R for downstream analysis
module load R

# Verify loaded modules
module list

# Confirm tool versions
salmon --version
R --version | head -1

Tip: Check Available Module Versions

If module load salmon does not find a match, search for the exact module name first:

module avail salmon

Use the full versioned name shown in the output (e.g., module load salmon/1.10.1).

Navigate to your workshop directory and create output folders

# Go to your workshop scratch space
cd /ibex/user/$USER/workshop

# Create the output directory for Salmon counts
mkdir -p counts/salmon

# Create a results directory for plots
mkdir -p results

# Verify workspace structure
ls -la

Part 2: Build a Salmon Transcriptome Index

Salmon performs quasi-mapping against a transcriptome FASTA (not the genome). This makes it much faster than alignment-based methods such as STAR, while still producing accurate quantification estimates.

Why a Transcriptome FASTA?

Salmon does not align reads to the genome. Instead it uses a k-mer index built from all annotated transcript sequences. Reads are matched against this index in memory, allowing Salmon to process a typical RNA-seq sample in a few minutes rather than hours. Because it works at transcript level, Salmon natively handles multi-mapping reads and produces both transcript-level and gene-level estimates.

Copy the chr11 transcriptome from the shared course directory

cd /ibex/user/$USER/workshop

SHARED=/biocorelab/BIX/resources/genomes/workshops/human-chr11

# Chr11 cDNA (transcript sequences only, ~5 MB)
cp ${SHARED}/GRCh38.cdna.chr11.fa references/chr11/

ls -lh references/chr11/

What is a transcriptome FASTA?

Salmon does not align reads to the genome. Instead it uses a k-mer index built from all annotated transcript sequences. The chr11 cDNA file contains spliced transcript sequences for genes on chromosome 11 only (~5 MB, vs ~200 MB for the full human transcriptome). This is exactly what Salmon needs and makes index building near-instantaneous.

Build the Salmon chr11 index

salmon index \
    -t references/chr11/GRCh38.cdna.chr11.fa \
    -i references/salmon_index_chr11 \
    --threads 8

This completes in under 1 minute for the chr11 transcriptome. The resulting references/salmon_index_chr11/ directory is reused for all samples.

Part 3: Quantify with Salmon

Now that the index is ready, run Salmon on each trimmed sample. We will loop over all sample IDs listed in raw_data/SRR_Acc_List.txt.

Key Salmon Parameters

Parameter	Meaning
`-i`	Path to the Salmon index directory
`-l A`	Auto-detect library/strand type (recommended for most datasets)
`-1` / `-2`	R1 and R2 FASTQ files (paired-end)
`--validateMappings`	More accurate mapping using alignment-based validation; recommended
`--gcBias`	Correct for GC-content bias in fragment sequencing
`--seqBias`	Correct for sequence-specific biases at read ends
`--threads`	Number of CPU threads to use
`-o`	Output directory for this sample

Run Salmon quantification for all samples

cd /ibex/user/$USER/workshop

mkdir -p counts/salmon

for r1 in trimmed/*_trimmed_1.fastq.gz; do
    sample=$(basename "$r1" _trimmed_1.fastq.gz)
    echo "=== Quantifying $sample ==="
    salmon quant \
        -i references/salmon_index_chr11 \
        -l A \
        -1 trimmed/${sample}_trimmed_1.fastq.gz \
        -2 trimmed/${sample}_trimmed_2.fastq.gz \
        --validateMappings \
        --gcBias \
        --seqBias \
        --threads 8 \
        -o counts/salmon/${sample}
    echo "Done: $sample"
done

Expected runtime

Each chr11 sample (~3–4 M reads against the chr11 index) takes well under 1 minute. The full loop for all 6 GSE136366 samples completes in a few minutes.

Inspect Salmon output files

Each sample gets its own output directory containing several files. The most important is quant.sf:

# View the quant.sf file for the first sample
head counts/salmon/KO_1_SRR10045016/quant.sf

quant.sf Column Descriptions

Column	Description
`Name`	Transcript identifier (Ensembl ENST...)
`Length`	Transcript length in bases
`EffectiveLength`	Length corrected for fragment size and biases
`TPM`	Transcripts Per Million — normalised expression estimate
`NumReads`	Estimated number of reads originating from this transcript

Check detected library strandedness

Salmon automatically detects the library type. Review what it found for your data:

# View the library format counts JSON for the first sample
cat counts/salmon/KO_1_SRR10045016/lib_format_counts.json

What Does the Strandedness Detection Tell You?

The lib_format_counts.json file reports how many reads matched each possible library orientation. A high count for SF or SR indicates a stranded library; roughly equal counts for IU (inward, unstranded) indicate an unstranded library. For GSE136366, expect an unstranded paired-end library (IU). The strandedness detected from chr11 reads should match the full dataset.

Verify all samples completed successfully

# Confirm each sample has a quant.sf file
for r1 in trimmed/*_trimmed_1.fastq.gz; do
    sample=$(basename "$r1" _trimmed_1.fastq.gz)
    f="counts/salmon/${sample}/quant.sf"
    if [ -f "$f" ]; then
        echo "OK: $sample  ($(wc -l < $f) transcripts)"
    else
        echo "MISSING: $sample"
    fi
done

Part 4: Exploratory Analysis in Python (Jupyter Notebook)

Run the notebook lab5_exploratory_analysis.ipynb interactively — either through the Ibex Open OnDemand portal (recommended) or locally on your laptop.

What the notebook covers

Parsing the chr11 GTF to build a transcript-to-gene mapping
Loading all six Salmon quant.sf files with pandas
Aggregating transcript-level TPM to gene level
log₂(TPM + 1) normalization (VST equivalent)
PCA plot with scikit-learn and matplotlib
Sample-distance heatmap with seaborn
Exporting gene-level count and TPM matrices for downstream labs

Option A: Ibex Open OnDemand (recommended)

Go to ondemand.ibex.kaust.edu.sa and log in with your Ibex credentials.
Click Interactive Apps → JupyterLab.
Set the resources: 4 CPUs, 16 GB RAM, 4 hours, partition batch.
In the Extra Jupyter arguments field enter: --notebook-dir=/ibex/user/$USER/workshop
Click Launch, wait for the session to start (usually < 1 minute), then click Connect to JupyterLab.
In the file browser on the left, open lab5_exploratory_analysis.ipynb.

First time: set the kernel

If the notebook asks you to select a kernel, choose rnaseq (the conda environment you created during setup). If it does not appear, activate it first in a terminal: conda activate rnaseq && python -m ipykernel install --user --name rnaseq, then refresh the page.

Option B: Run locally on your laptop

If the Salmon output files are available on your laptop (or you copied them from Ibex), you can run the notebook entirely offline.

# Activate the rnaseq conda environment
conda activate rnaseq

# Navigate to the directory containing the notebook
cd /path/to/workshop   # adjust to where you placed the notebook and Salmon output

# Launch JupyterLab — opens automatically in your browser
jupyter lab

Open lab5_exploratory_analysis.ipynb from the file browser and run all cells with Shift + Enter or Run → Run All Cells.

Copy Salmon output from Ibex to your laptop

# Run this on your LOCAL machine
scp -r username@ilogin.ibex.kaust.edu.sa:/ibex/user/username/workshop/counts/salmon ./counts/

Part 5: Exploratory Analysis in R (click to expand)

This part replicates the same exploratory analysis from Part 4 using R and Bioconductor. It uses tximport to import Salmon output, DESeq2 for variance-stabilising transformation (VST), and ggplot2 / pheatmap for visualisation.

Prerequisites

R and Bioconductor packages must be available on Ibex. See the Setup page for installation instructions. Required packages: tximport, DESeq2, ggplot2, pheatmap, GenomicFeatures.

Start an interactive R session

# Request an interactive session with enough RAM for R
srun --pty --time=01:00:00 --mem=16G --cpus-per-task=4 bash

# Load the R module
module load R

# Start R
R

Tip: RStudio on Ibex Open OnDemand

You can also run this analysis in RStudio via the Ibex Open OnDemand portal at ood.ibex.kaust.edu.sa. Select RStudio Server, request 4 CPUs and 16 GB RAM, then paste each step into the console or open a new R script.

Load required R packages

library(tximport)        # import Salmon/kallisto transcript-level estimates
library(DESeq2)          # differential expression and VST transformation
library(GenomicFeatures) # build transcript database from GTF
library(ggplot2)         # grammar-of-graphics plotting
library(pheatmap)        # annotated heatmaps

Build a transcript-to-gene mapping from the chr11 GTF

tximport needs a two-column data frame mapping each transcript ID (TXNAME) to its gene ID (GENEID) to summarise transcript-level Salmon estimates to gene level.

gtf_path <- "/ibex/user/$USER/workshop/references/chr11/GRCh38.chr11.gtf"

# Parse the GTF into a TxDb (transcript database) object
txdb    <- makeTxDbFromGFF(gtf_path, format = "gtf")

# Extract all transcript IDs as the key set
k       <- keys(txdb, keytype = "TXNAME")

# Build a two-column data frame: TXNAME → GENEID
tx2gene <- select(txdb, k, "GENEID", "TXNAME")

head(tx2gene)
cat("Transcript-to-gene mappings:", nrow(tx2gene), "\n")

Import Salmon output with tximport

# Set base workshop directory using the current user's home path
WORKSHOP <- paste0("/ibex/user/", Sys.getenv("USER"), "/workshop")

# Define the six sample IDs matching the Salmon output directory names
sample_ids <- c(
    "KO_1_SRR10045016", "KO_2_SRR10045017", "KO_3_SRR10045018",
    "WT_1_SRR10045019", "WT_2_SRR10045020", "WT_3_SRR10045021"
)

# Build paths to each sample's quant.sf file
quant_files <- file.path(WORKSHOP, "counts/salmon", sample_ids, "quant.sf")
names(quant_files) <- sample_ids  # names are used as column names in the output matrix

# Stop early with a clear error if any file is missing
stopifnot(all(file.exists(quant_files)))

# Import and summarise from transcript to gene level using the tx2gene map
# ignoreTxVersion strips the version suffix (e.g. ENST00000123456.3 → ENST00000123456)
txi <- tximport(
    quant_files,
    type            = "salmon",
    tx2gene         = tx2gene,
    ignoreTxVersion = TRUE
)

cat("Genes imported:", nrow(txi$counts), "\n")
cat("Samples:", ncol(txi$counts), "\n")

What tximport does

tximport reads each quant.sf file, applies the tx2gene mapping to sum transcript-level estimates to the gene level, and corrects for average transcript length differences between samples. The txi object contains three matrices: $counts (estimated read counts), $abundance (TPM), and $length (effective gene length).

Build sample metadata (coldata)

# Derive condition labels from sample names (KO_* → knockdown, WT_* → control)
condition <- ifelse(grepl("^KO", sample_ids), "knockdown", "control")

# Build a sample metadata table (coldata) required by DESeq2
# Levels set so "control" is the reference level in downstream contrasts
coldata <- data.frame(
    sample    = sample_ids,
    condition = factor(condition, levels = c("control", "knockdown")),
    row.names = sample_ids
)

print(coldata)

Create DESeqDataSet and apply variance-stabilising transformation

# Create a DESeqDataSet from tximport output, incorporating sample metadata
# design = ~ condition tells DESeq2 which variable to model
dds <- DESeqDataSetFromTximport(
    txi,
    colData = coldata,
    design  = ~ condition
)

# Remove genes with very low total counts — they carry no statistical power
# and slow down downstream steps; ≥10 counts across all samples is a common threshold
dds <- dds[rowSums(counts(dds)) >= 10, ]
cat("Genes after low-count filter:", nrow(dds), "\n")

# Apply variance-stabilising transformation (VST) for visualisation
# blind = TRUE means the transformation ignores the sample groups (fair QC)
vsd <- vst(dds, blind = TRUE)
cat("VST complete\n")

Why VST?

Raw counts have variance that scales with the mean: highly expressed genes dominate PCA and distance calculations. VST brings all genes onto a comparable variance scale, giving a fairer representation of the transcriptome-wide structure. We use blind = TRUE during QC so that the transformation is not informed by the experimental design.

PCA plot

# Use DESeq2's built-in PCA function; returnData = TRUE gives us the coordinates
# instead of plotting directly so we can customise with ggplot2
pca_data <- plotPCA(vsd, intgroup = "condition", returnData = TRUE)

# Extract the percentage of variance explained by each PC for axis labels
pct_var  <- round(100 * attr(pca_data, "percentVar"), 1)

# Build a customised PCA scatter plot
p_pca <- ggplot(pca_data, aes(x = PC1, y = PC2, colour = condition, label = name)) +
    geom_point(size = 4) +                                         # one dot per sample
    geom_text(vjust = -0.8, hjust = 0.5, size = 3.5) +            # sample labels above dots
    scale_colour_manual(values = c(control = "#0d7377", knockdown = "#e07b4c")) +
    xlab(paste0("PC1: ", pct_var[1], "% variance")) +
    ylab(paste0("PC2: ", pct_var[2], "% variance")) +
    ggtitle("PCA — GSE136366  (VST, chr11)") +
    theme_bw(base_size = 13)

print(p_pca)

# Save to a PDF in the results directory
results_dir <- file.path(WORKSHOP, "results")
dir.create(results_dir, showWarnings = FALSE)  # create if it doesn't exist yet
ggsave(file.path(results_dir, "pca_plot_R.pdf"), p_pca, width = 7, height = 5)

Sample-distance heatmap

# Extract the VST-transformed count matrix (genes × samples)
vst_mat   <- assay(vsd)

# Transpose so samples are rows, then compute Euclidean distances between samples
dist_mat  <- dist(t(vst_mat))

# Convert to a symmetric data frame for pheatmap (requires a matrix/data frame input)
dist_df   <- as.data.frame(as.matrix(dist_mat))

# Build a one-column annotation data frame for the colour sidebar on rows and columns
ann_col <- data.frame(
    condition = coldata$condition,
    row.names = rownames(coldata)
)
# Define colours for the annotation sidebar to match the PCA plot palette
ann_colors <- list(condition = c(control = "#0d7377", knockdown = "#e07b4c"))

# Draw the heatmap and save directly to PDF via the filename argument
pheatmap(
    dist_df,
    annotation_col    = ann_col,           # colour sidebar on columns
    annotation_row    = ann_col,           # colour sidebar on rows
    annotation_colors = ann_colors,
    color             = colorRampPalette(c("#2166ac", "white"))(100),  # blue→white gradient
    display_numbers   = TRUE,              # show distance values in each cell
    number_format     = "%.1f",            # one decimal place
    main              = "Sample-to-sample distances (VST, chr11)",
    filename          = file.path(results_dir, "sample_distance_heatmap_R.pdf"),
    width = 8, height = 7
)

Exercises

Work through these exercises using the commands and concepts from the lab. Refer to the sections above if you need a hint.

Exercise 1: Genes with Non-zero Counts

How many genes have a non-zero count in at least one sample? Filter txi$counts and count the remaining rows.

Hint: Use rowSums(txi$counts > 0) >= 1 as a logical filter, then count with sum() or nrow().

Exercise 2: Salmon Mapping Rate

What is the mapping rate for each sample? Salmon reports this in its run log and in the auxiliary output directory.

Hint: Check counts/salmon/<SRR>/aux_info/meta_info.json for the percent_mapped field. A healthy mapping rate for a human RNA-seq sample against a well-annotated transcriptome is typically 70–90%. Rates below 50% may indicate contamination, adapter content, wrong reference, or non-coding RNA libraries. You can also check the SLURM log for the line Mapping rate = ... printed by Salmon at the end of each run.

Exercise 3: PCA Sample Clustering

Do the GSE136366 samples cluster by condition (control vs knockdown) along PC1 in your PCA plot? What biological or technical factors could explain it if they do not?

Consider: batch effects, library size differences, outlier samples, or a weak transcriptional response to the TDP-43 knockdown. If you see an unexpected outlier, check whether its mapping rate or total count differs markedly from the other samples.

Exercise 4: Highest Average TPM Gene

Which gene has the highest average TPM across all samples? TPM values are stored in txi$abundance.

Hint: Calculate row means with rowMeans(txi$abundance), then use which.max() to find the index of the maximum value, and retrieve the gene name. Highly expressed genes are often housekeeping genes such as ACTB (beta-actin) or ribosomal protein genes.

Summary

In this lab, you have:

Downloaded the Ensembl GRCh38 cDNA transcriptome and built a Salmon index
Quantified all GSE136366 samples with Salmon quasi-mapping, using bias correction and automatic library-type detection
Inspected quant.sf output files and verified library strandedness via lib_format_counts.json
Imported Salmon transcript-level estimates into R with tximport, summarising to gene level using a tx2gene mapping
Built a DESeqDataSet, filtered low-count genes, and applied variance-stabilising transformation (VST)
Produced a PCA plot and sample-distance heatmap to assess sample quality and condition separation

These quality-control visualisations are an essential checkpoint before proceeding to differential expression testing. Samples that cluster with the wrong group, show unexpectedly large distances from their replicates, or have very low mapping rates should be investigated and potentially excluded.

What to Check Before Moving On

All samples have a Salmon mapping rate above 70%
Replicates cluster together in both the PCA and heatmap
Conditions are separated along a principal component
No obvious outlier samples

Previous: Lab 4: Read Alignment with STAR | Next: Lab 6: Running nf-core/rnaseq