Lab 5: Quantification with Salmon & Exploratory Analysis
In this lab you will quantify gene expression for all GSE136366 samples using Salmon quasi-mapping against the chromosome 11 transcriptome, then import the results into R, apply variance-stabilising transformation (VST), and produce a PCA plot and sample-distance heatmap to assess sample quality.
The input files (trimmed/${SRR}_chr11_trimmed_1/2.fastq.gz) are the chr11-filtered trimmed reads from Lab 3 (~3–4 M reads per sample). Salmon is built against a chr11-only transcriptome index, so quantification completes in under 1 minute per sample. The resulting count matrices contain only chr11 genes — enough for meaningful exploratory analysis and visualization. From Lab 6 onwards, the full dataset is used.
- Build a Salmon transcriptome index from the Ensembl cDNA FASTA
- Run Salmon quasi-mapping quantification on all GSE136366 paired-end samples
- Import Salmon output into R using
tximport(transcript → gene summarisation) - Apply variance-stabilising transformation (VST) with DESeq2
- Produce a PCA plot and sample-distance heatmap to evaluate sample quality and clustering
Part 1: Setup on Ibex
Connect to Ibex, start an interactive session, and load the tools needed for this lab.
Connect and start an interactive session
# Connect to Ibex
ssh username@ilogin.ibex.kaust.edu.sa
# Start an interactive session (never run analyses on login nodes!)
# Chr11 Salmon index is small; 4G RAM and 30 min is sufficient
srun --pty --time=00:30:00 --mem=4G --cpus-per-task=8 bash
Load required modules
# Load Salmon for quantification
module load salmon
# Load R for downstream analysis
module load R
# Verify loaded modules
module list
# Confirm tool versions
salmon --version
R --version | head -1
If module load salmon does not find a match, search for the exact module name first:
module avail salmon
Use the full versioned name shown in the output (e.g., module load salmon/1.10.1).
Navigate to your workshop directory and create output folders
# Go to your workshop scratch space
cd /ibex/user/$USER/workshop
# Create the output directory for Salmon counts
mkdir -p counts/salmon
# Create a results directory for plots
mkdir -p results
# Verify workspace structure
ls -la
Part 2: Build a Salmon Transcriptome Index
Salmon performs quasi-mapping against a transcriptome FASTA (not the genome). This makes it much faster than alignment-based methods such as STAR, while still producing accurate quantification estimates.
Salmon does not align reads to the genome. Instead it uses a k-mer index built from all annotated transcript sequences. Reads are matched against this index in memory, allowing Salmon to process a typical RNA-seq sample in a few minutes rather than hours. Because it works at transcript level, Salmon natively handles multi-mapping reads and produces both transcript-level and gene-level estimates.
Copy the chr11 transcriptome from the shared course directory
cd /ibex/user/$USER/workshop
SHARED=/biocorelab/BIX/resources/genomes/workshops/human-chr11
# Chr11 cDNA (transcript sequences only, ~5 MB)
cp ${SHARED}/GRCh38.cdna.chr11.fa references/chr11/
ls -lh references/chr11/
Salmon does not align reads to the genome. Instead it uses a k-mer index built from all annotated transcript sequences. The chr11 cDNA file contains spliced transcript sequences for genes on chromosome 11 only (~5 MB, vs ~200 MB for the full human transcriptome). This is exactly what Salmon needs and makes index building near-instantaneous.
Build the Salmon chr11 index
salmon index \
-t references/chr11/GRCh38.cdna.chr11.fa \
-i references/salmon_index_chr11 \
--threads 8
This completes in under 1 minute for the chr11 transcriptome. The resulting references/salmon_index_chr11/ directory is reused for all samples.
Part 3: Quantify with Salmon
Now that the index is ready, run Salmon on each trimmed sample. We will loop over all sample IDs listed in raw_data/SRR_Acc_List.txt.
| Parameter | Meaning |
|---|---|
-i | Path to the Salmon index directory |
-l A | Auto-detect library/strand type (recommended for most datasets) |
-1 / -2 | R1 and R2 FASTQ files (paired-end) |
--validateMappings | More accurate mapping using alignment-based validation; recommended |
--gcBias | Correct for GC-content bias in fragment sequencing |
--seqBias | Correct for sequence-specific biases at read ends |
--threads | Number of CPU threads to use |
-o | Output directory for this sample |
Run Salmon quantification for all samples
cd /ibex/user/$USER/workshop
mkdir -p counts/salmon
for r1 in trimmed/*_trimmed_1.fastq.gz; do
sample=$(basename "$r1" _trimmed_1.fastq.gz)
echo "=== Quantifying $sample ==="
salmon quant \
-i references/salmon_index_chr11 \
-l A \
-1 trimmed/${sample}_trimmed_1.fastq.gz \
-2 trimmed/${sample}_trimmed_2.fastq.gz \
--validateMappings \
--gcBias \
--seqBias \
--threads 8 \
-o counts/salmon/${sample}
echo "Done: $sample"
done
Each chr11 sample (~3–4 M reads against the chr11 index) takes well under 1 minute. The full loop for all 6 GSE136366 samples completes in a few minutes.
Inspect Salmon output files
Each sample gets its own output directory containing several files. The most important is quant.sf:
# View the quant.sf file for the first sample
head counts/salmon/KO_1_SRR10045016/quant.sf
| Column | Description |
|---|---|
Name | Transcript identifier (Ensembl ENST...) |
Length | Transcript length in bases |
EffectiveLength | Length corrected for fragment size and biases |
TPM | Transcripts Per Million — normalised expression estimate |
NumReads | Estimated number of reads originating from this transcript |
Check detected library strandedness
Salmon automatically detects the library type. Review what it found for your data:
# View the library format counts JSON for the first sample
cat counts/salmon/KO_1_SRR10045016/lib_format_counts.json
The lib_format_counts.json file reports how many reads matched each possible library orientation. A high count for SF or SR indicates a stranded library; roughly equal counts for IU (inward, unstranded) indicate an unstranded library. For GSE136366, expect an unstranded paired-end library (IU). The strandedness detected from chr11 reads should match the full dataset.
Verify all samples completed successfully
# Confirm each sample has a quant.sf file
for r1 in trimmed/*_trimmed_1.fastq.gz; do
sample=$(basename "$r1" _trimmed_1.fastq.gz)
f="counts/salmon/${sample}/quant.sf"
if [ -f "$f" ]; then
echo "OK: $sample ($(wc -l < $f) transcripts)"
else
echo "MISSING: $sample"
fi
done
Part 4: Exploratory Analysis in Python (Jupyter Notebook)
Run the notebook lab5_exploratory_analysis.ipynb interactively — either through the Ibex Open OnDemand portal (recommended) or locally on your laptop.
- Parsing the chr11 GTF to build a transcript-to-gene mapping
- Loading all six Salmon
quant.sffiles with pandas - Aggregating transcript-level TPM to gene level
- log₂(TPM + 1) normalization (VST equivalent)
- PCA plot with scikit-learn and matplotlib
- Sample-distance heatmap with seaborn
- Exporting gene-level count and TPM matrices for downstream labs
Option A: Ibex Open OnDemand (recommended)
- Go to ondemand.ibex.kaust.edu.sa and log in with your Ibex credentials.
- Click Interactive Apps → JupyterLab.
- Set the resources: 4 CPUs, 16 GB RAM, 4 hours, partition batch.
- In the Extra Jupyter arguments field enter:
--notebook-dir=/ibex/user/$USER/workshop - Click Launch, wait for the session to start (usually < 1 minute), then click Connect to JupyterLab.
- In the file browser on the left, open
lab5_exploratory_analysis.ipynb.
If the notebook asks you to select a kernel, choose rnaseq (the conda environment you created during setup). If it does not appear, activate it first in a terminal: conda activate rnaseq && python -m ipykernel install --user --name rnaseq, then refresh the page.
Option B: Run locally on your laptop
If the Salmon output files are available on your laptop (or you copied them from Ibex), you can run the notebook entirely offline.
# Activate the rnaseq conda environment
conda activate rnaseq
# Navigate to the directory containing the notebook
cd /path/to/workshop # adjust to where you placed the notebook and Salmon output
# Launch JupyterLab — opens automatically in your browser
jupyter lab
Open lab5_exploratory_analysis.ipynb from the file browser and run all cells with Shift + Enter or Run → Run All Cells.
# Run this on your LOCAL machine
scp -r username@ilogin.ibex.kaust.edu.sa:/ibex/user/username/workshop/counts/salmon ./counts/
Part 5: Exploratory Analysis in R (click to expand)
This part replicates the same exploratory analysis from Part 4 using R and Bioconductor. It uses tximport to import Salmon output, DESeq2 for variance-stabilising transformation (VST), and ggplot2 / pheatmap for visualisation.
R and Bioconductor packages must be available on Ibex. See the Setup page for installation instructions. Required packages: tximport, DESeq2, ggplot2, pheatmap, GenomicFeatures.
Start an interactive R session
# Request an interactive session with enough RAM for R
srun --pty --time=01:00:00 --mem=16G --cpus-per-task=4 bash
# Load the R module
module load R
# Start R
R
You can also run this analysis in RStudio via the Ibex Open OnDemand portal at ood.ibex.kaust.edu.sa. Select RStudio Server, request 4 CPUs and 16 GB RAM, then paste each step into the console or open a new R script.
Load required R packages
library(tximport) # import Salmon/kallisto transcript-level estimates
library(DESeq2) # differential expression and VST transformation
library(GenomicFeatures) # build transcript database from GTF
library(ggplot2) # grammar-of-graphics plotting
library(pheatmap) # annotated heatmaps
Build a transcript-to-gene mapping from the chr11 GTF
tximport needs a two-column data frame mapping each transcript ID (TXNAME) to its gene ID (GENEID) to summarise transcript-level Salmon estimates to gene level.
gtf_path <- "/ibex/user/$USER/workshop/references/chr11/GRCh38.chr11.gtf"
# Parse the GTF into a TxDb (transcript database) object
txdb <- makeTxDbFromGFF(gtf_path, format = "gtf")
# Extract all transcript IDs as the key set
k <- keys(txdb, keytype = "TXNAME")
# Build a two-column data frame: TXNAME → GENEID
tx2gene <- select(txdb, k, "GENEID", "TXNAME")
head(tx2gene)
cat("Transcript-to-gene mappings:", nrow(tx2gene), "\n")
Import Salmon output with tximport
# Set base workshop directory using the current user's home path
WORKSHOP <- paste0("/ibex/user/", Sys.getenv("USER"), "/workshop")
# Define the six sample IDs matching the Salmon output directory names
sample_ids <- c(
"KO_1_SRR10045016", "KO_2_SRR10045017", "KO_3_SRR10045018",
"WT_1_SRR10045019", "WT_2_SRR10045020", "WT_3_SRR10045021"
)
# Build paths to each sample's quant.sf file
quant_files <- file.path(WORKSHOP, "counts/salmon", sample_ids, "quant.sf")
names(quant_files) <- sample_ids # names are used as column names in the output matrix
# Stop early with a clear error if any file is missing
stopifnot(all(file.exists(quant_files)))
# Import and summarise from transcript to gene level using the tx2gene map
# ignoreTxVersion strips the version suffix (e.g. ENST00000123456.3 → ENST00000123456)
txi <- tximport(
quant_files,
type = "salmon",
tx2gene = tx2gene,
ignoreTxVersion = TRUE
)
cat("Genes imported:", nrow(txi$counts), "\n")
cat("Samples:", ncol(txi$counts), "\n")
tximport reads each quant.sf file, applies the tx2gene mapping to sum transcript-level estimates to the gene level, and corrects for average transcript length differences between samples. The txi object contains three matrices: $counts (estimated read counts), $abundance (TPM), and $length (effective gene length).
Build sample metadata (coldata)
# Derive condition labels from sample names (KO_* → knockdown, WT_* → control)
condition <- ifelse(grepl("^KO", sample_ids), "knockdown", "control")
# Build a sample metadata table (coldata) required by DESeq2
# Levels set so "control" is the reference level in downstream contrasts
coldata <- data.frame(
sample = sample_ids,
condition = factor(condition, levels = c("control", "knockdown")),
row.names = sample_ids
)
print(coldata)
Create DESeqDataSet and apply variance-stabilising transformation
# Create a DESeqDataSet from tximport output, incorporating sample metadata
# design = ~ condition tells DESeq2 which variable to model
dds <- DESeqDataSetFromTximport(
txi,
colData = coldata,
design = ~ condition
)
# Remove genes with very low total counts — they carry no statistical power
# and slow down downstream steps; ≥10 counts across all samples is a common threshold
dds <- dds[rowSums(counts(dds)) >= 10, ]
cat("Genes after low-count filter:", nrow(dds), "\n")
# Apply variance-stabilising transformation (VST) for visualisation
# blind = TRUE means the transformation ignores the sample groups (fair QC)
vsd <- vst(dds, blind = TRUE)
cat("VST complete\n")
Raw counts have variance that scales with the mean: highly expressed genes dominate PCA and distance calculations. VST brings all genes onto a comparable variance scale, giving a fairer representation of the transcriptome-wide structure. We use blind = TRUE during QC so that the transformation is not informed by the experimental design.
PCA plot
# Use DESeq2's built-in PCA function; returnData = TRUE gives us the coordinates
# instead of plotting directly so we can customise with ggplot2
pca_data <- plotPCA(vsd, intgroup = "condition", returnData = TRUE)
# Extract the percentage of variance explained by each PC for axis labels
pct_var <- round(100 * attr(pca_data, "percentVar"), 1)
# Build a customised PCA scatter plot
p_pca <- ggplot(pca_data, aes(x = PC1, y = PC2, colour = condition, label = name)) +
geom_point(size = 4) + # one dot per sample
geom_text(vjust = -0.8, hjust = 0.5, size = 3.5) + # sample labels above dots
scale_colour_manual(values = c(control = "#0d7377", knockdown = "#e07b4c")) +
xlab(paste0("PC1: ", pct_var[1], "% variance")) +
ylab(paste0("PC2: ", pct_var[2], "% variance")) +
ggtitle("PCA — GSE136366 (VST, chr11)") +
theme_bw(base_size = 13)
print(p_pca)
# Save to a PDF in the results directory
results_dir <- file.path(WORKSHOP, "results")
dir.create(results_dir, showWarnings = FALSE) # create if it doesn't exist yet
ggsave(file.path(results_dir, "pca_plot_R.pdf"), p_pca, width = 7, height = 5)
Sample-distance heatmap
# Extract the VST-transformed count matrix (genes × samples)
vst_mat <- assay(vsd)
# Transpose so samples are rows, then compute Euclidean distances between samples
dist_mat <- dist(t(vst_mat))
# Convert to a symmetric data frame for pheatmap (requires a matrix/data frame input)
dist_df <- as.data.frame(as.matrix(dist_mat))
# Build a one-column annotation data frame for the colour sidebar on rows and columns
ann_col <- data.frame(
condition = coldata$condition,
row.names = rownames(coldata)
)
# Define colours for the annotation sidebar to match the PCA plot palette
ann_colors <- list(condition = c(control = "#0d7377", knockdown = "#e07b4c"))
# Draw the heatmap and save directly to PDF via the filename argument
pheatmap(
dist_df,
annotation_col = ann_col, # colour sidebar on columns
annotation_row = ann_col, # colour sidebar on rows
annotation_colors = ann_colors,
color = colorRampPalette(c("#2166ac", "white"))(100), # blue→white gradient
display_numbers = TRUE, # show distance values in each cell
number_format = "%.1f", # one decimal place
main = "Sample-to-sample distances (VST, chr11)",
filename = file.path(results_dir, "sample_distance_heatmap_R.pdf"),
width = 8, height = 7
)
Exercises
Work through these exercises using the commands and concepts from the lab. Refer to the sections above if you need a hint.
Exercise 1: Genes with Non-zero Counts
How many genes have a non-zero count in at least one sample? Filter txi$counts and count the remaining rows.
Hint: Use rowSums(txi$counts > 0) >= 1 as a logical filter, then count with sum() or nrow().
Exercise 2: Salmon Mapping Rate
What is the mapping rate for each sample? Salmon reports this in its run log and in the auxiliary output directory.
Hint: Check counts/salmon/<SRR>/aux_info/meta_info.json for the percent_mapped field. A healthy mapping rate for a human RNA-seq sample against a well-annotated transcriptome is typically 70–90%. Rates below 50% may indicate contamination, adapter content, wrong reference, or non-coding RNA libraries. You can also check the SLURM log for the line Mapping rate = ... printed by Salmon at the end of each run.
Exercise 3: PCA Sample Clustering
Do the GSE136366 samples cluster by condition (control vs knockdown) along PC1 in your PCA plot? What biological or technical factors could explain it if they do not?
Consider: batch effects, library size differences, outlier samples, or a weak transcriptional response to the TDP-43 knockdown. If you see an unexpected outlier, check whether its mapping rate or total count differs markedly from the other samples.
Exercise 4: Highest Average TPM Gene
Which gene has the highest average TPM across all samples? TPM values are stored in txi$abundance.
Hint: Calculate row means with rowMeans(txi$abundance), then use which.max() to find the index of the maximum value, and retrieve the gene name. Highly expressed genes are often housekeeping genes such as ACTB (beta-actin) or ribosomal protein genes.
Summary
In this lab, you have:
- Downloaded the Ensembl GRCh38 cDNA transcriptome and built a Salmon index
- Quantified all GSE136366 samples with Salmon quasi-mapping, using bias correction and automatic library-type detection
- Inspected
quant.sfoutput files and verified library strandedness vialib_format_counts.json - Imported Salmon transcript-level estimates into R with tximport, summarising to gene level using a tx2gene mapping
- Built a DESeqDataSet, filtered low-count genes, and applied variance-stabilising transformation (VST)
- Produced a PCA plot and sample-distance heatmap to assess sample quality and condition separation
These quality-control visualisations are an essential checkpoint before proceeding to differential expression testing. Samples that cluster with the wrong group, show unexpectedly large distances from their replicates, or have very low mapping rates should be investigated and potentially excluded.
- All samples have a Salmon mapping rate above 70%
- Replicates cluster together in both the PCA and heatmap
- Conditions are separated along a principal component
- No obvious outlier samples
Previous: Lab 4: Read Alignment with STAR | Next: Lab 6: Running nf-core/rnaseq