Lab 4b: Pseudo-alignment (Salmon Only)

Choose Your Path

This lab uses Salmon pseudo-alignmentdirectly on FASTQ files. Choose this approach when you:

Only need gene/transcript expression counts
Want faster processing (no genome alignment step)
Have limited computational resources
Don't need BAM files for visualization

If you need BAM files for IGV visualization, splice junction analysis, or variant calling, use Lab 4a: Genome Alignment (STAR + Salmon) instead.

Learning Objectives

Understand pseudo-alignment vs traditional alignment
Build a Salmon transcriptome index
Quantify transcript abundance with Salmon
Interpret Salmon output and quality metrics
Aggregate transcript counts to gene level with tximport

Why Salmon for RNA-seq Quantification?

Salmon: Fast and Accurate Quantification

Salmon uses a pseudo-alignment approach that is fundamentally different from traditional alignment:

Pseudo-alignment: Maps reads to transcripts without full base-by-base alignment
Very fast: 10-100x faster than STAR alignment
Memory efficient: Requires only ~4 GB RAM (vs 32 GB for STAR)
Accurate: Comparable or better accuracy for gene-level counts
Bias correction: Built-in GC and sequence-specific bias correction

Pseudo-alignment vs Traditional Alignment

Feature	Traditional (STAR)	Pseudo-alignment (Salmon)
Speed	Slower	10-100x faster
Memory	~32 GB (human)	~4 GB
Output	BAM files + counts	Counts only
Visualization	IGV compatible	No BAM files
Splice info	Yes	No
Use case	Full analysis	Expression quantification

When to Use Pseudo-alignment

Salmon pseudo-alignment is ideal for differential expression studies where you only need gene counts. It's the preferred method in many modern RNA-seq pipelines (e.g., nf-core/rnaseq).

Check Reference Files

Make sure the steps in Lab 1 Part 3-5 were successful. You need the transcriptome FASTA file:

ls -lh ~/genomics/references/transcriptome_chr11.fa
ls -lh ~/genomics/references/Homo_sapiens.GRCh38.110.chr11.gtf

Part 1: Build Salmon Transcriptome Index

Salmon requires an index built from transcript sequences (cDNA), not the genome.

Create the Salmon index

cd ~/genomics

# Activate environment
conda activate genomics

# Build Salmon index from chr11 transcriptome
# This takes 1-2 minutes
salmon index \
    -t references/transcriptome_chr11.fa \
    -i references/salmon_index \
    --threads 4

# Check the index files
ls -lh references/salmon_index/

Transcriptome Reference

Unlike STAR which uses the genome, Salmon uses transcript sequences (cDNA). This is why it's faster - it only considers expressed sequences, not the entire genome.

Part 2: Quantify Samples with Salmon

Quantify one sample

cd ~/genomics

# Create output directory
mkdir -p salmon_quant

# Quantify KO_1 sample
salmon quant \
    -i references/salmon_index \
    -l A \
    -1 trimmed_data/KO_1_SRR10045016_1.trimmed.fastq.gz \
    -2 trimmed_data/KO_1_SRR10045016_2.trimmed.fastq.gz \
    -o salmon_quant/KO_1 \
    --threads 4 \
    --validateMappings \
    --gcBias \
    --seqBias

# Check output
ls -lh salmon_quant/KO_1/

Key Parameters

-l A: Auto-detect library type (stranded/unstranded)
--validateMappings: More accurate quasi-mapping
--gcBias: Correct for GC content bias
--seqBias: Correct for sequence-specific bias at read starts

Quantify all samples

cd ~/genomics

# Create logs directory
mkdir -p logs

# Define samples
SAMPLES="KO_1_SRR10045016 KO_2_SRR10045017 KO_3_SRR10045018 WT_1_SRR10045019 WT_2_SRR10045020 WT_3_SRR10045021"

# Process each sample
for SAMPLE in $SAMPLES; do
    # Extract short name (KO_1, KO_2, etc.)
    SHORT_NAME=$(echo $SAMPLE | cut -d'_' -f1,2)
    echo "Quantifying $SHORT_NAME..."

    salmon quant \
        -i references/salmon_index \
        -l A \
        -1 trimmed_data/${SAMPLE}_1.trimmed.fastq.gz \
        -2 trimmed_data/${SAMPLE}_2.trimmed.fastq.gz \
        -o salmon_quant/${SHORT_NAME} \
        --threads 4 \
        --validateMappings \
        --gcBias \
        --seqBias \
        2>> logs/salmon.log
done

echo "Quantification complete!"
ls salmon_quant/

Part 3: Understanding Salmon Output

Explore the quantification file

cd ~/genomics

# View the quant.sf file
head -10 salmon_quant/KO_1/quant.sf

# Column descriptions:
# Name: Transcript ID (Ensembl format)
# Length: Transcript length in bp
# EffectiveLength: Length adjusted for fragment size and bias
# TPM: Transcripts Per Million (normalized expression)
# NumReads: Estimated number of reads mapping to this transcript

Understanding Salmon Metrics

Metric	Description	Use
NumReads	Estimated read count	Input for DESeq2
TPM	Transcripts Per Million	Cross-sample comparison
EffectiveLength	Bias-corrected length	Internal calculation

Check mapping rates

cd ~/genomics

# View the log file for mapping statistics
cat salmon_quant/KO_1/logs/salmon_quant.log

# Extract mapping rate for all samples
echo "=== Salmon Mapping Rates ==="
for dir in salmon_quant/*/; do
    sample=$(basename $dir)
    rate=$(grep "Mapping rate" $dir/logs/salmon_quant.log | awk '{print $NF}')
    echo "$sample: $rate"
done

Expected Mapping Rates

Mapping Rate	Quality	Notes
>70%	Good	Expected for most RNA-seq
50-70%	Acceptable	May indicate some issues
<50%	Investigate	Check data quality or reference

Note: Mapping rates for chromosome-specific data may be lower than whole-genome data.

Explore additional output files

cd ~/genomics

# List all output files
ls -la salmon_quant/KO_1/

# View the metadata
cat salmon_quant/KO_1/aux_info/meta_info.json

# View library type detection
cat salmon_quant/KO_1/lib_format_counts.json

Part 4: Aggregate to Gene Level with tximport

Salmon outputs transcript-level counts. For differential expression with DESeq2, we aggregate to gene level using tximport.

Download the tximport script

We provide a reusable R script that handles both tx2gene generation from GTF and tximport aggregation.

cd ~/genomics


# Create scripts directorys
mkdir -p scripts

# Download the tximport script from GitHub
wget -O scripts/run_tximport.R https://raw.githubusercontent.com/bioinfo-kaust/academy-stage3-2026/refs/heads/main/scripts/run_tximport.R
    
# Make it executable (optional)
chmod +x scripts/run_tximport.R

# View script help
Rscript scripts/run_tximport.R --help

Script Features

Automatically generates tx2gene mapping from your GTF file
Auto-detects sample directories in salmon_quant folder
Outputs count matrices, TPM values, and sample info
Creates tximport.rds object ready for DESeq2

Run tximport

cd ~/genomics

# Run the tximport script with your GTF and Salmon output
Rscript scripts/run_tximport.R \
    --gtf references/Homo_sapiens.GRCh38.110.chr11.gtf \
    --salmon_dir salmon_quant \
    --outdir counts

# Check the output files
ls -lh counts/

Script Parameters

--gtf: Path to GTF annotation file (used to generate tx2gene mapping)
--salmon_dir: Directory containing Salmon output (one subdirectory per sample)
--outdir: Output directory for count matrices
--tx2gene: (Optional) Use existing tx2gene.tsv instead of generating from GTF
--samples: (Optional) Comma-separated sample list (auto-detected if not provided)

Explore the expression matrix

cd ~/genomics

# View the first few genes in the count matrix
head -10 counts/gene_counts.tsv | column -t

# How many genes have counts?
wc -l counts/gene_counts.tsv

# View TPM values (normalized)
head -10 counts/gene_tpm.tsv | column -t

# View sample info (auto-generated)
cat counts/sample_info.tsv

# Open the count matrix in MS Excel and get familiar with the content

Ready for Differential Expression

You now have gene-level count matrices ready for DESeq2 analysis in Lab 5. The script generated:

gene_counts.tsv - Raw counts for DESeq2
gene_tpm.tsv - TPM values for visualization
sample_info.tsv - Sample metadata
tx2gene.tsv - Transcript-to-gene mapping
tximport.rds - R object for DESeq2

Exercises

Exercise 1: Salmon Analysis

Which sample has the highest mapping rate?
How many transcripts were quantified in total?
What library type did Salmon detect? (Check lib_format_counts.json)

Exercise 2: Gene Expression

Find the top 5 most highly expressed genes (by TPM) in KO samples
Find the top 5 most highly expressed genes in WT samples
Are they the same genes? What might this tell you?

Exercise 3: Compare Approaches

Consider these questions:

When would you choose STAR + Salmon over Salmon only?
What information is lost when you skip genome alignment?
For a differential expression study, does it matter which approach you use?

Summary

In this lab, you have:

Understood the difference between pseudo-alignment and traditional alignment
Built a Salmon transcriptome index
Quantified all samples using Salmon's fast pseudo-alignment
Interpreted Salmon output files and quality metrics
Aggregated transcript counts to gene level with tximport
Prepared gene-level count matrices for DESeq2

Next: Lab 5 - Differential Expression Analysis