Lab 4b: Pseudo-alignment (Salmon Only)

Choose Your Path

This lab uses Salmon pseudo-alignmentdirectly on FASTQ files. Choose this approach when you:

If you need BAM files for IGV visualization, splice junction analysis, or variant calling, use Lab 4a: Genome Alignment (STAR + Salmon) instead.

Learning Objectives

Why Salmon for RNA-seq Quantification?

Salmon: Fast and Accurate Quantification

Salmon uses a pseudo-alignment approach that is fundamentally different from traditional alignment:

  • Pseudo-alignment: Maps reads to transcripts without full base-by-base alignment
  • Very fast: 10-100x faster than STAR alignment
  • Memory efficient: Requires only ~4 GB RAM (vs 32 GB for STAR)
  • Accurate: Comparable or better accuracy for gene-level counts
  • Bias correction: Built-in GC and sequence-specific bias correction

Pseudo-alignment vs Traditional Alignment

FeatureTraditional (STAR)Pseudo-alignment (Salmon)
SpeedSlower10-100x faster
Memory~32 GB (human)~4 GB
OutputBAM files + countsCounts only
VisualizationIGV compatibleNo BAM files
Splice infoYesNo
Use caseFull analysisExpression quantification
When to Use Pseudo-alignment

Salmon pseudo-alignment is ideal for differential expression studies where you only need gene counts. It's the preferred method in many modern RNA-seq pipelines (e.g., nf-core/rnaseq).

Check Reference Files

Make sure the steps in Lab 1 Part 3-5 were successful. You need the transcriptome FASTA file:

ls -lh ~/genomics/references/transcriptome_chr11.fa
ls -lh ~/genomics/references/Homo_sapiens.GRCh38.110.chr11.gtf

Part 1: Build Salmon Transcriptome Index

Salmon requires an index built from transcript sequences (cDNA), not the genome.

Create the Salmon index

cd ~/genomics

# Activate environment
conda activate genomics

# Build Salmon index from chr11 transcriptome
# This takes 1-2 minutes
salmon index \
    -t references/transcriptome_chr11.fa \
    -i references/salmon_index \
    --threads 4

# Check the index files
ls -lh references/salmon_index/
Transcriptome Reference

Unlike STAR which uses the genome, Salmon uses transcript sequences (cDNA). This is why it's faster - it only considers expressed sequences, not the entire genome.

Part 2: Quantify Samples with Salmon

Quantify one sample

cd ~/genomics

# Create output directory
mkdir -p salmon_quant

# Quantify KO_1 sample
salmon quant \
    -i references/salmon_index \
    -l A \
    -1 trimmed_data/KO_1_SRR10045016_1.trimmed.fastq.gz \
    -2 trimmed_data/KO_1_SRR10045016_2.trimmed.fastq.gz \
    -o salmon_quant/KO_1 \
    --threads 4 \
    --validateMappings \
    --gcBias \
    --seqBias

# Check output
ls -lh salmon_quant/KO_1/
Key Parameters
  • -l A: Auto-detect library type (stranded/unstranded)
  • --validateMappings: More accurate quasi-mapping
  • --gcBias: Correct for GC content bias
  • --seqBias: Correct for sequence-specific bias at read starts

Quantify all samples

cd ~/genomics

# Create logs directory
mkdir -p logs

# Define samples
SAMPLES="KO_1_SRR10045016 KO_2_SRR10045017 KO_3_SRR10045018 WT_1_SRR10045019 WT_2_SRR10045020 WT_3_SRR10045021"

# Process each sample
for SAMPLE in $SAMPLES; do
    # Extract short name (KO_1, KO_2, etc.)
    SHORT_NAME=$(echo $SAMPLE | cut -d'_' -f1,2)
    echo "Quantifying $SHORT_NAME..."

    salmon quant \
        -i references/salmon_index \
        -l A \
        -1 trimmed_data/${SAMPLE}_1.trimmed.fastq.gz \
        -2 trimmed_data/${SAMPLE}_2.trimmed.fastq.gz \
        -o salmon_quant/${SHORT_NAME} \
        --threads 4 \
        --validateMappings \
        --gcBias \
        --seqBias \
        2>> logs/salmon.log
done

echo "Quantification complete!"
ls salmon_quant/

Part 3: Understanding Salmon Output

Explore the quantification file

cd ~/genomics

# View the quant.sf file
head -10 salmon_quant/KO_1/quant.sf

# Column descriptions:
# Name: Transcript ID (Ensembl format)
# Length: Transcript length in bp
# EffectiveLength: Length adjusted for fragment size and bias
# TPM: Transcripts Per Million (normalized expression)
# NumReads: Estimated number of reads mapping to this transcript

Understanding Salmon Metrics

MetricDescriptionUse
NumReadsEstimated read countInput for DESeq2
TPMTranscripts Per MillionCross-sample comparison
EffectiveLengthBias-corrected lengthInternal calculation

Check mapping rates

cd ~/genomics

# View the log file for mapping statistics
cat salmon_quant/KO_1/logs/salmon_quant.log

# Extract mapping rate for all samples
echo "=== Salmon Mapping Rates ==="
for dir in salmon_quant/*/; do
    sample=$(basename $dir)
    rate=$(grep "Mapping rate" $dir/logs/salmon_quant.log | awk '{print $NF}')
    echo "$sample: $rate"
done

Expected Mapping Rates

Mapping RateQualityNotes
>70%GoodExpected for most RNA-seq
50-70%AcceptableMay indicate some issues
<50%InvestigateCheck data quality or reference

Note: Mapping rates for chromosome-specific data may be lower than whole-genome data.

Explore additional output files

cd ~/genomics

# List all output files
ls -la salmon_quant/KO_1/

# View the metadata
cat salmon_quant/KO_1/aux_info/meta_info.json

# View library type detection
cat salmon_quant/KO_1/lib_format_counts.json

Part 4: Aggregate to Gene Level with tximport

Salmon outputs transcript-level counts. For differential expression with DESeq2, we aggregate to gene level using tximport.

Download the tximport script

We provide a reusable R script that handles both tx2gene generation from GTF and tximport aggregation.

cd ~/genomics


# Create scripts directorys
mkdir -p scripts

# Download the tximport script from GitHub
wget -O scripts/run_tximport.R https://raw.githubusercontent.com/bioinfo-kaust/academy-stage3-2026/refs/heads/main/scripts/run_tximport.R
    
# Make it executable (optional)
chmod +x scripts/run_tximport.R

# View script help
Rscript scripts/run_tximport.R --help
Script Features
  • Automatically generates tx2gene mapping from your GTF file
  • Auto-detects sample directories in salmon_quant folder
  • Outputs count matrices, TPM values, and sample info
  • Creates tximport.rds object ready for DESeq2

Run tximport

cd ~/genomics

# Run the tximport script with your GTF and Salmon output
Rscript scripts/run_tximport.R \
    --gtf references/Homo_sapiens.GRCh38.110.chr11.gtf \
    --salmon_dir salmon_quant \
    --outdir counts

# Check the output files
ls -lh counts/
Script Parameters
  • --gtf: Path to GTF annotation file (used to generate tx2gene mapping)
  • --salmon_dir: Directory containing Salmon output (one subdirectory per sample)
  • --outdir: Output directory for count matrices
  • --tx2gene: (Optional) Use existing tx2gene.tsv instead of generating from GTF
  • --samples: (Optional) Comma-separated sample list (auto-detected if not provided)

Explore the expression matrix

cd ~/genomics

# View the first few genes in the count matrix
head -10 counts/gene_counts.tsv | column -t

# How many genes have counts?
wc -l counts/gene_counts.tsv

# View TPM values (normalized)
head -10 counts/gene_tpm.tsv | column -t

# View sample info (auto-generated)
cat counts/sample_info.tsv

# Open the count matrix in MS Excel and get familiar with the content
Ready for Differential Expression

You now have gene-level count matrices ready for DESeq2 analysis in Lab 5. The script generated:

  • gene_counts.tsv - Raw counts for DESeq2
  • gene_tpm.tsv - TPM values for visualization
  • sample_info.tsv - Sample metadata
  • tx2gene.tsv - Transcript-to-gene mapping
  • tximport.rds - R object for DESeq2

Exercises

Exercise 1: Salmon Analysis

  1. Which sample has the highest mapping rate?
  2. How many transcripts were quantified in total?
  3. What library type did Salmon detect? (Check lib_format_counts.json)

Exercise 2: Gene Expression

  1. Find the top 5 most highly expressed genes (by TPM) in KO samples
  2. Find the top 5 most highly expressed genes in WT samples
  3. Are they the same genes? What might this tell you?

Exercise 3: Compare Approaches

Consider these questions:

  1. When would you choose STAR + Salmon over Salmon only?
  2. What information is lost when you skip genome alignment?
  3. For a differential expression study, does it matter which approach you use?

Summary

In this lab, you have:

Next: Lab 5 - Differential Expression Analysis