Lab 4b: Pseudo-alignment (Salmon Only)
This lab uses Salmon pseudo-alignmentdirectly on FASTQ files. Choose this approach when you:
- Only need gene/transcript expression counts
- Want faster processing (no genome alignment step)
- Have limited computational resources
- Don't need BAM files for visualization
If you need BAM files for IGV visualization, splice junction analysis, or variant calling, use Lab 4a: Genome Alignment (STAR + Salmon) instead.
- Understand pseudo-alignment vs traditional alignment
- Build a Salmon transcriptome index
- Quantify transcript abundance with Salmon
- Interpret Salmon output and quality metrics
- Aggregate transcript counts to gene level with tximport
Why Salmon for RNA-seq Quantification?
Salmon: Fast and Accurate Quantification
Salmon uses a pseudo-alignment approach that is fundamentally different from traditional alignment:
- Pseudo-alignment: Maps reads to transcripts without full base-by-base alignment
- Very fast: 10-100x faster than STAR alignment
- Memory efficient: Requires only ~4 GB RAM (vs 32 GB for STAR)
- Accurate: Comparable or better accuracy for gene-level counts
- Bias correction: Built-in GC and sequence-specific bias correction
Pseudo-alignment vs Traditional Alignment
| Feature | Traditional (STAR) | Pseudo-alignment (Salmon) |
|---|---|---|
| Speed | Slower | 10-100x faster |
| Memory | ~32 GB (human) | ~4 GB |
| Output | BAM files + counts | Counts only |
| Visualization | IGV compatible | No BAM files |
| Splice info | Yes | No |
| Use case | Full analysis | Expression quantification |
Salmon pseudo-alignment is ideal for differential expression studies where you only need gene counts. It's the preferred method in many modern RNA-seq pipelines (e.g., nf-core/rnaseq).
Check Reference Files
Make sure the steps in Lab 1 Part 3-5 were successful. You need the transcriptome FASTA file:
ls -lh ~/genomics/references/transcriptome_chr11.fa
ls -lh ~/genomics/references/Homo_sapiens.GRCh38.110.chr11.gtf
Part 1: Build Salmon Transcriptome Index
Salmon requires an index built from transcript sequences (cDNA), not the genome.
Create the Salmon index
cd ~/genomics
# Activate environment
conda activate genomics
# Build Salmon index from chr11 transcriptome
# This takes 1-2 minutes
salmon index \
-t references/transcriptome_chr11.fa \
-i references/salmon_index \
--threads 4
# Check the index files
ls -lh references/salmon_index/
Unlike STAR which uses the genome, Salmon uses transcript sequences (cDNA). This is why it's faster - it only considers expressed sequences, not the entire genome.
Part 2: Quantify Samples with Salmon
Quantify one sample
cd ~/genomics
# Create output directory
mkdir -p salmon_quant
# Quantify KO_1 sample
salmon quant \
-i references/salmon_index \
-l A \
-1 trimmed_data/KO_1_SRR10045016_1.trimmed.fastq.gz \
-2 trimmed_data/KO_1_SRR10045016_2.trimmed.fastq.gz \
-o salmon_quant/KO_1 \
--threads 4 \
--validateMappings \
--gcBias \
--seqBias
# Check output
ls -lh salmon_quant/KO_1/
-l A: Auto-detect library type (stranded/unstranded)--validateMappings: More accurate quasi-mapping--gcBias: Correct for GC content bias--seqBias: Correct for sequence-specific bias at read starts
Quantify all samples
cd ~/genomics
# Create logs directory
mkdir -p logs
# Define samples
SAMPLES="KO_1_SRR10045016 KO_2_SRR10045017 KO_3_SRR10045018 WT_1_SRR10045019 WT_2_SRR10045020 WT_3_SRR10045021"
# Process each sample
for SAMPLE in $SAMPLES; do
# Extract short name (KO_1, KO_2, etc.)
SHORT_NAME=$(echo $SAMPLE | cut -d'_' -f1,2)
echo "Quantifying $SHORT_NAME..."
salmon quant \
-i references/salmon_index \
-l A \
-1 trimmed_data/${SAMPLE}_1.trimmed.fastq.gz \
-2 trimmed_data/${SAMPLE}_2.trimmed.fastq.gz \
-o salmon_quant/${SHORT_NAME} \
--threads 4 \
--validateMappings \
--gcBias \
--seqBias \
2>> logs/salmon.log
done
echo "Quantification complete!"
ls salmon_quant/
Part 3: Understanding Salmon Output
Explore the quantification file
cd ~/genomics
# View the quant.sf file
head -10 salmon_quant/KO_1/quant.sf
# Column descriptions:
# Name: Transcript ID (Ensembl format)
# Length: Transcript length in bp
# EffectiveLength: Length adjusted for fragment size and bias
# TPM: Transcripts Per Million (normalized expression)
# NumReads: Estimated number of reads mapping to this transcript
Understanding Salmon Metrics
| Metric | Description | Use |
|---|---|---|
| NumReads | Estimated read count | Input for DESeq2 |
| TPM | Transcripts Per Million | Cross-sample comparison |
| EffectiveLength | Bias-corrected length | Internal calculation |
Check mapping rates
cd ~/genomics
# View the log file for mapping statistics
cat salmon_quant/KO_1/logs/salmon_quant.log
# Extract mapping rate for all samples
echo "=== Salmon Mapping Rates ==="
for dir in salmon_quant/*/; do
sample=$(basename $dir)
rate=$(grep "Mapping rate" $dir/logs/salmon_quant.log | awk '{print $NF}')
echo "$sample: $rate"
done
Expected Mapping Rates
| Mapping Rate | Quality | Notes |
|---|---|---|
| >70% | Good | Expected for most RNA-seq |
| 50-70% | Acceptable | May indicate some issues |
| <50% | Investigate | Check data quality or reference |
Note: Mapping rates for chromosome-specific data may be lower than whole-genome data.
Explore additional output files
cd ~/genomics
# List all output files
ls -la salmon_quant/KO_1/
# View the metadata
cat salmon_quant/KO_1/aux_info/meta_info.json
# View library type detection
cat salmon_quant/KO_1/lib_format_counts.json
Part 4: Aggregate to Gene Level with tximport
Salmon outputs transcript-level counts. For differential expression with DESeq2, we aggregate to gene level using tximport.
Download the tximport script
We provide a reusable R script that handles both tx2gene generation from GTF and tximport aggregation.
cd ~/genomics
# Create scripts directorys
mkdir -p scripts
# Download the tximport script from GitHub
wget -O scripts/run_tximport.R https://raw.githubusercontent.com/bioinfo-kaust/academy-stage3-2026/refs/heads/main/scripts/run_tximport.R
# Make it executable (optional)
chmod +x scripts/run_tximport.R
# View script help
Rscript scripts/run_tximport.R --help
- Automatically generates tx2gene mapping from your GTF file
- Auto-detects sample directories in salmon_quant folder
- Outputs count matrices, TPM values, and sample info
- Creates tximport.rds object ready for DESeq2
Run tximport
cd ~/genomics
# Run the tximport script with your GTF and Salmon output
Rscript scripts/run_tximport.R \
--gtf references/Homo_sapiens.GRCh38.110.chr11.gtf \
--salmon_dir salmon_quant \
--outdir counts
# Check the output files
ls -lh counts/
--gtf: Path to GTF annotation file (used to generate tx2gene mapping)--salmon_dir: Directory containing Salmon output (one subdirectory per sample)--outdir: Output directory for count matrices--tx2gene: (Optional) Use existing tx2gene.tsv instead of generating from GTF--samples: (Optional) Comma-separated sample list (auto-detected if not provided)
Explore the expression matrix
cd ~/genomics
# View the first few genes in the count matrix
head -10 counts/gene_counts.tsv | column -t
# How many genes have counts?
wc -l counts/gene_counts.tsv
# View TPM values (normalized)
head -10 counts/gene_tpm.tsv | column -t
# View sample info (auto-generated)
cat counts/sample_info.tsv
# Open the count matrix in MS Excel and get familiar with the content
You now have gene-level count matrices ready for DESeq2 analysis in Lab 5. The script generated:
gene_counts.tsv- Raw counts for DESeq2gene_tpm.tsv- TPM values for visualizationsample_info.tsv- Sample metadatatx2gene.tsv- Transcript-to-gene mappingtximport.rds- R object for DESeq2
Exercises
Exercise 1: Salmon Analysis
- Which sample has the highest mapping rate?
- How many transcripts were quantified in total?
- What library type did Salmon detect? (Check lib_format_counts.json)
Exercise 2: Gene Expression
- Find the top 5 most highly expressed genes (by TPM) in KO samples
- Find the top 5 most highly expressed genes in WT samples
- Are they the same genes? What might this tell you?
Exercise 3: Compare Approaches
Consider these questions:
- When would you choose STAR + Salmon over Salmon only?
- What information is lost when you skip genome alignment?
- For a differential expression study, does it matter which approach you use?
Summary
In this lab, you have:
- Understood the difference between pseudo-alignment and traditional alignment
- Built a Salmon transcriptome index
- Quantified all samples using Salmon's fast pseudo-alignment
- Interpreted Salmon output files and quality metrics
- Aggregated transcript counts to gene level with tximport
- Prepared gene-level count matrices for DESeq2