Lab 4a: Genome Alignment (STAR + Salmon)
This lab uses genome alignment with STAR followed by Salmon quantification. Choose this approach when you need:
- BAM files for visualization in IGV
- Read-level information (splice junctions, mutations, etc.)
- Integration with variant calling or other genome-based analyses
If you only need gene expression counts and want faster processing, use Lab 4b: Pseudo-alignment (Salmon only) instead.
- Build a STAR genome index for chromosome 11
- Align RNA-seq reads to the reference genome using STAR
- Understand BAM file format and alignment statistics
- Use samtools to inspect and manipulate BAM files
- Quantify transcripts with Salmon and aggregate to gene level
Why STAR for RNA-seq Alignment?
STAR (Spliced Transcripts Alignment to a Reference)
STAR is the gold-standard aligner for RNA-seq data because it:
- Handles splicing: Can align reads across exon-exon junctions
- Fast: Uses uncompressed suffix arrays for rapid alignment
- Accurate: High mapping accuracy even for novel splice junctions
- Flexible: Outputs BAM files compatible with downstream tools
STAR requires significant RAM (~32 GB for full human genome). Using chromosome 11 only reduces this to ~4 GB, making it suitable for laptops.
STAR has many parameters and options, always revisit STAR manual (current version 2.7.11b) for details
Check Genome Reference Files
Make sure the steps in Lab 1 Part 3-5 were successful, Check contents of the reference folderls -lh ~/genomics/references
Part 1: Build STAR Genome Index
Before alignment, STAR requires a pre-built genome index.
Create the STAR index
cd ~/genomics
# Activate environment
conda activate genomics
# Create directory for STAR index
mkdir -p references/star_index
# Build STAR index for chromosome 11
# This takes 5-10 minutes and ~4 GB RAM
STAR --runMode genomeGenerate \
--runThreadN 4 \
--genomeDir references/star_index \
--genomeFastaFiles references/Homo_sapiens.GRCh38.dna.chromosome.11.fa \
--sjdbGTFfile references/Homo_sapiens.GRCh38.110.chr11.gtf \
--sjdbOverhang 69 \
--genomeSAindexNbases 11
# Check the generated index files
ls -lh references/star_index/
--sjdbOverhang 69: ReadLength - 1 (our reads are 70 bp)--genomeSAindexNbases 11: Reduced for small genomes (chr11 only)
Part 2: Align Reads with STAR
Align one sample
cd ~/genomics
# Align KO_1 sample
STAR --runThreadN 4 \
--genomeDir references/star_index \
--readFilesIn trimmed_data/KO_1_SRR10045016_1.trimmed.fastq.gz \
trimmed_data/KO_1_SRR10045016_2.trimmed.fastq.gz \
--outFileNamePrefix alignment/KO_1_ \
--outSAMattributes Standard \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts
# Check output files
ls -lh alignment/KO_1_*
If you have issues running STAR,please download this zip file to get pre-aligned BAM files for all samples (chr11)https://drive.google.com/drive/folders/1VHhugmCIAog3HgoTaODQWS_8rLVObIFo?usp=sharing
--readFilesCommand zcat: You might need to change this depending what extraction method is available on your system- In case STAT does not finish normally, Review
alignment/sample_Log.outfor information. On Mac it might have issues with the CPU architecture - The
--sjdbGTFfileoption must have been used during star_index generation when--quantMode GeneCountsis enabled.
Align all samples
cd ~/genomics
# Define samples
SAMPLES="KO_1_SRR10045016 KO_2_SRR10045017 KO_3_SRR10045018 WT_1_SRR10045019 WT_2_SRR10045020 WT_3_SRR10045021"
# Align each sample
for SAMPLE in $SAMPLES; do
echo "Aligning $SAMPLE..."
STAR --runThreadN 4 \
--genomeDir references/star_index \
--readFilesIn trimmed_data/${SAMPLE}_1.trimmed.fastq.gz \
trimmed_data/${SAMPLE}_2.trimmed.fastq.gz \
--outFileNamePrefix alignment/${SAMPLE}_ \
--outSAMattributes Standard \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts
2>> logs/star.log
done
echo "Alignment complete!"
ls -lh alignment/*.bam
Part 3: Understanding STAR Output
Review alignment log
cd ~/genomics
# View the alignment summary (Log.final.out)
cat alignment/KO_1_Log.final.out
# Key metrics to look for:
# - Uniquely mapped reads %: Should be >80% for good data
# - % of reads mapped to multiple loci: High values may indicate repetitive sequences
# - % of reads unmapped: Should be relatively low
Expected Alignment Metrics
| Metric | Good | Acceptable | Investigate |
|---|---|---|---|
| Uniquely mapped % | >80% | 70-80% | <70% |
| Multi-mapped % | <10% | 10-20% | >20% |
| Unmapped % | <5% | 5-15% | >15% |
Compare alignment rates across samples
Go through each file manually, see alignment/sampleName_Log.final.out and look for Sample,Input_Reads,Uniquely_Mapped,Unique%,Multi_Mapped,Unmapped rows
Part 4: Working with BAM Files
Index BAM files with samtools
cd ~/genomics/
# Index all BAM files (required for IGV and many tools)
for bam in alignment/*_Aligned.sortedByCoord.out.bam; do
echo "Indexing $bam..."
samtools index $bam
done
# Verify index files were created
ls -lh alignment/*.bai
Explore BAM file structure
# View BAM header
samtools view -H alignment/KO_1_Aligned.sortedByCoord.out.bam | head -20
# View first 5 aligned reads
samtools view alignment/KO_1_Aligned.sortedByCoord.out.bam | head -5
# Count total aligned reads
samtools view -c alignment/KO_1_Aligned.sortedByCoord.out.bam
# Count properly paired reads
samtools view -c -f 2 alignment/KO_1_Aligned.sortedByCoord.out.bam
| Col | Field | Description |
|---|---|---|
| 1 | QNAME | Read name |
| 2 | FLAG | Bitwise flag (paired, mapped, etc.) |
| 3 | RNAME | Reference chromosome |
| 4 | POS | Mapping position |
| 5 | MAPQ | Mapping quality |
| 6 | CIGAR | Alignment description |
| 10 | SEQ | Read sequence |
| 11 | QUAL | Quality scores |
To know what flag to use when inspecting BAM files using samtools, visit Explain flags check the desired boxes and use the resulting number from the top with -f or -F flags in samtools
Generate alignment statistics
# Detailed statistics with samtools flagstat
samtools flagstat alignment/KO_1_Aligned.sortedByCoord.out.bam
# Generate stats for all samples
for bam in alignment/*_Aligned.sortedByCoord.out.bam; do
echo "--- $bam ---"
samtools flagstat $bam | head -5
echo ""
done
Part 5: Visualizing Aligned Reads
View in IGV (Integrative Genomics Viewer)
# Launch IGV (if installed via conda)
igv &
# In IGV:
# 1. Genomes > Load Genome from File > references/Homo_sapiens.GRCh38.dna.chromosome.11.fa
# 2. File > Load from File > references/references/Homo_sapiens.GRCh38.110.chr11.gtf
# 3. File > Load from File > alignment/KO_1_Aligned.sortedByCoord.out.bam
# 4. Navigate to a gene: e.g., search for "MDK" (Midkine gene on chr11)
# 5. Check reads mapping to each exon in "Midkine" gene
# 6. Load BAM file for a WT sample and see if there are any differences with the KO sample above
Alternative Approach: use samtools to investigate BAM files
# View reads at a specific position (add chr:start-end to samtools command)
# MDK gene is approximately at chr11:46380756–46383837
samtools view alignment/KO_1_Aligned.sortedByCoord.out.bam chr11:46380756–46383837 | head -5
# Count reads in the MDK gene region
samtools view -c alignment/KO_1_Aligned.sortedByCoord.out.bam chr11:46380756–46383837
# Compare expression between conditions
echo "=== Reads in MDK gene region ==="
for bam in alignment/*_Aligned.sortedByCoord.out.bam; do
echo $bam
samtools view -c $bam chr11:46380756–46383837
done
Part 6: Transcript Quantification with Salmon
After alignment, we use Salmon to quantify transcript abundance. Salmon provides accurate counts that work well with DESeq2.
Why Salmon?
- Fast: Quasi-mapping is much faster than traditional alignment
- Accurate: Handles multi-mapping reads and transcript isoforms well
- Bias correction: Built-in GC and sequence bias correction
- Compatible: Output works directly with DESeq2 via tximport
Build Salmon transcriptome index
cd ~/genomics
# Build Salmon index from chr11 transcriptome
salmon index \
-t references/transcriptome_chr11.fa \
-i references/salmon_index \
--threads 4
# Check the index
ls -lh references/salmon_index/
Salmon uses a transcriptome reference (cDNA sequences) rather than the genome. This is faster and more appropriate for RNA-seq quantification.
Quantify all samples with Salmon
cd ~/genomics
# Define samples
SAMPLES="KO_1_SRR10045016 KO_2_SRR10045017 KO_3_SRR10045018 WT_1_SRR10045019 WT_2_SRR10045020 WT_3_SRR10045021"
# Process each sample
for SAMPLE in $SAMPLES; do
# Extract short name (KO_1, KO_2, etc.)
SHORT_NAME=$(echo $SAMPLE | cut -d'_' -f1,2)
echo "Quantifying $SHORT_NAME..."
salmon quant \
-i references/salmon_index \
-l A \
-1 trimmed_data/${SAMPLE}_1.trimmed.fastq.gz \
-2 trimmed_data/${SAMPLE}_2.trimmed.fastq.gz \
-o salmon_quant/${SHORT_NAME} \
--threads 4 \
--validateMappings \
--gcBias \
--seqBias \
2>> logs/salmon.log
done
echo "Quantification complete!"
ls salmon_quant/
-l A: Auto-detect library type--validateMappings: More accurate mapping--gcBias: Correct for GC content bias--seqBias: Correct for sequence-specific bias
Explore Salmon output
# View the quantification file
head -20 salmon_quant/KO_1/quant.sf
# The columns are:
# Name: Transcript ID
# Length: Transcript length
# EffectiveLength: Length adjusted for bias
# TPM: Transcripts Per Million
# NumReads: Estimated read count
# Check mapping rate
grep "Mapping rate" salmon_quant/KO_1/logs/salmon_quant.log
# Compare mapping rates across samples
echo "=== Salmon Mapping Rates ==="
for dir in salmon_quant/*/; do
sample=$(basename $dir)
rate=$(grep "Mapping rate" $dir/logs/salmon_quant.log | awk '{print $NF}')
echo "$sample: $rate"
done
Part 7: Aggregate to Gene Level with tximport
Salmon provides transcript-level counts. We use tximport in R to aggregate these to gene level for DESeq2.
Download the tximport script
We provide a reusable R script that handles both tx2gene generation from GTF and tximport aggregation.
cd ~/genomics
# Install required R packages (if not already installed)
Rscript -e 'install.packages(c("tximport", "optparse"), repos="https://cloud.r-project.org")'
# Create scripts directory
mkdir -p scripts
# Download the tximport script from GitHub
wget -O scripts/run_tximport.R \
https://raw.githubusercontent.com/kaust-academy/academy-stage3-2026/main/scripts/run_tximport.R
# Make it executable (optional)
chmod +x scripts/run_tximport.R
# View script help
Rscript scripts/run_tximport.R --help
- Automatically generates tx2gene mapping from your GTF file
- Auto-detects sample directories in salmon_quant folder
- Outputs count matrices, TPM values, and sample info
- Creates tximport.rds object ready for DESeq2
Run tximport
cd ~/genomics
# Run the tximport script with your GTF and Salmon output
Rscript scripts/run_tximport.R \
--gtf references/Homo_sapiens.GRCh38.110.chr11.gtf \
--salmon_dir salmon_quant \
--outdir counts
# Check the output files
ls -lh counts/
--gtf: Path to GTF annotation file (used to generate tx2gene mapping)--salmon_dir: Directory containing Salmon output (one subdirectory per sample)--outdir: Output directory for count matrices--tx2gene: (Optional) Use existing tx2gene.tsv instead of generating from GTF--samples: (Optional) Comma-separated sample list (auto-detected if not provided)
Explore the expression matrix
cd ~/genomics
# View the first few genes in the count matrix
head -20 counts/gene_counts.tsv | column -t
# How many genes have counts?
wc -l counts/gene_counts.tsv
# View TPM values (normalized)
head -20 counts/gene_tpm.tsv | column -t
# View sample info (auto-generated)
cat counts/sample_info.tsv
# Open the count matrix in MS Excel and get familiar with the content
You now have gene-level count matrices ready for DESeq2 analysis in Lab 5. The script generated:
gene_counts.tsv- Raw counts for DESeq2gene_tpm.tsv- TPM values for visualizationsample_info.tsv- Sample metadatatx2gene.tsv- Transcript-to-gene mappingtximport.rds- R object for DESeq2
Exercises
Exercise 1: Alignment Analysis
- Which sample has the highest mapping rate? Which has the lowest?
- How many reads map to multiple locations? Why might this occur?
- Why did we use
--sjdbOverhang 69in building the STAR index? Hint: use STAR manual to answer.
Exercise 2: BAM Exploration
Use samtools:
# How many reads have mapping quality >= 30?
samtools view -c -q 30 alignment/KO_1_Aligned.sortedByCoord.out.bam
# How many reads are on the forward strand?
samtools view -c -F 16 alignment/KO_1_Aligned.sortedByCoord.out.bam
# Extract reads from a specific region to a new BAM
samtools view -b alignment/KO_1_Aligned.sortedByCoord.out.bam 11:1000000-2000000 > test_region.bam
Question: Which condition (WT vs. KO) have more reads in "TTC17" gene?
Summary
In this lab, you have:
- Built a STAR genome index for chromosome 11
- Aligned RNA-seq reads using STAR with splice-aware alignment
- Interpreted alignment logs and metrics
- Used samtools to index, view, and analyze BAM files
- Explored read coverage at specific genomic regions
- Quantified transcripts with Salmon
- Aggregated transcript counts to gene level with tximport