Lab 4a: Genome Alignment (STAR + Salmon)

Choose Your Path

This lab uses genome alignment with STAR followed by Salmon quantification. Choose this approach when you need:

If you only need gene expression counts and want faster processing, use Lab 4b: Pseudo-alignment (Salmon only) instead.

Learning Objectives

Why STAR for RNA-seq Alignment?

STAR (Spliced Transcripts Alignment to a Reference)

STAR is the gold-standard aligner for RNA-seq data because it:

  • Handles splicing: Can align reads across exon-exon junctions
  • Fast: Uses uncompressed suffix arrays for rapid alignment
  • Accurate: High mapping accuracy even for novel splice junctions
  • Flexible: Outputs BAM files compatible with downstream tools
Memory Requirements

STAR requires significant RAM (~32 GB for full human genome). Using chromosome 11 only reduces this to ~4 GB, making it suitable for laptops.

STAR has many parameters and options, always revisit STAR manual (current version 2.7.11b) for details

Check Genome Reference Files

Make sure the steps in Lab 1 Part 3-5 were successful, Check contents of the reference folderls -lh ~/genomics/references

Part 1: Build STAR Genome Index

Before alignment, STAR requires a pre-built genome index.

Create the STAR index

cd ~/genomics

# Activate environment
conda activate genomics

# Create directory for STAR index
mkdir -p references/star_index

# Build STAR index for chromosome 11
# This takes 5-10 minutes and ~4 GB RAM
STAR --runMode genomeGenerate \
    --runThreadN 4 \
    --genomeDir references/star_index \
    --genomeFastaFiles references/Homo_sapiens.GRCh38.dna.chromosome.11.fa \
    --sjdbGTFfile references/Homo_sapiens.GRCh38.110.chr11.gtf \
    --sjdbOverhang 69 \
    --genomeSAindexNbases 11

# Check the generated index files
ls -lh references/star_index/
Key Parameters
  • --sjdbOverhang 69: ReadLength - 1 (our reads are 70 bp)
  • --genomeSAindexNbases 11: Reduced for small genomes (chr11 only)

Part 2: Align Reads with STAR

Align one sample

cd ~/genomics

# Align KO_1 sample
STAR --runThreadN 4 \
    --genomeDir references/star_index \
    --readFilesIn trimmed_data/KO_1_SRR10045016_1.trimmed.fastq.gz \
                  trimmed_data/KO_1_SRR10045016_2.trimmed.fastq.gz \
    --outFileNamePrefix alignment/KO_1_ \
    --outSAMattributes Standard \
    --readFilesCommand zcat \
    --outSAMtype BAM SortedByCoordinate \
    --quantMode GeneCounts

# Check output files
ls -lh alignment/KO_1_*
Download aligned chr11 BAM file

If you have issues running STAR,please download this zip file to get pre-aligned BAM files for all samples (chr11)https://drive.google.com/drive/folders/1VHhugmCIAog3HgoTaODQWS_8rLVObIFo?usp=sharing

Troubleshooting
  • --readFilesCommand zcat: You might need to change this depending what extraction method is available on your system
  • In case STAT does not finish normally, Review alignment/sample_Log.out for information. On Mac it might have issues with the CPU architecture
  • The --sjdbGTFfile option must have been used during star_index generation when --quantMode GeneCounts is enabled.

Align all samples

cd ~/genomics

# Define samples
SAMPLES="KO_1_SRR10045016 KO_2_SRR10045017 KO_3_SRR10045018 WT_1_SRR10045019 WT_2_SRR10045020 WT_3_SRR10045021"

# Align each sample
for SAMPLE in $SAMPLES; do
    echo "Aligning $SAMPLE..."

    STAR --runThreadN 4 \
        --genomeDir references/star_index \
        --readFilesIn trimmed_data/${SAMPLE}_1.trimmed.fastq.gz \
                      trimmed_data/${SAMPLE}_2.trimmed.fastq.gz \
        --outFileNamePrefix alignment/${SAMPLE}_ \
        --outSAMattributes Standard \
        --readFilesCommand zcat \
        --outSAMtype BAM SortedByCoordinate \
        --quantMode GeneCounts
        2>> logs/star.log
done

echo "Alignment complete!"
ls -lh alignment/*.bam

Part 3: Understanding STAR Output

Review alignment log

cd ~/genomics

# View the alignment summary (Log.final.out)
cat alignment/KO_1_Log.final.out

# Key metrics to look for:
# - Uniquely mapped reads %: Should be >80% for good data
# - % of reads mapped to multiple loci: High values may indicate repetitive sequences
# - % of reads unmapped: Should be relatively low

Expected Alignment Metrics

MetricGoodAcceptableInvestigate
Uniquely mapped %>80%70-80%<70%
Multi-mapped %<10%10-20%>20%
Unmapped %<5%5-15%>15%

Compare alignment rates across samples

Go through each file manually, see alignment/sampleName_Log.final.out and look for Sample,Input_Reads,Uniquely_Mapped,Unique%,Multi_Mapped,Unmapped rows

Part 4: Working with BAM Files

Index BAM files with samtools

cd ~/genomics/

# Index all BAM files (required for IGV and many tools)
for bam in alignment/*_Aligned.sortedByCoord.out.bam; do
    echo "Indexing $bam..."
    samtools index $bam
done

# Verify index files were created
ls -lh alignment/*.bai

Explore BAM file structure

# View BAM header
samtools view -H alignment/KO_1_Aligned.sortedByCoord.out.bam | head -20

# View first 5 aligned reads
samtools view alignment/KO_1_Aligned.sortedByCoord.out.bam | head -5

# Count total aligned reads
samtools view -c alignment/KO_1_Aligned.sortedByCoord.out.bam

# Count properly paired reads
samtools view -c -f 2 alignment/KO_1_Aligned.sortedByCoord.out.bam
BAM Format - Key Columns
ColFieldDescription
1QNAMERead name
2FLAGBitwise flag (paired, mapped, etc.)
3RNAMEReference chromosome
4POSMapping position
5MAPQMapping quality
6CIGARAlignment description
10SEQRead sequence
11QUALQuality scores
Samtools flags

To know what flag to use when inspecting BAM files using samtools, visit Explain flags check the desired boxes and use the resulting number from the top with -f or -F flags in samtools

Generate alignment statistics

# Detailed statistics with samtools flagstat
samtools flagstat alignment/KO_1_Aligned.sortedByCoord.out.bam

# Generate stats for all samples
for bam in alignment/*_Aligned.sortedByCoord.out.bam; do
    echo "--- $bam ---"
    samtools flagstat $bam | head -5
    echo ""
done

Part 5: Visualizing Aligned Reads

View in IGV (Integrative Genomics Viewer)

# Launch IGV (if installed via conda)
igv &

# In IGV:
# 1. Genomes > Load Genome from File > references/Homo_sapiens.GRCh38.dna.chromosome.11.fa
# 2. File > Load from File > references/references/Homo_sapiens.GRCh38.110.chr11.gtf
# 3. File > Load from File > alignment/KO_1_Aligned.sortedByCoord.out.bam
# 4. Navigate to a gene: e.g., search for "MDK" (Midkine gene on chr11)
# 5. Check reads mapping to each exon in "Midkine" gene
# 6. Load BAM file for a WT sample and see if there are any differences with the KO sample above
            

Alternative Approach: use samtools to investigate BAM files

# View reads at a specific position (add chr:start-end to samtools command)
# MDK gene is approximately at chr11:46380756–46383837
samtools view alignment/KO_1_Aligned.sortedByCoord.out.bam chr11:46380756–46383837 | head -5

# Count reads in the MDK gene region
samtools view -c alignment/KO_1_Aligned.sortedByCoord.out.bam chr11:46380756–46383837

# Compare expression between conditions
echo "=== Reads in MDK gene region ==="
for bam in alignment/*_Aligned.sortedByCoord.out.bam; do
    echo $bam
    samtools view -c $bam chr11:46380756–46383837
done

Part 6: Transcript Quantification with Salmon

After alignment, we use Salmon to quantify transcript abundance. Salmon provides accurate counts that work well with DESeq2.

Why Salmon?

  • Fast: Quasi-mapping is much faster than traditional alignment
  • Accurate: Handles multi-mapping reads and transcript isoforms well
  • Bias correction: Built-in GC and sequence bias correction
  • Compatible: Output works directly with DESeq2 via tximport

Build Salmon transcriptome index

cd ~/genomics

# Build Salmon index from chr11 transcriptome
salmon index \
    -t references/transcriptome_chr11.fa \
    -i references/salmon_index \
    --threads 4

# Check the index
ls -lh references/salmon_index/
Transcriptome vs Genome

Salmon uses a transcriptome reference (cDNA sequences) rather than the genome. This is faster and more appropriate for RNA-seq quantification.

Quantify all samples with Salmon

cd ~/genomics

# Define samples
SAMPLES="KO_1_SRR10045016 KO_2_SRR10045017 KO_3_SRR10045018 WT_1_SRR10045019 WT_2_SRR10045020 WT_3_SRR10045021"

# Process each sample
for SAMPLE in $SAMPLES; do
    # Extract short name (KO_1, KO_2, etc.)
    SHORT_NAME=$(echo $SAMPLE | cut -d'_' -f1,2)
    echo "Quantifying $SHORT_NAME..."

    salmon quant \
        -i references/salmon_index \
        -l A \
        -1 trimmed_data/${SAMPLE}_1.trimmed.fastq.gz \
        -2 trimmed_data/${SAMPLE}_2.trimmed.fastq.gz \
        -o salmon_quant/${SHORT_NAME} \
        --threads 4 \
        --validateMappings \
        --gcBias \
        --seqBias \
        2>> logs/salmon.log
done

echo "Quantification complete!"
ls salmon_quant/
Salmon Parameters
  • -l A: Auto-detect library type
  • --validateMappings: More accurate mapping
  • --gcBias: Correct for GC content bias
  • --seqBias: Correct for sequence-specific bias

Explore Salmon output

# View the quantification file
head -20 salmon_quant/KO_1/quant.sf

# The columns are:
# Name: Transcript ID
# Length: Transcript length
# EffectiveLength: Length adjusted for bias
# TPM: Transcripts Per Million
# NumReads: Estimated read count

# Check mapping rate
grep "Mapping rate" salmon_quant/KO_1/logs/salmon_quant.log

# Compare mapping rates across samples
echo "=== Salmon Mapping Rates ==="
for dir in salmon_quant/*/; do
    sample=$(basename $dir)
    rate=$(grep "Mapping rate" $dir/logs/salmon_quant.log | awk '{print $NF}')
    echo "$sample: $rate"
done

Part 7: Aggregate to Gene Level with tximport

Salmon provides transcript-level counts. We use tximport in R to aggregate these to gene level for DESeq2.

Download the tximport script

We provide a reusable R script that handles both tx2gene generation from GTF and tximport aggregation.

cd ~/genomics

# Install required R packages (if not already installed)
Rscript -e 'install.packages(c("tximport", "optparse"), repos="https://cloud.r-project.org")'

# Create scripts directory
mkdir -p scripts

# Download the tximport script from GitHub
wget -O scripts/run_tximport.R \
    https://raw.githubusercontent.com/kaust-academy/academy-stage3-2026/main/scripts/run_tximport.R

# Make it executable (optional)
chmod +x scripts/run_tximport.R

# View script help
Rscript scripts/run_tximport.R --help
Script Features
  • Automatically generates tx2gene mapping from your GTF file
  • Auto-detects sample directories in salmon_quant folder
  • Outputs count matrices, TPM values, and sample info
  • Creates tximport.rds object ready for DESeq2

Run tximport

cd ~/genomics

# Run the tximport script with your GTF and Salmon output
Rscript scripts/run_tximport.R \
    --gtf references/Homo_sapiens.GRCh38.110.chr11.gtf \
    --salmon_dir salmon_quant \
    --outdir counts

# Check the output files
ls -lh counts/
Script Parameters
  • --gtf: Path to GTF annotation file (used to generate tx2gene mapping)
  • --salmon_dir: Directory containing Salmon output (one subdirectory per sample)
  • --outdir: Output directory for count matrices
  • --tx2gene: (Optional) Use existing tx2gene.tsv instead of generating from GTF
  • --samples: (Optional) Comma-separated sample list (auto-detected if not provided)

Explore the expression matrix

cd ~/genomics

# View the first few genes in the count matrix
head -20 counts/gene_counts.tsv | column -t

# How many genes have counts?
wc -l counts/gene_counts.tsv

# View TPM values (normalized)
head -20 counts/gene_tpm.tsv | column -t

# View sample info (auto-generated)
cat counts/sample_info.tsv

# Open the count matrix in MS Excel and get familiar with the content
Ready for Differential Expression

You now have gene-level count matrices ready for DESeq2 analysis in Lab 5. The script generated:

  • gene_counts.tsv - Raw counts for DESeq2
  • gene_tpm.tsv - TPM values for visualization
  • sample_info.tsv - Sample metadata
  • tx2gene.tsv - Transcript-to-gene mapping
  • tximport.rds - R object for DESeq2

Exercises

Exercise 1: Alignment Analysis

  1. Which sample has the highest mapping rate? Which has the lowest?
  2. How many reads map to multiple locations? Why might this occur?
  3. Why did we use--sjdbOverhang 69 in building the STAR index? Hint: use STAR manual to answer.

Exercise 2: BAM Exploration

Use samtools:

# How many reads have mapping quality >= 30?
samtools view -c -q 30 alignment/KO_1_Aligned.sortedByCoord.out.bam

# How many reads are on the forward strand?
samtools view -c -F 16 alignment/KO_1_Aligned.sortedByCoord.out.bam

# Extract reads from a specific region to a new BAM
samtools view -b alignment/KO_1_Aligned.sortedByCoord.out.bam 11:1000000-2000000 > test_region.bam
Question: Which condition (WT vs. KO) have more reads in "TTC17" gene?

Summary

In this lab, you have:

  • Built a STAR genome index for chromosome 11
  • Aligned RNA-seq reads using STAR with splice-aware alignment
  • Interpreted alignment logs and metrics
  • Used samtools to index, view, and analyze BAM files
  • Explored read coverage at specific genomic regions
  • Quantified transcripts with Salmon
  • Aggregated transcript counts to gene level with tximport

Next: Lab 5 - Differential Expression Analysis