Lab 4a: Genome Alignment (STAR + Salmon)

Choose Your Path

This lab uses genome alignment with STAR followed by Salmon quantification. Choose this approach when you need:

BAM files for visualization in IGV
Read-level information (splice junctions, mutations, etc.)
Integration with variant calling or other genome-based analyses

If you only need gene expression counts and want faster processing, use Lab 4b: Pseudo-alignment (Salmon only) instead.

Learning Objectives

Build a STAR genome index for chromosome 11
Align RNA-seq reads to the reference genome using STAR
Understand BAM file format and alignment statistics
Use samtools to inspect and manipulate BAM files
Quantify transcripts with Salmon and aggregate to gene level

Why STAR for RNA-seq Alignment?

STAR (Spliced Transcripts Alignment to a Reference)

STAR is the gold-standard aligner for RNA-seq data because it:

Handles splicing: Can align reads across exon-exon junctions
Fast: Uses uncompressed suffix arrays for rapid alignment
Accurate: High mapping accuracy even for novel splice junctions
Flexible: Outputs BAM files compatible with downstream tools

Memory Requirements

STAR requires significant RAM (~32 GB for full human genome). Using chromosome 11 only reduces this to ~4 GB, making it suitable for laptops.

STAR has many parameters and options, always revisit STAR manual (current version 2.7.11b) for details

Check Genome Reference Files

Make sure the steps in Lab 1 Part 3-5 were successful, Check contents of the reference folderls -lh ~/genomics/references

Part 1: Build STAR Genome Index

Before alignment, STAR requires a pre-built genome index.

Create the STAR index

cd ~/genomics

# Activate environment
conda activate genomics

# Create directory for STAR index
mkdir -p references/star_index

# Build STAR index for chromosome 11
# This takes 5-10 minutes and ~4 GB RAM
STAR --runMode genomeGenerate \
    --runThreadN 4 \
    --genomeDir references/star_index \
    --genomeFastaFiles references/Homo_sapiens.GRCh38.dna.chromosome.11.fa \
    --sjdbGTFfile references/Homo_sapiens.GRCh38.110.chr11.gtf \
    --sjdbOverhang 69 \
    --genomeSAindexNbases 11

# Check the generated index files
ls -lh references/star_index/

Key Parameters

--sjdbOverhang 69: ReadLength - 1 (our reads are 70 bp)
--genomeSAindexNbases 11: Reduced for small genomes (chr11 only)

Part 2: Align Reads with STAR

Align one sample

cd ~/genomics

# Align KO_1 sample
STAR --runThreadN 4 \
    --genomeDir references/star_index \
    --readFilesIn trimmed_data/KO_1_SRR10045016_1.trimmed.fastq.gz \
                  trimmed_data/KO_1_SRR10045016_2.trimmed.fastq.gz \
    --outFileNamePrefix alignment/KO_1_ \
    --outSAMattributes Standard \
    --readFilesCommand zcat \
    --outSAMtype BAM SortedByCoordinate \
    --quantMode GeneCounts

# Check output files
ls -lh alignment/KO_1_*

Download aligned chr11 BAM file

If you have issues running STAR,please download this zip file to get pre-aligned BAM files for all samples (chr11)https://drive.google.com/drive/folders/1VHhugmCIAog3HgoTaODQWS_8rLVObIFo?usp=sharing

Troubleshooting

--readFilesCommand zcat: You might need to change this depending what extraction method is available on your system
In case STAT does not finish normally, Review alignment/sample_Log.out for information. On Mac it might have issues with the CPU architecture
The --sjdbGTFfile option must have been used during star_index generation when --quantMode GeneCounts is enabled.

Align all samples

cd ~/genomics

# Define samples
SAMPLES="KO_1_SRR10045016 KO_2_SRR10045017 KO_3_SRR10045018 WT_1_SRR10045019 WT_2_SRR10045020 WT_3_SRR10045021"

# Align each sample
for SAMPLE in $SAMPLES; do
    echo "Aligning $SAMPLE..."

    STAR --runThreadN 4 \
        --genomeDir references/star_index \
        --readFilesIn trimmed_data/${SAMPLE}_1.trimmed.fastq.gz \
                      trimmed_data/${SAMPLE}_2.trimmed.fastq.gz \
        --outFileNamePrefix alignment/${SAMPLE}_ \
        --outSAMattributes Standard \
        --readFilesCommand zcat \
        --outSAMtype BAM SortedByCoordinate \
        --quantMode GeneCounts
        2>> logs/star.log
done

echo "Alignment complete!"
ls -lh alignment/*.bam

Part 3: Understanding STAR Output

Review alignment log

cd ~/genomics

# View the alignment summary (Log.final.out)
cat alignment/KO_1_Log.final.out

# Key metrics to look for:
# - Uniquely mapped reads %: Should be >80% for good data
# - % of reads mapped to multiple loci: High values may indicate repetitive sequences
# - % of reads unmapped: Should be relatively low

Expected Alignment Metrics

Metric	Good	Acceptable	Investigate
Uniquely mapped %	>80%	70-80%	<70%
Multi-mapped %	<10%	10-20%	>20%
Unmapped %	<5%	5-15%	>15%

Compare alignment rates across samples

Go through each file manually, see alignment/sampleName_Log.final.out and look for Sample,Input_Reads,Uniquely_Mapped,Unique%,Multi_Mapped,Unmapped rows

Part 4: Working with BAM Files

Index BAM files with samtools

cd ~/genomics/

# Index all BAM files (required for IGV and many tools)
for bam in alignment/*_Aligned.sortedByCoord.out.bam; do
    echo "Indexing $bam..."
    samtools index $bam
done

# Verify index files were created
ls -lh alignment/*.bai

Explore BAM file structure

# View BAM header
samtools view -H alignment/KO_1_Aligned.sortedByCoord.out.bam | head -20

# View first 5 aligned reads
samtools view alignment/KO_1_Aligned.sortedByCoord.out.bam | head -5

# Count total aligned reads
samtools view -c alignment/KO_1_Aligned.sortedByCoord.out.bam

# Count properly paired reads
samtools view -c -f 2 alignment/KO_1_Aligned.sortedByCoord.out.bam

BAM Format - Key Columns

Col	Field	Description
1	QNAME	Read name
2	FLAG	Bitwise flag (paired, mapped, etc.)
3	RNAME	Reference chromosome
4	POS	Mapping position
5	MAPQ	Mapping quality
6	CIGAR	Alignment description
10	SEQ	Read sequence
11	QUAL	Quality scores

Samtools flags

To know what flag to use when inspecting BAM files using samtools, visit Explain flags check the desired boxes and use the resulting number from the top with -f or -F flags in samtools

Generate alignment statistics

# Detailed statistics with samtools flagstat
samtools flagstat alignment/KO_1_Aligned.sortedByCoord.out.bam

# Generate stats for all samples
for bam in alignment/*_Aligned.sortedByCoord.out.bam; do
    echo "--- $bam ---"
    samtools flagstat $bam | head -5
    echo ""
done

Part 5: Visualizing Aligned Reads

View in IGV (Integrative Genomics Viewer)

# Launch IGV (if installed via conda)
igv &

# In IGV:
# 1. Genomes > Load Genome from File > references/Homo_sapiens.GRCh38.dna.chromosome.11.fa
# 2. File > Load from File > references/references/Homo_sapiens.GRCh38.110.chr11.gtf
# 3. File > Load from File > alignment/KO_1_Aligned.sortedByCoord.out.bam
# 4. Navigate to a gene: e.g., search for "MDK" (Midkine gene on chr11)
# 5. Check reads mapping to each exon in "Midkine" gene
# 6. Load BAM file for a WT sample and see if there are any differences with the KO sample above



            
                Alternative Approach: use samtools to investigate BAM files
                # View reads at a specific position (add chr:start-end to samtools command)
# MDK gene is approximately at chr11:46380756–46383837
samtools view alignment/KO_1_Aligned.sortedByCoord.out.bam chr11:46380756–46383837 | head -5

# Count reads in the MDK gene region
samtools view -c alignment/KO_1_Aligned.sortedByCoord.out.bam chr11:46380756–46383837

# Compare expression between conditions
echo "=== Reads in MDK gene region ==="
for bam in alignment/*_Aligned.sortedByCoord.out.bam; do
    echo $bam
    samtools view -c $bam chr11:46380756–46383837
done



        
            Part 6: Transcript Quantification with Salmon
            After alignment, we use Salmon to quantify transcript abundance. Salmon provides accurate counts that work well with DESeq2.

            
                Why Salmon?
                
                    Fast: Quasi-mapping is much faster than traditional alignment
                    Accurate: Handles multi-mapping reads and transcript isoforms well
                    Bias correction: Built-in GC and sequence bias correction
                    Compatible: Output works directly with DESeq2 via tximport
                
            

            
                Build Salmon transcriptome index
                cd ~/genomics

# Build Salmon index from chr11 transcriptome
salmon index \
    -t references/transcriptome_chr11.fa \
    -i references/salmon_index \
    --threads 4

# Check the index
ls -lh references/salmon_index/

                
                    Transcriptome vs Genome
                    Salmon uses a transcriptome reference (cDNA sequences) rather than the genome. This is faster and more appropriate for RNA-seq quantification.
                
            

            
                Quantify all samples with Salmon
                cd ~/genomics

# Define samples
SAMPLES="KO_1_SRR10045016 KO_2_SRR10045017 KO_3_SRR10045018 WT_1_SRR10045019 WT_2_SRR10045020 WT_3_SRR10045021"

# Process each sample
for SAMPLE in $SAMPLES; do
    # Extract short name (KO_1, KO_2, etc.)
    SHORT_NAME=$(echo $SAMPLE | cut -d'_' -f1,2)
    echo "Quantifying $SHORT_NAME..."

    salmon quant \
        -i references/salmon_index \
        -l A \
        -1 trimmed_data/${SAMPLE}_1.trimmed.fastq.gz \
        -2 trimmed_data/${SAMPLE}_2.trimmed.fastq.gz \
        -o salmon_quant/${SHORT_NAME} \
        --threads 4 \
        --validateMappings \
        --gcBias \
        --seqBias \
        2>> logs/salmon.log
done

echo "Quantification complete!"
ls salmon_quant/

                
                    Salmon Parameters
                    
                        -l A: Auto-detect library type
                        --validateMappings: More accurate mapping
                        --gcBias: Correct for GC content bias
                        --seqBias: Correct for sequence-specific bias
                    
                
            

            
                Explore Salmon output
                # View the quantification file
head -20 salmon_quant/KO_1/quant.sf

# The columns are:
# Name: Transcript ID
# Length: Transcript length
# EffectiveLength: Length adjusted for bias
# TPM: Transcripts Per Million
# NumReads: Estimated read count

# Check mapping rate
grep "Mapping rate" salmon_quant/KO_1/logs/salmon_quant.log

# Compare mapping rates across samples
echo "=== Salmon Mapping Rates ==="
for dir in salmon_quant/*/; do
    sample=$(basename $dir)
    rate=$(grep "Mapping rate" $dir/logs/salmon_quant.log | awk '{print $NF}')
    echo "$sample: $rate"
done
            
        

        
            Part 7: Aggregate to Gene Level with tximport
            Salmon provides transcript-level counts. We use tximport in R to aggregate these to gene level for DESeq2.

            
                Download the tximport script
                We provide a reusable R script that handles both tx2gene generation from GTF and tximport aggregation.
                cd ~/genomics

# Install required R packages (if not already installed)
Rscript -e 'install.packages(c("tximport", "optparse"), repos="https://cloud.r-project.org")'

# Create scripts directory
mkdir -p scripts

# Download the tximport script from GitHub
wget -O scripts/run_tximport.R \
    https://raw.githubusercontent.com/kaust-academy/academy-stage3-2026/main/scripts/run_tximport.R

# Make it executable (optional)
chmod +x scripts/run_tximport.R

# View script help
Rscript scripts/run_tximport.R --help

                
                    Script Features
                    
                        Automatically generates tx2gene mapping from your GTF file
                        Auto-detects sample directories in salmon_quant folder
                        Outputs count matrices, TPM values, and sample info
                        Creates tximport.rds object ready for DESeq2
                    
                
            

            
                Run tximport
                cd ~/genomics

# Run the tximport script with your GTF and Salmon output
Rscript scripts/run_tximport.R \
    --gtf references/Homo_sapiens.GRCh38.110.chr11.gtf \
    --salmon_dir salmon_quant \
    --outdir counts

# Check the output files
ls -lh counts/

                
                    Script Parameters
                    
                        --gtf: Path to GTF annotation file (used to generate tx2gene mapping)
                        --salmon_dir: Directory containing Salmon output (one subdirectory per sample)
                        --outdir: Output directory for count matrices
                        --tx2gene: (Optional) Use existing tx2gene.tsv instead of generating from GTF
                        --samples: (Optional) Comma-separated sample list (auto-detected if not provided)
                    
                
            

            
                Explore the expression matrix
                cd ~/genomics

# View the first few genes in the count matrix
head -20 counts/gene_counts.tsv | column -t

# How many genes have counts?
wc -l counts/gene_counts.tsv

# View TPM values (normalized)
head -20 counts/gene_tpm.tsv | column -t

# View sample info (auto-generated)
cat counts/sample_info.tsv

# Open the count matrix in MS Excel and get familiar with the content

                
                    Ready for Differential Expression
                    You now have gene-level count matrices ready for DESeq2 analysis in Lab 5. The script generated:
                    
                        gene_counts.tsv - Raw counts for DESeq2
                        gene_tpm.tsv - TPM values for visualization
                        sample_info.tsv - Sample metadata
                        tx2gene.tsv - Transcript-to-gene mapping
                        tximport.rds - R object for DESeq2
                    
                
            
        

        
            Exercises
            
                Exercise 1: Alignment Analysis
                
                    Which sample has the highest mapping rate? Which has the lowest?
                    How many reads map to multiple locations? Why might this occur?
                    Why did we use--sjdbOverhang 69 in building the STAR index? Hint: use STAR manual to answer.
                
            

            
                Exercise 2: BAM Exploration
                Use samtools:
                # How many reads have mapping quality >= 30?
samtools view -c -q 30 alignment/KO_1_Aligned.sortedByCoord.out.bam

# How many reads are on the forward strand?
samtools view -c -F 16 alignment/KO_1_Aligned.sortedByCoord.out.bam

# Extract reads from a specific region to a new BAM
samtools view -b alignment/KO_1_Aligned.sortedByCoord.out.bam 11:1000000-2000000 > test_region.bam

Question: Which condition (WT vs. KO) have more reads in "TTC17" gene?
            
        

        
            Summary
            In this lab, you have:
            
                Built a STAR genome index for chromosome 11
                Aligned RNA-seq reads using STAR with splice-aware alignment
                Interpreted alignment logs and metrics
                Used samtools to index, view, and analyze BAM files
                Explored read coverage at specific genomic regions
                Quantified transcripts with Salmon
                Aggregated transcript counts to gene level with tximport
            
            Next: Lab 5 - Differential Expression Analysis