Lab 4: Read Alignment with STAR

Part 1Setup Part 2Reference & STAR Index Part 3STAR Alignment Part 4BAM Format Part 5BAM Post-processing Part 6Alignment QC Part 7IGV Visualization

In this lab, you will align the trimmed chromosome 11-filtered paired-end reads from GSE136366 (a TDP-43 RNA-seq study in human cells) to the chromosome 11 reference using STAR. You will then post-process and index the resulting BAM files, assess alignment quality, and visualize the alignments in IGV.

Using chr11 data in this lab

The input files (trimmed/${SRR}_chr11_trimmed_1/2.fastq.gz) were produced in Lab 3 from the chr11-filtered FASTQs (~3–4 M reads per sample). Because we align to a single-chromosome reference, the STAR index takes only ~4 GB of RAM (vs ~32 GB for the full genome) and alignment completes in under 1 minute per sample. All commands are identical to a full-scale analysis — only the reference scope differs. From Lab 6 onwards, the full GRCh38 dataset and genome are used via nf-core.

Learning Objectives

Part 1: Setup

Connect to Ibex, start an interactive session with enough resources for alignment, and load the required tools.

Connect to Ibex and start an interactive session

# Connect to Ibex
ssh username@ilogin.ibex.kaust.edu.sa

# Start an interactive session with 8 CPUs and 8 GB RAM
# (chr11 STAR index is ~4 GB; never run analyses on the login node)
srun --cpus-per-task=8 --mem=8G --time=01:00:00 --pty /bin/bash
Why only 8 GB RAM?

STAR loads the entire genome index into memory during alignment. A chromosome 11-only index requires only ~4 GB of RAM (versus ~32 GB for the full GRCh38 genome), making it practical to run within a standard interactive session. Alignment of ~3–4 M reads completes in under 1 minute per sample.

Load required modules

# Load alignment and BAM processing tools
module load star
module load samtools
module load picard

# Verify that all tools are loaded
module list

# Check versions
STAR --version
samtools --version | head -1
picard MarkDuplicates --version 2>&1 | head -1

Navigate to your workshop directory and create the output folder

# Navigate to your workshop directory on scratch
cd /ibex/user/$USER/workshop

# Create the directory for alignment outputs
mkdir -p aligned logs

# Confirm your directory structure
ls -la
Tip: Expected Directory Structure

By this point your workshop directory should contain:

/ibex/user/$USER/workshop/
├── raw_data/          # original full SRA FASTQs and SRR_Acc_List.txt
├── chr11_raw_data/    # chr11 FASTQs from Zenodo (KO/WT naming)
├── trimmed/           # fastp-trimmed chr11 FASTQs (*_trimmed_1/2.fastq.gz) from Lab 3
├── references/        # chr11 FASTA, GTF, and STAR index
├── aligned/           # will be created now
└── logs/              # SLURM log files

Part 2: Get the Chr11 Reference Files and Build the STAR Index

STAR requires a pre-built genome index that combines the reference FASTA and the gene annotation (GTF). For this lab we use a chromosome 11-only reference, which builds in under 5 minutes and requires only ~4 GB of RAM — suitable for an interactive session.

Copy the chr11 reference files from the shared course directory

cd /ibex/user/$USER/workshop

SHARED=/biocorelab/BIX/resources/genomes/workshops/human-chr11
mkdir -p references/chr11

# Chr11 genome FASTA and GTF
cp ${SHARED}/GRCh38.dna.chromosome.11.fa   references/chr11/
cp ${SHARED}/GRCh38.chr11.gtf  references/chr11/
cp ${SHARED}/GRCh38.cdna.chr11.fa  references/chr11/

# Verify
ls -lh references/chr11/
What these files contain

GRCh38.dna.chromosome.11.fa — the chromosome 11 DNA sequence only (~135 MB, vs ~3 GB for the full genome). GRCh38.chr11.gtf — gene annotations restricted to chromosome 11 (~3 MB). GRCh38.cdna.chr11.fa — chr11 transcript sequences for the Salmon index. Using a single-chromosome reference means STAR only needs ~4 GB RAM for the index and alignment is very fast.

Build the STAR index for chr11 (interactive)

cd /ibex/user/$USER/workshop

mkdir -p references/star_index_chr11

STAR --runMode genomeGenerate \
    --genomeDir references/star_index_chr11 \
    --genomeFastaFiles references/chr11/GRCh38.chr11.fa \
    --sjdbGTFfile references/chr11/GRCh38.chr11.gtf \
    --runThreadN 8 \
    --genomeSAindexNbases 11
Key STAR Index Parameters
ParameterValueMeaning
--runMode genomeGenerateTells STAR to build an index rather than align reads
--genomeDirreferences/star_index_chr11Output directory for the index files
--genomeFastaFilesGRCh38.chr11.faSingle-chromosome FASTA
--sjdbGTFfileGRCh38.chr11.gtfAnnotation used to add known splice junctions to the index
--genomeSAindexNbases11Reduced from 14 (full genome) to 11 because the reference is a single chromosome — STAR recommends min(14, log2(GenomeLength)/2 - 1)
--runThreadN8Number of CPU threads to use

Index building takes approximately 3–5 minutes. Verify it completed:

ls -lh references/star_index_chr11/
# Expect: Genome, SA, SAindex, chrName.txt, genomeParameters.txt, etc.

Part 3: Align Reads with STAR

With the chr11 index ready, align all samples interactively. Each sample (~3–4 M reads against a single-chromosome index) finishes in under 1 minute.

Run STAR alignment for all samples

cd /ibex/user/$USER/workshop

for r1 in trimmed/*_trimmed_1.fastq.gz; do
    sample=$(basename "$r1" _trimmed_1.fastq.gz)
    echo "=== Aligning $sample ==="
    mkdir -p aligned/${sample}

    STAR --runThreadN 8 \
        --genomeDir references/star_index_chr11 \
        --readFilesIn trimmed/${sample}_trimmed_1.fastq.gz trimmed/${sample}_trimmed_2.fastq.gz \
        --readFilesCommand zcat \
        --outSAMtype BAM SortedByCoordinate \
        --outSAMattributes NH HI AS NM MD \
        --outFileNamePrefix aligned/${sample}/ \
        --outSAMstrandField intronMotif \
        --alignSJoverhangMin 8 \
        --alignSJDBoverhangMin 1 \
        --quantMode GeneCounts \
        --runMode alignReads

    echo "Done: $sample"
done
Expected runtime

Each chr11 sample (~3–4 M read pairs against a single-chromosome index) takes under 1 minute. The loop processes all 6 samples in about 5 minutes total. STAR prints progress to the terminal; look for ... done mapping and the appearance of Aligned.sortedByCoord.out.bam to confirm success for each sample.

Understand the key STAR alignment parameters

STAR Parameter Reference
ParameterValue UsedExplanation
--readFilesCommand zcatzcatTells STAR to decompress gzipped FASTQ files on the fly
--outSAMtype BAM SortedByCoordinateOutput a coordinate-sorted BAM file directly (no need for a separate sort step)
--outSAMattributes NH HI AS NM MDTags added to each read: NH=number of hits, HI=hit index, AS=alignment score, NM=mismatches, MD=mismatch positions
--outSAMstrandField intronMotifAdds strand information based on intron splice-site motifs; required for unstranded libraries
--alignSJoverhangMin8Minimum overhang on each side of a novel splice junction
--alignSJDBoverhangMin1Minimum overhang for annotated (database) splice junctions
--quantMode GeneCountsOutputs a per-gene read count table alongside the BAM (useful for downstream DEG analysis)

Check expected output files for each sample

# Check the output directory for one sample
ls -lh aligned/KO_1_SRR10045016/
STAR Output Files
FileDescription
Aligned.sortedByCoord.out.bamCoordinate-sorted BAM file with aligned reads
Log.final.outSummary alignment statistics (mapping rate, multi-mappers, etc.)
Log.outDetailed STAR run log
Log.progress.outReal-time progress during alignment
ReadsPerGene.out.tabPer-gene read count table (from --quantMode GeneCounts)
SJ.out.tabSplice junctions detected in this sample

Part 4: BAM Format Overview

Before post-processing your alignments, take a moment to understand what a BAM file actually contains. BAM (Binary Alignment Map) is the compressed binary version of SAM (Sequence Alignment Map) — it is the standard output of alignment tools like STAR and the central file format for all downstream analysis.

SAM/BAM Structure
  • Header section (lines starting with @): @HD (format/sort order), @SQ (reference sequence names and lengths), @RG (read group metadata), @PG (programs and command lines used)
  • Alignment section: one line per read, 11 mandatory tab-separated fields — read name, FLAG, chromosome, position (1-based), MAPQ, CIGAR string, mate chromosome, mate position, insert size, read sequence, and base qualities
  • CIGAR string: encodes the alignment operation per base — M = match/mismatch, N = intron skip (splice junction), I/D = insertion/deletion, S = soft clip, H = hard clip
  • FLAG field: a bitwise integer encoding read properties — paired, properly paired, unmapped, mate unmapped, reverse strand, first/second in pair, secondary/supplementary alignment, duplicate
  • Optional tags: tool-specific key:type:value fields appended after the mandatory columns — e.g. NH:i:1 (number of alignments), AS:i:... (alignment score), MD:Z:... (mismatch positions)

Inspect a BAM file from your alignment output

Use the BAM you produced in Part 3 to explore the format directly.

cd /ibex/user/$USER/workshop

BAM=aligned/KO_1_SRR10045016/Aligned.sortedByCoord.out.bam

# View the full BAM header (@HD, @SQ, @PG lines)
samtools view -H $BAM

# View the first 5 alignment records in SAM text format
samtools view $BAM | head -5

# Summary alignment statistics
samtools flagstat $BAM

# Count all reads in the BAM
samtools view -c $BAM

# Count only mapped reads (exclude FLAG bit 0x4 = unmapped)
samtools view -c -F 4 $BAM
Decode any FLAG value

Use the Broad Institute's interactive FLAG decoder: broadinstitute.github.io/picard/explain-flags.html — paste any integer FLAG and it lists what each bit means.

Examine the CIGAR string and optional tags

Pick one alignment line and decode its CIGAR string to understand how the read spans a splice junction.

# Show one alignment record with all optional tags, formatted for readability
samtools view $BAM | head -1 | tr '\t' '\n' | cat -n

# Find reads that span a splice junction (N in CIGAR = intron skip)
samtools view $BAM | awk '$6 ~ /N/' | head -3
Reading a CIGAR string

A CIGAR like 50M400N50M means: 50 bases match the genome, then 400 bases are skipped (an intron), then 50 more bases match. This is how STAR encodes a read that spans a splice junction. The total length of M + I + S + = + X operations must equal the read length.

Part 5: BAM Post-processing

After alignment, BAM files need to be indexed for fast random access, and PCR duplicates should be flagged using Picard MarkDuplicates. These steps operate on the chr11 BAMs produced in Part 3 — the commands are identical to a full-scale workflow.

Index the BAM files with samtools

A BAM index (.bai file) allows tools like IGV and samtools to jump directly to any genomic region without reading the entire file.

cd /ibex/user/$USER/workshop

# Index the BAM file for every sample
for bam in aligned/*/Aligned.sortedByCoord.out.bam; do
    echo "Indexing $bam ..."
    samtools index "$bam"
done

# Verify — each BAM should now have a matching .bai file
ls -lh aligned/KO_1_SRR10045016/

Mark PCR duplicates with Picard

for bam in aligned/*/Aligned.sortedByCoord.out.bam; do
    sample=$(basename $(dirname "$bam"))
    echo "Marking duplicates for $sample ..."
    picard MarkDuplicates \
        I=aligned/${sample}/Aligned.sortedByCoord.out.bam \
        O=aligned/${sample}/marked_dup.bam \
        M=aligned/${sample}/dup_metrics.txt
    samtools index aligned/${sample}/marked_dup.bam
done
What is Duplicate Marking and Why Does It Matter?

During library preparation, PCR amplification can create multiple identical copies of the same original DNA molecule. These duplicates are not independent observations of gene expression — they are technical artifacts. Picard MarkDuplicates identifies read pairs that align to exactly the same position and flags them in the BAM file (it does not delete them by default). Downstream tools like GATK and many DEG pipelines can then ignore flagged duplicates.

Key output: dup_metrics.txt reports the duplication rate. For RNA-seq libraries, rates of 10–40% are common. Very high rates (>60%) may indicate poor library complexity. With chr11-only data the ESTIMATED_LIBRARY_SIZE will appear smaller than a full-dataset run — this is expected.

Inspect the Picard duplication metrics

# View the duplication report for one sample
cat aligned/KO_1_SRR10045016/dup_metrics.txt
Tip: Reading Picard MarkDuplicates Output

The key metric to check is PERCENT_DUPLICATION in the metrics file. Also note ESTIMATED_LIBRARY_SIZE — a low value suggests the library was sequenced to near-saturation. With chr11-only data this estimate will be lower than a full-genome run; the duplication rate itself remains a meaningful quality indicator.

Part 6: Alignment Quality Control

Before proceeding to quantification, assess the quality of the alignments to make sure the data is suitable for downstream analysis. These steps work directly on the chr11 BAMs — alignment stats and log inspection complete in seconds.

Run samtools flagstat on all samples

cd /ibex/user/$USER/workshop

# Print flagstat output for every sample
for bam in aligned/*/Aligned.sortedByCoord.out.bam; do
    sample=$(basename $(dirname "$bam"))
    echo "=== $sample ==="
    samtools flagstat "$bam"
    echo ""
done
Key flagstat Fields to Check
FieldWhat to Look For
totalTotal reads in the BAM; with chr11 data expect ~6–8 M (3–4 M read pairs × 2 reads each)
mappedNumber and % of reads that aligned; aim for >80%
properly pairedBoth reads in a pair aligned concordantly; should be high (>90% of mapped)
singletonsReads whose mate did not map; a few percent is normal, high values may indicate quality issues

Inspect the STAR alignment summary log

# View the STAR summary log for the first sample
cat aligned/KO_1_SRR10045016/Log.final.out

# Loop to print key stats for all samples
for log in aligned/*/Log.final.out; do
    sample=$(basename $(dirname "$log"))
    echo "=== $sample ==="
    grep -E "Uniquely mapped|mapped to multiple|unmapped" "$log"
    echo ""
done
STAR Log.final.out Field Reference
FieldExplanation
Number of input readsTotal read pairs submitted to STAR
Uniquely mapped reads number / %Reads with exactly one best alignment location — the gold standard metric
% of reads mapped to multiple lociMulti-mapping reads; typically 5–15% for RNA-seq
% of reads unmapped: too shortReads too short after soft-clipping to align; may indicate adapter contamination or aggressive trimming
% of reads unmapped: too many mismatchesReads with too many mismatches; may indicate wrong reference genome or RNA editing
% of reads unmapped: otherMiscellaneous unmapped reads
Number of splices: TotalTotal number of splice junctions observed
Number of splices: Annotated sjdbSplices matching known junctions from the GTF
Insertion / Deletion average lengthAverage indel size; should be small for short-read RNA-seq
Mismatch rate per base, %Base error rate; typically <1% for good Illumina data
Warning: What to Do If the Mapping Rate is Low

If uniquely mapped reads fall below 70%, investigate the following:

  • Wrong reference genome: Confirm you are using the correct species and genome build (GRCh38 for human).
  • Adapter contamination: Check FastQC reports from Lab 3. Run fastp again with more aggressive adapter trimming if needed.
  • rRNA contamination: A large fraction of unmapped reads may be ribosomal RNA not depleted during library preparation.
  • Incorrect library type: Check whether reads are stranded or unstranded; adjust --outSAMstrandField accordingly.
  • Short reads after trimming: If many reads are flagged as "too short", relax the minimum length threshold in fastp.
Unique vs. Multi-mapping Reads

Uniquely mapped reads align to exactly one location in the genome. They are the most reliable for quantification because their genomic origin is unambiguous.

Multi-mapping reads align equally well to two or more locations (e.g., reads from gene families or repeated elements). Different tools handle them differently: some discard them, some distribute them probabilistically. STAR outputs them in the BAM with an NH tag > 1. For standard differential expression analysis with tools like DESeq2, it is common to use only uniquely mapped reads.

Part 7: [Optional] Visualize Alignments in IGV

The Integrative Genomics Viewer (IGV) lets you visually inspect read alignments at any genomic locus — essential for sanity-checking results and exploring specific genes of interest. Because these BAMs contain chr11-only reads, coverage is lower than a full-genome dataset but gene structure, splice junctions, and strand orientation are clearly visible for chr11 genes.

To run IGV locally, copy BAM files to your local machine or mount Ibex on your local machine

# Run these commands from your LOCAL terminal (not Ibex)
# Replace "username" and "SRRxxxxxxxx" with real values

scp username@ilogin.ibex.kaust.edu.sa:/ibex/user/username/workshop/aligned/KO_1_SRR10045016/Aligned.sortedByCoord.out.bam ./
scp username@ilogin.ibex.kaust.edu.sa:/ibex/user/username/workshop/aligned/KO_1_SRR10045016/Aligned.sortedByCoord.out.bam.bai ./
Tip: Subsampled BAMs are small and fast to transfer

A chr11 BAM (~3–4 M read pairs) is typically 100–300 MB, so scp to your laptop takes under a minute. A full-dataset BAM can be 5–20 GB, where Option A (Open OnDemand) is strongly preferred.

Download IGV from https://igv.org/doc/desktop/ if not already installed.

Load the genome and BAM file in IGV

  1. In IGV, select Genomes → Load Genome from Server and choose Human (GRCh38/hg38)
  2. Select File → Load from File and open the BAM file (Aligned.sortedByCoord.out.bam)
  3. In the search box at the top, type GAPDH and press Enter to navigate to that gene

Navigate to the TARDBP gene locus

Since this dataset is a TDP-43 (TARDBP) study, navigate to the gene of interest:

# Type this gene name in the IGV search box to navigate:
TARDBP

# You can also navigate by coordinates:
# chr1:11,012,655-11,045,430  (GRCh38)
What to Look for in IGV
  • Coverage track: The histogram at the top of the alignment track shows read depth per position. Exons should show peaks of coverage; introns should be nearly empty. With chr11-only data, depth at expressed genes is still meaningful (~20–100× for moderately expressed genes on chr11).
  • Splice junctions: Curved arcs connecting exons indicate reads that span splice junctions. You should see arcs connecting TARDBP exons.
  • Strand orientation: Read colors indicate which strand a read aligns to. For a stranded library, reads on each gene should be predominantly one color.
  • Mismatches: Colored bars within reads indicate mismatches to the reference. A few scattered mismatches are normal; clustering may indicate a variant or sequencing error.
Tip: Useful IGV Keyboard Shortcuts
  • + / - — zoom in and out
  • Ctrl+F — jump forward along the chromosome
  • Right-click on a read — see full alignment details, flags, and tags
  • Right-click on the track — change color scheme, squish/expand reads, sort by various criteria

Exercises

Use the commands and output files from this lab to answer the following questions.

Exercise 1: Unique Mapping Rate

What percentage of reads mapped uniquely for each sample? Read the STAR Log.final.out files for all samples and compare the results.

Hint: Use the loop from Part 6, Step 2 to grep the "Uniquely mapped reads %" field from each sample's log. Are there any samples with notably lower mapping rates?

Exercise 2: Duplication Rate

What is the PCR duplication rate reported by Picard MarkDuplicates for each sample? Look at the dup_metrics.txt files.

Hint: The key column is PERCENT_DUPLICATION. You can extract it with:

for metrics in aligned/*/dup_metrics.txt; do
    sample=$(basename $(dirname "$metrics"))
    echo -n "$sample: "
    grep -A2 "ESTIMATED_LIBRARY_SIZE" "$metrics" | tail -1 | cut -f9
done

Exercise 3: Properly Paired Reads

For one sample, how many reads are properly paired versus how many are singletons (mate unmapped)? Use samtools flagstat and calculate the percentage of each.

Hint: Run samtools flagstat aligned/KO_1_SRR10045016/Aligned.sortedByCoord.out.bam and look at the "properly paired" and "singletons" lines. Divide each by the total read count to get percentages.

Exercise 4: TARDBP Coverage in IGV

Navigate to the TARDBP gene locus in IGV (chromosome 1). Describe what you observe: How many exons are covered? Do you see evidence of alternative splicing (multiple arc patterns)? Is coverage uniform across exons?

Hint: TARDBP encodes TDP-43, the key protein in this study (GSE136366). Compare a control sample versus a TDP-43 knockdown sample if available. Look for differences in exon coverage and splice-junction arcs.

Summary

In this lab, you have:

Previous: Lab 2: Public Data Retrieval  —  Next: Lab 5: Gene Quantification with Salmon