Lab 4: Read Alignment with STAR
In this lab, you will align the trimmed chromosome 11-filtered paired-end reads from GSE136366 (a TDP-43 RNA-seq study in human cells) to the chromosome 11 reference using STAR. You will then post-process and index the resulting BAM files, assess alignment quality, and visualize the alignments in IGV.
The input files (trimmed/${SRR}_chr11_trimmed_1/2.fastq.gz) were produced in Lab 3 from the chr11-filtered FASTQs (~3–4 M reads per sample). Because we align to a single-chromosome reference, the STAR index takes only ~4 GB of RAM (vs ~32 GB for the full genome) and alignment completes in under 1 minute per sample. All commands are identical to a full-scale analysis — only the reference scope differs. From Lab 6 onwards, the full GRCh38 dataset and genome are used via nf-core.
- Build a STAR genome index from the GRCh38 FASTA and GTF annotation
- Align paired-end reads to GRCh38 using STAR with appropriate parameters
- Post-process BAM files: sort, index, and mark PCR duplicates with Picard
- Assess alignment quality using
samtools flagstatand the STAR alignment log - Visualize aligned reads at a gene locus using IGV
Part 1: Setup
Connect to Ibex, start an interactive session with enough resources for alignment, and load the required tools.
Connect to Ibex and start an interactive session
# Connect to Ibex
ssh username@ilogin.ibex.kaust.edu.sa
# Start an interactive session with 8 CPUs and 8 GB RAM
# (chr11 STAR index is ~4 GB; never run analyses on the login node)
srun --cpus-per-task=8 --mem=8G --time=01:00:00 --pty /bin/bash
STAR loads the entire genome index into memory during alignment. A chromosome 11-only index requires only ~4 GB of RAM (versus ~32 GB for the full GRCh38 genome), making it practical to run within a standard interactive session. Alignment of ~3–4 M reads completes in under 1 minute per sample.
Load required modules
# Load alignment and BAM processing tools
module load star
module load samtools
module load picard
# Verify that all tools are loaded
module list
# Check versions
STAR --version
samtools --version | head -1
picard MarkDuplicates --version 2>&1 | head -1
Navigate to your workshop directory and create the output folder
# Navigate to your workshop directory on scratch
cd /ibex/user/$USER/workshop
# Create the directory for alignment outputs
mkdir -p aligned logs
# Confirm your directory structure
ls -la
By this point your workshop directory should contain:
/ibex/user/$USER/workshop/
├── raw_data/ # original full SRA FASTQs and SRR_Acc_List.txt
├── chr11_raw_data/ # chr11 FASTQs from Zenodo (KO/WT naming)
├── trimmed/ # fastp-trimmed chr11 FASTQs (*_trimmed_1/2.fastq.gz) from Lab 3
├── references/ # chr11 FASTA, GTF, and STAR index
├── aligned/ # will be created now
└── logs/ # SLURM log files
Part 2: Get the Chr11 Reference Files and Build the STAR Index
STAR requires a pre-built genome index that combines the reference FASTA and the gene annotation (GTF). For this lab we use a chromosome 11-only reference, which builds in under 5 minutes and requires only ~4 GB of RAM — suitable for an interactive session.
Copy the chr11 reference files from the shared course directory
cd /ibex/user/$USER/workshop
SHARED=/biocorelab/BIX/resources/genomes/workshops/human-chr11
mkdir -p references/chr11
# Chr11 genome FASTA and GTF
cp ${SHARED}/GRCh38.dna.chromosome.11.fa references/chr11/
cp ${SHARED}/GRCh38.chr11.gtf references/chr11/
cp ${SHARED}/GRCh38.cdna.chr11.fa references/chr11/
# Verify
ls -lh references/chr11/
GRCh38.dna.chromosome.11.fa — the chromosome 11 DNA sequence only (~135 MB, vs ~3 GB for the full genome). GRCh38.chr11.gtf — gene annotations restricted to chromosome 11 (~3 MB). GRCh38.cdna.chr11.fa — chr11 transcript sequences for the Salmon index. Using a single-chromosome reference means STAR only needs ~4 GB RAM for the index and alignment is very fast.
Build the STAR index for chr11 (interactive)
cd /ibex/user/$USER/workshop
mkdir -p references/star_index_chr11
STAR --runMode genomeGenerate \
--genomeDir references/star_index_chr11 \
--genomeFastaFiles references/chr11/GRCh38.chr11.fa \
--sjdbGTFfile references/chr11/GRCh38.chr11.gtf \
--runThreadN 8 \
--genomeSAindexNbases 11
| Parameter | Value | Meaning |
|---|---|---|
--runMode genomeGenerate | Tells STAR to build an index rather than align reads | |
--genomeDir | references/star_index_chr11 | Output directory for the index files |
--genomeFastaFiles | GRCh38.chr11.fa | Single-chromosome FASTA |
--sjdbGTFfile | GRCh38.chr11.gtf | Annotation used to add known splice junctions to the index |
--genomeSAindexNbases | 11 | Reduced from 14 (full genome) to 11 because the reference is a single chromosome — STAR recommends min(14, log2(GenomeLength)/2 - 1) |
--runThreadN | 8 | Number of CPU threads to use |
Index building takes approximately 3–5 minutes. Verify it completed:
ls -lh references/star_index_chr11/
# Expect: Genome, SA, SAindex, chrName.txt, genomeParameters.txt, etc.
Part 3: Align Reads with STAR
With the chr11 index ready, align all samples interactively. Each sample (~3–4 M reads against a single-chromosome index) finishes in under 1 minute.
Run STAR alignment for all samples
cd /ibex/user/$USER/workshop
for r1 in trimmed/*_trimmed_1.fastq.gz; do
sample=$(basename "$r1" _trimmed_1.fastq.gz)
echo "=== Aligning $sample ==="
mkdir -p aligned/${sample}
STAR --runThreadN 8 \
--genomeDir references/star_index_chr11 \
--readFilesIn trimmed/${sample}_trimmed_1.fastq.gz trimmed/${sample}_trimmed_2.fastq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS NM MD \
--outFileNamePrefix aligned/${sample}/ \
--outSAMstrandField intronMotif \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 \
--quantMode GeneCounts \
--runMode alignReads
echo "Done: $sample"
done
Each chr11 sample (~3–4 M read pairs against a single-chromosome index) takes under 1 minute. The loop processes all 6 samples in about 5 minutes total. STAR prints progress to the terminal; look for ... done mapping and the appearance of Aligned.sortedByCoord.out.bam to confirm success for each sample.
Understand the key STAR alignment parameters
| Parameter | Value Used | Explanation |
|---|---|---|
--readFilesCommand zcat | zcat | Tells STAR to decompress gzipped FASTQ files on the fly |
--outSAMtype BAM SortedByCoordinate | Output a coordinate-sorted BAM file directly (no need for a separate sort step) | |
--outSAMattributes NH HI AS NM MD | Tags added to each read: NH=number of hits, HI=hit index, AS=alignment score, NM=mismatches, MD=mismatch positions | |
--outSAMstrandField intronMotif | Adds strand information based on intron splice-site motifs; required for unstranded libraries | |
--alignSJoverhangMin | 8 | Minimum overhang on each side of a novel splice junction |
--alignSJDBoverhangMin | 1 | Minimum overhang for annotated (database) splice junctions |
--quantMode GeneCounts | Outputs a per-gene read count table alongside the BAM (useful for downstream DEG analysis) |
Check expected output files for each sample
# Check the output directory for one sample
ls -lh aligned/KO_1_SRR10045016/
| File | Description |
|---|---|
Aligned.sortedByCoord.out.bam | Coordinate-sorted BAM file with aligned reads |
Log.final.out | Summary alignment statistics (mapping rate, multi-mappers, etc.) |
Log.out | Detailed STAR run log |
Log.progress.out | Real-time progress during alignment |
ReadsPerGene.out.tab | Per-gene read count table (from --quantMode GeneCounts) |
SJ.out.tab | Splice junctions detected in this sample |
Part 4: BAM Format Overview
Before post-processing your alignments, take a moment to understand what a BAM file actually contains. BAM (Binary Alignment Map) is the compressed binary version of SAM (Sequence Alignment Map) — it is the standard output of alignment tools like STAR and the central file format for all downstream analysis.
- Header section (lines starting with
@):@HD(format/sort order),@SQ(reference sequence names and lengths),@RG(read group metadata),@PG(programs and command lines used) - Alignment section: one line per read, 11 mandatory tab-separated fields — read name, FLAG, chromosome, position (1-based), MAPQ, CIGAR string, mate chromosome, mate position, insert size, read sequence, and base qualities
- CIGAR string: encodes the alignment operation per base —
M= match/mismatch,N= intron skip (splice junction),I/D= insertion/deletion,S= soft clip,H= hard clip - FLAG field: a bitwise integer encoding read properties — paired, properly paired, unmapped, mate unmapped, reverse strand, first/second in pair, secondary/supplementary alignment, duplicate
- Optional tags: tool-specific key:type:value fields appended after the mandatory columns — e.g.
NH:i:1(number of alignments),AS:i:...(alignment score),MD:Z:...(mismatch positions)
Inspect a BAM file from your alignment output
Use the BAM you produced in Part 3 to explore the format directly.
cd /ibex/user/$USER/workshop
BAM=aligned/KO_1_SRR10045016/Aligned.sortedByCoord.out.bam
# View the full BAM header (@HD, @SQ, @PG lines)
samtools view -H $BAM
# View the first 5 alignment records in SAM text format
samtools view $BAM | head -5
# Summary alignment statistics
samtools flagstat $BAM
# Count all reads in the BAM
samtools view -c $BAM
# Count only mapped reads (exclude FLAG bit 0x4 = unmapped)
samtools view -c -F 4 $BAM
Use the Broad Institute's interactive FLAG decoder: broadinstitute.github.io/picard/explain-flags.html — paste any integer FLAG and it lists what each bit means.
Examine the CIGAR string and optional tags
Pick one alignment line and decode its CIGAR string to understand how the read spans a splice junction.
# Show one alignment record with all optional tags, formatted for readability
samtools view $BAM | head -1 | tr '\t' '\n' | cat -n
# Find reads that span a splice junction (N in CIGAR = intron skip)
samtools view $BAM | awk '$6 ~ /N/' | head -3
A CIGAR like 50M400N50M means: 50 bases match the genome, then 400 bases are skipped (an intron), then 50 more bases match. This is how STAR encodes a read that spans a splice junction. The total length of M + I + S + = + X operations must equal the read length.
Part 5: BAM Post-processing
After alignment, BAM files need to be indexed for fast random access, and PCR duplicates should be flagged using Picard MarkDuplicates. These steps operate on the chr11 BAMs produced in Part 3 — the commands are identical to a full-scale workflow.
Index the BAM files with samtools
A BAM index (.bai file) allows tools like IGV and samtools to jump directly to any genomic region without reading the entire file.
cd /ibex/user/$USER/workshop
# Index the BAM file for every sample
for bam in aligned/*/Aligned.sortedByCoord.out.bam; do
echo "Indexing $bam ..."
samtools index "$bam"
done
# Verify — each BAM should now have a matching .bai file
ls -lh aligned/KO_1_SRR10045016/
Mark PCR duplicates with Picard
for bam in aligned/*/Aligned.sortedByCoord.out.bam; do
sample=$(basename $(dirname "$bam"))
echo "Marking duplicates for $sample ..."
picard MarkDuplicates \
I=aligned/${sample}/Aligned.sortedByCoord.out.bam \
O=aligned/${sample}/marked_dup.bam \
M=aligned/${sample}/dup_metrics.txt
samtools index aligned/${sample}/marked_dup.bam
done
During library preparation, PCR amplification can create multiple identical copies of the same original DNA molecule. These duplicates are not independent observations of gene expression — they are technical artifacts. Picard MarkDuplicates identifies read pairs that align to exactly the same position and flags them in the BAM file (it does not delete them by default). Downstream tools like GATK and many DEG pipelines can then ignore flagged duplicates.
Key output: dup_metrics.txt reports the duplication rate. For RNA-seq libraries, rates of 10–40% are common. Very high rates (>60%) may indicate poor library complexity. With chr11-only data the ESTIMATED_LIBRARY_SIZE will appear smaller than a full-dataset run — this is expected.
Inspect the Picard duplication metrics
# View the duplication report for one sample
cat aligned/KO_1_SRR10045016/dup_metrics.txt
The key metric to check is PERCENT_DUPLICATION in the metrics file. Also note ESTIMATED_LIBRARY_SIZE — a low value suggests the library was sequenced to near-saturation. With chr11-only data this estimate will be lower than a full-genome run; the duplication rate itself remains a meaningful quality indicator.
Part 6: Alignment Quality Control
Before proceeding to quantification, assess the quality of the alignments to make sure the data is suitable for downstream analysis. These steps work directly on the chr11 BAMs — alignment stats and log inspection complete in seconds.
Run samtools flagstat on all samples
cd /ibex/user/$USER/workshop
# Print flagstat output for every sample
for bam in aligned/*/Aligned.sortedByCoord.out.bam; do
sample=$(basename $(dirname "$bam"))
echo "=== $sample ==="
samtools flagstat "$bam"
echo ""
done
| Field | What to Look For |
|---|---|
| total | Total reads in the BAM; with chr11 data expect ~6–8 M (3–4 M read pairs × 2 reads each) |
| mapped | Number and % of reads that aligned; aim for >80% |
| properly paired | Both reads in a pair aligned concordantly; should be high (>90% of mapped) |
| singletons | Reads whose mate did not map; a few percent is normal, high values may indicate quality issues |
Inspect the STAR alignment summary log
# View the STAR summary log for the first sample
cat aligned/KO_1_SRR10045016/Log.final.out
# Loop to print key stats for all samples
for log in aligned/*/Log.final.out; do
sample=$(basename $(dirname "$log"))
echo "=== $sample ==="
grep -E "Uniquely mapped|mapped to multiple|unmapped" "$log"
echo ""
done
| Field | Explanation |
|---|---|
| Number of input reads | Total read pairs submitted to STAR |
| Uniquely mapped reads number / % | Reads with exactly one best alignment location — the gold standard metric |
| % of reads mapped to multiple loci | Multi-mapping reads; typically 5–15% for RNA-seq |
| % of reads unmapped: too short | Reads too short after soft-clipping to align; may indicate adapter contamination or aggressive trimming |
| % of reads unmapped: too many mismatches | Reads with too many mismatches; may indicate wrong reference genome or RNA editing |
| % of reads unmapped: other | Miscellaneous unmapped reads |
| Number of splices: Total | Total number of splice junctions observed |
| Number of splices: Annotated sjdb | Splices matching known junctions from the GTF |
| Insertion / Deletion average length | Average indel size; should be small for short-read RNA-seq |
| Mismatch rate per base, % | Base error rate; typically <1% for good Illumina data |
If uniquely mapped reads fall below 70%, investigate the following:
- Wrong reference genome: Confirm you are using the correct species and genome build (GRCh38 for human).
- Adapter contamination: Check FastQC reports from Lab 3. Run fastp again with more aggressive adapter trimming if needed.
- rRNA contamination: A large fraction of unmapped reads may be ribosomal RNA not depleted during library preparation.
- Incorrect library type: Check whether reads are stranded or unstranded; adjust
--outSAMstrandFieldaccordingly. - Short reads after trimming: If many reads are flagged as "too short", relax the minimum length threshold in fastp.
Uniquely mapped reads align to exactly one location in the genome. They are the most reliable for quantification because their genomic origin is unambiguous.
Multi-mapping reads align equally well to two or more locations (e.g., reads from gene families or repeated elements). Different tools handle them differently: some discard them, some distribute them probabilistically. STAR outputs them in the BAM with an NH tag > 1. For standard differential expression analysis with tools like DESeq2, it is common to use only uniquely mapped reads.
Part 7: [Optional] Visualize Alignments in IGV
The Integrative Genomics Viewer (IGV) lets you visually inspect read alignments at any genomic locus — essential for sanity-checking results and exploring specific genes of interest. Because these BAMs contain chr11-only reads, coverage is lower than a full-genome dataset but gene structure, splice junctions, and strand orientation are clearly visible for chr11 genes.
To run IGV locally, copy BAM files to your local machine or mount Ibex on your local machine
# Run these commands from your LOCAL terminal (not Ibex)
# Replace "username" and "SRRxxxxxxxx" with real values
scp username@ilogin.ibex.kaust.edu.sa:/ibex/user/username/workshop/aligned/KO_1_SRR10045016/Aligned.sortedByCoord.out.bam ./
scp username@ilogin.ibex.kaust.edu.sa:/ibex/user/username/workshop/aligned/KO_1_SRR10045016/Aligned.sortedByCoord.out.bam.bai ./
A chr11 BAM (~3–4 M read pairs) is typically 100–300 MB, so scp to your laptop takes under a minute. A full-dataset BAM can be 5–20 GB, where Option A (Open OnDemand) is strongly preferred.
Download IGV from https://igv.org/doc/desktop/ if not already installed.
Load the genome and BAM file in IGV
- In IGV, select Genomes → Load Genome from Server and choose Human (GRCh38/hg38)
- Select File → Load from File and open the BAM file (
Aligned.sortedByCoord.out.bam) - In the search box at the top, type
GAPDHand press Enter to navigate to that gene
Navigate to the TARDBP gene locus
Since this dataset is a TDP-43 (TARDBP) study, navigate to the gene of interest:
# Type this gene name in the IGV search box to navigate:
TARDBP
# You can also navigate by coordinates:
# chr1:11,012,655-11,045,430 (GRCh38)
- Coverage track: The histogram at the top of the alignment track shows read depth per position. Exons should show peaks of coverage; introns should be nearly empty. With chr11-only data, depth at expressed genes is still meaningful (~20–100× for moderately expressed genes on chr11).
- Splice junctions: Curved arcs connecting exons indicate reads that span splice junctions. You should see arcs connecting TARDBP exons.
- Strand orientation: Read colors indicate which strand a read aligns to. For a stranded library, reads on each gene should be predominantly one color.
- Mismatches: Colored bars within reads indicate mismatches to the reference. A few scattered mismatches are normal; clustering may indicate a variant or sequencing error.
+/-— zoom in and outCtrl+F— jump forward along the chromosome- Right-click on a read — see full alignment details, flags, and tags
- Right-click on the track — change color scheme, squish/expand reads, sort by various criteria
Exercises
Use the commands and output files from this lab to answer the following questions.
Exercise 1: Unique Mapping Rate
What percentage of reads mapped uniquely for each sample? Read the STAR Log.final.out files for all samples and compare the results.
Hint: Use the loop from Part 6, Step 2 to grep the "Uniquely mapped reads %" field from each sample's log. Are there any samples with notably lower mapping rates?
Exercise 2: Duplication Rate
What is the PCR duplication rate reported by Picard MarkDuplicates for each sample? Look at the dup_metrics.txt files.
Hint: The key column is PERCENT_DUPLICATION. You can extract it with:
for metrics in aligned/*/dup_metrics.txt; do
sample=$(basename $(dirname "$metrics"))
echo -n "$sample: "
grep -A2 "ESTIMATED_LIBRARY_SIZE" "$metrics" | tail -1 | cut -f9
done
Exercise 3: Properly Paired Reads
For one sample, how many reads are properly paired versus how many are singletons (mate unmapped)? Use samtools flagstat and calculate the percentage of each.
Hint: Run samtools flagstat aligned/KO_1_SRR10045016/Aligned.sortedByCoord.out.bam and look at the "properly paired" and "singletons" lines. Divide each by the total read count to get percentages.
Exercise 4: TARDBP Coverage in IGV
Navigate to the TARDBP gene locus in IGV (chromosome 1). Describe what you observe: How many exons are covered? Do you see evidence of alternative splicing (multiple arc patterns)? Is coverage uniform across exons?
Hint: TARDBP encodes TDP-43, the key protein in this study (GSE136366). Compare a control sample versus a TDP-43 knockdown sample if available. Look for differences in exon coverage and splice-junction arcs.
Summary
In this lab, you have:
- Built a STAR genome index for chr11 and aligned all GSE136366 paired-end samples
- Explored the BAM/SAM format: header sections, alignment fields, CIGAR strings, FLAG values, and optional tags using
samtools - Indexed BAM files with
samtools indexfor fast random access - Marked PCR duplicates with
picard MarkDuplicatesand inspected duplication rates - Assessed alignment quality using
samtools flagstatand the STARLog.final.outsummary - Visualized aligned reads at the TARDBP gene locus in IGV
Previous: Lab 2: Public Data Retrieval — Next: Lab 5: Gene Quantification with Salmon