Lab 3: QC and Preprocessing

Part 1Setup on Ibex Part 2FastQC Part 3fastp Trimming Part 4Post-trim FastQC Part 5MultiQC Report

In this lab you will assess the quality of the chromosome 11-filtered reads from GSE136366, trim adapters and low-quality bases using fastp, re-assess quality after trimming, and aggregate all QC reports into a single interactive summary with MultiQC.

Why chromosome 11?

Working with chr11-filtered reads (~3–4 million reads per sample) gives enough data for biologically meaningful QC, alignment, and quantification result while keeping runtimes under a few minutes per sample on Ibex. Chromosome 11 encodes many well-studied neuronal genes relevant to TDP-43 biology, making it an ideal subset for this dataset.
From Lab 6 onwards, the full dataset is used since nf-core pipelines are optimised for large-scale data.

Learning Objectives

Run FastQC on raw reads and interpret the key report modules
Understand Phred quality scores and what PASS / WARN / FAIL mean
Trim adapters and low-quality bases with fastp using a loop over all samples
Re-run FastQC on trimmed reads to verify improvement
Aggregate all QC outputs into one report with MultiQC

Part 1: Setup on Ibex

All work in this lab is performed on the KAUST Ibex HPC cluster. Start an interactive session and load the required tools before running any commands.

Connect to Ibex and start an interactive session

# Connect to the Ibex login node
ssh username@ilogin.ibex.kaust.edu.sa

# Start an interactive compute session (never run analyses on login nodes!)
srun --pty --time=4:00:00 --mem=16G --cpus-per-task=4 bash

Tip: Interactive vs. batch jobs

For this lab the commands are fast enough to run interactively. For full-size datasets, submit them as SLURM batch jobs (sbatch) so they can run unattended overnight.

Load required modules

# Load the three tools used in this lab
module load fastqc
module load fastp
module load multiqc

# Confirm all three are loaded
module list

# Quick version checks
fastqc --version
fastp --version
multiqc --version

Module names on Ibex

If module load fastqc fails, search for the correct module name with module avail fastqc. Module names are sometimes case-sensitive or include a version suffix (e.g., fastqc/0.12.1).

Navigate to your workspace and create output directories

# Go to your workshop directory
cd /ibex/user/$USER/workshop

# Create subdirectories for trimmed reads and QC results
mkdir -p trimmed
mkdir -p qc/raw
mkdir -p qc/trimmed
mkdir -p qc/fastp
mkdir -p qc/multiqc_report

# Verify the structure
ls -la

Copy the chr11 FASTQ files from the shared course directory

For Labs 3–5 we work with a chromosome 11-filtered subset of GSE136366. These files are pre-prepared on Ibex shared storage — copying takes only a few seconds.

cd /ibex/user/$USER/workshop

# Create the dedicated directory for chr11 data
mkdir -p chr11_raw_data

# Copy all chr11 FASTQs from the shared Ibex path
cp /biocorelab/BIX/resources/datasets/rnaseq/GSE136366_KO_chr11/*.fastq.gz chr11_raw_data/

# Verify — you should see 12 files (6 samples × 2 paired-end reads)
ls -lh chr11_raw_data/

Expected files

KO_1_SRR10045016_1.fastq.gz   KO_1_SRR10045016_2.fastq.gz
KO_2_SRR10045017_1.fastq.gz   KO_2_SRR10045017_2.fastq.gz
KO_3_SRR10045018_1.fastq.gz   KO_3_SRR10045018_2.fastq.gz
WT_1_SRR10045019_1.fastq.gz   WT_1_SRR10045019_2.fastq.gz
WT_2_SRR10045020_1.fastq.gz   WT_2_SRR10045020_2.fastq.gz
WT_3_SRR10045021_1.fastq.gz   WT_3_SRR10045021_2.fastq.gz

Naming convention: {condition}_{replicate}_{SRR}_{read}.fastq.gz — KO = TDP-43 knockdown, WT = wild type (control).

Verify read counts

module load seqkit
seqkit stats chr11_raw_data/*_1.fastq.gz

Expect approximately 3–4 million reads per sample — sufficient for biologically meaningful QC, alignment, and quantification on chromosome 11.

Part 2: Quality Control with FastQC

FastQC is the most widely used tool for assessing the quality of raw sequencing reads. It reads one or more FASTQ files and produces an HTML report with a series of diagnostic plots and summary flags (PASS / WARN / FAIL) for each quality module.

What FastQC checks

Module	What to look for
Per Base Sequence Quality	Quality should stay above Q28 across the full read length. A drop at the 3′ end is normal for Illumina reads.
Per Sequence Quality Scores	The distribution should be unimodal and shifted towards high scores (Q30+).
Adapter Content	Any adapter signal >5% indicates reads shorter than the sequenced fragment — trim before alignment.
Sequence Duplication Levels	RNA-seq libraries commonly show high duplication due to highly expressed transcripts. WARN/FAIL here is expected.
Overrepresented Sequences	WARN/FAIL often caused by adapter dimers or rRNA — worth investigating if >1% of reads.
Per Base Sequence Content	The first 10–15 bp may show biased content due to random hexamer priming — this is expected in RNA-seq.

Phred Quality Scores (Q-scores)

Each base in a FASTQ file has a quality score encoded as a single ASCII character. The numeric Phred score Q is defined as:

Q = -10 * log10(P)   where P = probability of a wrong base call

Phred Score	Error probability	Accuracy	Typical threshold
Q10	1 in 10	90%	Poor — discard or trim
Q20	1 in 100	99%	Minimum acceptable
Q30	1 in 1,000	99.9%	Good quality
Q40	1 in 10,000	99.99%	Excellent quality

Most downstream tools (aligners, variant callers) recommend at least Q20 per-base quality. Aim for ≥80% of bases at Q30 or above.

Run FastQC on all raw chr11 FASTQ files

cd /ibex/user/$USER/workshop

# Run FastQC on all chr11 R1 and R2 files simultaneously
fastqc chr11_raw_data/*_1.fastq.gz chr11_raw_data/*_2.fastq.gz \
    -o qc/raw/ \
    --threads 4

# Check the output — one .html and one .zip per file
ls -lh qc/raw/

Tip: FastQC key options

Option	Description
`-o`	Output directory for reports
`--threads`	Number of files to process in parallel (set to match `--cpus-per-task`)
`--extract`	Unzip the output ZIP archives automatically
`--quiet`	Suppress progress messages (useful in scripts)

View the FastQC HTML report

FastQC produces one .html report per input file. To view it you need to transfer it to your local computer or open it via the wondow explorer.

# Option A: Copy a report to your local machine (run this on your LOCAL terminal)
scp username@ilogin.ibex.kaust.edu.sa:/ibex/user/username/workshop/qc/raw/KO_1_SRR10045016_1_fastqc.html ~/Desktop/

#Or Option B: use the window explorer and connect to Ibex



            
                Interpret key modules for GSE136366
                When reviewing the FastQC report for this dataset, pay attention to:
                
                    Per Base Sequence Quality: Look for a quality drop at the 3′ end of reads. Values consistently below Q28 warrant trimming.
                    Adapter Content: Check whether Illumina Universal Adapter or TruSeq adapter sequences appear. If any sample shows >5% adapter contamination, trimming is essential.
                    Sequence Duplication Levels: RNA-seq libraries almost always show high duplication levels (FAIL) because highly expressed genes contribute many identical reads. This is normal and does not need to be corrected at this stage.
                    Per Base Sequence Content: Biased nucleotide composition in the first ~10 bp is a known artefact of random hexamer priming used during library preparation and is expected in RNA-seq data.



        
        
        
        
            Part 3: Adapter Trimming and Quality Filtering with fastp
            fastp is an all-in-one FASTQ preprocessing tool that automatically detects and removes adapter sequences, trims low-quality bases from read ends, filters short reads, and produces a rich HTML + JSON quality report — all in a single fast pass over the data.

            
                What fastp does
                
                    Adapter auto-detection: Infers adapter sequences from the data itself for paired-end libraries — no need to specify adapter sequences manually.
                    Quality trimming: Removes low-quality bases from the 3′ end of reads using a sliding window or per-base threshold.
                    Length filtering: Discards reads that become too short after trimming (typically <50 bp).
                    Low-complexity filtering: Optionally removes reads dominated by repetitive sequences (e.g., poly-A tails).
                    QC reports: Produces an HTML report and a machine-readable JSON file per sample, which MultiQC can aggregate.
                
            

            
                Run fastp on all samples using a loop
                cd /ibex/user/$USER/workshop

# Loop over every SRR accession and trim each pair
for r1 in chr11_raw_data/*_1.fastq.gz; do
    sample=$(basename "$r1" _1.fastq.gz)
    echo "Trimming $sample ..."
    fastp \
        -i chr11_raw_data/${sample}_1.fastq.gz \
        -I chr11_raw_data/${sample}_2.fastq.gz \
        -o trimmed/${sample}_trimmed_1.fastq.gz \
        -O trimmed/${sample}_trimmed_2.fastq.gz \
        --json qc/fastp/${sample}_fastp.json \
        --html qc/fastp/${sample}_fastp.html \
        --thread 4 \
        --detect_adapter_for_pe \
        --qualified_quality_phred 20 \
        --length_required 50
done

echo "Trimming complete."
ls -lh trimmed/
            

            
                Understand the key fastp parameters
                
                    fastp parameter reference
                    
                        Parameter Description
                        -i / -I Input R1 / R2 FASTQ files
                        -o / -O Output trimmed R1 / R2 FASTQ files
                        --json Path for the machine-readable JSON report (used by MultiQC)
                        --html Path for the human-readable HTML report
                        --thread Number of CPU threads (match your --cpus-per-task)
                        --detect_adapter_for_pe Automatically detect adapter sequences for paired-end data — recommended for most Illumina libraries
                        --qualified_quality_phred 20 A base is considered “low quality” if its Phred score is below Q20. Used for sliding-window trimming and unqualified base counting.
                        --length_required 50 Discard reads shorter than 50 bp after trimming (prevents very short reads from causing misalignments)
                    
                
                
                    Tip: Additional fastp options worth knowing
                    
                        --cut_tail — Enable 3′ sliding-window quality trimming (cuts once the window average drops below --cut_mean_quality).
                        --low_complexity_filter — Remove reads with >30% low-complexity content (useful if rRNA removal was not performed).
                        --trim_poly_x — Trim poly-A / poly-T / poly-G / poly-C tails at the 3′ end.
                        --dedup — Remove duplicate reads based on exact sequence match (use cautiously for RNA-seq).
                    
                
            

            
                Inspect a fastp HTML report
                Each sample produces its own fastp HTML report. Key sections to review:
                # Copy a fastp report to your local machine (run on your LOCAL terminal)
scp username@ilogin.ibex.kaust.edu.sa:/ibex/user/username/workshop/qc/fastp/KO_1_SRR10045016_fastp.html ~/Desktop/

#Or Option B: use the window explorer and connect to Ibex

                
                    Reading the fastp HTML report
                    
                        Summary table: Shows total reads before and after filtering, the percentage of reads passing filters, adapter trimming rate, and the percentage of bases at Q20 and Q30. A passing rate below 80% may indicate a library quality issue.
                        Filtering result: Breaks down why reads were discarded — low quality, too short after trimming, too many Ns, or low complexity. “Too short” is the most common category when adapter contamination is present.
                        Insert size distribution: Shows the estimated fragment length. A bimodal or very short peak may indicate adapter dimers in the original library.
                        Quality plots (before/after): Side-by-side per-base and per-read quality distributions confirm that trimming improved the data.
                    
                
            

        


        
        
        
        
            Part 4: Post-trimming QC with FastQC
            Re-running FastQC on the trimmed reads confirms that adapter contamination has been removed and that per-base quality has improved. Always compare pre- and post-trimming reports before proceeding to alignment.

            
                Run FastQC on the trimmed reads
                cd /ibex/user/$USER/workshop

# Run FastQC on all trimmed chr11 files
fastqc trimmed/*_trimmed_1.fastq.gz trimmed/*_trimmed_2.fastq.gz \
    -o qc/trimmed/ \
    --threads 4

# Verify outputs
ls -lh qc/trimmed/
            

            
                Compare pre- and post-trimming reports
                Open the FastQC report for the same sample from qc/raw/ and qc/trimmed/ side by side. Key differences to expect after trimming:
                
                    Adapter Content: Should change from WARN/FAIL to PASS — adapter signal should be gone or negligible.
                    Per Base Sequence Quality: The 3′ quality drop should be reduced or eliminated.
                    Read count: Slightly lower because very short reads were discarded (--length_required 50).
                    Sequence Duplication Levels: May remain FAIL — this is expected for RNA-seq and is not corrected by trimming.
                
                
                    Tip: When trimming is not enough
                    If FastQC still shows high adapter content after trimming, check whether the correct adapter sequences were detected. You can specify adapters manually in fastp using --adapter_sequence and --adapter_sequence_r2. Illumina TruSeq adapter sequences are widely documented.
                
            
        

        
        
        
        
            Part 5: Aggregate Reports with MultiQC
            MultiQC scans a directory tree for output files from many bioinformatics tools (FastQC, fastp, STAR, Salmon, Picard, and dozens more) and compiles them into a single interactive HTML report. This makes it easy to compare quality metrics across all samples at a glance.

            
                Run MultiQC over all QC outputs
                cd /ibex/user/$USER/workshop

# Aggregate FastQC (raw + trimmed) and fastp reports into one report
multiqc qc/ trimmed/ \
    --outdir qc/multiqc_report \
    --filename multiqc_report \
    --title "GSE136366 RNA-seq QC Report"

# List the output files
ls -lh qc/multiqc_report/
                
                    Tip: Useful MultiQC options
                    
                        Option Description
                        --outdir Directory for the output report
                        --filename Base name for the output HTML file
                        --title Title displayed in the report header
                        --ignore Exclude files matching a pattern (e.g., --ignore "*_raw_*")
                        --sample-names Provide a TSV file to rename samples in the report
                        -f Force overwrite if a report already exists
                    
                
            

            
                Open the MultiQC report
                # Copy the MultiQC report to your local machine (run on your LOCAL terminal)
scp -r username@ilogin.ibex.kaust.edu.sa:/ibex/user/$USER/workshop/qc/multiqc_report/ ~/Desktop/multiqc_report/

#Or Option B: use the window explorer and connect to Ibex


# Navigate to /ibex/user/<username>/workshop/qc/multiqc_report/
# Click multiqc_report.html to open in the browser
            

            
                Interpret the MultiQC report
                The MultiQC report is organized into collapsible sections, one per tool. Here is what to look for:
                
                    Key MultiQC sections for this lab
                    
                        
                            General Statistics table (top of report): A summary row per sample showing % duplicates, % GC, average read quality, % reads passing FastQC, total reads after fastp filtering, and % adapter trimmed. Use this table to spot any sample that looks like an outlier.
                        
                        
                            FastQC: Per Sequence Quality Scores: All samples should cluster together. A sample shifted far to the left (lower quality) may need special attention.
                        
                        
                            FastQC: Adapter Content: Compare the raw vs. trimmed adapter plots side by side. After trimming, all lines should be flat near 0%.
                        
                        
                            fastp: Filtering Results: A stacked bar chart showing reads that passed vs. were discarded per sample. If a sample loses >20% of reads, investigate the reason (adapter dimers, poor quality, short insert sizes).
                        
                        
                            fastp: Insert Size: The insert size distribution across samples. For RNA-seq, a median insert size of 150–300 bp is typical.
                        
                    
                
                
                    Tip: Using MultiQC throughout the course
                    MultiQC recognizes outputs from STAR (alignment), Salmon (quantification), Picard (duplication), RSeQC (BAM QC), and many other tools. As you complete later labs, simply re-run MultiQC pointing to the entire workshop directory to get an up-to-date combined report covering all analysis steps.
                
            
        

        
        
        
        
            Exercises
            Work through these exercises using the commands and concepts you have practised in the lab above. Use the tool reports and the MultiQC summary to find the answers.

            
                Exercise 1: Filtering rate per sample
                What percentage of reads were removed by fastp for one sample of your choice? Check both the fastp HTML report and the JSON file to find this value.
                Hint: Open qc/fastp/SRR10009250_fastp.html and look at the "Filtering result" summary table. Alternatively, inspect the JSON file:
                cat qc/fastp/KO_1_SRR10045016_fastp.json | grep -A5 "filtering_result"
            

            
                Exercise 2: Adapter detection in raw reads
                Do any samples show notable adapter content in the raw FastQC reports? Which adapter sequence(s) were detected by fastp? Are the same adapters present in all samples or do they differ between samples?
                Hint: Check the "Adapter Content" module in the FastQC reports for the raw files in qc/raw/. Cross-reference with the "Adapter" section in the corresponding fastp HTML report.
            

            
                Exercise 3: Quality improvement after trimming
                Open the FastQC "Per Base Sequence Quality" plot for the same sample before trimming (qc/raw/) and after trimming (qc/trimmed/). What changed? Did trimming improve the 3′ quality drop? Were any other modules upgraded from WARN/FAIL to PASS?
                Hint: The MultiQC report can display pre- and post-trimming FastQC results together. Look for sample names like KO_1_SRR10045016_1 (raw) vs. KO_1_SRR10045016_trimmed_1 (trimmed) in the per-sample plots.
            

            
                Exercise 4: Read counts after trimming
                How many reads remain in the trimmed files across all samples? Use seqkit stats on the trimmed R1 files to produce a summary table.
                module load seqkit
seqkit stats trimmed/*_trimmed_1.fastq.gz
                Compare the read counts to the chr11 input files (chr11_raw_data/*_1.fastq.gz). What is the average retention rate across samples?
            
        

        
        
        
        
            Summary
            In this lab you have:
            
                Loaded fastqc, fastp, and multiqc modules on the Ibex HPC cluster
                Run FastQC on all chr11-filtered FASTQ files from GSE136366 and interpreted the per-base quality, adapter content, and duplication modules
                Understood Phred quality scores and the meaning of the PASS / WARN / FAIL flags in FastQC
                Trimmed adapters and low-quality bases from all samples using fastp in a loop, producing trimmed FASTQ files and per-sample QC reports
                Re-run FastQC on the trimmed reads and confirmed quality improvements
                Aggregated all FastQC and fastp reports into a single interactive summary using MultiQC
            
            Your trimmed reads in /ibex/user/$USER/workshop/trimmed/ are now ready for alignment.
            
                Files produced in this lab
                
                    Location Contents
                    trimmed/*_trimmed_1/2.fastq.gz Adapter-trimmed, quality-filtered chr11 paired FASTQ files (input for Lab 4)
                    qc/raw/ FastQC reports for raw reads
                    qc/trimmed/ FastQC reports for trimmed reads
                    qc/fastp/ fastp HTML and JSON reports per sample
                    qc/multiqc_report/multiqc_report.html Aggregated MultiQC report
                
            
            
                ← Previous: Lab 2: Public Data Retrieval
                  |  
                Next: Lab 4: STAR Alignment →

Parameter	Description
`-i / -I`	Input R1 / R2 FASTQ files
`-o / -O`	Output trimmed R1 / R2 FASTQ files
`--json`	Path for the machine-readable JSON report (used by MultiQC)
`--html`	Path for the human-readable HTML report
`--thread`	Number of CPU threads (match your `--cpus-per-task`)
`--detect_adapter_for_pe`	Automatically detect adapter sequences for paired-end data — recommended for most Illumina libraries
`--qualified_quality_phred 20`	A base is considered “low quality” if its Phred score is below Q20. Used for sliding-window trimming and unqualified base counting.
`--length_required 50`	Discard reads shorter than 50 bp after trimming (prevents very short reads from causing misalignments)

Option	Description
`--outdir`	Directory for the output report
`--filename`	Base name for the output HTML file
`--title`	Title displayed in the report header
`--ignore`	Exclude files matching a pattern (e.g., `--ignore "_raw_"`)
`--sample-names`	Provide a TSV file to rename samples in the report
`-f`	Force overwrite if a report already exists

Location	Contents
`trimmed/*_trimmed_1/2.fastq.gz`	Adapter-trimmed, quality-filtered chr11 paired FASTQ files (input for Lab 4)
`qc/raw/`	FastQC reports for raw reads
`qc/trimmed/`	FastQC reports for trimmed reads
`qc/fastp/`	fastp HTML and JSON reports per sample
`qc/multiqc_report/multiqc_report.html`	Aggregated MultiQC report