Lab 3: QC and Preprocessing

Part 1Setup on Ibex Part 2FastQC Part 3fastp Trimming Part 4Post-trim FastQC Part 5MultiQC Report

In this lab you will assess the quality of the chromosome 11-filtered reads from GSE136366, trim adapters and low-quality bases using fastp, re-assess quality after trimming, and aggregate all QC reports into a single interactive summary with MultiQC.

Why chromosome 11?

Working with chr11-filtered reads (~3–4 million reads per sample) gives enough data for biologically meaningful QC, alignment, and quantification result while keeping runtimes under a few minutes per sample on Ibex. Chromosome 11 encodes many well-studied neuronal genes relevant to TDP-43 biology, making it an ideal subset for this dataset.
From Lab 6 onwards, the full dataset is used since nf-core pipelines are optimised for large-scale data.

Learning Objectives

Part 1: Setup on Ibex

All work in this lab is performed on the KAUST Ibex HPC cluster. Start an interactive session and load the required tools before running any commands.

Connect to Ibex and start an interactive session

# Connect to the Ibex login node
ssh username@ilogin.ibex.kaust.edu.sa

# Start an interactive compute session (never run analyses on login nodes!)
srun --pty --time=4:00:00 --mem=16G --cpus-per-task=4 bash
Tip: Interactive vs. batch jobs

For this lab the commands are fast enough to run interactively. For full-size datasets, submit them as SLURM batch jobs (sbatch) so they can run unattended overnight.

Load required modules

# Load the three tools used in this lab
module load fastqc
module load fastp
module load multiqc

# Confirm all three are loaded
module list

# Quick version checks
fastqc --version
fastp --version
multiqc --version
Module names on Ibex

If module load fastqc fails, search for the correct module name with module avail fastqc. Module names are sometimes case-sensitive or include a version suffix (e.g., fastqc/0.12.1).

Navigate to your workspace and create output directories

# Go to your workshop directory
cd /ibex/user/$USER/workshop

# Create subdirectories for trimmed reads and QC results
mkdir -p trimmed
mkdir -p qc/raw
mkdir -p qc/trimmed
mkdir -p qc/fastp
mkdir -p qc/multiqc_report

# Verify the structure
ls -la

Copy the chr11 FASTQ files from the shared course directory

For Labs 3–5 we work with a chromosome 11-filtered subset of GSE136366. These files are pre-prepared on Ibex shared storage — copying takes only a few seconds.

cd /ibex/user/$USER/workshop

# Create the dedicated directory for chr11 data
mkdir -p chr11_raw_data

# Copy all chr11 FASTQs from the shared Ibex path
cp /biocorelab/BIX/resources/datasets/rnaseq/GSE136366_KO_chr11/*.fastq.gz chr11_raw_data/

# Verify — you should see 12 files (6 samples × 2 paired-end reads)
ls -lh chr11_raw_data/
Expected files
KO_1_SRR10045016_1.fastq.gz   KO_1_SRR10045016_2.fastq.gz
KO_2_SRR10045017_1.fastq.gz   KO_2_SRR10045017_2.fastq.gz
KO_3_SRR10045018_1.fastq.gz   KO_3_SRR10045018_2.fastq.gz
WT_1_SRR10045019_1.fastq.gz   WT_1_SRR10045019_2.fastq.gz
WT_2_SRR10045020_1.fastq.gz   WT_2_SRR10045020_2.fastq.gz
WT_3_SRR10045021_1.fastq.gz   WT_3_SRR10045021_2.fastq.gz

Naming convention: {condition}_{replicate}_{SRR}_{read}.fastq.gzKO = TDP-43 knockdown, WT = wild type (control).

Verify read counts

module load seqkit
seqkit stats chr11_raw_data/*_1.fastq.gz

Expect approximately 3–4 million reads per sample — sufficient for biologically meaningful QC, alignment, and quantification on chromosome 11.

Part 2: Quality Control with FastQC

FastQC is the most widely used tool for assessing the quality of raw sequencing reads. It reads one or more FASTQ files and produces an HTML report with a series of diagnostic plots and summary flags (PASS / WARN / FAIL) for each quality module.

What FastQC checks
ModuleWhat to look for
Per Base Sequence QualityQuality should stay above Q28 across the full read length. A drop at the 3′ end is normal for Illumina reads.
Per Sequence Quality ScoresThe distribution should be unimodal and shifted towards high scores (Q30+).
Adapter ContentAny adapter signal >5% indicates reads shorter than the sequenced fragment — trim before alignment.
Sequence Duplication LevelsRNA-seq libraries commonly show high duplication due to highly expressed transcripts. WARN/FAIL here is expected.
Overrepresented SequencesWARN/FAIL often caused by adapter dimers or rRNA — worth investigating if >1% of reads.
Per Base Sequence ContentThe first 10–15 bp may show biased content due to random hexamer priming — this is expected in RNA-seq.
Phred Quality Scores (Q-scores)

Each base in a FASTQ file has a quality score encoded as a single ASCII character. The numeric Phred score Q is defined as:

Q = -10 * log10(P)   where P = probability of a wrong base call
Phred ScoreError probabilityAccuracyTypical threshold
Q101 in 1090%Poor — discard or trim
Q201 in 10099%Minimum acceptable
Q301 in 1,00099.9%Good quality
Q401 in 10,00099.99%Excellent quality

Most downstream tools (aligners, variant callers) recommend at least Q20 per-base quality. Aim for ≥80% of bases at Q30 or above.

Run FastQC on all raw chr11 FASTQ files

cd /ibex/user/$USER/workshop

# Run FastQC on all chr11 R1 and R2 files simultaneously
fastqc chr11_raw_data/*_1.fastq.gz chr11_raw_data/*_2.fastq.gz \
    -o qc/raw/ \
    --threads 4

# Check the output — one .html and one .zip per file
ls -lh qc/raw/
Tip: FastQC key options
OptionDescription
-oOutput directory for reports
--threadsNumber of files to process in parallel (set to match --cpus-per-task)
--extractUnzip the output ZIP archives automatically
--quietSuppress progress messages (useful in scripts)

View the FastQC HTML report

FastQC produces one .html report per input file. To view it you need to transfer it to your local computer or open it via the wondow explorer.

# Option A: Copy a report to your local machine (run this on your LOCAL terminal)
scp username@ilogin.ibex.kaust.edu.sa:/ibex/user/username/workshop/qc/raw/KO_1_SRR10045016_1_fastqc.html ~/Desktop/

#Or Option B: use the window explorer and connect to Ibex

Interpret key modules for GSE136366

When reviewing the FastQC report for this dataset, pay attention to:

  • Per Base Sequence Quality: Look for a quality drop at the 3′ end of reads. Values consistently below Q28 warrant trimming.
  • Adapter Content: Check whether Illumina Universal Adapter or TruSeq adapter sequences appear. If any sample shows >5% adapter contamination, trimming is essential.
  • Sequence Duplication Levels: RNA-seq libraries almost always show high duplication levels (FAIL) because highly expressed genes contribute many identical reads. This is normal and does not need to be corrected at this stage.
  • Per Base Sequence Content: Biased nucleotide composition in the first ~10 bp is a known artefact of random hexamer priming used during library preparation and is expected in RNA-seq data.

Part 3: Adapter Trimming and Quality Filtering with fastp

fastp is an all-in-one FASTQ preprocessing tool that automatically detects and removes adapter sequences, trims low-quality bases from read ends, filters short reads, and produces a rich HTML + JSON quality report — all in a single fast pass over the data.

What fastp does
  • Adapter auto-detection: Infers adapter sequences from the data itself for paired-end libraries — no need to specify adapter sequences manually.
  • Quality trimming: Removes low-quality bases from the 3′ end of reads using a sliding window or per-base threshold.
  • Length filtering: Discards reads that become too short after trimming (typically <50 bp).
  • Low-complexity filtering: Optionally removes reads dominated by repetitive sequences (e.g., poly-A tails).
  • QC reports: Produces an HTML report and a machine-readable JSON file per sample, which MultiQC can aggregate.

Run fastp on all samples using a loop

cd /ibex/user/$USER/workshop

# Loop over every SRR accession and trim each pair
for r1 in chr11_raw_data/*_1.fastq.gz; do
    sample=$(basename "$r1" _1.fastq.gz)
    echo "Trimming $sample ..."
    fastp \
        -i chr11_raw_data/${sample}_1.fastq.gz \
        -I chr11_raw_data/${sample}_2.fastq.gz \
        -o trimmed/${sample}_trimmed_1.fastq.gz \
        -O trimmed/${sample}_trimmed_2.fastq.gz \
        --json qc/fastp/${sample}_fastp.json \
        --html qc/fastp/${sample}_fastp.html \
        --thread 4 \
        --detect_adapter_for_pe \
        --qualified_quality_phred 20 \
        --length_required 50
done

echo "Trimming complete."
ls -lh trimmed/

Understand the key fastp parameters

fastp parameter reference
ParameterDescription
-i / -IInput R1 / R2 FASTQ files
-o / -OOutput trimmed R1 / R2 FASTQ files
--jsonPath for the machine-readable JSON report (used by MultiQC)
--htmlPath for the human-readable HTML report
--threadNumber of CPU threads (match your --cpus-per-task)
--detect_adapter_for_peAutomatically detect adapter sequences for paired-end data — recommended for most Illumina libraries
--qualified_quality_phred 20A base is considered “low quality” if its Phred score is below Q20. Used for sliding-window trimming and unqualified base counting.
--length_required 50Discard reads shorter than 50 bp after trimming (prevents very short reads from causing misalignments)
Tip: Additional fastp options worth knowing
  • --cut_tail — Enable 3′ sliding-window quality trimming (cuts once the window average drops below --cut_mean_quality).
  • --low_complexity_filter — Remove reads with >30% low-complexity content (useful if rRNA removal was not performed).
  • --trim_poly_x — Trim poly-A / poly-T / poly-G / poly-C tails at the 3′ end.
  • --dedup — Remove duplicate reads based on exact sequence match (use cautiously for RNA-seq).

Inspect a fastp HTML report

Each sample produces its own fastp HTML report. Key sections to review:

# Copy a fastp report to your local machine (run on your LOCAL terminal)
scp username@ilogin.ibex.kaust.edu.sa:/ibex/user/username/workshop/qc/fastp/KO_1_SRR10045016_fastp.html ~/Desktop/

#Or Option B: use the window explorer and connect to Ibex
Reading the fastp HTML report
  • Summary table: Shows total reads before and after filtering, the percentage of reads passing filters, adapter trimming rate, and the percentage of bases at Q20 and Q30. A passing rate below 80% may indicate a library quality issue.
  • Filtering result: Breaks down why reads were discarded — low quality, too short after trimming, too many Ns, or low complexity. “Too short” is the most common category when adapter contamination is present.
  • Insert size distribution: Shows the estimated fragment length. A bimodal or very short peak may indicate adapter dimers in the original library.
  • Quality plots (before/after): Side-by-side per-base and per-read quality distributions confirm that trimming improved the data.

Part 4: Post-trimming QC with FastQC

Re-running FastQC on the trimmed reads confirms that adapter contamination has been removed and that per-base quality has improved. Always compare pre- and post-trimming reports before proceeding to alignment.

Run FastQC on the trimmed reads

cd /ibex/user/$USER/workshop

# Run FastQC on all trimmed chr11 files
fastqc trimmed/*_trimmed_1.fastq.gz trimmed/*_trimmed_2.fastq.gz \
    -o qc/trimmed/ \
    --threads 4

# Verify outputs
ls -lh qc/trimmed/

Compare pre- and post-trimming reports

Open the FastQC report for the same sample from qc/raw/ and qc/trimmed/ side by side. Key differences to expect after trimming:

  • Adapter Content: Should change from WARN/FAIL to PASS — adapter signal should be gone or negligible.
  • Per Base Sequence Quality: The 3′ quality drop should be reduced or eliminated.
  • Read count: Slightly lower because very short reads were discarded (--length_required 50).
  • Sequence Duplication Levels: May remain FAIL — this is expected for RNA-seq and is not corrected by trimming.
Tip: When trimming is not enough

If FastQC still shows high adapter content after trimming, check whether the correct adapter sequences were detected. You can specify adapters manually in fastp using --adapter_sequence and --adapter_sequence_r2. Illumina TruSeq adapter sequences are widely documented.

Part 5: Aggregate Reports with MultiQC

MultiQC scans a directory tree for output files from many bioinformatics tools (FastQC, fastp, STAR, Salmon, Picard, and dozens more) and compiles them into a single interactive HTML report. This makes it easy to compare quality metrics across all samples at a glance.

Run MultiQC over all QC outputs

cd /ibex/user/$USER/workshop

# Aggregate FastQC (raw + trimmed) and fastp reports into one report
multiqc qc/ trimmed/ \
    --outdir qc/multiqc_report \
    --filename multiqc_report \
    --title "GSE136366 RNA-seq QC Report"

# List the output files
ls -lh qc/multiqc_report/
Tip: Useful MultiQC options
OptionDescription
--outdirDirectory for the output report
--filenameBase name for the output HTML file
--titleTitle displayed in the report header
--ignoreExclude files matching a pattern (e.g., --ignore "*_raw_*")
--sample-namesProvide a TSV file to rename samples in the report
-fForce overwrite if a report already exists

Open the MultiQC report

# Copy the MultiQC report to your local machine (run on your LOCAL terminal)
scp -r username@ilogin.ibex.kaust.edu.sa:/ibex/user/$USER/workshop/qc/multiqc_report/ ~/Desktop/multiqc_report/

#Or Option B: use the window explorer and connect to Ibex
# Navigate to /ibex/user/<username>/workshop/qc/multiqc_report/ # Click multiqc_report.html to open in the browser

Interpret the MultiQC report

The MultiQC report is organized into collapsible sections, one per tool. Here is what to look for:

Key MultiQC sections for this lab
  • General Statistics table (top of report): A summary row per sample showing % duplicates, % GC, average read quality, % reads passing FastQC, total reads after fastp filtering, and % adapter trimmed. Use this table to spot any sample that looks like an outlier.
  • FastQC: Per Sequence Quality Scores: All samples should cluster together. A sample shifted far to the left (lower quality) may need special attention.
  • FastQC: Adapter Content: Compare the raw vs. trimmed adapter plots side by side. After trimming, all lines should be flat near 0%.
  • fastp: Filtering Results: A stacked bar chart showing reads that passed vs. were discarded per sample. If a sample loses >20% of reads, investigate the reason (adapter dimers, poor quality, short insert sizes).
  • fastp: Insert Size: The insert size distribution across samples. For RNA-seq, a median insert size of 150–300 bp is typical.
Tip: Using MultiQC throughout the course

MultiQC recognizes outputs from STAR (alignment), Salmon (quantification), Picard (duplication), RSeQC (BAM QC), and many other tools. As you complete later labs, simply re-run MultiQC pointing to the entire workshop directory to get an up-to-date combined report covering all analysis steps.

Exercises

Work through these exercises using the commands and concepts you have practised in the lab above. Use the tool reports and the MultiQC summary to find the answers.

Exercise 1: Filtering rate per sample

What percentage of reads were removed by fastp for one sample of your choice? Check both the fastp HTML report and the JSON file to find this value.

Hint: Open qc/fastp/SRR10009250_fastp.html and look at the "Filtering result" summary table. Alternatively, inspect the JSON file:

cat qc/fastp/KO_1_SRR10045016_fastp.json | grep -A5 "filtering_result"

Exercise 2: Adapter detection in raw reads

Do any samples show notable adapter content in the raw FastQC reports? Which adapter sequence(s) were detected by fastp? Are the same adapters present in all samples or do they differ between samples?

Hint: Check the "Adapter Content" module in the FastQC reports for the raw files in qc/raw/. Cross-reference with the "Adapter" section in the corresponding fastp HTML report.

Exercise 3: Quality improvement after trimming

Open the FastQC "Per Base Sequence Quality" plot for the same sample before trimming (qc/raw/) and after trimming (qc/trimmed/). What changed? Did trimming improve the 3′ quality drop? Were any other modules upgraded from WARN/FAIL to PASS?

Hint: The MultiQC report can display pre- and post-trimming FastQC results together. Look for sample names like KO_1_SRR10045016_1 (raw) vs. KO_1_SRR10045016_trimmed_1 (trimmed) in the per-sample plots.

Exercise 4: Read counts after trimming

How many reads remain in the trimmed files across all samples? Use seqkit stats on the trimmed R1 files to produce a summary table.

module load seqkit
seqkit stats trimmed/*_trimmed_1.fastq.gz

Compare the read counts to the chr11 input files (chr11_raw_data/*_1.fastq.gz). What is the average retention rate across samples?

Summary

In this lab you have:

  • Loaded fastqc, fastp, and multiqc modules on the Ibex HPC cluster
  • Run FastQC on all chr11-filtered FASTQ files from GSE136366 and interpreted the per-base quality, adapter content, and duplication modules
  • Understood Phred quality scores and the meaning of the PASS / WARN / FAIL flags in FastQC
  • Trimmed adapters and low-quality bases from all samples using fastp in a loop, producing trimmed FASTQ files and per-sample QC reports
  • Re-run FastQC on the trimmed reads and confirmed quality improvements
  • Aggregated all FastQC and fastp reports into a single interactive summary using MultiQC

Your trimmed reads in /ibex/user/$USER/workshop/trimmed/ are now ready for alignment.

Files produced in this lab
LocationContents
trimmed/*_trimmed_1/2.fastq.gzAdapter-trimmed, quality-filtered chr11 paired FASTQ files (input for Lab 4)
qc/raw/FastQC reports for raw reads
qc/trimmed/FastQC reports for trimmed reads
qc/fastp/fastp HTML and JSON reports per sample
qc/multiqc_report/multiqc_report.htmlAggregated MultiQC report

← Previous: Lab 2: Public Data Retrieval   |   Next: Lab 4: STAR Alignment →

RNA-seq Data Analysis Course — KAUST Bioinformatics Platform