Lab 3: QC and Preprocessing
In this lab you will assess the quality of the chromosome 11-filtered reads from GSE136366, trim adapters and low-quality bases using fastp, re-assess quality after trimming, and aggregate all QC reports into a single interactive summary with MultiQC.
Working with chr11-filtered reads (~3–4 million reads per sample) gives enough data for biologically meaningful QC, alignment, and quantification result while keeping runtimes under a few minutes per sample on Ibex. Chromosome 11 encodes many well-studied neuronal genes relevant to TDP-43 biology, making it an ideal subset for this dataset.
From Lab 6 onwards, the full dataset is used since nf-core pipelines are optimised for large-scale data.
- Run
FastQCon raw reads and interpret the key report modules - Understand Phred quality scores and what PASS / WARN / FAIL mean
- Trim adapters and low-quality bases with
fastpusing a loop over all samples - Re-run
FastQCon trimmed reads to verify improvement - Aggregate all QC outputs into one report with
MultiQC
Part 1: Setup on Ibex
All work in this lab is performed on the KAUST Ibex HPC cluster. Start an interactive session and load the required tools before running any commands.
Connect to Ibex and start an interactive session
# Connect to the Ibex login node
ssh username@ilogin.ibex.kaust.edu.sa
# Start an interactive compute session (never run analyses on login nodes!)
srun --pty --time=4:00:00 --mem=16G --cpus-per-task=4 bash
For this lab the commands are fast enough to run interactively. For full-size datasets, submit them as SLURM batch jobs (sbatch) so they can run unattended overnight.
Load required modules
# Load the three tools used in this lab
module load fastqc
module load fastp
module load multiqc
# Confirm all three are loaded
module list
# Quick version checks
fastqc --version
fastp --version
multiqc --version
If module load fastqc fails, search for the correct module name with module avail fastqc. Module names are sometimes case-sensitive or include a version suffix (e.g., fastqc/0.12.1).
Navigate to your workspace and create output directories
# Go to your workshop directory
cd /ibex/user/$USER/workshop
# Create subdirectories for trimmed reads and QC results
mkdir -p trimmed
mkdir -p qc/raw
mkdir -p qc/trimmed
mkdir -p qc/fastp
mkdir -p qc/multiqc_report
# Verify the structure
ls -la
Copy the chr11 FASTQ files from the shared course directory
For Labs 3–5 we work with a chromosome 11-filtered subset of GSE136366. These files are pre-prepared on Ibex shared storage — copying takes only a few seconds.
cd /ibex/user/$USER/workshop
# Create the dedicated directory for chr11 data
mkdir -p chr11_raw_data
# Copy all chr11 FASTQs from the shared Ibex path
cp /biocorelab/BIX/resources/datasets/rnaseq/GSE136366_KO_chr11/*.fastq.gz chr11_raw_data/
# Verify — you should see 12 files (6 samples × 2 paired-end reads)
ls -lh chr11_raw_data/
KO_1_SRR10045016_1.fastq.gz KO_1_SRR10045016_2.fastq.gz
KO_2_SRR10045017_1.fastq.gz KO_2_SRR10045017_2.fastq.gz
KO_3_SRR10045018_1.fastq.gz KO_3_SRR10045018_2.fastq.gz
WT_1_SRR10045019_1.fastq.gz WT_1_SRR10045019_2.fastq.gz
WT_2_SRR10045020_1.fastq.gz WT_2_SRR10045020_2.fastq.gz
WT_3_SRR10045021_1.fastq.gz WT_3_SRR10045021_2.fastq.gz
Naming convention: {condition}_{replicate}_{SRR}_{read}.fastq.gz — KO = TDP-43 knockdown, WT = wild type (control).
Verify read counts
module load seqkit
seqkit stats chr11_raw_data/*_1.fastq.gz
Expect approximately 3–4 million reads per sample — sufficient for biologically meaningful QC, alignment, and quantification on chromosome 11.
Part 2: Quality Control with FastQC
FastQC is the most widely used tool for assessing the quality of raw sequencing reads. It reads one or more FASTQ files and produces an HTML report with a series of diagnostic plots and summary flags (PASS / WARN / FAIL) for each quality module.
| Module | What to look for |
|---|---|
| Per Base Sequence Quality | Quality should stay above Q28 across the full read length. A drop at the 3′ end is normal for Illumina reads. |
| Per Sequence Quality Scores | The distribution should be unimodal and shifted towards high scores (Q30+). |
| Adapter Content | Any adapter signal >5% indicates reads shorter than the sequenced fragment — trim before alignment. |
| Sequence Duplication Levels | RNA-seq libraries commonly show high duplication due to highly expressed transcripts. WARN/FAIL here is expected. |
| Overrepresented Sequences | WARN/FAIL often caused by adapter dimers or rRNA — worth investigating if >1% of reads. |
| Per Base Sequence Content | The first 10–15 bp may show biased content due to random hexamer priming — this is expected in RNA-seq. |
Each base in a FASTQ file has a quality score encoded as a single ASCII character. The numeric Phred score Q is defined as:
Q = -10 * log10(P) where P = probability of a wrong base call
| Phred Score | Error probability | Accuracy | Typical threshold |
|---|---|---|---|
| Q10 | 1 in 10 | 90% | Poor — discard or trim |
| Q20 | 1 in 100 | 99% | Minimum acceptable |
| Q30 | 1 in 1,000 | 99.9% | Good quality |
| Q40 | 1 in 10,000 | 99.99% | Excellent quality |
Most downstream tools (aligners, variant callers) recommend at least Q20 per-base quality. Aim for ≥80% of bases at Q30 or above.
Run FastQC on all raw chr11 FASTQ files
cd /ibex/user/$USER/workshop
# Run FastQC on all chr11 R1 and R2 files simultaneously
fastqc chr11_raw_data/*_1.fastq.gz chr11_raw_data/*_2.fastq.gz \
-o qc/raw/ \
--threads 4
# Check the output — one .html and one .zip per file
ls -lh qc/raw/
| Option | Description |
|---|---|
-o | Output directory for reports |
--threads | Number of files to process in parallel (set to match --cpus-per-task) |
--extract | Unzip the output ZIP archives automatically |
--quiet | Suppress progress messages (useful in scripts) |
View the FastQC HTML report
FastQC produces one .html report per input file. To view it you need to transfer it to your local computer or open it via the wondow explorer.
# Option A: Copy a report to your local machine (run this on your LOCAL terminal)
scp username@ilogin.ibex.kaust.edu.sa:/ibex/user/username/workshop/qc/raw/KO_1_SRR10045016_1_fastqc.html ~/Desktop/
#Or Option B: use the window explorer and connect to Ibex
Interpret key modules for GSE136366
When reviewing the FastQC report for this dataset, pay attention to:
- Per Base Sequence Quality: Look for a quality drop at the 3′ end of reads. Values consistently below Q28 warrant trimming.
- Adapter Content: Check whether Illumina Universal Adapter or TruSeq adapter sequences appear. If any sample shows >5% adapter contamination, trimming is essential.
- Sequence Duplication Levels: RNA-seq libraries almost always show high duplication levels (FAIL) because highly expressed genes contribute many identical reads. This is normal and does not need to be corrected at this stage.
- Per Base Sequence Content: Biased nucleotide composition in the first ~10 bp is a known artefact of random hexamer priming used during library preparation and is expected in RNA-seq data.
Part 3: Adapter Trimming and Quality Filtering with fastp
fastp is an all-in-one FASTQ preprocessing tool that automatically detects and removes adapter sequences, trims low-quality bases from read ends, filters short reads, and produces a rich HTML + JSON quality report — all in a single fast pass over the data.
- Adapter auto-detection: Infers adapter sequences from the data itself for paired-end libraries — no need to specify adapter sequences manually.
- Quality trimming: Removes low-quality bases from the 3′ end of reads using a sliding window or per-base threshold.
- Length filtering: Discards reads that become too short after trimming (typically <50 bp).
- Low-complexity filtering: Optionally removes reads dominated by repetitive sequences (e.g., poly-A tails).
- QC reports: Produces an HTML report and a machine-readable JSON file per sample, which MultiQC can aggregate.
Run fastp on all samples using a loop
cd /ibex/user/$USER/workshop
# Loop over every SRR accession and trim each pair
for r1 in chr11_raw_data/*_1.fastq.gz; do
sample=$(basename "$r1" _1.fastq.gz)
echo "Trimming $sample ..."
fastp \
-i chr11_raw_data/${sample}_1.fastq.gz \
-I chr11_raw_data/${sample}_2.fastq.gz \
-o trimmed/${sample}_trimmed_1.fastq.gz \
-O trimmed/${sample}_trimmed_2.fastq.gz \
--json qc/fastp/${sample}_fastp.json \
--html qc/fastp/${sample}_fastp.html \
--thread 4 \
--detect_adapter_for_pe \
--qualified_quality_phred 20 \
--length_required 50
done
echo "Trimming complete."
ls -lh trimmed/
Understand the key fastp parameters
| Parameter | Description |
|---|---|
-i / -I | Input R1 / R2 FASTQ files |
-o / -O | Output trimmed R1 / R2 FASTQ files |
--json | Path for the machine-readable JSON report (used by MultiQC) |
--html | Path for the human-readable HTML report |
--thread | Number of CPU threads (match your --cpus-per-task) |
--detect_adapter_for_pe | Automatically detect adapter sequences for paired-end data — recommended for most Illumina libraries |
--qualified_quality_phred 20 | A base is considered “low quality” if its Phred score is below Q20. Used for sliding-window trimming and unqualified base counting. |
--length_required 50 | Discard reads shorter than 50 bp after trimming (prevents very short reads from causing misalignments) |
--cut_tail— Enable 3′ sliding-window quality trimming (cuts once the window average drops below--cut_mean_quality).--low_complexity_filter— Remove reads with >30% low-complexity content (useful if rRNA removal was not performed).--trim_poly_x— Trim poly-A / poly-T / poly-G / poly-C tails at the 3′ end.--dedup— Remove duplicate reads based on exact sequence match (use cautiously for RNA-seq).
Inspect a fastp HTML report
Each sample produces its own fastp HTML report. Key sections to review:
# Copy a fastp report to your local machine (run on your LOCAL terminal)
scp username@ilogin.ibex.kaust.edu.sa:/ibex/user/username/workshop/qc/fastp/KO_1_SRR10045016_fastp.html ~/Desktop/
#Or Option B: use the window explorer and connect to Ibex
- Summary table: Shows total reads before and after filtering, the percentage of reads passing filters, adapter trimming rate, and the percentage of bases at Q20 and Q30. A passing rate below 80% may indicate a library quality issue.
- Filtering result: Breaks down why reads were discarded — low quality, too short after trimming, too many Ns, or low complexity. “Too short” is the most common category when adapter contamination is present.
- Insert size distribution: Shows the estimated fragment length. A bimodal or very short peak may indicate adapter dimers in the original library.
- Quality plots (before/after): Side-by-side per-base and per-read quality distributions confirm that trimming improved the data.
Part 4: Post-trimming QC with FastQC
Re-running FastQC on the trimmed reads confirms that adapter contamination has been removed and that per-base quality has improved. Always compare pre- and post-trimming reports before proceeding to alignment.
Run FastQC on the trimmed reads
cd /ibex/user/$USER/workshop
# Run FastQC on all trimmed chr11 files
fastqc trimmed/*_trimmed_1.fastq.gz trimmed/*_trimmed_2.fastq.gz \
-o qc/trimmed/ \
--threads 4
# Verify outputs
ls -lh qc/trimmed/
Compare pre- and post-trimming reports
Open the FastQC report for the same sample from qc/raw/ and qc/trimmed/ side by side. Key differences to expect after trimming:
- Adapter Content: Should change from WARN/FAIL to PASS — adapter signal should be gone or negligible.
- Per Base Sequence Quality: The 3′ quality drop should be reduced or eliminated.
- Read count: Slightly lower because very short reads were discarded (
--length_required 50). - Sequence Duplication Levels: May remain FAIL — this is expected for RNA-seq and is not corrected by trimming.
If FastQC still shows high adapter content after trimming, check whether the correct adapter sequences were detected. You can specify adapters manually in fastp using --adapter_sequence and --adapter_sequence_r2. Illumina TruSeq adapter sequences are widely documented.
Part 5: Aggregate Reports with MultiQC
MultiQC scans a directory tree for output files from many bioinformatics tools (FastQC, fastp, STAR, Salmon, Picard, and dozens more) and compiles them into a single interactive HTML report. This makes it easy to compare quality metrics across all samples at a glance.
Run MultiQC over all QC outputs
cd /ibex/user/$USER/workshop
# Aggregate FastQC (raw + trimmed) and fastp reports into one report
multiqc qc/ trimmed/ \
--outdir qc/multiqc_report \
--filename multiqc_report \
--title "GSE136366 RNA-seq QC Report"
# List the output files
ls -lh qc/multiqc_report/
| Option | Description |
|---|---|
--outdir | Directory for the output report |
--filename | Base name for the output HTML file |
--title | Title displayed in the report header |
--ignore | Exclude files matching a pattern (e.g., --ignore "*_raw_*") |
--sample-names | Provide a TSV file to rename samples in the report |
-f | Force overwrite if a report already exists |
Open the MultiQC report
# Copy the MultiQC report to your local machine (run on your LOCAL terminal)
scp -r username@ilogin.ibex.kaust.edu.sa:/ibex/user/$USER/workshop/qc/multiqc_report/ ~/Desktop/multiqc_report/
#Or Option B: use the window explorer and connect to Ibex
# Navigate to /ibex/user/<username>/workshop/qc/multiqc_report/
# Click multiqc_report.html to open in the browser
Interpret the MultiQC report
The MultiQC report is organized into collapsible sections, one per tool. Here is what to look for:
- General Statistics table (top of report): A summary row per sample showing % duplicates, % GC, average read quality, % reads passing FastQC, total reads after fastp filtering, and % adapter trimmed. Use this table to spot any sample that looks like an outlier.
- FastQC: Per Sequence Quality Scores: All samples should cluster together. A sample shifted far to the left (lower quality) may need special attention.
- FastQC: Adapter Content: Compare the raw vs. trimmed adapter plots side by side. After trimming, all lines should be flat near 0%.
- fastp: Filtering Results: A stacked bar chart showing reads that passed vs. were discarded per sample. If a sample loses >20% of reads, investigate the reason (adapter dimers, poor quality, short insert sizes).
- fastp: Insert Size: The insert size distribution across samples. For RNA-seq, a median insert size of 150–300 bp is typical.
MultiQC recognizes outputs from STAR (alignment), Salmon (quantification), Picard (duplication), RSeQC (BAM QC), and many other tools. As you complete later labs, simply re-run MultiQC pointing to the entire workshop directory to get an up-to-date combined report covering all analysis steps.
Exercises
Work through these exercises using the commands and concepts you have practised in the lab above. Use the tool reports and the MultiQC summary to find the answers.
Exercise 1: Filtering rate per sample
What percentage of reads were removed by fastp for one sample of your choice? Check both the fastp HTML report and the JSON file to find this value.
Hint: Open qc/fastp/SRR10009250_fastp.html and look at the "Filtering result" summary table. Alternatively, inspect the JSON file:
cat qc/fastp/KO_1_SRR10045016_fastp.json | grep -A5 "filtering_result"
Exercise 2: Adapter detection in raw reads
Do any samples show notable adapter content in the raw FastQC reports? Which adapter sequence(s) were detected by fastp? Are the same adapters present in all samples or do they differ between samples?
Hint: Check the "Adapter Content" module in the FastQC reports for the raw files in qc/raw/. Cross-reference with the "Adapter" section in the corresponding fastp HTML report.
Exercise 3: Quality improvement after trimming
Open the FastQC "Per Base Sequence Quality" plot for the same sample before trimming (qc/raw/) and after trimming (qc/trimmed/). What changed? Did trimming improve the 3′ quality drop? Were any other modules upgraded from WARN/FAIL to PASS?
Hint: The MultiQC report can display pre- and post-trimming FastQC results together. Look for sample names like KO_1_SRR10045016_1 (raw) vs. KO_1_SRR10045016_trimmed_1 (trimmed) in the per-sample plots.
Exercise 4: Read counts after trimming
How many reads remain in the trimmed files across all samples? Use seqkit stats on the trimmed R1 files to produce a summary table.
module load seqkit
seqkit stats trimmed/*_trimmed_1.fastq.gz
Compare the read counts to the chr11 input files (chr11_raw_data/*_1.fastq.gz). What is the average retention rate across samples?
Summary
In this lab you have:
- Loaded
fastqc,fastp, andmultiqcmodules on the Ibex HPC cluster - Run FastQC on all chr11-filtered FASTQ files from GSE136366 and interpreted the per-base quality, adapter content, and duplication modules
- Understood Phred quality scores and the meaning of the PASS / WARN / FAIL flags in FastQC
- Trimmed adapters and low-quality bases from all samples using fastp in a loop, producing trimmed FASTQ files and per-sample QC reports
- Re-run FastQC on the trimmed reads and confirmed quality improvements
- Aggregated all FastQC and fastp reports into a single interactive summary using MultiQC
Your trimmed reads in /ibex/user/$USER/workshop/trimmed/ are now ready for alignment.
| Location | Contents |
|---|---|
trimmed/*_trimmed_1/2.fastq.gz | Adapter-trimmed, quality-filtered chr11 paired FASTQ files (input for Lab 4) |
qc/raw/ | FastQC reports for raw reads |
qc/trimmed/ | FastQC reports for trimmed reads |
qc/fastp/ | fastp HTML and JSON reports per sample |
qc/multiqc_report/multiqc_report.html | Aggregated MultiQC report |
← Previous: Lab 2: Public Data Retrieval | Next: Lab 4: STAR Alignment →