Lab 3: Quality Control Hands-on

Learning Objectives

Run FastQC to assess raw read quality
Interpret FastQC reports and identify quality issues
Use fastp for quality trimming and adapter removal
Compare pre- and post-trimming quality with MultiQC

Why Quality Control?

Common Quality Issues in RNA-seq Data

Low quality bases: Especially at read ends (3' end degradation)
Adapter contamination: Sequencing adapters not fully removed
Overrepresented sequences: PCR duplicates or contamination
GC bias: Non-uniform GC content distribution
N bases: Uncalled bases indicating sequencing errors

Quality control ensures reliable downstream analysis by identifying and addressing these issues.

Part 1: FastQC Analysis

FastQC provides a comprehensive quality report for each FASTQ file.

Run FastQC on raw data

cd ~/genomics

# Activate environment
conda activate genomics

# Run FastQC on all raw FASTQ files
fastqc raw_data/*.fastq.gz -o qc_reports/fastqc_raw -t 4

# List the generated reports
ls -la qc_reports/fastqc_raw/

FastQC generates two files per sample: an HTML report and a ZIP archive with data.

View FastQC reports

Open the HTML report in browser, check under qc_reports/fastqc_raw/

Or view the summary.txt file under the zip folder in qc_reports/fastqc_raw/

Part 2: Interpreting FastQC Reports

FastQC Modules Overview

Module	What it Shows	Common Issues
Per base sequence quality	Quality scores across read positions	Low quality at 3' end
Per sequence quality scores	Distribution of mean quality per read	Bimodal distribution
Per base sequence content	A/T/G/C proportions per position	Bias at read start (normal for RNA-seq)
Per sequence GC content	GC% distribution across reads	Multiple peaks (contamination)
Sequence duplication levels	Read duplication rate	High duplication (normal for RNA-seq)
Overrepresented sequences	Frequently occurring sequences	Adapters, rRNA
Adapter content	Adapter sequence presence	Adapter contamination

RNA-seq Specific Considerations

Some FastQC warnings are expected for RNA-seq data:

Per base sequence content: Often shows bias at first 10-15 bases due to random hexamer priming - this is normal
Sequence duplication: Higher duplication is expected due to highly expressed genes
GC content: May show slight deviations from theoretical distribution

Part 3: Quality Trimming with fastp

fastp is an all-in-one tool for quality control, trimming, and filtering.

Run fastp on one sample

cd ~/genomics

# Process one sample first to understand the output
fastp \
    --in1 raw_data/KO_1_SRR10045016_1.fastq.gz \
    --in2 raw_data/KO_1_SRR10045016_2.fastq.gz \
    --out1 trimmed_data/KO_1_SRR10045016_1.trimmed.fastq.gz \
    --out2 trimmed_data/KO_1_SRR10045016_2.trimmed.fastq.gz \
    --qualified_quality_phred 20 \
    --length_required 36 \
    --detect_adapter_for_pe \
    --overrepresentation_analysis \
    --thread 4 \
    --json qc_reports/fastp/KO_1.json \
    --html qc_reports/fastp/KO_1.html

# View the HTML report
open qc_reports/fastp/KO_1.html

fastp Parameters Explained

--qualified_quality_phred 20: Bases below Q20 are considered low quality
--length_required 36: Discard reads shorter than 36 bp after trimming
--detect_adapter_for_pe: Auto-detect adapters for paired-end data
--thread 4: Use 4 CPU threads

Process all samples with a loop

cd ~/genomics

# Define sample IDs
SAMPLES="KO_1_SRR10045016 KO_2_SRR10045017 KO_3_SRR10045018 WT_1_SRR10045019 WT_2_SRR10045020 WT_3_SRR10045021"

# Process each sample
for SAMPLE in $SAMPLES; do
    fastp \
        --in1 raw_data/${SAMPLE}_1.fastq.gz \
        --in2 raw_data/${SAMPLE}_2.fastq.gz \
        --out1 trimmed_data/${SAMPLE}_1.trimmed.fastq.gz \
        --out2 trimmed_data/${SAMPLE}_2.trimmed.fastq.gz \
        --qualified_quality_phred 20 \
        --length_required 36 \
        --detect_adapter_for_pe \
        --overrepresentation_analysis \
        --thread 4 \
        --json qc_reports/fastp/${SAMPLE}.json \
        --html qc_reports/fastp/${SAMPLE}.html \
        2>> logs/fastp.log
done

ls -lh trimmed_data/

Check trimming results

# Compare file sizes before and after trimming
echo "=== Raw vs Trimmed File Sizes ==="
ls -lh raw_data/*_1.fastq.gz
ls -lh trimmed_data/*_1.trimmed.fastq.gz

# Get read counts before and after
seqkit stats raw_data/*_1.fastq.gz trimmed_data/*_1.trimmed.fastq.gz

Part 4: Post-Trimming Quality Check

Run FastQC on trimmed data

cd ~/genomics

# Run FastQC on trimmed files
fastqc trimmed_data/*.fastq.gz -o qc_reports/fastqc_trimmed -t 4

#Compare a sample before and after
#browse the HTML reports

Part 5: MultiQC Aggregated Report

MultiQC combines results from multiple samples into a single interactive report.

Generate MultiQC reports

cd ~/genomics

# Generate a comprehensive report with all QC data
multiqc qc_reports/ -o qc_reports -n multiqc_all --force

# Open the combined report in the browser:
qc_reports/multiqc_all.html

Key MultiQC Sections to Review

General Statistics: Overview table of all samples
FastQC: Per-base quality, GC content, sequence duplication
fastp: Filtering statistics, adapter content, quality distribution

Use MultiQC to quickly identify outlier samples that may need special attention.

Exercises

Exercise 1: Quality Assessment

Which sample has the highest percentage of reads passing filters?
Compare the "Per base sequence quality" plots before and after trimming. What improved?
Are there any overrepresented sequences? What might they be?

Exercise 2: Parameter Exploration

Try running fastp with different parameters on one sample:

# More stringent quality filtering
fastp --in1 raw_data/KO_1_SRR10045016_1.fastq.gz \
      --in2 raw_data/KO_1_SRR10045016_2.fastq.gz \
      --out1 test_q30_1.fastq.gz \
      --out2 test_q30_2.fastq.gz \
      --qualified_quality_phred 30 \
      --length_required 50

# Compare read counts
seqkit stats trimmed_data/KO_1_SRR10045016_1.trimmed.fastq.gz test_q30_1.fastq.gz
rm test_q30_*.fastq.gz

How does changing the quality threshold affect the number of reads retained?

Summary

In this lab, you have:

Run FastQC to assess raw read quality
Interpreted FastQC reports and understood common metrics
Used fastp to trim adapters and low-quality bases
Verified quality improvement after trimming
Generated aggregated MultiQC reports for easy comparison

Next: Lab 4 - Genome Alignment Hands-on