Lab 3: Quality Control Hands-on

Learning Objectives

Why Quality Control?

Common Quality Issues in RNA-seq Data

  • Low quality bases: Especially at read ends (3' end degradation)
  • Adapter contamination: Sequencing adapters not fully removed
  • Overrepresented sequences: PCR duplicates or contamination
  • GC bias: Non-uniform GC content distribution
  • N bases: Uncalled bases indicating sequencing errors

Quality control ensures reliable downstream analysis by identifying and addressing these issues.

Part 1: FastQC Analysis

FastQC provides a comprehensive quality report for each FASTQ file.

Run FastQC on raw data

cd ~/genomics

# Activate environment
conda activate genomics

# Run FastQC on all raw FASTQ files
fastqc raw_data/*.fastq.gz -o qc_reports/fastqc_raw -t 4

# List the generated reports
ls -la qc_reports/fastqc_raw/

FastQC generates two files per sample: an HTML report and a ZIP archive with data.

View FastQC reports

Open the HTML report in browser, check under qc_reports/fastqc_raw/

Or view the summary.txt file under the zip folder in qc_reports/fastqc_raw/

Part 2: Interpreting FastQC Reports

FastQC Modules Overview

ModuleWhat it ShowsCommon Issues
Per base sequence qualityQuality scores across read positionsLow quality at 3' end
Per sequence quality scoresDistribution of mean quality per readBimodal distribution
Per base sequence contentA/T/G/C proportions per positionBias at read start (normal for RNA-seq)
Per sequence GC contentGC% distribution across readsMultiple peaks (contamination)
Sequence duplication levelsRead duplication rateHigh duplication (normal for RNA-seq)
Overrepresented sequencesFrequently occurring sequencesAdapters, rRNA
Adapter contentAdapter sequence presenceAdapter contamination
RNA-seq Specific Considerations

Some FastQC warnings are expected for RNA-seq data:

  • Per base sequence content: Often shows bias at first 10-15 bases due to random hexamer priming - this is normal
  • Sequence duplication: Higher duplication is expected due to highly expressed genes
  • GC content: May show slight deviations from theoretical distribution

Part 3: Quality Trimming with fastp

fastp is an all-in-one tool for quality control, trimming, and filtering.

Run fastp on one sample

cd ~/genomics

# Process one sample first to understand the output
fastp \
    --in1 raw_data/KO_1_SRR10045016_1.fastq.gz \
    --in2 raw_data/KO_1_SRR10045016_2.fastq.gz \
    --out1 trimmed_data/KO_1_SRR10045016_1.trimmed.fastq.gz \
    --out2 trimmed_data/KO_1_SRR10045016_2.trimmed.fastq.gz \
    --qualified_quality_phred 20 \
    --length_required 36 \
    --detect_adapter_for_pe \
    --overrepresentation_analysis \
    --thread 4 \
    --json qc_reports/fastp/KO_1.json \
    --html qc_reports/fastp/KO_1.html

# View the HTML report
open qc_reports/fastp/KO_1.html
fastp Parameters Explained
  • --qualified_quality_phred 20: Bases below Q20 are considered low quality
  • --length_required 36: Discard reads shorter than 36 bp after trimming
  • --detect_adapter_for_pe: Auto-detect adapters for paired-end data
  • --thread 4: Use 4 CPU threads

Process all samples with a loop

cd ~/genomics

# Define sample IDs
SAMPLES="KO_1_SRR10045016 KO_2_SRR10045017 KO_3_SRR10045018 WT_1_SRR10045019 WT_2_SRR10045020 WT_3_SRR10045021"

# Process each sample
for SAMPLE in $SAMPLES; do
    fastp \
        --in1 raw_data/${SAMPLE}_1.fastq.gz \
        --in2 raw_data/${SAMPLE}_2.fastq.gz \
        --out1 trimmed_data/${SAMPLE}_1.trimmed.fastq.gz \
        --out2 trimmed_data/${SAMPLE}_2.trimmed.fastq.gz \
        --qualified_quality_phred 20 \
        --length_required 36 \
        --detect_adapter_for_pe \
        --overrepresentation_analysis \
        --thread 4 \
        --json qc_reports/fastp/${SAMPLE}.json \
        --html qc_reports/fastp/${SAMPLE}.html \
        2>> logs/fastp.log
done

ls -lh trimmed_data/

Check trimming results

# Compare file sizes before and after trimming
echo "=== Raw vs Trimmed File Sizes ==="
ls -lh raw_data/*_1.fastq.gz
ls -lh trimmed_data/*_1.trimmed.fastq.gz

# Get read counts before and after
seqkit stats raw_data/*_1.fastq.gz trimmed_data/*_1.trimmed.fastq.gz

Part 4: Post-Trimming Quality Check

Run FastQC on trimmed data

cd ~/genomics

# Run FastQC on trimmed files
fastqc trimmed_data/*.fastq.gz -o qc_reports/fastqc_trimmed -t 4

#Compare a sample before and after
#browse the HTML reports

Part 5: MultiQC Aggregated Report

MultiQC combines results from multiple samples into a single interactive report.

Generate MultiQC reports

cd ~/genomics

# Generate a comprehensive report with all QC data
multiqc qc_reports/ -o qc_reports -n multiqc_all --force

# Open the combined report in the browser:
qc_reports/multiqc_all.html

Key MultiQC Sections to Review

  • General Statistics: Overview table of all samples
  • FastQC: Per-base quality, GC content, sequence duplication
  • fastp: Filtering statistics, adapter content, quality distribution

Use MultiQC to quickly identify outlier samples that may need special attention.

Exercises

Exercise 1: Quality Assessment

  1. Which sample has the highest percentage of reads passing filters?
  2. Compare the "Per base sequence quality" plots before and after trimming. What improved?
  3. Are there any overrepresented sequences? What might they be?

Exercise 2: Parameter Exploration

Try running fastp with different parameters on one sample:

# More stringent quality filtering
fastp --in1 raw_data/KO_1_SRR10045016_1.fastq.gz \
      --in2 raw_data/KO_1_SRR10045016_2.fastq.gz \
      --out1 test_q30_1.fastq.gz \
      --out2 test_q30_2.fastq.gz \
      --qualified_quality_phred 30 \
      --length_required 50

# Compare read counts
seqkit stats trimmed_data/KO_1_SRR10045016_1.trimmed.fastq.gz test_q30_1.fastq.gz
rm test_q30_*.fastq.gz

How does changing the quality threshold affect the number of reads retained?

Summary

In this lab, you have:

Next: Lab 4 - Genome Alignment Hands-on