Lab 3: Quality Control Hands-on
- Run FastQC to assess raw read quality
- Interpret FastQC reports and identify quality issues
- Use fastp for quality trimming and adapter removal
- Compare pre- and post-trimming quality with MultiQC
Why Quality Control?
Common Quality Issues in RNA-seq Data
- Low quality bases: Especially at read ends (3' end degradation)
- Adapter contamination: Sequencing adapters not fully removed
- Overrepresented sequences: PCR duplicates or contamination
- GC bias: Non-uniform GC content distribution
- N bases: Uncalled bases indicating sequencing errors
Quality control ensures reliable downstream analysis by identifying and addressing these issues.
Part 1: FastQC Analysis
FastQC provides a comprehensive quality report for each FASTQ file.
Run FastQC on raw data
cd ~/genomics
# Activate environment
conda activate genomics
# Run FastQC on all raw FASTQ files
fastqc raw_data/*.fastq.gz -o qc_reports/fastqc_raw -t 4
# List the generated reports
ls -la qc_reports/fastqc_raw/
FastQC generates two files per sample: an HTML report and a ZIP archive with data.
View FastQC reports
Open the HTML report in browser, check under qc_reports/fastqc_raw/
Or view the summary.txt file under the zip folder in qc_reports/fastqc_raw/
Part 2: Interpreting FastQC Reports
FastQC Modules Overview
| Module | What it Shows | Common Issues |
|---|---|---|
| Per base sequence quality | Quality scores across read positions | Low quality at 3' end |
| Per sequence quality scores | Distribution of mean quality per read | Bimodal distribution |
| Per base sequence content | A/T/G/C proportions per position | Bias at read start (normal for RNA-seq) |
| Per sequence GC content | GC% distribution across reads | Multiple peaks (contamination) |
| Sequence duplication levels | Read duplication rate | High duplication (normal for RNA-seq) |
| Overrepresented sequences | Frequently occurring sequences | Adapters, rRNA |
| Adapter content | Adapter sequence presence | Adapter contamination |
Some FastQC warnings are expected for RNA-seq data:
- Per base sequence content: Often shows bias at first 10-15 bases due to random hexamer priming - this is normal
- Sequence duplication: Higher duplication is expected due to highly expressed genes
- GC content: May show slight deviations from theoretical distribution
Part 3: Quality Trimming with fastp
fastp is an all-in-one tool for quality control, trimming, and filtering.
Run fastp on one sample
cd ~/genomics
# Process one sample first to understand the output
fastp \
--in1 raw_data/KO_1_SRR10045016_1.fastq.gz \
--in2 raw_data/KO_1_SRR10045016_2.fastq.gz \
--out1 trimmed_data/KO_1_SRR10045016_1.trimmed.fastq.gz \
--out2 trimmed_data/KO_1_SRR10045016_2.trimmed.fastq.gz \
--qualified_quality_phred 20 \
--length_required 36 \
--detect_adapter_for_pe \
--overrepresentation_analysis \
--thread 4 \
--json qc_reports/fastp/KO_1.json \
--html qc_reports/fastp/KO_1.html
# View the HTML report
open qc_reports/fastp/KO_1.html
--qualified_quality_phred 20: Bases below Q20 are considered low quality--length_required 36: Discard reads shorter than 36 bp after trimming--detect_adapter_for_pe: Auto-detect adapters for paired-end data--thread 4: Use 4 CPU threads
Process all samples with a loop
cd ~/genomics
# Define sample IDs
SAMPLES="KO_1_SRR10045016 KO_2_SRR10045017 KO_3_SRR10045018 WT_1_SRR10045019 WT_2_SRR10045020 WT_3_SRR10045021"
# Process each sample
for SAMPLE in $SAMPLES; do
fastp \
--in1 raw_data/${SAMPLE}_1.fastq.gz \
--in2 raw_data/${SAMPLE}_2.fastq.gz \
--out1 trimmed_data/${SAMPLE}_1.trimmed.fastq.gz \
--out2 trimmed_data/${SAMPLE}_2.trimmed.fastq.gz \
--qualified_quality_phred 20 \
--length_required 36 \
--detect_adapter_for_pe \
--overrepresentation_analysis \
--thread 4 \
--json qc_reports/fastp/${SAMPLE}.json \
--html qc_reports/fastp/${SAMPLE}.html \
2>> logs/fastp.log
done
ls -lh trimmed_data/
Check trimming results
# Compare file sizes before and after trimming
echo "=== Raw vs Trimmed File Sizes ==="
ls -lh raw_data/*_1.fastq.gz
ls -lh trimmed_data/*_1.trimmed.fastq.gz
# Get read counts before and after
seqkit stats raw_data/*_1.fastq.gz trimmed_data/*_1.trimmed.fastq.gz
Part 4: Post-Trimming Quality Check
Run FastQC on trimmed data
cd ~/genomics
# Run FastQC on trimmed files
fastqc trimmed_data/*.fastq.gz -o qc_reports/fastqc_trimmed -t 4
#Compare a sample before and after
#browse the HTML reports
Part 5: MultiQC Aggregated Report
MultiQC combines results from multiple samples into a single interactive report.
Generate MultiQC reports
cd ~/genomics
# Generate a comprehensive report with all QC data
multiqc qc_reports/ -o qc_reports -n multiqc_all --force
# Open the combined report in the browser:
qc_reports/multiqc_all.html
Key MultiQC Sections to Review
- General Statistics: Overview table of all samples
- FastQC: Per-base quality, GC content, sequence duplication
- fastp: Filtering statistics, adapter content, quality distribution
Use MultiQC to quickly identify outlier samples that may need special attention.
Exercises
Exercise 1: Quality Assessment
- Which sample has the highest percentage of reads passing filters?
- Compare the "Per base sequence quality" plots before and after trimming. What improved?
- Are there any overrepresented sequences? What might they be?
Exercise 2: Parameter Exploration
Try running fastp with different parameters on one sample:
# More stringent quality filtering
fastp --in1 raw_data/KO_1_SRR10045016_1.fastq.gz \
--in2 raw_data/KO_1_SRR10045016_2.fastq.gz \
--out1 test_q30_1.fastq.gz \
--out2 test_q30_2.fastq.gz \
--qualified_quality_phred 30 \
--length_required 50
# Compare read counts
seqkit stats trimmed_data/KO_1_SRR10045016_1.trimmed.fastq.gz test_q30_1.fastq.gz
rm test_q30_*.fastq.gz
How does changing the quality threshold affect the number of reads retained?
Summary
In this lab, you have:
- Run FastQC to assess raw read quality
- Interpreted FastQC reports and understood common metrics
- Used fastp to trim adapters and low-quality bases
- Verified quality improvement after trimming
- Generated aggregated MultiQC reports for easy comparison