Lab 2: Hands-on with Transcriptomics Data
- Understand the experimental dataset we will analyze
- Understand the FASTQ file format in depth
- Obtain statistics on FASTQ files using SeqKit
- Explore genome FASTA files with SeqKit
- Use SeqKit for sequence manipulation and searching
Part 1: Understand The Dataset: TDP-43 Knockout Study
Throughout this course, we will analyze RNA-seq data from GSE136366 - a study investigating the role of TDP-43 protein in human cells.
About TDP-43
TDP-43 (TAR DNA-binding protein 43) is an RNA-binding protein involved in RNA processing and regulation. Mutations in TDP-43 are associated with neurodegenerative diseases including:
- Amyotrophic Lateral Sclerosis (ALS)
- Frontotemporal Dementia (FTD)
This study examines gene expression changes when TDP-43 is knocked out versus restored (wildtype rescue) in HeLa cells.
Experimental Design
- Organism: Homo sapiens (Human)
- Cell line: HeLa cells
- Conditions:
- KO (Knockout): TDP-43 gene deleted
- WT (Wildtype/Rescue): TDP-43 re-expressed
- Replicates: 3 biological replicates per condition
- Sequencing: Paired-end Illumina HiSeq 2500, 70 bp reads
Reference: Roczniak-Ferguson A, Ferguson SM. "Pleiotropic requirements for human TDP-43 in the regulation of cell and organelle homeostasis." Life Sci Alliance 2019 Oct;2(5). PMID: 31527135
| Sample Name | SRA Accession | Condition | Replicate |
|---|---|---|---|
| KO_1 | SRR10045016 | TDP-43 Knockout | 1 |
| KO_2 | SRR10045017 | TDP-43 Knockout | 2 |
| KO_3 | SRR10045018 | TDP-43 Knockout | 3 |
| WT_1 | SRR10045019 | Wildtype (Rescue) | 1 |
| WT_2 | SRR10045020 | Wildtype (Rescue) | 2 |
| WT_3 | SRR10045021 | Wildtype (Rescue) | 3 |
For this workshop, we use reads mapped to chromosome 11 only (~3-4 million reads per sample instead of >20 million). This allows all analyses to complete within the workshop time while still providing biologically meaningful results.
Part 3: Understand FASTQ File Format
FASTQ is the standard format for storing sequencing reads along with their quality scores. Each read consists of 4 lines:
FASTQ Structure
@SRR10045016.1 1 length=70
NTGCAGTGCTGAGTCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA
+
#AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
| Line | Content | Description |
|---|---|---|
| 1 | @SRR10045016.1... | Read identifier (starts with @) |
| 2 | NTGCAGTG... | DNA sequence (A, T, G, C, N) |
| 3 | + | Separator (sometimes repeats ID) |
| 4 | #AAFFFJJ... | Quality scores (ASCII encoded) |
Phred Quality Scores
Quality scores indicate the probability of a base call being incorrect:
| Phred Score | Error Probability | Accuracy | ASCII (Phred+33) |
|---|---|---|---|
| 10 | 1 in 10 | 90% | + |
| 20 | 1 in 100 | 99% | 5 |
| 30 | 1 in 1,000 | 99.9% | ? |
| 40 | 1 in 10,000 | 99.99% | I |
Formula: Q = -10 × log₁₀(P) where P is error probability
Part 4: Exploring FASTQ Files
View raw FASTQ data
cd ~/genomics
# View first 12 lines (3 reads) of a compressed FASTQ file
gzip -dc raw_data/KO_1_SRR10045016_1.fastq.gz | head -12
# gzip -dc works on both macOS and Linux
# Count total line in a file (each read = 4 lines)
echo "Total lines in KO_1 R1:"
gzip -dc raw_data/KO_1_SRR10045016_1.fastq.gz | wc -l
# Count total number of reads in a file (each read = 4 lines)
echo "Total reads in KO_1 R1:"
gzip -dc raw_data/KO_1_SRR10045016_1.fastq.gz | wc -l | awk '{print $1/4}'
# View a random sample of reads
gzip -dc raw_data/KO_1_SRR10045016_1.fastq.gz | head -400 | tail -8
Understand paired-end reads
# Compare read IDs between R1 and R2 (should match)
echo "=== First read from R1 (forward) ==="
gzip -dc raw_data/KO_1_SRR10045016_1.fastq.gz | head -4
echo ""
echo "=== First read from R2 (reverse) ==="
gzip -dc raw_data/KO_1_SRR10045016_2.fastq.gz | head -4
# The read IDs should be identical (or differ only by /1 and /2)
In paired-end sequencing, each DNA fragment is sequenced from both ends:
- R1 (_1.fastq): Forward read
- R2 (_2.fastq): Reverse read
Part 5: Using SeqKit for FASTQ Exploration
SeqKit is a powerful toolkit for FASTA/FASTQ file manipulation.
Get basic statistics with SeqKit
cd ~/genomics
# Get statistics for one sample
seqkit stats raw_data/KO_1_SRR10045016_1.fastq.gz --threads 8
# Get more detailed statistics for one sample
seqkit stats -a raw_data/KO_1_SRR10045016_1.fastq.gz --threads 8
# Get more detailed statistics for All samples
seqkit stats -a raw_data/*.fastq.gz --threads 8 > seqkit_stats.tsv
Understand Stats on the fastq files
SeqKit reports multiple metrics on raw fastq files, open the seqkit_stats.tsv file and investigate the columns.
Part 6: Exploring Genome FASTA with SeqKit
SeqKit can also be used to explore and manipulate reference genome FASTA files.
Get genome statistics
cd ~/genomics
# Get statistics on the chromosome 11 genome file
seqkit stats references/Homo_sapiens.GRCh38.dna.chromosome.11.fa
# Get detailed statistics including GC content
seqkit stats -a references/Homo_sapiens.GRCh38.dna.chromosome.11.fa
# View the sequence header (chromosome name)
seqkit seq -n references/Homo_sapiens.GRCh38.dna.chromosome.11.fa
Extract sequence regions
# Get the first 1000 bases of chromosome 11
seqkit subseq -r 1:1000 references/Homo_sapiens.GRCh38.dna.chromosome.11.fa
# Extract a specific region (e.g., around the INS gene: 2159779-2161209)
seqkit subseq -r 2159779:2161209 references/Homo_sapiens.GRCh38.dna.chromosome.11.fa
# Get the length of each sequence in a FASTA file
seqkit fx2tab -l -n references/Homo_sapiens.GRCh38.dna.chromosome.11.fa
Calculate GC content and composition
# Calculate GC content for the genome
seqkit fx2tab -n -g references/Homo_sapiens.GRCh38.dna.chromosome.11.fa
Part 7: Useful SeqKit Commands
Here are additional SeqKit commands useful for bioinformatics workflows.
Subsample reads
cd ~/genomics
# Randomly sample 10000 reads from a FASTQ file
seqkit sample -n 10000 raw_data/KO_1_SRR10045016_1.fastq.gz -o subsampled_10k.fastq.gz
# Sample a proportion (10%) of reads
seqkit sample -p 0.1 raw_data/KO_1_SRR10045016_1.fastq.gz -o subsampled_10pct.fastq.gz
# Sample with a specific seed for reproducibility
seqkit sample -n 10000 -s 42 raw_data/KO_1_SRR10045016_1.fastq.gz -o subsampled_seed42.fastq.gz
# Clean up test files
rm -f subsampled_*.fastq.gz
Subsampling is useful for:
- Testing pipelines quickly before running on full data
- Debugging issues with smaller datasets
- Creating balanced datasets when samples have unequal depths
Search for sequence patterns
# Search for reads containing a specific sequence motif
seqkit grep -s -p "AGATCGGAAG" raw_data/KO_1_SRR10045016_1.fastq.gz | head -20
# Count reads containing adapter sequence (Illumina universal adapter)
seqkit grep -s -p "AGATCGGAAGAGC" raw_data/KO_1_SRR10045016_1.fastq.gz | seqkit stats
Sequence manipulation
# Get reverse complement of sequences
seqkit seq -r -p raw_data/KO_1_SRR10045016_1.fastq.gz | head -8
# Extract only sequence IDs
seqkit seq -n raw_data/KO_1_SRR10045016_1.fastq.gz | head -10
# Get sequences within a length range
seqkit seq -m 60 -M 80 raw_data/KO_1_SRR10045016_1.fastq.gz | seqkit stats
Compare and deduplicate sequences
# Remove duplicate sequences (by sequence content)
seqkit rmdup -s raw_data/KO_1_SRR10045016_1.fastq.gz | seqkit stats
# Find common sequences between two files (useful for checking paired-end sync)
seqkit common -n raw_data/KO_1_SRR10045016_1.fastq.gz raw_data/KO_1_SRR10045016_2.fastq.gz | seqkit stats
SeqKit Quick Reference
| Command | Description |
|---|---|
seqkit stats | Get sequence statistics |
seqkit sample | Randomly subsample sequences |
seqkit grep | Search by ID or sequence pattern |
seqkit subseq | Extract subsequences by region |
seqkit seq | Transform sequences (reverse, complement) |
seqkit fx2tab | Convert FASTA/Q to tabular format |
seqkit fq2fa | Convert FASTQ to FASTA |
seqkit rmdup | Remove duplicate sequences |
seqkit common | Find common sequences between files |
For full documentation: seqkit --help or visit SeqKit documentation
Exercises
Exercise 1: Read Analysis
- What is the average read length across all samples?
- Which sample has the highest number of reads?
- Calculate the total sequencing depth (total bases) for KO vs WT groups.
Exercise 2: Quality Investigation
- Find reads that start with 'N' (unknown base). How many are there in KO_1 and WT_1?
gzip -dc raw_data/KO_1_*_1.fastq.gz | awk 'NR%4==2 && /^N/' | wc -l - What percentage of reads have a quality score below 20?
- Compare the GC content between KO and WT samples. Is there a difference?
Exercise 3: SeqKit Exploration
Use seqkit --help to discover more commands. Try:
seqkit --help-Revise the parametersseqkit grep -p "PATTERN"- Search for specific sequences
Summary
In this lab, you have:
- Covered the experimental dataset that we use in this course
- Learned the FASTQ file format and Phred quality scores
- Used gzip -dc to view compressed files
- Applied SeqKit for FASTQ and FASTA statistics
- Explored genome sequences and extracted regions with SeqKit
- Learned to subsample, search, and manipulate sequences