Lab 2: Hands-on with Transcriptomics Data

Learning Objectives

Understand the experimental dataset we will analyze
Understand the FASTQ file format in depth
Obtain statistics on FASTQ files using SeqKit
Explore genome FASTA files with SeqKit
Use SeqKit for sequence manipulation and searching

Part 1: Understand The Dataset: TDP-43 Knockout Study

Throughout this course, we will analyze RNA-seq data from GSE136366 - a study investigating the role of TDP-43 protein in human cells.

About TDP-43

TDP-43 (TAR DNA-binding protein 43) is an RNA-binding protein involved in RNA processing and regulation. Mutations in TDP-43 are associated with neurodegenerative diseases including:

Amyotrophic Lateral Sclerosis (ALS)
Frontotemporal Dementia (FTD)

This study examines gene expression changes when TDP-43 is knocked out versus restored (wildtype rescue) in HeLa cells.

Experimental Design

Organism: Homo sapiens (Human)
Cell line: HeLa cells
Conditions:
- KO (Knockout): TDP-43 gene deleted
- WT (Wildtype/Rescue): TDP-43 re-expressed
Replicates: 3 biological replicates per condition
Sequencing: Paired-end Illumina HiSeq 2500, 70 bp reads

Reference: Roczniak-Ferguson A, Ferguson SM. "Pleiotropic requirements for human TDP-43 in the regulation of cell and organelle homeostasis." Life Sci Alliance 2019 Oct;2(5). PMID: 31527135

Sample Name	SRA Accession	Condition	Replicate
KO_1	SRR10045016	TDP-43 Knockout	1
KO_2	SRR10045017	TDP-43 Knockout	2
KO_3	SRR10045018	TDP-43 Knockout	3
WT_1	SRR10045019	Wildtype (Rescue)	1
WT_2	SRR10045020	Wildtype (Rescue)	2
WT_3	SRR10045021	Wildtype (Rescue)	3

Workshop Dataset - Already downlpad in Lab 1 (~/genomics/raw_data)

For this workshop, we use reads mapped to chromosome 11 only (~3-4 million reads per sample instead of >20 million). This allows all analyses to complete within the workshop time while still providing biologically meaningful results.

Part 3: Understand FASTQ File Format

FASTQ is the standard format for storing sequencing reads along with their quality scores. Each read consists of 4 lines:

FASTQ Structure

@SRR10045016.1 1 length=70
NTGCAGTGCTGAGTCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA
+
#AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ

Line	Content	Description
1	@SRR10045016.1...	Read identifier (starts with @)
2	NTGCAGTG...	DNA sequence (A, T, G, C, N)
3	+	Separator (sometimes repeats ID)
4	#AAFFFJJ...	Quality scores (ASCII encoded)

Phred Quality Scores

Quality scores indicate the probability of a base call being incorrect:

Phred Score	Error Probability	Accuracy	ASCII (Phred+33)
10	1 in 10	90%	+
20	1 in 100	99%	5
30	1 in 1,000	99.9%	?
40	1 in 10,000	99.99%	I

Formula: Q = -10 × log₁₀(P) where P is error probability

Part 4: Exploring FASTQ Files

View raw FASTQ data

cd ~/genomics

# View first 12 lines (3 reads) of a compressed FASTQ file
gzip -dc raw_data/KO_1_SRR10045016_1.fastq.gz | head -12
# gzip -dc works on both macOS and Linux

# Count total line in a file (each read = 4 lines)
echo "Total lines in KO_1 R1:"
gzip -dc raw_data/KO_1_SRR10045016_1.fastq.gz | wc -l

# Count total number of reads in a file (each read = 4 lines)
echo "Total reads in KO_1 R1:"
gzip -dc raw_data/KO_1_SRR10045016_1.fastq.gz | wc -l | awk '{print $1/4}'

# View a random sample of reads
gzip -dc raw_data/KO_1_SRR10045016_1.fastq.gz | head -400 | tail -8

Understand paired-end reads

# Compare read IDs between R1 and R2 (should match)
echo "=== First read from R1 (forward) ==="
gzip -dc raw_data/KO_1_SRR10045016_1.fastq.gz | head -4

echo ""
echo "=== First read from R2 (reverse) ==="
gzip -dc raw_data/KO_1_SRR10045016_2.fastq.gz | head -4

# The read IDs should be identical (or differ only by /1 and /2)

Paired-End Sequencing

In paired-end sequencing, each DNA fragment is sequenced from both ends:

R1 (_1.fastq): Forward read
R2 (_2.fastq): Reverse read

Part 5: Using SeqKit for FASTQ Exploration

SeqKit is a powerful toolkit for FASTA/FASTQ file manipulation.

Get basic statistics with SeqKit

cd ~/genomics

# Get statistics for one sample
seqkit stats raw_data/KO_1_SRR10045016_1.fastq.gz --threads 8

# Get more detailed statistics for one sample
seqkit stats -a raw_data/KO_1_SRR10045016_1.fastq.gz --threads 8


# Get more detailed statistics for All samples
seqkit stats -a raw_data/*.fastq.gz --threads 8 > seqkit_stats.tsv

Understand Stats on the fastq files

SeqKit reports multiple metrics on raw fastq files, open the seqkit_stats.tsv file and investigate the columns.

Part 6: Exploring Genome FASTA with SeqKit

SeqKit can also be used to explore and manipulate reference genome FASTA files.

Get genome statistics

cd ~/genomics

# Get statistics on the chromosome 11 genome file
seqkit stats references/Homo_sapiens.GRCh38.dna.chromosome.11.fa

# Get detailed statistics including GC content
seqkit stats -a references/Homo_sapiens.GRCh38.dna.chromosome.11.fa

# View the sequence header (chromosome name)
seqkit seq -n references/Homo_sapiens.GRCh38.dna.chromosome.11.fa

Extract sequence regions

# Get the first 1000 bases of chromosome 11
seqkit subseq -r 1:1000 references/Homo_sapiens.GRCh38.dna.chromosome.11.fa

# Extract a specific region (e.g., around the INS gene: 2159779-2161209)
seqkit subseq -r 2159779:2161209 references/Homo_sapiens.GRCh38.dna.chromosome.11.fa

# Get the length of each sequence in a FASTA file
seqkit fx2tab -l -n references/Homo_sapiens.GRCh38.dna.chromosome.11.fa

Calculate GC content and composition

# Calculate GC content for the genome
seqkit fx2tab -n -g references/Homo_sapiens.GRCh38.dna.chromosome.11.fa

Part 7: Useful SeqKit Commands

Here are additional SeqKit commands useful for bioinformatics workflows.

Subsample reads

cd ~/genomics

# Randomly sample 10000 reads from a FASTQ file
seqkit sample -n 10000 raw_data/KO_1_SRR10045016_1.fastq.gz -o subsampled_10k.fastq.gz

# Sample a proportion (10%) of reads
seqkit sample -p 0.1 raw_data/KO_1_SRR10045016_1.fastq.gz -o subsampled_10pct.fastq.gz

# Sample with a specific seed for reproducibility
seqkit sample -n 10000 -s 42 raw_data/KO_1_SRR10045016_1.fastq.gz -o subsampled_seed42.fastq.gz

# Clean up test files
rm -f subsampled_*.fastq.gz

When to Subsample?

Subsampling is useful for:

Testing pipelines quickly before running on full data
Debugging issues with smaller datasets
Creating balanced datasets when samples have unequal depths

Search for sequence patterns

# Search for reads containing a specific sequence motif
seqkit grep -s -p "AGATCGGAAG" raw_data/KO_1_SRR10045016_1.fastq.gz | head -20

# Count reads containing adapter sequence (Illumina universal adapter)
seqkit grep -s -p "AGATCGGAAGAGC" raw_data/KO_1_SRR10045016_1.fastq.gz | seqkit stats

Sequence manipulation

# Get reverse complement of sequences
seqkit seq -r -p raw_data/KO_1_SRR10045016_1.fastq.gz | head -8

# Extract only sequence IDs
seqkit seq -n raw_data/KO_1_SRR10045016_1.fastq.gz | head -10

# Get sequences within a length range
seqkit seq -m 60 -M 80 raw_data/KO_1_SRR10045016_1.fastq.gz | seqkit stats

Compare and deduplicate sequences

# Remove duplicate sequences (by sequence content)
seqkit rmdup -s raw_data/KO_1_SRR10045016_1.fastq.gz | seqkit stats

# Find common sequences between two files (useful for checking paired-end sync)
seqkit common -n raw_data/KO_1_SRR10045016_1.fastq.gz raw_data/KO_1_SRR10045016_2.fastq.gz | seqkit stats

SeqKit Quick Reference

Command	Description
`seqkit stats`	Get sequence statistics
`seqkit sample`	Randomly subsample sequences
`seqkit grep`	Search by ID or sequence pattern
`seqkit subseq`	Extract subsequences by region
`seqkit seq`	Transform sequences (reverse, complement)
`seqkit fx2tab`	Convert FASTA/Q to tabular format
`seqkit fq2fa`	Convert FASTQ to FASTA
`seqkit rmdup`	Remove duplicate sequences
`seqkit common`	Find common sequences between files

For full documentation: seqkit --help or visit SeqKit documentation

Exercises

Exercise 1: Read Analysis

What is the average read length across all samples?
Which sample has the highest number of reads?
Calculate the total sequencing depth (total bases) for KO vs WT groups.

Exercise 2: Quality Investigation

Find reads that start with 'N' (unknown base). How many are there in KO_1 and WT_1?
```
gzip -dc raw_data/KO_1_*_1.fastq.gz | awk 'NR%4==2 && /^N/' | wc -l
```
What percentage of reads have a quality score below 20?
Compare the GC content between KO and WT samples. Is there a difference?

Exercise 3: SeqKit Exploration

Use seqkit --help to discover more commands. Try:

seqkit --help -Revise the parameters
seqkit grep -p "PATTERN" - Search for specific sequences

Summary

In this lab, you have:

Covered the experimental dataset that we use in this course
Learned the FASTQ file format and Phred quality scores
Used gzip -dc to view compressed files
Applied SeqKit for FASTQ and FASTA statistics
Explored genome sequences and extracted regions with SeqKit
Learned to subsample, search, and manipulate sequences

Next: Lab 3 - Quality Control Hands-on