Lab 2: Hands-on with Transcriptomics Data

Learning Objectives

Part 1: Understand The Dataset: TDP-43 Knockout Study

Throughout this course, we will analyze RNA-seq data from GSE136366 - a study investigating the role of TDP-43 protein in human cells.

About TDP-43

TDP-43 (TAR DNA-binding protein 43) is an RNA-binding protein involved in RNA processing and regulation. Mutations in TDP-43 are associated with neurodegenerative diseases including:

  • Amyotrophic Lateral Sclerosis (ALS)
  • Frontotemporal Dementia (FTD)

This study examines gene expression changes when TDP-43 is knocked out versus restored (wildtype rescue) in HeLa cells.

Experimental Design

  • Organism: Homo sapiens (Human)
  • Cell line: HeLa cells
  • Conditions:
    • KO (Knockout): TDP-43 gene deleted
    • WT (Wildtype/Rescue): TDP-43 re-expressed
  • Replicates: 3 biological replicates per condition
  • Sequencing: Paired-end Illumina HiSeq 2500, 70 bp reads

Reference: Roczniak-Ferguson A, Ferguson SM. "Pleiotropic requirements for human TDP-43 in the regulation of cell and organelle homeostasis." Life Sci Alliance 2019 Oct;2(5). PMID: 31527135

Sample Name SRA Accession Condition Replicate
KO_1SRR10045016TDP-43 Knockout1
KO_2SRR10045017TDP-43 Knockout2
KO_3SRR10045018TDP-43 Knockout3
WT_1SRR10045019Wildtype (Rescue)1
WT_2SRR10045020Wildtype (Rescue)2
WT_3SRR10045021Wildtype (Rescue)3
Workshop Dataset - Already downlpad in Lab 1 (~/genomics/raw_data)

For this workshop, we use reads mapped to chromosome 11 only (~3-4 million reads per sample instead of >20 million). This allows all analyses to complete within the workshop time while still providing biologically meaningful results.

Part 3: Understand FASTQ File Format

FASTQ is the standard format for storing sequencing reads along with their quality scores. Each read consists of 4 lines:

FASTQ Structure

@SRR10045016.1 1 length=70
NTGCAGTGCTGAGTCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA
+
#AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
LineContentDescription
1@SRR10045016.1...Read identifier (starts with @)
2NTGCAGTG...DNA sequence (A, T, G, C, N)
3+Separator (sometimes repeats ID)
4#AAFFFJJ...Quality scores (ASCII encoded)

Phred Quality Scores

Quality scores indicate the probability of a base call being incorrect:

Phred ScoreError ProbabilityAccuracyASCII (Phred+33)
101 in 1090%+
201 in 10099%5
301 in 1,00099.9%?
401 in 10,00099.99%I

Formula: Q = -10 × log₁₀(P) where P is error probability

Part 4: Exploring FASTQ Files

View raw FASTQ data

cd ~/genomics

# View first 12 lines (3 reads) of a compressed FASTQ file
gzip -dc raw_data/KO_1_SRR10045016_1.fastq.gz | head -12
# gzip -dc works on both macOS and Linux

# Count total line in a file (each read = 4 lines)
echo "Total lines in KO_1 R1:"
gzip -dc raw_data/KO_1_SRR10045016_1.fastq.gz | wc -l

# Count total number of reads in a file (each read = 4 lines)
echo "Total reads in KO_1 R1:"
gzip -dc raw_data/KO_1_SRR10045016_1.fastq.gz | wc -l | awk '{print $1/4}'

# View a random sample of reads
gzip -dc raw_data/KO_1_SRR10045016_1.fastq.gz | head -400 | tail -8

Understand paired-end reads

# Compare read IDs between R1 and R2 (should match)
echo "=== First read from R1 (forward) ==="
gzip -dc raw_data/KO_1_SRR10045016_1.fastq.gz | head -4

echo ""
echo "=== First read from R2 (reverse) ==="
gzip -dc raw_data/KO_1_SRR10045016_2.fastq.gz | head -4

# The read IDs should be identical (or differ only by /1 and /2)
Paired-End Sequencing

In paired-end sequencing, each DNA fragment is sequenced from both ends:

  • R1 (_1.fastq): Forward read
  • R2 (_2.fastq): Reverse read

Part 5: Using SeqKit for FASTQ Exploration

SeqKit is a powerful toolkit for FASTA/FASTQ file manipulation.

Get basic statistics with SeqKit

cd ~/genomics

# Get statistics for one sample
seqkit stats raw_data/KO_1_SRR10045016_1.fastq.gz --threads 8

# Get more detailed statistics for one sample
seqkit stats -a raw_data/KO_1_SRR10045016_1.fastq.gz --threads 8


# Get more detailed statistics for All samples
seqkit stats -a raw_data/*.fastq.gz --threads 8 > seqkit_stats.tsv

Understand Stats on the fastq files

SeqKit reports multiple metrics on raw fastq files, open the seqkit_stats.tsv file and investigate the columns.

Part 6: Exploring Genome FASTA with SeqKit

SeqKit can also be used to explore and manipulate reference genome FASTA files.

Get genome statistics

cd ~/genomics

# Get statistics on the chromosome 11 genome file
seqkit stats references/Homo_sapiens.GRCh38.dna.chromosome.11.fa

# Get detailed statistics including GC content
seqkit stats -a references/Homo_sapiens.GRCh38.dna.chromosome.11.fa

# View the sequence header (chromosome name)
seqkit seq -n references/Homo_sapiens.GRCh38.dna.chromosome.11.fa

Extract sequence regions

# Get the first 1000 bases of chromosome 11
seqkit subseq -r 1:1000 references/Homo_sapiens.GRCh38.dna.chromosome.11.fa

# Extract a specific region (e.g., around the INS gene: 2159779-2161209)
seqkit subseq -r 2159779:2161209 references/Homo_sapiens.GRCh38.dna.chromosome.11.fa

# Get the length of each sequence in a FASTA file
seqkit fx2tab -l -n references/Homo_sapiens.GRCh38.dna.chromosome.11.fa

Calculate GC content and composition

# Calculate GC content for the genome
seqkit fx2tab -n -g references/Homo_sapiens.GRCh38.dna.chromosome.11.fa

Part 7: Useful SeqKit Commands

Here are additional SeqKit commands useful for bioinformatics workflows.

Subsample reads

cd ~/genomics

# Randomly sample 10000 reads from a FASTQ file
seqkit sample -n 10000 raw_data/KO_1_SRR10045016_1.fastq.gz -o subsampled_10k.fastq.gz

# Sample a proportion (10%) of reads
seqkit sample -p 0.1 raw_data/KO_1_SRR10045016_1.fastq.gz -o subsampled_10pct.fastq.gz

# Sample with a specific seed for reproducibility
seqkit sample -n 10000 -s 42 raw_data/KO_1_SRR10045016_1.fastq.gz -o subsampled_seed42.fastq.gz

# Clean up test files
rm -f subsampled_*.fastq.gz
When to Subsample?

Subsampling is useful for:

  • Testing pipelines quickly before running on full data
  • Debugging issues with smaller datasets
  • Creating balanced datasets when samples have unequal depths

Search for sequence patterns

# Search for reads containing a specific sequence motif
seqkit grep -s -p "AGATCGGAAG" raw_data/KO_1_SRR10045016_1.fastq.gz | head -20

# Count reads containing adapter sequence (Illumina universal adapter)
seqkit grep -s -p "AGATCGGAAGAGC" raw_data/KO_1_SRR10045016_1.fastq.gz | seqkit stats

Sequence manipulation

# Get reverse complement of sequences
seqkit seq -r -p raw_data/KO_1_SRR10045016_1.fastq.gz | head -8

# Extract only sequence IDs
seqkit seq -n raw_data/KO_1_SRR10045016_1.fastq.gz | head -10

# Get sequences within a length range
seqkit seq -m 60 -M 80 raw_data/KO_1_SRR10045016_1.fastq.gz | seqkit stats

Compare and deduplicate sequences

# Remove duplicate sequences (by sequence content)
seqkit rmdup -s raw_data/KO_1_SRR10045016_1.fastq.gz | seqkit stats

# Find common sequences between two files (useful for checking paired-end sync)
seqkit common -n raw_data/KO_1_SRR10045016_1.fastq.gz raw_data/KO_1_SRR10045016_2.fastq.gz | seqkit stats

SeqKit Quick Reference

CommandDescription
seqkit statsGet sequence statistics
seqkit sampleRandomly subsample sequences
seqkit grepSearch by ID or sequence pattern
seqkit subseqExtract subsequences by region
seqkit seqTransform sequences (reverse, complement)
seqkit fx2tabConvert FASTA/Q to tabular format
seqkit fq2faConvert FASTQ to FASTA
seqkit rmdupRemove duplicate sequences
seqkit commonFind common sequences between files

For full documentation: seqkit --help or visit SeqKit documentation

Exercises

Exercise 1: Read Analysis

  1. What is the average read length across all samples?
  2. Which sample has the highest number of reads?
  3. Calculate the total sequencing depth (total bases) for KO vs WT groups.

Exercise 2: Quality Investigation

  1. Find reads that start with 'N' (unknown base). How many are there in KO_1 and WT_1?
    gzip -dc raw_data/KO_1_*_1.fastq.gz | awk 'NR%4==2 && /^N/' | wc -l
  2. What percentage of reads have a quality score below 20?
  3. Compare the GC content between KO and WT samples. Is there a difference?

Exercise 3: SeqKit Exploration

Use seqkit --help to discover more commands. Try:

  1. seqkit --help -Revise the parameters
  2. seqkit grep -p "PATTERN" - Search for specific sequences

Summary

In this lab, you have:

Next: Lab 3 - Quality Control Hands-on