Lab 5: Differential Expression Analysis

Learning Objectives

Understand the DESeq2 statistical framework
Run differential expression analysis comparing KO vs WT
Interpret DE results (log2 fold change, p-values, adjusted p-values)
Filter and export significant differentially expressed genes

DESeq2 Overview

Why DESeq2?

DESeq2 is the gold-standard tool for differential expression analysis because it:

Models count data with negative binomial distribution
Performs size factor normalization to account for library size differences
Uses shrinkage estimators for dispersion and fold changes
Properly controls false discovery rate (FDR) with Benjamini-Hochberg correction

Key Concepts

Term	Description
log2 Fold Change	log2(KO/WT) - positive means upregulated in KO
p-value	Probability of observing the difference by chance
padj (adjusted p-value)	p-value corrected for multiple testing (FDR)
baseMean	Average normalized count across all samples

Output - Lab 4b: Download counts.zip

In case you have not obtained the results from Lab 4 (the gene count table), you can download lab4-output folder and unzip it. It contains all files that are required for this lab and the remaining ones.

Part 1: Setup & Start Jupyter Notebook

Install required pacakges and setup the kernel

conda activate genomics

mamba install -c conda-forge -c bioconda jupyter jupyterlab r-irkernel -y

R -e "IRkernel::installspec(name='genomics', displayname='R (DESeq2)')"

# Verify installation
jupyter --version

Start jupyter notebook

cd ~/genomics

# Launch Jupyter Notebook
jupyter notebook

Create a new notebook:

In Jupyter (or JupyterLab), choose File → New → Notebook, then select the R kernel.
Set working directory: In the first cell, run setwd("~/genomics") so paths below work.
Copy each code block below into a separate cell and run them in order (Shift+Enter).

Part 2: Run DESeq2 Analysis

Load libraries

library(DESeq2)
library(tximport)

Load data

txi <- readRDS("~/genomics/counts/tximport.rds")
sample_info <- read.table("~/genomics/counts/sample_info.tsv", header = TRUE, sep = "\t")

# Set factor levels (WT as reference)
sample_info$condition <- factor(sample_info$condition, levels = c("WT", "KO"))

cat(paste0("Samples: ", nrow(sample_info), "\n"))
cat(paste0("Genes: ", nrow(txi$counts), "\n"))
cat(paste0("Conditions: ", paste(levels(sample_info$condition), collapse = " vs "), "\n"))

Create DESeq2 dataset and pre-filter

dds <- DESeqDataSetFromTximport(txi, colData = sample_info, design = ~ condition)

# Pre-filtering: Remove genes with very low counts
keep <- rowSums(counts(dds) >= 10) >= 3
dds <- dds[keep, ]
cat(paste0("Genes after filtering (>=10 counts in >=3 samples): ", nrow(dds), "\n"))

Run DESeq2 and get results

dds <- DESeq(dds)
                res <- results(dds, contrast = c("condition", "KO", "WT"))
res <- res[order(res$padj), ]

summary(res)

# Significant: padj < 0.05 and |log2FC| > 1
sig_genes <- subset(res, padj < 0.05 & abs(log2FoldChange) > 1)
sig_genes <- sig_genes[order(sig_genes$padj), ]

cat(paste0("Total significant genes: ", nrow(sig_genes), "\n"))
cat(paste0("Upregulated in KO: ", sum(sig_genes$log2FoldChange > 0, na.rm = TRUE), "\n"))
cat(paste0("Downregulated in KO: ", sum(sig_genes$log2FoldChange < 0, na.rm = TRUE), "\n"))

Save results

dir.create("~/genomics/results/tables", showWarnings = FALSE, recursive = TRUE)

# All results
all_res <- as.data.frame(res)
all_res$gene_id <- rownames(all_res)
all_res <- all_res[, c("gene_id", "baseMean", "log2FoldChange", "lfcSE", "stat", "pvalue", "padj")]
write.csv(all_res, "~/genomics/results/tables/deseq2_all_results.csv", row.names = FALSE)

# Significant genes only
sig_res <- as.data.frame(sig_genes)
sig_res$gene_id <- rownames(sig_res)
sig_res <- sig_res[, c("gene_id", "baseMean", "log2FoldChange", "lfcSE", "stat", "pvalue", "padj")]
write.csv(sig_res, "~/genomics/results/tables/deseq2_significant.csv", row.names = FALSE)

# Normalized counts
norm_counts <- as.data.frame(counts(dds, normalized = TRUE))
norm_counts$gene_id <- rownames(norm_counts)
write.csv(norm_counts, "~/genomics/results/tables/normalized_counts.csv", row.names = FALSE)

# Save DESeq2 object for later use
saveRDS(dds, "~/genomics/results/deseq2_object.rds")

Optional: Alternative Approach - Run from Command Line

To run as a script instead of a notebook, save all blocks above into deseq2_analysis.R, then:

cd ~/genomics
Rscript scripts/deseq2_analysis.R
ls -lh results/tables/

Part 3: Understanding the Results

Open the results/tables/deseq2_significant.csv file in MS Excel and check the output

Interpreting the Output

Positive log2FC: Gene is upregulated in KO (higher expression when TDP-43 is knocked out)
Negative log2FC: Gene is downregulated in KO (lower expression when TDP-43 is knocked out)
padj < 0.05: Statistically significant after multiple testing correction
|log2FC| > 1: At least 2-fold change in expression

Biological Interpretation

Since TDP-43 is an RNA-binding protein involved in RNA processing:

Upregulated genes in KO may be normally suppressed by TDP-43
Downregulated genes in KO may depend on TDP-43 for proper expression/stability

Exercises

Exercise 1: Results Exploration

How many genes pass the significance threshold of padj < 0.01?
What are the genes with the most positive and most negative fold change?
Look up one of the top genes in a gene database (e.g., GeneCards, NCBI Gene)

Summary

In this lab, you have:

Run DESeq2 differential expression analysis
Identified genes differentially expressed between KO and WT
Learned to interpret fold changes and adjusted p-values

Next: Lab 6 - Visualization & QC Plots