Lab 5: Differential Expression Analysis
Learning Objectives
- Understand the DESeq2 statistical framework
- Run differential expression analysis comparing KO vs WT
- Interpret DE results (log2 fold change, p-values, adjusted p-values)
- Filter and export significant differentially expressed genes
DESeq2 Overview
Why DESeq2?
DESeq2 is the gold-standard tool for differential expression analysis because it:
- Models count data with negative binomial distribution
- Performs size factor normalization to account for library size differences
- Uses shrinkage estimators for dispersion and fold changes
- Properly controls false discovery rate (FDR) with Benjamini-Hochberg correction
Key Concepts
| Term | Description |
|---|---|
| log2 Fold Change | log2(KO/WT) - positive means upregulated in KO |
| p-value | Probability of observing the difference by chance |
| padj (adjusted p-value) | p-value corrected for multiple testing (FDR) |
| baseMean | Average normalized count across all samples |
Output - Lab 4b: Download counts.zip
In case you have not obtained the results from Lab 4 (the gene count table), you can download lab4-output folder and unzip it. It contains all files that are required for this lab and the remaining ones.
Part 1: Setup & Start Jupyter Notebook
Install required pacakges and setup the kernel
conda activate genomics
mamba install -c conda-forge -c bioconda jupyter jupyterlab r-irkernel -y
R -e "IRkernel::installspec(name='genomics', displayname='R (DESeq2)')"
# Verify installation
jupyter --version
Start jupyter notebook
cd ~/genomics
# Launch Jupyter Notebook
jupyter notebook
-
Create a new notebook:
- In Jupyter (or JupyterLab), choose File → New → Notebook, then select the R kernel.
- Set working directory: In the first cell, run
setwd("~/genomics")so paths below work. - Copy each code block below into a separate cell and run them in order (Shift+Enter).
Part 2: Run DESeq2 Analysis
Load libraries
library(DESeq2)
library(tximport)
Load data
txi <- readRDS("~/genomics/counts/tximport.rds")
sample_info <- read.table("~/genomics/counts/sample_info.tsv", header = TRUE, sep = "\t")
# Set factor levels (WT as reference)
sample_info$condition <- factor(sample_info$condition, levels = c("WT", "KO"))
cat(paste0("Samples: ", nrow(sample_info), "\n"))
cat(paste0("Genes: ", nrow(txi$counts), "\n"))
cat(paste0("Conditions: ", paste(levels(sample_info$condition), collapse = " vs "), "\n"))
Create DESeq2 dataset and pre-filter
dds <- DESeqDataSetFromTximport(txi, colData = sample_info, design = ~ condition)
# Pre-filtering: Remove genes with very low counts
keep <- rowSums(counts(dds) >= 10) >= 3
dds <- dds[keep, ]
cat(paste0("Genes after filtering (>=10 counts in >=3 samples): ", nrow(dds), "\n"))
Run DESeq2 and get results
dds <- DESeq(dds)
res <- results(dds, contrast = c("condition", "KO", "WT"))
res <- res[order(res$padj), ]
summary(res)
# Significant: padj < 0.05 and |log2FC| > 1
sig_genes <- subset(res, padj < 0.05 & abs(log2FoldChange) > 1)
sig_genes <- sig_genes[order(sig_genes$padj), ]
cat(paste0("Total significant genes: ", nrow(sig_genes), "\n"))
cat(paste0("Upregulated in KO: ", sum(sig_genes$log2FoldChange > 0, na.rm = TRUE), "\n"))
cat(paste0("Downregulated in KO: ", sum(sig_genes$log2FoldChange < 0, na.rm = TRUE), "\n"))
Save results
dir.create("~/genomics/results/tables", showWarnings = FALSE, recursive = TRUE)
# All results
all_res <- as.data.frame(res)
all_res$gene_id <- rownames(all_res)
all_res <- all_res[, c("gene_id", "baseMean", "log2FoldChange", "lfcSE", "stat", "pvalue", "padj")]
write.csv(all_res, "~/genomics/results/tables/deseq2_all_results.csv", row.names = FALSE)
# Significant genes only
sig_res <- as.data.frame(sig_genes)
sig_res$gene_id <- rownames(sig_res)
sig_res <- sig_res[, c("gene_id", "baseMean", "log2FoldChange", "lfcSE", "stat", "pvalue", "padj")]
write.csv(sig_res, "~/genomics/results/tables/deseq2_significant.csv", row.names = FALSE)
# Normalized counts
norm_counts <- as.data.frame(counts(dds, normalized = TRUE))
norm_counts$gene_id <- rownames(norm_counts)
write.csv(norm_counts, "~/genomics/results/tables/normalized_counts.csv", row.names = FALSE)
# Save DESeq2 object for later use
saveRDS(dds, "~/genomics/results/deseq2_object.rds")
Optional: Alternative Approach - Run from Command Line
To run as a script instead of a notebook, save all blocks above into deseq2_analysis.R, then:
cd ~/genomics
Rscript scripts/deseq2_analysis.R
ls -lh results/tables/
Part 3: Understanding the Results
Open the results/tables/deseq2_significant.csv file in MS Excel and check the outputInterpreting the Output
- Positive log2FC: Gene is upregulated in KO (higher expression when TDP-43 is knocked out)
- Negative log2FC: Gene is downregulated in KO (lower expression when TDP-43 is knocked out)
- padj < 0.05: Statistically significant after multiple testing correction
- |log2FC| > 1: At least 2-fold change in expression
Biological Interpretation
Since TDP-43 is an RNA-binding protein involved in RNA processing:
- Upregulated genes in KO may be normally suppressed by TDP-43
- Downregulated genes in KO may depend on TDP-43 for proper expression/stability
Exercises
Exercise 1: Results Exploration
- How many genes pass the significance threshold of padj < 0.01?
- What are the genes with the most positive and most negative fold change?
- Look up one of the top genes in a gene database (e.g., GeneCards, NCBI Gene)
Summary
In this lab, you have:
- Run DESeq2 differential expression analysis
- Identified genes differentially expressed between KO and WT
- Learned to interpret fold changes and adjusted p-values