Lab 7: Enrichment Analysis with g:Profiler

Learning Objectives

What is Enrichment Analysis?

From Gene Lists to Biological Insight

Enrichment analysis helps answer: "What biological processes or pathways are over-represented in my gene list?"

  • Gene Ontology (GO): Standardized vocabulary for gene function
    • Biological Process (BP) - What the gene does
    • Molecular Function (MF) - How it does it
    • Cellular Component (CC) - Where it acts
  • KEGG Pathways: Biological pathway maps linking genes to cellular processes

g:Profiler

g:Profiler is a widely-used tool for functional enrichment analysis. The Python package gprofiler-official provides programmatic access to the g:Profiler web service.

  • Supports multiple organisms and gene ID types
  • Queries GO, KEGG, Reactome, and other databases
  • Returns statistical significance with multiple testing correction

Part 1: Setup Jupyter Notebook

Activate environment and start Jupyter

conda activate genomics

cd ~/genomics

# Launch Jupyter Notebook
jupyter notebook

Create a new Python 3 notebook and copy each code block below into a cell.

Part 2: Load DE Results and Prepare Gene Lists

Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from gprofiler import GProfiler
import os

# Create output directory
os.makedirs('results/enrichment', exist_ok=True)

print("Libraries loaded successfully!")

Load DESeq2 results from Lab 5

# Load all DE results
de_results = pd.read_csv('results/tables/deseq2_all_results.csv')

print(f"Total genes in DE results: {len(de_results)}")
print(f"Columns: {list(de_results.columns)}")
de_results.head()

Define significance thresholds and create gene lists

# Significance thresholds
PADJ_THRESHOLD = 0.05
LOG2FC_THRESHOLD = 1  # |log2FC| > 1 means >2-fold change

# Filter for significant genes
sig_genes = de_results[
    (de_results['padj'] < PADJ_THRESHOLD) &
    (de_results['log2FoldChange'].abs() > LOG2FC_THRESHOLD)
].copy()

# Separate upregulated and downregulated genes
up_genes = sig_genes[sig_genes['log2FoldChange'] > 0]['gene_id'].tolist()
down_genes = sig_genes[sig_genes['log2FoldChange'] < 0]['gene_id'].tolist()
all_sig_genes = sig_genes['gene_id'].tolist()

print(f"Significant genes (padj < {PADJ_THRESHOLD}, |log2FC| > {LOG2FC_THRESHOLD}):")
print(f"  Total: {len(all_sig_genes)}")
print(f"  Upregulated in KO: {len(up_genes)}")
print(f"  Downregulated in KO: {len(down_genes)}")
Gene ID Format

g:Profiler accepts Ensembl gene IDs (ENSG...) which is what we have from our analysis. It will automatically convert them internally.

Part 3: Run GO Enrichment Analysis

Initialize g:Profiler and define helper function

# Initialize g:Profiler
gp = GProfiler(return_dataframe=True)

def run_enrichment(gene_list, name, sources=['GO:BP', 'GO:MF', 'GO:CC']):
    """
    Run g:Profiler enrichment analysis on a gene list.

    Parameters:
    - gene_list: list of gene IDs
    - name: description for logging
    - sources: databases to query (GO:BP, GO:MF, GO:CC, KEGG, REAC, etc.)

    Returns:
    - DataFrame with enrichment results
    """
    if len(gene_list) == 0:
        print(f"Warning: {name} gene list is empty!")
        return pd.DataFrame()

    print(f"\nRunning enrichment for {name} ({len(gene_list)} genes)...")

    results = gp.profile(
        organism='hsapiens',
        query=gene_list,
        sources=sources,
        user_threshold=0.05,  # Significance threshold
        significance_threshold_method='fdr',  # Multiple testing correction
        no_evidences=False
    )

    if len(results) > 0:
        print(f"  Found {len(results)} significant terms")
    else:
        print(f"  No significant terms found")

    return results

Run GO enrichment for all three gene sets

# Run GO enrichment analysis
go_all = run_enrichment(all_sig_genes, "All significant genes", sources=['GO:BP', 'GO:MF', 'GO:CC'])
go_up = run_enrichment(up_genes, "Upregulated genes", sources=['GO:BP', 'GO:MF', 'GO:CC'])
go_down = run_enrichment(down_genes, "Downregulated genes", sources=['GO:BP', 'GO:MF', 'GO:CC'])

Explore GO results

# View columns in results
if len(go_all) > 0:
    print("Columns in enrichment results:")
    print(go_all.columns.tolist())
    print("\nTop 10 GO terms for all significant genes:")
    display(go_all[['source', 'native', 'name', 'p_value', 'intersection_size']].head(10))

Part 4: Run KEGG Pathway Analysis

Run KEGG enrichment

# Run KEGG pathway enrichment
kegg_all = run_enrichment(all_sig_genes, "All significant genes (KEGG)", sources=['KEGG'])
kegg_up = run_enrichment(up_genes, "Upregulated genes (KEGG)", sources=['KEGG'])
kegg_down = run_enrichment(down_genes, "Downregulated genes (KEGG)", sources=['KEGG'])

View KEGG results

# Display KEGG results
if len(kegg_all) > 0:
    print("Top KEGG pathways for all significant genes:")
    display(kegg_all[['native', 'name', 'p_value', 'intersection_size']].head(10))
else:
    print("No significant KEGG pathways found for all genes")

Part 5: Visualize Enrichment Results

Define plotting functions

def plot_enrichment_bar(results, title, top_n=15, figsize=(10, 8)):
    """
    Create a bar plot of top enriched terms.
    """
    if len(results) == 0:
        print(f"No results to plot for: {title}")
        return None

    # Get top N results
    plot_data = results.head(top_n).copy()
    plot_data = plot_data.sort_values('p_value', ascending=True)

    # Create -log10(p-value) for better visualization
    plot_data['-log10(p)'] = -np.log10(plot_data['p_value'])

    # Create figure
    fig, ax = plt.subplots(figsize=figsize)

    # Color by source (GO:BP, GO:MF, GO:CC, KEGG)
    colors = {'GO:BP': '#1f77b4', 'GO:MF': '#ff7f0e', 'GO:CC': '#2ca02c', 'KEGG': '#d62728'}
    bar_colors = [colors.get(s, '#7f7f7f') for s in plot_data['source']]

    # Create horizontal bar plot
    bars = ax.barh(range(len(plot_data)), plot_data['-log10(p)'], color=bar_colors)

    # Customize plot
    ax.set_yticks(range(len(plot_data)))
    ax.set_yticklabels(plot_data['name'], fontsize=9)
    ax.set_xlabel('-log10(p-value)', fontsize=12)
    ax.set_title(title, fontsize=14, fontweight='bold')

    # Add gene count labels
    for i, (bar, count) in enumerate(zip(bars, plot_data['intersection_size'])):
        ax.text(bar.get_width() + 0.1, bar.get_y() + bar.get_height()/2,
                f'n={count}', va='center', fontsize=8)

    # Add legend
    from matplotlib.patches import Patch
    legend_elements = [Patch(facecolor=c, label=s) for s, c in colors.items() if s in plot_data['source'].values]
    if legend_elements:
        ax.legend(handles=legend_elements, loc='lower right')

    plt.tight_layout()
    return fig


def plot_enrichment_dot(results, title, top_n=15, figsize=(10, 8)):
    """
    Create a dot plot of top enriched terms (similar to clusterProfiler style).
    """
    if len(results) == 0:
        print(f"No results to plot for: {title}")
        return None

    plot_data = results.head(top_n).copy()

    # Calculate gene ratio
    plot_data['GeneRatio'] = plot_data['intersection_size'] / plot_data['query_size']
    plot_data['-log10(p)'] = -np.log10(plot_data['p_value'])

    fig, ax = plt.subplots(figsize=figsize)

    scatter = ax.scatter(
        plot_data['GeneRatio'],
        range(len(plot_data)),
        s=plot_data['intersection_size'] * 10,  # Size by gene count
        c=plot_data['-log10(p)'],  # Color by significance
        cmap='viridis',
        alpha=0.7,
        edgecolors='black',
        linewidths=0.5
    )

    ax.set_yticks(range(len(plot_data)))
    ax.set_yticklabels(plot_data['name'], fontsize=9)
    ax.set_xlabel('Gene Ratio', fontsize=12)
    ax.set_title(title, fontsize=14, fontweight='bold')

    # Add colorbar
    cbar = plt.colorbar(scatter, ax=ax, label='-log10(p-value)')

    # Add size legend
    sizes = [5, 10, 20]
    for size in sizes:
        ax.scatter([], [], s=size*10, c='gray', alpha=0.7, label=f'{size} genes')
    ax.legend(loc='lower right', title='Gene Count')

    plt.tight_layout()
    return fig

Plot GO enrichment for the significant genes (up, down, and all)

# Plot GO enrichment - All significant genes
fig = plot_enrichment_bar(go_all, 'GO Enrichment - All Significant DE Genes')
if fig:
    fig.savefig('results/enrichment/go_all_genes_bar.png', dpi=150, bbox_inches='tight')
    plt.show()
    
# Plot GO enrichment - Upregulated genes
fig = plot_enrichment_bar(go_up, 'GO Enrichment - Upregulated in KO')
if fig:
    fig.savefig('results/enrichment/go_upregulated_bar.png', dpi=150, bbox_inches='tight')
    plt.show()
# Plot GO enrichment - Downregulated genes
fig = plot_enrichment_bar(go_down, 'GO Enrichment - Downregulated in KO')
if fig:
    fig.savefig('results/enrichment/go_downregulated_bar.png', dpi=150, bbox_inches='tight')
    plt.show()

Create dot plots for publication

# Create dot plots (publication-style)
fig = plot_enrichment_dot(go_all, 'GO Enrichment - All DE Genes (Dot Plot)')
if fig:
    fig.savefig('results/enrichment/go_all_genes_dot.png', dpi=150, bbox_inches='tight')
    plt.show()

Plot KEGG pathways

# Plot KEGG pathways for all gene sets
for results, name, filename in [
    (kegg_all, 'KEGG Pathways - All DE Genes', 'kegg_all'),
    (kegg_up, 'KEGG Pathways - Upregulated', 'kegg_up'),
    (kegg_down, 'KEGG Pathways - Downregulated', 'kegg_down')
]:
    fig = plot_enrichment_bar(results, name, top_n=10)
    if fig:
        fig.savefig(f'results/enrichment/{filename}_bar.png', dpi=150, bbox_inches='tight')
        plt.show()

Part 6: Save Results to TSV Files

Save all enrichment results

def save_enrichment_results(results, filename):
    """Save enrichment results to TSV file."""
    if len(results) == 0:
        print(f"No results to save for {filename}")
        return

    # Select and rename columns for clarity
    cols_to_save = ['source', 'native', 'name', 'p_value', 'significant',
                    'term_size', 'query_size', 'intersection_size', 'intersections']

    # Only include columns that exist
    cols_to_save = [c for c in cols_to_save if c in results.columns]

    output = results[cols_to_save].copy()
    output.to_csv(f'results/enrichment/{filename}.tsv', sep='\t', index=False)
    print(f"Saved: results/enrichment/{filename}.tsv ({len(output)} terms)")

# Save GO results
save_enrichment_results(go_all, 'go_all_significant_genes')
save_enrichment_results(go_up, 'go_upregulated_genes')
save_enrichment_results(go_down, 'go_downregulated_genes')

# Save KEGG results
save_enrichment_results(kegg_all, 'kegg_all_significant_genes')
save_enrichment_results(kegg_up, 'kegg_upregulated_genes')
save_enrichment_results(kegg_down, 'kegg_downregulated_genes')

print("\nAll results saved to results/enrichment/")

Create summary report

# Create a summary of enrichment analysis
summary = f"""
============================================
Enrichment Analysis Summary
============================================

Input:
  Total significant genes: {len(all_sig_genes)}
  Upregulated genes: {len(up_genes)}
  Downregulated genes: {len(down_genes)}

Significance thresholds:
  padj < {PADJ_THRESHOLD}
  |log2FoldChange| > {LOG2FC_THRESHOLD}

GO Enrichment Results:
  All genes: {len(go_all)} significant terms
  Upregulated: {len(go_up)} significant terms
  Downregulated: {len(go_down)} significant terms

KEGG Pathway Results:
  All genes: {len(kegg_all)} significant pathways
  Upregulated: {len(kegg_up)} significant pathways
  Downregulated: {len(kegg_down)} significant pathways

Output files saved to: results/enrichment/
============================================
"""
print(summary)

# Save summary to file
with open('results/enrichment/analysis_summary.txt', 'w') as f:
    f.write(summary)

Part 7: Interpreting Results

Key Questions to Answer

  1. What biological processes are affected? Look at GO:BP terms
  2. What molecular functions are enriched? Look at GO:MF terms
  3. Which pathways are dysregulated? Look at KEGG results
  4. Are results consistent with known TDP-43 biology?
    • TDP-43 is involved in RNA processing, splicing, and transport
    • Expect enrichment in RNA-related terms
Biological Interpretation

When interpreting enrichment results:

  • Upregulated in KO: These genes may be normally suppressed by TDP-43
  • Downregulated in KO: These genes may depend on TDP-43 for expression
  • Compare your results to published TDP-43 studies to validate findings

Exercises

Exercise 1: Explore GO Terms

  1. What is the most significantly enriched GO term for each category (BP, MF, CC)?
  2. How many genes contribute to the top GO:BP term?
  3. Are the enriched terms consistent with TDP-43's known functions?

Exercise 2: Compare Up vs Down

  1. Are the enriched GO terms different between up- and down-regulated genes?
  2. What does this tell you about TDP-43's regulatory role?
  3. Which KEGG pathways appear in both sets?

Exercise 3: Web-based Comparison

  1. Go to g:Profiler web interface
  2. Paste your upregulated gene list and run enrichment
  3. Compare web results with your Python analysis
  4. Export a Manhattan plot from the web interface

Summary

In this lab, you have:

Congratulations!

You have completed the full RNA-seq analysis pipeline, from raw FASTQ files to biological insights about TDP-43 function!

Return to: Labs Overview