Lab 1: Warm-up for Large-scale Analysis & Genomic File Formats

Learning Objectives

Set up a structured project directory for RNA-seq analysis
Navigate NCBI GEO and SRA to find public RNA-seq datasets
Browse Ensembl to locate and download reference files
Understand genome reference file formats (FASTA and GTF)

Environment Setup

Make sure you have the environment setup, see setup instructions

Mac Users: if wget is not installed on your system please replacte it with curl -O in the commands below. or install wget in your conda environmentconda install wget

Part 1: Project Directory Setup

A well-organized directory structure is essential for reproducible research.

Create project directories

# Activate your environment
conda activate genomics

# Create main project directory
mkdir ~/genomics
cd ~/genomics

# Create subdirectories for each analysis step
mkdir subsampled_data
mkdir trimmed_data
mkdir references
#you can create multiple folders in the same command:
mkdir -p qc_reports/fastqc_raw qc_reports/fastqc_trimmed qc_reports/fastp
mkdir alignment
mkdir salmon_quant
mkdir counts
mkdir -p results/tables results/figures results/enrichment
mkdir logs

# View the structure
tree

Directory Structure Explained

raw_data/ - Original FASTQ files from sequencing
subsampled_data/ - Subsampled reads for faster processing
trimmed_data/ - Quality-trimmed reads
references/ - Genome, transcriptome, and annotation files
qc_reports/ - Quality control reports (FastQC, fastp, MultiQC)
alignment/ - BAM files from STAR alignment
salmon_quant/ - Transcript quantification results
counts/ - Gene-level count matrices
results/ - Final analysis outputs (tables, figures, enrichment)

Part 2: Finding Raw Genomics Datasets on NCBI

Before working with any public dataset, it's important to know how to find and access it from NCBI. Here's a step-by-step guide to locating GSE136366 and its raw sequencing files.

Step 1: Navigate to NCBI GEO

NCBI GEO (Gene Expression Omnibus) is a public repository for gene expression data, including RNA-seq datasets.

Open your web browser and go to https://www.ncbi.nlm.nih.gov/geo/
You will see the GEO homepage with a search box at the top

Step 2: Search for the Dataset

In the search box, type the accession number: GSE136366
Click "Search" or press Enter
The search results will show the dataset: "TDP-43 regulates LC3ylation..."
Click on the accession number GSE136366 to view the full record

Step 3: Explore the GEO Series Page

The GEO Series page contains important information about the dataset:

Title: Full study title and description
Summary: Experimental design and objectives
Overall design: Sample information and conditions
Contributors: Authors and their affiliations
Citation: Link to the published paper
Samples: List of all individual samples (GSM accessions)

Step 4: Find the SRA Data

To download the raw FASTQ files, we need to access the SRA (Sequence Read Archive):

Scroll down to the "Relations" section on the GEO page
Look for the link labeled "SRA" with accession SRP219898
Click on the SRA link to navigate to the SRA page

Alternatively, scroll to the "Samples" section and click on any sample (e.g., GSM4047073) to see its individual SRA run.

Step 5: Navigate the SRA Run Selector

The SRA Run Selector allows you to view and download individual sequencing runs:

On the SRA page, click "Send results to Run selector" (if not already there)
You will see a table listing all 6 runs (SRR10045016 - SRR10045021)
The table shows important metadata:
- Run: The SRR accession number
- Sample: Sample name from the study
- LibraryLayout: PAIRED (paired-end sequencing)
- Bases: Total sequencing bases
- Spots: Number of reads

Step 6: Identify Sample-to-Accession Mapping

From the SRA Run Selector, you can determine which SRR accession corresponds to each sample:

GEO Sample	SRA Run	Sample Name	Condition
GSM4047073	SRR10045016	TDP43_KO1	Knockout Rep 1
GSM4047074	SRR10045017	TDP43_KO2	Knockout Rep 2
GSM4047075	SRR10045018	TDP43_KO3	Knockout Rep 3
GSM4047076	SRR10045019	TDP43_WT1	Wildtype Rep 1
GSM4047077	SRR10045020	TDP43_WT2	Wildtype Rep 2
GSM4047078	SRR10045021	TDP43_WT3	Wildtype Rep 3

Step 7: Download Options

There are several ways to download the FASTQ files, including:

Option A: Using SRA Toolkit (Recommended for large files)

Use command-line tools prefetch and fasterq-dump as shown in the "Alternative" section below.

Option B: Direct Download from ENA

The European Nucleotide Archive (ENA) often provides direct FASTQ download links:

Go to ENA
Search for the SRR accession
Click "FASTQ files (FTP)" to download directly

Tip: Always Check the Metadata!

Before downloading any dataset, review:

Sequencing platform (Illumina, PacBio, Nanopore, etc.)
Library layout (single-end vs paired-end)
Read length and total read count
Any quality or processing notes

This information affects your analysis pipeline choices!

Part 3: Finding Reference Genome Files on Ensembl

Ensembl is a genome browser and database that provides reference sequences and annotations for many species. Here's how to navigate Ensembl to find the files we need.

Step 1: Navigate to Ensembl

Open your browser and go to https://www.ensembl.org
The homepage shows featured species and a search box
Click on "Human" in the popular species section, or search for "Homo sapiens"

Step 2: Access the FTP Download Site

Reference files are available through Ensembl's FTP server:

From the Ensembl homepage, click "Downloads" in the top navigation menu
Select "Download data via FTP"
Alternatively, go directly to: https://ftp.ensembl.org/pub/

Step 3: Understand the FTP Directory Structure

The Ensembl FTP site is organized by release version and data type:

ftp.ensembl.org/pub/
├── release-110/              # Release version (we use 110)
│   ├── fasta/                # Sequence files
│   │   └── homo_sapiens/
│   │       ├── dna/          # Genomic DNA sequences
│   │       ├── cdna/         # Transcriptome (cDNA)
│   │       └── pep/          # Protein sequences
│   ├── gtf/                  # Gene annotation files
│   │   └── homo_sapiens/
│   └── gff3/                 # Alternative annotation format
└── current_fasta/            # Symlink to latest release

Step 4: Find the Genome FASTA Files

To download genomic DNA sequences:

Navigate to: release-110/fasta/homo_sapiens/dna/
You will see several file types:
- Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz - (primary assembly)
- Homo_sapiens.GRCh38.dna.toplevel.fa.gz - Includes alternate haplotypes
- Homo_sapiens.GRCh38.dna.chromosome.*.fa.gz - Individual chromosomes
For this workshop, we download: Homo_sapiens.GRCh38.dna.chromosome.11.fa.gz

Download reference genome - chr11 only

cd ~/genomics/references

# Download human chromosome 11 sequence
echo "Downloading chromosome 11 genome..."
wget https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.11.fa.gz
gunzip Homo_sapiens.GRCh38.dna.chromosome.11.fa.gz

File Naming Convention

Homo_sapiens.GRCh38.dna.chromosome.11.fa.gz

Homo_sapiens - Species name
GRCh38 - Genome assembly version (Genome Reference Consortium Human Build 38)
dna - DNA sequence type
chromosome.11 - Specific chromosome
.fa.gz - FASTA format, gzip compressed

Step 5: Find the GTF Annotation Files

Gene annotations provide coordinates and metadata for genes, transcripts, and exons:

Navigate to: release-110/gtf/homo_sapiens/
The main annotation file is: Homo_sapiens.GRCh38.110.gtf.gz
You may also see:
- Homo_sapiens.GRCh38.110.chr.gtf.gz - Chromosomes only (no scaffolds)
- Homo_sapiens.GRCh38.110.abinitio.gtf.gz - Ab initio predictions

GTF vs GFF3

Both formats contain gene annotations. GTF (Gene Transfer Format) is more commonly used with RNA-seq tools like STAR and featureCounts. GFF3 is the newer standard but not all tools support it equally.

Download reference genome and annotation

cd ~/genomics/references
# Download full GTF annotation
echo "Downloading gene annotation..."
wget https://ftp.ensembl.org/pub/release-110/gtf/homo_sapiens/Homo_sapiens.GRCh38.110.gtf.gz
gunzip Homo_sapiens.GRCh38.110.gtf.gz

# Extract chromosome 11 annotations only
echo "Extracting chr11 annotations..."
grep -E "^#|^11	" Homo_sapiens.GRCh38.110.gtf > Homo_sapiens.GRCh38.110.chr11.gtf

Step 6: Find the Transcriptome (cDNA) Files

The transcriptome contains sequences of all annotated transcripts:

Navigate to: release-110/fasta/homo_sapiens/cdna/
Download: Homo_sapiens.GRCh38.cdna.all.fa.gz
This file is used for transcript-level quantification with tools like Salmon and Kallisto

Download reference genome and annotation

cd ~/genomics/references
# Download transcriptome and extract chr11 transcripts
echo "Downloading transcriptome..."
wget https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
gunzip -c Homo_sapiens.GRCh38.cdna.all.fa.gz | awk '/^>/ {keep = /chromosome:GRCh38:11:/} keep' > transcriptome_chr11.fa

Step 7: Check Release Notes

Before downloading, it's good practice to check release notes for any important changes:

Go to Ensembl News
Review what's new in your chosen release
Check for known issues or updates to gene annotations

Choosing the Right Release

When starting a project, consider:

Consistency: Use the same release for all analyses in a project
Reproducibility: Document the exact release version used
Compatibility: Ensure your tools support the chosen genome build (GRCh38 vs GRCh37)
Publication: Check what version was used in related publications

Part 5: Exploring Genomic File Formats

Explore FASTA format (genome sequence)

cd ~/genomics/references

# View the first 10 lines of the genome file
head -10 Homo_sapiens.GRCh38.dna.chromosome.11.fa

# The first line starts with > and contains the sequence name
# Subsequent lines contain the DNA sequence (A, T, G, C, N)

# Count total bases (excluding header)
grep -v "^>" Homo_sapiens.GRCh38.dna.chromosome.11.fa | tr -d '\n' | wc -c

# Chromosome 11 is approximately 135 million base pairs

FASTA Format

>chromosome_name description
ATGCATGCATGCATGCATGCATGCATGC...
GCTAGCTAGCTAGCTAGCTAGCTAGCTA...

Header lines start with >, followed by sequence on subsequent lines (typically 60-80 characters per line).

Explore GTF format (gene annotation)

cd ~/genomics/references

# View header comments and first few data lines
head -50 Homo_sapiens.GRCh38.110.chr11.gtf

# Count different feature types
echo "Feature types in GTF:"
cut -f3 Homo_sapiens.GRCh38.110.chr11.gtf | grep -v "^#" | sort | uniq -c | sort -rn

# Count genes on chromosome 11
echo "Number of genes:"
awk '$3 == "gene"' Homo_sapiens.GRCh38.110.chr11.gtf | wc -l

# Find a specific gene (e.g., TARDBP - the TDP-43 gene, though it's on chr1)
# Let's look at a chr11 gene: INS (insulin)
grep "gene_name \"INS\"" Homo_sapiens.GRCh38.110.chr11.gtf | head -5

GTF Format - 9 Tab-Separated Columns

Column	Description	Example
1	Chromosome	11
2	Source	ensembl_havana
3	Feature type	gene, transcript, exon, CDS
4	Start position	2159779
5	End position	2161209
6	Score	.
7	Strand	+ or -
8	Frame	0, 1, 2, or .
9	Attributes	gene_id; gene_name; etc.

Part 6 (Optional): Create transcript-to-gene mapping file - will be used later

# Generate transcript ID to gene ID mapping from GTF
# This is needed for tximport later

cd ~/genomics/references

awk -F'\t' '$3=="transcript" {
  if (match($9, /gene_id "[^"]+"/)) {
    gene = substr($9, RSTART, RLENGTH)
    sub(/^gene_id "/, "", gene)
    sub(/"$/, "", gene)
  }
  if (match($9, /transcript_id "[^"]+"/)) {
    tx = substr($9, RSTART, RLENGTH)
    sub(/^transcript_id "/, "", tx)
    sub(/"$/, "", tx)
  }
  if (gene && tx) print tx "\t" gene
}' Homo_sapiens.GRCh38.110.chr11.gtf > tx2gene.tsv


# Add header
printf "transcript_id\tgene_id\n" | cat - tx2gene.tsv > tmp && mv tmp tx2gene.tsv

# View first few lines
head tx2gene.tsv
wc -l tx2gene.tsv

Part 7: Download Workshop Data - Prepare for Lab2

The pre-processed chr11 subset data is provided for the workshop.

Download the FASTQ raw files

# The data is available under the following path

cd ~/genomics

# Run wget to download
wget -bc https://zenodo.org/records/18462432/files/raw_data.zip

unzip raw_data.zip

# Verify the files
ls -lh raw_data/

# You should see 12 files (6 samples × 2 paired-end files):
# KO_1_SRR10045016_1.fastq.gz  KO_1_SRR10045016_2.fastq.gz
# KO_2_SRR10045017_1.fastq.gz  KO_2_SRR10045017_2.fastq.gz
# KO_3_SRR10045018_1.fastq.gz  KO_3_SRR10045018_2.fastq.gz
# WT_1_SRR10045019_1.fastq.gz  WT_1_SRR10045019_2.fastq.gz
# WT_2_SRR10045020_1.fastq.gz  WT_2_SRR10045020_2.fastq.gz
# WT_3_SRR10045021_1.fastq.gz  WT_3_SRR10045021_2.fastq.gz

Alternative: Download Full Dataset from NCBI SRA

Note: This downloads the complete dataset (~20 million reads per sample). Only use this if you have sufficient time and disk space (~50 GB).

# Configure SRA Toolkit (first time only)
mkdir -p ~/ncbi/public
vdb-config --root -s "/repository/user/main/public/root=$HOME/ncbi/public"

# Download all 6 samples using prefetch (faster, downloads .sra files) - more than >100GB of disk is needed
cd ~/genomics/raw_data

# Download in parallel (6 samples simultaneously)
printf "%s\n" SRR10045016 SRR10045017 SRR10045018 SRR10045019 SRR10045020 SRR10045021 \
    | xargs -n 1 -P 6 prefetch --progress

# Convert .sra to FASTQ format (paired-end)
fasterq-dump --split-files --threads 8 --progress SRR*/*.sra

# Rename files to include sample names
mv SRR10045016_1.fastq KO_1_SRR10045016_1.fastq
mv SRR10045016_2.fastq KO_1_SRR10045016_2.fastq
mv SRR10045017_1.fastq KO_2_SRR10045017_1.fastq
mv SRR10045017_2.fastq KO_2_SRR10045017_2.fastq
mv SRR10045018_1.fastq KO_3_SRR10045018_1.fastq
mv SRR10045018_2.fastq KO_3_SRR10045018_2.fastq
mv SRR10045019_1.fastq WT_1_SRR10045019_1.fastq
mv SRR10045019_2.fastq WT_1_SRR10045019_2.fastq
mv SRR10045020_1.fastq WT_2_SRR10045020_1.fastq
mv SRR10045020_2.fastq WT_2_SRR10045020_2.fastq
mv SRR10045021_1.fastq WT_3_SRR10045021_1.fastq
mv SRR10045021_2.fastq WT_3_SRR10045021_2.fastq

# Compress the files
gzip *.fastq

Exercises

Check all files are in place

revisit earlier commands to know what to use.

Exercise 1: Explore the GTF file

Use command-line tools to answer:

How many exons are annotated on chromosome 11?
How many protein-coding genes are there? (Hint: look for gene_biotype "protein_coding")
What is the longest gene on chromosome 11?

Exercise 2: NCBI GEO Exploration

Visit GSE136366 on NCBI GEO and find:

The original publication title and authors
The library preparation method used
Any supplementary files provided by the authors

Summary

In this lab, you have:

Created an organized project directory structure
Learned how to navigate NCBI GEO and SRA to find RNA-seq datasets
Learned how to browse Ensembl to find reference genome and annotation files
Downloaded reference genome and annotation files for chromosome 11
Explored FASTA and GTF file formats

Next: Lab 2 - Hands-on with Transcriptomics Data