Lab 1: Warm-up for Large-scale Analysis & Genomic File Formats
- Set up a structured project directory for RNA-seq analysis
- Navigate NCBI GEO and SRA to find public RNA-seq datasets
- Browse Ensembl to locate and download reference files
- Understand genome reference file formats (FASTA and GTF)
Environment Setup
Make sure you have the environment setup, see setup instructions
Mac Users: if wget is not installed on your system please replacte it with curl -O in the commands below. or install wget in your conda environmentconda install wget
Part 1: Project Directory Setup
A well-organized directory structure is essential for reproducible research.
Create project directories
# Activate your environment
conda activate genomics
# Create main project directory
mkdir ~/genomics
cd ~/genomics
# Create subdirectories for each analysis step
mkdir subsampled_data
mkdir trimmed_data
mkdir references
#you can create multiple folders in the same command:
mkdir -p qc_reports/fastqc_raw qc_reports/fastqc_trimmed qc_reports/fastp
mkdir alignment
mkdir salmon_quant
mkdir counts
mkdir -p results/tables results/figures results/enrichment
mkdir logs
# View the structure
tree
raw_data/- Original FASTQ files from sequencingsubsampled_data/- Subsampled reads for faster processingtrimmed_data/- Quality-trimmed readsreferences/- Genome, transcriptome, and annotation filesqc_reports/- Quality control reports (FastQC, fastp, MultiQC)alignment/- BAM files from STAR alignmentsalmon_quant/- Transcript quantification resultscounts/- Gene-level count matricesresults/- Final analysis outputs (tables, figures, enrichment)
Part 2: Finding Raw Genomics Datasets on NCBI
Before working with any public dataset, it's important to know how to find and access it from NCBI. Here's a step-by-step guide to locating GSE136366 and its raw sequencing files.
Step 1: Navigate to NCBI GEO
NCBI GEO (Gene Expression Omnibus) is a public repository for gene expression data, including RNA-seq datasets.
- Open your web browser and go to https://www.ncbi.nlm.nih.gov/geo/
- You will see the GEO homepage with a search box at the top
Step 2: Search for the Dataset
- In the search box, type the accession number:
GSE136366 - Click "Search" or press Enter
- The search results will show the dataset: "TDP-43 regulates LC3ylation..."
- Click on the accession number GSE136366 to view the full record
Step 3: Explore the GEO Series Page
The GEO Series page contains important information about the dataset:
- Title: Full study title and description
- Summary: Experimental design and objectives
- Overall design: Sample information and conditions
- Contributors: Authors and their affiliations
- Citation: Link to the published paper
- Samples: List of all individual samples (GSM accessions)
Step 4: Find the SRA Data
To download the raw FASTQ files, we need to access the SRA (Sequence Read Archive):
- Scroll down to the "Relations" section on the GEO page
- Look for the link labeled "SRA" with accession
SRP219898 - Click on the SRA link to navigate to the SRA page
Alternatively, scroll to the "Samples" section and click on any sample (e.g., GSM4047073) to see its individual SRA run.
Step 5: Navigate the SRA Run Selector
The SRA Run Selector allows you to view and download individual sequencing runs:
- On the SRA page, click "Send results to Run selector" (if not already there)
- You will see a table listing all 6 runs (SRR10045016 - SRR10045021)
- The table shows important metadata:
- Run: The SRR accession number
- Sample: Sample name from the study
- LibraryLayout: PAIRED (paired-end sequencing)
- Bases: Total sequencing bases
- Spots: Number of reads
Step 6: Identify Sample-to-Accession Mapping
From the SRA Run Selector, you can determine which SRR accession corresponds to each sample:
| GEO Sample | SRA Run | Sample Name | Condition |
|---|---|---|---|
| GSM4047073 | SRR10045016 | TDP43_KO1 | Knockout Rep 1 |
| GSM4047074 | SRR10045017 | TDP43_KO2 | Knockout Rep 2 |
| GSM4047075 | SRR10045018 | TDP43_KO3 | Knockout Rep 3 |
| GSM4047076 | SRR10045019 | TDP43_WT1 | Wildtype Rep 1 |
| GSM4047077 | SRR10045020 | TDP43_WT2 | Wildtype Rep 2 |
| GSM4047078 | SRR10045021 | TDP43_WT3 | Wildtype Rep 3 |
Step 7: Download Options
There are several ways to download the FASTQ files, including:
Option A: Using SRA Toolkit (Recommended for large files)
Use command-line tools prefetch and fasterq-dump as shown in the "Alternative" section below.
Option B: Direct Download from ENA
The European Nucleotide Archive (ENA) often provides direct FASTQ download links:
- Go to ENA
- Search for the SRR accession
- Click "FASTQ files (FTP)" to download directly
Before downloading any dataset, review:
- Sequencing platform (Illumina, PacBio, Nanopore, etc.)
- Library layout (single-end vs paired-end)
- Read length and total read count
- Any quality or processing notes
This information affects your analysis pipeline choices!
Part 3: Finding Reference Genome Files on Ensembl
Ensembl is a genome browser and database that provides reference sequences and annotations for many species. Here's how to navigate Ensembl to find the files we need.
Step 1: Navigate to Ensembl
- Open your browser and go to https://www.ensembl.org
- The homepage shows featured species and a search box
- Click on "Human" in the popular species section, or search for "Homo sapiens"
Step 2: Access the FTP Download Site
Reference files are available through Ensembl's FTP server:
- From the Ensembl homepage, click "Downloads" in the top navigation menu
- Select "Download data via FTP"
- Alternatively, go directly to: https://ftp.ensembl.org/pub/
Step 3: Understand the FTP Directory Structure
The Ensembl FTP site is organized by release version and data type:
ftp.ensembl.org/pub/
├── release-110/ # Release version (we use 110)
│ ├── fasta/ # Sequence files
│ │ └── homo_sapiens/
│ │ ├── dna/ # Genomic DNA sequences
│ │ ├── cdna/ # Transcriptome (cDNA)
│ │ └── pep/ # Protein sequences
│ ├── gtf/ # Gene annotation files
│ │ └── homo_sapiens/
│ └── gff3/ # Alternative annotation format
└── current_fasta/ # Symlink to latest release
Step 4: Find the Genome FASTA Files
To download genomic DNA sequences:
- Navigate to:
release-110/fasta/homo_sapiens/dna/ - You will see several file types:
Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz- (primary assembly)Homo_sapiens.GRCh38.dna.toplevel.fa.gz- Includes alternate haplotypesHomo_sapiens.GRCh38.dna.chromosome.*.fa.gz- Individual chromosomes
- For this workshop, we download:
Homo_sapiens.GRCh38.dna.chromosome.11.fa.gz
Download reference genome - chr11 only
cd ~/genomics/references
# Download human chromosome 11 sequence
echo "Downloading chromosome 11 genome..."
wget https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.11.fa.gz
gunzip Homo_sapiens.GRCh38.dna.chromosome.11.fa.gz
Homo_sapiens.GRCh38.dna.chromosome.11.fa.gz
- Homo_sapiens - Species name
- GRCh38 - Genome assembly version (Genome Reference Consortium Human Build 38)
- dna - DNA sequence type
- chromosome.11 - Specific chromosome
- .fa.gz - FASTA format, gzip compressed
Step 5: Find the GTF Annotation Files
Gene annotations provide coordinates and metadata for genes, transcripts, and exons:
- Navigate to:
release-110/gtf/homo_sapiens/ - The main annotation file is:
Homo_sapiens.GRCh38.110.gtf.gz - You may also see:
Homo_sapiens.GRCh38.110.chr.gtf.gz- Chromosomes only (no scaffolds)Homo_sapiens.GRCh38.110.abinitio.gtf.gz- Ab initio predictions
Both formats contain gene annotations. GTF (Gene Transfer Format) is more commonly used with RNA-seq tools like STAR and featureCounts. GFF3 is the newer standard but not all tools support it equally.
Download reference genome and annotation
cd ~/genomics/references
# Download full GTF annotation
echo "Downloading gene annotation..."
wget https://ftp.ensembl.org/pub/release-110/gtf/homo_sapiens/Homo_sapiens.GRCh38.110.gtf.gz
gunzip Homo_sapiens.GRCh38.110.gtf.gz
# Extract chromosome 11 annotations only
echo "Extracting chr11 annotations..."
grep -E "^#|^11 " Homo_sapiens.GRCh38.110.gtf > Homo_sapiens.GRCh38.110.chr11.gtf
Step 6: Find the Transcriptome (cDNA) Files
The transcriptome contains sequences of all annotated transcripts:
- Navigate to:
release-110/fasta/homo_sapiens/cdna/ - Download:
Homo_sapiens.GRCh38.cdna.all.fa.gz - This file is used for transcript-level quantification with tools like Salmon and Kallisto
Download reference genome and annotation
cd ~/genomics/references
# Download transcriptome and extract chr11 transcripts
echo "Downloading transcriptome..."
wget https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
gunzip -c Homo_sapiens.GRCh38.cdna.all.fa.gz | awk '/^>/ {keep = /chromosome:GRCh38:11:/} keep' > transcriptome_chr11.fa
Step 7: Check Release Notes
Before downloading, it's good practice to check release notes for any important changes:
- Go to Ensembl News
- Review what's new in your chosen release
- Check for known issues or updates to gene annotations
When starting a project, consider:
- Consistency: Use the same release for all analyses in a project
- Reproducibility: Document the exact release version used
- Compatibility: Ensure your tools support the chosen genome build (GRCh38 vs GRCh37)
- Publication: Check what version was used in related publications
Part 5: Exploring Genomic File Formats
Explore FASTA format (genome sequence)
cd ~/genomics/references
# View the first 10 lines of the genome file
head -10 Homo_sapiens.GRCh38.dna.chromosome.11.fa
# The first line starts with > and contains the sequence name
# Subsequent lines contain the DNA sequence (A, T, G, C, N)
# Count total bases (excluding header)
grep -v "^>" Homo_sapiens.GRCh38.dna.chromosome.11.fa | tr -d '\n' | wc -c
# Chromosome 11 is approximately 135 million base pairs
>chromosome_name description
ATGCATGCATGCATGCATGCATGCATGC...
GCTAGCTAGCTAGCTAGCTAGCTAGCTA...
Header lines start with >, followed by sequence on subsequent lines (typically 60-80 characters per line).
Explore GTF format (gene annotation)
cd ~/genomics/references
# View header comments and first few data lines
head -50 Homo_sapiens.GRCh38.110.chr11.gtf
# Count different feature types
echo "Feature types in GTF:"
cut -f3 Homo_sapiens.GRCh38.110.chr11.gtf | grep -v "^#" | sort | uniq -c | sort -rn
# Count genes on chromosome 11
echo "Number of genes:"
awk '$3 == "gene"' Homo_sapiens.GRCh38.110.chr11.gtf | wc -l
# Find a specific gene (e.g., TARDBP - the TDP-43 gene, though it's on chr1)
# Let's look at a chr11 gene: INS (insulin)
grep "gene_name \"INS\"" Homo_sapiens.GRCh38.110.chr11.gtf | head -5
| Column | Description | Example |
|---|---|---|
| 1 | Chromosome | 11 |
| 2 | Source | ensembl_havana |
| 3 | Feature type | gene, transcript, exon, CDS |
| 4 | Start position | 2159779 |
| 5 | End position | 2161209 |
| 6 | Score | . |
| 7 | Strand | + or - |
| 8 | Frame | 0, 1, 2, or . |
| 9 | Attributes | gene_id; gene_name; etc. |
Part 6 (Optional): Create transcript-to-gene mapping file - will be used later
# Generate transcript ID to gene ID mapping from GTF
# This is needed for tximport later
cd ~/genomics/references
awk -F'\t' '$3=="transcript" {
if (match($9, /gene_id "[^"]+"/)) {
gene = substr($9, RSTART, RLENGTH)
sub(/^gene_id "/, "", gene)
sub(/"$/, "", gene)
}
if (match($9, /transcript_id "[^"]+"/)) {
tx = substr($9, RSTART, RLENGTH)
sub(/^transcript_id "/, "", tx)
sub(/"$/, "", tx)
}
if (gene && tx) print tx "\t" gene
}' Homo_sapiens.GRCh38.110.chr11.gtf > tx2gene.tsv
# Add header
printf "transcript_id\tgene_id\n" | cat - tx2gene.tsv > tmp && mv tmp tx2gene.tsv
# View first few lines
head tx2gene.tsv
wc -l tx2gene.tsv
Part 7: Download Workshop Data - Prepare for Lab2
The pre-processed chr11 subset data is provided for the workshop.
Download the FASTQ raw files
# The data is available under the following path
cd ~/genomics
# Run wget to download
wget -bc https://zenodo.org/records/18462432/files/raw_data.zip
unzip raw_data.zip
# Verify the files
ls -lh raw_data/
# You should see 12 files (6 samples × 2 paired-end files):
# KO_1_SRR10045016_1.fastq.gz KO_1_SRR10045016_2.fastq.gz
# KO_2_SRR10045017_1.fastq.gz KO_2_SRR10045017_2.fastq.gz
# KO_3_SRR10045018_1.fastq.gz KO_3_SRR10045018_2.fastq.gz
# WT_1_SRR10045019_1.fastq.gz WT_1_SRR10045019_2.fastq.gz
# WT_2_SRR10045020_1.fastq.gz WT_2_SRR10045020_2.fastq.gz
# WT_3_SRR10045021_1.fastq.gz WT_3_SRR10045021_2.fastq.gz
Alternative: Download Full Dataset from NCBI SRA
# Configure SRA Toolkit (first time only)
mkdir -p ~/ncbi/public
vdb-config --root -s "/repository/user/main/public/root=$HOME/ncbi/public"
# Download all 6 samples using prefetch (faster, downloads .sra files) - more than >100GB of disk is needed
cd ~/genomics/raw_data
# Download in parallel (6 samples simultaneously)
printf "%s\n" SRR10045016 SRR10045017 SRR10045018 SRR10045019 SRR10045020 SRR10045021 \
| xargs -n 1 -P 6 prefetch --progress
# Convert .sra to FASTQ format (paired-end)
fasterq-dump --split-files --threads 8 --progress SRR*/*.sra
# Rename files to include sample names
mv SRR10045016_1.fastq KO_1_SRR10045016_1.fastq
mv SRR10045016_2.fastq KO_1_SRR10045016_2.fastq
mv SRR10045017_1.fastq KO_2_SRR10045017_1.fastq
mv SRR10045017_2.fastq KO_2_SRR10045017_2.fastq
mv SRR10045018_1.fastq KO_3_SRR10045018_1.fastq
mv SRR10045018_2.fastq KO_3_SRR10045018_2.fastq
mv SRR10045019_1.fastq WT_1_SRR10045019_1.fastq
mv SRR10045019_2.fastq WT_1_SRR10045019_2.fastq
mv SRR10045020_1.fastq WT_2_SRR10045020_1.fastq
mv SRR10045020_2.fastq WT_2_SRR10045020_2.fastq
mv SRR10045021_1.fastq WT_3_SRR10045021_1.fastq
mv SRR10045021_2.fastq WT_3_SRR10045021_2.fastq
# Compress the files
gzip *.fastq
Exercises
Check all files are in place
revisit earlier commands to know what to use.
Exercise 1: Explore the GTF file
Use command-line tools to answer:
- How many exons are annotated on chromosome 11?
- How many protein-coding genes are there? (Hint: look for
gene_biotype "protein_coding") - What is the longest gene on chromosome 11?
Exercise 2: NCBI GEO Exploration
Visit GSE136366 on NCBI GEO and find:
- The original publication title and authors
- The library preparation method used
- Any supplementary files provided by the authors
Summary
In this lab, you have:
- Created an organized project directory structure
- Learned how to navigate NCBI GEO and SRA to find RNA-seq datasets
- Learned how to browse Ensembl to find reference genome and annotation files
- Downloaded reference genome and annotation files for chromosome 11
- Explored FASTA and GTF file formats