Lab 1: Warm-up for Large-scale Analysis & Genomic File Formats

Learning Objectives

Environment Setup

Make sure you have the environment setup, see setup instructions

Mac Users: if wget is not installed on your system please replacte it with curl -O in the commands below. or install wget in your conda environmentconda install wget

Part 1: Project Directory Setup

A well-organized directory structure is essential for reproducible research.

Create project directories

# Activate your environment
conda activate genomics

# Create main project directory
mkdir ~/genomics
cd ~/genomics

# Create subdirectories for each analysis step
mkdir subsampled_data
mkdir trimmed_data
mkdir references
#you can create multiple folders in the same command:
mkdir -p qc_reports/fastqc_raw qc_reports/fastqc_trimmed qc_reports/fastp
mkdir alignment
mkdir salmon_quant
mkdir counts
mkdir -p results/tables results/figures results/enrichment
mkdir logs

# View the structure
tree
Directory Structure Explained
  • raw_data/ - Original FASTQ files from sequencing
  • subsampled_data/ - Subsampled reads for faster processing
  • trimmed_data/ - Quality-trimmed reads
  • references/ - Genome, transcriptome, and annotation files
  • qc_reports/ - Quality control reports (FastQC, fastp, MultiQC)
  • alignment/ - BAM files from STAR alignment
  • salmon_quant/ - Transcript quantification results
  • counts/ - Gene-level count matrices
  • results/ - Final analysis outputs (tables, figures, enrichment)

Part 2: Finding Raw Genomics Datasets on NCBI

Before working with any public dataset, it's important to know how to find and access it from NCBI. Here's a step-by-step guide to locating GSE136366 and its raw sequencing files.

Step 1: Navigate to NCBI GEO

NCBI GEO (Gene Expression Omnibus) is a public repository for gene expression data, including RNA-seq datasets.

  1. Open your web browser and go to https://www.ncbi.nlm.nih.gov/geo/
  2. You will see the GEO homepage with a search box at the top

Step 2: Search for the Dataset

  1. In the search box, type the accession number: GSE136366
  2. Click "Search" or press Enter
  3. The search results will show the dataset: "TDP-43 regulates LC3ylation..."
  4. Click on the accession number GSE136366 to view the full record

Step 3: Explore the GEO Series Page

The GEO Series page contains important information about the dataset:

  • Title: Full study title and description
  • Summary: Experimental design and objectives
  • Overall design: Sample information and conditions
  • Contributors: Authors and their affiliations
  • Citation: Link to the published paper
  • Samples: List of all individual samples (GSM accessions)

Step 4: Find the SRA Data

To download the raw FASTQ files, we need to access the SRA (Sequence Read Archive):

  1. Scroll down to the "Relations" section on the GEO page
  2. Look for the link labeled "SRA" with accession SRP219898
  3. Click on the SRA link to navigate to the SRA page

Alternatively, scroll to the "Samples" section and click on any sample (e.g., GSM4047073) to see its individual SRA run.

Step 5: Navigate the SRA Run Selector

The SRA Run Selector allows you to view and download individual sequencing runs:

  1. On the SRA page, click "Send results to Run selector" (if not already there)
  2. You will see a table listing all 6 runs (SRR10045016 - SRR10045021)
  3. The table shows important metadata:
    • Run: The SRR accession number
    • Sample: Sample name from the study
    • LibraryLayout: PAIRED (paired-end sequencing)
    • Bases: Total sequencing bases
    • Spots: Number of reads

Step 6: Identify Sample-to-Accession Mapping

From the SRA Run Selector, you can determine which SRR accession corresponds to each sample:

GEO Sample SRA Run Sample Name Condition
GSM4047073SRR10045016TDP43_KO1Knockout Rep 1
GSM4047074SRR10045017TDP43_KO2Knockout Rep 2
GSM4047075SRR10045018TDP43_KO3Knockout Rep 3
GSM4047076SRR10045019TDP43_WT1Wildtype Rep 1
GSM4047077SRR10045020TDP43_WT2Wildtype Rep 2
GSM4047078SRR10045021TDP43_WT3Wildtype Rep 3

Step 7: Download Options

There are several ways to download the FASTQ files, including:

Option A: Using SRA Toolkit (Recommended for large files)

Use command-line tools prefetch and fasterq-dump as shown in the "Alternative" section below.

Option B: Direct Download from ENA

The European Nucleotide Archive (ENA) often provides direct FASTQ download links:

  1. Go to ENA
  2. Search for the SRR accession
  3. Click "FASTQ files (FTP)" to download directly
Tip: Always Check the Metadata!

Before downloading any dataset, review:

  • Sequencing platform (Illumina, PacBio, Nanopore, etc.)
  • Library layout (single-end vs paired-end)
  • Read length and total read count
  • Any quality or processing notes

This information affects your analysis pipeline choices!

Part 3: Finding Reference Genome Files on Ensembl

Ensembl is a genome browser and database that provides reference sequences and annotations for many species. Here's how to navigate Ensembl to find the files we need.

Step 1: Navigate to Ensembl

  1. Open your browser and go to https://www.ensembl.org
  2. The homepage shows featured species and a search box
  3. Click on "Human" in the popular species section, or search for "Homo sapiens"

Step 2: Access the FTP Download Site

Reference files are available through Ensembl's FTP server:

  1. From the Ensembl homepage, click "Downloads" in the top navigation menu
  2. Select "Download data via FTP"
  3. Alternatively, go directly to: https://ftp.ensembl.org/pub/

Step 3: Understand the FTP Directory Structure

The Ensembl FTP site is organized by release version and data type:

ftp.ensembl.org/pub/
├── release-110/              # Release version (we use 110)
│   ├── fasta/                # Sequence files
│   │   └── homo_sapiens/
│   │       ├── dna/          # Genomic DNA sequences
│   │       ├── cdna/         # Transcriptome (cDNA)
│   │       └── pep/          # Protein sequences
│   ├── gtf/                  # Gene annotation files
│   │   └── homo_sapiens/
│   └── gff3/                 # Alternative annotation format
└── current_fasta/            # Symlink to latest release

Step 4: Find the Genome FASTA Files

To download genomic DNA sequences:

  1. Navigate to: release-110/fasta/homo_sapiens/dna/
  2. You will see several file types:
    • Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz - (primary assembly)
    • Homo_sapiens.GRCh38.dna.toplevel.fa.gz - Includes alternate haplotypes
    • Homo_sapiens.GRCh38.dna.chromosome.*.fa.gz - Individual chromosomes
  3. For this workshop, we download: Homo_sapiens.GRCh38.dna.chromosome.11.fa.gz

Download reference genome - chr11 only

cd ~/genomics/references

# Download human chromosome 11 sequence
echo "Downloading chromosome 11 genome..."
wget https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.11.fa.gz
gunzip Homo_sapiens.GRCh38.dna.chromosome.11.fa.gz
File Naming Convention

Homo_sapiens.GRCh38.dna.chromosome.11.fa.gz

  • Homo_sapiens - Species name
  • GRCh38 - Genome assembly version (Genome Reference Consortium Human Build 38)
  • dna - DNA sequence type
  • chromosome.11 - Specific chromosome
  • .fa.gz - FASTA format, gzip compressed

Step 5: Find the GTF Annotation Files

Gene annotations provide coordinates and metadata for genes, transcripts, and exons:

  1. Navigate to: release-110/gtf/homo_sapiens/
  2. The main annotation file is: Homo_sapiens.GRCh38.110.gtf.gz
  3. You may also see:
    • Homo_sapiens.GRCh38.110.chr.gtf.gz - Chromosomes only (no scaffolds)
    • Homo_sapiens.GRCh38.110.abinitio.gtf.gz - Ab initio predictions
GTF vs GFF3

Both formats contain gene annotations. GTF (Gene Transfer Format) is more commonly used with RNA-seq tools like STAR and featureCounts. GFF3 is the newer standard but not all tools support it equally.

Download reference genome and annotation

cd ~/genomics/references
# Download full GTF annotation
echo "Downloading gene annotation..."
wget https://ftp.ensembl.org/pub/release-110/gtf/homo_sapiens/Homo_sapiens.GRCh38.110.gtf.gz
gunzip Homo_sapiens.GRCh38.110.gtf.gz

# Extract chromosome 11 annotations only
echo "Extracting chr11 annotations..."
grep -E "^#|^11	" Homo_sapiens.GRCh38.110.gtf > Homo_sapiens.GRCh38.110.chr11.gtf
                        
                    

Step 6: Find the Transcriptome (cDNA) Files

The transcriptome contains sequences of all annotated transcripts:

  1. Navigate to: release-110/fasta/homo_sapiens/cdna/
  2. Download: Homo_sapiens.GRCh38.cdna.all.fa.gz
  3. This file is used for transcript-level quantification with tools like Salmon and Kallisto

Download reference genome and annotation

cd ~/genomics/references
# Download transcriptome and extract chr11 transcripts
echo "Downloading transcriptome..."
wget https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
gunzip -c Homo_sapiens.GRCh38.cdna.all.fa.gz | awk '/^>/ {keep = /chromosome:GRCh38:11:/} keep' > transcriptome_chr11.fa
                

Step 7: Check Release Notes

Before downloading, it's good practice to check release notes for any important changes:

  1. Go to Ensembl News
  2. Review what's new in your chosen release
  3. Check for known issues or updates to gene annotations
Choosing the Right Release

When starting a project, consider:

  • Consistency: Use the same release for all analyses in a project
  • Reproducibility: Document the exact release version used
  • Compatibility: Ensure your tools support the chosen genome build (GRCh38 vs GRCh37)
  • Publication: Check what version was used in related publications

Part 5: Exploring Genomic File Formats

Explore FASTA format (genome sequence)

cd ~/genomics/references

# View the first 10 lines of the genome file
head -10 Homo_sapiens.GRCh38.dna.chromosome.11.fa

# The first line starts with > and contains the sequence name
# Subsequent lines contain the DNA sequence (A, T, G, C, N)

# Count total bases (excluding header)
grep -v "^>" Homo_sapiens.GRCh38.dna.chromosome.11.fa | tr -d '\n' | wc -c

# Chromosome 11 is approximately 135 million base pairs
FASTA Format
>chromosome_name description
ATGCATGCATGCATGCATGCATGCATGC...
GCTAGCTAGCTAGCTAGCTAGCTAGCTA...

Header lines start with >, followed by sequence on subsequent lines (typically 60-80 characters per line).

Explore GTF format (gene annotation)

cd ~/genomics/references

# View header comments and first few data lines
head -50 Homo_sapiens.GRCh38.110.chr11.gtf

# Count different feature types
echo "Feature types in GTF:"
cut -f3 Homo_sapiens.GRCh38.110.chr11.gtf | grep -v "^#" | sort | uniq -c | sort -rn

# Count genes on chromosome 11
echo "Number of genes:"
awk '$3 == "gene"' Homo_sapiens.GRCh38.110.chr11.gtf | wc -l

# Find a specific gene (e.g., TARDBP - the TDP-43 gene, though it's on chr1)
# Let's look at a chr11 gene: INS (insulin)
grep "gene_name \"INS\"" Homo_sapiens.GRCh38.110.chr11.gtf | head -5
GTF Format - 9 Tab-Separated Columns
ColumnDescriptionExample
1Chromosome11
2Sourceensembl_havana
3Feature typegene, transcript, exon, CDS
4Start position2159779
5End position2161209
6Score.
7Strand+ or -
8Frame0, 1, 2, or .
9Attributesgene_id; gene_name; etc.

Part 6 (Optional): Create transcript-to-gene mapping file - will be used later

# Generate transcript ID to gene ID mapping from GTF
# This is needed for tximport later

cd ~/genomics/references

awk -F'\t' '$3=="transcript" {
  if (match($9, /gene_id "[^"]+"/)) {
    gene = substr($9, RSTART, RLENGTH)
    sub(/^gene_id "/, "", gene)
    sub(/"$/, "", gene)
  }
  if (match($9, /transcript_id "[^"]+"/)) {
    tx = substr($9, RSTART, RLENGTH)
    sub(/^transcript_id "/, "", tx)
    sub(/"$/, "", tx)
  }
  if (gene && tx) print tx "\t" gene
}' Homo_sapiens.GRCh38.110.chr11.gtf > tx2gene.tsv


# Add header
printf "transcript_id\tgene_id\n" | cat - tx2gene.tsv > tmp && mv tmp tx2gene.tsv

# View first few lines
head tx2gene.tsv
wc -l tx2gene.tsv

Part 7: Download Workshop Data - Prepare for Lab2

The pre-processed chr11 subset data is provided for the workshop.

Download the FASTQ raw files

# The data is available under the following path

cd ~/genomics

# Run wget to download
wget -bc https://zenodo.org/records/18462432/files/raw_data.zip

unzip raw_data.zip

# Verify the files
ls -lh raw_data/

# You should see 12 files (6 samples × 2 paired-end files):
# KO_1_SRR10045016_1.fastq.gz  KO_1_SRR10045016_2.fastq.gz
# KO_2_SRR10045017_1.fastq.gz  KO_2_SRR10045017_2.fastq.gz
# KO_3_SRR10045018_1.fastq.gz  KO_3_SRR10045018_2.fastq.gz
# WT_1_SRR10045019_1.fastq.gz  WT_1_SRR10045019_2.fastq.gz
# WT_2_SRR10045020_1.fastq.gz  WT_2_SRR10045020_2.fastq.gz
# WT_3_SRR10045021_1.fastq.gz  WT_3_SRR10045021_2.fastq.gz

Alternative: Download Full Dataset from NCBI SRA

Note: This downloads the complete dataset (~20 million reads per sample). Only use this if you have sufficient time and disk space (~50 GB).
# Configure SRA Toolkit (first time only)
mkdir -p ~/ncbi/public
vdb-config --root -s "/repository/user/main/public/root=$HOME/ncbi/public"

# Download all 6 samples using prefetch (faster, downloads .sra files) - more than >100GB of disk is needed
cd ~/genomics/raw_data

# Download in parallel (6 samples simultaneously)
printf "%s\n" SRR10045016 SRR10045017 SRR10045018 SRR10045019 SRR10045020 SRR10045021 \
    | xargs -n 1 -P 6 prefetch --progress

# Convert .sra to FASTQ format (paired-end)
fasterq-dump --split-files --threads 8 --progress SRR*/*.sra

# Rename files to include sample names
mv SRR10045016_1.fastq KO_1_SRR10045016_1.fastq
mv SRR10045016_2.fastq KO_1_SRR10045016_2.fastq
mv SRR10045017_1.fastq KO_2_SRR10045017_1.fastq
mv SRR10045017_2.fastq KO_2_SRR10045017_2.fastq
mv SRR10045018_1.fastq KO_3_SRR10045018_1.fastq
mv SRR10045018_2.fastq KO_3_SRR10045018_2.fastq
mv SRR10045019_1.fastq WT_1_SRR10045019_1.fastq
mv SRR10045019_2.fastq WT_1_SRR10045019_2.fastq
mv SRR10045020_1.fastq WT_2_SRR10045020_1.fastq
mv SRR10045020_2.fastq WT_2_SRR10045020_2.fastq
mv SRR10045021_1.fastq WT_3_SRR10045021_1.fastq
mv SRR10045021_2.fastq WT_3_SRR10045021_2.fastq

# Compress the files
gzip *.fastq

Exercises

Check all files are in place

revisit earlier commands to know what to use.

Exercise 1: Explore the GTF file

Use command-line tools to answer:

  1. How many exons are annotated on chromosome 11?
  2. How many protein-coding genes are there? (Hint: look for gene_biotype "protein_coding")
  3. What is the longest gene on chromosome 11?

Exercise 2: NCBI GEO Exploration

Visit GSE136366 on NCBI GEO and find:

  1. The original publication title and authors
  2. The library preparation method used
  3. Any supplementary files provided by the authors

Summary

In this lab, you have:

Next: Lab 2 - Hands-on with Transcriptomics Data