Lab 2: Public Data Retrieval

Part 1Setup on Ibex Part 2Finding Data on NCBI Part 3Download Reads Part 4Reference Genomes Part 5Organize Workspace Part 6Explore Data

In this lab, you will find, download, and organize the course dataset (GSE136366) and the human reference genome on the Ibex cluster, ready for all downstream analysis.

Learning Objectives

Part 1: Setup on Ibex

Connect to Ibex, start an interactive session, and load the tools needed for this lab.

Connect and start an interactive session

# Connect to Ibex
ssh username@ilogin.ibex.kaust.edu.sa

# Start an interactive session (never run analyses on login nodes!)
srun --cpus-per-task=4 --mem=32G --time=04:00:00 --pty /bin/bash

Load required modules

# Load bioinformatics tools
module load sratoolkit
module load seqkit

# Verify tools are loaded
module list

Navigate to your workspace

# Navigate to your scratch directory
cd /ibex/user/$USER

# Create directories for this lab
mkdir workshop
cd workshop

# Create directories for this lab
mkdir -p raw_data references

# Verify
ls -la

Part 2: Finding Data on NCBI

Before downloading data, you need to know how to find it. NCBI hosts two key resources:

  1. GEO (Gene Expression Omnibus) for expression data: https://www.ncbi.nlm.nih.gov/geo
  2. SRA (Sequence Read Archive) for raw sequencing data: https://www.ncbi.nlm.nih.gov/sra

Step 1: Navigate to NCBI GEO

NCBI GEO is a public repository for gene expression and genomics datasets.

  1. Open your web browser and go to https://www.ncbi.nlm.nih.gov/geo/
  2. You will see the GEO homepage with a search box at the top

Step 2: Search for a Dataset

As an example, search for accession GSE136366 (a TDP-43 RNA-seq study):

  1. In the search box, type: GSE136366
  2. Click "Search" or press Enter
  3. Click on the accession number GSE136366 to view the full record

Step 3: Explore the GEO Series Page

The GEO Series page contains important information about the dataset:

  • Title: Full study title and description
  • Summary: Experimental design and objectives
  • Overall design: Sample information and conditions
  • Samples: List of all individual samples (GSM accessions)
  • Citation: Link to the published paper

Step 4: Navigate to SRA from GEO

To access the raw FASTQ files, you need to go to the SRA:

  1. Scroll down to the "Relations" section on the GEO page
  2. Look for the link labeled "SRA" and click on it
  3. Alternatively, click on any individual sample (GSM accession) to find its SRA run

Step 5: SRA Run Selector

The SRA Run Selector provides metadata about each sequencing run:

  1. Click "Send results to Run selector" if not already there
  2. The table shows important metadata columns:
    ColumnDescription
    RunSRR accession number (used for download)
    SampleSample name from the study
    LibraryLayoutSINGLE or PAIRED (single-end vs paired-end)
    BasesTotal sequencing bases
    SpotsNumber of reads (or read pairs)
    PlatformSequencing platform (e.g., ILLUMINA)
Tip: Always Check the Metadata!

Before downloading any dataset, review the sequencing platform, library layout (single-end vs paired-end), read length, and total read count. This information determines which analysis pipeline to use.

Course Dataset: GSE136366

For this course, we use GSE136366 — an RNA-seq study investigating the role of TDP-43 (TARDBP) in gene regulation in human cells:

PropertyValue
GEO AccessionGSE136366
OrganismHomo sapiens
StudyTDP-43 (TARDBP) RNA-seq
Library LayoutPaired-end
PlatformIllumina
ReferenceGRCh38 (human)
Tip: Get the SRR accession list from SRA Run Selector

On the GSE136366 GEO page, click the SRA link and then "Send results to Run Selector". In the Run Selector, click Accession List to download a text file (SRR_Acc_List.txt) with all SRR IDs. Use this file for batch downloading below.

Part 3: Downloading Raw Sequencing Data with sra-tools

The SRA Toolkit provides command-line tools to download and convert SRA data. The two main commands are prefetch (downloads compressed .sra files) and fasterq-dump (converts .sra to FASTQ).

Navigate to your raw data directory

cd /ibex/user/$USER/workshop/raw_data

Option A: Copy from the shared course directory (recommended)

The full GSE136366 FASTQs are already available on Ibex at a shared course path. Files are already named with the KO_1_/WT_1_ prefix convention — no download or renaming needed just make a link to the files.

ln -s /biocorelab/BIX/resources/datasets/rnaseq/GSE136366/raw_fastq/*.fastq.gz /ibex/user/$USER/workshop/raw_data/

ls -lh /ibex/user/$USER/workshop/raw_data/

Skip to Part 4. The NCBI download option below is provided for reference — it demonstrates how to retrieve public RNA-seq data from scratch, which is the workflow you would use for any dataset not already available on Ibex.

Option B: Download a full dataset from NCBI using prefetch

Download GSE136366 (or data from PRJNA507634) with prefetch and convert to FASTQ

Pass the GEO accession directly to prefetch — it resolves all associated SRR runs automatically and downloads them into a GSE136366/ directory.

# Download all runs for GSE136366 in one command (takes too long)
prefetch GSE136366

# Convert each downloaded .sra file to paired FASTQ
for sra in ./*/*.sra; do
    echo "Converting $sra ..."
    fasterq-dump --split-files --threads 4 "$sra"
done
Storage tip

Compress FASTQ files immediately after conversion (gzip). Raw FASTQ files are very large; gzip reduces their size 4–5×. Use /ibex/user/$USER for all large files — home directories have limited quota.

Submit the download as an sbatch job (recommended)

Downloading and converting all GSE136366 samples can take several hours. Running this interactively risks losing work if your SSH session disconnects. Instead, submit it as a SLURM batch job on Ibex using sbatch.

Interactive vs. batch jobs

Use an interactive session (srun --pty bash) for short exploratory tasks (< 30 min) where you need immediate feedback. Use sbatch for anything that runs longer — downloads, alignments, workflow runs. The job continues even if you close your terminal.

Understanding sbatch directives

Every sbatch script starts with a shebang line (#!/bin/bash) followed by #SBATCH comment lines. SLURM reads these before executing the script. Here is what each directive means:

Directive Meaning
--job-nameA human-readable label for the job (shown in squeue output).
--ntasksNumber of MPI tasks (processes). For serial or multi-threaded jobs, keep this at 1.
--cpus-per-taskCPU cores allocated per task. Match this to the --threads argument of your tool.
--memTotal RAM for the job (e.g. 16G). Request enough — too little causes an out-of-memory kill; too much wastes queue priority.
--timeWall-clock time limit in HH:MM:SS. The job is killed if it runs longer. Estimate generously for downloads.
--partitionThe queue to submit to. Use batch for general compute jobs on Ibex.
--outputFile path for standard output (stdout). %j is replaced by the job ID at runtime.
--errorFile path for standard error (stderr). Separate from stdout so errors are easy to find.
--mail-typeWhen to email you: BEGIN, END, FAIL, or ALL.
--mail-userYour email address for job notifications.

Create the submission script

Save the script below as download_sra.sh in your raw_data directory:

#!/bin/bash
#SBATCH --job-name=sra_download       # Name shown in squeue
#SBATCH --ntasks=1                    # Single task (not MPI)
#SBATCH --cpus-per-task=8             # 8 threads for fasterq-dump
#SBATCH --mem=16G                     # RAM — fasterq-dump needs ~4G per thread
#SBATCH --time=12:00:00               # Up to 12 hours for all samples
#SBATCH --partition=batch             # General compute queue on Ibex
#SBATCH --output=logs/sra_download_%j.out   # stdout log; %j = job ID
#SBATCH --error=logs/sra_download_%j.err    # stderr log
#SBATCH --mail-type=END,FAIL          # Email when done or if it fails
#SBATCH --mail-user=your.email@kaust.edu.sa # Replace with your email

# ---------- environment ----------
cd /ibex/user/$USER/workshop/raw_data

module load sratoolkit

mkdir -p logs

# ---------- step 1: prefetch (download all runs for GSE136366) ----------
echo "Starting prefetch ..."
prefetch GSE136366

# ---------- step 2: convert .sra to paired FASTQ ----------
echo "Starting fasterq-dump ..."
for sra in GSE136366/**/*.sra; do
    echo "  Converting $sra ..."
    fasterq-dump --split-files --threads 8 "$sra"
done
gzip *.fastq

# ---------- step 3: rename to add condition/replicate prefix ----------
echo "Renaming files to match shared folder convention ..."
mv SRR10045016_1.fastq.gz KO_1_SRR10045016_1.fastq.gz
mv SRR10045016_2.fastq.gz KO_1_SRR10045016_2.fastq.gz
mv SRR10045017_1.fastq.gz KO_2_SRR10045017_1.fastq.gz
mv SRR10045017_2.fastq.gz KO_2_SRR10045017_2.fastq.gz
mv SRR10045018_1.fastq.gz KO_3_SRR10045018_1.fastq.gz
mv SRR10045018_2.fastq.gz KO_3_SRR10045018_2.fastq.gz
mv SRR10045019_1.fastq.gz WT_1_SRR10045019_1.fastq.gz
mv SRR10045019_2.fastq.gz WT_1_SRR10045019_2.fastq.gz
mv SRR10045020_1.fastq.gz WT_2_SRR10045020_1.fastq.gz
mv SRR10045020_2.fastq.gz WT_2_SRR10045020_2.fastq.gz
mv SRR10045021_1.fastq.gz WT_3_SRR10045021_1.fastq.gz
mv SRR10045021_2.fastq.gz WT_3_SRR10045021_2.fastq.gz

echo "All downloads complete."
ls -lh *.fastq.gz

Submit and monitor the job

# Submit to SLURM
sbatch /ibex/user/$USER/workshop/raw_data/download_sra.sh

# Check job status (replace JOBID with the number printed by sbatch)
squeue -u $USER

# Watch live stdout while job runs
tail -f /ibex/user/$USER/workshop/raw_data/logs/sra_download_JOBID.out

# Cancel a job if needed
scancel JOBID
After the job finishes

Check the .err log for any errors, then verify all expected FASTQ pairs are present: ls -lh *.fastq.gz | wc -l should equal 2 × number of samples. If a sample failed, re-run fasterq-dump for just that SRR ID rather than re-running the whole job.

Part 4: Downloading Reference Genomes

Reference genomes and gene annotations are available from two major sources: NCBI and Ensembl. Both provide the same biological data but differ in naming conventions, annotation sources, and access tools. We will cover both approaches.

Navigate to your references directory

cd /ibex/user/$USER/workshop/references

Option A: Download from NCBI using datasets

NCBI provides the datasets command-line tool – a modern, fast way to download genomes, annotations, and gene data directly from the NCBI database.

Get the NCBI datasets tool on Ibex

#load NCBI datasets tool
module load ncbi_datasets_tools

# Check the help message
datasets --help

Search for a genome assembly

You can search for genomes by organism name or assembly accession:

# Search for human genome assemblies
datasets summary genome taxon "Homo sapiens" --reference | head -5

# Search by specific assembly accession (GRCh38 / hg38)
datasets summary genome accession GCF_000001405.40
Understanding Assembly Accessions
  • GCF_ prefix – NCBI RefSeq assembly (curated, recommended)
  • GCA_ prefix – GenBank assembly (submitted by researchers)

For most analyses, prefer RefSeq (GCF_) assemblies as they are more standardized.

Download genome and annotation

The datasets download command fetches the genome FASTA, annotation (GTF/GFF3), and metadata in a single zip archive:

# Download human GRCh38 reference genome with annotations
# Note: This is a large download (~3 GB)
datasets download genome accession GCF_000001405.40 \
    --include genome,gtf,seq-report

# This creates a file called ncbi_dataset.zip
ls -lh ncbi_dataset.zip
--include Options
OptionDownloads
genomeGenome FASTA sequence
gtfGene annotation in GTF format
gff3Gene annotation in GFF3 format
rnaTranscript sequences (RNA FASTA)
proteinProtein sequences (amino acid FASTA)
cdsCoding sequences (CDS FASTA)
seq-reportSequence report with chromosome info

Extract and organize the downloaded files

# Extract the zip archive
unzip ncbi_dataset.zip

# See the directory structure
ls ncbi_dataset/data/GCF_000001405.40/

# The extracted files will include:
#   GCF_000001405.40_GRCh38.p14_genomic.fna  - genome FASTA (chromosomes)
#   genomic.gtf                               - gene annotation (GTF)
#   sequence_report.jsonl                     - sequence metadata

# Copy files to your references directory with clear names
cp ncbi_dataset/data/GCF_000001405.40/*.fna GRCh38_genome.fa
cp ncbi_dataset/data/GCF_000001405.40/genomic.gtf GRCh38_genes.gtf

# Verify the genome (should report ~3 GB sequence, 25 chromosomes)
seqkit stats GRCh38_genome.fa

# Clean up the zip and extracted directory
rm -r ncbi_dataset ncbi_dataset.zip

Option B: Download from Ensembl using wget

Ensembl provides reference genomes and gene annotations for thousands of species. Files are organized on their FTP server by release version, species, and data type.

Understanding the Ensembl FTP Structure

The Ensembl FTP site is organized as follows:

ftp.ensemblgenomes.org/pub/bacteria/
├── release-57/
│   ├── fasta/                    # Sequence files (genome, cDNA, protein)
│   │   └── bacteria_0_collection/
│   │       └── escherichia_coli_.../
│   │           └── dna/          # Genomic DNA sequences
│   ├── gff3/                     # Gene annotations (GFF3 format)
│   │   └── bacteria_0_collection/
│   └── gtf/                      # Gene annotations (GTF format)
└── current/                      # Symlink to latest release
Ensembl vs Ensembl Genomes
  • ftp.ensembl.org – Vertebrate species (human, mouse, zebrafish, etc.)
  • ftp.ensemblgenomes.org – Non-vertebrate species (bacteria, plants, fungi, etc.)

Use the appropriate server for your organism.

Download human GRCh38 reference genome from Ensembl

# Download human GRCh38 primary assembly (chromosome sequences only — smaller than full assembly)
wget https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

Download gene annotation

# Download GTF annotation (Ensembl release 110, GRCh38)
wget https://ftp.ensembl.org/pub/release-110/gtf/homo_sapiens/Homo_sapiens.GRCh38.110.gtf.gz

Decompress and verify

# Decompress all downloaded files
gunzip *.gz

# Check reference genome statistics (human GRCh38 primary assembly: ~3 GB, 25 sequences)
seqkit stats Homo_sapiens.GRCh38.dna.primary_assembly.fa

# Rename for convenience
mv Homo_sapiens.GRCh38.dna.primary_assembly.fa GRCh38_genome.fa
mv Homo_sapiens.GRCh38.110.gtf GRCh38_genes.gtf

# List files with sizes
ls -lh
NCBI vs Ensembl: Which to Use?
FeatureNCBI (datasets)Ensembl (wget)
Download tooldatasets toolwget / curl
NamingRefSeq accessions (GCF_/GCA_)Ensembl-specific names
AnnotationsNCBI RefSeq gene modelsEnsembl/GENCODE gene models
Chromosome namesNC_ accessions or chromosome numbersChromosome numbers (1, 2, X, MT)
Best forProkaryotes, clinical/NCBI pipelinesEukaryotes, RNA-seq, GENCODE consistency

Both are valid choices. The key is to be consistent –
Always use the genome and annotation from the same source to avoid mismatches in chromosome names.

Part 5: Organizing Your Workspace

Create the full directory structure

mkdir -p /ibex/user/$USER/workshop/{raw_data,trimmed,aligned,counts,results,references,logs}

ls /ibex/user/$USER/workshop/
Directory conventions
DirectoryContentsUsed in
raw_data/Original full FASTQ files downloaded from SRALab 6 onwards (nf-core)
trimmed/fastp-trimmed FASTQsLabs 3–5
aligned/STAR BAM filesLab 4
counts/Salmon quantification outputLab 5
references/Reference genome, annotation, and indexesLabs 4–5
results/Final analysis outputs (DEA, plots, reports)Labs 7–10

Part 7: Exploring Downloaded Data

Before proceeding with analysis, always inspect your downloaded data to make sure everything looks correct.

Inspect raw reads

cd /ibex/user/$USER/workshop

# Quick look at the first few reads from one GSE136366 sample (12 lines = 3 reads)
zcat raw_data/$(head -1 raw_data/SRR_Acc_List.txt)_1.fastq.gz | head -12
FASTQ Format Reminder

Each read consists of 4 lines:

  1. @ followed by the read identifier
  2. The DNA sequence
  3. + (separator line)
  4. Quality scores (ASCII-encoded Phred scores)

Reference genome statistics

# Get stats for the human GRCh38 reference genome
seqkit stats references/GRCh38_genome.fa

Preview the annotation file

# View the first 30 lines of the GTF annotation (skip comment lines)
grep -v "^#" references/GRCh38_genes.gtf | head -10

# Count annotated genes
grep -P "\tgene\t" references/GRCh38_genes.gtf | wc -l

Exercises

Exercise 1: Read Counts

How many reads are in the FASTQ for one sample? Compare it to one of the full SRA FASTQs in raw_data/.

Hint
seqkit stats raw_data/KO_1_SRR10045016_1.fastq.gz
seqkit stats raw_data/$(head -1 raw_data/SRR_Acc_List.txt)_1.fastq.gz

Exercise 2: Genome Size

How large (in bases) is the human GRCh38 primary assembly? How many sequences does it contain?

Hint
seqkit stats references/GRCh38_genome.fa

Exercise 3: Gene Count

How many genes are annotated in the GRCh38 GTF file?

Hint
grep -P "\tgene\t" references/GRCh38_genes.gtf | wc -l

Summary

In this lab, you have:

Next: Lab 3: FastQC, fastp, and MultiQC