Lab 2: Public Data Retrieval

Part 1Setup on Ibex Part 2Finding Data on NCBI Part 3Download Reads Part 4Reference Genomes Part 5Organize Workspace Part 6Explore Data

In this lab, you will find, download, and organize the course dataset (GSE136366) and the human reference genome on the Ibex cluster, ready for all downstream analysis.

Learning Objectives

Navigate NCBI GEO and SRA to find datasets
Download raw sequencing data using sra-tools (prefetch, fasterq-dump)
Batch-download multiple SRA files using an accession list
Download reference genomes from NCBI using the datasets CLI tool
Download reference genomes from Ensembl using wget
Use seqkit to summarize and subsample data
Organize data with proper directory structure on Ibex

Part 1: Setup on Ibex

Connect to Ibex, start an interactive session, and load the tools needed for this lab.

Connect and start an interactive session

# Connect to Ibex
ssh username@ilogin.ibex.kaust.edu.sa

# Start an interactive session (never run analyses on login nodes!)
srun --cpus-per-task=4 --mem=32G --time=04:00:00 --pty /bin/bash

Load required modules

# Load bioinformatics tools
module load sratoolkit
module load seqkit

# Verify tools are loaded
module list

Navigate to your workspace

# Navigate to your scratch directory
cd /ibex/user/$USER

# Create directories for this lab
mkdir workshop
cd workshop

# Create directories for this lab
mkdir -p raw_data references

# Verify
ls -la

Part 2: Finding Data on NCBI

Before downloading data, you need to know how to find it. NCBI hosts two key resources:

GEO (Gene Expression Omnibus) for expression data: https://www.ncbi.nlm.nih.gov/geo
SRA (Sequence Read Archive) for raw sequencing data: https://www.ncbi.nlm.nih.gov/sra

Step 1: Navigate to NCBI GEO

NCBI GEO is a public repository for gene expression and genomics datasets.

Open your web browser and go to https://www.ncbi.nlm.nih.gov/geo/
You will see the GEO homepage with a search box at the top

Step 2: Search for a Dataset

As an example, search for accession GSE136366 (a TDP-43 RNA-seq study):

In the search box, type: GSE136366
Click "Search" or press Enter
Click on the accession number GSE136366 to view the full record

Step 3: Explore the GEO Series Page

The GEO Series page contains important information about the dataset:

Title: Full study title and description
Summary: Experimental design and objectives
Overall design: Sample information and conditions
Samples: List of all individual samples (GSM accessions)
Citation: Link to the published paper

Step 4: Navigate to SRA from GEO

To access the raw FASTQ files, you need to go to the SRA:

Scroll down to the "Relations" section on the GEO page
Look for the link labeled "SRA" and click on it
Alternatively, click on any individual sample (GSM accession) to find its SRA run

Step 5: SRA Run Selector

The SRA Run Selector provides metadata about each sequencing run:

Click "Send results to Run selector" if not already there

The table shows important metadata columns:

Column	Description
Run	SRR accession number (used for download)
Sample	Sample name from the study
LibraryLayout	SINGLE or PAIRED (single-end vs paired-end)
Bases	Total sequencing bases
Spots	Number of reads (or read pairs)
Platform	Sequencing platform (e.g., ILLUMINA)

Tip: Always Check the Metadata!

Before downloading any dataset, review the sequencing platform, library layout (single-end vs paired-end), read length, and total read count. This information determines which analysis pipeline to use.

Course Dataset: GSE136366

For this course, we use GSE136366 — an RNA-seq study investigating the role of TDP-43 (TARDBP) in gene regulation in human cells:

Property	Value
GEO Accession	`GSE136366`
Organism	Homo sapiens
Study	TDP-43 (TARDBP) RNA-seq
Library Layout	Paired-end
Platform	Illumina
Reference	GRCh38 (human)

Tip: Get the SRR accession list from SRA Run Selector

On the GSE136366 GEO page, click the SRA link and then "Send results to Run Selector". In the Run Selector, click Accession List to download a text file (SRR_Acc_List.txt) with all SRR IDs. Use this file for batch downloading below.

Part 3: Downloading Raw Sequencing Data with sra-tools

The SRA Toolkit provides command-line tools to download and convert SRA data. The two main commands are prefetch (downloads compressed .sra files) and fasterq-dump (converts .sra to FASTQ).

Navigate to your raw data directory

cd /ibex/user/$USER/workshop/raw_data

Option A: Copy from the shared course directory (recommended)

The full GSE136366 FASTQs are already available on Ibex at a shared course path. Files are already named with the KO_1_/WT_1_ prefix convention — no download or renaming needed just make a link to the files.

ln -s /biocorelab/BIX/resources/datasets/rnaseq/GSE136366/raw_fastq/*.fastq.gz /ibex/user/$USER/workshop/raw_data/

ls -lh /ibex/user/$USER/workshop/raw_data/

Skip to Part 4. The NCBI download option below is provided for reference — it demonstrates how to retrieve public RNA-seq data from scratch, which is the workflow you would use for any dataset not already available on Ibex.

Option B: Download a full dataset from NCBI using `prefetch`

Download GSE136366 (or data from PRJNA507634) with prefetch and convert to FASTQ

Pass the GEO accession directly to prefetch — it resolves all associated SRR runs automatically and downloads them into a GSE136366/ directory.

# Download all runs for GSE136366 in one command (takes too long)
prefetch GSE136366

# Convert each downloaded .sra file to paired FASTQ
for sra in ./*/*.sra; do
    echo "Converting $sra ..."
    fasterq-dump --split-files --threads 4 "$sra"
done

Storage tip

Compress FASTQ files immediately after conversion (gzip). Raw FASTQ files are very large; gzip reduces their size 4–5×. Use /ibex/user/$USER for all large files — home directories have limited quota.

Submit the download as an sbatch job (recommended)

Downloading and converting all GSE136366 samples can take several hours. Running this interactively risks losing work if your SSH session disconnects. Instead, submit it as a SLURM batch job on Ibex using sbatch.

Interactive vs. batch jobs

Use an interactive session (srun --pty bash) for short exploratory tasks (< 30 min) where you need immediate feedback. Use sbatch for anything that runs longer — downloads, alignments, workflow runs. The job continues even if you close your terminal.

Understanding sbatch directives

Every sbatch script starts with a shebang line (#!/bin/bash) followed by #SBATCH comment lines. SLURM reads these before executing the script. Here is what each directive means:

Directive	Meaning
`--job-name`	A human-readable label for the job (shown in `squeue` output).
`--ntasks`	Number of MPI tasks (processes). For serial or multi-threaded jobs, keep this at `1`.
`--cpus-per-task`	CPU cores allocated per task. Match this to the `--threads` argument of your tool.
`--mem`	Total RAM for the job (e.g. `16G`). Request enough — too little causes an out-of-memory kill; too much wastes queue priority.
`--time`	Wall-clock time limit in `HH:MM:SS`. The job is killed if it runs longer. Estimate generously for downloads.
`--partition`	The queue to submit to. Use `batch` for general compute jobs on Ibex.
`--output`	File path for standard output (stdout). `%j` is replaced by the job ID at runtime.
`--error`	File path for standard error (stderr). Separate from stdout so errors are easy to find.
`--mail-type`	When to email you: `BEGIN`, `END`, `FAIL`, or `ALL`.
`--mail-user`	Your email address for job notifications.

Create the submission script

Save the script below as download_sra.sh in your raw_data directory:

#!/bin/bash
#SBATCH --job-name=sra_download       # Name shown in squeue
#SBATCH --ntasks=1                    # Single task (not MPI)
#SBATCH --cpus-per-task=8             # 8 threads for fasterq-dump
#SBATCH --mem=16G                     # RAM — fasterq-dump needs ~4G per thread
#SBATCH --time=12:00:00               # Up to 12 hours for all samples
#SBATCH --partition=batch             # General compute queue on Ibex
#SBATCH --output=logs/sra_download_%j.out   # stdout log; %j = job ID
#SBATCH --error=logs/sra_download_%j.err    # stderr log
#SBATCH --mail-type=END,FAIL          # Email when done or if it fails
#SBATCH --mail-user=your.email@kaust.edu.sa # Replace with your email

# ---------- environment ----------
cd /ibex/user/$USER/workshop/raw_data

module load sratoolkit

mkdir -p logs

# ---------- step 1: prefetch (download all runs for GSE136366) ----------
echo "Starting prefetch ..."
prefetch GSE136366

# ---------- step 2: convert .sra to paired FASTQ ----------
echo "Starting fasterq-dump ..."
for sra in GSE136366/**/*.sra; do
    echo "  Converting $sra ..."
    fasterq-dump --split-files --threads 8 "$sra"
done
gzip *.fastq

# ---------- step 3: rename to add condition/replicate prefix ----------
echo "Renaming files to match shared folder convention ..."
mv SRR10045016_1.fastq.gz KO_1_SRR10045016_1.fastq.gz
mv SRR10045016_2.fastq.gz KO_1_SRR10045016_2.fastq.gz
mv SRR10045017_1.fastq.gz KO_2_SRR10045017_1.fastq.gz
mv SRR10045017_2.fastq.gz KO_2_SRR10045017_2.fastq.gz
mv SRR10045018_1.fastq.gz KO_3_SRR10045018_1.fastq.gz
mv SRR10045018_2.fastq.gz KO_3_SRR10045018_2.fastq.gz
mv SRR10045019_1.fastq.gz WT_1_SRR10045019_1.fastq.gz
mv SRR10045019_2.fastq.gz WT_1_SRR10045019_2.fastq.gz
mv SRR10045020_1.fastq.gz WT_2_SRR10045020_1.fastq.gz
mv SRR10045020_2.fastq.gz WT_2_SRR10045020_2.fastq.gz
mv SRR10045021_1.fastq.gz WT_3_SRR10045021_1.fastq.gz
mv SRR10045021_2.fastq.gz WT_3_SRR10045021_2.fastq.gz

echo "All downloads complete."
ls -lh *.fastq.gz

Submit and monitor the job

# Submit to SLURM
sbatch /ibex/user/$USER/workshop/raw_data/download_sra.sh

# Check job status (replace JOBID with the number printed by sbatch)
squeue -u $USER

# Watch live stdout while job runs
tail -f /ibex/user/$USER/workshop/raw_data/logs/sra_download_JOBID.out

# Cancel a job if needed
scancel JOBID

After the job finishes

Check the .err log for any errors, then verify all expected FASTQ pairs are present: ls -lh *.fastq.gz | wc -l should equal 2 × number of samples. If a sample failed, re-run fasterq-dump for just that SRR ID rather than re-running the whole job.

Part 4: Downloading Reference Genomes

Reference genomes and gene annotations are available from two major sources: NCBI and Ensembl. Both provide the same biological data but differ in naming conventions, annotation sources, and access tools. We will cover both approaches.

Navigate to your references directory

cd /ibex/user/$USER/workshop/references

Option A: Download from NCBI using `datasets`

NCBI provides the datasets command-line tool – a modern, fast way to download genomes, annotations, and gene data directly from the NCBI database.

Get the NCBI datasets tool on Ibex

#load NCBI datasets tool
module load ncbi_datasets_tools

# Check the help message
datasets --help

Search for a genome assembly

You can search for genomes by organism name or assembly accession:

# Search for human genome assemblies
datasets summary genome taxon "Homo sapiens" --reference | head -5

# Search by specific assembly accession (GRCh38 / hg38)
datasets summary genome accession GCF_000001405.40

Understanding Assembly Accessions

GCF_ prefix – NCBI RefSeq assembly (curated, recommended)
GCA_ prefix – GenBank assembly (submitted by researchers)

For most analyses, prefer RefSeq (GCF_) assemblies as they are more standardized.

Download genome and annotation

The datasets download command fetches the genome FASTA, annotation (GTF/GFF3), and metadata in a single zip archive:

# Download human GRCh38 reference genome with annotations
# Note: This is a large download (~3 GB)
datasets download genome accession GCF_000001405.40 \
    --include genome,gtf,seq-report

# This creates a file called ncbi_dataset.zip
ls -lh ncbi_dataset.zip

--include Options

Option	Downloads
`genome`	Genome FASTA sequence
`gtf`	Gene annotation in GTF format
`gff3`	Gene annotation in GFF3 format
`rna`	Transcript sequences (RNA FASTA)
`protein`	Protein sequences (amino acid FASTA)
`cds`	Coding sequences (CDS FASTA)
`seq-report`	Sequence report with chromosome info

Extract and organize the downloaded files

# Extract the zip archive
unzip ncbi_dataset.zip

# See the directory structure
ls ncbi_dataset/data/GCF_000001405.40/

# The extracted files will include:
#   GCF_000001405.40_GRCh38.p14_genomic.fna  - genome FASTA (chromosomes)
#   genomic.gtf                               - gene annotation (GTF)
#   sequence_report.jsonl                     - sequence metadata

# Copy files to your references directory with clear names
cp ncbi_dataset/data/GCF_000001405.40/*.fna GRCh38_genome.fa
cp ncbi_dataset/data/GCF_000001405.40/genomic.gtf GRCh38_genes.gtf

# Verify the genome (should report ~3 GB sequence, 25 chromosomes)
seqkit stats GRCh38_genome.fa

# Clean up the zip and extracted directory
rm -r ncbi_dataset ncbi_dataset.zip

Option B: Download from Ensembl using `wget`

Ensembl provides reference genomes and gene annotations for thousands of species. Files are organized on their FTP server by release version, species, and data type.

Understanding the Ensembl FTP Structure

The Ensembl FTP site is organized as follows:

ftp.ensemblgenomes.org/pub/bacteria/
├── release-57/
│   ├── fasta/                    # Sequence files (genome, cDNA, protein)
│   │   └── bacteria_0_collection/
│   │       └── escherichia_coli_.../
│   │           └── dna/          # Genomic DNA sequences
│   ├── gff3/                     # Gene annotations (GFF3 format)
│   │   └── bacteria_0_collection/
│   └── gtf/                      # Gene annotations (GTF format)
└── current/                      # Symlink to latest release

Ensembl vs Ensembl Genomes

ftp.ensembl.org – Vertebrate species (human, mouse, zebrafish, etc.)
ftp.ensemblgenomes.org – Non-vertebrate species (bacteria, plants, fungi, etc.)

Use the appropriate server for your organism.

Download human GRCh38 reference genome from Ensembl

# Download human GRCh38 primary assembly (chromosome sequences only — smaller than full assembly)
wget https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

Download gene annotation

# Download GTF annotation (Ensembl release 110, GRCh38)
wget https://ftp.ensembl.org/pub/release-110/gtf/homo_sapiens/Homo_sapiens.GRCh38.110.gtf.gz

Decompress and verify

# Decompress all downloaded files
gunzip *.gz

# Check reference genome statistics (human GRCh38 primary assembly: ~3 GB, 25 sequences)
seqkit stats Homo_sapiens.GRCh38.dna.primary_assembly.fa

# Rename for convenience
mv Homo_sapiens.GRCh38.dna.primary_assembly.fa GRCh38_genome.fa
mv Homo_sapiens.GRCh38.110.gtf GRCh38_genes.gtf

# List files with sizes
ls -lh

NCBI vs Ensembl: Which to Use?

Feature	NCBI (datasets)	Ensembl (wget)
Download tool	`datasets` tool	`wget` / `curl`
Naming	RefSeq accessions (GCF_/GCA_)	Ensembl-specific names
Annotations	NCBI RefSeq gene models	Ensembl/GENCODE gene models
Chromosome names	NC_ accessions or chromosome numbers	Chromosome numbers (1, 2, X, MT)
Best for	Prokaryotes, clinical/NCBI pipelines	Eukaryotes, RNA-seq, GENCODE consistency

Both are valid choices. The key is to be consistent –
Always use the genome and annotation from the same source to avoid mismatches in chromosome names.

Part 5: Organizing Your Workspace

Create the full directory structure

mkdir -p /ibex/user/$USER/workshop/{raw_data,trimmed,aligned,counts,results,references,logs}

ls /ibex/user/$USER/workshop/

Directory conventions

Directory	Contents	Used in
`raw_data/`	Original full FASTQ files downloaded from SRA	Lab 6 onwards (nf-core)
`trimmed/`	fastp-trimmed FASTQs	Labs 3–5
`aligned/`	STAR BAM files	Lab 4
`counts/`	Salmon quantification output	Lab 5
`references/`	Reference genome, annotation, and indexes	Labs 4–5
`results/`	Final analysis outputs (DEA, plots, reports)	Labs 7–10

Part 7: Exploring Downloaded Data

Before proceeding with analysis, always inspect your downloaded data to make sure everything looks correct.

Inspect raw reads

cd /ibex/user/$USER/workshop

# Quick look at the first few reads from one GSE136366 sample (12 lines = 3 reads)
zcat raw_data/$(head -1 raw_data/SRR_Acc_List.txt)_1.fastq.gz | head -12

FASTQ Format Reminder

Each read consists of 4 lines:

@ followed by the read identifier
The DNA sequence
+ (separator line)
Quality scores (ASCII-encoded Phred scores)

Reference genome statistics

# Get stats for the human GRCh38 reference genome
seqkit stats references/GRCh38_genome.fa

Preview the annotation file

# View the first 30 lines of the GTF annotation (skip comment lines)
grep -v "^#" references/GRCh38_genes.gtf | head -10

# Count annotated genes
grep -P "\tgene\t" references/GRCh38_genes.gtf | wc -l

Exercises

Exercise 1: Read Counts

How many reads are in the FASTQ for one sample? Compare it to one of the full SRA FASTQs in raw_data/.

Hint

seqkit stats raw_data/KO_1_SRR10045016_1.fastq.gz
seqkit stats raw_data/$(head -1 raw_data/SRR_Acc_List.txt)_1.fastq.gz

Exercise 2: Genome Size

How large (in bases) is the human GRCh38 primary assembly? How many sequences does it contain?

Hint

seqkit stats references/GRCh38_genome.fa

Exercise 3: Gene Count

How many genes are annotated in the GRCh38 GTF file?

Hint

grep -P "\tgene\t" references/GRCh38_genes.gtf | wc -l

Summary

In this lab, you have:

Learned how to navigate NCBI GEO and SRA to find public sequencing datasets
Downloaded raw sequencing data from SRA using prefetch and fasterq-dump (or copied from the shared Ibex path)
Learned to batch-download multiple SRA files using an accession list file
Copied raw GSE136366 FASTQs from the shared course directory (Option A) — and learned how to download from NCBI SRA using prefetch/fasterq-dump (Option B)
Downloaded a reference genome using NCBI datasets CLI (Option A) or Ensembl FTP using wget (Option B)
Used seqkit to summarize FASTQ data
Organized all data with a clear directory structure on Ibex (raw_data/, trimmed/, etc.)

Next: Lab 3: FastQC, fastp, and MultiQC

Lab 2: Public Data Retrieval

Part 1: Setup on Ibex

Connect and start an interactive session

Load required modules

Navigate to your workspace

Part 2: Finding Data on NCBI

Step 1: Navigate to NCBI GEO

Step 2: Search for a Dataset

Step 3: Explore the GEO Series Page

Step 4: Navigate to SRA from GEO

Step 5: SRA Run Selector

Course Dataset: GSE136366

Part 3: Downloading Raw Sequencing Data with sra-tools

Navigate to your raw data directory

Option A: Copy from the shared course directory (recommended)

Option B: Download a full dataset from NCBI using prefetch

Download GSE136366 (or data from PRJNA507634) with prefetch and convert to FASTQ

Submit the download as an sbatch job (recommended)

Understanding sbatch directives

Create the submission script

Submit and monitor the job

Part 4: Downloading Reference Genomes

Navigate to your references directory

Option A: Download from NCBI using datasets

Get the NCBI datasets tool on Ibex

Search for a genome assembly

Download genome and annotation

Extract and organize the downloaded files

Option B: Download from Ensembl using wget

Understanding the Ensembl FTP Structure

Download human GRCh38 reference genome from Ensembl

Download gene annotation

Decompress and verify

Part 5: Organizing Your Workspace

Create the full directory structure

Part 7: Exploring Downloaded Data

Inspect raw reads

Reference genome statistics

Preview the annotation file

Exercises

Exercise 1: Read Counts

Exercise 2: Genome Size

Exercise 3: Gene Count

Summary

Option B: Download a full dataset from NCBI using `prefetch`

Option A: Download from NCBI using `datasets`

Option B: Download from Ensembl using `wget`