Lab 2: Public Data Retrieval
In this lab, you will find, download, and organize the course dataset (GSE136366) and the human reference genome on the Ibex cluster, ready for all downstream analysis.
- Navigate NCBI GEO and SRA to find datasets
- Download raw sequencing data using sra-tools (
prefetch,fasterq-dump) - Batch-download multiple SRA files using an accession list
- Download reference genomes from NCBI using the
datasetsCLI tool - Download reference genomes from Ensembl using
wget - Use
seqkitto summarize and subsample data - Organize data with proper directory structure on Ibex
Part 1: Setup on Ibex
Connect to Ibex, start an interactive session, and load the tools needed for this lab.
Connect and start an interactive session
# Connect to Ibex
ssh username@ilogin.ibex.kaust.edu.sa
# Start an interactive session (never run analyses on login nodes!)
srun --cpus-per-task=4 --mem=32G --time=04:00:00 --pty /bin/bash
Load required modules
# Load bioinformatics tools
module load sratoolkit
module load seqkit
# Verify tools are loaded
module list
Navigate to your workspace
# Navigate to your scratch directory
cd /ibex/user/$USER
# Create directories for this lab
mkdir workshop
cd workshop
# Create directories for this lab
mkdir -p raw_data references
# Verify
ls -la
Part 2: Finding Data on NCBI
Before downloading data, you need to know how to find it. NCBI hosts two key resources:
- GEO (Gene Expression Omnibus) for expression data: https://www.ncbi.nlm.nih.gov/geo
- SRA (Sequence Read Archive) for raw sequencing data: https://www.ncbi.nlm.nih.gov/sra
Step 1: Navigate to NCBI GEO
NCBI GEO is a public repository for gene expression and genomics datasets.
- Open your web browser and go to https://www.ncbi.nlm.nih.gov/geo/
- You will see the GEO homepage with a search box at the top
Step 2: Search for a Dataset
As an example, search for accession GSE136366 (a TDP-43 RNA-seq study):
- In the search box, type:
GSE136366 - Click "Search" or press Enter
- Click on the accession number GSE136366 to view the full record
Step 3: Explore the GEO Series Page
The GEO Series page contains important information about the dataset:
- Title: Full study title and description
- Summary: Experimental design and objectives
- Overall design: Sample information and conditions
- Samples: List of all individual samples (GSM accessions)
- Citation: Link to the published paper
Step 4: Navigate to SRA from GEO
To access the raw FASTQ files, you need to go to the SRA:
- Scroll down to the "Relations" section on the GEO page
- Look for the link labeled "SRA" and click on it
- Alternatively, click on any individual sample (GSM accession) to find its SRA run
Step 5: SRA Run Selector
The SRA Run Selector provides metadata about each sequencing run:
- Click "Send results to Run selector" if not already there
- The table shows important metadata columns:
Column Description Run SRR accession number (used for download) Sample Sample name from the study LibraryLayout SINGLE or PAIRED (single-end vs paired-end) Bases Total sequencing bases Spots Number of reads (or read pairs) Platform Sequencing platform (e.g., ILLUMINA)
Before downloading any dataset, review the sequencing platform, library layout (single-end vs paired-end), read length, and total read count. This information determines which analysis pipeline to use.
Course Dataset: GSE136366
For this course, we use GSE136366 — an RNA-seq study investigating the role of TDP-43 (TARDBP) in gene regulation in human cells:
| Property | Value |
|---|---|
| GEO Accession | GSE136366 |
| Organism | Homo sapiens |
| Study | TDP-43 (TARDBP) RNA-seq |
| Library Layout | Paired-end |
| Platform | Illumina |
| Reference | GRCh38 (human) |
On the GSE136366 GEO page, click the SRA link and then "Send results to Run Selector". In the Run Selector, click Accession List to download a text file (SRR_Acc_List.txt) with all SRR IDs. Use this file for batch downloading below.
Part 3: Downloading Raw Sequencing Data with sra-tools
The SRA Toolkit provides command-line tools to download and convert SRA data. The two main commands are prefetch (downloads compressed .sra files) and fasterq-dump (converts .sra to FASTQ).
Navigate to your raw data directory
cd /ibex/user/$USER/workshop/raw_data
Option A: Copy from the shared course directory (recommended)
The full GSE136366 FASTQs are already available on Ibex at a shared course path. Files are already named with the KO_1_/WT_1_ prefix convention — no download or renaming needed just make a link to the files.
ln -s /biocorelab/BIX/resources/datasets/rnaseq/GSE136366/raw_fastq/*.fastq.gz /ibex/user/$USER/workshop/raw_data/
ls -lh /ibex/user/$USER/workshop/raw_data/
Skip to Part 4. The NCBI download option below is provided for reference — it demonstrates how to retrieve public RNA-seq data from scratch, which is the workflow you would use for any dataset not already available on Ibex.
Option B: Download a full dataset from NCBI using prefetch
prefetchDownload GSE136366 (or data from PRJNA507634) with prefetch and convert to FASTQ
Pass the GEO accession directly to prefetch — it resolves all associated SRR runs automatically and downloads them into a GSE136366/ directory.
# Download all runs for GSE136366 in one command (takes too long)
prefetch GSE136366
# Convert each downloaded .sra file to paired FASTQ
for sra in ./*/*.sra; do
echo "Converting $sra ..."
fasterq-dump --split-files --threads 4 "$sra"
done
Compress FASTQ files immediately after conversion (gzip). Raw FASTQ files are very large; gzip reduces their size 4–5×. Use /ibex/user/$USER for all large files — home directories have limited quota.
Submit the download as an sbatch job (recommended)
Downloading and converting all GSE136366 samples can take several hours. Running this interactively risks losing work if your SSH session disconnects. Instead, submit it as a SLURM batch job on Ibex using sbatch.
Use an interactive session (srun --pty bash) for short exploratory tasks (< 30 min) where you need immediate feedback. Use sbatch for anything that runs longer — downloads, alignments, workflow runs. The job continues even if you close your terminal.
Understanding sbatch directives
Every sbatch script starts with a shebang line (#!/bin/bash) followed by #SBATCH comment lines. SLURM reads these before executing the script. Here is what each directive means:
| Directive | Meaning |
|---|---|
--job-name | A human-readable label for the job (shown in squeue output). |
--ntasks | Number of MPI tasks (processes). For serial or multi-threaded jobs, keep this at 1. |
--cpus-per-task | CPU cores allocated per task. Match this to the --threads argument of your tool. |
--mem | Total RAM for the job (e.g. 16G). Request enough — too little causes an out-of-memory kill; too much wastes queue priority. |
--time | Wall-clock time limit in HH:MM:SS. The job is killed if it runs longer. Estimate generously for downloads. |
--partition | The queue to submit to. Use batch for general compute jobs on Ibex. |
--output | File path for standard output (stdout). %j is replaced by the job ID at runtime. |
--error | File path for standard error (stderr). Separate from stdout so errors are easy to find. |
--mail-type | When to email you: BEGIN, END, FAIL, or ALL. |
--mail-user | Your email address for job notifications. |
Create the submission script
Save the script below as download_sra.sh in your raw_data directory:
#!/bin/bash
#SBATCH --job-name=sra_download # Name shown in squeue
#SBATCH --ntasks=1 # Single task (not MPI)
#SBATCH --cpus-per-task=8 # 8 threads for fasterq-dump
#SBATCH --mem=16G # RAM — fasterq-dump needs ~4G per thread
#SBATCH --time=12:00:00 # Up to 12 hours for all samples
#SBATCH --partition=batch # General compute queue on Ibex
#SBATCH --output=logs/sra_download_%j.out # stdout log; %j = job ID
#SBATCH --error=logs/sra_download_%j.err # stderr log
#SBATCH --mail-type=END,FAIL # Email when done or if it fails
#SBATCH --mail-user=your.email@kaust.edu.sa # Replace with your email
# ---------- environment ----------
cd /ibex/user/$USER/workshop/raw_data
module load sratoolkit
mkdir -p logs
# ---------- step 1: prefetch (download all runs for GSE136366) ----------
echo "Starting prefetch ..."
prefetch GSE136366
# ---------- step 2: convert .sra to paired FASTQ ----------
echo "Starting fasterq-dump ..."
for sra in GSE136366/**/*.sra; do
echo " Converting $sra ..."
fasterq-dump --split-files --threads 8 "$sra"
done
gzip *.fastq
# ---------- step 3: rename to add condition/replicate prefix ----------
echo "Renaming files to match shared folder convention ..."
mv SRR10045016_1.fastq.gz KO_1_SRR10045016_1.fastq.gz
mv SRR10045016_2.fastq.gz KO_1_SRR10045016_2.fastq.gz
mv SRR10045017_1.fastq.gz KO_2_SRR10045017_1.fastq.gz
mv SRR10045017_2.fastq.gz KO_2_SRR10045017_2.fastq.gz
mv SRR10045018_1.fastq.gz KO_3_SRR10045018_1.fastq.gz
mv SRR10045018_2.fastq.gz KO_3_SRR10045018_2.fastq.gz
mv SRR10045019_1.fastq.gz WT_1_SRR10045019_1.fastq.gz
mv SRR10045019_2.fastq.gz WT_1_SRR10045019_2.fastq.gz
mv SRR10045020_1.fastq.gz WT_2_SRR10045020_1.fastq.gz
mv SRR10045020_2.fastq.gz WT_2_SRR10045020_2.fastq.gz
mv SRR10045021_1.fastq.gz WT_3_SRR10045021_1.fastq.gz
mv SRR10045021_2.fastq.gz WT_3_SRR10045021_2.fastq.gz
echo "All downloads complete."
ls -lh *.fastq.gz
Submit and monitor the job
# Submit to SLURM
sbatch /ibex/user/$USER/workshop/raw_data/download_sra.sh
# Check job status (replace JOBID with the number printed by sbatch)
squeue -u $USER
# Watch live stdout while job runs
tail -f /ibex/user/$USER/workshop/raw_data/logs/sra_download_JOBID.out
# Cancel a job if needed
scancel JOBID
Check the .err log for any errors, then verify all expected FASTQ pairs are present: ls -lh *.fastq.gz | wc -l should equal 2 × number of samples. If a sample failed, re-run fasterq-dump for just that SRR ID rather than re-running the whole job.
Part 4: Downloading Reference Genomes
Reference genomes and gene annotations are available from two major sources: NCBI and Ensembl. Both provide the same biological data but differ in naming conventions, annotation sources, and access tools. We will cover both approaches.
Navigate to your references directory
cd /ibex/user/$USER/workshop/references
Option A: Download from NCBI using datasets
NCBI provides the datasets command-line tool – a modern, fast way to download genomes, annotations, and gene data directly from the NCBI database.
Get the NCBI datasets tool on Ibex
#load NCBI datasets tool
module load ncbi_datasets_tools
# Check the help message
datasets --help
Search for a genome assembly
You can search for genomes by organism name or assembly accession:
# Search for human genome assemblies
datasets summary genome taxon "Homo sapiens" --reference | head -5
# Search by specific assembly accession (GRCh38 / hg38)
datasets summary genome accession GCF_000001405.40
GCF_prefix – NCBI RefSeq assembly (curated, recommended)GCA_prefix – GenBank assembly (submitted by researchers)
For most analyses, prefer RefSeq (GCF_) assemblies as they are more standardized.
Download genome and annotation
The datasets download command fetches the genome FASTA, annotation (GTF/GFF3), and metadata in a single zip archive:
# Download human GRCh38 reference genome with annotations
# Note: This is a large download (~3 GB)
datasets download genome accession GCF_000001405.40 \
--include genome,gtf,seq-report
# This creates a file called ncbi_dataset.zip
ls -lh ncbi_dataset.zip
--include Options
| Option | Downloads |
|---|---|
genome | Genome FASTA sequence |
gtf | Gene annotation in GTF format |
gff3 | Gene annotation in GFF3 format |
rna | Transcript sequences (RNA FASTA) |
protein | Protein sequences (amino acid FASTA) |
cds | Coding sequences (CDS FASTA) |
seq-report | Sequence report with chromosome info |
Extract and organize the downloaded files
# Extract the zip archive
unzip ncbi_dataset.zip
# See the directory structure
ls ncbi_dataset/data/GCF_000001405.40/
# The extracted files will include:
# GCF_000001405.40_GRCh38.p14_genomic.fna - genome FASTA (chromosomes)
# genomic.gtf - gene annotation (GTF)
# sequence_report.jsonl - sequence metadata
# Copy files to your references directory with clear names
cp ncbi_dataset/data/GCF_000001405.40/*.fna GRCh38_genome.fa
cp ncbi_dataset/data/GCF_000001405.40/genomic.gtf GRCh38_genes.gtf
# Verify the genome (should report ~3 GB sequence, 25 chromosomes)
seqkit stats GRCh38_genome.fa
# Clean up the zip and extracted directory
rm -r ncbi_dataset ncbi_dataset.zip
Option B: Download from Ensembl using wget
Ensembl provides reference genomes and gene annotations for thousands of species. Files are organized on their FTP server by release version, species, and data type.
Understanding the Ensembl FTP Structure
The Ensembl FTP site is organized as follows:
ftp.ensemblgenomes.org/pub/bacteria/
├── release-57/
│ ├── fasta/ # Sequence files (genome, cDNA, protein)
│ │ └── bacteria_0_collection/
│ │ └── escherichia_coli_.../
│ │ └── dna/ # Genomic DNA sequences
│ ├── gff3/ # Gene annotations (GFF3 format)
│ │ └── bacteria_0_collection/
│ └── gtf/ # Gene annotations (GTF format)
└── current/ # Symlink to latest release
- ftp.ensembl.org – Vertebrate species (human, mouse, zebrafish, etc.)
- ftp.ensemblgenomes.org – Non-vertebrate species (bacteria, plants, fungi, etc.)
Use the appropriate server for your organism.
Download human GRCh38 reference genome from Ensembl
# Download human GRCh38 primary assembly (chromosome sequences only — smaller than full assembly)
wget https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
Download gene annotation
# Download GTF annotation (Ensembl release 110, GRCh38)
wget https://ftp.ensembl.org/pub/release-110/gtf/homo_sapiens/Homo_sapiens.GRCh38.110.gtf.gz
Decompress and verify
# Decompress all downloaded files
gunzip *.gz
# Check reference genome statistics (human GRCh38 primary assembly: ~3 GB, 25 sequences)
seqkit stats Homo_sapiens.GRCh38.dna.primary_assembly.fa
# Rename for convenience
mv Homo_sapiens.GRCh38.dna.primary_assembly.fa GRCh38_genome.fa
mv Homo_sapiens.GRCh38.110.gtf GRCh38_genes.gtf
# List files with sizes
ls -lh
| Feature | NCBI (datasets) | Ensembl (wget) |
|---|---|---|
| Download tool | datasets tool | wget / curl |
| Naming | RefSeq accessions (GCF_/GCA_) | Ensembl-specific names |
| Annotations | NCBI RefSeq gene models | Ensembl/GENCODE gene models |
| Chromosome names | NC_ accessions or chromosome numbers | Chromosome numbers (1, 2, X, MT) |
| Best for | Prokaryotes, clinical/NCBI pipelines | Eukaryotes, RNA-seq, GENCODE consistency |
Both are valid choices. The key is to be consistent –
Always use the genome and annotation from the same source to avoid mismatches in chromosome names.
Part 5: Organizing Your Workspace
Create the full directory structure
mkdir -p /ibex/user/$USER/workshop/{raw_data,trimmed,aligned,counts,results,references,logs}
ls /ibex/user/$USER/workshop/
| Directory | Contents | Used in |
|---|---|---|
raw_data/ | Original full FASTQ files downloaded from SRA | Lab 6 onwards (nf-core) |
trimmed/ | fastp-trimmed FASTQs | Labs 3–5 |
aligned/ | STAR BAM files | Lab 4 |
counts/ | Salmon quantification output | Lab 5 |
references/ | Reference genome, annotation, and indexes | Labs 4–5 |
results/ | Final analysis outputs (DEA, plots, reports) | Labs 7–10 |
Part 7: Exploring Downloaded Data
Before proceeding with analysis, always inspect your downloaded data to make sure everything looks correct.
Inspect raw reads
cd /ibex/user/$USER/workshop
# Quick look at the first few reads from one GSE136366 sample (12 lines = 3 reads)
zcat raw_data/$(head -1 raw_data/SRR_Acc_List.txt)_1.fastq.gz | head -12
Each read consists of 4 lines:
@followed by the read identifier- The DNA sequence
+(separator line)- Quality scores (ASCII-encoded Phred scores)
Reference genome statistics
# Get stats for the human GRCh38 reference genome
seqkit stats references/GRCh38_genome.fa
Preview the annotation file
# View the first 30 lines of the GTF annotation (skip comment lines)
grep -v "^#" references/GRCh38_genes.gtf | head -10
# Count annotated genes
grep -P "\tgene\t" references/GRCh38_genes.gtf | wc -l
Exercises
Exercise 1: Read Counts
How many reads are in the FASTQ for one sample? Compare it to one of the full SRA FASTQs in raw_data/.
Hint
seqkit stats raw_data/KO_1_SRR10045016_1.fastq.gz
seqkit stats raw_data/$(head -1 raw_data/SRR_Acc_List.txt)_1.fastq.gz
Exercise 2: Genome Size
How large (in bases) is the human GRCh38 primary assembly? How many sequences does it contain?
Hint
seqkit stats references/GRCh38_genome.fa
Exercise 3: Gene Count
How many genes are annotated in the GRCh38 GTF file?
Hint
grep -P "\tgene\t" references/GRCh38_genes.gtf | wc -l
Summary
In this lab, you have:
- Learned how to navigate NCBI GEO and SRA to find public sequencing datasets
- Downloaded raw sequencing data from SRA using
prefetchandfasterq-dump(or copied from the shared Ibex path) - Learned to batch-download multiple SRA files using an accession list file
- Copied raw GSE136366 FASTQs from the shared course directory (Option A) — and learned how to download from NCBI SRA using
prefetch/fasterq-dump(Option B) - Downloaded a reference genome using NCBI
datasetsCLI (Option A) or Ensembl FTP usingwget(Option B) - Used
seqkitto summarize FASTQ data - Organized all data with a clear directory structure on Ibex (
raw_data/,trimmed/, etc.)