Lab 6: Running nf-core/rnaseq on Ibex
In this lab you will run the nf-core/rnaseq pipeline on the KAUST Ibex HPC cluster using the chr11-filtered GSE136366 reads. The pipeline automates the full preprocessing and quantification workflow — from raw reads to expression matrices — in a reproducible and scalable way.
Like Labs 3–5, this lab uses the chromosome 11-filtered GSE136366 reads (~3–4 M read pairs per sample). nf-core/rnaseq manages its own SLURM job scheduling, so we pass the chr11 FASTA and GTF directly instead of the full-genome KAUST profile registry entry. All pipeline steps are identical to a full-scale run — only the reference scope differs. The full-scale run would simply replace the --fasta/--gtf flags with --genome GRCh38.p14.
- Set up Nextflow and nf-core tools on Ibex
- Create a properly formatted samplesheet for GSE136366
- Configure nf-core/rnaseq for the Ibex SLURM cluster
- Submit the pipeline as a SLURM batch job
- Monitor pipeline progress and interpret Nextflow logs
- Inspect the MultiQC report and understand nf-core output structure
- Locate Salmon quantification files for downstream differential expression analysis
nf-core/rnaseq is a community-maintained Nextflow workflow that runs the following steps automatically, in order, for every sample:
- FastQC — raw read quality assessment before trimming
- fastp — adapter trimming and quality filtering (or Trim Galore)
- FastQC — post-trimming quality assessment
- STAR — splice-aware alignment to the reference genome (produces BAM files)
- Salmon — quantification of transcript-level expression from alignments (
star_salmonmode) or directly from reads (pseudo-alignment mode) - Picard / SAMtools — post-alignment processing: duplicate marking, sorting, indexing
- RSeQC, QualiMap, dupRadar, PreSeq — alignment-level quality metrics
- MultiQC — aggregates all QC reports into a single interactive HTML report
All processes run in containers (Singularity on Ibex), so no manual software installation is needed beyond Nextflow itself.
Part 1: Setup on Ibex
Connect to Ibex, start an interactive session, and prepare the environment for running Nextflow.
Connect to Ibex and start an interactive session
# Connect to Ibex login node
ssh username@ilogin.ibex.kaust.edu.sa
# Start an interactive session for setup tasks
# (never run heavy jobs directly on the login node)
srun --pty --time=2:00:00 --mem=8G --cpus-per-task=4 bash
Load Nextflow and Singularity
# Clear any previously loaded modules, then load Nextflow and Singularity
module purge
module load nextflow
module load singularity
# Confirm the versions
nextflow -version
singularity --version
nf-core/rnaseq requires Nextflow ≥ 23.04. Both nextflow and singularity must be loaded — the KAUST profile uses Singularity containers for all pipeline tools.
Create the pipeline working directory
# Create a dedicated directory for this pipeline run
mkdir -p /ibex/user/$USER/workshop/nfcore_rnaseq
cd /ibex/user/$USER/workshop/nfcore_rnaseq
# Confirm your location
pwd
Part 2: Create the Samplesheet
nf-core/rnaseq takes a CSV samplesheet as its primary input. Each row describes one sample and its FASTQ files. The samplesheet must have exactly these four columns:
| Column | Description |
|---|---|
sample | A unique sample name (no spaces) |
fastq_1 | Absolute path to the R1 (forward) FASTQ file |
fastq_2 | Absolute path to the R2 (reverse) FASTQ file. Leave empty for single-end data. |
strandedness | Library strandedness: auto, forward, reverse, or unstranded |
auto— nf-core/rnaseq will detect strandedness automatically using Salmon. Recommended when unsure.forward— Use for stranded libraries where read 1 is in the same orientation as the transcript (e.g., Illumina TruSeq Stranded mRNA, dUTP method, ligation-based kits).reverse— Use for stranded libraries where read 1 is reverse-complementary to the transcript.unstranded— Use for non-stranded protocols (e.g., standard poly-A capture without strand preservation).
Example samplesheet structure
Below is an example showing two samples from GSE136366. Replace YOUR_USERNAME with your Ibex username:
sample,fastq_1,fastq_2,strandedness
KO_1_SRR10045016,/ibex/user/YOUR_USERNAME/workshop/chr11_raw_data/KO_1_SRR10045016_1.fastq.gz,/ibex/user/YOUR_USERNAME/workshop/chr11_raw_data/KO_1_SRR10045016_2.fastq.gz,auto
KO_2_SRR10045017,/ibex/user/YOUR_USERNAME/workshop/chr11_raw_data/KO_2_SRR10045017_1.fastq.gz,/ibex/user/YOUR_USERNAME/workshop/chr11_raw_data/KO_2_SRR10045017_2.fastq.gz,auto
Generate the samplesheet automatically with a bash script
Rather than typing each row manually, use this script to generate the samplesheet for all six GSE136366 chr11-filtered samples:
cd /ibex/user/$USER/workshop/nfcore_rnaseq
RAWDIR="/ibex/user/$USER/workshop/chr11_raw_data"
SAMPLES="KO_1_SRR10045016 KO_2_SRR10045017 KO_3_SRR10045018 WT_1_SRR10045019 WT_2_SRR10045020 WT_3_SRR10045021"
# Write the CSV header
echo "sample,fastq_1,fastq_2,strandedness" > samplesheet.csv
# Add one row per sample
for sample in $SAMPLES; do
echo "${sample},${RAWDIR}/${sample}_1.fastq.gz,${RAWDIR}/${sample}_2.fastq.gz,auto"
done >> samplesheet.csv
# Review the generated samplesheet
cat samplesheet.csv
Part 3: The KAUST Profile
Instead of writing a manual nextflow.config, we use the pre-built KAUST institutional profile by passing -profile kaust to Nextflow. This single flag activates a configuration file maintained by the KAUST Bioinformatics Platform that handles all cluster-specific settings automatically.
-profile kaust configures automatically
| Setting | Value | Meaning |
|---|---|---|
| Job scheduler | SLURM | Each pipeline process is submitted as a SLURM job to the batch partition on Ibex |
| Container runtime | Singularity | All tools run inside Singularity containers — no manual module load needed per process |
| Container library | Central shared library | Pre-downloaded images shared across all users; new images go to your personal cache |
| Resource limits | Ibex-tuned defaults | Per-process CPU, memory, and time limits sized for Ibex compute nodes |
| Reference genomes | --fasta / --gtf | Pass explicit chr11 FASTA and GTF paths; nf-core builds STAR and Salmon indexes automatically |
The KAUST profile includes a genome registry with reference files for common organisms already available on Ibex. For human data, you can pass --genome GRCh38.p14 — this resolves the reference genome FASTA, GTF annotation, and pre-built STAR/Salmon indexes from shared paths on Ibex without needing to download or specify reference files manually.
For this chr11 lab, we pass explicit --fasta and --gtf paths pointing to the chromosome 11-only reference files. nf-core/rnaseq will build the STAR and Salmon indexes automatically from these files. The full-genome equivalent would simply replace those two flags with --genome GRCh38.p14.
The KAUST profile also includes a dedicated rnaseq sub-profile with tuned defaults for the nf-core/rnaseq pipeline specifically.
Part 4: Run nf-core/rnaseq
With the samplesheet and configuration in place, you are ready to launch the pipeline.
Full pipeline run command
cd /ibex/user/$USER/workshop/nfcore_rnaseq
nextflow run nf-core/rnaseq -r 3.23.0 \
--input samplesheet.csv \
--outdir results \
--fasta /biocorelab/BIX/resources/genomes/workshops/human-chr11/GRCh38.dna.chromosome.11.fa \
--gtf /biocorelab/BIX/resources/genomes/workshops/human-chr11/GRCh38.chr11.gtf \
--aligner star_salmon \
--pseudo_aligner salmon \
--trimmer fastp \
--genome_size 135086622 \
--star_index_bases 11 \
-profile kaust \
-resume
| Parameter | Value used | Description |
|---|---|---|
--input | samplesheet.csv | Path to the CSV samplesheet you created |
--outdir | results | Directory where all output files will be written |
--fasta | chr11 FASTA path | Path to the chromosome 11 reference genome FASTA file |
--gtf | chr11 GTF path | Path to the chromosome 11 GTF annotation file; used for STAR index building and gene-level quantification |
--genome_size | 135086622 | Effective genome size for chromosome 11 (~135 Mb); used by some QC tools |
--star_index_bases | 11 | STAR genomeSAindexNbases parameter, reduced for the smaller chr11 genome |
--aligner | star_salmon | Align with STAR, then quantify with Salmon in alignment-based mode |
--pseudo_aligner | salmon | Also run Salmon in pseudo-alignment mode directly from reads |
--trimmer | fastp | Use fastp for adapter trimming (alternative: trimgalore) |
-profile | kaust | Activates the KAUST institutional config: SLURM scheduler, Singularity containers, shared image library, and Ibex resource defaults |
-resume | — | Resume a previous run from where it left off (uses Nextflow cache) |
-resume flag and Nextflow caching
Nextflow caches the output of every successfully completed process in a hidden work/ directory. If the pipeline fails or you add new samples, running with -resume skips all already-completed steps and only re-runs what is needed. This can save hours of compute time. Cache entries are invalidated automatically when inputs or parameters change.
Submit the pipeline as a SLURM batch job (recommended)
For runs that will take several hours, submit the Nextflow head process itself as a SLURM job. Nextflow will then submit all sub-jobs to SLURM automatically from within that job. copy the following into a new file run_rnaseq.sh
#!/bin/bash
#SBATCH --job-name=nfcore_rnaseq
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=8G
#SBATCH --time=24:00:00
#SBATCH --partition=batch
#SBATCH --output=nextflow_%j.log
#SBATCH --error=nextflow_%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=your.email@kaust.edu.sa
cd /ibex/user/$USER/workshop/nfcore_rnaseq
module purge
module load nextflow
module load singularity
export NXF_SINGULARITY_CACHEDIR=/ibex/user/$USER/singularity_cache
nextflow run nf-core/rnaseq \
--input samplesheet.csv \
--outdir results \
--fasta /biocorelab/BIX/resources/genomes/workshops/human-chr11/GRCh38.dna.chromosome.11.fa \
--gtf /biocorelab/BIX/resources/genomes/workshops/human-chr11/GRCh38.chr11.gtf \
--aligner star_salmon \
--pseudo_aligner salmon \
--trimmer fastp \
--genome_size 135086622 \
--star_index_bases 11 \
-profile kaust \
-resume
# Submit the script
sbatch run_rnaseq.sh
# Check job status
squeue -u $USER
The Nextflow head job itself only needs 2 CPUs and 8 GB of memory — it just manages job submission and monitors progress. All actual computation happens in the sub-jobs Nextflow submits to SLURM. Keep the head job running for the full expected runtime (24 hours is safe).
Part 5: Monitor Pipeline Progress
Watch the Nextflow log in real time
# If running interactively, Nextflow writes to stdout and .nextflow.log
# If submitted via sbatch, tail the log file:
tail -f nextflow_*.log
# Also watch the Nextflow-specific log
tail -f .nextflow.log
Monitor SLURM jobs submitted by Nextflow
# List all your running and pending jobs
squeue -u $USER
# More detail (job name, state, time used, node)
squeue -u $USER -o "%.18i %.30j %.8T %.10M %.6D %R"
# Watch the queue refreshing every 5 seconds
watch -n 5 squeue -u $USER
Understanding the Nextflow progress output
A typical Nextflow run looks like this in the log:
executor > slurm (42)
[5e/3f2a1b] NFCORE_RNASEQ:RNASEQ:FASTQC_UMITOOLS_TRIMGALORE:FASTQC (KO_1_SRR10045016) [100%] 6 of 6 ✔
[a1/88cd2f] NFCORE_RNASEQ:RNASEQ:FASTP (KO_1_SRR10045016) [100%] 6 of 6 ✔
[b3/12ef45] NFCORE_RNASEQ:RNASEQ:ALIGN_STAR:STAR_ALIGN (KO_1_SRR10045016) [ 83%] 5 of 6
[--/------] NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:SALMON_QUANT -
...
- The hex code (e.g.,
5e/3f2a1b) is the work directory hash for that process instance. 100% 6 of 6 ✔— all 6 samples completed successfully for this step.83% 5 of 6— 5 samples done, 1 still running.-— step has not started yet (waiting for upstream steps to finish).- A red
✗indicates a failed process.
What to do if a step fails
# Step 1: Find the work directory of the failed process from the log
# The hash is shown in the progress line, e.g. [b3/12ef45]
ls work/b3/12ef45*/
# Step 2: Read the error file
cat work/b3/12ef45*/.command.err
# Step 3: Read the full command that was run
cat work/b3/12ef45*/.command.sh
# Step 4: Read stdout output
cat work/b3/12ef45*/.command.out
# After fixing the issue, re-run with -resume to continue from where it stopped
nextflow run nf-core/rnaseq \
--input samplesheet.csv \
--outdir results \
--fasta /biocorelab/BIX/resources/genomes/workshops/human-chr11/GRCh38.dna.chromosome.11.fa \
--gtf /biocorelab/BIX/resources/genomes/workshops/human-chr11/GRCh38.chr11.gtf \
--aligner star_salmon \
--genome_size 135086622 \
--star_index_bases 11 \
-profile kaust \
-resume
- Exit status 137 (OOM) — The process ran out of memory. The KAUST profile sets process-level limits automatically, but you can override them by creating a local
nextflow.configwith awithNameblock and passing-c nextflow.configalongside-profile kaust. - Exit status 140 (walltime) — The job exceeded its time limit. Override the time limit in a local config as above.
- Singularity: Failed to create image — The
NXF_SINGULARITY_CACHEDIRis not set or the target filesystem is full. Check available space withdf -h /ibex/user/$USER. - WARN: Task was cached but the cached output is missing — Work directory was cleaned or moved. Remove the
-resumeflag to re-run from scratch, or restore the deleted work directory. - Cannot stage input file — A FASTQ file listed in the samplesheet does not exist. Double-check all paths with
ls. - Process killed (SIGKILL) — Usually OOM or disk quota exceeded. Check
df -handquota -s.
Part 6: Inspect Pipeline Outputs
Once the pipeline completes, explore the results directory to understand what was produced.
Output directory structure
# List the top-level output structure
ls -lh results/
# A successful run produces directories like:
# results/
# ├── fastqc/ -- FastQC reports for raw reads
# ├── fastp/ -- Trimming reports and trimmed FASTQs
# ├── star_salmon/ -- STAR alignments + Salmon quantification
# │ ├── KO_1_SRR10045016/ -- Per-sample BAM files and Salmon output
# │ ├── KO_2_SRR10045017/
# │ ├── ...
# │ └── salmon/ -- Merged Salmon quant.sf files and count matrices
# ├── multiqc/ -- Aggregated QC HTML report
# └── pipeline_info/ -- Nextflow execution report and timeline
Explore the STAR alignment output
# List output for one sample
ls -lh results/star_salmon/KO_1_SRR10045016/
# The BAM file contains the genome alignments
samtools flagstat results/star_salmon/KO_1_SRR10045016/KO_1_SRR10045016.Aligned.sortedByCoord.out.bam
# View the alignment log (STAR mapping statistics)
cat results/star_salmon/KO_1_SRR10045016/Log.final.out
Find the Salmon quantification files
# Each sample has a quant.sf file with per-transcript TPM values
ls results/star_salmon/KO_1_SRR10045016/
# View the quant.sf header and first few lines
head results/star_salmon/KO_1_SRR10045016/quant.sf
# The merged count matrix across all samples is here:
ls results/star_salmon/salmon/
# salmon.merged.gene_counts.tsv -- raw counts per gene (for DESeq2)
# salmon.merged.gene_tpm.tsv -- TPM per gene (for visualization)
head results/star_salmon/salmon/salmon.merged.gene_counts.tsv | cut -f1-4
For downstream DEA with DESeq2 or edgeR, use salmon.merged.gene_counts.tsv (raw integer counts). Do not use TPM or RPKM values as input to count-based statistical models — they have already been normalized in ways that conflict with DESeq2's internal normalization.
You can also import the per-sample quant.sf files directly into R using tximeta or tximport, which is the recommended approach for propagating uncertainty from transcript-level quantification to the gene level.
Open the MultiQC report
# The aggregated QC report is here:
ls -lh results/multiqc/
# To view it on your local machine, copy it using scp:
# (run this command from your LOCAL terminal, not Ibex)
scp username@ilogin.ibex.kaust.edu.sa:/ibex/user/$USER/workshop/nfcore_rnaseq/results/multiqc/multiqc_report.html ~/Desktop/
# Or use a web browser on Ibex via X11 forwarding (if available):
# firefox results/multiqc/multiqc_report.html
Key sections of the MultiQC report
Open the MultiQC HTML report and review these sections:
| Section | What to look for |
|---|---|
| General Statistics | Overview table: total reads, % aligned, % duplicates, % GC content per sample |
| FastQC (Raw) | Per-base quality scores, adapter content before trimming |
| fastp | % reads passing filters, adapter trimming rates, insert size distribution |
| STAR | Uniquely mapped reads %, multi-mapper %, unmapped reads % — aim for >70% unique mapping |
| Salmon | Mapping rate from Salmon pseudo-alignment — should agree with STAR |
| dupRadar | Gene-level duplication: low-expression genes with high duplication may indicate over-sequencing or contamination |
| RSeQC | Read distribution across genomic features (exons, introns, intergenic) |
| Picard | Insert size distribution — should be unimodal for typical poly-A RNA-seq |
Check the pipeline execution report
# Nextflow generates an HTML execution report with resource usage per process
ls results/pipeline_info/
# Open execution_report_*.html for a visual breakdown of:
# - Wall time per process
# - CPU and memory utilization
# - Total number of tasks executed
# - Process-level success/failure status
# Copy it locally to open in a browser
scp username@ilogin.ibex.kaust.edu.sa:/ibex/user/$USER/workshop/nfcore_rnaseq/results/pipeline_info/execution_report*.html ~/Desktop/
Exercises
Exercise 1: Mapping Rate
Open the MultiQC report and find the General Statistics table. What is the overall STAR unique mapping rate for each sample? Are there any samples with noticeably lower mapping rates? What could cause a low mapping rate in an RNA-seq experiment?
Exercise 2: Pipeline Process Count
How many individual processes (tasks) did nf-core/rnaseq execute in total? Open the pipeline execution report (results/pipeline_info/execution_report_*.html) and find the total task count. Note how many of those were cached if you used -resume.
Exercise 3: Locate Salmon quant.sf Files
Where are the per-sample Salmon quant.sf output files located in the results directory? Write the full path pattern. How many columns does a quant.sf file have, and what does each column represent?
Hint: Use ls results/star_salmon/KO_*/ and head to inspect one file.
Exercise 4: Strandedness in the Samplesheet
Your collaborator informs you that the GSE136366 libraries were prepared with the Illumina TruSeq Stranded mRNA kit, which produces reverse-stranded libraries. What value would you set for the strandedness column in your samplesheet? How would you update the samplesheet CSV file to reflect this?
Hint: Use sed or a text editor to replace auto with the correct strandedness value in all rows.
Summary
In this lab, you have:
- Loaded Nextflow with
module load nextflowand installed nf-core tools on Ibex - Configured a Singularity cache directory in scratch space
- Created a valid CSV samplesheet for all GSE136366 samples
- Used
-profile kaustto activate the KAUST institutional config (SLURM, Singularity, shared genomes) - Submitted nf-core/rnaseq as a SLURM batch job using
sbatchwith explicit--fastaand--gtfchr11 paths - Monitored pipeline progress with
tail -fandsqueue - Navigated the output directory and located STAR BAMs, Salmon quant files, and the MultiQC report
- Understood how to diagnose and recover from pipeline failures using
-resumeand work directory inspection
The merged gene count matrix at results/star_salmon/salmon/salmon.merged.gene_counts.tsv is the primary input for the next lab.
The Salmon count matrix produced by this pipeline will be used directly in the differential expression analysis lab. You do not need to run any additional quantification steps — nf-core/rnaseq has already done it all.
Previous: Lab 5: Quantification | Next: Lab 7: Differential Expression Analysis