Lab 6: Running nf-core/rnaseq on Ibex

Part 1Setup on Ibex Part 2Create Samplesheet Part 3KAUST Profile Part 4Run Pipeline Part 5Monitor Progress Part 6Inspect Outputs

In this lab you will run the nf-core/rnaseq pipeline on the KAUST Ibex HPC cluster using the chr11-filtered GSE136366 reads. The pipeline automates the full preprocessing and quantification workflow — from raw reads to expression matrices — in a reproducible and scalable way.

Using chr11-filtered data in this lab

Like Labs 3–5, this lab uses the chromosome 11-filtered GSE136366 reads (~3–4 M read pairs per sample). nf-core/rnaseq manages its own SLURM job scheduling, so we pass the chr11 FASTA and GTF directly instead of the full-genome KAUST profile registry entry. All pipeline steps are identical to a full-scale run — only the reference scope differs. The full-scale run would simply replace the --fasta/--gtf flags with --genome GRCh38.p14.

Learning Objectives
What does nf-core/rnaseq do under the hood?

nf-core/rnaseq is a community-maintained Nextflow workflow that runs the following steps automatically, in order, for every sample:

  1. FastQC — raw read quality assessment before trimming
  2. fastp — adapter trimming and quality filtering (or Trim Galore)
  3. FastQC — post-trimming quality assessment
  4. STAR — splice-aware alignment to the reference genome (produces BAM files)
  5. Salmon — quantification of transcript-level expression from alignments (star_salmon mode) or directly from reads (pseudo-alignment mode)
  6. Picard / SAMtools — post-alignment processing: duplicate marking, sorting, indexing
  7. RSeQC, QualiMap, dupRadar, PreSeq — alignment-level quality metrics
  8. MultiQC — aggregates all QC reports into a single interactive HTML report

All processes run in containers (Singularity on Ibex), so no manual software installation is needed beyond Nextflow itself.

Part 1: Setup on Ibex

Connect to Ibex, start an interactive session, and prepare the environment for running Nextflow.

Connect to Ibex and start an interactive session

# Connect to Ibex login node
ssh username@ilogin.ibex.kaust.edu.sa

# Start an interactive session for setup tasks
# (never run heavy jobs directly on the login node)
srun --pty --time=2:00:00 --mem=8G --cpus-per-task=4 bash

Load Nextflow and Singularity

# Clear any previously loaded modules, then load Nextflow and Singularity
module purge
module load nextflow
module load singularity

# Confirm the versions
nextflow -version
singularity --version
Tip: Nextflow version

nf-core/rnaseq requires Nextflow ≥ 23.04. Both nextflow and singularity must be loaded — the KAUST profile uses Singularity containers for all pipeline tools.

Create the pipeline working directory

# Create a dedicated directory for this pipeline run
mkdir -p /ibex/user/$USER/workshop/nfcore_rnaseq
cd /ibex/user/$USER/workshop/nfcore_rnaseq

# Confirm your location
pwd

Part 2: Create the Samplesheet

nf-core/rnaseq takes a CSV samplesheet as its primary input. Each row describes one sample and its FASTQ files. The samplesheet must have exactly these four columns:

ColumnDescription
sampleA unique sample name (no spaces)
fastq_1Absolute path to the R1 (forward) FASTQ file
fastq_2Absolute path to the R2 (reverse) FASTQ file. Leave empty for single-end data.
strandednessLibrary strandedness: auto, forward, reverse, or unstranded
Strandedness options explained
  • auto — nf-core/rnaseq will detect strandedness automatically using Salmon. Recommended when unsure.
  • forward — Use for stranded libraries where read 1 is in the same orientation as the transcript (e.g., Illumina TruSeq Stranded mRNA, dUTP method, ligation-based kits).
  • reverse — Use for stranded libraries where read 1 is reverse-complementary to the transcript.
  • unstranded — Use for non-stranded protocols (e.g., standard poly-A capture without strand preservation).

Example samplesheet structure

Below is an example showing two samples from GSE136366. Replace YOUR_USERNAME with your Ibex username:

sample,fastq_1,fastq_2,strandedness
KO_1_SRR10045016,/ibex/user/YOUR_USERNAME/workshop/chr11_raw_data/KO_1_SRR10045016_1.fastq.gz,/ibex/user/YOUR_USERNAME/workshop/chr11_raw_data/KO_1_SRR10045016_2.fastq.gz,auto
KO_2_SRR10045017,/ibex/user/YOUR_USERNAME/workshop/chr11_raw_data/KO_2_SRR10045017_1.fastq.gz,/ibex/user/YOUR_USERNAME/workshop/chr11_raw_data/KO_2_SRR10045017_2.fastq.gz,auto

Generate the samplesheet automatically with a bash script

Rather than typing each row manually, use this script to generate the samplesheet for all six GSE136366 chr11-filtered samples:

cd /ibex/user/$USER/workshop/nfcore_rnaseq

RAWDIR="/ibex/user/$USER/workshop/chr11_raw_data"
SAMPLES="KO_1_SRR10045016 KO_2_SRR10045017 KO_3_SRR10045018 WT_1_SRR10045019 WT_2_SRR10045020 WT_3_SRR10045021"

# Write the CSV header
echo "sample,fastq_1,fastq_2,strandedness" > samplesheet.csv

# Add one row per sample
for sample in $SAMPLES; do
    echo "${sample},${RAWDIR}/${sample}_1.fastq.gz,${RAWDIR}/${sample}_2.fastq.gz,auto"
done >> samplesheet.csv

# Review the generated samplesheet
cat samplesheet.csv

Part 3: The KAUST Profile

Instead of writing a manual nextflow.config, we use the pre-built KAUST institutional profile by passing -profile kaust to Nextflow. This single flag activates a configuration file maintained by the KAUST Bioinformatics Platform that handles all cluster-specific settings automatically.

What -profile kaust configures automatically
SettingValueMeaning
Job schedulerSLURMEach pipeline process is submitted as a SLURM job to the batch partition on Ibex
Container runtimeSingularityAll tools run inside Singularity containers — no manual module load needed per process
Container libraryCentral shared libraryPre-downloaded images shared across all users; new images go to your personal cache
Resource limitsIbex-tuned defaultsPer-process CPU, memory, and time limits sized for Ibex compute nodes
Reference genomes--fasta / --gtfPass explicit chr11 FASTA and GTF paths; nf-core builds STAR and Salmon indexes automatically
Using the KAUST genome registry

The KAUST profile includes a genome registry with reference files for common organisms already available on Ibex. For human data, you can pass --genome GRCh38.p14 — this resolves the reference genome FASTA, GTF annotation, and pre-built STAR/Salmon indexes from shared paths on Ibex without needing to download or specify reference files manually.

For this chr11 lab, we pass explicit --fasta and --gtf paths pointing to the chromosome 11-only reference files. nf-core/rnaseq will build the STAR and Salmon indexes automatically from these files. The full-genome equivalent would simply replace those two flags with --genome GRCh38.p14.

The KAUST profile also includes a dedicated rnaseq sub-profile with tuned defaults for the nf-core/rnaseq pipeline specifically.

Part 4: Run nf-core/rnaseq

With the samplesheet and configuration in place, you are ready to launch the pipeline.

Full pipeline run command

cd /ibex/user/$USER/workshop/nfcore_rnaseq

nextflow run nf-core/rnaseq -r 3.23.0 \
    --input samplesheet.csv \
    --outdir results \
    --fasta /biocorelab/BIX/resources/genomes/workshops/human-chr11/GRCh38.dna.chromosome.11.fa \
    --gtf /biocorelab/BIX/resources/genomes/workshops/human-chr11/GRCh38.chr11.gtf \
    --aligner star_salmon \
    --pseudo_aligner salmon \
    --trimmer fastp \
    --genome_size 135086622 \
    --star_index_bases 11 \
    -profile kaust \
    -resume
Key parameter reference
ParameterValue usedDescription
--inputsamplesheet.csvPath to the CSV samplesheet you created
--outdirresultsDirectory where all output files will be written
--fastachr11 FASTA pathPath to the chromosome 11 reference genome FASTA file
--gtfchr11 GTF pathPath to the chromosome 11 GTF annotation file; used for STAR index building and gene-level quantification
--genome_size135086622Effective genome size for chromosome 11 (~135 Mb); used by some QC tools
--star_index_bases11STAR genomeSAindexNbases parameter, reduced for the smaller chr11 genome
--alignerstar_salmonAlign with STAR, then quantify with Salmon in alignment-based mode
--pseudo_alignersalmonAlso run Salmon in pseudo-alignment mode directly from reads
--trimmerfastpUse fastp for adapter trimming (alternative: trimgalore)
-profilekaustActivates the KAUST institutional config: SLURM scheduler, Singularity containers, shared image library, and Ibex resource defaults
-resumeResume a previous run from where it left off (uses Nextflow cache)
The -resume flag and Nextflow caching

Nextflow caches the output of every successfully completed process in a hidden work/ directory. If the pipeline fails or you add new samples, running with -resume skips all already-completed steps and only re-runs what is needed. This can save hours of compute time. Cache entries are invalidated automatically when inputs or parameters change.

Submit the pipeline as a SLURM batch job (recommended)

For runs that will take several hours, submit the Nextflow head process itself as a SLURM job. Nextflow will then submit all sub-jobs to SLURM automatically from within that job. copy the following into a new file run_rnaseq.sh

#!/bin/bash
#SBATCH --job-name=nfcore_rnaseq
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=8G
#SBATCH --time=24:00:00
#SBATCH --partition=batch
#SBATCH --output=nextflow_%j.log
#SBATCH --error=nextflow_%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=your.email@kaust.edu.sa

cd /ibex/user/$USER/workshop/nfcore_rnaseq

module purge
module load nextflow
module load singularity

export NXF_SINGULARITY_CACHEDIR=/ibex/user/$USER/singularity_cache

nextflow run nf-core/rnaseq \
    --input samplesheet.csv \
    --outdir results \
    --fasta /biocorelab/BIX/resources/genomes/workshops/human-chr11/GRCh38.dna.chromosome.11.fa \
    --gtf /biocorelab/BIX/resources/genomes/workshops/human-chr11/GRCh38.chr11.gtf \
    --aligner star_salmon \
    --pseudo_aligner salmon \
    --trimmer fastp \
    --genome_size 135086622 \
    --star_index_bases 11 \
    -profile kaust \
    -resume

# Submit the script
sbatch run_rnaseq.sh

# Check job status
squeue -u $USER
Tip: Head job resource requirements

The Nextflow head job itself only needs 2 CPUs and 8 GB of memory — it just manages job submission and monitors progress. All actual computation happens in the sub-jobs Nextflow submits to SLURM. Keep the head job running for the full expected runtime (24 hours is safe).

Part 5: Monitor Pipeline Progress

Watch the Nextflow log in real time

# If running interactively, Nextflow writes to stdout and .nextflow.log
# If submitted via sbatch, tail the log file:
tail -f nextflow_*.log

# Also watch the Nextflow-specific log
tail -f .nextflow.log

Monitor SLURM jobs submitted by Nextflow

# List all your running and pending jobs
squeue -u $USER

# More detail (job name, state, time used, node)
squeue -u $USER -o "%.18i %.30j %.8T %.10M %.6D %R"

# Watch the queue refreshing every 5 seconds
watch -n 5 squeue -u $USER

Understanding the Nextflow progress output

A typical Nextflow run looks like this in the log:

executor >  slurm (42)
[5e/3f2a1b] NFCORE_RNASEQ:RNASEQ:FASTQC_UMITOOLS_TRIMGALORE:FASTQC (KO_1_SRR10045016) [100%] 6 of 6 ✔
[a1/88cd2f] NFCORE_RNASEQ:RNASEQ:FASTP (KO_1_SRR10045016)                              [100%] 6 of 6 ✔
[b3/12ef45] NFCORE_RNASEQ:RNASEQ:ALIGN_STAR:STAR_ALIGN (KO_1_SRR10045016)              [ 83%] 5 of 6
[--/------] NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:SALMON_QUANT               -
...
Reading Nextflow progress lines
  • The hex code (e.g., 5e/3f2a1b) is the work directory hash for that process instance.
  • 100% 6 of 6 ✔ — all 6 samples completed successfully for this step.
  • 83% 5 of 6 — 5 samples done, 1 still running.
  • - — step has not started yet (waiting for upstream steps to finish).
  • A red indicates a failed process.

What to do if a step fails

# Step 1: Find the work directory of the failed process from the log
# The hash is shown in the progress line, e.g. [b3/12ef45]
ls work/b3/12ef45*/

# Step 2: Read the error file
cat work/b3/12ef45*/.command.err

# Step 3: Read the full command that was run
cat work/b3/12ef45*/.command.sh

# Step 4: Read stdout output
cat work/b3/12ef45*/.command.out

# After fixing the issue, re-run with -resume to continue from where it stopped
nextflow run nf-core/rnaseq \
    --input samplesheet.csv \
    --outdir results \
    --fasta /biocorelab/BIX/resources/genomes/workshops/human-chr11/GRCh38.dna.chromosome.11.fa \
    --gtf /biocorelab/BIX/resources/genomes/workshops/human-chr11/GRCh38.chr11.gtf \
    --aligner star_salmon \
    --genome_size 135086622 \
    --star_index_bases 11 \
    -profile kaust \
    -resume
Common Nextflow errors on HPC clusters
  • Exit status 137 (OOM) — The process ran out of memory. The KAUST profile sets process-level limits automatically, but you can override them by creating a local nextflow.config with a withName block and passing -c nextflow.config alongside -profile kaust.
  • Exit status 140 (walltime) — The job exceeded its time limit. Override the time limit in a local config as above.
  • Singularity: Failed to create image — The NXF_SINGULARITY_CACHEDIR is not set or the target filesystem is full. Check available space with df -h /ibex/user/$USER.
  • WARN: Task was cached but the cached output is missing — Work directory was cleaned or moved. Remove the -resume flag to re-run from scratch, or restore the deleted work directory.
  • Cannot stage input file — A FASTQ file listed in the samplesheet does not exist. Double-check all paths with ls.
  • Process killed (SIGKILL) — Usually OOM or disk quota exceeded. Check df -h and quota -s.

Part 6: Inspect Pipeline Outputs

Once the pipeline completes, explore the results directory to understand what was produced.

Output directory structure

# List the top-level output structure
ls -lh results/

# A successful run produces directories like:
# results/
# ├── fastqc/          -- FastQC reports for raw reads
# ├── fastp/           -- Trimming reports and trimmed FASTQs
# ├── star_salmon/     -- STAR alignments + Salmon quantification
# │   ├── KO_1_SRR10045016/ -- Per-sample BAM files and Salmon output
# │   ├── KO_2_SRR10045017/
# │   ├── ...
# │   └── salmon/      -- Merged Salmon quant.sf files and count matrices
# ├── multiqc/         -- Aggregated QC HTML report
# └── pipeline_info/   -- Nextflow execution report and timeline

Explore the STAR alignment output

# List output for one sample
ls -lh results/star_salmon/KO_1_SRR10045016/

# The BAM file contains the genome alignments
samtools flagstat results/star_salmon/KO_1_SRR10045016/KO_1_SRR10045016.Aligned.sortedByCoord.out.bam

# View the alignment log (STAR mapping statistics)
cat results/star_salmon/KO_1_SRR10045016/Log.final.out

Find the Salmon quantification files

# Each sample has a quant.sf file with per-transcript TPM values
ls results/star_salmon/KO_1_SRR10045016/

# View the quant.sf header and first few lines
head results/star_salmon/KO_1_SRR10045016/quant.sf

# The merged count matrix across all samples is here:
ls results/star_salmon/salmon/

# salmon.merged.gene_counts.tsv  -- raw counts per gene (for DESeq2)
# salmon.merged.gene_tpm.tsv     -- TPM per gene (for visualization)
head results/star_salmon/salmon/salmon.merged.gene_counts.tsv | cut -f1-4
Which file to use for differential expression analysis?

For downstream DEA with DESeq2 or edgeR, use salmon.merged.gene_counts.tsv (raw integer counts). Do not use TPM or RPKM values as input to count-based statistical models — they have already been normalized in ways that conflict with DESeq2's internal normalization.

You can also import the per-sample quant.sf files directly into R using tximeta or tximport, which is the recommended approach for propagating uncertainty from transcript-level quantification to the gene level.

Open the MultiQC report

# The aggregated QC report is here:
ls -lh results/multiqc/

# To view it on your local machine, copy it using scp:
# (run this command from your LOCAL terminal, not Ibex)
scp username@ilogin.ibex.kaust.edu.sa:/ibex/user/$USER/workshop/nfcore_rnaseq/results/multiqc/multiqc_report.html ~/Desktop/

# Or use a web browser on Ibex via X11 forwarding (if available):
# firefox results/multiqc/multiqc_report.html

Key sections of the MultiQC report

Open the MultiQC HTML report and review these sections:

SectionWhat to look for
General StatisticsOverview table: total reads, % aligned, % duplicates, % GC content per sample
FastQC (Raw)Per-base quality scores, adapter content before trimming
fastp% reads passing filters, adapter trimming rates, insert size distribution
STARUniquely mapped reads %, multi-mapper %, unmapped reads % — aim for >70% unique mapping
SalmonMapping rate from Salmon pseudo-alignment — should agree with STAR
dupRadarGene-level duplication: low-expression genes with high duplication may indicate over-sequencing or contamination
RSeQCRead distribution across genomic features (exons, introns, intergenic)
PicardInsert size distribution — should be unimodal for typical poly-A RNA-seq

Check the pipeline execution report

# Nextflow generates an HTML execution report with resource usage per process
ls results/pipeline_info/

# Open execution_report_*.html for a visual breakdown of:
# - Wall time per process
# - CPU and memory utilization
# - Total number of tasks executed
# - Process-level success/failure status

# Copy it locally to open in a browser
scp username@ilogin.ibex.kaust.edu.sa:/ibex/user/$USER/workshop/nfcore_rnaseq/results/pipeline_info/execution_report*.html ~/Desktop/

Exercises

Exercise 1: Mapping Rate

Open the MultiQC report and find the General Statistics table. What is the overall STAR unique mapping rate for each sample? Are there any samples with noticeably lower mapping rates? What could cause a low mapping rate in an RNA-seq experiment?

Exercise 2: Pipeline Process Count

How many individual processes (tasks) did nf-core/rnaseq execute in total? Open the pipeline execution report (results/pipeline_info/execution_report_*.html) and find the total task count. Note how many of those were cached if you used -resume.

Exercise 3: Locate Salmon quant.sf Files

Where are the per-sample Salmon quant.sf output files located in the results directory? Write the full path pattern. How many columns does a quant.sf file have, and what does each column represent?

Hint: Use ls results/star_salmon/KO_*/ and head to inspect one file.

Exercise 4: Strandedness in the Samplesheet

Your collaborator informs you that the GSE136366 libraries were prepared with the Illumina TruSeq Stranded mRNA kit, which produces reverse-stranded libraries. What value would you set for the strandedness column in your samplesheet? How would you update the samplesheet CSV file to reflect this?

Hint: Use sed or a text editor to replace auto with the correct strandedness value in all rows.

Summary

In this lab, you have:

The merged gene count matrix at results/star_salmon/salmon/salmon.merged.gene_counts.tsv is the primary input for the next lab.

Next steps

The Salmon count matrix produced by this pipeline will be used directly in the differential expression analysis lab. You do not need to run any additional quantification steps — nf-core/rnaseq has already done it all.

Previous: Lab 5: Quantification   |   Next: Lab 7: Differential Expression Analysis