Lab 1: Genomic File Formats

Part 1Login to Ibex Part 2Sample Data Part 3FASTA Files Part 4FASTQ Files
Linux Command Reference

Not sure about a command? Open the Linux Command Line Cheatsheet on the Setup page — it covers navigation, file operations, searching, pipes, and more. Keep it open in another tab throughout the lab.

Learning Objectives

Part 1: Log in to Ibex

Before we begin, make sure you have completed the Environment Setup. All work in this lab is performed on the KAUST Ibex HPC cluster.

Connect to Ibex via SSH

# Open your terminal and connect to Ibex
ssh username@ilogin.ibex.kaust.edu.sa

#start an ineractive job
srun --cpus-per-task=4 --mem=32G --time=04:00:00 --pty /bin/bash

# Replace "username" with your actual Ibex username
# Enter your password when prompted

# Confirm your location
pwd

# List with human-readable file sizes
ls -lh

Create Your Workshop Workspace

# Create a directory for today's lab
mkdir -p ~/workshop/day1
cd ~/workshop/day1

# Confirm your location
pwd

# List with human-readable file sizes
ls -lh

Part 2: Sample Data

Download data from the web and extract (wget and unzip)

# wget is used to download contents frmo a given URL
#read the help page for wget
wget -h

#download a sample data using wget:
wget https://raw.githubusercontent.com/bioinfo-kaust/rnaseq-course/main/materials/genome_data.zip

#check if theere is a new file named genome_data.zip after the download
ls -lh

#this file is compress we ned to uncompress using unzip command
unzip genome_data.zip

Check contents of the sample data (use ls) and cd


# View what we have
ls -h/

#check sequences folder:
ls genome_data

#How many sample files are there?
ls genome_data | wc -l

Explore the Data Directory Structure

# See sizes of all files

# Count files in each subdirectory
ls genome_data/reference/
ls genome_data/fastq/
ls genome_data/bam/
What is in the sample data?

The sample data contains small example files for each genomic format we will explore today: FASTA reference sequences, FASTQ read files (including intentionally broken ones for debugging practice), and a BAM alignment file.

Part 3: Exploring FASTA Files

FASTA is the standard format for storing nucleotide or protein sequences. Each record has a header line starting with > followed by the sequence on subsequent lines.

View the Raw FASTA Format

# Look at the raw file content
head -20 genome_data/reference/mini_ref.fasta
FASTA Format Structure
>sequence_name optional description
ATGCATGCATGCATGCATGCATGC
GCTAGCTAGCTAGCTAGCTAGCTA
...
  • Line 1: Header starting with >, followed by sequence ID and optional description
  • Lines 2+: The sequence itself, typically wrapped at 60-80 characters per line

Read the file


#Read the fine using nano 
nano genome_data/reference/mini_ref.fasta

#count number of lines
wc -l genome_data/reference/mini_ref.fasta

#how do calculate the number of records in this fasta file
grep ...

Use seqkit to read genomic files


#search for a specific tool:
module avail seqkit

#load a tool to use in this session
module load seqkit

# Verify that the tools are loaded
module list

# Test each tool
seqkit version

# Run the tool to get Basic statistics
seqkit stats genome_data/reference/mini_ref.fasta

# Detailed statistics (includes N50, GC content, etc.)
seqkit stats -a genome_data/reference/mini_ref.fasta

Extract Sequence Headers

# View headers only (sequence names)
seqkit seq -n genome_data/reference/mini_ref.fasta

Explore Compressed Genome Files - Let's get a bigger file (human chr11 DNA)

# Download human chromosome 11 sequence
echo "Downloading chromosome 11 genome..."
wget https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.11.fa.gz

#check what you got?
ls .

Real genome files are typically compressed with gzip. Bioinformatics tools like seqkit can read compressed files directly.

# Get statistics for compressed genome files
seqkit stats Homo_sapiens.GRCh38.dna.chromosome.11.fa.gz

# You can also look at them without decompressing
zcat Homo_sapiens.GRCh38.dna.chromosome.11.fa.gz | head -5
Tip: Compressed File Handling

Most bioinformatics tools (seqkit, samtools) can read .gz files directly. Use zcat or zless to view compressed files without decompressing them.

File Naming Convention

Homo_sapiens.GRCh38.dna.chromosome.11.fa.gz

  • Homo_sapiens - Species name
  • GRCh38 - Genome assembly version (Genome Reference Consortium Human Build 38)
  • dna - DNA sequence type
  • chromosome.11 - Specific chromosome
  • .fa.gz - FASTA format, gzip compressed

Exercise 1: Explore the DNA file

Use command-line tools to answer:

  1. How many nucleotides are on chromosome 11?
  2. What is the size of chr11 FASTA file

Part 4: Exploring FASTQ Files

FASTQ is the standard format for storing sequencing reads with their quality scores. It is the primary output format of Illumina sequencing instruments.

View the Raw FASTQ Format

# View the first 3 reads (each read = 4 lines)
head -12 genome_data/fastq/sample_R1.fastq
FASTQ Format: 4 Lines Per Read
LineContentExample
1Header (starts with @)@SEQ_ID instrument:run:flowcell:lane:tile:x:y
2DNA sequenceATCGATCGATCG...
3Separator (always +)+
4Quality scores (ASCII-encoded Phred scores)IIIIIIIHGFED...

Each ASCII character on line 4 encodes a quality score. Higher characters (like I) mean higher confidence in the base call. Phred score 30 (?) means 99.9% accuracy.

Get FASTQ Statistics

# Basic statistics for each read file
seqkit stats genome_data/fastq/sample_R1.fastq
seqkit stats genome_data/fastq/sample_R2.fastq

# Detailed statistics (includes Q20/Q30 percentages)
seqkit stats -a genome_data/fastq/sample_R1.fastq
seqkit stats -a genome_data/fastq/sample_R2.fastq

Understanding Paired-End Reads

In paired-end sequencing, each DNA fragment is read from both ends (R1 = forward, R2 = reverse). Read pairs share the same ID.

# Compare read IDs between R1 and R2
# They should match, indicating paired reads
seqkit seq -n genome_data/fastq/sample_R1.fastq | head -5
seqkit seq -n genome_data/fastq/sample_R2.fastq | head -5
Paired-End Sequencing

R1 and R2 files must always be kept in sync -- the first read in R1 is paired with the first read in R2, and so on. If the files get out of order, downstream alignment will fail.

Debugging: Broken and Truncated FASTQ Files

Real-world data can sometimes be corrupted. The sample data includes intentionally broken files for practice.

# Try running stats on the broken file -- what happens?
seqkit stats genome_data/fastq/broken.fastq

# Try the truncated file
seqkit stats genome_data/fastq/truncated.fastq

# Inspect the broken file manually to find the issue
head -20 genome_data/fastq/broken.fastq

# Inspect the truncated file
tail -10 genome_data/fastq/truncated.fastq
Tip: Validating FASTQ Files

Always run a quick validation on your FASTQ files before starting analysis. Corrupted files can cause silent errors in downstream tools. Use seqkit stats as a fast sanity check -- if it reports an error, investigate before proceeding.

Batch Statistics on Multiple FASTQ Files

# Get statistics for all .fq files at once
seqkit stats genome_data/sequences/*.fq

# You can also use wildcards for subsets
seqkit stats genome_data/sequences/sample_*.fq

Exercises

Try these exercises on your own using the commands and tools you have learned. Refer to the sections above if you get stuck.

Exercise 1: Count Sequences

How many sequences are in genome_data/reference/mini_ref.fasta?

Hint: Use seqkit stats or grep -c "^>"

Exercise 2: Average Read Length

What is the average read length in genome_data/fastq/sample_R1.fastq?

Hint: Use seqkit stats -a and look at the "avg_len" column.

Exercise 3: Pattern Search

Find all reads in genome_data/fastq/sample_R1.fastq that contain the pattern "TTGC". How many are there?

Hint: Use seqkit grep -s -r -p "TTGC" and pipe to seqkit stats

Summary

In this lab, you have:

Next: Lab 2: Public Data Retrieval