Lab 1: Genomic File Formats
Not sure about a command? Open the Linux Command Line Cheatsheet on the Setup page — it covers navigation, file operations, searching, pipes, and more. Keep it open in another tab throughout the lab.
- Explore FASTA and FASTQ files using bioinformatics tools
- Understand the structure and purpose of each genomic file format
- Use
seqkitfor sequence file inspection - Debug common issues with corrupted sequencing files
Part 1: Log in to Ibex
Before we begin, make sure you have completed the Environment Setup. All work in this lab is performed on the KAUST Ibex HPC cluster.
Connect to Ibex via SSH
# Open your terminal and connect to Ibex
ssh username@ilogin.ibex.kaust.edu.sa
#start an ineractive job
srun --cpus-per-task=4 --mem=32G --time=04:00:00 --pty /bin/bash
# Replace "username" with your actual Ibex username
# Enter your password when prompted
# Confirm your location
pwd
# List with human-readable file sizes
ls -lh
Create Your Workshop Workspace
# Create a directory for today's lab
mkdir -p ~/workshop/day1
cd ~/workshop/day1
# Confirm your location
pwd
# List with human-readable file sizes
ls -lh
Part 2: Sample Data
Download data from the web and extract (wget and unzip)
# wget is used to download contents frmo a given URL
#read the help page for wget
wget -h
#download a sample data using wget:
wget https://raw.githubusercontent.com/bioinfo-kaust/rnaseq-course/main/materials/genome_data.zip
#check if theere is a new file named genome_data.zip after the download
ls -lh
#this file is compress we ned to uncompress using unzip command
unzip genome_data.zip
Check contents of the sample data (use ls) and cd
# View what we have
ls -h/
#check sequences folder:
ls genome_data
#How many sample files are there?
ls genome_data | wc -l
Explore the Data Directory Structure
# See sizes of all files
# Count files in each subdirectory
ls genome_data/reference/
ls genome_data/fastq/
ls genome_data/bam/
The sample data contains small example files for each genomic format we will explore today: FASTA reference sequences, FASTQ read files (including intentionally broken ones for debugging practice), and a BAM alignment file.
Part 3: Exploring FASTA Files
FASTA is the standard format for storing nucleotide or protein sequences. Each record has a header line starting with > followed by the sequence on subsequent lines.
View the Raw FASTA Format
# Look at the raw file content
head -20 genome_data/reference/mini_ref.fasta
>sequence_name optional description
ATGCATGCATGCATGCATGCATGC
GCTAGCTAGCTAGCTAGCTAGCTA
...
- Line 1: Header starting with
>, followed by sequence ID and optional description - Lines 2+: The sequence itself, typically wrapped at 60-80 characters per line
Read the file
#Read the fine using nano
nano genome_data/reference/mini_ref.fasta
#count number of lines
wc -l genome_data/reference/mini_ref.fasta
#how do calculate the number of records in this fasta file
grep ...Use seqkit to read genomic files
#search for a specific tool:
module avail seqkit
#load a tool to use in this session
module load seqkit
# Verify that the tools are loaded
module list
# Test each tool
seqkit version
# Run the tool to get Basic statistics
seqkit stats genome_data/reference/mini_ref.fasta
# Detailed statistics (includes N50, GC content, etc.)
seqkit stats -a genome_data/reference/mini_ref.fasta
Extract Sequence Headers
# View headers only (sequence names)
seqkit seq -n genome_data/reference/mini_ref.fasta
Explore Compressed Genome Files - Let's get a bigger file (human chr11 DNA)
# Download human chromosome 11 sequence
echo "Downloading chromosome 11 genome..."
wget https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.11.fa.gz
#check what you got?
ls .
Real genome files are typically compressed with gzip. Bioinformatics tools like seqkit can read compressed files directly.
# Get statistics for compressed genome files
seqkit stats Homo_sapiens.GRCh38.dna.chromosome.11.fa.gz
# You can also look at them without decompressing
zcat Homo_sapiens.GRCh38.dna.chromosome.11.fa.gz | head -5
Most bioinformatics tools (seqkit, samtools) can read .gz files directly. Use zcat or zless to view compressed files without decompressing them.
Homo_sapiens.GRCh38.dna.chromosome.11.fa.gz
- Homo_sapiens - Species name
- GRCh38 - Genome assembly version (Genome Reference Consortium Human Build 38)
- dna - DNA sequence type
- chromosome.11 - Specific chromosome
- .fa.gz - FASTA format, gzip compressed
Exercise 1: Explore the DNA file
Use command-line tools to answer:
- How many nucleotides are on chromosome 11?
- What is the size of chr11 FASTA file
Part 4: Exploring FASTQ Files
FASTQ is the standard format for storing sequencing reads with their quality scores. It is the primary output format of Illumina sequencing instruments.
View the Raw FASTQ Format
# View the first 3 reads (each read = 4 lines)
head -12 genome_data/fastq/sample_R1.fastq
| Line | Content | Example |
|---|---|---|
| 1 | Header (starts with @) | @SEQ_ID instrument:run:flowcell:lane:tile:x:y |
| 2 | DNA sequence | ATCGATCGATCG... |
| 3 | Separator (always +) | + |
| 4 | Quality scores (ASCII-encoded Phred scores) | IIIIIIIHGFED... |
Each ASCII character on line 4 encodes a quality score. Higher characters (like I) mean higher confidence in the base call. Phred score 30 (?) means 99.9% accuracy.
Get FASTQ Statistics
# Basic statistics for each read file
seqkit stats genome_data/fastq/sample_R1.fastq
seqkit stats genome_data/fastq/sample_R2.fastq
# Detailed statistics (includes Q20/Q30 percentages)
seqkit stats -a genome_data/fastq/sample_R1.fastq
seqkit stats -a genome_data/fastq/sample_R2.fastq
Understanding Paired-End Reads
In paired-end sequencing, each DNA fragment is read from both ends (R1 = forward, R2 = reverse). Read pairs share the same ID.
# Compare read IDs between R1 and R2
# They should match, indicating paired reads
seqkit seq -n genome_data/fastq/sample_R1.fastq | head -5
seqkit seq -n genome_data/fastq/sample_R2.fastq | head -5
R1 and R2 files must always be kept in sync -- the first read in R1 is paired with the first read in R2, and so on. If the files get out of order, downstream alignment will fail.
Debugging: Broken and Truncated FASTQ Files
Real-world data can sometimes be corrupted. The sample data includes intentionally broken files for practice.
# Try running stats on the broken file -- what happens?
seqkit stats genome_data/fastq/broken.fastq
# Try the truncated file
seqkit stats genome_data/fastq/truncated.fastq
# Inspect the broken file manually to find the issue
head -20 genome_data/fastq/broken.fastq
# Inspect the truncated file
tail -10 genome_data/fastq/truncated.fastq
Always run a quick validation on your FASTQ files before starting analysis. Corrupted files can cause silent errors in downstream tools. Use seqkit stats as a fast sanity check -- if it reports an error, investigate before proceeding.
Batch Statistics on Multiple FASTQ Files
# Get statistics for all .fq files at once
seqkit stats genome_data/sequences/*.fq
# You can also use wildcards for subsets
seqkit stats genome_data/sequences/sample_*.fq
Exercises
Try these exercises on your own using the commands and tools you have learned. Refer to the sections above if you get stuck.
Exercise 1: Count Sequences
How many sequences are in genome_data/reference/mini_ref.fasta?
Hint: Use seqkit stats or grep -c "^>"
Exercise 2: Average Read Length
What is the average read length in genome_data/fastq/sample_R1.fastq?
Hint: Use seqkit stats -a and look at the "avg_len" column.
Exercise 3: Pattern Search
Find all reads in genome_data/fastq/sample_R1.fastq that contain the pattern "TTGC". How many are there?
Hint: Use seqkit grep -s -r -p "TTGC" and pipe to seqkit stats
Summary
In this lab, you have:
- Loaded bioinformatics modules using the HPC module system
- Explored FASTA files using
seqkitto view sequences, headers, and statistics - Explored FASTQ files, understood the 4-line format, and examined paired-end reads
- Practiced debugging with intentionally broken FASTQ files