Analyze SARS-CoV-2 Genomes with the ARTIC Workflow: A Comprehensive Guide

Are you working with SARS-CoV-2 genomic data from Oxford Nanopore sequencing? Do you need a robust and reliable workflow for generating consensus sequences and analyzing variants? This guide will walk you through the ARTIC SARS-CoV-2 workflow (wf-artic), a powerful tool for analyzing multiplexed MinION, GridION, and PromethION data.

What is the ARTIC SARS-CoV-2 Workflow (wf-artic)?

The ARTIC SARS-CoV-2 workflow is a bioinformatics pipeline designed to process sequencing data from SARS-CoV-2 genomes. It leverages a modified ARTIC FieldBioinformatics workflow to create consensus sequences from samples sequenced using a pooled tiling amplicon strategy. With this workflow, researchers can process demultiplexed sequence reads from instruments like MinION or GridION.

Key Benefits of Using wf-artic:

Standardized Analysis: Ensures consistent and reproducible results across different datasets.
Simplified Workflow: Automates complex bioinformatics tasks, saving you time and effort.
Comprehensive Reporting: Generates detailed reports with key information about your samples.
Flexibility: Supports various primer schemes and basecaller configurations.

System Requirements

Before diving in, ensure your system meets the following:

Recommended: 4 CPUs, 8GB Memory
Minimum: 2 CPUs, 4GB Memory
Please note that the workflow does not currently support ARM processors

How to Install and Run the ARTIC Workflow

Here's a step-by-step guide to getting the workflow up and running:

Install Nextflow: The workflow relies on Nextflow, a workflow management system.
Choose a Containerization Method: Select either Docker or Singularity for software isolation. Docker is generally easier to set up, while Singularity is often preferred on HPC clusters.
Obtain the Workflow: Use the following command to download the workflow and view available parameters:

nextflow run epi2me-labs/wf-artic --help

Download the Demo Dataset (Optional): For testing, download the demo dataset:

wget https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-artic/wf-artic-demo.tar.gz
tar -xzvf wf-artic-demo.tar.gz

Demo data

Run the Workflow: Execute the workflow with your data or the demo data:

nextflow run epi2me-labs/wf-artic \
--fastq test_data/reads.fastq.gz \
-profile standard

The -profile standard option tells Nextflow to use Docker or Singularity for managing software dependencies.

Input Data: Preparing Your FASTQ Files

The workflow accepts FASTQ files (or gzipped FASTQ files) as input, supporting three different scenarios:

(i) Single FASTQ File: Use the --sample parameter to specify a sample name.
(ii) Directory of FASTQ Files: All FASTQ files in the directory belong to a single sample. Again, use --sample to specify the sample name.
(iii) Directory with Subdirectories (Multiplexed Data): Each subdirectory represents a different barcode. Provide a sample sheet using --sample_sheet to map barcodes to sample names.

Example Input Structures

(i) (ii) (iii)
input_reads.fastq ─── input_directory ─── input_directory
├── reads0.fastq ├── barcode01
└── reads1.fastq │ ├── reads0.fastq
│ └── reads1.fastq
├── barcode02
│ ├── reads0.fastq
│ ├── reads1.fastq
│ └── reads2.fastq
└── barcode03
└── reads0.fastq

Key Input Parameters Explained

Understanding the input parameters is crucial for optimizing the workflow for your specific data. Here's a breakdown of some important parameters:

--fastq: Specifies the path to your FASTQ file(s) or directory.
--scheme_name: Defines the primer scheme used (e.g., SARS-CoV-2, spike-seq).
--scheme_version: Specifies the version of the primer scheme (e.g., ARTIC/V3, ONT/V1).
--sample_sheet: Provides a CSV file for mapping barcodes to sample names in multiplexed data.
--out_dir: Specifies the output directory for all results.

Analyzing the Output

The workflow generates several key output files:

wf-artic-report.html: A comprehensive HTML report summarizing the analysis.
all_consensus.fasta: Contains the final consensus sequences for all samples.
lineage_report.csv: Pangolin lineage assignments for each sample.
nextclade.json: Nextclade results (clade and variant analysis).
{{ alias }}.pass.named.vcf.gz: A VCF file containing high-confidence variants for each sample.

output

Conclusion

The ARTIC SARS-CoV-2 workflow (wf-artic) offers a streamlined and reliable solution for analyzing SARS-CoV-2 sequencing data. By following this guide, you can quickly set up the workflow, process your data, and gain valuable insights into viral genomes and variants.