Analyze COVID-19 Genomes Faster: A Guide to the ARTIC SARS-CoV-2 Workflow

Do you work with SARS-CoV-2 sequencing data? The ARTIC SARS-CoV-2 workflow (wf-artic) can streamline your analysis, generating consensus sequences from pooled amplicon sequencing data. This article provides a comprehensive guide to utilizing this powerful tool.

What is the ARTIC SARS-CoV-2 Workflow?

wf-artic is a bioinformatics pipeline designed for analyzing SARS-CoV-2 genomes sequenced using the ARTIC network's amplicon-based approach. It's specifically tailored for data from Oxford Nanopore Technologies (ONT) sequencing platforms like MinION, GridION, and PromethION.

Here's why you should consider using it:

Standardized analysis: Implements a consistent and validated methodology for SARS-CoV-2 genome analysis.
Amplicon-based sequencing: Optimized for the ARTIC FieldBioinformatics workflow.
Consensus sequence generation: Creates high-quality consensus sequences for downstream analysis.

System Requirements to Run wf-artic

Before you dive in, ensure your system meets these requirements:

CPUs: Recommended 4, Minimum 2
Memory: Recommended 8GB, Minimum 4GB
Containerization: Docker or Singularity

Note: ARM processors are currently not supported.

How to Install and Run the Workflow

The wf-artic workflow leverages Nextflow to manage the software dependencies. You can either clone the git repository for the workflow, or access the workflow via the EPI2ME application.

To install and run wf-artic follow these steps:

Install Nextflow: If you haven't already, install Nextflow.
Obtain the Workflow: Use the following command to pull the workflow:
```
nextflow run epi2me-labs/wf-artic --help
```
This command downloads the workflow and provides a list of available parameters with descriptions.

Grab the Demo Dataset (Optional): Use this for initial testing to ensure success.

wget https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-artic/wf-artic-demo.tar.gz
tar -xzvf wf-artic-demo.tar.gz

Run the Workflow: Execute the workflow with your data:

nextflow run epi2me-labs/wf-artic \
--fastq test_data/reads.fastq.gz \
-profile standard

Understanding Input Data

wf-artic requires demultiplexed FASTQ files as input. There are three ways to provide FASTQ input:

(i) Single FASTQ: Path to a single FASTQ file. Use --sample to specify the sample name.
(ii) Directory of FASTQs: Path to a directory containing FASTQ files. Use --sample to specify the sample name.
(iii) Multiplexed Directory: Path to a directory containing sub-directories, where each sub-directory represents a barcode and contains FASTQ files. Use --sample_sheet to provide a sample sheet mapping barcodes to sample names.

Key Input Parameters Explained

To tailor the workflow to your specific needs, here's a breakdown of the most important parameters:

--fastq: Specifies the path to your FASTQ, or directory of, sequencing reads.
--scheme_name: Sets the primer scheme, such as SARS-CoV-2 or spike-seq.
--scheme_version: Defines the primer scheme version (e.g., ARTIC/V3). Find more about different schemes here.
--sample_sheet: CSV file for mapping barcodes to sample names in multiplexed data.
--out_dir: Sets the output directory for all results.
--basecaller_cfg: Determines the basecaller configuration to use for model selection.

Essential Output Files

Once the workflow completes, you'll find several important output files in your specified output directory:

wf-artic-report.html: A comprehensive HTML report summarizing the analysis.
all_consensus.fasta: Contains the final consensus sequences for all samples.
lineage_report.csv: Pangolin lineage assignments for each sample.
nextclade.json: Nextclade results for clade assignment and mutation analysis.
{{ alias }}.pass.named.vcf.gz: A VCF file containing high-confidence variants for each sample.

Optimizing Your Workflow

Consider these tips for optimal performance:

Choose the correct primer scheme: Ensure the scheme_name and scheme_version parameters match the primers used in your experiment.
Provide a sample sheet for multiplexed data: Accurate sample mapping is crucial for correct analysis.
Adjust compute resources: Optimize the number of threads used by ARTIC and Pangolin (artic_threads, pangolin_threads) based on your system's capabilities.

By implementing these steps, you can efficiently analyze your SARS-CoV-2 sequencing data using the ARTIC SARS-CoV-2 workflow, accelerating your research and contributing to global genomic surveillance efforts.