GHGA AQuA Nextflow Pipeline

Introduction

GHGA AQuA (Automatic Quality Assessment) Pipeline is a bioinformatics pipeline that performs basic quality control over input datasets without altering the raw data. It accepts three main input types:

Raw FastQ files
Aligned BAM/CRAM files
Variant called VCF/BCF files

The pipeline automatically selects the appropriate quality control tools based on your provided analysis method (e.g., WGS, RNA-Seq, Nanopore) and compiles the results into a single MultiQC report.

Tool & Analysis Matrix

The following table details which tools are executed based on the analysis method and input data type provided in the samplesheet.

Analysis Method	Read QC (FastQ)	Alignment QC (BAM/CRAM)	Variant QC (VCF)
WGS	FastQC, FastP, SeqFU	Mosdepth, Samtools Stats, Picard, VerifyBamID, NGS-Bits*, Preseq	BCFTools Stats
WES / TES	FastQC, FastP, SeqFU	Mosdepth, Samtools Stats, Picard, VerifyBamID, Preseq	BCFTools Stats
ATAC-Seq	FastQC, FastP, SeqFU	Mosdepth, Samtools Stats, Picard, Ataqv, Preseq	BCFTools Stats
ChIP-Seq	FastQC, FastP, SeqFU	Mosdepth, Samtools Stats, Phantompeakqualtools, Preseq	BCFTools Stats
RNA-Seq / smRNA	FastQC, FastP, SeqFU	RSeQC, Preseq	BCFTools Stats
Nanopore	FastQC, NanoPlot	-	BCFTools Stats
PacBio	FastPLong	-	BCFTools Stats
MethylSeq	FastQC, FastP, SeqFU	Samtools Stats	BCFTools Stats

Usage

The pipeline can be started in two ways: by providing a manual samplesheet or by providing GHGA-compliant metadata.

Option A: Manual Samplesheet (CSV)

Create a samplesheet.csv with your data. The pipeline auto-detects the starting step based on which columns are populated.

You must create a samplesheet.csv containing your input data. The structure requires a step column to tell the pipeline which type of file you are providing:

step 1: FastQ files (Read QC)
step 2: BAM/CRAM files (Alignment QC)
step 3: VCF files (Variant QC)

A samplesheet containing a mix of raw data, mapped bams, and variant files would look like this:

sample,lane,individual_id,sex,experiment_method,fastq_1,fastq_2,bam,bai,vcf
SAMPLE_FASTQ,L001,ind_1,MALE,wgs,s1_R1.fastq.gz,s1_R2.fastq.gz,,,
SAMPLE_BAM,L001,ind_2,FEMALE,wgs,,,s2.bam,s2.bam.bai,
SAMPLE_VCF,L001,ind_3,NA,wgs,,,,,s3.vcf.gz

Column	Description
`sample`	Required. Custom sample name. This identifier is used to group multiple sequencing runs (lanes) from the same sample. Spaces are automatically converted to underscores (`_`).
`lane`	Required. identifier for the sequencing lane or library (e.g., L001, L002). Must not contain spaces.
`individual_id`	Identifier for the individual (patient/subject).
`sex`	Biological sex of the individual (e.g., MALE, FEMALE, NA).
`status`	Disease status as an integer: `0` (Normal/Control) or `1` (Tumor/Case).
`phenotype`	Phenotypic terms or descriptions associated with the individual.
`sample_type`	The type of sample (e.g., GENOMIC_DNA, TOTAL_RNA).
`disease_status`	Text description of the disease status (e.g., Healthy, Tumor).
`case_control_status`	Status in the study design (e.g., CASE, CONTROL).
`tissue`	The source tissue of the specimen (e.g., blood, tissue).
`experiment_method`	The sequencing method used. Supported values: `wgs`, `wes`, `rna`, `atac`, `nanopore`, `pacbio`.
`analysis_method`	The type of analysis performed (e.g., `varcall`).
`fastq_1`	Path to the Read 1 FastQ file. Must end in `.fastq.gz` or `.fq.gz`.
`fastq_2`	Path to the Read 2 FastQ file for paired-end data. Optional for single-end.
`single_end`	Boolean (`true`/`false`) indicating if the sequencing is single-end.
`bam`	Path to the aligned BAM file.
`bai`	Path to the corresponding BAM index file.
`cram`	Path to the aligned CRAM file.
`crai`	Path to the corresponding CRAM index file.
`vcf`	Path to the Variant Call Format file. Must end in `.vcf` or `.vcf.gz`.
`data_files`	Semicolon-separated list of any other relevant data files not covered by specific columns.

An example samplesheet has been provided with the pipeline.

Option B: GHGA Metadata (JSON)

If you already have a metadata.json following the GHGA metadata model, you can provide it directly. The pipeline will automatically convert the JSON into the required internal format, eliminating the need to create a manual samplesheet.

2. Run the Pipeline

Run the pipeline using the command below with input samplesheet.csv.

nextflow run main.nf \
   -profile docker/singularity/conda \
   --input samplesheet.csv \
   --outdir ./results

or using the command below with input metadata.json.

nextflow run main.nf \
   -profile docker/singularity/conda \
   --metadata metadata.json \
   --outdir ./results

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

Supported Tools

Read QC

FastQC: Basic quality control checks for raw sequence data.
FastP: A fast processor used to generate read metrics for quality control.
SeqFU: Tools for gathering sequence statistics and metadata.
NanoPlot: Specialized plotting for long read sequencing data.
FastPLong: Quality control specifically for PacBio and other long read data.

Alignment QC

Mosdepth: Fast depth calculation for BAM or CRAM files using target intervals for specific assays.
Samtools Stats: Comprehensive statistics for alignment files.
Picard CollectMultipleMetrics: Metrics for DNA library quality and fragmentation.
Ataqv: Specialized quality control for ATAC sequencing experiments.
Phantompeakqualtools: Tools for quality control of ChIP sequencing datasets.
RSeQC: A quality control package designed for RNA sequencing experiments.
Preseq: Software to estimate library complexity and duplication.
VerifyBamID: Estimation of DNA contamination using ancestry agnostic methods.
NGSBits SampleGender: Determination of biological sex based on sequencing coverage.

Variant QC

BCFTools Stats: Detailed statistics and metrics for VCF and BCF files.

Reporting

MultiQC: A tool that combines all quality control results into one interactive report.
MultiQC-mapper: A tool unifies QC reports to report back to the GHGA dataportal.

Credits

GHGA AQuA nextflow pipeline was originally written by Kubra Narci @kubranarci.

Current development team: - Manuel Kösters - Virag Sharma - Ruchi Tanavade

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
assets		assets
bin		bin
conf		conf
docs		docs
modules		modules
subworkflows		subworkflows
tests		tests
workflows		workflows
.gitignore		.gitignore
.gitpod.yml		.gitpod.yml
.nf-core.yml		.nf-core.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
.prettierrc.yml		.prettierrc.yml
CHANGELOG.md		CHANGELOG.md
CITATIONS.md		CITATIONS.md
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
nf-test.config		nf-test.config
ro-crate-metadata.json		ro-crate-metadata.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GHGA AQuA Nextflow Pipeline

Introduction

Tool & Analysis Matrix

Usage

Option A: Manual Samplesheet (CSV)

Option B: GHGA Metadata (JSON)

2. Run the Pipeline

Supported Tools

Read QC

Alignment QC

Variant QC

Reporting

Credits

Contributions and Support

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GHGA AQuA Nextflow Pipeline

Introduction

Tool & Analysis Matrix

Usage

Option A: Manual Samplesheet (CSV)

Option B: GHGA Metadata (JSON)

2. Run the Pipeline

Supported Tools

Read QC

Alignment QC

Variant QC

Reporting

Credits

Contributions and Support

Citations

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages