Skip to content

ghga-de/AQuA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GHGA AQuA Nextflow Pipeline

GitHub Actions CI Status GitHub Actions Linting Status nf-test

Nextflow run with conda run with docker run with singularity

Introduction

GHGA AQuA (Automatic Quality Assessment) Pipeline is a bioinformatics pipeline that performs basic quality control over input datasets without altering the raw data. It accepts three main input types:

  1. Raw FastQ files
  2. Aligned BAM/CRAM files
  3. Variant called VCF/BCF files

The pipeline automatically selects the appropriate quality control tools based on your provided analysis method (e.g., WGS, RNA-Seq, Nanopore) and compiles the results into a single MultiQC report.

Tool & Analysis Matrix

The following table details which tools are executed based on the analysis method and input data type provided in the samplesheet.

Analysis Method Read QC (FastQ) Alignment QC (BAM/CRAM) Variant QC (VCF)
WGS FastQC, FastP, SeqFU Mosdepth, Samtools Stats, Picard, VerifyBamID, NGS-Bits*, Preseq BCFTools Stats
WES / TES FastQC, FastP, SeqFU Mosdepth, Samtools Stats, Picard, VerifyBamID, Preseq BCFTools Stats
ATAC-Seq FastQC, FastP, SeqFU Mosdepth, Samtools Stats, Picard, Ataqv, Preseq BCFTools Stats
ChIP-Seq FastQC, FastP, SeqFU Mosdepth, Samtools Stats, Phantompeakqualtools, Preseq BCFTools Stats
RNA-Seq / smRNA FastQC, FastP, SeqFU RSeQC, Preseq BCFTools Stats
Nanopore FastQC, NanoPlot - BCFTools Stats
PacBio FastPLong - BCFTools Stats
MethylSeq FastQC, FastP, SeqFU Samtools Stats BCFTools Stats

Usage

The pipeline can be started in two ways: by providing a manual samplesheet or by providing GHGA-compliant metadata.

Option A: Manual Samplesheet (CSV)

Create a samplesheet.csv with your data. The pipeline auto-detects the starting step based on which columns are populated.

You must create a samplesheet.csv containing your input data. The structure requires a step column to tell the pipeline which type of file you are providing:

  • step 1: FastQ files (Read QC)
  • step 2: BAM/CRAM files (Alignment QC)
  • step 3: VCF files (Variant QC)

A samplesheet containing a mix of raw data, mapped bams, and variant files would look like this:

sample,lane,individual_id,sex,experiment_method,fastq_1,fastq_2,bam,bai,vcf
SAMPLE_FASTQ,L001,ind_1,MALE,wgs,s1_R1.fastq.gz,s1_R2.fastq.gz,,,
SAMPLE_BAM,L001,ind_2,FEMALE,wgs,,,s2.bam,s2.bam.bai,
SAMPLE_VCF,L001,ind_3,NA,wgs,,,,,s3.vcf.gz
Column Description
sample Required. Custom sample name. This identifier is used to group multiple sequencing runs (lanes) from the same sample. Spaces are automatically converted to underscores (_).
lane Required. identifier for the sequencing lane or library (e.g., L001, L002). Must not contain spaces.
individual_id Identifier for the individual (patient/subject).
sex Biological sex of the individual (e.g., MALE, FEMALE, NA).
status Disease status as an integer: 0 (Normal/Control) or 1 (Tumor/Case).
phenotype Phenotypic terms or descriptions associated with the individual.
sample_type The type of sample (e.g., GENOMIC_DNA, TOTAL_RNA).
disease_status Text description of the disease status (e.g., Healthy, Tumor).
case_control_status Status in the study design (e.g., CASE, CONTROL).
tissue The source tissue of the specimen (e.g., blood, tissue).
experiment_method The sequencing method used. Supported values: wgs, wes, rna, atac, nanopore, pacbio.
analysis_method The type of analysis performed (e.g., varcall).
fastq_1 Path to the Read 1 FastQ file. Must end in .fastq.gz or .fq.gz.
fastq_2 Path to the Read 2 FastQ file for paired-end data. Optional for single-end.
single_end Boolean (true/false) indicating if the sequencing is single-end.
bam Path to the aligned BAM file.
bai Path to the corresponding BAM index file.
cram Path to the aligned CRAM file.
crai Path to the corresponding CRAM index file.
vcf Path to the Variant Call Format file. Must end in .vcf or .vcf.gz.
data_files Semicolon-separated list of any other relevant data files not covered by specific columns.

An example samplesheet has been provided with the pipeline.

Option B: GHGA Metadata (JSON)

If you already have a metadata.json following the GHGA metadata model, you can provide it directly. The pipeline will automatically convert the JSON into the required internal format, eliminating the need to create a manual samplesheet.

2. Run the Pipeline

Run the pipeline using the command below with input samplesheet.csv.

nextflow run main.nf \
   -profile docker/singularity/conda \
   --input samplesheet.csv \
   --outdir ./results

or using the command below with input metadata.json.

nextflow run main.nf \
   -profile docker/singularity/conda \
   --metadata metadata.json \
   --outdir ./results

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

Supported Tools

Read QC

  • FastQC: Basic quality control checks for raw sequence data.
  • FastP: A fast processor used to generate read metrics for quality control.
  • SeqFU: Tools for gathering sequence statistics and metadata.
  • NanoPlot: Specialized plotting for long read sequencing data.
  • FastPLong: Quality control specifically for PacBio and other long read data.

Alignment QC

  • Mosdepth: Fast depth calculation for BAM or CRAM files using target intervals for specific assays.
  • Samtools Stats: Comprehensive statistics for alignment files.
  • Picard CollectMultipleMetrics: Metrics for DNA library quality and fragmentation.
  • Ataqv: Specialized quality control for ATAC sequencing experiments.
  • Phantompeakqualtools: Tools for quality control of ChIP sequencing datasets.
  • RSeQC: A quality control package designed for RNA sequencing experiments.
  • Preseq: Software to estimate library complexity and duplication.
  • VerifyBamID: Estimation of DNA contamination using ancestry agnostic methods.
  • NGSBits SampleGender: Determination of biological sex based on sequencing coverage.

Variant QC

  • BCFTools Stats: Detailed statistics and metrics for VCF and BCF files.

Reporting

  • MultiQC: A tool that combines all quality control results into one interactive report.
  • MultiQC-mapper: A tool unifies QC reports to report back to the GHGA dataportal.

Credits

GHGA AQuA nextflow pipeline was originally written by Kubra Narci @kubranarci.

Current development team: - Manuel Kösters - Virag Sharma - Ruchi Tanavade

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

About

Automated Quality Assesment nextflow pipeline for GHGA datasets

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors