Skip to content

Downloading deliverables

Gencove provides a number of deliverables for each sample that is processed as part of a project. In case a sample fails processing due to quality control, only the original input files are provided as deliverables.

Downloading using the CLI

$ gencove download <local-destination-path> --project-id <project-id>`

Downloads all deliverables for all samples in project the specified project, with the following default naming scheme:

<local-destination-path>/<client-id>/<gencove-id>/<gencove-id>_<file-type>.<file-extension>

This naming scheme reflects the fact that uniqueness of client-ids is not enforced, while uniqueness of gencove-id is enforced.

Customizing download naming scheme

The default naming scheme outlined above can be customized by providing the --download-template flag and a custom file naming template that may contain {client_id}, {gencove_id}, {file_type}, {file_extension} and {default_filename} tokens.

$ gencove download . --project-id <project-id> --download-template '{client_id}.{file_extension}'

When using this feature, make sure to specify download templates that result in unique filenames across all samples.

The {default_filename} token provides access to the API's default file naming scheme, which takes into account different bioinformatics conventions across a subset of file types. Current exceptions to the default {gencove_id}_{file_type}.{file_extension} scheme are:

  • fastq-r1: {gencove_id}_R1.fastq.gz
  • fastq-r2: {gencove_id}_R2.fastq.gz
  • alignment-bam: {gencove_id}.bam
  • alignment-bai: {gencove_id}.bam.bai
  • impute-vcf: {gencove_id}.vcf.gz
  • impute-tbi: {gencove_id}.vcf.gz.tbi
  • impute-csi: {gencove_id}.vcf.gz.csi

Continuing previous downloads

When downloading, existing files on the local filesystem are not overwritten if the file already exists and has the same size in bytes as the file that would be downloaded. This behavior can be tweaked with the --no-skip-existing flag.

Downloading subsets of deliverables

$ gencove download . --sample-ids sample-id-1,sample-id-2,sample-id-3 --file-types impute-vcf,impute-tbi
Behavior of the download can also be tweaked in the following manner:

  1. Download only a specific set of sample ids by providing the --sample-ids flag instead of the --project-id flag
  2. Download only a specific set of file types by providing the --file-types flag. Currently available file types are listed below (not all file types may be available for every project).
fastq-r1
original input FASTQ file with raw sequencing reads, containing the first read of a read pair when using paired-end sequencing
fastq-r2
original input FASTQ file with raw sequencing reads, containing the second read of a read pair when using paired-end sequencing
alignment-bam
BAM file with reads aligned to the target genome (includes all reads from original FASTQ files)
alignment-bai
BAI index file accompanying the BAM file
cnv-cnr
CNR file with bin-level log2 ratios for copy-number variation calls
cnv-cns
CNS file with segmented lod2 ratios for copy-number variation calls
cnv-pdf
Portable Document Format (PDF) file with copy-number variation plot
cnv-png
Portable Network Graphics (PNG) file with copy-number variation plot (commonly used when PDFs are too large).
impute-vcf
VCF file with imputed variant calls
impute-tbi
Tabix index file accompanying the VCF file
impute-csi
CSI index file accompanying the VCF file
kraken-report
Kraken report for sequencing reads that didn't map to the target genome
ancestry-json

JavaScript Object Notation (JSON) file with ancestry estimates for subpopulations, contains the following keys:

  • ancestry - contains ancestry estimates
  • ancestry_raw - may contain additional entries for ambiguous groupings in situations where specific subgroups cannot be consistently identified
  • ancestry_metadata_id - legacy key (should be disregarded)
traits-json

JSON file with polygenic risk score calculations

  • each key represents a polygenic score outlined in the "Data analysis configurations" section below
  • each polygenic score object contains the following keys:
    • score - calculated value of polygenic score
    • nsnp - number of single-nucleotide polymorphisms (SNPs) taken into account
    • score_percentile - percentile of individual's score relative to scores calculated for individuals in the reference dataset used to generate the score
call_capture-vcf
VCF file with variant calls from target capture regions, corresponding with the deliverable labeled Target capture, VCF file in the web interface.
call_capture-csi
index accompanying the target capture VCF file, corresponding with the deliverable labeled Target capture, CSI file in the web interface.
call_capture-vcf_pathogenic
VCF file with pathogenic variant calls from target capture regions
call_capture-forced_vcf
VCF file with variant calls at a set of pre-determined variants, corresponding with the deliverable labeled Target capture (pre-defined variants), VCF file in the web interface.
call_capture-forced_csi
index accompanying the VCF file with variant calls at a set of predetermined variants; corresponds with the deliverable labeled Target capture (pre-defined variants), CSI file
qc

JSON file with sample quality control metrics, containing the following quality_control_types:

  • format - FASTQ format validity
  • r1_eq_r2 - number of bases in R1 file equal to number of bases in R2 file
  • r1_r2_ids_match - R1 read identifiers match R2 read identifiers
  • bases_min - minimum number of total bases sequenced
  • bases_max - maximum number of total bases sequenced
  • bases_dedup_min - minimum number of deduplicated bases
  • bases_dedup_mapped_min - minimum number of deduplicated bases that have aligned to the target genome
  • fraction_contamination_max - maximum contamination by DNA from another sample of the same species
  • snps_min - number of variants in reference panel that are covered by at least one sequencing read
  • effective_coverage_min - minimum effective coverage
  • hzy_max - maximum heterozygosity
  • cc_min - minimum "call confidence", i.e., imputation algorithm variant calling confidence across all sites
  • nhref_min - minimum number of homozygous reference calls
  • nhet_max - maximum number of heterozygous calls
  • nhalt_min - minimum number of homozygous alt calls
  • pct_target_bases_30x_min - minimum percentage of target capture bases with 30x coverage
  • pathogenic_min - number of pathogenic variants detected
metadata
JSON file with user-specified metadata that has been assigned to a sample

Downloading checksum files

$ gencove download <local-destination-path> --project-id <project-id> --checksums

Include sha256 checksum files to verify that deliverables are valid. For instance, for file file.vcf.gz a file named file.vcf.gz.sha256 will be downloaded as well.

To verify the integrity of a file you can run

$ shasum -c file.vcf.gz.sha256
# or
$ sha256sum -c file.vcf.gz.sha256

This will output if the checksum of the downloaded file matches the one provided by Gencove.

Note

Only projects that were created after July 6, 2022 have checksums available.

Warning

The CLI does NOT validate deliverables against checksum, even when the checksum flag is provided.