Downloading deliverables

Gencove provides a number of deliverables for each sample that is processed as part of a project. In case a sample fails processing due to quality control, only the original input files are provided as deliverables.

Downloading using the CLI¶

$ gencove download <local-destination-path> --project-id <project-id>`

Downloads all deliverables for all samples in project the specified project, with the following default naming scheme:

<local-destination-path>/<client-id>/<gencove-id>/<gencove-id>_<file-type>.<file-extension>

This naming scheme reflects the fact that uniqueness of client-ids is not enforced, while uniqueness of gencove-id is enforced.

Customizing download naming scheme¶

The default naming scheme outlined above can be customized by providing the --download-template flag and a custom file naming template that may contain {client_id}, {gencove_id}, {file_type}, {file_extension} and {default_filename} tokens.

$ gencove download . --project-id <project-id> --download-template '{client_id}.{file_extension}'

When using this feature, make sure to specify download templates that result in unique filenames across all samples.

The {default_filename} token provides access to the API's default file naming scheme, which takes into account different bioinformatics conventions across a subset of file types. Current exceptions to the default {gencove_id}_{file_type}.{file_extension} scheme are:

fastq-r1: {gencove_id}_R1.fastq.gz
fastq-r2: {gencove_id}_R2.fastq.gz
alignment-bam: {gencove_id}.bam
alignment-bai: {gencove_id}.bam.bai
impute-vcf: {gencove_id}.vcf.gz
impute-tbi: {gencove_id}.vcf.gz.tbi
impute-csi: {gencove_id}.vcf.gz.csi

Continuing previous downloads¶

When downloading, existing files on the local filesystem are not overwritten if the file already exists and has the same size in bytes as the file that would be downloaded. This behavior can be tweaked with the --no-skip-existing flag.

Downloading subsets of deliverables¶

$ gencove download . --sample-ids sample-id-1,sample-id-2,sample-id-3 --file-types impute-vcf,impute-tbi

Behavior of the download can also be tweaked in the following manner:

Download only a specific set of sample ids by providing the --sample-ids flag instead of the --project-id flag
Download only a specific set of file types by providing the --file-types flag. Currently available file types are listed below (not all file types may be available for every project).

fastq-r1

original input FASTQ file with raw sequencing reads, containing the first read of a read pair when using paired-end sequencing

fastq-r2

original input FASTQ file with raw sequencing reads, containing the second read of a read pair when using paired-end sequencing

alignment-bam

BAM file with reads aligned to the target genome (includes all reads from original FASTQ files)

alignment-bai

BAI index file accompanying the BAM file

cnv-cnr

CNR file with bin-level log2 ratios for copy-number variation calls

cnv-cns

CNS file with segmented lod2 ratios for copy-number variation calls

cnv-pdf

Portable Document Format (PDF) file with copy-number variation plot

cnv-png

Portable Network Graphics (PNG) file with copy-number variation plot (commonly used when PDFs are too large).

impute-vcf

VCF file with imputed variant calls

impute-tbi

Tabix index file accompanying the VCF file

impute-csi

CSI index file accompanying the VCF file

kraken-report

Kraken report for sequencing reads that didn't map to the target genome

ancestry-json

JavaScript Object Notation (JSON) file with ancestry estimates for subpopulations, contains the following keys:

ancestry - contains ancestry estimates
ancestry_raw - may contain additional entries for ambiguous groupings in situations where specific subgroups cannot be consistently identified
ancestry_metadata_id - legacy key (should be disregarded)

traits-json

JSON file with polygenic risk score calculations

each key represents a polygenic score outlined in the "Data analysis configurations" section below
each polygenic score object contains the following keys:
- score - calculated value of polygenic score
- nsnp - number of single-nucleotide polymorphisms (SNPs) taken into account
- score_percentile - percentile of individual's score relative to scores calculated for individuals in the reference dataset used to generate the score

call_capture-vcf

VCF file with variant calls from target capture regions, corresponding with the deliverable labeled Target capture, VCF file in the web interface.

call_capture-csi

index accompanying the target capture VCF file, corresponding with the deliverable labeled Target capture, CSI file in the web interface.

call_capture-vcf_pathogenic

VCF file with pathogenic variant calls from target capture regions

call_capture-forced_vcf

VCF file with variant calls at a set of pre-determined variants, corresponding with the deliverable labeled Target capture (pre-defined variants), VCF file in the web interface.

call_capture-forced_csi

index accompanying the VCF file with variant calls at a set of predetermined variants; corresponds with the deliverable labeled Target capture (pre-defined variants), CSI file

qc

JSON file with sample quality control metrics, containing the following quality_control_types:

format - FASTQ format validity
r1_eq_r2 - number of bases in R1 file equal to number of bases in R2 file
r1_r2_ids_match - R1 read identifiers match R2 read identifiers
bases_min - minimum number of total bases sequenced
bases_max - maximum number of total bases sequenced
bases_dedup_min - minimum number of deduplicated bases
bases_dedup_mapped_min - minimum number of deduplicated bases that have aligned to the target genome
fraction_contamination_max - maximum contamination by DNA from another sample of the same species
snps_min - number of variants in reference panel that are covered by at least one sequencing read
effective_coverage_min - minimum effective coverage
hzy_max - maximum heterozygosity
cc_min - minimum "call confidence", i.e., imputation algorithm variant calling confidence across all sites
nhref_min - minimum number of homozygous reference calls
nhet_max - maximum number of heterozygous calls
nhalt_min - minimum number of homozygous alt calls
pct_target_bases_30x_min - minimum percentage of target capture bases with 30x coverage
pathogenic_min - number of pathogenic variants detected

metadata

JSON file with user-specified metadata that has been assigned to a sample

Downloading checksum files¶

$ gencove download <local-destination-path> --project-id <project-id> --checksums

Include sha256 checksum files to verify that deliverables are valid. For instance, for file file.vcf.gz a file named file.vcf.gz.sha256 will be downloaded as well.

To verify the integrity of a file you can run

$ shasum -c file.vcf.gz.sha256
# or
$ sha256sum -c file.vcf.gz.sha256

This will output if the checksum of the downloaded file matches the one provided by Gencove.

Note

Only projects that were created after July 6, 2022 have checksums available.

Warning

The CLI does NOT validate deliverables against checksum, even when the checksum flag is provided.