Downloading deliverables
Gencove provides a number of deliverables for each sample that is processed as part of a project. In case a sample fails processing due to quality control, only the original input files are provided as deliverables.
Downloading using the CLI¶
Downloads all deliverables for all samples in project the specified project, with the following default naming scheme:
<local-destination-path>/<client-id>/<gencove-id>/<gencove-id>_<file-type>.<file-extension>
This naming scheme reflects the fact that uniqueness of client-id
s is not
enforced, while uniqueness of gencove-id
is enforced.
Customizing download naming scheme¶
The default naming scheme outlined above can be customized by providing the --download-template
flag and a custom file naming template that may contain {client_id}
, {gencove_id}
, {file_type}
, {file_extension}
and {default_filename}
tokens.
When using this feature, make sure to specify download templates that result in unique filenames across all samples.
The {default_filename}
token provides access to the API's default file naming scheme, which takes into account different bioinformatics conventions across a subset of file types. Current exceptions to the default {gencove_id}_{file_type}.{file_extension}
scheme are:
fastq-r1
:{gencove_id}_R1.fastq.gz
fastq-r2
:{gencove_id}_R2.fastq.gz
alignment-bam
:{gencove_id}.bam
alignment-bai
:{gencove_id}.bam.bai
impute-vcf
:{gencove_id}.vcf.gz
impute-tbi
:{gencove_id}.vcf.gz.tbi
impute-csi
:{gencove_id}.vcf.gz.csi
Continuing previous downloads¶
When downloading, existing files on the local filesystem are not overwritten if the file
already exists and has the same size in bytes as the file that would be downloaded. This
behavior can be tweaked with the --no-skip-existing
flag.
Downloading subsets of deliverables¶
$ gencove download . --sample-ids sample-id-1,sample-id-2,sample-id-3 --file-types impute-vcf,impute-tbi
- Download only a specific set of sample ids by providing the
--sample-ids
flag instead of the--project-id
flag - Download only a specific set of file types by providing the
--file-types
flag. Currently available file types are listed below (not all file types may be available for every project).
fastq-r1
- original input FASTQ file with raw sequencing reads, containing the first read of a read pair when using paired-end sequencing
fastq-r2
- original input FASTQ file with raw sequencing reads, containing the second read of a read pair when using paired-end sequencing
alignment-bam
- BAM file with reads aligned to the target genome (includes all reads from original FASTQ files)
alignment-bai
- BAI index file accompanying the BAM file
cnv-cnr
- CNR file with bin-level log2 ratios for copy-number variation calls
cnv-cns
- CNS file with segmented lod2 ratios for copy-number variation calls
cnv-pdf
- Portable Document Format (PDF) file with copy-number variation plot
cnv-png
- Portable Network Graphics (PNG) file with copy-number variation plot (commonly used when PDFs are too large).
impute-vcf
- VCF file with imputed variant calls
impute-tbi
- Tabix index file accompanying the VCF file
impute-csi
- CSI index file accompanying the VCF file
kraken-report
- Kraken report for sequencing reads that didn't map to the target genome
ancestry-json
-
JavaScript Object Notation (JSON) file with ancestry estimates for subpopulations, contains the following keys:
ancestry
- contains ancestry estimatesancestry_raw
- may contain additional entries for ambiguous groupings in situations where specific subgroups cannot be consistently identifiedancestry_metadata_id
- legacy key (should be disregarded)
traits-json
-
JSON file with polygenic risk score calculations
- each key represents a polygenic score outlined in the "Data analysis configurations" section below
- each polygenic score object contains the following keys:
score
- calculated value of polygenic scorensnp
- number of single-nucleotide polymorphisms (SNPs) taken into accountscore_percentile
- percentile of individual's score relative to scores calculated for individuals in the reference dataset used to generate the score
call_capture-vcf
- VCF file with variant calls from target capture regions, corresponding with the deliverable labeled
Target capture, VCF file
in the web interface. call_capture-csi
- index accompanying the target capture VCF file, corresponding with the deliverable labeled
Target capture, CSI file
in the web interface. call_capture-vcf_pathogenic
- VCF file with pathogenic variant calls from target capture regions
call_capture-forced_vcf
- VCF file with variant calls at a set of pre-determined variants, corresponding with the deliverable labeled
Target capture (pre-defined variants), VCF file
in the web interface. call_capture-forced_csi
- index accompanying the VCF file with variant calls at a set of predetermined variants; corresponds with the deliverable labeled
Target capture (pre-defined variants), CSI file
qc
-
JSON file with sample quality control metrics, containing the following
quality_control_type
s:format
- FASTQ format validityr1_eq_r2
- number of bases inR1
file equal to number of bases inR2
filer1_r2_ids_match
-R1
read identifiers matchR2
read identifiersbases_min
- minimum number of total bases sequencedbases_max
- maximum number of total bases sequencedbases_dedup_min
- minimum number of deduplicated basesbases_dedup_mapped_min
- minimum number of deduplicated bases that have aligned to the target genomefraction_contamination_max
- maximum contamination by DNA from another sample of the same speciessnps_min
- number of variants in reference panel that are covered by at least one sequencing readeffective_coverage_min
- minimum effective coveragehzy_max
- maximum heterozygositycc_min
- minimum "call confidence", i.e., imputation algorithm variant calling confidence across all sitesnhref_min
- minimum number of homozygous reference callsnhet_max
- maximum number of heterozygous callsnhalt_min
- minimum number of homozygous alt callspct_target_bases_30x_min
- minimum percentage of target capture bases with 30x coveragepathogenic_min
- number of pathogenic variants detected
metadata
- JSON file with user-specified metadata that has been assigned to a sample
Downloading checksum files¶
Include sha256
checksum files to verify that deliverables are valid.
For instance, for file file.vcf.gz
a file named file.vcf.gz.sha256
will be downloaded as well.
To verify the integrity of a file you can run
This will output if the checksum of the downloaded file matches the one provided by Gencove.
Note
Only projects that were created after July 6, 2022 have checksums available.
Warning
The CLI does NOT validate deliverables against checksum, even when the checksum flag is provided.