FAQ¶

General¶

Which browsers does Gencove support?¶

The Gencove dashboard supports Safari, Chrome, Firefox, and Edge. Generally, any WebKit-based browser should work without issues. In case you encounter compatibility issues, please let us know at support@gencove.com.

How long does it take to get my results back?¶

Results are returned within 6-8 weeks from the time samples arrive in our lab. Let us know if you have special requirements for turnaround time and we'll work with you to make sure your samples are done on time!

How do you deliver the resulting data?¶

Data can be accessed via the Gencove data management website or REST API. For downloading data in bulk, we recommend using the API or our command-line tool.

Do you have an API?¶

Yes, the API can be used to track sample status and automate data delivery. Check out the docs.

Which sequencing machines do you use for low-pass sequencing?¶

We always use the latest Illumina and BGI sequencing machines. Depending on the specifics of your project, we'll use Illumina (NovaSeq, HiSeq X, NextSeq) or BGI (DNBseq) sequencers.

How does sample naming work?¶

Gencove projects contain samples, and each sample has both 1) a unique Gencove identifier, and 2) a user supplied client ID. Samples are derived from FASTQ uploads or imports.

Gencove ID
- This is a UUID value and is guaranteed to be unique. Every sample in a project will have a unique Gencove ID. This value is automatically generated by the Gencove platform when a sample is created.
Client ID
- This value is supplied by users OR derived from the FASTQ file names used to create the sample. This value is not guaranteed to be unique.

Client IDs are primarily derived from input FASTQ file names. When deciding on names for FASTQ files, it is important to consider how these will translate into final client IDs. It is also important to note that these values are not enforced to be unique and duplicate values are allowed.

Misnaming files can lead to issues in automatically pairing paired-end reads. Please see the file naming convention section for details on how to properly name FASTQ files.

In cases where users would prefer to supply custom client IDs, a CSV mapping file can be created, linking FASTQ files and custom client IDs. For details, see the custom file names section.

Technical¶

What tools can I use to parse these VCFs?¶

We recommend using bcftools and htslib to parse VCFs. Bindings for htslib are available in a wide variety of programming languages.

What is the meaning of the different fields in my VCF?¶

Please see the VCF specification for details on the technical specifications of the Variant Call Format (VCF).

What is the meaning of the `FORMAT` fields `GT:RC:AC:GP:DS` in my VCF?¶

These fields are defined in the header of the VCF, and are copied below for convenience:

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Genotype Probabilities">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage">
##FORMAT=<ID=RC,Number=1,Type=Integer,Description="Count of reads with REF allele">
##FORMAT=<ID=AC,Number=1,Type=Integer,Description="Count of reads with ALT allele">

What does the `LOWCONF` filter mean for a variant?¶

The LOWCONF filter indicates that the variant was not imputed with high confidence. More precisely, it means that none of the posterior probabilities for the possible genotypes (0, 1 , or 2) for that variant exceeded 0.9.

Can I get access to the variants called directly from the BAMs before imputation?¶

Unlike other variant calling + imputation workflows, we do not we do not perform an intermediate variant calling step before imputation. Rather, we impute genotypes directly from the reads; as such, there are no intermediate genotypes or genotype likelihoods generated.

How do you calculate contamination for a sample?¶

Contamination is estimated via examination of reads mapping to the mitochondrial genome. More precisely, reads deviating from the consensus sequence at rare alleles present in the target individual are used to estimate the proportion of reads deriving from a contaminating individual, and therefore the contamination present in the overall sample.

Why are there multiple entries for a given chromosome and position in my VCF? Are these duplicate variants?¶

These are not duplicate variants but multi-allelic sites (i.e., sites with more than one alternative allele) split into multiple biallelic-records (i.e., one alternate allele per line). Multiple such records representing a single multi-allelic site can be joined into a single multi-allelic record using standard tools such as bcftools or vt.

What is this `*` allele in my VCF?¶

The * allele denotes a spanning deletion. For more information, please see this article from the Broad Institute and item number 5 on page 8 of the VCF specification.

How do you impute genotypes when there are no reads at a site?¶

Genotype imputation takes advantage of shared haplotypes between a target individual and haplotypes within a reference panel. The process of imputation itself is the exercise of estimating partially unobserved haplotypes in the target individual and as such, the results of imputation are based not only on local evidence but also on non-local evidence, such as reads on linked sites.

For a more detailed description of the statistical model underlying imputation from low-pass sequencing, please see the Supplementary Note in Wasik et al. 2021.

What program is used for genotype imputation?¶

For most production pipelines, GLIMPSE is used for imputation from low-pass sequence data.

For a subset of production pipeline configurations, loimpute is used for imputation. For a more detailed description of the statistical model implemented by loimpute, please see the Supplementary Note in Wasik et al. 2021. A copy of loimpute can be requested for academic use at this link.

What tool is used for CNV calling?¶

We use CNVkit in wgs mode to call CNVs from low-pass sequence data. An explanation of the output file formats can be found at the following link.

Tool Versions¶

We use the following versions of open-source bioinformatics tools in our production pipelines:

samtools : v1.8
htslib: v1.8
bcftools: v1.8
bedtools: v2.26.0
bwa : v0.7.17
cnvkit: v0.9.6
glimpse: v1.0.0
kraken: v1.1