Introduction
Welcome to the Gencove API docs!
The Gencove REST API makes it easy to:
- upload low-pass sequencing FASTQ files to the Gencove analysis pipeline
- download analysis results
- track sample status
- automate data delivery.
Read on to get started and try out the examples on the side along the way.
Also, additional documentation is available here: - API reference for publicly available endpoints: API reference - Command-line interface (CLI) tool reference: CLI reference
Gencove data
Genomic data is organized into “projects”. Each
project contains “samples”. Each sample has an
id
(generated by Gencove) and client_id
(provided to
Gencove by clients).
In most cases, a user account and project will be created for you by our team.
In case you would like to explore the Gencove data delivery dashboard, feel free to create an account and explore as follows:
- Create a free Gencove account using the dashboard
- Create a project by going to My Projects -> Add new project
The Gencove CLI
The Gencove command-line interface (CLI) can be used to easily access the API.
It is mostly used for:
- Uploading FASTQ files for analysis
- Downloading analysis results
Quickstart
$ pip install gencove
$ gencove upload <local-directory-path>
# In shell:
pip install gencove
gencove upload <local-directory-path>
Install the Gencove CLI using the Python package manager pip
and upload
files to your Gencove account.
Hint: for the newest pre-release versions, check: PyPI
Video demo
Setup
Installation
$ pip install gencove
# In shell:
pip install gencove
The Gencove CLI can be installed using the Python package manager
pip
. The source code is available on
GitLab.
Python and pip are commonly preinstalled on most Mac and Linux systems. In case you do need to install Python, commonly used instructions are available here.
In production environments, we highly recommend using virtualenv and/or virtualenvwrapper for installing the Gencove CLI in a dedicated Python environment.
Mac OS notes
Due to a known issue with Python that ships
with Mac OS, the Gencove CLI should be installed in the user’s home directory (not
system-wide) as follows: pip install --user gencove
. Make sure to have ~/bin
present
in your $PATH
environment variable.
For advanced users, we highly recommend virtualenvwrapper and installing the Gencove CLI within a dedicated virtualenv.
If you absolutely must install the Gencove CLI system-wide using sudo
, the following
command can be used as a last resort: sudo pip install gencove --ignore-installed six
.
Configuration
$ export GENCOVE_EMAIL='<your-email>'
$ export GENCOVE_PASSWORD='<your-password>'
export GENCOVE_EMAIL='<your-email>'
export GENCOVE_PASSWORD='<your-password>'
Your credentials can be provided to the Gencove CLI via the environment variables
$GENCOVE_EMAIL
and $GENCOVE_PASSWORD
.
Uploading FASTQ files
In order to enable FASTQ uploads for your account, log into your account and go to My FASTQs, where instructions will be provided (in case you already do not have access). You can expect a response from Gencove support within 24h.
Once uploads are enabled, users can upload files to the Gencove upload area using the Gencove CLI and assign the files to projects using the Gencove Dashboard. Once files are assigned to a project, they will be processed by the Gencove analysis pipeline. Analysis results will be available via the Gencove API and Dashboard once analysis is complete.
Naming files
We highly recommend using the standard Illumina naming convention for FASTQ files. If files are named in this manner, Gencove systems will automatically detect:
- the sample identifier (and use it as the sample’s
client_id
) - R1/R2 designations of files
A summary of the naming convention is:
SAMPLE ID
+ _
+ … + _
+ (R1
or R2
) + _
+ … + .fastq.gz
For example, the table below shows examples of file names using this convention and the corresponding detected sample identifiers and read designations
File name | Sample ID | Read pair |
---|---|---|
SAMPLE1_R1.fastq.gz | SAMPLE1 | R1 |
SAMPLE1_R2.fastq.gz | SAMPLE1 | R2 |
SAMPLE2_LANE1_SEQUENCER1_R1.fastq.gz | SAMPLE2 | R1 |
SAMPLE3_R1_L001.fastq.gz | SAMPLE3 | R1 |
SAMPLE4_R1.fq.gz | SAMPLE4 | R1 |
Grouping files
Gencove software currently supports one pair of FASTQ files per sample. If you have reads spread across multiple files (e.g., from multiple sequencing lanes), they need to be merged into one R1 file and one R2 file. Luckily, gzipped files can be easily merged without decompressing, so it’s simply a matter of concatenating the compressed files.
Splitting files can also be avoided upstream in the demultiplexing phase by providing the --no-lane-splitting
flag to bcl2fastq
.
Uploading using the CLI
gencove upload <source-path> [<destination-path>]
Syncs local directories to directories in your Gencove upload area. Recursively copies new and updated files from the source directory to the destination. Only creates folders in the destination if they contain one or more files.
$ gencove upload my-fastq-files/
gencove upload my-fastq-files/
The example command will recursively copy all files in the
my-fastq-files/
directory on your host system to a directory with an
automatically generated name the Gencove upload area.
$ gencove upload my-fastq-files/ gncv://my-fastq/batch-1/
gencove upload my-fastq-files/ gncv://my-fastqs/batch-1/
In case more control is needed over the upload destination, a destination path
prefixed with gncv://
may be provided. This pattern is commonly used for
separating upload batches when continuously uploading data to your Gencove
account and is useful for easily filtering files in the Gencove Dashboard. A
common directory structure for batching uploads is:
gncv://<project-name>/<batch-name>/
Details of upload
behavior:
- In case a file in the local directory already exists in the destination, it will not be overwritten
- In case a file exists in the destination, but not the local directory, it will not be deleted
Automatically starting analysis
$ gencove upload my-fastq-files/ gncv://my-fastq/batch-1/ --run-project-id b1edbb20-ee77-4be0-9944-e8e3a593cc83
gencove upload my-fastq-files/ gncv://my-fastqs/batch-1/ --run-project-id b1edbb20-ee77-4be0-9944-e8e3a593cc83
To automatically assign uploads to a project and run analysis, provide the --run-project-id
flag and destination project id to the Gencove CLI.
When utilizing this feature, make sure uploaded files are named according to the file naming convention outlined above to avoid sample identifier detection issues.
Downloading deliverables
Gencove provides a number of deliverables for each sample that is processed as part of a project. In case a sample fails processing due to quality control, only the original input files are provided as deliverables.
Downloading using the CLI
$ gencove download . --project-id my-project-id
gencove download . --project-id my-project-id
gencove download <local-destination-path> --project-id <project-id>
Downloads all deliverables for all samples in project the specified project, with the following default naming scheme:
<local-destination-path>/<client-id>/<gencove-id>/<gencove-id>_<file-type>.<file-extension>
This naming scheme reflects the fact that uniqueness of client-id
s is not
enforced, while uniqueness of gencove-id
is enforced.
Customizing download naming scheme
$ gencove download . --project-id my-project-id --download-template '{client_id}_'
gencove download . --project-id my-project-id --download-template '{client_id}_'
The default naming scheme outlined above can be customized by providing the --download-template
flag and a custom file naming template that may contain {client_id}
and {gencove_id}
tokens.
The Gencove CLI will always append <file-type>.<file-extension>
to any template provided by the user.
When using this feature, make sure to specify download templates that result in unique filenames across all samples.
Continuing previous downloads
When downloading, existing files on the local filesystem are not overwritten if the file
already exists and has the same size in bytes as the file that would be downloaded. This
behavior can be tweaked with the --no-skip-existing
flag.
Downloading subsets of deliverables
$ gencove download . --sample-ids sample-id-1,sample-id-2,sample-id-3 --file-types impute-vcf,impute-tbi
gencove download . --sample-ids sample-id-1,sample-id-2,sample-id-3 --file-types impute-vcf,impute-tbi
Behavior of the download can also be tweaked in the following manner:
- Download only a specific set of sample ids by providing the
--sample-ids
flag instead of the--project-id
flag - Download only a specific set of file types by providing the
--file-types
flag. Currently available file types are listed below (not all file types may be available for every project).
fastq-r1
fastq-r2
alignment-bam
alignment-bai
cnv-cnr
cnv-cns
cnv-pdf
impute-vcf
impute-tbi
impute-csi
ancestry-json
kraken-report
traits-scores
Automated data delivery
Data delivery can be automated using webhooks.
Users can specify a webhook URL for a project. Once a webhook is specified, events relating to that project will be submitted to the webhook URL in JSON format via HTTP POST requests. The content of the webhook contains the following keys:
event
, describing the type of eventpayload
, containing webhook contentsobject_id
, a unique identifier for the originating object of the webhooktimestamp
Together, object_id
and event
should be considered unique and duplicates should be
handled by the receiver.
{
"event": "analysis_complete",
"object_id": "99573a16-98a8-48fc-8caf-e3b4dcdf34e6",
"timestamp": "2018-11-18T14:09:59.741183",
"payload": {
"project_id": "1d6daca6-475a-4961-9841-57aac36cbd0f",
"sample_ids": [
"45273390-64bd-4a07-a1be-8514d3ba7750"
]
}
}
{
"event": "analysis_complete",
"object_id": "99573a16-98a8-48fc-8caf-e3b4dcdf34e6",
"timestamp": "2018-11-18T14:09:59.741183",
"payload": {
"project_id": "1d6daca6-475a-4961-9841-57aac36cbd0f",
"sample_ids": [
"45273390-64bd-4a07-a1be-8514d3ba7750"
]
}
}
Currently, the following events are available:
analysis_complete
- this event describes the completion of analysis on a batch of samples belonging to
a project. The webhook
payload
will contain the respectiveproject_id
and a list ofsample_id
s.
- this event describes the completion of analysis on a batch of samples belonging to
a project. The webhook
Once a webhook is received, the receiver is responsible for querying the Gencove API for
more details on each object that is referenced. For example, upon receiving a
analysis_complete
webhook for a project, the receiver should query the Gencove API for
sample details, status, and fresh download URLs for deliverables related to those
samples.
Webhook signatures
Gencove can optionally sign webhook events it sends to endpoints.
This is done by including a signature in each event’s Gencove-Signature
header,
allowing you to verify that the events were sent by Gencove.
Before verifying signatures, webhooks need to be enabled and the secret needs to be retrieved for each project via the Gencove API (API reference). Note that each project has a separate unique secret.
After this setup, Gencove automatically starts signing each webhook event it sends to the endpoint of the related project.
Verifying webhook signatures
The Gencove-Signature
header contains a timestamp and a signature:
- the timestamp is an integer representing UNIX time
and is prefixed by
t=
- the signature is prefixed by a scheme, which starts with
v
and is followed by an integer. Currently, the only valid signature scheme isv1
.
Example signature: Gencove-Signature: t=1492774577,v1=5257a869e7ecebeda32affa62cdca3fa51cad7e77a0e56ff536d0ce8e108d8bd
Gencove generates signatures using a hash-based message authentication code (HMAC) with SHA-512. To prevent downgrade attacks, you should ignore all schemes that are not v1.
export SECRET='super-secret'
export TIMESTAMP='123456'
export PAYLOAD='{"k":"v"}'
python3 -c \
'import hmac, hashlib, os; print(hmac.new(os.environ["SECRET"].encode("utf-8"), "{}.{}".format(os.environ["TIMESTAMP"], os.environ["PAYLOAD"]).encode("utf-8"), hashlib.sha512).hexdigest())'
import hmac, hashlib
def calculate_signature(secret, timestamp, payload):
signature_message = "{}.{}".format(timestamp, payload).encode("utf-8")
return hmac.new(
secret.encode("utf-8"),
signature_message,
hashlib.sha512
).hexdigest()
Step 1: Extract the timestamp and signatures from the header
Split the header, using the ,
character as the separator, to get a list of elements.
Then split each element, using the =
character as the separator, to get a prefix and value pair.
The value for the prefix t
corresponds to the timestamp, and v1
corresponds to the signature.
Step 2: Prepare signature_message
This is achieved by concatenating:
- The timestamp (as a string)
- The character
.
- The actual JSON payload (i.e., the request’s body)
Step 3: Determine the expected signature
Compute an HMAC with the SHA512 hash function.
Use the endpoint’s signing secret as the key and the signature_message
string as
the message.
Step 4: Compare signatures
Compare the signature in the header to the expected signature. If a signature matches, compute the difference between the current timestamp and the received timestamp, then decide if the difference is within your tolerance.
Testing environment
Developers may use the Gencove staging environment for development and testing.
The staging developer website URL is: https://web-stage.gencove.com
The staging API URL is: https://api-stage.gencove.com
Data analysis configurations
Each Gencove project is pinned to a ‘configuration’ that specifies the species, reference datasets (e.g. a reference genome and haplotype reference panel), and specific deliverables. These configurations can be private to a specific set of individuals, or public. The datasets underlying the public configurations are as follows:
- Chicken low-pass
- Reference genome: galGal6
- Imputation reference panel: Sequencing data was downloaded from the European Nucleotide Archive project PRJEB30270. These data are described in detail in Qanbari et al. (2019). Raw sequences were processed using GATK4 and 26M variants were identified and converted into a haplotype reference panel.
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index)
- Cattle low-pass
- Reference genome: ARS-UCD1.2
- Imputation reference panel: We used sequence data from 484 samples primarily from B. taurus breeds and processed these data using GATK4 into a reference panel of 49M bi-allelic SNPs.
- Breed analysis reference panel: We report ancestry proportions from 13 breeds: Angus, Brahman, Charolais, Gelbvieh, Hereford, Holstein, Jersey, Limousin, Red Angus, Simmental, Braunvieh, Santa Gertrudis, and Shorthorn.
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index), breed analysis
- Cattle low-pass v2
- Reference genome: ARS-UCD1.2
- Imputation reference panel: We used sequence data from 946 samples from B. taurus and B. indicus-related breeds and processed these data using GATK4 into a reference panel of 70M bi-allelic SNPs.
- Breed analysis reference panel: We report ancestry proportions from 12 breeds: Angus (including red and black Angus), Brahman, Charolais, Gelbvieh, Hereford, Holstein, Jersey, Limousin, Maine Anjou, Simmental, Braunvieh, and Shorthorn.
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index), breed analysis
- Dog low-pass
- Reference genome: canFam3
- Imputation reference panel: A reference panel of 435 sequenced dogs and 46M sites
- Breed analysis reference panel: This panel contains data from 91 breeds
- Deliverables:original FASTQ, aligned BAM (and index), imputed VCF (and index), breed analysis
- Dog low-pass v2
- Reference genome: canFam3
- Imputation reference panel: A reference panel of 676 sequenced dogs and 53M sites.
- Breed analysis reference panel: This panel contains data from 91 breeds
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index), breed analysis
- Human low-pass
- Reference genome: hs37-1kg
- Imputation reference panel: 1000 Genomes Phase 3, with all sites with a minor allele count less than three, with more than two alleles, or on the sex chromosomes removed.
- Ancestry reference panel: We provide an ancestry analysis based on 26 reference populations described here
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index), ancestry analysis, polygenic scores
- Human low-pass v2
- Reference genome: hs37-1kg
- Imputation reference panel: 1000 Genomes Phase 3. Relative to v1, this includes all sites (including normalized multiallelic sites and the X chromosome.
- Ancestry reference panel: We provide an ancestry analysis based on 26 reference populations described here
- Polygenic risk scores:
- Coronary artery disease: Inouye et al. 2018
- Breast cancer: Mavaddat et al. 2018
- Prostate cancer: Schumacher et al. 2018
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index), ancestry analysis, polygenic scores
- Maize low-pass
- Reference genome: AGPv4
- Imputation reference panel: Maize 282 association panel genotypes (7x, AGPv4 coordinates)
- Strain reference panel: Each strain in the imputation reference panel was considered a separate population. Those that appear particularly similar are merged downstream into related groups such as “NC262RELATED”.
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index), strain analysis
- Mouse low-pass
- Reference genome: GRCm38_68
- Imputation reference panel: The Mouse Genomes Project contains ~59M SNPs discovered in 36 sequenced lines.
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index)
- Rat low-pass
- Reference genome: rn6
- Imputation reference panel: 42 rat genomes described in Hermsen et al. (2015) and lifted over to rn6, containing 8.7M variants.
- Strain analysis panel: The same 42 rat genomes used for the imputation reference panel.
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index), strain analysis
- Soy low-pass
- Reference genome: Wm82.a2.v1
- Imputation reference panel: GmHapMap contains ~11M bi-allelic SNPs identified in around 1000 sequenced accessions.
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index)
- Domestic cat low-pass
- Reference genome: felCat9
- Imputation reference panel: We used 78 WGS samples from Felis Catus breeds from the 99lives project and processed these data using GATK4 into a reference panel of 49M snps and small indels.
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index)
API Reference
The full API reference for publicly available endpoints is available here: API reference
CLI Reference
The full CLI reference is available here: CLI reference
Terms
We reserve the right to remove your access to our API for any reason at our sole discretion.
FAQ
Support
Contact us at support@gencove.com