Skip to content

The Gencove CLI

The Gencove command-line interface (CLI) can be used to easily access the API.

It is mostly used for:

  1. Uploading FASTQ files for analysis
  2. Downloading analysis results

Quickstart

$ pip install gencove
$ gencove upload <local-directory-path>

Install the Gencove CLI using the Python package manager pip and upload files to your Gencove account.

For more detailed installation instructions, please see the Installation section below.

Video demo

Setup

Installation

$ pip install gencove

Warning

Please note that Python 2 has reached its end of life and we highly recommend using Python 3. Gencove CLI will not work on Python 2. For further details about the Gencove CLI migration path and where to find precompiled executables, see dedicated section below.

The Gencove CLI can be installed using the Python package manager pip. The source code is mirrored to a public repository on GitHub.

Python 3 and pip are commonly available on many operating systems. In case you do need to install Python 3, straightforward instructions are available here.

In production environments, we highly recommend using virtualenv and/or virtualenvwrapper for installing the Gencove CLI in a dedicated Python environment.

Python 2 end of life

Due to Python 2 reaching its end of life, Gencove migrated towards supporting exclusively Python 3 for the CLI.

We understand this may be disruptive to existing user workflows and Gencove will take the following steps to make transitioning as easy as possible:

  • Maintain Python 2 support through June 2020
  • Provide precompiled executables for Mac OS, Linux, and Windows operating systems in order to support users without Python 3
  • beta builds currently available on GitHub
  • As of July 2020, Gencove will officially support only Python 3

Mac OS notes

Due to a known issue with Python that ships with Mac OS, the Gencove CLI should be installed in the user's home directory (not system-wide) as follows: pip install --user gencove. Make sure to have ~/bin present in your $PATH environment variable.

For advanced users, we highly recommend virtualenvwrapper and installing the Gencove CLI within a dedicated virtualenv.

If you absolutely must install the Gencove CLI system-wide using sudo, the following command can be used as a last resort: sudo pip install gencove --ignore-installed six.

Configuration

Your credentials can be provided to the Gencove CLI via environment variables:

  • $GENCOVE_EMAIL and $GENCOVE_PASSWORD
  • $GENCOVE_API_KEY
    • API keys can be generated and revoked using the Gencove Dashboard under Account Settings -> API Keys
$ export GENCOVE_EMAIL='<your-email>'
$ export GENCOVE_PASSWORD='<your-password>'
$ export GENCOVE_API_KEY='<your-api-key>'

Please note that you cannot use $GENCOVE_EMAIL+$GENCOVE_PASSWORD and $GENCOVE_API_KEY at the same time.

$ curl -H "Authorization: Api-Key <your-api-key>" https://api.gencove.com/api/v2/projects/
import requests

r = requests.get(
  "https://api.gencove.com/api/v2/projects/",
  headers={"Authorization": "Api-Key <your-api-key>"}
)

API keys can also be used to authenticate with the API directly by setting the Authorization HTTP header to Api-Key <your-api-key>.

Uploading FASTQ files

In order to enable FASTQ uploads for your account, log into your account and go to My FASTQs, where instructions will be provided (in case you already do not have access). You can expect a response from Gencove support within 24h.

Once uploads are enabled, users can upload files to the Gencove upload area using the Gencove CLI and assign the files to projects using the Gencove Dashboard. Once files are assigned to a project, they will be processed by the Gencove analysis pipeline. Analysis results will be available via the Gencove API and Dashboard once analysis is complete.

Warning

The Gencove upload area should be considered temporary storage and should not be used as permanent storage space for your files. Once files are assigned to a project, they will be stored according to your data retention agreement with Gencove.

File naming convention

We highly recommend using the standard Illumina naming convention for FASTQ files. If files are named in this manner, Gencove systems will automatically detect:

  1. the sample identifier (and use it as the sample's client_id)
  2. R1/R2 designations of files

A summary of the naming convention is:

SAMPLE ID + _ + ... + _ + (R1 or R2) + _ + ... + .fastq.gz

For example, the table below shows examples of file names using this convention and the corresponding detected sample identifiers and read designations

File name Sample ID Read pair
SAMPLE1_R1.fastq.gz SAMPLE1 R1
SAMPLE1_R2.fastq.gz SAMPLE1 R2
SAMPLE2_LANE1_SEQUENCER1_R1.fastq.gz SAMPLE2 R1
SAMPLE3_R1_L001.fastq.gz SAMPLE3 R1
SAMPLE4_R1.fq.gz SAMPLE4 R1

Custom file names

To bypass the default convention outlined above and explicitly specify sample identifiers and R1/R2 designations for FASTQ files, a file ending with .fastq-map.csv can be provided as the SOURCE to the gencove upload command. The format of the file is outlined in the code snippet on the right.

The following validation is performed on the .fastq-map.csv file:

  • the file header is client_id,r_notation,path
  • values in the client_id column cannot contain _
  • values in the r_notation column can only be "r1" or "r2"
  • file listed in the path column must:
  • exist
  • be gzip-compressed

Example:

client_id,r_notation,path
<sample_id_1>,<r_notation_1>,<path_to_fastq_file_1>
<sample_id_2>,<r_notation_2>,<path_to_fastq_file_2>
<sample_id_3>,<r_notation_3>,<path_to_fastq_file_3>
...

Grouping files

By default, Gencove systems expect one pair of FASTQ files per sample.

If sequencing reads for a single sample are spread across multiple FASTQ files, they need to be merged into one R1 file and one R2 file. This can be accomplished in several ways:

  1. Listing multiple files for the same client_id and r_notation in the .fastq-map.csv file (outlined in previous section) results in the files being concatenated on the fly during upload with the Gencove CLI - see example in code snippet on the right.
  2. Manually concatenate the files. Since gzip-compressed files can be merged without decompressing, it's simply a matter of concatenating the compressed files.
  3. By providing the --no-lane-splitting flag to bcl2fastq, splitting reads into multiple FASTQ files can be avoided upstream in the demultiplexing phase.

Example:

client_id,r_notation,path
sampleid1,r1,sample1_part1_r1.fastq.gz
sampleid1,r1,sample1_part2_r1.fastq.gz
sampleid1,r1,sample1_part3_r1.fastq.gz
sampleid1,r2,sample1_part1_r2.fastq.gz
sampleid1,r2,sample1_part2_r2.fastq.gz
sampleid1,r2,sample1_part3_r2.fastq.gz

Uploading using the CLI

$ gencove upload <source-path> [<destination-path>]

Syncs local directories to directories in your Gencove upload area. Recursively copies new and updated files from the source directory to the destination. Only creates folders in the destination if they contain one or more files.

$ gencove upload my-fastq-files/

This example command will recursively copy all files in the my-fastq-files/ directory on your host system to a directory with an automatically generated name the Gencove upload area.

$ gencove upload input.fastq-map.csv

If there are multiple input FASTQ files per sample, or the file names do not follow the conventions described above, a manifest describing the relationship between the sample identifiers and the input FASTQ files must be provided in a CSV file in the format described above.

$ gencove upload my-fastq-files/ gncv://my-fastq/batch-1/

In case more control is needed over the upload destination, a destination path prefixed with gncv:// may be provided. This pattern is commonly used for separating upload batches when continuously uploading data to your Gencove account and is useful for easily filtering files in the Gencove Dashboard. A common directory structure for batching uploads is:

gncv://<project-name>/<batch-name>/

If specifying a destination path, it is recommended to have at least one level of directories to separate batches of uploaded data. In other words, it is recommended to avoid placing all files in the root directory gncv://

Details of upload behavior:

  • In case a file in the local directory already exists in the destination, it will not be overwritten
  • In case a file exists in the destination, but not the local directory, it will not be deleted

Automatically starting analysis

To automatically assign uploads to a project and run analysis, provide the --run-project-id flag and destination project id to the Gencove CLI.

$ gencove upload my-fastq-files/ gncv://my-fastq/batch-1/ --run-project-id b1edbb20-ee77-4be0-9944-e8e3a593cc83

When this feature is used, the Gencove CLI will check to make sure that contents of SOURCE and DESTINATION are identical in order to avoid analysis of unwanted samples. This will always be the case if DESTINATION is omitted, i.e., autogenerated by the Gencove CLI.

It is also important to ensure uploaded files follow naming conventions outlined above to avoid sample identifier detection issues.

Downloading deliverables

Gencove provides a number of deliverables for each sample that is processed as part of a project. In case a sample fails processing due to quality control, only the original input files are provided as deliverables.

Downloading using the CLI

$ gencove download <local-destination-path> --project-id <project-id>`

Downloads all deliverables for all samples in project the specified project, with the following default naming scheme:

<local-destination-path>/<client-id>/<gencove-id>/<gencove-id>_<file-type>.<file-extension>

This naming scheme reflects the fact that uniqueness of client-ids is not enforced, while uniqueness of gencove-id is enforced.

Customizing download naming scheme

The default naming scheme outlined above can be customized by providing the --download-template flag and a custom file naming template that may contain {client_id}, {gencove_id}, {file_type}, {file_extension} and {default_filename} tokens.

$ gencove download . --project-id my-project-id --download-template '{client_id}.{file_extension}'

When using this feature, make sure to specify download templates that result in unique filenames across all samples.

The {default_filename} token provides access to the API's default file naming scheme, which takes into account different bioinformatics conventions across a subset of file types. Current exceptions to the default {gencove_id}_{file_type}.{file_extension} scheme are:

  • fastq-r1: {gencove_id}_R1.fastq.gz
  • fastq-r2: {gencove_id}_R2.fastq.gz
  • alignment-bam: {gencove_id}.bam
  • alignment-bai: {gencove_id}.bam.bai
  • impute-vcf: {gencove_id}.vcf.gz
  • impute-tbi: {gencove_id}.vcf.gz.tbi
  • impute-csi: {gencove_id}.vcf.gz.csi

Continuing previous downloads

When downloading, existing files on the local filesystem are not overwritten if the file already exists and has the same size in bytes as the file that would be downloaded. This behavior can be tweaked with the --no-skip-existing flag.

Downloading subsets of deliverables

$ gencove download . --sample-ids sample-id-1,sample-id-2,sample-id-3 --file-types impute-vcf,impute-tbi
Behavior of the download can also be tweaked in the following manner:

  1. Download only a specific set of sample ids by providing the --sample-ids flag instead of the --project-id flag
  2. Download only a specific set of file types by providing the --file-types flag. Currently available file types are listed below (not all file types may be available for every project).

fastq-r1
original input FASTQ file with raw sequencing reads, containing the first read of a read pair when using paired-end sequencing
fastq-r2
original input FASTQ file with raw sequencing reads, containing the second read of a read pair when using paired-end sequencing
alignment-bam
BAM file with reads aligned to the target genome (includes all reads from original FASTQ files)
alignment-bai
BAI index file accompanying the BAM file
cnv-cnr
CNR file with bin-level log2 ratios for copy-number variation calls
cnv-cns
CNS file with segmented lod2 ratios for copy-number variation calls
cnv-pdf
Portable Document Format (PDF) file with copy-number variation plot
cnv-png
Portable Network Graphics (PNG) file with copy-number variation plot (commonly used when PDFs are too large).
impute-vcf
VCF file with imputed variant calls
impute-tbi
Tabix index file accompanying the VCF file
impute-csi
CSI index file accompanying the VCF file
kraken-report
Kraken report for sequencing reads that didn't map to the target genome
ancestry-json

JavaScript Object Notation (JSON) file with ancestry estimates for subpopulations, contains the following keys:

  • ancestry - contains ancestry estimates
  • ancestry_raw - may contain additional entries for ambiguous groupings in situations where specific subgroups cannot be consistently identified
  • ancestry_metadata_id - legacy key (should be disregarded)
traits-json

JSON file with polygenic risk score calculations

  • each key represents a polygenic score outlined in the "Data analysis configurations" section below
  • each polygenic score object contains the following keys:
    • score - calculated value of polygenic score
    • nsnp - number of single-nucleotide polymorphisms (SNPs) taken into account
    • score_percentile - percentile of individual's score relative to scores calculated for individuals in the reference dataset used to generate the score
call_capture-vcf
VCF file with variant calls from target capture regions
call_capture-csi
index accompanying the target capture VCF file
call_capture-vcf_pathogenic
VCF file with pathogenic variant calls from target capture regions
qc

JSON file with sample quality control metrics, containing the following quality_control_types:

  • format - FASTQ format validity
  • r1_eq_r2 - number of bases in R1 file equal to number of bases in R2 file
  • r1_r2_ids_match - R1 read identifiers match R2 read identifiers
  • bases_min - minimum number of total bases sequenced
  • bases_max - maximum number of total bases sequenced
  • bases_dedup_min - minimum number of deduplicated bases
  • bases_dedup_mapped_min - minimum number of deduplicated bases that have aligned to the target genome
  • fraction_contamination_max - maximum contamination by DNA from another sample of the same species
  • snps_min - number of variants in reference panel that are covered by at least one sequencing read
  • effective_coverage_min - minimum effective coverage
  • hzy_max - maximum heterozygosity
  • cc_min - minimum "call confidence", i.e., imputation algorithm variant calling confidence across all sites
  • nhref_min - minimum number of homozygous reference calls
  • nhet_max - maximum number of heterozygous calls
  • nhalt_min - minimum number of homozygous alt calls
  • pct_target_bases_30x_min - minimum percentage of target capture bases with 30x coverage
  • pathogenic_min - number of pathogenic variants detected
metadata
JSON file with user-specified metadata that has been assigned to a sample

The Gencove Archive

The Gencove Archive automatically transitions samples older than 30 days from hot storage to the Archive. Once a sample is in the Archive, its deliverables are not immediately available for download, rather users need to intentionally restore them from the Archive using the Gencove web dashboard, command-line interface (CLI), or API. Sample restoration can take up to 50 hours. Upon restoration, sample deliverables are available to download for 12 days, after which they return to the Archive.

gencove projects restore-samples my-project-id --sample-ids sample-id-1,...,sample-id-N

Note that default views in the Gencove web dashboard and CLI only display samples that are immediately available for download. To view archived samples, set the view filter to either:

  • all: display available and archived samples
  • archived: display only archived samples

Listing projects, samples and uploads

Listing projects

All projects can be listed using the gencove projects list command.

$ gencove projects list

Listing project samples

All samples can be listed using the gencove projects list-samples command.

$ gencove projects list-samples my-project-id

Project samples can also be filtered by status and searched. Metadata substring can be specified as the search query as well.

$ gencove projects list-samples my-project-id --status completed
$ gencove projects list-samples my-project-id --search my-client-id

Listing uploads

Uploads can be listed using the gencove uploads list command.

$ gencove uploads list

Uploads can also be filtered by status and searched.

$ gencove uploads list --status assigned
$ gencove uploads list --search gncv://upload/path

Sample metadata and files

Gencove supports assigning metadata to a sample in JavaScript Object Notation (JSON) format.

Information commonly stored as sample metadata:

  • phenotypes (characteristics) of the individual represented by the sample
  • batch identifiers
  • alternative or auxiliary sample identifiers

Each sample has many different files assigned to it that can be retrieved using the CLI.

The following CLI commands can be used to set and get metadata:

Assigning sample metadata

Metadata can be assigned to a sample using the gencove samples set-metadata command. Specifying sample id and the --json flag together with a JSON string is mandatory.

$ gencove samples set-metadata my-sample-id --json '{"example-key": "example-value"}'
$ gencove samples set-metadata my-sample-id --json '1234567'

Retrieving sample metadata

Sample metadata can be retrieved by using the gencove samples get-metadata command. Optionally, --output-filename my-filename can be used to specify the filename where the metadata will be output. If not specified, metadata will be printed to stdout.

$ gencove samples get-metadata my-sample-id

Downloading single sample file

Download and save file

A single sample file can be downloaded using the gencove samples download-file command.

$ gencove samples download-file sample-id-1 impute-vcf destination.vcf

Download and stream file to stdout

A single sample file can be downloaded and streamed to stdout using the gencove samples download-file command.

$ gencove samples download-file sample-id-1 impute-vcf -

Merged VCF file

Gencove supports generating a merged VCF file containing variant calls from all successful samples in a project.

Generating a merged VCF file is initiated from the Gencove Dashboard, by opening a project and clicking the "Merge VCFs" button. Once the merge operation is complete, a download button will appear on the project page.

Please keep in mind:

  • merging is only possible for projects with two or more successful samples
  • not all project configurations support merging
    • in case you need a merged VCF and a project configuration you are using does not support it, please let us know at support@gencove.com
  • depending on the number of samples in your project, merging may take anywhere between several minutes and several hours
  • if multiple samples have the same client_id, the merged VCF file will only contain the newest sample

In addition to the web interface, the following CLI commands can be used to access merged VCF functionality:

Creating a merged VCF

A merged VCF file can be created using the gencove projects create-merged-vcf command.

$ gencove projects create-merged-vcf my-project-id

Checking the status of a merged VCF

Status of the merging job can be checked using the gencove projects status-merged-vcf command.

$ gencove projects status-merged-vcf my-project-id

Downloading the merged VCF

The merged VCF file can be downloaded using the gencove projects get-merged-vcf command. Optionally, --output-filename my-filename can be used to override the default filename.

$ gencove projects get-merged-vcf my-project-id

Backwards-compatible array deliverables

Backwards-compatible genotyping array deliverables can be generated for batches of samples in projects that support this functionality. Each project configuration can support multiple batch types that correspond to different array types.

More information about these deliverables is available in this blog post

Listing batch types

Available batch types for a project can be listed using the gencove projects list-batch-types command.

$ gencove projects list-batch-types my-project-id

Creating a batch

A new batch can be created using the gencove projects create-batch command.

$ gencove projects create-batch --batch-type illuminasnp50 --batch-name batch-001 --sample-ids sample-id-1,...,sample-id-N my-project-id

Omitting--sample-ids results in all samples belonging to the project being used for the batch.

$ gencove projects create-batch --batch-type illuminasnp50 --batch-name batch-001 my-project-id

Successful generation of a batch deliverable will also trigger a webhook associated with the project.

Listing project batches

Project batches can be listed using the gencove projects list-batches command.

$ gencove projects list-batches my-project-id

Downloading batch deliverable

Once the batch deliverable is generated, it is available for download using the gencove projects get-batch command.

$ gencove projects get-batch my-batch-id --output-filename batch.zip