Shortcut Library¶
The Gencove Explorer Library is a Python package that contains a collection of pre-made “shortcuts” that represent commonly used genomic analysis workflows like subsetting and annotating VCF files.
There two main types of shortcuts:
- Local: execute locally
- Remote: execute on a cluster
Local shortcuts¶
These shortcuts execute locally and commonly do not have large resource requirements. They commonly provide visualization and summaries of various statistics.
Local shortcuts provide:
run()
method for running the shortcutresult()
method for accessing shortcut results (if applicable)save()
andload()
methods for saving and reloading shortcut state from local file storage
IGV¶
This shortcut is a version of the Integrative Genomics Viewer (IGV) integrated into the broader Gencove platform, making it easy to visually observe various aspects of a sample like BAM file read coverage relative to the reference genome.
from gencove_explorer_library.shortcuts.igv import IGV
IGV().run(
sample="c721a787-3550-4f2c-8324-97ba4686ef4c",
region="chr2:56,804,074-56,811,712",
)
Remote shortcuts¶
These shortcuts execute remotely on the cluster and commonly represent workloads with large resource requirements that cannot reasonably complete in a local environment.
Remote shortcuts provide:
input_helper()
method to generate input for the shortcut in a simple and user-friendly manner- this is a static method, therefore it is not required to instantiate an object to execute the method
run()
method for scheduling execution of the shortcut onto the clusterstatus()
method for checking shortcut execution statusresult()
method for accessing shortcut resultsanalyses()
method for returningAnalysis
objects upon which downstream shortcuts must depend onsave()
andload()
methods for saving and reloading shortcut state from local file storage
Subset VCFs¶
This shortcut enables the user to subset a collection of VCF files to a set of genomic regions.
The shortcut's input_helper()
method accepts either:
- Gencove project id
- List of Gencove sample ids
and returns a dictionary containing a list of Sample objects from the Gencove platform.
Alternatively, the user may provide a list of Sample or VCF objects to the shortcut without using input_helper()
.
from gencove_explorer_library.shortcuts.subset import SubsetVCFs
from gencove_explorer.helpers import GenomicRegion
input_parameters = SubsetVCFs.input_helper("aa3a46e0-c390-4943-b613-26f9908367d5")
subset = SubsetVCFs(
regions=[GenomicRegion(contig=1, start=860000, stop=880000)],
**input_parameters,
).run()
Annotate VCFs¶
This shortcut enables the user to annotate a collection of VCF files with a specific version of ClinVar.
The shortcut's input_helper()
method accepts either:
- Gencove project id
- List of Gencove sample ids
and returns a dictionary containing a list of Sample objects from the Gencove platform.
Alternatively, the user may provide a list of Sample or VCF objects to the shortcut without using input_helper()
.
from gencove_explorer_library.shortcuts.annotate import AnnotateVCFs, AnnotationClinVar
input_parameters = AnnotateVCFs.input_helper("aa3a46e0-c390-4943-b613-26f9908367d5")
annotated = AnnotateVCFs(
annotation=AnnotationClinVar(genome="GRCh37"),
**input_parameters,
).run()
Shard VCFs¶
This shortcut enables the user to shard a collection of VCF files to a set of genomic regions.
The shortcut's input_helper()
method accepts either:
- Gencove project id
- List of Gencove sample ids
and returns a dictionary containing a list of Sample objects from the Gencove platform.
Alternatively, the user may provide a list of Sample or VCF objects to the shortcut without using input_helper()
.
from gencove_explorer_library.shortcuts.shard_vcfs import ShardVCFs
from gencove_explorer.helpers import GenomicRegion
input_parameters = ShardVCFs.input_helper("aa3a46e0-c390-4943-b613-26f9908367d5")
sharded = ShardVCFs(
regions=[GenomicRegion(contig=1), GenomicRegion(contig=2)],
**input_parameters,
).run()
Merge VCFs¶
This shortcut enables the user to merge a collection of VCF files from non-overlapping sample sets to create one multi-sample file.
The shortcut's input_helper()
method accepts either:
- Gencove project id
- List of Gencove sample ids
and returns a dictionary containing a list of lists of Sample objects from the Gencove platform.
Alternatively, the user may provide a list of lists of Sample or VCF objects to the shortcut without using input_helper()
.
VCFs from each sublist of the list will be merged into a single VCF file.
from gencove_explorer_library.shortcuts.merge_vcfs import MergeVCFs
input_parameters = MergeVCFs.input_helper("aa3a46e0-c390-4943-b613-26f9908367d5")
merged = MergeVCFs(
**input_parameters,
).run()
Concatenate VCFs¶
This shortcut enables the user to concatenate a collection of VCF files from the same set of samples. All source files must have the same sample columns appearing in the same order.
The user may provide a list of lists of VCF objects to the shortcut. VCFs from each sublist of the list will be concatenated into a single VCF file.
from gencove_explorer_library.shortcuts.concatenate_vcfs import ConcatenateVCFs
concatenated = ConcatenateVCFs(
vcfs=[
[shard1_vcf, shard2_vcf, shard3_vcf],
[shard4_vcf, shard5_vcf, shard6_vcf]
],
).run()
Shard, merge, and concatenate VCFs¶
This shortcut enables the user to shard a collection of VCF files to a set of genomic regions, merge the shards, and concatenate the merged shards into large shards. It is essentially a distributed version of MergeVCFs
that works for large sample numbers and large VCF files by composing ShardVCFs
, MergeVCFs
, and ConcatenateVCFs
shortcuts.
The shortcut's input_helper()
method accepts either:
- Gencove project id
- List of Gencove sample ids
and returns a dictionary containing a list of Sample objects from the Gencove platform.
Alternatively, the user may provide a list of Sample or VCF objects to the shortcut without using input_helper()
.
The regions
input is a list of lists, where the elements in each sublist define the shards and each sublist defines which shards get concatenated into final output VCFs.
from gencove_explorer_library.shortcuts.shard_merge_concatenate_vcfs import ShardMergeConcatenateVCFs
from gencove_explorer.helpers import GenomicRegion
step = int(5e6)
regions = [
[GenomicRegion(contig="chr22", start=s, stop=s+step) for s in range(int(10e6), int(30e6), step)], # These shards are concatenated into the first output VCF
[GenomicRegion(contig="chr22", start=s, stop=s+step) for s in range(int(30e6), int(55e6), step)], # These shards are concatenated into the second output VCF
]
input_parameters = ShardMergeConcatenateVCFs.input_helper("aa3a46e0-c390-4943-b613-26f9908367d5")
sharded_merged_concatenated = ShardMergeConcatenateVCFs(
regions=regions,
**input_parameters,
).run()
Exporting sample deliverables to S3¶
This shortcut enables the user to export all or a subset of sample deliverables to AWS S3 created by Gencove's analysis pipeline.
The shortcut's input_helper()
method accepts:
- Gencove project id
- Optional list of file types
- Optional list of sample statuses; if not defined otherwise, only
succeeded
samples are used
and returns a dictionary containing a list of Sample objects from the Gencove platform.
Alternatively, the user may provide a list of Sample objects to the shortcut without using input_helper()
.
If the user is copying the files to a bucket that is outside of Explorer workspace, standard AWS credentials need to be provided.
from gencove_explorer_library.shortcuts.export_sample_deliverables import ExportSampleDeliverablesToS3
input_parameters = ExportSampleDeliverablesToS3.input_helper("aa3a46e0-c390-4943-b613-26f9908367d5")
export = ExportSampleDeliverablesToS3(
s3_path="s3://bucket/prefix/",
aws_session_configuration={
"aws_access_key_id": "AKIA...",
"aws_secret_access_key": "123...",
},
**input_parameters,
).run()
Exporting sample deliverables to Azure¶
This shortcut enables the user to export all or a subset of sample deliverables to Microsoft Azure Storage created by Gencove's analysis pipeline.
The shortcut's input_helper()
method accepts:
- Gencove project id
- Optional list of file types
- Optional list of sample statuses; if not defined otherwise, only
succeeded
samples are used
and returns a dictionary containing a list of Sample objects from the Gencove platform.
Alternatively, the user may provide a list of Sample objects to the shortcut without using input_helper()
.
In order to be able to upload to Azure Storage, the user needs to provide a connection string.
from gencove_explorer_library.shortcuts.export_sample_deliverables import ExportSampleDeliverablesToAzureStorage
input_parameters = ExportSampleDeliverablesToAzureStorage.input_helper("aa3a46e0-c390-4943-b613-26f9908367d5")
export = ExportSampleDeliverablesToAzureStorage(
azure_container_name="my-container",
azure_blob_path="foo/bar/baz",
azure_connection_string="DefaultEndpointsProtocol=https;AccountName=storagesample;AccountKey=<account-key>",
**input_parameters,
).run()
Exporting sample deliverables to GCP¶
This shortcut enables the user to export all or a subset of sample deliverables to GCP Cloud Storage created by Gencove's analysis pipeline.
The shortcut's input_helper()
method accepts:
- Gencove project id
- Optional list of file types
- Optional list of sample statuses; if not defined otherwise, only
succeeded
samples are used
and returns a dictionary containing a list of Sample objects from the Gencove platform.
Alternatively, the user may provide a list of Sample objects to the shortcut without using input_helper()
.
In order to be able to upload to GCP Storage, the user needs to provide a path to a GCP service account JSON credentials file.
from gencove_explorer_library.shortcuts.export_sample_deliverables import ExportSampleDeliverablesToGCPStorage
input_parameters = ExportSampleDeliverablesToGCPStorage.input_helper("aa3a46e0-c390-4943-b613-26f9908367d5")
export = ExportSampleDeliverablesToGCPStorage(
storage_bucket="my-bucket",
storage_path="foo/bar/baz",
gcp_service_account_json_path="credentials.json",
**input_parameters,
).run()
Composing remote shortcuts¶
One important aspect of these shortcuts is that they can be easily composed, assuming the respective inputs and outputs are compatible.
The example below subsets a collection of VCF files to a genomic region and annotates the resulting VCF files with ClinVar annotations.
from gencove_explorer_library.shortcuts.annotate import AnnotateVCFs, AnnotationClinVar
from gencove_explorer_library.shortcuts.subset import SubsetVCFs
from gencove_explorer.helpers import GenomicRegion
input_parameters = SubsetVCFs.input_helper("aa3a46e0-c390-4943-b613-26f9908367d5")
subset = SubsetVCFs(
regions=[GenomicRegion(contig=1, start=860000, stop=880000)],
**input_parameters,
)
annotated_subset = AnnotateVCFs(
vcfs=subset,
annotation=AnnotationClinVar(genome="GRCh37"),
).run()