VCF Management
Subset VCFs¶
This shortcut enables the user to subset a collection of VCF files to a set of genomic regions.
The shortcut's input_helper()
method accepts either:
- Gencove project id
- List of Gencove sample ids
and returns a dictionary containing a list of Sample objects from the Gencove platform.
Alternatively, the user may provide a list of Sample or VCF objects to the shortcut without using input_helper()
.
from gencove_explorer_library.shortcuts.subset import SubsetVCFs
from gencove_explorer.helpers import GenomicRegion
input_parameters = SubsetVCFs.input_helper("aa3a46e0-c390-4943-b613-26f9908367d5")
subset = SubsetVCFs(
regions=[GenomicRegion(contig=1, start=860000, stop=880000)],
**input_parameters,
).run()
Annotate VCFs¶
This shortcut enables the user to annotate a collection of VCF files with a specific version of ClinVar.
The shortcut's input_helper()
method accepts either:
- Gencove project id
- List of Gencove sample ids
and returns a dictionary containing a list of Sample objects from the Gencove platform.
Alternatively, the user may provide a list of Sample or VCF objects to the shortcut without using input_helper()
.
from gencove_explorer_library.shortcuts.annotate import AnnotateVCFs, AnnotationClinVar
input_parameters = AnnotateVCFs.input_helper("aa3a46e0-c390-4943-b613-26f9908367d5")
annotated = AnnotateVCFs(
annotation=AnnotationClinVar(genome="GRCh37"),
**input_parameters,
).run()
Shard VCFs¶
This shortcut enables the user to shard a collection of VCF files to a set of genomic regions.
The shortcut's input_helper()
method accepts either:
- Gencove project id
- List of Gencove sample ids
and returns a dictionary containing a list of Sample objects from the Gencove platform.
Alternatively, the user may provide a list of Sample or VCF objects to the shortcut without using input_helper()
.
from gencove_explorer_library.shortcuts.shard_vcfs import ShardVCFs
from gencove_explorer.helpers import GenomicRegion
input_parameters = ShardVCFs.input_helper("aa3a46e0-c390-4943-b613-26f9908367d5")
sharded = ShardVCFs(
regions=[GenomicRegion(contig=1), GenomicRegion(contig=2)],
**input_parameters,
).run()
Merge VCFs¶
This shortcut enables the user to merge a collection of VCF files from non-overlapping sample sets to create one multi-sample file.
The shortcut's input_helper()
method accepts either:
- Gencove project id
- List of Gencove sample ids
and returns a dictionary containing a list of lists of Sample objects from the Gencove platform.
Alternatively, the user may provide a list of lists of Sample or VCF objects to the shortcut without using input_helper()
.
VCFs from each sublist of the list will be merged into a single VCF file.
from gencove_explorer_library.shortcuts.merge_vcfs import MergeVCFs
input_parameters = MergeVCFs.input_helper("aa3a46e0-c390-4943-b613-26f9908367d5")
merged = MergeVCFs(
**input_parameters,
).run()
Concatenate VCFs¶
This shortcut enables the user to concatenate a collection of VCF files from the same set of samples. All source files must have the same sample columns appearing in the same order.
The user may provide a list of lists of VCF objects to the shortcut. VCFs from each sublist of the list will be concatenated into a single VCF file.
from gencove_explorer_library.shortcuts.concatenate_vcfs import ConcatenateVCFs
concatenated = ConcatenateVCFs(
vcfs=[
[shard1_vcf, shard2_vcf, shard3_vcf],
[shard4_vcf, shard5_vcf, shard6_vcf]
],
).run()
Shard, merge, and concatenate VCFs¶
This shortcut enables the user to shard a collection of VCF files to a set of genomic regions, merge the shards, and concatenate the merged shards into large shards. It is essentially a distributed version of MergeVCFs
that works for large sample numbers and large VCF files by composing ShardVCFs
, MergeVCFs
, and ConcatenateVCFs
shortcuts.
The shortcut's input_helper()
method accepts either:
- Gencove project id
- List of Gencove sample ids
and returns a dictionary containing a list of Sample objects from the Gencove platform.
Alternatively, the user may provide a list of Sample or VCF objects to the shortcut without using input_helper()
.
The regions
input is a list of lists, where the elements in each sublist define the shards and each sublist defines which shards get concatenated into final output VCFs.
from gencove_explorer_library.shortcuts.shard_merge_concatenate_vcfs import ShardMergeConcatenateVCFs
from gencove_explorer.helpers import GenomicRegion
step = int(5e6)
regions = [
[GenomicRegion(contig="chr22", start=s, stop=s+step) for s in range(int(10e6), int(30e6), step)], # These shards are concatenated into the first output VCF
[GenomicRegion(contig="chr22", start=s, stop=s+step) for s in range(int(30e6), int(55e6), step)], # These shards are concatenated into the second output VCF
]
input_parameters = ShardMergeConcatenateVCFs.input_helper("aa3a46e0-c390-4943-b613-26f9908367d5")
sharded_merged_concatenated = ShardMergeConcatenateVCFs(
regions=regions,
**input_parameters,
).run()