Data analysis configurations¶
Each Gencove project is pinned to a 'configuration' that specifies the species, reference datasets (e.g. a reference genome and haplotype reference panel), and specific deliverables. These configurations can be private to a specific set of individuals, or public. The datasets underlying the public configurations are as follows:
Chicken¶
- Chicken low-pass v1.0
-
- Reference genome: galGal6
- Imputation reference panel: Sequencing data was downloaded from the European Nucleotide Archive project PRJEB30270. These data are described in detail in Qanbari et al. (2019). Raw sequences were processed using GATK4 and 26M variants were identified and converted into a haplotype reference panel.
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index)
Cattle¶
- Cattle low-pass v1.0
-
- Reference genome: ARS-UCD1.2
- Imputation reference panel: We used sequence data from 484 samples primarily from B. taurus breeds and processed these data using GATK4 into a reference panel of 49M bi-allelic SNPs.
- Breed analysis reference panel: We report ancestry proportions from 13 breeds: Angus, Brahman, Charolais, Gelbvieh, Hereford, Holstein, Jersey, Limousin, Red Angus, Simmental, Braunvieh, Santa Gertrudis, and Shorthorn.
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index), breed analysis
- Cattle low-pass v2.0
-
- Reference genome: ARS-UCD1.2
- Imputation reference panel: We used sequence data from 946 samples from B. taurus and B. indicus-related breeds and processed these data using GATK4 into a reference panel of 59M bi-allelic SNPs.
- Breed analysis reference panel: We report ancestry proportions from 12 breeds: Angus (including red and black Angus), Brahman, Charolais, Gelbvieh, Hereford, Holstein, Jersey, Limousin, Maine Anjou, Simmental, Braunvieh, and Shorthorn.
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index), breed analysis
- Cattle low-pass v2.2
-
- Same configuration as Cattle low-pass v2.0, but includes the Y chromosome
- Cattle low-pass v2.3
-
- Same configuration as Cattle low-pass v2.2, but with performance improvements
- Cattle low-pass v2.4
-
- Same configuration as Cattle low-pass v2.3, but the reference panel has been updated to include additional polymorphisms in the imputed VCF
- Cattle low-pass v3.0
-
- Reference genome: ARS-UCD1.2
- Imputation reference panel: We used sequence data from 1,987 animals with publicly-available data, and processed these data using GATK into an imputation reference panel. We then subet this reference panel to the set of variants either 1) segregating in annotated Bos taurus samples or 2) present on public genotyping array manifests. This resulted in a reference panel of 59M variants (SNPs and small indels).
- Breed analysis reference panel: We report ancestry proportions from 13 breeds: Angus, Brahman, Charolais, Gelbvieh, Hereford, Holstein, Jersey, Limousin, Red Angus, Simmental, Braunvieh, Santa Gertrudis, and Shorthorn.
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index), breed analysis
Dog¶
- Dog low-pass v1.0
- Dog low-pass v2.0
- Dog low-pass v3.0
-
- Reference genome: canFam4. Note that the version used does not include a Y chromosome.
- Imputation reference panel: A reference panel of 765 sequenced dogs and 45M variants.
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index)
Human¶
- Human low-pass v1.0
-
- Reference genome: hs37-1kg
- Imputation reference panel: 1000 Genomes Phase 3, with all sites with a minor allele count less than three, with more than two alleles, or on the sex chromosomes removed.
- Ancestry reference panel: We provide an ancestry analysis based on 26 reference populations described here
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index), ancestry analysis, polygenic scores
- Human low-pass v2.0
-
- Reference genome: hs37-1kg
- Imputation reference panel: 1000 Genomes Phase 3. Relative to v1.0, this includes all sites (including normalized multiallelic sites and the X chromosome.
- Ancestry reference panel: We provide an ancestry analysis based on 26 reference populations described here
- Polygenic risk scores:
- Coronary artery disease: Inouye et al. 2018
- delivery key:
cad
- delivery key:
- Breast cancer: Mavaddat et al. 2018
- delivery key:
brca
- delivery key:
- Prostate cancer: Schumacher et al. 2018
- delivery key:
prca
- delivery key:
- Coronary artery disease: Inouye et al. 2018
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index), ancestry analysis, polygenic scores
- Human low-pass v2.1
-
- Same configuration as Human low-pass v2.0, but with duplicate sites removed (see 1000 Genomes website for details)
- Human low-pass GRCh37 v2.2
-
- Same configuration as Human low-pass v2.1, but with bugfixes and performance improvements
- Human low-pass GRCh37 v2.3
-
- Same configuration as Human low-pass v2.2, but imputation now takes into account varying recombination rates across the genome. In particular, a recombination map derived from from the HapMap II project is used to interpolate recombination rates across all sites in the haplotype reference panel. This results in increased imputation accuracy compared to the configuration for Human low-pass v2.2.
- Human low-pass GRCh37 v2.4
-
- Same configuration as Human low-pass 2.3, but the CNV calling part of the pipeline now uses a panel of normals comprising 59 male normal samples. Previously, the CNV calling step did not normalize against any normal human samples.
- Human low-pass GRCh37 v2.5
-
- Same configuration as Human low-pass 2.4, but with additional imputation QC metrics calculated.
- Human low-pass GRCh37 v2.6
-
- Same configuration as Human low-pass 2.5, but with optimized CNV calling parameters.
- Human low-pass GRCh37 v3.0
-
- Same configuration as Human low-pass 2.6, but with upgrades to the underlying imputation algorithm.
- Human low-pass GRCh37 v3.1
-
- Same configuration as Human low-pass 3.0, but with performance improvements.
- Human low-pass GRCh37 GLIMPSE v0.1 (testing)
-
- Utilizes the GLIMPSE imputation engine v1.0
- Imputation reference panel: 1000 Genomes Phase 3 filtered to contain only SNPs and indels
- Otherwise, same configuration as Human low-pass GRCh37 v2.3
- Human low-pass GRCh37 GLIMPSE v0.2 (testing)
-
- Same configuration as Human low-pass GRCh37 GLIMPSE v0.1, but with the full 1000 Genomes Phase 3 reference panel
- Human low-pass GRCh37 GLIMPSE v0.3 (testing)
-
- Same configuration as Human low-pass GRCh37 GLIMPSE v0.2, but with performance improvements.
- Human low-pass GRCh37 v6.0
-
- Reference genome: hs37-1kg
- Imputation reference panel: ~79M variants from 4091 individuals from the 1KGP3 + HGDP gnomad v3.1.2 release lifted over from the b38 release; includes X, Y, and MT calls from the 1000 Genomes Phase 3 panel
- Ancestry reference panel: We provide an ancestry analysis based on 26 reference populations described here
- CNV analysis: CNV calls (
cnr
,cns
,png
) from CNVKit; normalization performed using a panel of normals comprising 59 male normal samples. - Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index), ancestry analysis, CNV analysis
- Human low-pass GRCh38 (beta):
-
- Reference genome: GRCh38 with alternative sequences, plus decoys and HLA here.
- Imputation reference panel: Variant calls from 1000 Genomes Phase 3 samples resequenced at high depth by the New York Genome Center (processing pipeline described here), after removing singletons (variants with a minor allele count of 1 in the sample), for a total of ~62M variants.
- Ancestry reference panel: We provide an ancestry analysis based on 26 reference populations described here.
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index), ancestry analysis.
- Human low-pass GRCh38 v1.0:
-
- Same configuration as the Human low-pass GRCh38 (beta) except for the imputation reference panel.
- Imputation reference panel: Lifted-over panel from the 1000 Genomes Phase 3 GRCh37 release.
- Human low-pass GRCh38 v1.1:
-
- Same configuration as Human low-pass GRCh38 v1.0 but with bugfixes and performance improvements.
- Human low-pass GRCh38 v2.0:
-
- Same reference genome and deliverables as Human low-pass GRCh38 v1.1 but with a new imputation reference panel comprising the phased release of genotype calls from the New York Genome Center's resequencing efforts of individuals from the 1000 Genomes Project. Comprises 3202 individuals, including the original 2504 from Phase 3 and an additional 798 relatives. See the preprint here for more details.
- Human low-pass GRCh38 v2.1:
-
- Same configuration as Human low-pass GRCh38 v2.0 but annotated with variant IDs deriving from dbSNP build 151.
- Human low-pass GRCh38 v2.2:
-
- Same configuration as Human low-pass GRCh38 v2.1 but with additional imputation QC metrics calculated.
- Human low-pass GRCh38 v2.3:
-
- Same configuration as Human low-pass GRCh38 v2.2 but with a reference genome excluding ALT contigs.
- Human low-pass GRCh38 v3.0:
-
- Same configuration as Human low-pass GRCh38 v2.3 but with upgrades to the underlying imputation algorithm.
- Human low-pass GRCh38 v3.1:
-
- Same configuration as Human low-pass GRCh38 v3.0 but with performance improvements, and an updated reference genome, which can be found here.
- Human low-pass GRCh38 v3.2:
-
- Same configuration as Human low-pass GRCh38 v3.1 but with a new sex detection QC metric.
- Human low-pass GRCh38 v3.3:
-
- Same configuration as Human low-pass GRCh38 v3.2 but with sex detection improvements.
- Human low-pass GRCh38 v6.0:
-
- Reference genome: Version of GRCh38 reference genome recommended for use by Heng Li (https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use), containing no ALT contigs.
- Imputation reference panel: ~79M variants from 4091 individuals from the 1KGP3 + HGDP gnomad v3.1.2 release; includes X, Y, and MT calls lifted over from the b37 release of the 1000 Genomes Phase 3 panel
- Ancestry reference panel: We provide an ancestry analysis based on 26 reference populations described here.
- CNV analysis: CNV calls (
cnr
,cns
,png
) from CNVKit; normalization performed using a panel of normals comprising 59 male normal samples. - Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index), ancestry analysis, CNV analysis
Maize¶
- Maize low-pass v1.0
-
- Reference genome: AGPv4
- Imputation reference panel: Maize 282 association panel genotypes (7x, AGPv4 coordinates), for a total of ~82M variants.
- Strain reference panel: Each strain in the imputation reference panel was considered a separate population. Those that appear particularly similar are merged downstream into related groups such as "NC262RELATED".
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index), strain analysis
- Maize low-pass v1.1
-
- Same as Maize low-pass v1.0 but with multi-allelic SNPs included in the imputation reference panel.
Mouse¶
- Mouse low-pass v1.0
-
- Reference genome: GRCm38_68
- Imputation reference panel: The Mouse Genomes Project contains ~59M SNPs discovered in 36 sequenced lines.
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index)
Rat¶
- Rat low-pass v1.0
-
- Reference genome: rn6
- Imputation reference panel: 42 rat genomes described in Hermsen et al. (2015) and lifted over to rn6, containing 8.7M variants.
- Strain analysis panel: The same 42 rat genomes used for the imputation reference panel.
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index), strain analysis
Soy¶
- Soy low-pass Wm82.a2 v1.0
-
- Reference genome: Wm82.a2.v1
- Imputation reference panel: GmHapMap contains ~11M bi-allelic SNPs identified in around 1000 sequenced accessions.
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index)
- Soy low-pass Wm82.a2 v2.0
-
- Same reference genome as v1.0
- Imputation reference panel: We used 478 samples from the USDA-GRIN germplasm collection, and processed these data using GATK4 into a reference panel of 32M SNPs and short indels.
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index)
- Soy low-pass Wm82.a4 v1.0
-
- Reference genome: Wm82.a4.v1
- Imputation reference panel: We used 478 samples from the USDA-GRIN germplasm collection, and processed these data using GATK4 into a reference panel of 25M SNPs and short indels.
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index)
Swine¶
- Swine low-pass v1.0
-
- Reference genome: susScr11
- Imputation reference panel: We used 414 samples from the publicly available swine sequence data (PRJEB39374, PRJNA343658, PRJNA414091, PRJNA482384, PRJNA506339, and PRJNA553106), and processed these data using GATK4 into a reference panel of 53M SNPs and short indels.
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index)
Domestic cat¶
- Domestic cat low-pass v1.0
-
- Reference genome: felCat9
- Imputation reference panel: We used 78 WGS samples from Felis Catus breeds from the 99lives project and processed these data using GATK4 into a reference panel of 49M snps and small indels.
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index)
- Domestic cat low-pass v2.0
-
- Reference genome: same as Domestic cat low-pass v1.0
- Imputation reference panel: We used 185 WGS samples from Felis Catus breeds from the 99 lives project (a partial list can be found here) using GATK4 into a reference panel comprising 55M SNPs and small indels.
- Deliverables: original FASTQ, aligned BAM (and index), imputed VCF (and index)
Tomato¶
- Tomato low-pass SL4.0 v1.0