This page summarizes key findings observed from a wide range of analyses performed on the Zimmerome by Mark Gerstein’s lab at Yale. Note that we refer Zimmerome as subjectZ in the text below as well as in our presentations.
Schema of data generation and analyses are summarized in the following presentation: schema.pptx
1) BAM files as digital representations of the genome
a) Original BAM from Illumina with Isaac aligner
A BAM file is a binary file that stores the genomic sequence data of an individual. Illumina has provided the original BAM file to Carl upon request, and it was generated by aligning Carl’s genomic sequence to that of the human reference genome.
b) Alternate BAM based on remapping with BWA-mem
In order to perform comparisons with two other personal genomes, we decided to follow a uniform protocol for aligning each genomic sequence to the reference genome (GRCH37). Reads were extracted from the Illumina-provided BAM files, and they were remapped to the reference genome using the BWA-MEM algorithm with default parameters.
associated data files: BAM files provided by Illumina(large file) Index of the BAM file provided by Illumina BAM file generated by BWA-mem(large file) Index of the BAM file generated by BWA-mem
2) Summary of genomic variants in subjectZ
Slides providing more details on the variant analysis: variants.pdf
a) SNP/Indel calling with standard pipelines
We settled on a variant set using standard aligners and callers (ie GATK) consistent with best practices. This call set has ~3.5M SNPs and ~750K indels. We also did a number of comparisons using various SNP and indel callers (ie GATK v Illumina caller) and found variances of ~5%.
We compared subjectZ with two other European individuals whose genomes we have extensively analyzed under the same protocol. 77% of the SNPs (2.7M) and 78% of the indels (~590K) were shared by at least one of the two. Among the 3.5 million SNPs, roughly 9% (320,000) are considered rare, insofar as they are detected at a frequency of less than 0.05% among the genomes sequenced as part of the 1000 Genomes Project.
associated data files: Germline SNP call set for subjectZ (VCF format) Germline indel call set for subjectZ (VCF format) Germline SNP call set for subjectZ (PLINK format)
Van der Auwera GA, et al. (2013) From fastQ data to high-confidence variant calls: The genome analysis toolkit best practices pipeline. Curr Protoc Bioinformatics
b) SV calling (CNVnator, Somatator)
We also found ~1800 large deletions, although the precise number of structural variants depends greatly on the post-filtering method used. About ~25% of these were shared with at least one of the two other Europeans in our analyzed set. We identified some interesting insertions & deletions, including a HERV deletion (which is thought to result in a lower risk of schizophrenia) on the C4 gene and a partial deletion of a gene that may protect against macular degeneration.
associated data files: CNVnator-based deletion calls (VCF format) CNVnator-based duplication calls (VCF format) Somatator-based CNV calls (BED format)
Link to CNVnator
Abyzov A, Urban AE, Snyder M, Gerstein M (2011) CNVnator: An approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res 21(6):974–984
Sudmant PH, et al. (2015) An integrated map of structural variation in 2,504 human genomes. Nature 526(7571):75–81
c) Variable retroelements (RDVs) & pseudogenes
The human reference genome contains around 14,000 pseudogenes, and about 8,000 of these are processed. subjectZ lacks 14 of these and has 12 novel processed pseudogenes. A small majority of the variable pseudogenes are shared with at least one of the two other Europeans.
A similar story is evident in the case of Alu elements:subjectZ has ~1000 Alu insertions that are not numbered among the ~1.3M Alu elements in the reference genome, with a bit more than 50% of these insertions shared with at least one the two Europeans.
associated data files: Processed pseudogenes present in subjectZ (BED format) Processed pseudogenes absent in subjectZ (BED format) ALU insertions in subjectZ (BED format)
Abyzov A, et al. (2013) Analysis of variable retroduplications in human populations suggests coupling of retrotransposition to cell division. Genome Res 23(12):2042–2052
Schrider DR, et al. (2013) Gene Copy-Number Polymorphism Caused by Retrotransposition in Humans. PLoS Genet 9(1). doi:10.1371/journal.pgen.1003242.
3) Functional analysis of SNPs affecting coding regions
Slides providing more details on coding variants: coding.pdf
Slides providing more details on LOF variants: LOF.pdf
a) Frequency analysis & mapping onto 3D structures (STRESS)
We identified ~23K SNPs in coding regions, of which about ~11K give rise to amino acid changes. Approximately half of a percent of the rare SNPs (~1800) were located within the coding regions of the genome. About ~1,000 of these were non-synonymous. We found some of these were disease-associated SNPs, such as those in ATM and FOXP4. We also show examples of exactly how non-synonymous SNPs may affect the 3D structures of their corresponding proteins (eg I=>T change at position 114 in a transferase).
associated data file: Coding SNPs in subjectZ (text file)
Link to STRESS
Clarke D, et al. (2016) Identifying Allosteric Hotspots with Dynamics: Application to Inter- and Intra-species Conservation. Structure.
b) Loss-of-Function (LOF) variants (VAT/ALOFT)
We identified 90 LOF variants in subjectZ. Many of the LOF variants are enriched in olfactory genes. Four of these are very likely to be deleterious based on their ALOFT score: namely, those impacting the FAM8A1, TIAM2, CCDC47, and MISP genes. Also, 10 of the LOF variants were very rare alleles, which also connotes deleterious potential. This group of deleterious LOFs is very different in subjectZ vs the two other European genomes.
associated data files: Predicted LOF variants in subjectZ (BED format) Annotations of LOF variants (BED format)
Link to ALOFT
1000 Genomes Project Consortium, et al. (2015) A global reference for human genetic variation. Nature 526(7571):68–74.
4) Functional analysis of largely noncoding SNPs
A number of new computational tools have been developed to prioritize noncoding variants in terms of their likely functional significance. One such tool is FunSeq, developed in Gerstein Lab. Like other noncoding variant prioritization tools, FunSeq combines data types to approach the question of functional significance from multiple perspectives. It uses the ENCODE annotation to interrelate the variant with known functional elements. It identifies the evolutionary pressure on the region in which the variant occurs by both inter- and intra-species measures of evolutionary conservation. It uses the network centrality of associated genes as a proxy for how important are the genes regulated by the functional elements containing the variant. FunSeq also identifies whether the variant leads to the gain or loss of some functional motif. FunSeq assigns subscores along each of these dimensions and integrates them into a composite score representing the likely functional significance of each noncoding variant.
One illustrative example of such a variant in subjectZ is a G to C transversion in the promoter of Van-Gogh-Like Protein 2, a hub gene that helps to maintain the polarity of cells in the developing heart, nervous system, and auditory conduction system. This transversion leads to a gain of a BCL motif, which creates a new docking site for FGFR-family transcription factors, and could be expected to alter the transcription of VGLP2, although, in subjectZ’s case, apparently not by enough to cause a major developmental disorder, such as Tetralogy of Fallot.
Slides providing details on non-coding variant annotations in subjectZ: noncoding.pptx
a) High impact non-coding variants (FunSeq)
Our pipeline identified thousands of rare variants affecting annotated regions of the non-coding portions of the genome, with approximately ~80K rare variants in introns, ~15K in enhancers, ~3000 in promoters, and ~42K in other annotations. These numbers are very similar to those found in the other two European individuals. Nine of the non-coding variants (which collectively affect 11 genes) are predicted to have particularly high functional impact, insofar as they i) are rare variants that occur in highly conserved regions, ii) result in altered TF binding sites, and iii) are in regions that regulate genes that serve as a hubs within a regulatory interaction network.
associated data file: FunSeq annotations for rare variants (BED format)
Link to FunSeq
Khurana E, et al. (2013) Integrative annotation of variants from 1092 humans: application to cancer genomics. Science (80- ) 342(6154):1235587.
Fu Y, et al. (2014) FunSeq2: A framework for prioritizing noncoding regulatory variants in cancer. Genome Biol 15(10):480.
b) SNPs associated with allelic activity (AlleleDB)
subjectZ has ~40K heterozygous SNPs that have been assessed for allele specific expression in AlleleDB, and for about 1/5 of these we have some evidence that they can show allelic behavior. 173 SNPs show a fairly consistent pattern of allelic activity (present in >100 individuals, >25% of the database). Many of these consistently allelic SNPs are associated with known imprinted genes such as SNURF.
associated data file: List of allelic SNPs in subjectZ (BED format)
Link to alleleDB
Chen J, et al. (2016) A uniform survey of allele-specific binding and expression over 1000-Genomes-Project individuals. Nat Commun 7:11101.
5) Searching for SNVs associated with rare disorders
Slides on rare nonsynonymous coding SNVs: RareVariantVis_slides.pptx
The RareVariantVis package identifies harmful variants associated with extremely rare disorders using whole genome sequencing data. When RareVariantVis was applied to subjectZ, no major chromosomal abnormalities were detected (such abnormalities would manifest as regions exceeding 1Mb that exhibit loss-of-heterozygosity). We therefore focused on 1033 rare nonsynonymous coding SNVs.
434 (42%) of these SNVs are not present in the Snyder or NA12878 genomes. 37 of those 434 are located in a major histocompatibility complex on chromosome 6, which is the biggest aggregation of rare SNVs in subjectZ. There are 2 other rare SNV-enriched areas (these are often observed in other samples): one in chromosome 1 (16 SNVs in the area around NBPF10) and another in chromosome 7 (10 SNVs in the area around MUC12). NBPF10 and MUC12 are repetitive genes for which function is not well described. Chromosomes X and Y have historically been difficult to evaluate for technical reasons, so results obtained on these chromosomes should thus be treated with greater care.
additional images: RareVariantVis SNVs in subjectZ (zip file)
Stokowy, T, et al. (2016) RareVariantVis: new tool for visualization of causative variants in rare monogenic disorders using whole genome sequencing data. Bioinformatics btw359
6) Advanced read analysis
Slides describing microbiome related analysis for the subjectZ: exDNA.pptx
a) Personal genome construction
We built a personal diploid genome of subjectZ from the call set.
associated data files: Personal genome sequence of haplotype1 (zipped fastq file) Personal genome sequence of haplotype2 (zipped fastq file)
Link to AlleleSeq
Rozowsky J, et al. (2011) AlleleSeq: analysis of allele-specific expression and binding in a network framework. Mol Syst Biol 7:522.
b) Microbiome-associated reads (exceRpt)
Approximately 0.03% of the total reads for subjectZ can be attributed to bacterial genomes. Of these, about 25% could be mapped to the Proteobacteria phylum and a further 16% to the Acinetobacter genus. However it is likely that much, if not all, of this signal is due to laboratory contamination. In other ‘dirtier’ samples, especially those obtained from sputum, we typically observe a greater diversity of bacterial species.
associated data files: Exogenous DNA alignment results for subjectZ (text file)