Research

Gene conversion

Gene conversions plot on genome

De novo gene conversions plotted on genome. Male gene conversions are indicated by blue arrows, females red.

Currently, I am studying the rate and dynamics of meiotic gene conversion.  This work is in preparation for publication and I plan to post it on the bioRxiv by mid-April.

Slides from a recent talk I gave describing this research are available here.  In summary, I estimated the rate at which a base in the genome is be affected by meiotic gene conversion (~8×10-6/base pair/generation), found evidence for GC bias (~70% of gene conversions transmit G or C alleles vs. A or T), and found that females transmit more gene conversions than males (~1.5× more).

SIGMA Type 2 Diabetes Project

I led the analysis for the SIGMA Type 2 Diabetes Project that examined more than 8,000 Mexican and other Latin American descent individuals to identify type 2 diabetes susceptibility loci. We identified a novel locus that confers risk for type 2 diabetes and has high frequency in Mexicans and other Latin Americans. The paper describing this work is currently in press at Nature and will appear online before the end of 2013.

Inferring haplotype phase in large genotype datasets of unreleated individuals and trios/duos

Fig3a

Figure 3a from Williams et al. – Switch error rates of HAPI-UR 3x, HAPI-UR, and other methods decrease with sample size.

I developed the software HAPI-UR for inferring phase in larget datasets of unrelated and/or trio or duo samples. This work is described in an available paper.

A key insight underlying the methodology employed in HAPI-UR is that haplotype phase accuracy increases with sample size. HAPI-UR uses a computationally efficient method that is more than 18 times faster than other phasing methods. Thus HAPI-UR is efficient and effective at phasing very large datasets and will be especially applicable to the increasingly large datasets being generated and now available.

Inferring haplotype phase in family datasets

Haplotype transmissions in a nuclear family with 11 children.  Father's haplotypes are on the left in 11 colored columns, and the mother's transmissions are on the right.  Switches in color from blue to red and vice versa show recombination events.

Haplotype transmissions in nuclear family with 11 children. The father’s haplotypes are on the left in 11 colored columns, and the mother’s transmissions are on the right. Single columns represent individual haplotype transmissions to one child from one parent, and switches in color (from blue to red and vice versa) are recombination events.

I developed the software HAPI for inferring haplotypes in family data. HAPI is described here.

Other methods for inferring haplotype phase in family genotype data have runtime that scales exponentially in the number of individuals in the family. HAPI uses a novel state formulation that leverages the fact that real genetic data contain relatively few recombination events and, in so doing, obtains polynomial runtime on real genetic data. The problem of inferring haplotypes in family data has been shown to be NP-hard, but in practice the state formulation that HAPI uses enables it to merge an exponential number of states for realistic inputs.

When run on a dataset containing 103 nuclear families, HAPI was more than 300 times faster than other methods.  When analyzing a family with 11 children, HAPI used an average of 4.2 states per marker, with a maximum of 48 states at any marker.  In contrast, other methods use 22c markers, where c is the number of children in a nuclear family, and thus for an 11 child family, other methods build 4.2 million states per marker.

HAPI is currently only able to handle nuclear families, but I plan to extend it to apply to general pedigrees so that haplotype-based genetic analyses of family data will not be computationally limited.

Local ancestry inference in Latinos

Figure 1a from Fejermen, et al. (2012).  The x axis shows physical position in the genome; y axis is -log10 P-value for association at the site based on local ancestry information.

Figure 1a from Fejermen, et al. (2012). The x axis shows physical position in the genome; y axis is -log10 P-value for association at the site based on local ancestry information. Significant association occurs in the 6q25 region.

I developed an extension to HapMix to enable it to infer local ancestry in multi-way admixed populations, including Latinos. This extension is described in the supplement to the 1000 Genomes Phase I paper. I aided in applying this method to identify a breast cancer risk locus in Latinas; the paper describing this work is available here.

Advertisements