Statistical Genetics and Population Genetics

Singapore Skyline (picture from wikipedia under the GNU Free Documentation License)


GMMAT stands for "Generalized linear Mixed Model Association Test". This is an R package to perform association tests based on generalized linear mixed models (i.e. modelling outcomes with the exponential family distributions). The package implemented a series of algorithms to improve the computational speed so that it is efficient to perform genome-wide scan in large-scale genetic studies (e.g. case-control disease studies). GMMAT is useful to control for family relatedness, population structure and complex study design in genome-wide association studies. Dr. Han Chen is the leading developer of this R package.

The GMMAT R package and manual can be downloaded here.

Citation for GMMAT:

  • H Chen*, C Wang*, MP Conomos, AM Stilp, Z Li, T Sofer, AA Szpiro, W Chen, JM Brehm, JC Celedon, S Redline, GJ Papanicolaou, TA Thornton, CC Laurie, K Rice, X Lin (2016). Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models. American Journal of Human Genetics, 98: 653-666. [link]


    LASER stands for "Locating Ancestry from SEquence Reads". This package include two C++ programs, laser and trace, for estimating individual ancestry in a reference ancestry space using either shortgun sequence reads (laser) or genotype data (trace). Both programs were implemented under a unified framework based on principal components analysis (PCA) and projection Procrustes analysis. Given a shared reference panel, laser and trace can place sequenced and genotyped samples into the same ancestry space.

    LASER can also perform standard PCA on genotype data to explore population structure and to create the reference ancestry space. Different options to compute PC scores and PC loadings have been implemented in the LASER program (version 2.01 or later).

    The LASER program and a detailed manual can be downloaded here.

    Citation for LASER:

  • C Wang*, X Zhan*, J Bragg-Gresham, HM Kang, D Stambolian, E Chew, K Branham, J Heckenlively, The FUSION Study, RS Fulton, RK Wilson, ER Mardis, X Lin, A Swaroop, S Zöllner, GR Abecasis (2014). Ancestry estimation and control of population stratification for sequence-based association studies. Nature Genetics, 46: 409-415. [link]

  • C Wang, X Zhan, L Liang, GR Abecasis, X Lin (2015). Improved ancestry estimation for both genotyping and sequencing data using projection Procrustes analysis and genotype imputation. American Journal of Human Genetics, 96: 926-937. [link]

  • LASER Server

    This is a web server that provides a unified framework to estimate ancestry using either genotyping or sequencing data. The server is based on the LASER algorithm (Wang et al. 2014 Nature Genetics, Wang et al. 2015 AJHG). We provide a series of built-in ancestry reference panels on the server so that users do not need to prepare their own panels. By using the same ancestry reference panel on the server, researchers can directly compare ancestry estimates across different studies. We also provides interactive graphical visualization to faciliate quick exploration of the ancestry background of samples.

    Please try our LASER Server and have fun!

    Citation for LASER Server:

  • D Taliun, S Chothani, S Schonherr, L Forer, M Boehnke, GR Abecasis, C Wang (2017). LASER server: ancestry tracing with genotypes or sequence reads. Bioinformatics, 33: 2056-2058. [link]

  • MicroDrop

    MicroDrop is a C++ program for estimating and correcting for allelic dropout in microsatellite data when replicated genotypes are not available. Based on an allele frequency model, the program implements an expectation-maximization algorithm to search for maximum-likelihood estimates of the allele frequencies, sample-specific and locus-specific dropout rates, and an inbreeding coefficient. With the estimated parameter values, an empirical Bayesian strategy is used to prepare multiple imputed data sets to circumvent allelic dropout in downstream data analyses.

    The MicroDrop program and a detailed manual can be downloaded here.

    Citation for MicroDrop:

  • C Wang, KB Schroeder, NA Rosenberg (2012). A maximum-likelihood method to correct for allelic dropout in microsatellite data with no replicate genotypes. Genetics 192: 651-669. [link]


    SEEKIN stands for "SEquence-based Estimation of KINship". This is a C++ program to estimate pairwise kinship coefficients for both homogeneous samples and heterogeneous samples with population structure and admixture. The method was initially developed to analyze sparse sequencing data, such as off-target data from targeted sequencing experiments, in which genotypes are uncertain. But it can also be applied to high-quality genotyping data. The program is computationally efficient with multithreading feature and takes standard VCF files as the input.

    The SEEKIN software package is available on GitHub.

    Citation for SEEKIN:

  • J Dou*, B Sun*, X Sim, JD Hughes, DF Reilly, ES Tai, J Liu, C Wang (2017). Estimation of kinship coefficient in structured and admixed populations using sparse sequencing data. PLOS Genetics, 13: e1007021. [link]