Statistical And Population Genetics
Medical Genomics

We are located in Wuhan, a metropolitan city in Central China besides the Yangtze River.

CLoMAT

CLoMAT stands for "Conditional Logistic Model Association Tests". This R package implements three rare-variant association tests for matched case-control data under the conditional logistic regression (CLR) framework, namely CLR-Burden, CLR-SKAT, and CLR-MiST, as well as a heuristic and fast matching algorithm. CLoMAT provides a general solution to control for population stratification by matching cases and controls based on their ancestry background. It is useful to empower genetic association studies in the setting with a large number of common controls.

The CLoMAT R package and manual can be downloaded from GitHub.

Citation for CLoMAT:

  • S Cheng*, J Lyu*, X Shi, K Wang, Z Wang, M Deng, B Sun, C Wang (2022). Rare variant association tests for ancestry-matched case-control data based on conditional logistic regression. Briefings in Bioinformatics, 23(2): bbab572. [link]

  • GMMAT

    GMMAT stands for "Generalized linear Mixed Model Association Test". This is an R package to perform association tests based on generalized linear mixed models (i.e. modelling outcomes with the exponential family distributions). The package implemented a series of algorithms to improve the computational speed so that it is efficient to perform genome-wide scan in large-scale genetic studies (e.g. case-control disease studies). GMMAT is useful to control for family relatedness, population structure and complex study design in genome-wide association studies. Dr. Han Chen is the leading developer of this R package.

    The GMMAT R package and manual can be downloaded here.

    Citation for GMMAT:

  • H Chen*, C Wang*, MP Conomos, AM Stilp, Z Li, T Sofer, AA Szpiro, W Chen, JM Brehm, JC Celedon, S Redline, GJ Papanicolaou, TA Thornton, CC Laurie, K Rice, X Lin (2016). Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models. American Journal of Human Genetics, 98: 653-666. [link]

  • LASER

    LASER stands for "Locating Ancestry from SEquence Reads". This package include two C++ programs, laser and trace, for estimating individual ancestry in a reference ancestry space using either shortgun sequence reads (laser) or genotype data (trace). Both programs were implemented under a unified framework based on principal components analysis (PCA) and projection Procrustes analysis. Given a shared reference panel, laser and trace can place sequenced and genotyped samples into the same ancestry space.

    LASER can also perform standard PCA on genotype data to explore population structure and to create the reference ancestry space. Different options to compute PC scores and PC loadings have been implemented in the LASER program (version 2.01 or later).

    The LASER program and a detailed manual can be downloaded here.

    Citation for LASER:

  • C Wang*, X Zhan*, J Bragg-Gresham, HM Kang, D Stambolian, E Chew, K Branham, J Heckenlively, The FUSION Study, RS Fulton, RK Wilson, ER Mardis, X Lin, A Swaroop, S Zöllner, GR Abecasis (2014). Ancestry estimation and control of population stratification for sequence-based association studies. Nature Genetics, 46: 409-415. [link]

  • C Wang, X Zhan, L Liang, GR Abecasis, X Lin (2015). Improved ancestry estimation for both genotyping and sequencing data using projection Procrustes analysis and genotype imputation. American Journal of Human Genetics, 96: 926-937. [link]

  • LASER Server

    This is a web server that provides a unified framework to estimate ancestry using either genotyping or sequencing data. The server is based on the LASER algorithm (Wang et al. 2014 Nature Genetics, Wang et al. 2015 AJHG). We provide a series of built-in ancestry reference panels on the server so that users do not need to prepare their own panels. By using the same ancestry reference panel on the server, researchers can directly compare ancestry estimates across different studies. We also provides interactive graphical visualization to faciliate quick exploration of the ancestry background of samples.

    Please try our LASER Server and have fun!

    Citation for LASER Server:

  • D Taliun, S Chothani, S Schonherr, L Forer, M Boehnke, GR Abecasis, C Wang (2017). LASER server: ancestry tracing with genotypes or sequence reads. Bioinformatics, 33: 2056-2058. [link]

  • MicroDrop

    MicroDrop is a C++ program for estimating and correcting for allelic dropout in microsatellite data when replicated genotypes are not available. Based on an allele frequency model, the program implements an expectation-maximization algorithm to search for maximum-likelihood estimates of the allele frequencies, sample-specific and locus-specific dropout rates, and an inbreeding coefficient. With the estimated parameter values, an empirical Bayesian strategy is used to prepare multiple imputed data sets to circumvent allelic dropout in downstream data analyses.

    The MicroDrop program and a detailed manual can be downloaded here.

    Citation for MicroDrop:

  • C Wang, KB Schroeder, NA Rosenberg (2012). A maximum-likelihood method to correct for allelic dropout in microsatellite data with no replicate genotypes. Genetics 192: 651-669. [link]

  • SEEKIN

    SEEKIN stands for "SEquence-based Estimation of KINship". This is a C++ program to estimate pairwise kinship coefficients for both homogeneous samples and heterogeneous samples with population structure and admixture. The method was initially developed to analyze sparse sequencing data, such as off-target data from targeted sequencing experiments, in which genotypes are uncertain. But it can also be applied to high-quality genotyping data. The program is computationally efficient with multithreading feature and takes standard VCF files as the input.

    The SEEKIN software package is available on GitHub.

    Citation for SEEKIN:

  • J Dou*, B Sun*, X Sim, JD Hughes, DF Reilly, ES Tai, J Liu, C Wang (2017). Estimation of kinship coefficient in structured and admixed populations using sparse sequencing data. PLOS Genetics, 13: e1007021. [link]

  • WEScall

    WEScall is a genotype calling pipeline for both whole-exome sequencing (WES) and whole-genome seqeuncing (WGS) data. It was designed to utilize linkage disequilibrium (LD) information within the study sample and from an external WGS reference panel (such as the 1000 Genomes Project) to improve genotype calling accuracy. For WES, the pipeline makes utilization of the shallow off-target seqeuncing data, allowing for relatively accurate genotyping across non-coding regions, and thus improving downstream association analysis and polygenic risk prediction. For more details, please see the reference listed below.

    The WEScall software pipeline is available on GitHub.

    Citation for WEScall:

  • J Dou*, D Wu*, L Ding, K Wang, M Jiang, X Chai, DF Reilly, ES Tai, J Liu, X Sim, S Cheng, C Wang (2021). Using off-target data from whole-exome sequencing to improve genotyping accuracy, association analysis, and polygenic risk prediction. Briefings in Bioinformatics, 22(3): bbaa084. [link]