MethylPCA - A toolkit to control for confounders in methylome-wide association studies
lga972 - A cross-platform application for optimizing multistage association studies
In methylome-wide association studies (MWAS) there are many possible differences between cases and controls (e.g. related to life style, diet, and medication use) that may affect the methylome and produce false positive findings. An effective approach to control for these confounders is to first capture the major sources of variation in the methylation data and then regress out these components in the association analyses. This approach is, however, computationally very challenging because the human genome comprises over 30 million possible methylation sites. We introduce methylPCA that is specifically designed to handle this problem. Specifically, MethylPCA can:
1) Create blocks. Reducing the total number of sites has computational and statistical advantages (e.g., decreased risk of false discoveries, avoid redundancy in the PCA) and the sum of substantially inter-correlated measurements is a more reliable indicator of the underlying signal than the individual measurements separately. Rather than using a sliding window of a pre-determined fixed length, MethylPCA combines adjacent sites adaptively based on the observed inter-correlations.
2) Perform PCA. The PCA is based on input methylation data and the output is PC scores, eigenvalues and loadings. The PCA is performed through eigen-decomposition of a much smaller inner product matrix calculated from the methylation data.
3) Perform association tests. It performs association tests with supplied covariates. Typical covariates are the PC scores calculated from the PCA procedure. It outputs the test statistics and p-values, as well as a QQ plot.
To speed up calculations, data from different chromosomes can be processed simultaneously and the PCA input matrix can be computed in parallel. Statistics that are used repeatedly (e.g. means in the entire sample) are calculated only once and stored to further increase efficiency. MethylPCA consists of separate components that can be run individually or as a pipeline. A user-friendly interface is provided where a parameter file controls which and how procedures are performed. The software is described in the paper:
Wenan Chen , Guimin Gao, Karolina A Aberg, Srilaxmi Nerella, Swedish Schizophrenia Consortium, Christina M Hultman, Patrik KE Magnusson, Patrick F Sullivan, Edwin JCG van den Oord (2013). MethylPCA: A toolkit to control for confounders in methylome-wide association studies. BMC Bioinformatics, In press.
The computational and I/O intensive part of MethylPCA is implemented in C++ and the R package serves as the user interface. The Documentation/source code/executables/example can be downloaded for Windows (WinZip format), Mac OS X (Zip format), and Linux (tar.gz format).
Because of the assays costs and large sample sizes that are required to discover effects while controlling false discoveries, large scale genetic association studies can be very expensive. Two-stage designs can be used to design these studies in the most cost-effective way. In two stage designs all the markers are assayed and tested in a first stage. Only the promising markers are subsequently assayed in the second stage using additional samples. Compared to single-stage studies, optimized multistage designs can achieve the same goals in terms of true and false discoveries with a 50-70% saving in the amount of genotyping. Furthermore, rather than using arbitrary rules (e.g. P-values smaller than 0.05 suggest a replication), use of multistage designs can provide statistically motivated decision rules for declaring significance.
lga972 is a cross-platform application with a graphical interface that uses a genetic algorithm for determining the design features of 2-stage genetic association studies that minimize the genotyping burden. The user can choose among a variety of case-control and family based tests where outcome may be scored as present versus absent or is a continuous variable. The text-based output can easily be exported to other programs such as word-processors and spreadsheets.
Lga972 is described in:
Robles, J & Van den Oord, EJCG (2004). lga972: A cross-platform application for optimizing LD studies via the genetic algorithm. Bioinformatics, 20, 3244-3245.
Van den Oord, EJCG & Sullivan, PF (2003). False discoveries and models for gene discovery. Trend in Genetics, 19, 537-542.
Van den Oord, EJCG & Sullivan, PF (2003). A framework for controlling false discovery rates and minimizing the amount of genotyping in the search for disease mutations. Human Heredity, 188-199.
Van den Oord, EJCG (2005). Controlling false discoveries in candidate gene studies. Molecular Psychiatry, 10, 230-231.
Bukszár, J & Van den Oord, EJCG (2006). Optimization of two-stage genetic designs where data are combined using an accurate and efficient approximation for Pearson's statistic, Biometrics 62, 1132-1137.
Download lga972 in tar.gz format
lga972 is distributed as Freeware. To install, download the lga972 distribution and expand the file into your system. The lga972 distribution includes the program (java jar file), the program manual (PDF) and User-License (text). You need the Java Runtime Environment (or Java Development Kit) Standard Edition, version 1.3.1 or better. Check your system for an existing installation or download the JRE and follow the installation instructions at the Sun Microsystems Java site.