Frederick Boehm
Frederick Boehm
Biostatistics, genetics, medicine, open science
Mar 19, 2018 4 min read

Getting started with the reticulate R package

thumbnail for this post

Motivation

My colleague Rene Welch brought to my attention a recent pre-print, “LiMMBo: a simple, scalable approach for linear mixed models in high-dimensional genetic association studies” by Hannah Meyer, Francesco Casale, Oliver Stegle, and Ewan Birney. The biorxiv preprint is at https://www.biorxiv.org/content/early/2018/01/30/255497

The authors present a bootstrap-based strategy for fitting multivariate linear mixed effects models in genetics studies with up to hundreds of phenotypes. They use the model:

\[vec(Y) = Xvec(B) + vec(G) + vec(E)\] where G and E are independent n by p matrices of random effects and random errors, respectively; X is a \(np\) by \(p(f + c)\) block-diagonal matrix (with p blocks of dimension \(n\) by \(f + c\) on the diagonal) containing both the genetic marker data and the covariate values; B is a \(f + c\) by \(p\) matrix of covariate and allele effects. They also assume that G and E are matrix-variate normally distributed:

\[G \sim MN(0, K, V_g)\]

\[E \sim MN(0, I_n, V_e)\]

They use the bootstrap in a clever way in finding an estimate for the \(np\) by \(np\) covariance matrix. A single bootstrap sample of \(s\) phenotypes is randomly drawn from the \(p\) phenotypes for all subjects. They then estimate the marginal covariance matrix - and the matrices \(V_g\) and \(V_e\) - for the \(s\)-variate phenotype and repeat \(b\) times. They then have \(b\) estimates of marginal covariance matrices. By the nature of the sampling procedure, there may be multiple bootstrap samples in which traits \(i\) and \(j\) (for \(1 \le i, j \le p\)) both appear. They average each covariate trait pair \(i\), \(j\), then use least-squares to find the closest covariance matrix. The last step makes use of the Broyden-Fletcher-Goldfarb-Shanno algorithm.

I’m particularly interested in the methodology of this paper for reasons that I’ll develop in a future blog post. For now, I want to document my initial attempts at using the R package reticulate

reticulate R package

The goal of reticulate R package, as I understand it, is to enable calling of python code from R and vice-versa.

The authors share freely their python module limmbo and their analysis code on Github.

Recognizing that the methods would be useful in my research, and not wanting to work in python, I sought a way to use the existing python code within R code wrappers that I would write. An internet search guided me to the reticulate R package, which is on both CRAN and Github.

Initial R code with reticulate

I first installed Anaconda2 and limmbo as described by Hannah Meyer in the limmbo README file. I then loaded the reticulate package in a new R session.

library(reticulate)

Within reticulate, there is a function import to load a python module (ie, like a package in the R terminology). But you want to be sure to store the output as something (don’t just run import() without assigning its output to something).

import("limmbo", convert = TRUE) -> limmbo

Now, we can look at the object limmbo & its contents. Typing merely limmbo at the R prompt isn’t terribly helpful:

limmbo
## Module(limmbo)

However, we can use the $ operator to explore the contents of limmbo.

limmbo$core$vdsimple$vd_reml
## <function vd_reml at 0x1a2583e5f0>

It would be nice to know a little about this function. Let’s look at its help page. To do that, we use the reticulate R package’s function py_help.

py_help(limmbo$core$vdsimple$vd_reml)

If you’re working in Rstudio in an interactive session, the Python help file will pop up in a new tab. The help file has 3 main sections: 1. Arguments, 2. Returns, and 3. Examples.

I’ll work through the python code in the Examples section of the help file for the above function.

import numpy
from numpy.random import RandomState
from numpy.linalg import cholesky as chol
from limmbo.core.vdsimple import vd_reml
from limmbo.io.input import InputData
random = RandomState(15)
N = 100
S = 1000
P = 3
snps = (random.rand(N, S) < 0.2).astype(float)
kinship = numpy.dot(snps, snps.T) / float(10)
y  = random.randn(N, P)
pheno = numpy.dot(chol(kinship), y)
pheno_ID = [ 'PID{}'.format(x+1) for x in range(P)]
samples = [ 'SID{}'.format(x+1) for x in range(N)]
datainput = InputData()
datainput.addPhenotypes(phenotypes = pheno,
phenotype_ID = pheno_ID, pheno_samples = samples)
datainput.addRelatedness(relatedness = kinship,
relatedness_samples = samples)
Cg, Cn, ptime = vd_reml(datainput, verbose=True)
## Estimate covariance matrices based on standard REML
## Variance decomposition via REML converged
## Minimum Eigenvalue 0.6904
## Minimum Eigenvalue 0.0001
print(Cg)
## [[ 0.94744745  0.0377506   0.08489522]
##  [ 0.0377506   0.98454127 -0.21465495]
##  [ 0.08489522 -0.21465495  0.89726338]]

Now, we see that we created the objects Cn, Cg and ptime using python. We can access them with R code by using the object py:

py$Cg
py$Cn
py$ptime

Examine vdbootstrap code

(limmbo$core$vdbootstrap$LiMMBo(py$datainput, timing = TRUE, iterations = 100, S = 2) -> foo)
limmbo$core$vdbootstrap$LiMMBo$runBootstrapCovarianceEstimation(foo, cpus = 1, seed = 12345, minCooccurrence = 10)

Other resources

Rstudio has a reticulate site: https://rstudio.github.io/reticulate/index.html

They include many examples and discussions of more technical issues.

comments powered by Disqus