# Getting started with the reticulate R package

## Motivation

My colleague Rene Welch brought to my attention a recent pre-print, “LiMMBo: a simple, scalable approach for linear mixed models in high-dimensional genetic association studies” by Hannah Meyer, Francesco Casale, Oliver Stegle, and Ewan Birney. The biorxiv preprint is at https://www.biorxiv.org/content/early/2018/01/30/255497

The authors present a bootstrap-based strategy for fitting multivariate linear mixed effects models in genetics studies with up to hundreds of phenotypes. They use the model:

\[vec(Y) = Xvec(B) + vec(G) + vec(E)\] where G and E are independent n by p matrices of random effects and random errors, respectively; X is a \(np\) by \(p(f + c)\) block-diagonal matrix (with p blocks of dimension \(n\) by \(f + c\) on the diagonal) containing both the genetic marker data and the covariate values; B is a \(f + c\) by \(p\) matrix of covariate and allele effects. They also assume that G and E are matrix-variate normally distributed:

\[G \sim MN(0, K, V_g)\]

\[E \sim MN(0, I_n, V_e)\]

They use the bootstrap in a clever way in finding an estimate for the \(np\) by \(np\) covariance matrix. A single bootstrap sample of \(s\) phenotypes is randomly drawn from the \(p\) phenotypes for all subjects. They then estimate the marginal covariance matrix - and the matrices \(V_g\) and \(V_e\) - for the \(s\)-variate phenotype and repeat \(b\) times. They then have \(b\) estimates of marginal covariance matrices. By the nature of the sampling procedure, there may be multiple bootstrap samples in which traits \(i\) and \(j\) (for \(1 \le i, j \le p\)) both appear. They average each covariate trait pair \(i\), \(j\), then use least-squares to find the closest covariance matrix. The last step makes use of the Broyden-Fletcher-Goldfarb-Shanno algorithm.

I’m particularly interested in the methodology of this paper for reasons that I’ll develop in a future blog post. For now, I want to document my initial attempts at using the R package `reticulate`

`reticulate`

R package

The goal of `reticulate`

R package, as I understand it, is to enable calling of `python`

code from `R`

and vice-versa.

The authors share freely their `python`

module `limmbo`

and their analysis code on Github.

Recognizing that the methods would be useful in my research, and not wanting to work in `python`

, I sought a way to use the existing `python`

code within `R`

code wrappers that I would write. An internet search guided me to the `reticulate`

R package, which is on both CRAN and Github.

## Initial R code with `reticulate`

I first installed Anaconda2 and `limmbo`

as described by Hannah Meyer in the `limmbo`

README file. I then loaded the `reticulate`

package in a new R session.

`library(reticulate)`

Within `reticulate`

, there is a function `import`

to load a `python`

module (ie, like a *package* in the `R`

terminology). But you want to be sure to store the output as something (don’t just run `import()`

without assigning its output to something).

`import("limmbo", convert = TRUE) -> limmbo`

Now, we can look at the object `limmbo`

& its contents. Typing merely `limmbo`

at the R prompt isn’t terribly helpful:

`limmbo`

`## Module(limmbo)`

However, we can use the `$`

operator to explore the contents of `limmbo`

.

`limmbo$core$vdsimple$vd_reml`

`## <function vd_reml at 0x1a2b7086e0>`

It would be nice to know a little about this function. Let’s look at its help page. To do that, we use the `reticulate`

R package’s function `py_help`

.

`py_help(limmbo$core$vdsimple$vd_reml)`

If you’re working in Rstudio in an interactive session, the `Python`

help file will pop up in a new tab. The help file has 3 main sections: 1. Arguments, 2. Returns, and 3. Examples.

I’ll work through the `python`

code in the Examples section of the help file for the above function.

```
import numpy
from numpy.random import RandomState
from numpy.linalg import cholesky as chol
from limmbo.core.vdsimple import vd_reml
from limmbo.io.input import InputData
random = RandomState(15)
N = 100
S = 1000
P = 3
snps = (random.rand(N, S) < 0.2).astype(float)
kinship = numpy.dot(snps, snps.T) / float(10)
y = random.randn(N, P)
pheno = numpy.dot(chol(kinship), y)
pheno_ID = [ 'PID{}'.format(x+1) for x in range(P)]
samples = [ 'SID{}'.format(x+1) for x in range(N)]
datainput = InputData()
datainput.addPhenotypes(phenotypes = pheno,
phenotype_ID = pheno_ID, pheno_samples = samples)
datainput.addRelatedness(relatedness = kinship,
relatedness_samples = samples)
Cg, Cn, ptime = vd_reml(datainput, verbose=True)
```

```
## Estimate covariance matrices based on standard REML
## Variance decomposition via REML converged
## Minimum Eigenvalue 0.6904
## Minimum Eigenvalue 0.0001
```

`print(Cg)`

```
## [[ 0.94744745 0.0377506 0.08489522]
## [ 0.0377506 0.98454127 -0.21465495]
## [ 0.08489522 -0.21465495 0.89726338]]
```

Now, we see that we created the objects `Cn`

, `Cg`

and `ptime`

using `python`

. We can access them with R code by using the object `py`

:

```
py$Cg
py$Cn
py$ptime
```

## Examine `vdbootstrap`

code

`(limmbo$core$vdbootstrap$LiMMBo(py$datainput, timing = TRUE, iterations = 100, S = 2) -> foo)`

`limmbo$core$vdbootstrap$LiMMBo$runBootstrapCovarianceEstimation(foo, cpus = 1, seed = 12345, minCooccurrence = 10)`

## Other resources

Rstudio has a `reticulate`

site: https://rstudio.github.io/reticulate/index.html

They include many examples and discussions of more technical issues.